Skip to Content

What Is Big Data?

The importance of Big Data analytics

Evolution of Big Data

As inconceivable as it seems today, the Apollo Guidance Computer took the first spaceship to the moon with fewer than 80 kilobytes of memory. Since then, computer technology has grown at an exponential rate – and data generation along with it. In fact, the world’s technological capacity to store data has been doubling about every three years since the 1980s. Just over 50 years ago when Apollo 11 lifted off, the amount of digital data generated in the entire world could have fit on the average laptop. Today, the IDC estimates that number to be at 44 zettabytes (or 44 trillion gigabytes) and further predicts it to grow to 163 zettabytes by 2025.  

44

zettabytes of digital data today, IDC

163

zettabytes of digital data by 2025, IDC

As software and technology become more and more advanced, the less viable non-digital systems are by comparison. Data generated and gathered digitally demands more advanced data management systems to handle it. In addition, the exponential growth of social media platforms, smartphone technologies, and digitally connected IoT devices has helped create the current Big Data era.

What is structured and unstructured data?

Datasets are typically categorized into three types based on its structure and how straightforward (or not) it is to index.

Structured data

This kind of data is the simplest to organize and search. It can include things like financial data, machine logs, and demographic details. An Excel spreadsheet, with its layout of pre-defined columns and rows, is a good way to envision structured data. Its components are easily categorized, allowing database designers and administrators to define simple algorithms for search and analysis. Even when structured data exists in enormous volume, it doesn’t necessarily qualify as Big Data because structured data on its own is relatively simple to manage and therefore doesn’t meet the defining criteria of Big Data. Traditionally, databases have used a programming language called Structured Query Language (SQL) in order to manage structured data. SQL was developed by IBM in the 1970s to allow developers to build and manage relational (spreadsheet style) databases that were beginning to take off at that time.  

Unstructured data

This category of data can include things like social media posts, audio files, images, and open-ended customer comments. This kind of data cannot be easily captured in standard row-column relational databases. Traditionally, companies that wanted to search, manage, or analyze large amounts of unstructured data had to use laborious manual processes. There was never any question as to the potential value of analyzing and understanding such data, but the cost of doing so was often too exorbitant to make it worthwhile. Considering the time it took, results were often obsolete before they were even delivered. Instead of spreadsheets or relational databases, unstructured data is usually stored in data lakes, data warehouses, and NoSQL databases.

Semi-structured data

As it sounds, semi-structured data is a hybrid of structured and unstructured data. E-mails are a good example as they include unstructured data in the body of the message, as well as more organizational properties such as sender, recipient, subject, and date. Devices that use geo-tagging, time stamps, or semantic tags can also deliver structured data alongside unstructured content. An unidentified smartphone image, for instance, can still tell you that it is a selfie, and the time and place where it was taken. A modern database running AI technology can not only instantly identify different types of data, it can also generate algorithms in real time to effectively manage and analyze the disparate datasets involved. 

The range of data-generating things is growing at a phenomenal rate – from drone satellites to toasters. But for the purposes of categorization, data sources are generally broken down into three types:

Social data

As is sounds, social data is generated by social media comments, posts, images, and, increasingly, video. And with the growing global ubiquity of 4G and 5G cellular networks, it is estimated that the number of people in the world who regularly watch video content on their smartphones will rise to 2.72 billion by 2023. Although trends in social media and its usage tend to change quickly and unpredictably, what does not change is its steady growth as a generator of digital data.

Machine data

IoT devices and machines are fitted with sensors and have the ability to send and receive digital data. IoT sensors help companies collect and process machine data from devices, vehicles, and equipment across the business. Globally, the number of data-generating things is rapidly growing – from weather and traffic sensors to security surveillance. The IDC estimates that by 2025 there will be over 40 billion IoT devices on earth, generating almost half the world’s total digital data.

Transactional data

This is some of the world’s fastest moving and growing data. For example, a large international retailer is known to process over one million customer transactions every hour. And when you add in all the world’s purchasing and banking transactions, you get a picture of the staggering volume of data being generated. Furthermore, transactional data is increasingly comprised of semi-structured data, including things like images and comments, making it all the more complex to manage and process.

Just because a dataset is big, it isn’t necessarily Big Data. To qualify as such, data must possess at least the following five characteristics:

Volume

While volume is by no means the only component that makes Big Data “big,” it is certainly a primary feature. To fully manage and utilize Big Data, advanced algorithms and AI-driven analytics are required. But before any of that can happen, there needs to be a secure and reliable means of storing, organizing, and retrieving the many terabytes of data that are held by large companies.

Velocity

In the past, any data that was generated had to later be entered into a traditional database system – often manually – before it could be analyzed or retrieved. Today, Big Data technology allows databases to process, analyze, and configure data while it is being generated – sometimes within milliseconds. For businesses, that means real-time data can be used to capture financial opportunities, respond to customer needs, thwart fraud, and address any other activity where speed is critical.

Variety

Datasets that are comprised solely of structured data are not necessarily Big Data, regardless of how voluminous they are. Big Data is typically comprised of combinations of structured, unstructured, and semi-structured data. Traditional databases and data management solutions lack the flexibility and scope to manage the complex, disparate datasets that make up Big Data.

Veracity

While modern database technology makes it possible for companies to amass and make sense of staggering amounts and types of Big Data, it’s only valuable if it is accurate, relevant, and timely. For traditional databases that were populated only with structured data, syntactical errors and typos were the usual culprits when it came to data accuracy. With unstructured data, there is a whole new set of veracity challenges. Human bias, social noise, and data provenance issues can all have an impact upon the quality of data.

Value

Without question, the results that come from Big Data analysis are often fascinating and unexpected. But for businesses, Big Data analytics must deliver insights that can help businesses become more competitive and resilient – and better serve their customers. Modern Big Data technologies open up the capacity for collecting and retrieving data that can provide measurable benefit to both bottom lines and operational resilience.

Modern Big Data management solutions allow companies to turn raw data into relevant insights – with unprecedented speed and accuracy.

  • Product and service development: Big Data analytics allows product developers to analyze unstructured data, such as customer reviews and cultural trends, and respond quickly.
  • Predictive maintenance: In an international survey, McKinsey found that the analysis of Big Data from IoT-enabled machines reduced equipment maintenance costs by up to 40%.
  • Customer Experience: In a 2020 survey of global business leaders, Gartner determined that “growing companies are more actively collecting customer experience data than nongrowth companies.” Big Data analysis allows businesses to improve and personalize their customers’ experience with their brand.  
  • Resilience and risk management: The COVID-19 pandemic was a sharp awakening for many business leaders as they realized just how vulnerable their operations were to disruption. Big Data insights can help companies anticipate risk and prepare for the unexpected.
  • Cost savings and greater efficiency: When businesses apply advanced Big Data analytics across all processes within their organization, they are able to not only spot inefficiencies, but to implement fast and effective solutions.
  • Improved competitiveness: The insights gleaned from Big Data can help companies save money, please customers, make better products, and innovate business operations.

AI and Big Data

Big Data management is dependent upon systems with the power to process and meaningfully analyze vast amounts of disparate and complex information. In this regard, Big Data and AI have a somewhat reciprocal relationship. Big Data would not have a lot of practical use without AI to organize and analyze it. And AI depends upon the breadth of the datasets contained within Big Data to deliver analytics that are sufficiently robust to be actionable. As Forrester Research analyst Brandon Purcell puts it, “Data is the lifeblood of AI. An AI system needs to learn from data in order to be able to fulfill its function.”

Data is the lifeblood of AI. An AI system needs to learn from data in order to be able to fulfill its function.

Brandon Purcell, analyst, Forrester Research

Machine learning and Big Data

Machine learning algorithms define the incoming data and identify patterns within it. These insights are delivered to help inform business decisions and automate processes. Machine learning thrives on Big Data because the more robust the datasets being analyzed, the greater the opportunity for the system to learn and continuously evolve and adapt its processes.

Explore data management solutions from SAP

Manage your diverse data landscape – across data warehouses, data lakes, and databases – with a choice of on-premise and cloud solutions to meet your specific needs.

Big Data FAQs

What is Big Data used for?

Big Data is comprised of all potentially business-relevant data – both structured and unstructured – from a variety of disparate sources. Once analyzed, it is used to provide deeper insight and more accurate information about all operational areas of a business and its market.

What is Big Data technology?

Big Data technology applies to all the tools, software, and techniques that are used to process and analyze Big Data – including (but not limited to) data mining, data storage, data sharing, and data visualization.

What is Hadoop used for?

Apache Hadoop is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel.

What is Spark used for?

Apache Spark is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel. Its predecessor Hadoop is much more commonly used, but Spark is gaining popularity due to its use of machine learning and other technologies, which increase its speed and efficiency.

What is a data lake?

A data lake is a repository in which large amounts of raw, unstructured data can be stored and retrieved. Data lakes are necessary because much of Big Data is unstructured and cannot be stored in a traditional row-column relational database. 

What is dark data?

Dark data is all the data that companies collect as part of their regular business operations (such as, surveillance footage and website log files). It is saved for compliance purposes but is typically never used. These large datasets cost more to store than the value they bring. 

What is data fabric?

Data fabric is the integration of Big Data architecture and technologies across an entire business ecosystem. Its purpose is to connect Big Data from all sources and of all types, with all data management services across the business.  

Back to top