What is Big Data?
What is Big Data?
Big Data is the ocean of information we swim in every day – vast zettabytes of data flowing from our computers, mobile devices, and machine sensors. This data is used by organizations to drive decisions, improve processes and policies, and create customer-centric products, services, and experiences. Big Data is defined as “big” not just because of its volume, but also due to the variety and complexity of its nature. Typically, it exceeds the capacity of traditional databases to capture, manage, and process it. And, Big Data can come from anywhere or anything on earth that we’re able to monitor digitally. Weather satellites, Internet of Things (IoT) devices, traffic cameras, social media trends – these are just a few of the data sources being mined and analyzed to make businesses more resilient and competitive.
The true value of Big Data is measured by the degree to which you are able to analyze and understand it. Artificial intelligence (AI), machine learning, and modern database technologies allow for Big Data visualization and analysis to deliver actionable insights – in real time. Big Data analytics help companies put their data to work – to realize new opportunities and build business models. As Geoffrey Moore, author and management analyst, aptly stated, “Without Big Data analytics, companies are blind and deaf, wandering out onto the Web like deer on a freeway.”
As inconceivable as it seems today, the Apollo Guidance Computer took the first spaceship to the moon with fewer than 80 kilobytes of memory. Since then, computer technology has grown at an exponential rate – and data generation along with it. In fact, the world’s technological capacity to store data has been doubling about every three years since the 1980s. Just over 50 years ago when Apollo 11 lifted off, the amount of digital data generated in the entire world could have fit on the average laptop. Today, the IDC estimates that number to be at 44 zettabytes (or 44 trillion gigabytes) and further predicts it to grow to 163 zettabytes by 2025.
As software and technology become more and more advanced, the less viable non-digital systems are by comparison. Data generated and gathered digitally demands more advanced data management systems to handle it. In addition, the exponential growth of social media platforms, smartphone technologies, and digitally connected IoT devices has helped create the current Big Data era.
Data sets are typically categorized into three types based on its structure and how straightforward (or not) it is to index.
- Structured data: This kind of data is the simplest to organize and search. It can include things like financial data, machine logs, and demographic details. An Excel spreadsheet, with its layout of pre-defined columns and rows, is a good way to envision structured data. Its components are easily categorized, allowing database designers and administrators to define simple algorithms for search and analysis. Even when structured data exists in enormous volume, it doesn’t necessarily qualify as Big Data because structured data on its own is relatively simple to manage and therefore doesn’t meet the defining criteria of Big Data. Traditionally, databases have used a programming language called Structured Query Language (SQL) in order to manage structured data. SQL was developed by IBM in the 1970s to allow developers to build and manage relational (spreadsheet style) databases that were beginning to take off at that time.
- Unstructured data: This category of data can include things like social media posts, audio files, images, and open-ended customer comments. This kind of data cannot be easily captured in standard row-column relational databases. Traditionally, companies that wanted to search, manage, or analyze large amounts of unstructured data had to use laborious manual processes. There was never any question as to the potential value of analyzing and understanding such data, but the cost of doing so was often too exorbitant to make it worthwhile. Considering the time it took, results were often obsolete before they were even delivered. Instead of spreadsheets or relational databases, unstructured data is usually stored in data lakes, data warehouses, and NoSQL databases.
- Semi-structured data: As it sounds, semi-structured data is a hybrid of structured and unstructured data. E-mails are a good example as they include unstructured data in the body of the message, as well as more organizational properties such as sender, recipient, subject, and date. Devices that use geo-tagging, time stamps, or semantic tags can also deliver structured data alongside unstructured content. An unidentified smartphone image, for instance, can still tell you that it is a selfie, and the time and place where it was taken. A modern database running AI technology can not only instantly identify different types of data, it can also generate algorithms in real time to effectively manage and analyze the disparate data sets involved.
The range of data-generating things is growing at a phenomenal rate – from drone satellites to toasters. But for the purposes of categorization, data sources are generally broken down into three types:
As is sounds, social data is generated by social media comments, posts, images, and, increasingly, video. And with the growing global ubiquity of 4G and 5G cellular networks, it is estimated that the number of people in the world who regularly watch video content on their smartphones will rise to 2.72 billion by 2023. Although trends in social media and its usage tend to change quickly and unpredictably, what does not change is its steady growth as a generator of digital data.
IoT devices and machines are fitted with sensors and have the ability to send and receive digital data. IoT sensors help companies collect and process machine data from devices, vehicles, and equipment across the business. Globally, the number of data-generating things is rapidly growing – from weather and traffic sensors to security surveillance. The IDC estimates that by 2025 there will be over 40 billion IoT devices on earth, generating almost half the world’s total digital data.
This is some of the world’s fastest moving and growing data. For example, a large international retailer is known to process over one million customer transactions every hour. And when you add in all the world’s purchasing and banking transactions, you get a picture of the staggering volume of data being generated. Furthermore, transactional data is increasingly comprised of semi-structured data, including things like images and comments, making it all the more complex to manage and process.
Just because a data set is big, it isn’t necessarily Big Data. To qualify as such, data must possess at least the following five characteristics:
- Volume: While volume is by no means the only component that makes Big Data “big,” it is certainly a primary feature. To fully manage and utilize Big Data, advanced algorithms and AI-driven analytics are required. But before any of that can happen, there needs to be a secure and reliable means of storing, organizing, and retrieving the many terabytes of data that are held by large companies.
- Velocity: In the past, any data that was generated had to later be entered into a traditional database system – often manually – before it could be analyzed or retrieved. Today, Big Data technology allows databases to process, analyze, and configure data while it is being generated – sometimes within milliseconds. For businesses, that means real-time data can be used to capture financial opportunities, respond to customer needs, thwart fraud, and address any other activity where speed is critical.
- Variety: Data sets that are comprised solely of structured data are not necessarily Big Data, regardless of how voluminous they are. Big Data is typically comprised of combinations of structured, unstructured, and semi-structured data. Traditional databases and data management solutions lack the flexibility and scope to manage the complex, disparate data sets that make up Big Data.
- Veracity: While modern database technology makes it possible for companies to amass and make sense of staggering amounts and types of Big Data, it’s only valuable if it is accurate, relevant, and timely. For traditional databases that were populated only with structured data, syntactical errors and typos were the usual culprits when it came to data accuracy. With unstructured data, there is a whole new set of veracity challenges. Human bias, social noise, and data provenance issues can all have an impact upon the quality of data.
- Value: Without question, the results that come from Big Data analysis are often fascinating and unexpected. But for businesses, Big Data analytics must deliver insights that can help businesses become more competitive and resilient – and better serve their customers. Modern Big Data technologies open up the capacity for collecting and retrieving data that can provide measurable benefit to both bottom lines and operational resilience.
Modern Big Data management solutions allow companies to turn raw data into relevant insights – with unprecedented speed and accuracy.
- Product and service development: Big Data analytics allows product developers to analyze unstructured data, such as customer reviews and cultural trends, and respond quickly.
- Predictive maintenance: In an international survey, McKinsey found that the analysis of Big Data from IoT-enabled machines reduced equipment maintenance costs by up to 40%.
- Customer Experience: In a 2020 survey of global business leaders, Gartner determined that “growing companies are more actively collecting customer experience data than nongrowth companies.” Big Data analysis allows businesses to improve and personalize their customers’ experience with their brand.
- Resilience and risk management: The COVID-19 pandemic was a sharp awakening for many business leaders as they realized just how vulnerable their operations were to disruption. Big Data insights can help companies anticipate risk and prepare for the unexpected.
- Cost savings and greater efficiency: When businesses apply advanced Big Data analytics across all processes within their organization, they are able to not only spot inefficiencies, but to implement fast and effective solutions.
- Improved competitiveness: The insights gleaned from Big Data can help companies save money, please customers, make better products, and innovate business operations.
Big Data management is dependent upon systems with the power to process and meaningfully analyze vast amounts of disparate and complex information. In this regard, Big Data and AI have a somewhat reciprocal relationship. Big Data would not have a lot of practical use without AI to organize and analyze it. And AI depends upon the breadth of the data sets contained within Big Data to deliver analytics that are sufficiently robust to be actionable. As Forrester Research analyst Brandon Purcell puts it, “Data is the lifeblood of AI. An AI system needs to learn from data in order to be able to fulfill its function.”
Machine learning algorithms define the incoming data and identify patterns within it. These insights are delivered to help inform business decisions and automate processes. Machine learning thrives on Big Data because the more robust the data sets being analyzed, the greater the opportunity for the system to learn and continuously evolve and adapt its processes.
Big Data architecture
As with architecture in building construction, Big Data architecture provides a blueprint for the foundational structure of how businesses will manage and analyze their data. Big Data architecture maps the processes necessary to manage Big Data on its journey across four basic “layers,” from data sources, to data storage, then on to Big Data analysis, and finally through the consumption layer in which the analyzed results are presented as business intelligence.
Big Data analytics
This process allows for meaningful data visualization through the use of data modeling and algorithms specific to Big Data characteristics. In an in-depth study and survey from the MIT Sloan School of Management, over 2,000 business leaders were asked about their company’s experience regarding Big Data analysis. Unsurprisingly, those who were engaged and supportive of developing their Big Data management strategies achieved the most measurably beneficial business results.
Big Data and Apache Hadoop
Picture 10 dimes in a single large box mixed in with 100 nickels. Then picture 10 smaller boxes, side by side, each with 10 nickels and only one dime. In which scenario will it be easier to spot the dimes? Hadoop basically works on this principle. It is an open-source framework for managing distributed Big Data processing across a network of many connected computers. So instead of using one large computer to store and process all the data, Hadoop clusters multiple computers into an almost infinitely scalable network and analyzes the data in parallel. This process typically uses a programming model called MapReduce, which coordinates Big Data processing by marshalling the distributed computers.
Data lakes, data warehouses, and NoSQL
Traditional SQL spreadsheet-style databases are used for storing structured data. Unstructured and semi-structured Big Data requires unique storage and processing paradigms, as it does not lend itself to being indexed and categorized. Data lakes, data warehouses, and NoSQL databases are all data repositories that manage non-traditional data sets. A data lake is a vast pool of raw data which has yet to be processed. A data warehouse is a repository for data that has already been processed for a specific purpose. NoSQL databases provide a flexible schema that can be modified to suit the nature of the data to be processed. Each of these systems has its strengths and weaknesses and many businesses use a combination of these different data repositories to best suit their needs.
Traditional disk-based databases were developed with SQL and relational database technologies in mind. While they may be able to handle large volumes of structured data, they simply aren’t designed to best store and process unstructured data. With in-memory databases, processing and analysis take place entirely in RAM, as opposed to having to retrieve the data from a disk-based system. In-memory databases are also built on distributed architectures. This means they can achieve far greater speeds by utilizing parallel processing, as opposed to single node, disk-based database models.
Big Data works when its analysis delivers relevant and actionable insights that measurably improve the business. In preparation for Big Data transformation, businesses should ensure that their systems and processes are sufficiently ready to gather, store, and analyze Big Data.
- Gather Big Data. Much of Big Data is comprised of massive sets of unstructured data, flooding in from disparate and inconsistent sources. Traditional disk-based databases and data integration mechanisms are simply not equal to the task of handling this. Big Data management requires the adoption of in-memory database solutions and software solutions specific to Big Data acquisition.
- Store Big Data. By its very name, Big Data is voluminous. Many businesses have on-premise storage solutions for their existing data and hope to economize by repurposing those repositories to meet their Big Data processing needs. However, Big Data works best when it is unconstrained by size and memory limitations. Businesses that fail to incorporate cloud storage solutions into their Big Data models from the beginning often regret this a few months down the road.
- Analyze Big Data. Without the application of AI and machine learning technologies to Big Data analysis, it is simply not feasible to realize its full potential. One of the five V’s of Big Data is “velocity.” For Big Data insights to be actionable and valuable, they must come quickly. Analytics processes have to be self-optimizing and able to learn from experience on a regular basis – an outcome which can only be achieved with AI functionality and modern database technologies.
The insights and deep learning afforded by Big Data can offer benefit to virtually any business or industry. However, large organizations with complex operational remits are often able to make the most meaningful use of Big Data.
In the Journal of Big Data, a 2020 study points out that Big Data “plays an important role in changing the financial services sector, particularly in trade and investment, tax reform, fraud detection and investigation, risk analysis, and automation.” Big Data has also helped to transform the financial industry by analyzing customer data and feedback to gain the valuable insights needed to improve customer satisfaction and experience. Transactional data sets are some of the fastest moving and largest in the world. The growing adoption of advanced Big Data management solutions will help banks and financial institutions protect this data and use it in ways that benefit and protect both the customer and the business.
Big Data analysis allows healthcare professionals to make more accurate and evidence-based diagnoses. Additionally, Big Data helps hospital administrators spot trends, manage risks, and minimize unnecessary spending – driving the highest possible budgets to areas of patient care and research. In the midst of the pandemic, research scientists around the world are racing toward better ways to treat and manage COVID-19 – and Big Data is playing an enormous role in this process. A July 2020 article in The Scientist describes how medical teams were able to collaborate and analyze Big Data to help fight coronavirus: “We may transform the way clinical science is done, leveraging the tools and resources of Big Data and data science in ways that have not been possible.”
- Transportation and Logistics
The Amazon Effect is a term that describes how Amazon has set the bar for next-day delivery expectations to where customers now demand that kind of shipping speed for anything they order online. Entrepreneur magazine points out that as a direct result of the Amazon Effect, “the ‘last mile’ logistics race will grow more competitive.” Logistics companies are increasingly relying upon Big Data analytics to optimize route planning, load consolidation, and fuel efficiency measures.
During the pandemic, educational institutions around the world have had to reinvent their curricula and teaching methods to support remote learning. A major challenge to this process has been finding reliable ways to analyze and assess students’ performance and the overall effectiveness of online teaching methods. A 2020 article about the impact of Big Data on education and online learning makes an observation about teachers: “Big data makes them feel much more confident in personalizing education, developing blended learning, transforming assessment systems, and promoting life-long learning.”
- Energy and Utilities
According to the U.S. Bureau of Labor Statistics, utility companies spend over US$1.4 billion on meter readers and typically rely upon analog meters and infrequent manual readings. Smart meter readers deliver digital data many times a day and, with the benefit of Big Data analysis, this intel can inform more efficient energy usage and more accurate pricing and forecasting. Furthermore, when field workers are freed up from meter reading, data capture and analysis can help more quickly reallocate them to where repairs and upgrades are most urgently needed.
More in this series
Big Data FAQs
Big Data is comprised of all potentially business-relevant data – both structured and unstructured – from a variety of disparate sources. Once analyzed, it is used to provide deeper insight and more accurate information about all operational areas of a business and its market.
Big Data technology applies to all the tools, software, and techniques that are used to process and analyze Big Data – including (but not limited to) data mining, data storage, data sharing, and data visualization.
Apache Hadoop is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel.
Apache Spark is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel. Its predecessor Hadoop is much more commonly used, but Spark is gaining popularity due to its use of machine learning and other technologies, which increase its speed and efficiency.
A data lake is a repository in which large amounts of raw, unstructured data can be stored and retrieved. Data lakes are necessary because much of Big Data is unstructured and cannot be stored in a traditional row-column relational database.
Dark data is all the data that companies collect as part of their regular business operations (such as, surveillance footage and website log files). It is saved for compliance purposes but is typically never used. These large data sets cost more to store than the value they bring.
Data fabric is the integration of Big Data architecture and technologies across an entire business ecosystem. Its purpose is to connect Big Data from all sources and of all types, with all data management services across the business.