What is data mining?
What is data mining?
Data mining is the process of extracting useful information from an accumulation of data, often from a data warehouse or collection of linked data sets. Data mining tools include powerful statistical, mathematical, and analytics capabilities whose primary purpose is to sift through large sets of data to identify trends, patterns, and relationships to support informed decision-making and planning.
Often associated with marketing department inquiries, data mining is seen by many executives as a way to help them better understand demand and to see the effect that changes in products, pricing, or promotion have on sales. But data mining has considerable benefit for other business areas as well. Engineers and designers can analyze the effectiveness of product changes and look for possible causes of product success or failure related to how, when, and where products are used. Service and repair operations can better plan parts inventory and staffing. Professional service organizations can use data mining to identify new opportunities from changing economic trends and demographic shifts.
Data mining becomes more useful and valuable with bigger data sets and with more user experience. Logically, the more data, the more insights and intelligence should be buried there. Also, as users get more familiar with the tools and better understand the database, the more creative they can be with their explorations and analyses.
The primary benefit of data mining is its power to identify patterns and relationships in large volumes of data from multiple sources. With more and more data available – from sources as varied as social media, remote sensors, and increasingly detailed reports of product movement and market activity – data mining offers the tools to fully exploit Big Data and turn it into actionable intelligence. What’s more, it can act as a mechanism for “thinking outside the box.”
The data mining process can detect surprising and intriguing relationships and patterns in seemingly unrelated bits of information. Because information tends to be compartmentalized, it has historically been difficult or impossible to analyze as a whole. However, there may be a relationship between external factors – perhaps demographic or economic factors – and the performance of a company’s products. And while executives regularly look at sales numbers by territory, product line, distribution channel, and region, they often lack external context for this information. Their analysis points out “what happened” but does little to uncover the “why it happened this way.” Data mining can fill this gap.
Data mining can look for correlations with external factors; while correlation does not always indicate causation, these trends can be valuable indicators to guide product, channel, and production decisions. The same analysis benefits other parts of the business from product design to operational efficiency and service delivery.
People have been collecting and analyzing data for thousands of years and, in many ways, the process has remained the same: identify the information needed, find quality data sources, collect and combine the data, use the most effective tools available to analyze the data, and capitalize on what you’ve learned. As computing and data-based systems have grown and advanced, so have the tools for managing and analyzing data. The real inflection point came in the 1960s with the development of relational database technology and user-oriented natural language query tools like Structured Query Language (SQL). No longer was data only available through custom coded programs. With this breakthrough, business users could interactively explore their data and tease out the hidden gems of intelligence buried inside.
Data mining has traditionally been a specialty skill set within data science. Every new generation of analytical tools, however, starts out requiring advanced technical skills but quickly evolves to become accessible to users. Interactivity – the ability to let the data talk to you – is the key advancement. Ask a question; see the answer. Based on what you learn, ask another question. This kind of unstructured roaming through the data takes the user beyond the confines of the application-specific database design and allows for the discovery of relationships that cross functional and organizational boundaries.
Data mining is a key component of business intelligence. Data mining tools are built into executive dashboards, harvesting insight from Big Data, including data from social media, Internet of Things (IoT) sensor feeds, location-aware devices, unstructured text, video, and more. Modern data mining relies on the cloud and virtual computing, as well in-memory databases, to manage data from many sources cost-effectively and to scale on demand.
- Understand the problem – or at least the area of inquiry. The business decision-maker, who should be in the driver’s seat for this data mining off-road adventure, needs a general understanding of the domain they will be working in – the types of internal and external data that are to be a part of this exploration. It is assumed that they have intimate knowledge of the business and the functional areas involved.
- Data gathering. Start with your internal systems and databases. Link them through their data models and various relational tools or gather the data together into a data warehouse. This includes any data from external sources that are part of your operations, like field sales and/or service data, IoT, or social media data. Seek out and acquire the rights to external data including demographics, economic data, and market intelligence, such as industry trends and financial benchmarks from trade associations and governments. Bring them into the tool kit’s purview (bring them into your data warehouse or link them to data mining environment).
- Data preparation and understanding. Use your business’ subject matter experts to help define, categorize, and organize the data. This part of the process is sometimes called data wrangling or munging. Some of the data may need cleaning or “cleansing” to remove duplication, inconsistencies, incomplete records, or outdated formats. Data preparation and cleansing may be an ongoing task as new projects or data from new fields of inquiry become of interest.
- User training. You wouldn’t give your teenager the keys to the family Ferrari without having them go through driver’s education, on-the-road training, and some supervised practice with a licensed driver – so be sure to provide formal training to your future data miners as well as some supervised practice as they start to get familiar with these powerful tools. Continuing education is also a good idea once they have mastered the basics and can move on to more advanced techniques.
Keep in mind that data mining is based on a tool kit rather than a fixed routine or process. Specific data mining techniques cited here are merely examples of how the tools are being used by organizations to explore their data in search of trends, correlations, intelligence, and business insight.
Generally speaking, data mining approaches can be categorized as directed – focused on a specific desired result – or undirected as a discovery process. Other explorations might be aimed at sorting or classifying data, such as grouping prospective customers according to business attributes like industry, products, size, and location. A similar objective, outlier or anomaly detection, is an automated method of recognizing real anomalies (rather than simple variability) within a set of data that displays identifiable patterns.
Another interesting goal is association – linking two seemingly unrelated events or activities. A classic story from the early days of analytics and data mining, perhaps fictitious, has a convenience store chain discovering a correlation between sales of beer and diapers. Speculating that harried new fathers who run out late in the evening to get diapers may grab a couple of six-packs while they are there. The stores position the beer and diapers in close proximity and increase beer sales as a result.
This approach is aimed at grouping data by similarities rather than pre-defined assumptions. For example, when you mine your customer sales information combined with external consumer credit and demographic data, you may discover that your most profitable customers are from midsize cities.
Much of the time, data mining is pursued in support of prediction or forecasting. The better you understand patterns and behaviors, the better job you can do of forecasting future actions related to causations or correlations.
One of the mathematical techniques offered in data mining tool kits, regression analysis predicts a number based on historic patterns projected into the future. Various other pattern detection and tracking algorithms provide flexible tools to help users better understand the data and the behavior it represents.
These are just a few of the techniques and tools available in data mining tool kits. The choice of tool or technique is somewhat automated in that the techniques will be applied according to how the question is posed. In earlier times, data mining was referred to as “slicing and dicing” the database, but the practice is more sophisticated now and terms like association, clustering, and regression are commonplace.
Data mining is key to sentiment analysis, price optimization, database marketing, credit risk management, training and support, fraud detection, healthcare and medical diagnoses, risk assessment, recommendation systems (“customers who bought this also liked… ”), and much more. It can be an effective tool in just about any industry, including retail, wholesale distribution, service industries, manufacturing, telecom, communications, insurance, education, manufacturing, healthcare, banking, science, engineering, and online marketing or social media.
- Product Development: Companies that design, make, or distribute physical products can pinpoint opportunities to better target their products by analyzing purchasing patterns coupled with economic and demographic data. Their designers and engineers can also cross-reference customer and user feedback, repair records, and other data to identify product improvement opportunities.
- Manufacturing: Manufacturers can track quality trends, repair data, production rates, and product performance data from the field to identify production concerns. They can also recognize possible process upgrades that would improve quality, save time and cost, improve product performance, and/or point to the need for new or better factory equipment.
- Service Industries: In service industries, users can find similar opportunities for product improvement by cross-referencing customer feedback (direct or from social media or other sources) with specific services, channels, peer performance data, region, pricing, demographics, economic data, and more.
Finally, all of these findings should be fed back to forecasting and planning so that the entire organization is attuned to anticipated changes in demand based on more intimate knowledge of the customer – and be better positioned to exploit newly-identified opportunities.
- Big Data: Data is being generated at a rapidly accelerating pace, offering ever more opportunities for data mining. However, modern data mining tools are required to extract meaning from Big Data, given the high volume, high velocity, and wide variety of data structures as well as the increasing volume of unstructured data. Many existing systems struggle to handle, store, and make use of this flood of input.
- User competency: Data mining and analysis tools are designed to help users and decision makers make sense and coax meaning and insight from masses of data. While highly technical, these powerful tools are now packaged with excellent user experience design so virtually anyone can use these tools with minimal training. However, to fully gain the benefits, the user must understand the data available and the business context of the information they are seeking. They must also know, at least generally, how the tools work and what they can do. This is not beyond the reach of the average manager or executive, but it is a learning process and users need to put some effort into developing this new skill set.
- Data quality and availability: With masses of new data, there are also masses of incomplete, incorrect, misleading, fraudulent, damaged, or just plain useless data. The tools can help sort this all out, but the users must be continually aware of the source of the data and its credibility and reliability. Privacy concerns are also important, both in terms of the acquisition of the data and the care and handling once it is in your possession.
More in this series
Data mining FAQs
Data mining is the process of using advanced analytical tools to extract useful information from an accumulation of data. Machine learning is a type of artificial intelligence (AI) that allows systems to learn from experience. Data mining may make use of machine learning, when the analytical programs have the ability to adapt their functionality in response to the data analysis they perform.
Data analysis or analytics are general terms for the broad set of practices focused on identifying useful information, evaluating it, and providing specific answers. Data mining is one type of data analysis that is focused on digging into large, combined sets of data to discover patterns, trends, and relationships that can lead to insights and predictions.
Data science is a term that includes many information technologies including statistics, mathematics, and sophisticated computational techniques as applied to data. Data mining is a use case for data science focused on the analysis of large data sets from a broad range of sources.