What is a data warehouse?
What is a data warehouse?
A data warehouse (DW) is a digital storage system that connects and harmonizes large amounts of data from many different sources. Its purpose is to feed business intelligence (BI), reporting, and analytics, and support regulatory requirements – so companies can turn their data into insight and make smart, data-driven decisions. Data warehouses store current and historical data in one place and act as the single source of truth for an organization.
Data flows into a data warehouse from operational systems (like ERP and CRM), databases, and external sources such as partner systems, Internet of Things (IoT) devices, weather apps, and social media – usually on a regular cadence. The emergence of cloud computing has caused a shift in the landscape. In recent years, data storage locations have moved away from traditional on-premise infrastructure to multiple locations, including on premise, private cloud, and public cloud.
Modern data warehouses are designed to handle both structured and unstructured data, like videos, image files, and sensor data. Some leverage integrated analytics and in-memory database technology (which holds the data set in computer memory rather than in disk storage) to provide real-time access to trusted data and drive confident decision-making. Without data warehousing, it’s very difficult to combine data from heterogeneous sources, ensure it’s in the right format for analytics, and get both a current and long-range view of data over time.
Benefits of data warehousing
A well-designed data warehouse is the foundation for any successful BI or analytics program. Its main job is to power the reports, dashboards, and analytical tools that have become indispensable to businesses today. A data warehouse provides the information for your data-driven decisions – and helps you make the right call on everything from new product development to inventory levels. There are many benefits of a data warehouse. Here are just a few:
- Better business analytics: With data warehousing, decision-makers have access to data from multiple sources and no longer have to make decisions based on incomplete information.
- Faster queries: Data warehouses are built specifically for fast data retrieval and analysis. With a DW, you can very rapidly query large amounts of consolidated data with little to no support from IT.
- Improved data quality: Before being loaded into the DW, data cleansing cases are created by the system and entered in a worklist for further processing, ensuring data is transformed into a consistent format to support analytics – and decisions – based on high quality, accurate data.
- Historical insight: By storing rich historical data, a data warehouse lets decision-makers learn from past trends and challenges, make predictions, and drive continuous business improvement.
What can a data warehouse store?
When data warehouses first became popular in the late 1980s, they were designed to store information about people, products, and transactions. This data – called structured data – was neatly organized and formatted for easy access. However, businesses soon wanted to store, retrieve, and analyze unstructured data – such as documents, images, videos, emails, social media posts, and raw data from machine sensors.
A modern data warehouse can accommodate both structured and unstructured data. By merging these data types and breaking down silos between the two, businesses can get a complete, comprehensive picture for the most valuable insights.
Some key terms
There are lots of terms to make sense of in the world of DW. Here are some of the most important. Explore some other terms and FAQs in our glossary.
Data warehouse vs. database
Databases and data warehouses are both data storage systems; however, they serve different purposes. A database stores data usually for a particular business area. A data warehouse stores current and historical data for the entire business and feeds BI and analytics. Data warehouses use a database server to pull in data from an organization’s databases and have additional functionalities for data modeling, data lifecycle management, data source integration, and more.
Data warehouse vs. data lake
Both data warehouses and data lakes are used for storing Big Data, but they are very different storage systems. A data warehouse stores data that has been formatted for a specific purpose, whereas a data lake stores data in its raw, unprocessed state – the purpose of which has not yet been defined. Data warehouses and lakes often complement each other. For example, when raw data stored in a lake is needed to answer a business question, it can be extracted, cleaned, transformed, and used in a data warehouse for analysis. The volume of data, database performance, and storage pricing play important role in helping you choose the right storage solution.
Data warehouse vs. data mart
A data mart is a subsection of a data warehouse, partitioned specifically for a department or line of business – like sales, marketing, or finance. Some data marts are created for standalone operational purposes as well. While a data warehouse serves as the central data store for an entire company, a data mart serves relevant data to a select group of users. This simplifies data access, speeds up analysis, and gives them control over their own data. Multiple data marts are often deployed within a data warehouse.
What are the key components of a data warehouse?
A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.
- Central database: A database serves as the foundation of your data warehouse. Traditionally, these have been standard relational databases running on premise or in the cloud. But because of Big Data, the need for true, real-time performance, and a drastic reduction in the cost of RAM, in-memory databases are rapidly gaining in popularity.
- Data integration: Data is pulled from source systems and modified to align the information for rapid analytical consumption using a variety of data integration approaches such as ETL (extract, transform, load) and ELT as well as real-time data replication, bulk-load processing, data transformation, and data quality and enrichment services.
- Metadata: Metadata is data about your data. It specifies the source, usage, values, and other features of the data sets in your data warehouse. There is business metadata, which adds context to your data, and technical metadata, which describes how to access data – including where it resides and how it is structured.
- Data warehouse access tools: Access tools allow users to interact with the data in your data warehouse. Examples of access tools nclude: query and reporting tools, application development tools, data mining tools, and OLAP tools.
Data warehouse architecture
In the past, data warehouses operated in layers that matched the flow of the business data.
- Data layer: Data is extracted from your sources and then transformed and loaded into the bottom tier using ETL tools. The bottom tier consists of your database server, data marts, and data lakes. Metadata is created in this tier – and data integration tools, like data virtualization, are used to seamlessly combine and aggregate data.
- Semantics layer: In the middle tier, online analytical processing (OLAP) and online transactional processing (OLTP) servers restructure the data for fast, complex queries and analytics.
- Analytics layer: The top tier is the front-end client layer. It holds the data warehouse access tools that let users interact with data, create dashboards and reports, monitor KPIs, mine and analyze data, build apps, and more. This tier often includes a workbench or sandbox area for data exploration and new data model development.
Data warehouses have been designed to support decision making and have been primarily built and maintained by IT teams, but over the past few years they have evolved to empower business users – reducing their reliance on IT to get access to the data and derive actionable insights. A few key data warehousing capabilities that have empowered business users are:
- The semantic or business layer that provides natural language phrases and allows everyone to instantly understand data, define relationships between elements in the data model, and enrich data fields with new business information.
- Virtual workspaces allow teams to bring data models and connections into one secured and governed place supporting better collaborating with colleagues through one common space and one common data set.
- Cloud has further improved decision making by globally empowering employees with a rich set of tools and features to easily perform data analysis tasks. They can connect new apps and data sources without much IT support.
Top seven benefits of a cloud data warehouse
Cloud-based data warehouses are rising in popularity – for good reason. These modern warehouses offer several advantages over traditional, on-premise versions. Here are the top seven benefits of a cloud data warehouse:
- Quick to deploy: With cloud data warehousing, you can purchase nearly unlimited computing power and data storage in just a few clicks – and you can build your own data warehouse, data marts, and sandboxes from anywhere, in minutes.
- Low total cost of ownership (TCO): Data warehouse-as-a-service (DWaaS) pricing models are set up so you only pay for the resources you need, when you need them. You don’t have to forecast your long-term needs or pay for more compute throughout the year than necessary. You can also avoid upfront costs like expensive hardware, server rooms, and maintenance staff. Separating the storage pricing from the computing pricing also gives you a way to drive down the costs.
- Elasticity: With a cloud data warehouse, you can dynamically scale up or down as needed. Cloud gives us a virtualized, highly distributed environment that can manage huge volumes of data that can scale up and down.
- Security and disaster recovery: In many cases, cloud data warehouses actually provide stronger data security and encryption than on-premise DWs. Data is also automatically duplicated and backed-up, so you can minimize the risk of lost data.
- Real-time technologies: Cloud data warehouses built on in-memory database technology can provide extremely fast data processing speeds to deliver real-time data for instantaneous situational awareness.
- New technologies: Cloud data warehouses allow you to easily integrate new technologies such as machine learning, which can provide a guided experience for business users and decision support in the form of recommended questions to ask, as an example.
- Empower business users: Cloud data warehouses empower employees equally and globally with a single view of data from numerous sources and a rich set of tools and features to easily perform data analysis tasks. They can connect new apps and data sources without IT.
Data warehousing best practices
When you build a new data warehouse or add new applications to an existing warehouse, there are proven steps for achieving your goals while saving time and money. Some are focused on your business use, and other practices are part of your overall IT program. The following list is a good starting point, and you will pick up additional best practices as you work with your technology and services partners.
Business Best Practices
- Define the information you require. Once you have a good understanding of your initial needs, you can find the data sources to support them. Often, trade groups, customers, and suppliers will have data recommendations for you.
- Document the location, structure, and quality of your current data. Then, you can identify data gaps and business rules for transforming the data to meet your warehouse requirements.
- Build a team. This includes executive sponsors, managers, and staff who will be using and providing the information. For example, identify the standard reporting and KPIs they need to do their jobs.
- Prioritize your data warehouse applications. Pick one or two pilot projects that have reasonable requirements and good business value.
- Pick a strong data warehouse technology partner. They must have the implementation services and experience needed for your projects. Make sure that they support your deployment needs, including both cloud services and on-premise options.
- Develop a good project plan. Work with your team on a realistic blueprint and schedule that supports communications and status reporting.
IT Best Practices
- Monitor performance and security. The information in your data warehouse is valuable, though it must be readily accessible to provide value to the organization. Monitor system usage carefully to ensure that performance levels are high.
- Maintain data quality standards, metadata, structure, and governance. New sources of valuable data are becoming available routinely, but they require consistent management as part of a data warehouse. Follow procedures for data cleaning, defining metadata, and meeting governance standards.
- Provide an agile architecture. As your corporate and business unit usage increases, you will discover a wide range of data mart and warehouse needs. A flexible platform will support them far better than a limited, restrictive product.
- Automate processes such as maintenance. In addition to adding value to business intelligence, machine learning can automate data warehouse technical management functions to maintain speed and reduce operating costs.
- Use the cloud strategically. Business units and departments have different deployment needs. Use on-premise systems when required, and capitalize on cloud data warehouses for scalability, reduced cost, and phone and tablet access.
Modern data warehouses, and increasingly cloud data warehouses, will be a key part of any digital transformation initiative for parent companies and their business units. They capitalize on current business systems, particularly when you combine data from multiple internal systems with new, important information from outside organizations.
Dashboards, KPIs, alerts, and reporting support executive, management, and staff requirements, as well as important customer and supplier needs. Data warehouses also provide fast, complex data mining and analytics, and they don’t disrupt the performance of other business systems.
Given the flexibility to start small and expand as needed, both corporate offices and business units can improve decision-making and bottom-line performance with modern data warehouse technology.
More in this series
Data warehouse glossary
A data lake is a place to store all kinds of Big Data, whether it’s structured data from business applications or unstructured data from mobile apps, social media, or Internet of Things (IoT) devices. Because data is stored in its natural format – structured, unstructured, semi-structured, or binary – conversion, normalization, or other processing may be needed to enable analytics across multiple data types. Most data lakes are cloud based due to the large volumes of data they store, the need for high-speed connections to distributed sources, and the need for scalability.
ETL stands for “extract, transform, and load.” Together these activities make up the process used to take data from the source and convert it into a usable format – and then move it into a data warehouse or other data store. ETL is especially useful on transactional data, but more advanced tools can also manage a variety of unstructured data types.
A data mart is a partitioned segment of a data warehouse that is oriented to a specific business area or team, such as finance or marketing. Data marts make it easier for departments to quickly access the data and insights that are relevant to them, and also to control their own data sets within the larger data store.
Data models are a foundational element of software development and analytics. A data model is a description of how data is structured, and the form in which the data will be stored in the database. A data model provides a framework of relationships between data elements within a database, as well as a guide for use of the data.
Data modeling is the process of creating data models. When creating a database or data warehouse structure, the designer starts with a diagram of how data will flow into and out of the database or data warehouse. This flow diagram is used to define the characteristics of the data formats, structures, and database handling functions to efficiently support the data flow requirements. The modeling provides a standardized method for defining and formatting database contents consistently across systems, enabling different applications to share the same data.
An enterprise data warehouse (EDW) stores all current and historical business data in one place – the embodiment of master data management, data warehousing, and a data strategy based on a holistic approach to data management. EDWs provide a welcoming environment for analytics software and the maintenance of accurate, company-wide KPIs and reporting. Many EDWs are cloud-based for scalability, access, and ease of use.
SAP Insights Newsletter
Gain key insights by subscribing to our newsletter.