media-blend
text-black

Businesspeople studying graphs on an interactive screen in a business meeting

Data lake vs. data warehouse

Data lakes store raw data in their native format, regardless of how they arrive. Data warehouses store data that has been cleansed and structured in a predefined way.

default

{}

default

{}

primary

default

{}

secondary

Introduction to data lakes and data warehouses

Data lakes and data warehouses are systems that store, manage, and retrieve large volumes of digital data. Businesses collect data to help them gain insights into their operations, customers, markets, and supply chains so they can respond more strategically.

Data warehouses emerged as a solution to break down data silos and address the challenge of business data scattered across multiple systems, formats, and departments.

The inconsistency made it difficult for users to access, integrate, and analyse this data to spot patterns, forecast demand, or evaluate business performance. Data warehouses were developed to consolidate this data into a centralised repository, where data could be integrated, cleaned, and structured for analysis. This approach established a “single source of truth” to support compliance, performance monitoring, and business intelligence processes.

Data lakes, in turn, emerged as a solution to the limitations of data warehouses, which could not adequately handle the explosion of unstructured and semi-structured data generated from new sources such as social media, IoT devices, sensors, mobile apps, and more. Storing and processing vast amounts of diverse data, such as images, video, and text, proved too costly and inefficient, as traditional data warehouses initially required data to be cleaned and processed in advance and before storage.

Businesses needed a more flexible, low-cost way to store data in its raw, original format, and data lakes were created as the solution.

Today, many modern enterprises adopt a hybrid approach involving both data warehouses and data lakes: the data lakehouse. This architecture offers both the rapid, structured reporting capabilities of the former and the potential for AI and machine learning applications of the latter.

Data lakes vs. data warehouses: key differences

The main difference between data lakes and data warehouses lies in the type of data they store and how they store that data, both of which play a key role in an organisation’s data strategy.

Data warehouses store structured data that has been cleaned and processed according to a predefined structure, or schema. Because the schema is applied before the data is stored, the approach is known as schema-on-write.

For example, a schema may require that customer ID data must be an integer, order date data must be in the YYYY-MM-DD format, and that total sale amount data must be in decimal format. Because all data adheres to these rules, users can ask queries such as “find the total sales per customer in April 2025” quickly and reliably. This speed and accuracy make data warehouses ideal for reporting, dashboards, and business intelligence use cases.

In contrast, data lakes can store raw data in their original format regardless of how they are structured. No predefined schema is required in advance.

The schema is only defined when the data is queried, so the approach is known as schema-on-read. Only then are the raw data parsed, structured, and interpreted according to the query.

To summarise, data warehouses apply a schema before storing data to ensure all data is structured and cleaned for use. Data lakes apply schema when the data is queried and can store any data, structured or otherwise, from the outset.

Differences between data lakes and data warehouses

Data lakes
Data warehouses
Data type
Stores structured, semi-structured, and unstructured data (e.g., logs, videos, text).
Stores only structured data (e.g., sales transactions, financial data).
Schema
Schema-on-read: schema is applied when the data is queried.
Schema-on-write: schema is applied before data is stored.
Users
Data scientists, engineers, and analysts exploring patterns, training models, or running machine learning workflows.
Business analysts, executives, and operations teams producing reports and KPIs.
Purpose
Flexible storage for large volumes of raw, diverse data used for data exploration, AI, and machine learning.
Centralised repository for structured, processed data used for reporting, dashboards, and business intelligence.
Cost
Lower-cost object storage
Higher storage and processing costs due to pre-processing and optimisation.

Choosing between data lakes and data warehouses

Since data lakes can store raw data in any format, they are ideal for organisations that require flexibility. Retailers, for example, collect vast amounts from multiple sources, such as websites, mobile apps, social media, point of sale systems, and others. Because the data they collect does not need to be cleaned, transformed, or structured, they can use more cost-effective storage systems that scale easily. However, the cost of processing raw data at query time can be higher compared to a data warehouse’s optimised queries.

In comparison, costs will be higher with data warehouses. The cleaning, transforming, and structuring processes prior to loading—as well as indexing and partitioning after loading—require additional resources and storage to operate. However, this optimisation results in ready-to-use data for business intelligence, reporting, and operational analytics. With data warehouses, analysts and executives can generate reports, monitor KPIs, and make informed decisions quickly and easily.

It must be noted that data lakes do unlock new opportunities for AI and machine learning applications. The vast and varied datasets they store enable data scientists to identify trends, build predictive models, and run machine learning applications. This results in, for example, recommendation systems that suggest products to users based on past interactions or natural language processing tools that carry out sentiment analysis on customer reviews or social media comments.

Today, many modern enterprises run data architectures that are essentially combinations of both. These data lakehouses aim to offer the flexibility of a data lake with the governance and performance of a data warehouse. While adoption is growing rapidly, many businesses still rely on traditional warehouses for essential reporting.

Real-world examples and use cases

Here are examples of how different industries use data lakes, data warehouses, or a combination of elements from both to support their unique requirements.

Healthcare: Hospitals often use a data lake architecture to store, manage, and analyse the vast amounts and varied types of data their operations generate. This includes unstructured wearable data and medical images, semi-structured HL7 patient data, and structured laboratory test results. By consolidating everything in a central repository, they can apply advanced analytics and AI to the raw data to, for example, identify patients at risk or analyse genomics to personalise treatment plans. With patients now equipped with “smart” wearable devices that stream data on vital signs, healthcare providers can even detect early warning signs and intervene more quickly.

Finance: Banks and other financial institutions must comply with anti-money laundering (AML) rules and strict financial reporting regulations (such as Sarbanes-Oxley in the US or Basel III internationally). By using data warehouses to store structured financial data from multiple systems, including transaction records, account balances, and trading data, they can generate regulatory reports that meet governance and security requirements. In addition to compliance, financial institutions also use data warehouses to support their business intelligence, manage risk, and detect fraud by running complex queries across historical and current datasets.

Media: Video streaming services use a data lakehouse approach to collect, store, and analyse user data to deliver personalised experiences. They ingest diverse types of data from multiple sources, such as streaming logs and social media feedback, and store it in a central repository. These data can then be used to build machine learning models that recommend the most relevant content. The same data can also be curated and structured into subsets for analytics or reporting requirements, powering dashboards on retention rates or informing decisions on content acquisitions.

Data lakehouses are rapidly becoming the preferred option for organisations seeking to maximise the value of their data. They can support both business intelligence and AI and machine learning use cases on a single platform. However, it must be noted that they are still evolving and that some organisations continue to rely on traditional data warehouses for mission-critical reporting.

The potential of AI as a driver of productivity and efficiency has particularly influenced data architectures, with some emerging data lake and data lakehouse platforms now integrated with LLMs. This enables non-technical users to explore and analyse data by asking queries in plain language. For example, a user can ask “show me sales trends in Q2,” and the LLM can generate SQL that the system can understand. This democratises access to data-driven insights.

Serverless architectures are also emerging as a strategy, where businesses employ a cloud provider to manage their data infrastructure. In this arrangement, a company pays for access to a data platform instead of setting up and managing their own. The advantages of this are easier scalability and cost-efficiency. The cloud provider offers flexibility in bandwidth in the event of spikes in data volume or query load, and the business only pays for what they use. In this way, developers can deploy more quickly, as they do not have to deal with infrastructure considerations.

Some businesses even choose a multi-cloud strategy, distributing their data lakes and warehouses across several cloud services. The main benefit is resilience through redundancy. If one cloud goes offline, the business can continue operating on another. They can also optimise specific workflows on certain clouds, such as if one service specialises in machine learning. In certain industries or countries, sensitive data must be stored in a region or with a cloud provider that meets local compliance requirements.

To connect, manage, and govern data across multiple cloud environments, organisations can implement data fabric architectures. They provide real-time access to data across separate but synchronised systems and applications, creating a unified view across the landscape.

To protect sensitive data such as medical records, National Insurance numbers, and source code, organisations are also adopting policies such as zero-trust access controls in their data platforms. These controls require all users to verify their identity to access the data they need.

FAQs

What is a data lake?
A data lake is a storage system designed to hold large volumes of raw data in its original format, such as numbers, text, images, videos, or logs. Think of it as a giant “digital reservoir” where all kinds of information can flow in without being organised immediately.

Data lakes are useful for data scientists who wish to train machine learning models that power content recommendation systems.
What is a data warehouse?

A data warehouse is a storage system primarily designed to hold large volumes of structured data. Structured data is cleaned, organised, and formatted in a certain way. (Think of the defined rows and columns of a spreadsheet). More modern warehouses can also handle certain semi-structured formats such as JSON or XML.

Businesses use data warehouses to answer questions quickly, generate reports, and track key performance indicators. These functions are categorised as business intelligence.

What is a data lakehouse?
A data lakehouse is a modern data platform that combines the best features of data lakes and data warehouses. It can store all types of data—raw, unstructured, or semi-structured—without needing to organise it first. It allows for fast, structured analysis and reporting when required.
What is a schema? What is the difference between schema-on-read and schema-on-write?

Schemas are rules for how data is organised, such as what kind of data can be stored (numbers, dates), how the data is arranged (tables and columns), and how the information relates to one another.

Schema-on-write means the data must fit into a predefined structure (schema) before being stored. This is how data warehouses operate. They ensure the data is clean and ready for analysis from the outset.

Schema-on-read means the structure is only applied when someone wishes to use or analyse the data. This is how data lakes work. They allow greater flexibility as the data can be stored in any form initially, and you do not have to organise it straight away. However, the trade-offs of this approach include slower query times and potential inconsistency, since different users might interpret the same raw data differently.

By contrast, schema-on-write enforces consistency upfront but reduces flexibility.

What is the difference between structured, unstructured, and semi-structured data?

Structured data is highly organised, easy to search, and can usually be stored in tables, such as customer names, sales figures, and dates.

Unstructured data has no fixed format and is more difficult to organise, such as videos, images, audio files and social media posts.

Semi-structured data is somewhere in between. It has some organisation but not as strict as tables. Think JSON files, XML documents, and e-mails.

SAP logo

Maximise the value of your data

Bring everything together with SAP Business Data Cloud.

Learn more