What is a data lake?
What is a data lake?
In essence, a data lake is a repository of information. Data lakes are often confused with data warehouses, yet both serve different business needs and have different architectures. In particular, cloud data lakes are a vital component of a modern data management strategy as the proliferation of social data, Internet of Things (IoT) machine data, and transactional data keeps accelerating. The ability to store, transform, and analyse any data type paves the way for new business opportunities and digital transformation – and here in lies the role of a data lake.
A data lake is a central data repository that helps to address data silo issues. Importantly, a data lake stores vast amounts of raw data in its native – or original – format. That format could be structured, unstructured, or semi-structured. Data lakes, especially those in the cloud, are low-cost, easily scalable, and often used with applied machine learning analytics.
Data warehouses and lakes do often complement each other. For instance, when raw data stored in a data lake is needed to answer a business question, it can be extracted, cleaned, transformed, and used in a data warehouse for further analysis.
A “data lakehouse” is a new and evolving concept, which adds data management capabilities on top of a traditional data lake. In essence it’s the combination of a data lake and a data warehouse.
In addition to the type of data and the differences in the process noted above, here are some details comparing a data lake with a data warehouse solution.
Ultimately, the volume of data, database performance, and storage pricing will play an important role in choosing the right storage solution.
- Data movement: Data lakes allow the importing of any data type from multiple sources in its native format. This allows businesses to scale to data size on an as-needed basis without having to define data structures, schema, and transformations, which can result in overhead cost savings.
- Securely store and catalogue data: Data lake stores structured, semi-structured, and unstructured data from a variety of sources like business data from CRM or ERP software, IoT devices, social media, or even historical data from legacy systems. And data lakes allow you to capture batch and streaming data while applying governance, security, and control. Data can be queried directly or ingested into a data warehouse with the right tools.
- Analytics and machine learning: Data lakes allow role-based access to the information to run analytics and machine learning analysis without the need to move data to a separate analytics database. As well, data lakes allow historical data to be combined with real-time data to refine machine learning or predictive analytics models to provide better and/or new results.
A modern data lake has three main features:
- A landing zone for your raw data
- A staging zone where data is transformed with an analytic purpose in mind
- A data exploration zone where data is utilised by analytics, applications, and to feed machine learning models
From the data lake, the information is fed to a variety of sources – such as analytics or other business applications, or to machine learning tools for further analysis.
A data lake use case
Here are two examples of a data lake use case in retail.
Long term sales data is stored in a data lake alongside unstructured data like Web site clickstreams, weather, news, and micro/macroeconomic data. Having this data stored together and accessible makes it easier for a data scientist to combine these different sources of information into a model that will forecast demand for a specific product or line of products. This information is then used as inputs to the retail ERP system to drive increased or decreased production plans.
In parallel, a marketing expert may access this same data lake and look at a sentiment analysis of the Web site and social media engagement with news, macroeconomic, and sales history data to determine which products to focus on and how best to maximise sales, profit and/or adoption.
Data lakes can reside on premises, in the cloud, a hybrid of both, and across multiple cloud hyperscalers, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud.
By far, the most popular type of data lake is a cloud data lake. A cloud data lake provides all the usual data lake features, but in a fully managed cloud service.
- On-premise data lake: With an on-premise data lake, in-house IT engineering resources manage the hardware, software, and processes. This approach has a higher capital expenditure (CAPEX) commitment, and data tends to be siloed.
- Cloud data lake: In a cloud data lake, the on-premise infrastructure is outsourced. There is a higher operational expenditure (OPEX) commitment, but this deployment approach allows businesses to scale more easily, along with many other benefits (see below).
- Hybrid data lake: In select cases, some companies choose to maintain both on-premise and cloud data lakes concurrently. This situation is pretty rare and mostly seen during migration scenarios from on-premise to the cloud.
- Multi-cloud data lake: In a multi-cloud data lake, two or more cloud offerings are combined; for example, a business may use both AWS and Azure to manage and maintain cloud data lakes. This requires greater expertise to ensure these disparate platforms communicate with one another.
Why choose a cloud data lake? Turning data into a high-value business asset drives digital transformation. The strengths of the cloud combined with a data lake provide this foundation. A cloud data lake permits companies to apply analytics to historical data as well as new data sources, such as log files, clickstreams, social media, Internet-connected devices, and more, for actionable insights.
Here are some of the key benefits you should expect:
- Cost efficiency: Cloud storage providers offer many storage and pricing options.
- Auto-scaling: Cloud services are designed to provide scaling functionality to allow businesses to compute and tap into storage capacity on demand.
- Central data repository: A cloud data lake brings information together, serving as a single-source-of-truth with governed data access that allows for process efficiency among teams.
- Data security: Cloud storage providers guarantee the security of data through a shared responsibility model.
- Tools: Cloud storage providers and other vendors provide ETL tools that crawl data, build a data catalog, and perform data preparation, data transformation, and data ingestion to make data query-able.
- Improved analytics for new insights and better business outcomes: A cloud data lake can combine data in new ways. For example, CRM data and social media analytics can provide new customer insights into the cause of churn or show which promotions increase loyalty. Also, operational efficiency can be improved through the analysis of IoT data.
More in this series
Data lake frequently asked questions
The term “data lake” evolved to reflect the concept of a fluid, larger store of data – as compared to a more siloed, well-defined, and structured data mart, specifically.
More than a decade ago, as data sources grew, data lakes changed to address the need to store petabytes of undefined data for later analysis. Early data lakes were based on the Hadoop file system (HDFS) and commodity hardware based in on-premise data centres. However, the inherent challenges with a distributed architecture and the need for custom data transformation and analysis contributed to the suboptimal performance of Hadoop-based systems.
A data warehouse (DW) is a digital storage system that connects and harmonises large amounts of structured and formatted data from many different sources. In contrast, a data lake stores data in its original form – and is not structured or formatted.
A data lakehouse adds data management and warehouse capabilities on top of the capabilities of a traditional data lake. This is a new and evolving area that’s changing rapidly.
Multicloud is the use of multiple cloud computing and storage services in a single heterogeneous architecture. This refers to the distribution of cloud assets, software, and applications, for example, across several cloud-hosting environments.
File storage organises and represents data as a hierarchy of files in folders; block storage chunks data into arbitrarily organised, evenly sized volumes; and object storage manages data and links it to associated metadata. Object storage systems allow for the retention of massive amounts of unstructured data.