Explore some of our FAQs on data lakes below, and review our data management glossary for even more definitions.
The term “data lake” evolved to reflect the concept of a fluid, larger store of data – as compared to a more siloed, well-defined, and structured data mart, specifically.
More than a decade ago, as data sources grew, data lakes changed to address the need to store petabytes of undefined data for later analysis. Early data lakes were based on the Hadoop file system (HDFS) and commodity hardware based in on-premise data centres. However, the inherent challenges with a distributed architecture and the need for custom data transformation and analysis contributed to the suboptimal performance of Hadoop-based systems.
Cloud computing and data storage technologies are now the main foundation for the modern data stack – and for cloud data lakes.
A data warehouse (DW) is a digital storage system that connects and harmonises large amounts of structured and formatted data from many different sources. In contrast, a data lake stores data in its original form – and is not structured or formatted.
Data management is the process of collecting, organising, and accessing data to support productivity, efficiency, and decision-making.
A data lakehouse adds data management and warehouse capabilities on top of the capabilities of a traditional data lake. This is a new and evolving area that’s changing rapidly.
Multicloud is the use of multiple cloud computing and storage services in a single heterogeneous architecture. This refers to the distribution of cloud assets, software, and applications, for example, across several cloud-hosting environments.
File storage organises and represents data as a hierarchy of files in folders; block storage chunks data into arbitrarily organised, evenly sized volumes; and object storage manages data and links it to associated metadata. Object storage systems allow for the retention of massive amounts of unstructured data.