The term “data lake” evolved to reflect the concept of a fluid, larger store of data – as compared to a more siloed, well-defined, and structured data mart, specifically.
More than a decade ago, as data sources grew, data lakes changed to address the need to store petabytes of undefined data for later analysis. Early data lakes were based on the Hadoop file system (HDFS) and commodity hardware based in on-premise data centres. However, the inherent challenges with a distributed architecture and the need for custom data transformation and analysis contributed to the suboptimal performance of Hadoop-based systems.
A data warehouse (DW) is a digital storage system that connects and harmonises large amounts of structured and formatted data from many different sources. In contrast, a data lake stores data in its original form – and is not structured or formatted.
A data lakehouse adds data management and warehouse capabilities on top of the capabilities of a traditional data lake. This is a new and evolving area that’s changing rapidly.
Multicloud is the use of multiple cloud computing and storage services in a single heterogeneous architecture. This refers to the distribution of cloud assets, software, and applications, for example, across several cloud-hosting environments.
File storage organises and represents data as a hierarchy of files in folders; block storage chunks data into arbitrarily organised, evenly sized volumes; and object storage manages data and links it to associated metadata. Object storage systems allow for the retention of massive amounts of unstructured data.