Big Data is comprised of all potentially business-relevant data – both structured and unstructured – from a variety of disparate sources. Once analysed, it is used to provide deeper insight and more accurate information about all operational areas of a business and its market.
Big Data technology applies to all the tools, software, and techniques that are used to process and analyse Big Data – including (but not limited to) data mining, data storage, data sharing, and data visualisation.
Apache Hadoop is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel.
Apache Spark is an open-source, distributed processing software solution. It is used to speed up and facilitate Big Data management by connecting several computers and allowing them to process Big Data in parallel. Its predecessor Hadoop is much more commonly used, but Spark is gaining popularity due to its use of machine learning and other technologies, which increase its speed and efficiency.
A data lake is a repository in which large amounts of raw, unstructured data can be stored and retrieved. Data lakes are necessary because much of Big Data is unstructured and cannot be stored in a traditional row-column relational database.
Dark data is all the data that companies collect as part of their regular business operations (such as, surveillance footage and website log files). It is saved for compliance purposes but is typically never used. These large data sets cost more to store than the value they bring.
Data fabric is the integration of Big Data architecture and technologies across an entire business ecosystem. Its purpose is to connect Big Data from all sources and of all types, with all data management services across the business.