A database is a facility for organizing, storing, managing, safeguarding, and controlling access to data. Databases are designed according to a number of different schemes (schema), many of which adhere to the relational model for ease of access by programs and data queries. Common types of databases include relational database management systems (RDBMS), in-memory databases, object-oriented databases (OODBMS), NoSQL databases, and NewSQL databases – each with their own advantages.
Data management refers to all the functions necessary to collect, control, safeguard, manipulate, and deliver data. Data management systems include databases, data warehouses, and data marts; tools for data collection, storage, and retrieval; and utilities to assist with validation, quality, and integration with applications and analytical tools. Businesses need a data strategy to establish accountability for data that originates or is endemic to particular areas of responsibility.
A database management system (DBMS) is the software toolkit that provides a storage structure and data management facility for database management. The DBMS may be an integral part of a licensed enterprise resource planning (ERP) system, a required separate purchase, a part of the system software (operating system), or a separately licensed software product. No matter the source, it is essential that applications are built around and/or completely integrated with the DBMS, as they are mutually dependent for effective functionality of both applications and the DBMS. The DBMS is essentially a toolkit for database management.
A relational database is a type of database that organizes data into tables. These tables can be linked (or related) to each other to help users understand the relationships among all available data points. Relational databases use structured query language (SQL) to let administrators communicate with the database, join tables, insert and delete data, and more.
An SQL database is a relational database that stores data in tables and rows. Data items (rows) are linked based on common data items to enable efficiency, avoid redundancy, and facilitate easy, flexible retrieval. The name SQL derives from Structured Query Language, the toolkit and natural language query protocol that users can learn and apply to any compliant database for data storage, manipulation, and retrieval.
NoSQL databases were developed for handling unstructured data that SQL cannot support because of the lack of structure. NoSQL uses creative techniques to overcome this limitation including dynamic schemas and various pre-processing techniques. The most common types of databases for unstructured data are key-value, document, column, and graph databases and often include things like video, graphics, free text, and raw sensor output.
Structured data is neatly formatted into rows and columns and mapped to predefined fields. Typically stored in Excel spreadsheets or relational databases, examples include financial transactions, demographic information, and machine logs. Until recently, structured data was the only usable type of data for businesses.
Unstructured data is not organized into rows and columns – making it more difficult to store, analyze, and search. Examples include raw Internet of Things (IoT) data, video and audio files, social media comments, and call center transcripts. Unstructured data is usually stored in data lakes, NoSQL databases, or modern data warehouses.
Semi-structured data has some organizational properties, such as semantic tags or metadata, but does not conform to the rows and columns of a spreadsheet or relational database. A good example of semi-structured data is e-mail – which includes some structured data, like the sender and recipient addresses, but also unstructured data, like the message itself.
Data mapping is the process of matching fields between different data structures or databases. This is a necessary step if databases are to be combined, if data is being migrated from one system or database to another, or if different data sources are to be used within a single application or analytical tool – as happens frequently in data warehousing. Data mapping will identify unique, conflicting, and duplicate information so that a set of rules can be developed for bringing all the data into a coordinated schema or format.
In creating a new or alternate database structure, the designer starts with a diagram of how data will flow into and out of the database. Diagramming the data flows is called data modeling. From this flow diagram, software engineers can define the characteristics of the data formats, structures, and database handling functions to efficiently support the data flow requirements.
A data warehouse provides a single, comprehensive storage facility for data from many different sources – both internal and external. Its main purpose is to supply the data for business intelligence (BI), reporting, and analytics. Modern data warehouses can store and manage all data types, structured and unstructured, and are typically deployed in the cloud for greater scalability and ease of use.
Big Data is a term that describes extremely large datasets of structured, unstructured, and semi-structured data. Big Data is often characterized by the five Vs: the sheer volume of data collected, the variety of data types, the velocity at which the data is generated, the veracity of the data, and the value of it. With Big Data management systems and analytics, companies can mine Big Data for deep insights that guide decision-making and actions.
Data integration is the practice ingesting, transforming, combining, and provisioning data, where and when it’s needed. This integration takes place in the enterprise and beyond – across partners as well as third-party data sources and use cases – to meet the data consumption requirements of all applications and business processes. Techniques include bulk/batch data movement, extract, transform, load (ETL), change data capture, data replication, data virtualization, streaming data integration, data orchestration, and more.
Data virtualization provides companies with a unified view of all enterprise data – across disparate systems and formats – in a virtual data layer. Instead of duplicating data, data virtualization leaves the data in its source systems and simply exposes a virtual representation of it to users and applications in real time. Data virtualization is a modern approach to data integration that lets users discover and manipulate data regardless of its physical location, format, or protocol.
A data fabric is a customized combination of architecture and technology. It uses dynamic data integration and orchestration to connect different locations, sources, and types of data. With the right structures and flows as defined within the data fabric platform, companies can quickly access and share data regardless of where it is or how it was generated.
A data pipeline describes a set of automated and repeatable processes for finding, cleansing, transforming, and analyzing any type of data at its source. Because data is analyzed near where it’s generated, business users can quickly analyze and share the information they need at a lower cost to the organization. Data pipelines can also be enhanced by technologies such as machine learning to make them faster and more effective.
A data silo is a slang term for a situation in which individual departments or functional areas within an enterprise do not share data and information with other departments. This isolation prevents coordinated efforts toward company goals and results in poor performance (and poor customer service), high costs, and a general inability to respond to market demands and changes. Duplicate and redundant data is difficult to reconcile, further preventing any attempt to coordinate activities and effectively manage the business.
Data wrangling is the process of taking raw data and transforming it into a format that is compatible with established databases and applications. The process may include structuring, cleaning, enriching, and validating data as necessary to make raw data useful.
Data security is the act of making data safe and secure – safe from unauthorized access or exposure, disaster, or system failure, and, at the same time, readily accessible to legitimate users and applications. Methods and tools include data encryption, key management, redundancy and backup practices, and access controls. Data security is a requirement for organizations of all sizes and types to safeguard customer and organizational data against the ever-increasing threat of data breaches and privacy risks. Redundancy and backups are important for business continuity and disaster recovery.
Data privacy refers to the policies and practices for handling data in ways that protect it from unauthorized access or disclosure. Data privacy policies and practices cover how information is collected and stored per the organization’s data strategy, how it may or may not be shared with third parties, and how to comply with regulatory restrictions. Data privacy is a business imperative that satisfies client expectations while protecting the integrity and safety of stored information.
Data quality is a nebulous term describing the suitability and reliability of data. Good, quality data simply means that the data is accurate (truly representative of what it describes), reliable (consistent, auditable, properly managed, and protected), and complete to the extent that users and applications require. Data quality can only be ensured by a properly devised and executed data strategy carried out with industrial strength tools and systems along with scrupulously followed data management policies and procedures.
Data validation is the process of determining the quality, accuracy, and validity of data before importing or using it. Validation can consist of a series of activities and processes for authenticating the data and generally “cleaning up” data items, including removal of duplicates, correction of obvious errors or missing items, and possible formatting changes (data cleansing). Data validation ensures the information you need for making important decisions is accurate and trustworthy.
Data cleansing is the process of removing or correcting errors from a dataset, table, or database. These errors can include corrupt, inaccurate, irrelevant, or incomplete information. This process, also called data scrubbing, finds duplicate data and other inconsistencies, like typos and numerical sets that don’t add up. Data cleansing may remove incorrect information or fix obvious mistakes, such as empty fields or missing codes.
Data integrity refers to the veracity of data over the long term. Once data is entered or imported, wrangled, validated, cleansed, and stored, data integrity is a statement that data quality is maintained and users can rest assured that the data that went in has not and will not change. The data that is retrieved is the same as what was originally stored. Sometimes used as a synonym for data quality, data integrity is more about reliability and dependability.
Data governance is a set of policies and practices for ensuring proper data management across an organization. It establishes the IT infrastructure and names the individuals (or positions) that have the authority and responsibility for the handling and safeguarding of specific types of data. Effective data governance ensures that data is available, trustworthy, secure, and compliant – and that it doesn’t get misused.
Data stewardship is the implementation of data governance policies and procedures for establishing data accuracy, reliability, integrity, and security. Individuals assigned with data stewardship responsibilities manage and oversee the procedures and tools used to handle, store, and protect data.
Data architecture is the overall design for the structure, policies, and rules that define an organization’s data and how it will be used and managed. Data architecture includes the details of how the data strategy is implemented in support of business needs and goals – and serves as the foundation for development of databases, procedures, safeguards, security, and data governance.
Master data management (MDM) is the practice of creating one single, “master” reference source for all important business data. It includes policies and procedures for defining, managing, and controlling (or governing) the handling of master data. Centralized master data management eliminates conflict and confusion that stems from scattered databases with duplicate information and uncoordinated data that might be out-of-date, corrupted, or displaced in time – updated in one place but not in another. Having one version to serve the entire enterprise means that all parts of the organization are working with the same definitions, standards, and assumptions.
The term analytics refers to the systematic analysis of data. Analytics applications and toolkits contain mathematical algorithms and computational engines that can manipulate large datasets to uncover patterns, trends, relationships, and other intelligence that allow users to ask questions and gain useful insights about their business, operations, and markets. Many modern analytics toolkits are designed for use by non-technical business people, allowing them to perform these analyses with minimal assistance from data scientists or IT specialists.
Data mining is the act of extracting useful information from large datasets. Data mining is often done by business users employing analytics tools to uncover patterns, trends, anomalies, relationships, dependencies, and other useful intelligence. Data mining has a broad range of applications, from detecting fraud and cybersecurity concerns to improving forecasts and finding performance improvement opportunities.
Data profiling is the practice of collecting statistics and traits about a dataset, such as its accuracy, completeness, and validity. Data profiling is one of the techniques used in data validation and data cleansing efforts, as it can help detect data quality issues like redundancies, missing values, and inconsistencies.