What are dimensions in the dataware house

Data warehouse systems: Success with big data analyzes

Data acquisition level

Before data can be loaded into the DWH, the often very heterogeneous information must be converted into a uniform form of representation. A DWH is fed both from internal data sources of a company and from relevant external data sources:

  • Internal data: Operational systems: enterprise resource planning systems (ERP), customer relationship management systems (CRM); operational databases; Content management systems (CMS); Flat files (e.g. Excel, CSV, text files), mails, etc.
  • External data: Applications and systems from external service providers, websites / internet, social media, cloud services, etc.

Systems on the data acquisition level provide interfaces to the operational systems of a company and are used in the first phase of data warehousing: the central functions of these DWH components are Data acquisition and Data integration

As part of the Data acquisition The following extraction techniques are used:

  • Trigger: If the operational systems of a company support database triggers, these can be used to automate data extraction. Triggers make it possible to define operations that are automatically carried out when certain events occur. As a rule, the triggering events are changes in the database of the source systems, which lead to an extraction of the modified data into the DWH.
  • Log files: If an operative system does not support trigger technology, the data acquisition level of a DWH can contain programs that are able to evaluate the log files of the source systems and to extract the operations recorded in them.
  • Monitor programs: If neither triggers nor log data are available for the extraction, monitor programs are usually used. These extract changes in the database of an operational system using algorithms that create images (snapshots) of the data to be monitored at regular intervals and compare them with previous ones.

If none of the techniques described are supported because access to the data of the operative system is not possible, the system must independently log changes and transmit relevant modifications to the data warehouse.

Most DWHs are part of the data integration OLAP functionalities that make it possible to represent data in multi-dimensional structures. Online Analytical Processing (OLAP) is an analysis method that is used to compress management-relevant company data. The procedure is based on the ETL process:

  • E = Extraction: Data extraction involves reading out relevant information from various data sources. This can be implemented as a push or pull strategy. If the data is extracted as part of a push strategy, data sources are encouraged to generate extracts at regular intervals and to transmit them to the DWH. In the case of a pull strategy, the DWH initiates the data extraction on its own.
  • T = transformation: The extracted data is cleaned up as part of a transformation and uniformly translated into the format of the target database.
  • L = Loading: The loading phase includes saving the transformed data in the respective target databases of the DWH.

The data acquisition level of a DWH can be a so-called Staging area include (also called work area). This is a temporary area of ​​the database in which the preprocessing of the data to be loaded takes place. Such staging can be necessary especially for complex ETL processes.

Since data from a wide variety of sources are brought together in the DWH, the Data integration on various tools that enable a transformation and cleansing of the extracted data. These can be assigned to the following categories.

  • Data migration tools: Programs for data migration make it possible to define simple transformation rules in order to convert heterogeneous source data into a uniform target format.
  • Data scrubbing tools: In the context of data scrubbing, programs are used that are based on the fuzzy logic approach and neural networks. The aim is to improve data quality by correcting errors, gaps and repetitions in data records using predefined rules, algorithms or lookup tables (LUT). In this case one also speaks of Quality management.  
  • Data auditing tools: Data auditing tools are used in the context of data integration to determine rules and relationships between data. In addition, programs of this type make it possible to identify data that violate the rules determined and are therefore likely to be incorrect.

The data integration is followed by the transfer of the extracted data to the central database, the so-called Core data warehouse. This step is supported by programs that provide the following functions:

  • Check constraints
  • Sort data
  • Calculate aggregations
  • Calculate access structures
  • Partitioning data for efficient access