What were the people doing before MapReduce

Big Data Analytics with Hadoop and MapReduce

In view of the steadily growing amounts of data, intelligent and targeted data analyzes form the basis for the entire company performance. The right combination of methods and technologies is increasingly being used to decide which companies are ahead of the curve. The variety of key technologies for big data analytics is becoming more and more opaque and complex. With this in mind, we would like to introduce the most important technologies with this series of articles.

Hadoop - The Big Data Enabler

Hadoop allows as an open source technology simultaneous distributed processing of mass data. The data is processed by several computers in a network at the same time. The underlying file system for data processing distributed over several computers in a network is called Hadoop Distributed File System (HDFS) designated. This file system enables the distribution and parallel processing of data over several computers connected to a network in a so-called Hadoop cluster. In order to be able to operate a Hadoop cluster, nothing more than commercially available hardware is required. On the software side, Hadoop has low demands, because every Hadoop computer (node) requires nothing more than Linux as the operating system and Java as the framework.

Hadoop draws on the ability to process large amounts of data by parallelizing the mapping processes in the MapReduce algorithm, which can also be used to evaluate unstructured data very well. Only with the Processing of databases in the multi-digit terabyte range The advantages of Hadoop with distributed and parallel computing come into their own, as each individual computer only evaluates a subset and the data is only merged again shortly before the reducer process.

The principle of MapReduce can easily be explained with the following example:
Assume that the highest age information per place of residence is to be determined in a social media network (such as Facebook) in order to clarify the question of which city has the oldest person who is still alive. Since established social media networks manage a lot of data, the data is distributed over several servers. A map process would be on all servers (Data nodes) collect all years of life per active member in a table and then assign them to the respective place of residence. Here all age information becomes clear place of residence mapped. Since each node accomplishes this task for itself on the basis of its own database, this mapping process is parallelized.

A server, on the other hand, has a special role (Name node), because this has not only initiated the mapping process, but also receives the results from all data nodes and aggregates them into the result (reduce process). For this example, the name node sorts the age information for all places of residence and reduces the result to the numerically highest age information in each case, so that only one result remains for each city. MapReduce may seem quite simple at first, but more complex analyzes require several combined MapReduce processing.

However, Hadoop is not able to evaluate data in almost real time. However, there are add-on projects compatible with HDFS, such as Apache Spark and Apache Flink, for this purpose.

 

About DATANOMIQ

DATANOMIQ is a leading solution and service partner for Business analytics and Data science. We open up undreamt-of potential for companies of all sizes and in all industries, which was previously beyond our reach. With a team of first-class data scientists, we use the world's most innovative technologies such as Process Analytics, Predictive modeling or Machine learningto discover new ways to improve business performance. DATANOMIQ achieves measurable results that make everyday work easier and more successful - 'simplicity at work'.

 

Press contact

DATANOMIQ GmbH
Mr. Benjamin Aunkofer
Ostendstrasse 25
D-12459 Berlin
Tel: +49 (0) 30 20653828
I: www.datanomiq.de
E: [email protected]