Apache Pig leaves the market

Hadoop is shaking up the big data market

The Hadoop ecosystem

The free Hadoop project hosted by the ASF comprises only the two core components MapReduce and HDFS and is also known as the Hadoop Core. However, only very few experts are likely to be able to set up their own big data project based on Hadoop Core alone. Over the years, a whole ecosystem of additional projects has been created around Hadoop Core, or in addition to Hadoop, which make Hadoop more usable, more secure and flexible, and enable collaboration with SQL. The individual Hadoop distributions bundle some of these projects with Hadoop Core and their own developments into an easy-to-use overall system. Hortonworks, the driving force behind Hadoop, also calls this ecosystem the project network. Each individual project is dedicated to a single main function, each has its own community of developers and its own individual release cycles.

  1. Apache YARN - Hadoop architecture
    Since version 2.0 of the MapReduce algorithm (MRv2), introduced with Hadoop version 2.3, the framework has been called NextGen MapReduce (YARN). YARN compensates for a number of deficits of the previous version in the areas of real-time capability and security. The fundamental innovation in the YARN architecture is to split the two main functions of resource management and job scheduling / monitoring into two separate daemons.
  2. Apache Hive query
    The graphic shows the query editor in Apache Hive. Hive extends the MapReduce framework by an abstraction level with the SQL-like query language HiveQL. HiveQl enables classic queries to be sent and thus the analysis of the data stored in the HDFS. You can also think of Hive as a data warehouse component of the Hadoop framework. The query editor in Apache Hive.
  3. Apache HCatalog management
    HCatalog is very important for managing metadata in Hadoop. The graphic shows a table list in Hcatalog.
  4. Apache Pig - Scripting Engine
    The Apache Pig scripting engine.
  5. Apache Knox security
    Apache Knox primarily takes care of the cluster level and extends the Hadoop security model for all users who access the cluster data with the help of authentication roles.
  6. Hortonworks and Hadoop
    This is how Hortonworks envisions a modern data architecture. On the basis of the data systems, big data solutions such as Hadoop operate on an equal footing with SQL databases such as SQL Server, Oracle or SAP Hana and can exchange data with each other if necessary. Applications such as SAP access the available data at will.
  7. Cloudera and Hadoop
    The Hadoop distribution "CDH" from Cloudera is labeled "enterprise ready" by the provider and contains a number of in-house developments. At its core, however, CDH also uses YARN for workload management and uses either HDFS or Hbase as the storage engine.
  8. Amazon EMR and Hadoop
    Amazon Elastic Map Reduce now supports all Hadoop versions from 0.20 through 1.0.3 up to the current versions 2.2 and 2.4. In addition, EMR also works with the cluster types Hive, Custom JAR, Pig, Hbase and Streaming.

Since the publication of Hadoop2 / YARN, however, the system no longer sees itself as a pure MapReduce runtime environment. Rather, YARN combines a number of individual projects for data access under the name "data operating system", including the Pig scripting engine. Furthermore, with the help of the projects Hive, Tez and HCatalog, access to SQL databases and with HBase and Accumulo to NoSQL databases is possible. As a search engine, Solr is part of the YARN framework, Storm enables real-time processing of streaming data, and Spark integrates a powerful in-memory database.

Apache Hive extends the MapReduce framework by an abstraction level with the SQL-like query language HiveQL. HiveQL enables classic queries to be sent and thus the analysis of the data stored in HDFS. You can also think of Hive as a data warehouse component of the Hadoop framework. HCatalog for meta-data management and the NoSQL database HBase are also of great importance for Hadoop. HBase is used when Hadoop's essentially stack-oriented mode of operation with its optimization for storing the data once and reading it multiple times cannot map the problem or when the data needs to be manipulated.

Apache Spark

Another popular framework for real-time data analysis is Apache Spark. Spark provides APIs for Java, Scala and Python and can read data natively from the HDFS (Hadoop File System), the Hadoop HBase database and the Cassandra data store. With Spark, querying and analyzing data is much faster thanks to the in-memory technology used than the MapReduce implementation of YARN (Yet Another Resource Negotiator). If the real-time analysis with Hadoop 1.x only worked with additional products, YARN is more flexible in this regard. With YARN, MapReduce is only one way of building a Hadoop cluster. Apache Spark is certified as part of the Hortonworks YARN Ready Program for Hortonworks Data Platform and thus fits seamlessly into the YARN architecture.

In addition to Spark, Hadoop users can implement real-time analysis with the Apache Lucene-based Elasticsearch project, the Swiss Army Knife for Hadoop applications, so to speak. Elasticsearch can evaluate data from CRM and ERP systems and process click streams and log information. In addition, there are projects in the Hadoop framework from the areas of security, corporate workflow and governance integration, which expand Hadoop if necessary and additional functions and are used in the Hadoop distributions. For example, Apache ZooKeeper enables the numerous distributed processes to be coordinated.

One of the main reasons for using a Hadoop distribution is the usually quite time-consuming installation and administration. The Apache Ambari project promises a remedy here and allows a Hadoop cluster to be installed, administered and monitored via a web interface. When planning a Hadoop cluster, for example, Apache Oozie, which enables the creation and automation of process chains, helps. In addition, Apache Scoop allows you to import and export large amounts of data from relational databases. Today, Hadoop is not only able to access relational databases, but also a whole range of special items of data. Apache Flume, for example, enables log data to be collected and aggregated.