Can process Apache Flink batch data

What is Azure HDInsight?

  • 6 minutes to read

Azure HDInsight is a comprehensive, managed, open source analytics service in the cloud for businesses. You can use open source frameworks like Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, and others.

What are HDInsight and the Hadoop technology stack?

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight enables the simple, fast and inexpensive processing of large amounts of data. You can use the most popular open source frameworks like Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, etc. With these frameworks you can enable a wide range of scenarios, e.g. extract, transform and load (ETL), data warehousing, machine learning and IoT.

For information about available components of the Hadoop technology stack for HDInsight, see What Hadoop components and versions are available in HDInsight ?. For more information about Hadoop in HDInsight, see the Azure Features for HDInsight page.

What does "Big Data" mean?

Large amounts of data, ie “Big Data”, are being recorded in ever larger amounts, at ever higher speeds and in ever more formats than ever before. This can be historical (stored data) or real-time data (streamed from the source). For the most common use cases for big data, see Usage scenarios for HDInsight.

Arguments for using Azure HDInsight

This section lists the capabilities of Azure HDInsight.

Cloud basedWith Azure HDInsight you can create optimized clusters for Hadoop, Spark, Interactive Query (LLAP), Kafka, Storm, HBase and ML services in Azure. In addition, HDInsight offers an end-to-end SLA for all of your production workloads.
Inexpensive and scalableWith HDInsight you can scale up and down workloads. By creating on-demand clusters, you can reduce your costs by only paying for what you actually use. You can also create data pipelines for the operationalization of your orders. The decoupling of compute and storage ensures better performance and more flexibility.
Safe and CompliantWith HDInsight, you can protect your company's data resources through the use of Azure Virtual Network, encryption and integration with Azure Active Directory. In addition, HDInsight meets the most common industry and government-specific compliance standards.
monitoringThanks to the integration with Azure Monitor protocols, Azure HDInsight offers a central interface for monitoring your entire cluster.
Global availabilityHDInsight is available in more regions than any other big data analytics solution. In addition, Azure HDInsight is available for Azure Government, China and Germany, which enables the fulfillment of business requirements in central independent areas.
productivityWith Azure HDInsight, you can leverage extensive productivity tools for Hadoop and Spark in your preferred development environment. These development environments include Visual Studio, VSCode, Eclipse and IntelliJ for supporting Scala, Python, R, Java and .NET. Data analysts can also work together using popular notebooks such as Jupyter and Zeppelin.
ExpandabilityYou can extend the HDInsight clusters with installed components (e.g. Hue, Presto, etc.) by using script actions, adding edge nodes or integrating other applications that are certified for big data. HDInsight can be seamlessly integrated into the most popular big data solutions using one-click provisioning.

Usage scenarios for HDInsight

Azure HDInsight can be used for big data processing in a wide variety of scenarios. This can be historical (data that has already been collected and stored) or real-time data (data streamed directly from the source). The scenarios for processing this data can be divided into the following categories:

Batch processing (ETL)

Extract, transform, and load (ETL) is a process of extracting unstructured and structured data from heterogeneous data sources. They are then transformed into a structured format and loaded into a data store. You can use the transformed data for data science or data warehousing purposes.

Data warehousing

With HDInsight, you can perform interactive queries on petabytes of structured or unstructured data in any format. You can also create models for interfacing with BI tools.

Internet of Things (IoT)

With HDInsight, you can process streaming data received in real time from various types of devices. For more information, see this Azure blog announcing the public preview of Apache Kafka on HDInsight with Azure Managed Disks.

Data science

With HDInsight, you can build applications that extract critical insights from data. In addition, you can use Azure Machine Learning to forecast future trends for your company. Please see this customer report for more information.


With HDInsight you can extend your existing local big data infrastructure to Azure and benefit from the advanced analysis functions of the cloud.

Cluster types in HDInsight

HDInsight includes certain types of clusters and cluster customization features, such as the ability to add components, utilities, and languages. HDInsight offers the following types of clusters:

Apache HadoopA framework that uses the Hadoop Distributed File System, YARN resource management and a simple MapReduce programming model for parallel processing and analysis of batch data.
Apache SparkAn open source parallel processing framework that supports in-memory processing to improve the performance of big data analysis applications. See What is Apache Spark in HDInsight?
Apache HBaseA Hadoop-based NoSQL database that provides random access and strong consistency to large amounts of unstructured and partially structured data - in a potential dimension of billions of rows multiplied by billions of columns. See What is HBase in HDInsight?
ML ServicesA server for hosting and managing parallel, distributed R processes. This feature enables data analysts, statisticians, and R programmers to access scalable, distributed analysis methods in HDInsight when needed. See Introduction to R Server and Open Source R Features in HDInsight.
Apache StormA distributed real-time computation system for the rapid processing of large data streams. Storm is offered as a managed cluster in HDInsight. See Analyzing Real-Time Sensor Data Using Storm and Hadoop.
Interactive Apache queryIn-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.
Apache KafkaAn open source platform for building streaming data pipelines and applications. Kafka also offers a message queuing feature that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.

Open source components in HDInsight

Azure HDInsight enables the creation of clusters with open source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase and R. These clusters have additional integrated open source components such as Apache Ambari5, Avro5, Apache Hive3, HCatalog2 by default , Apache Mahout2, Apache Hadoop MapReduce3, Apache Hadoop YARN2, Apache Phoenix3, Apache Pig3, Apache Sqoop3, Apache Tez3, Apache Oozie2 and Apache ZooKeeper5.

Programming languages ​​in HDInsight

HDInsight clusters, e.g. Spark, HBase, Kafka, Hadoop and others, support many programming languages. Some programming languages ​​are not installed by default. Use a script action to install any library, module, or package that is not installed by default.

programming languageinformation
Standard support for programming languagesBy default, HDInsight clusters support the following languages:
Java Virtual Machine (JVM) languagesIn addition to Java, many other languages ​​can also be executed on a Java Virtual Machine (JVM). However, if you are running some of these languages, you may need to install additional components in the cluster. The following JVM-based languages ​​are supported in HDInsight clusters:
  • Clojure
  • Jython (Python for Java)
  • Scala
Hadoop-specific languagesHDInsight clusters provide support for the following languages ​​that are specific to the Hadoop technology stack:
  • Pig Latin for Pig orders
  • HiveQL for Hive jobs and SparkSQL

Development tools for HDInsight

You can use HDInsight development tools such as IntelliJ, Eclipse, Visual Studio Code, and Visual Studio to create and submit HDInsight data queries and jobs - with seamless integration with Azure.

  • Azure Toolkit for IntelliJ10
  • Azure Toolkit for Eclipse6
  • Azure HDInsight Tools for VS Code13
  • Azure Data Lake Tools for Visual Studio9

Business Intelligence in HDInsight

Well-known business intelligence tools (BI) retrieve, analyze and report data built into HDInsight either through the Power Query add-in or the Microsoft Hive ODBC driver:

Data residency in the region

Spark, Hadoop, LLAP, Storm, and MLService do not store customer data, so these services automatically meet the data residency requirements in the region, including those listed in the Trust Center.

Kafka and HBase store customer data. This data is automatically stored in a single region by Kafka and HBase, so this service meets the data residency requirements in the region, including those specified in the Trust Center.

Well-known business intelligence (BI) tools retrieve, analyze, and report data built into HDInsight either through the Power Query add-in or the Microsoft Hive ODBC driver.

Next Steps