Can process Apache Flink batch data
What is Azure HDInsight?
- 6 minutes to read
Azure HDInsight is a comprehensive, managed, open source analytics service in the cloud for businesses. You can use open source frameworks like Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, and others.
What are HDInsight and the Hadoop technology stack?
Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight enables the simple, fast and inexpensive processing of large amounts of data. You can use the most popular open source frameworks like Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, etc. With these frameworks you can enable a wide range of scenarios, e.g. extract, transform and load (ETL), data warehousing, machine learning and IoT.
For information about available components of the Hadoop technology stack for HDInsight, see What Hadoop components and versions are available in HDInsight ?. For more information about Hadoop in HDInsight, see the Azure Features for HDInsight page.
What does "Big Data" mean?
Large amounts of data, ie “Big Data”, are being recorded in ever larger amounts, at ever higher speeds and in ever more formats than ever before. This can be historical (stored data) or real-time data (streamed from the source). For the most common use cases for big data, see Usage scenarios for HDInsight.
Arguments for using Azure HDInsight
This section lists the capabilities of Azure HDInsight.
|Cloud based||With Azure HDInsight you can create optimized clusters for Hadoop, Spark, Interactive Query (LLAP), Kafka, Storm, HBase and ML services in Azure. In addition, HDInsight offers an end-to-end SLA for all of your production workloads.|
|Inexpensive and scalable||With HDInsight you can scale up and down workloads. By creating on-demand clusters, you can reduce your costs by only paying for what you actually use. You can also create data pipelines for the operationalization of your orders. The decoupling of compute and storage ensures better performance and more flexibility.|
|Safe and Compliant||With HDInsight, you can protect your company's data resources through the use of Azure Virtual Network, encryption and integration with Azure Active Directory. In addition, HDInsight meets the most common industry and government-specific compliance standards.|
|monitoring||Thanks to the integration with Azure Monitor protocols, Azure HDInsight offers a central interface for monitoring your entire cluster.|
|Global availability||HDInsight is available in more regions than any other big data analytics solution. In addition, Azure HDInsight is available for Azure Government, China and Germany, which enables the fulfillment of business requirements in central independent areas.|
|productivity||With Azure HDInsight, you can leverage extensive productivity tools for Hadoop and Spark in your preferred development environment. These development environments include Visual Studio, VSCode, Eclipse and IntelliJ for supporting Scala, Python, R, Java and .NET. Data analysts can also work together using popular notebooks such as Jupyter and Zeppelin.|
|Expandability||You can extend the HDInsight clusters with installed components (e.g. Hue, Presto, etc.) by using script actions, adding edge nodes or integrating other applications that are certified for big data. HDInsight can be seamlessly integrated into the most popular big data solutions using one-click provisioning.|
Usage scenarios for HDInsight
Azure HDInsight can be used for big data processing in a wide variety of scenarios. This can be historical (data that has already been collected and stored) or real-time data (data streamed directly from the source). The scenarios for processing this data can be divided into the following categories:
Batch processing (ETL)
Extract, transform, and load (ETL) is a process of extracting unstructured and structured data from heterogeneous data sources. They are then transformed into a structured format and loaded into a data store. You can use the transformed data for data science or data warehousing purposes.
With HDInsight, you can perform interactive queries on petabytes of structured or unstructured data in any format. You can also create models for interfacing with BI tools.
Internet of Things (IoT)
With HDInsight, you can process streaming data received in real time from various types of devices. For more information, see this Azure blog announcing the public preview of Apache Kafka on HDInsight with Azure Managed Disks.
With HDInsight, you can build applications that extract critical insights from data. In addition, you can use Azure Machine Learning to forecast future trends for your company. Please see this customer report for more information.
With HDInsight you can extend your existing local big data infrastructure to Azure and benefit from the advanced analysis functions of the cloud.
Cluster types in HDInsight
HDInsight includes certain types of clusters and cluster customization features, such as the ability to add components, utilities, and languages. HDInsight offers the following types of clusters:
|Apache Hadoop||A framework that uses the Hadoop Distributed File System, YARN resource management and a simple MapReduce programming model for parallel processing and analysis of batch data.|
|Apache Spark||An open source parallel processing framework that supports in-memory processing to improve the performance of big data analysis applications. See What is Apache Spark in HDInsight?|
|Apache HBase||A Hadoop-based NoSQL database that provides random access and strong consistency to large amounts of unstructured and partially structured data - in a potential dimension of billions of rows multiplied by billions of columns. See What is HBase in HDInsight?|
|ML Services||A server for hosting and managing parallel, distributed R processes. This feature enables data analysts, statisticians, and R programmers to access scalable, distributed analysis methods in HDInsight when needed. See Introduction to R Server and Open Source R Features in HDInsight.|
|Apache Storm||A distributed real-time computation system for the rapid processing of large data streams. Storm is offered as a managed cluster in HDInsight. See Analyzing Real-Time Sensor Data Using Storm and Hadoop.|
|Interactive Apache query||In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.|
|Apache Kafka||An open source platform for building streaming data pipelines and applications. Kafka also offers a message queuing feature that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.|
Open source components in HDInsight
Azure HDInsight enables the creation of clusters with open source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase and R. These clusters have additional integrated open source components such as Apache Ambari5, Avro5, Apache Hive3, HCatalog2 by default , Apache Mahout2, Apache Hadoop MapReduce3, Apache Hadoop YARN2, Apache Phoenix3, Apache Pig3, Apache Sqoop3, Apache Tez3, Apache Oozie2 and Apache ZooKeeper5.
Programming languages in HDInsight
HDInsight clusters, e.g. Spark, HBase, Kafka, Hadoop and others, support many programming languages. Some programming languages are not installed by default. Use a script action to install any library, module, or package that is not installed by default.
|Standard support for programming languages||By default, HDInsight clusters support the following languages:|
|Java Virtual Machine (JVM) languages||In addition to Java, many other languages can also be executed on a Java Virtual Machine (JVM). However, if you are running some of these languages, you may need to install additional components in the cluster. The following JVM-based languages are supported in HDInsight clusters: |
|Hadoop-specific languages||HDInsight clusters provide support for the following languages that are specific to the Hadoop technology stack: |
Development tools for HDInsight
You can use HDInsight development tools such as IntelliJ, Eclipse, Visual Studio Code, and Visual Studio to create and submit HDInsight data queries and jobs - with seamless integration with Azure.
- Azure Toolkit for IntelliJ10
- Azure Toolkit for Eclipse6
- Azure HDInsight Tools for VS Code13
- Azure Data Lake Tools for Visual Studio9
Business Intelligence in HDInsight
Well-known business intelligence tools (BI) retrieve, analyze and report data built into HDInsight either through the Power Query add-in or the Microsoft Hive ODBC driver:
Data residency in the region
Spark, Hadoop, LLAP, Storm, and MLService do not store customer data, so these services automatically meet the data residency requirements in the region, including those listed in the Trust Center.
Kafka and HBase store customer data. This data is automatically stored in a single region by Kafka and HBase, so this service meets the data residency requirements in the region, including those specified in the Trust Center.
Well-known business intelligence (BI) tools retrieve, analyze, and report data built into HDInsight either through the Power Query add-in or the Microsoft Hive ODBC driver.
- What inspired Stanley Kubrick
- How are traditional Belgian clothes made
- How should we prepare for banking
- What is the 25th change
- What makes a successful school
- What is failure and omission insurance
- Has anyone lost weight when taking Elysium supplements
- What is your opinion on ambient music
- Why is Brexit called Brexit
- Are plush animals safe for children
- What do you love about the French
- Glassdoor reviews can be tracked
- What harm can 15 grams of sugar cause
- Can people die thinking
- What is a parable in the Bible
- What is your opinion on quality assurance
- What does a chi-square value mean
- Is an electron a dipole
- Wells Fargo is a product based company
- Why do automotive companies go bankrupt more easily?
- What's the weirdest part of marketing
- Is life good in JNV
- Who needs another Grimm season besides me
- Someone is using pirated operating systems