introduction to microsoft azure hd insight by dattatrey sindhol
TRANSCRIPT
2
Agenda
Introduction
Hadoop Distributions
Microsoft Azure HDInsight
Microsoft BI and Data Platform
HDInsight - Use Cases
HDInsight - Typical Implementation
Further Learning
3
Introduction
What is Big Data?
“Big Data is a collection
of data sets so large and
complex that it becomes
difficult to process using
on-hand database
management tools or
traditional data processing
applications”.
4
Introduction
Hadoop is an open source
framework, from Apache foundation,
capable of processing very large
volumes of heterogeneous data sets
in a distributed fashion across clusters
of commodity computers and
hardware using a simplified
programming model.
What is Hadoop?
6
Hadoop Distributions
Amazon Elastic
Map Reduce
(EMR)
Cloudera Hortonworks
IBM
InfoSphere
BigInsights
MapR
Pivotal Teradata IntelAzure
HDInsight
Reference: How the 9 Leading Commercial Hadoop Distributions Stack Up
7
Which Distribution Should I Use?
Cost
Scalability
Availability
Existing Technology Stack
Existing Infrastructure
Existing Skillset
8
HDInsight - Overview
Microsoft’s
Hadoop
Distribution in
the Cloud
Offers Hadoop
on Windows
Platform
Based on
Hortonworks
Data Platform
(HDP)
Tightly
integrated
with Microsoft
Technology
Stack
11
Why HDInsight?
Microsoft Stack
Runs on Windows
Create & Destroy
On-Demand
DFS Implementation
in Blob Storage
DFS Implementation
in Blob Storage
Store data on Blob
Storage for Later Use
Automation using
PowerShell
Orchestration/Work
flow using SSIS
Scheduling using
SQL Agent
BI & Analytics with
Power BI
12
Considerations
Requires dropping and
re-creating the cluster to
scale-up/down
Storage and Cluster should be in
the same Data Center
13
HDInsight Versions
COMPONENT VERSION 1.6 VERSION 2.1 VERSION 3.0VERSION 3.1
(Current/Default)
Hortonworks Data Platform (HDP) 1.1 1.3 2.0 2.1.7
Apache Hadoop & YARN 1.0.3 1.2.0 2.2.0 2.4.0
Tez 0.4.0
Apache Pig 0.9.3 0.11.0 0.12.0 0.12.1
Apache Hive & HCatalog 0.9.0 0.11.0 0.12.0 0.13.1
HBase 0.98.0
Apache Sqoop 1.4.2 1.4.3 1.4.4 1.4.4
Apache Oozie 3.2.0 3.3.2 4.0.0 4.0.0
Apache HCatalog 0.4.1 Merged with Hive Merged with Hive Merged with Hive
Apache Templeton 0.1.4 Merged with Hive Merged with Hive Merged with Hive
Ambari API v1.0 1.4.1 >=1.5.1
Zookeeper 3.4.5 3.4.5
Storm 0.9.1
Mahout 0.9.0
Phoenix 4.0.0.2.1.7.0-2162
18
Typical Implementation
Transactional
Social
Warehouse
Azure
Blob
Blob Blob
Blob Blob
Multi-NodeHDInsight Cluster
MapReduce• Hive• Java
Reporting and Analytics
• SSRS• Excel• Power BI
Web LogsClickstream
Files(TXT, XML, JSON, ..)
Collaboration
Office 365 / SharePoint
19
Typical Implementation (Contd…)
E-C
om
mer
ceIn
tern
al S
yste
ms
OLTP
Transactional
Internal Systems
Customers
Internal SystemsTeam
SqoopOr AzCopy
Hive Metastore
MapReduceHive
Multi-NodeHDInsight Cluster
MapReduce• Hive• Pig• Java• Python
Collaboration, Reporting, and Analytics• SSRS• Excel• Power BI
PowerShell / SSIS / SQL Agent
Subscription & Cluster Management | Data Movement | Job Execution
Warehouse
Web LogsSo
cial
Web Logs
Azure
Blob Storage
Blob
Blob Blob
Blob
Blob
BlobBlob
20
Further Reading and Learning Resources
• HDInsight Emulator
• http://azure.microsoft.com
• Learning map for HDInsight: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map
21
References
• http://msdn.microsoft.com/en-us/library/dn749804.aspx
• http://azure.microsoft.com/en-us/documentation/articles/hdinsight-
component-versioning/
• http://msdn.microsoft.com/en-us/library/dn749848.aspx
• http://msdn.microsoft.com/en-us/library/dn749787.aspx
• http://msdn.microsoft.com/en-us/library/dn749805.aspx
• http://msdn.microsoft.com/en-us/library/dn749876.aspx
22
Related Apache Projects
Term Description
Ambari / HUE Deployment, Configuration, and Monitoring
Avro / Parquet / RC / Sequence Data serialization system
Flume / S4 / Storm Collection and import of log and event data
Hbase / Cassandra Column-oriented database scaling to billions of rows
HCatalog Schema and Data Type Sharing over Pig, Hive, and MapReduce
Hive / Drill / Impala Data Warehouse with SQL-Like Access
Hive-QL/HQL SQL-Like Language to Query Hive
Mahout Library of machine learning and data mining algorithms
Pig High-level programming for Hadoop computations
Oozie Orchestration and workflow management
Sqoop Imports data from relational databases
Tez Application framework for graph
Whirr Cloud-agnostic deployment of clusters
MapReduce / YARNMapReduce is a programming model for distributed data processing. MapReduce has undergone a
complete overhaul in hadoop-0.23 and we now have Map-Reduce 2.0 (MRv2) or YARN.
Zookeeper Configuration management and coordination