what is hadoop? oct 17 2013
DESCRIPTION
What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.TRANSCRIPT
WELCOME TO HADOOP Adam Muise – Hortonworks
Who am I?
Why are we here?
Data
“Big Data” is the marke=ng term of the decade
What lurks behind the marke=ng and hype is a legi=mate movement
forward in dealing with data
You need to deal with Data
Put it away, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put in the cloud, hide it in tape…
Let’s talk challenges…
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Storage, Management, Processing all become challenges with Data at
Volume
Tradi=onal technologies adopt a divide, drop, and conquer approach
The solu=on? EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy=cal DB
Data Data Data
Data Data Data
Data Data Data OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Ummm…you dropped something
Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data
EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy=cal DB
Data Data Data
Data Data Data
Data Data Data
OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Analyzing the data usually raises more interes=ng ques=ons…
…which leads to more data
Wait, you’ve seen this before.
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Sausage Factory
Data Data Data
Data Data Data
Data Data Data … Data
Data Data …
Your data silos are lonely places.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper=es
Data Data Data
Data Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper=es
Data Data Data
Data Data Data
Data Data Data
New types of data don’t quite fit your pris=ne view of the world
My LiYle Data Empire
Data Data Data
Data
Data Data
Data Data Data
Logs
Data Data Data Data
Data
Data Data
CDR/SIP
Data Data Data Data
Data
Data Data
? ?
? ?
To resolve this, some people take hints from Lord Of The Rings..
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data
Data Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
So what is the answer?
Enter the Hadoop.
hYp://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/
………
Hadoop was created because Big IT never cut it for the Internet
Proper=es like Google, Yahoo, Facebook, TwiYer, LinkedIn
Tradi=onal architecture didn’t scale enough…
DB DB DB
SAN
App App App App
DB DB DB
SAN
App App App App DB DB DB
SAN
App App App App
Tradi=onal architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercompu=ng
So what is the answer?
If you could design a system that would handle this, what would it
look like?
It would probably need a highly resilient, self-‐healing, cost-‐efficient,
distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
It would probably need a completely parallel processing framework that
took tasks to the data…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably run on commodity hardware, virtualized machines, and
common OS pladorms
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably be open source so innova=on could happen as quickly
as possible
It would need a cri=cal mass of users
{Processing + Storage} =
{MapReduce/YARN+ HDFS}
HDFS stores data in blocks and replicates those blocks
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3 block3
block3
block2 block2
block2
block1
block1
block1
If a block fails then HDFS always has the other copies and heals itself
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3
block3
block3
block2 block2
block2
block1
block1
block1
X
MapReduce is a programming paradigm that completely parallel
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
MapReduce has three phases: Map, Sort/Shuffle, Reduce
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value Key, Value
Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
MapReduce applies to a lot of data processing problems
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Introducing YARN
YARN = Yet Another Resource Nego=ator
YARN abstracts resource management so you can run more
than just MapReduce
HDFS2
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and
more
YARN turns Hadoop into a smart phone: An App Ecosystem
hortonworks.com/yarn/
Check out the book too…
Preview at: hortonworks.com/yarn/
YARN is an essen=al part of a balanced breakfast in Hadoop 2.0
Oct 15 2013: Apache Community releases Hadoop 2.2.0
Halloween 2013: Hortonworks releases HDP 2.0 GA
pict
Hadoop has other open source projects…
Hive = {SQL -‐> MapReduce} SQL-‐IN-‐HADOOP
Pig = {PigLa=n -‐> MapReduce}
HCatalog = {metadata* for MapReduce, Hive, Pig, Hbase, etc} *metadata = tables, columns, par==ons, types
Oozie = Job::{Task, Task, if Task, then Task, final Task}
Falcon
Hadoop Hadoop Feed Feed
Feed Feed
Feed
Feed
Feed
Feed DR
Replica=on
Flume
JMS
Weblogs
Events
Files
Hadoop Flume
Flume
Flume
Flume
Flume
Flume
Sqoop
Hadoop
DB DB Sqoop
Sqoop
Ambari = {install, manage, monitor}
HBase = {real-‐=me, distributed-‐map, big-‐tables}
Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by
YARN }
Apache Hadoop
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Hortonworks Data Pladorm
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
What else are we working on?
hortonworks.com/labs/
Hadoop is the new Data Opera=ng System for the Enterprise
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 67
There is NO second place
Hortonworks …the Bull Elephant of Hadoop Innova@on