2014 july 24_what_ishadoop
DESCRIPTION
Presentation for Silicon Peel group at Microsoft Canada HQ, July 24, 2014TRANSCRIPT
EVERYONE LIKES ELEPHANTS
Adam [email protected] ArchitectHortonworks
Who am I?
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive Innovation in the platform – We lead the roadmap
100% Open Source – Democratized Access to Data
We do Hadoop successfully.
Support
Professional ServicesTraining
What is Hadoop? What is everyone talking about?
Data
“Big Data” is the marketing term of the decade in IT
What lurks behind the hype is the democratization of Data, a move to aggregate disparate data silos
into one shiny pile of analytic gold
So what are the problems with Big Data?
Let’s talk challenges…
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
VolumeVolume Volume
Volume
VolumeVolume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
VolumeVolume Volume
Volume
VolumeVolume
VolumeVolume
VolumeVolume Volume
Volume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
Volume
Volume
VolumeVolume
Volume
VolumeVolume Volume
Volume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
VolumeVolume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Volume
VolumeVolume
VolumeVolume
Volume
Volume
Storage, Management, Processing all become challenges with Data at Volume
Traditional technologies adopt a divide, drop, and conquer approach
The solution?EDW
DataDataData
DataData
Data
Data DataData
Yet Another EDW
DataDataData
DataData
Data
Data DataData
Analytical DB
DataDataData
DataData
Data
Data DataData OLTP
DataDataData
DataData
Data
Data DataData
Another EDW
DataDataData
DataData
Data
Data DataData
Ummm…you dropped something
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataDataDataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataDataDataData
Data
DataData
Data
Data DataData
EDW
DataDataData
DataData
Data
Data DataData
Yet Another EDW
DataDataData
DataData
Data
Data DataData
Analytical DB
DataDataData
DataData
Data
Data DataData
OLTP
DataDataData
DataData
Data
Data DataData
Another EDW
DataDataData
DataData
Data
Data DataData
Analyzing the data usually raises more interesting questions…
…which leads to more data
Wait, you’ve seen this before.
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataDataDataData
Data
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
Analytics Sausage Factory
Data DataData
DataData
Data
Data DataData …Data
DataData…
DataData
Data
Data
Data begets Data.
What keeps us from our Data?
“Prices, Stupid passwords, and Boring Statistics.”
- Hans Rosling
http://www.youtube.com/watch?v=hVimVzgtD6w
Your data silos are lonely places.
EDW
DataDataData
DataData
Data
Data DataData
Accounts
DataDataData
DataData
Data
Data DataData
Customers
DataDataData
DataData
Data
Data DataData
Web Properties
DataDataData
DataData
Data
Data DataData
… Data likes to be together.
EDW
DataDataData
DataData
Data
Data DataData
Accounts
DataDataData
DataData
Data
Data DataData
Customers
DataDataData
DataData
Data
Data DataData
Web Properties
DataDataData
DataData
Data
Data DataData
Data likes to socialize too.EDW
DataDataData
DataData
Data
Data DataData
Accounts
DataDataData
DataData
Data
Data DataData
Customers
DataDataData
DataData
Data
Data DataData
Web Properties
DataDataData
DataData
Data
Data DataData
Machine Data
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
DataDataData
DataData
Data
Data DataData
CDR
DataDataData
DataData
Data
Data DataData
Weather Data
DataDataData
DataData
Data
Data DataData
New types of data don’t quite fit into your pristine view of the world.
My Little Data Empire
DataDataData
Data
DataData
Data DataData
Logs
DataDataDataData
Data
DataData
Machine Data
DataDataDataData
Data
DataData
??
? ?
To resolve this, some people take hints from Lord Of The Rings...
…and create One-Schema-To-Rule-Them-All…
EDW
DataDataData
DataData
Data
Data DataDataSchema
…but that has its problems too.
EDW
DataDataData
DataData
Data
Data DataDataSchemaData
DataData
ETL ETL
ETL ETL
EDW
DataDataData
DataData
Data
Data DataDataSchemaData
DataData
ETL ETL
ETL ETL
What if the data was processed and stored centrally? What if you didn’t
need to force it into a single schema? We call it a Data Lake.
EDW
DataDataData
DataData
DataData
Schema
BI & Analytics Schema Schema
DataData
Data
Data Lake
DataData
DataData
DataDataData
DataData
DataData
Data
SchemaSchema
DataData
DataProcess Process
DataData
Data
DataData
Data
DataData
DataData
DataDataData Sources
Data Sources
A Data Lake Architecture enables:- Landing data without forcing a single schema- Landing a variety and large volume of data efficiently- Retaining data for a long period of time with a very low $/TB- A platform to feed other Analytical DBs- A platform to execute next gen data analytics and processing applications (SAS, Informatica,
Graph Analytics, Machine Learning, SAP, etc…)
In most cases, more data is better.Work with the population, not just a sample.
Your view of a client today.
Male
Female
Age: 25-30
Town/City
Middle Income Band
Product Category Preferences
Your view with more data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-68k per year
Product recommendations
Tea PartyHippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products left in basket indicate drunk amazon shopper
Gene Expression for Risk Taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
So what is the answer?
Enter the Hadoop.
http://www.fabulouslybroke.com/2011/05/ninja-elephants-and-other-awesome-stories/
………
Hadoop was created because traditional technologies never cut it for the Internet properties like Google, Yahoo, Facebook, Twitter, and LinkedIn
Traditional architecture didn’t scale enough…
DB DBDB
SAN
AppApp AppApp
DB DBDB
SAN
AppApp AppApp DB DBDB
SAN
AppApp AppApp
Traditional architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercomputing
So what is the answer?
If you could design a system that would handle this, what would it look like?
It would probably need a highly resilient, self-healing, cost-efficient, distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
It would probably need a completely parallel processing framework that took tasks to the
data…
Storage Storage Storage
Storage Storage Storage
Storage Storage StorageProcessing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably run on commodity hardware, virtualized machines, and common OS
platforms
Storage Storage Storage
Storage Storage Storage
Storage Storage StorageProcessing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably be open source so innovation could happen as quickly as possible
It would need a critical mass of users
{Processing + Storage}=
{YARN + HDFS}
Want to get your hands dirty?
To do this, we need to install Hadoop right?
Nope.
Enter the
Sandbox.
The Sandbox is ‘Hadoop in a Can’.It contains one copy of each of the Master and Worker node processes
used in a cluster, only in a single virtual node.
Storage Storage Storage
Storage Storage Storage
Storage Storage StorageProcessing Processing Processing
Processing Processing Processing
Processing Processing Processing
ProcessingStorage
Linux VM
Getting started with Sandbox VM:
- Pick your flavor of VM at…http://www.hortonworks.com/sandbox
- Start the sandbox VM- find the IP displayed - go to…
http://172.16.130.137
- Register- Click on ‘Start Tutorials’- On the left hand nav, click on ‘HCatalog, Basic Pig & Hive Commands’
http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/
In this tutorial you can…- Land files in HDFS- Assign metadata with HCatalog- Use SQL with Hive- Learn to process data with Pig
Hadoop has other open source projects…
Apache Hadoop
FlumeAmbari
HBaseFalcon
MapReduceHDFS
SqoopHCatalog
Pig
Hive
StormYARN
Knox
Tez
Hortonworks Data Platform
FlumeAmbari
HBaseFalcon
MapReduceHDFS
SqoopHCatalog
Pig
Hive
Storm YARN
Knox
Tez
What else are we working on?
hortonworks.com/labs/
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 62
There is NO second place
HortonworksWe do Hadoop.