introduction to hadoop v2
TRANSCRIPT
![Page 1: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/1.jpg)
Introduction to Hadoop
![Page 2: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/2.jpg)
• Tarjei Romtveit
• Co-founder of Monokkel AS
• Former CTO – Integrasco AS
• My story with Hadoopwww.monokkel.io
![Page 3: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/3.jpg)
• Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistering – Prosessering – Presentasjon
![Page 4: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/4.jpg)
Bombshell
If you work with data today and not start to learn the Hadoop ecosystem: You may be
unemployed soon
![Page 5: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/5.jpg)
Agenda• Context – Big Data and how to handle it
• What is Hadoop?
• Demo
• Distributions and/or demo
• “Deepdive” into Hadoop - Architecure– HDFS– YARN– MapReduce
• Languages and ecosystem
![Page 6: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/6.jpg)
What we not will cover
• Security
• Integrations with database X or system Y
• Running Hadoop in production
![Page 7: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/7.jpg)
Big Data
![Page 8: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/8.jpg)
Big Data – hype and hipsters
![Page 9: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/9.jpg)
Big Data
The originator
![Page 10: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/10.jpg)
Big Data – Let’s add some letters • Volume• Variety• Velocity• Variability• Veracity / Data quality
and the step-brother
• Complexity
Relatively boring stuff
![Page 11: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/11.jpg)
Big Data – Example
This is a CEO
![Page 12: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/12.jpg)
The Nordic Hotel Tycoon1600 Hotels in 5 countries
![Page 13: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/13.jpg)
I am a digital champion:The website
![Page 14: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/14.jpg)
I am a digital champion:The desk
![Page 15: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/15.jpg)
I am a digital champion:The external provider
![Page 16: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/16.jpg)
I am a digital championThe IoT case
![Page 17: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/17.jpg)
I am a digital championSocial
![Page 18: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/18.jpg)
Houston we have a problem• Sales is declining and my stock price is
tumbling
![Page 19: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/19.jpg)
The CEO
No cluewhat is
happening
![Page 20: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/20.jpg)
How can the CEO manage his problem?
• Get control over the data
• Implement analytical processes to aid sales
![Page 21: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/21.jpg)
BUT HE DOES NOT WANT TO
PAY 10 000 000 000
000$ FOR IT
![Page 22: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/22.jpg)
The data he need to handle
• Volume – Gigabytes/Terabyte
• Variety – Click stream, Voice, emails, sensor data, social data, different languages, timestamp data, transactional data, third party data
• Variability – Various quality
• Velocity – MB per second
![Page 23: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/23.jpg)
The data he need to handle
• Veracity / Data quality – Inconsistent data quality
• Complexity – Many legacy domain models
![Page 24: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/24.jpg)
How to handle ?Web
Emails
Sensors
Social
StorageProcessing
RDBMS
Search
![Page 25: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/25.jpg)
How to understand ?Web
Emails
Sensors
Social
StorageProcessing
RDBMS
Search
![Page 26: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/26.jpg)
So what do Hadoop solve?
StorageProcessing
![Page 27: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/27.jpg)
What is Hadoop?
![Page 28: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/28.jpg)
What is Hadoop? An operating system for data
![Page 29: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/29.jpg)
An OS need software on top
![Page 30: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/30.jpg)
Distributions
'
![Page 31: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/31.jpg)
Distributions• ”Stable” compilation of the Hadoop Ecosystem
• Operational tools
• Integration tools and frameworks
• Data governance and data management tools
• Security
![Page 32: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/32.jpg)
Distributions
![Page 33: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/33.jpg)
HADOOP An operating system for data
Layman’s terms
• Store huge files (unstructured) on many machines
• Query and modify data
• Can run sophisticated analytics on top
![Page 34: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/34.jpg)
How to start:Alt 1• https://hadoop.apache.org/ • Getting Started• Download• Unzip• bin/hadoop <commandline arguments>
Alt 2• http://hortonworks.com/products/hortonworks-sandbox/#install• Install VMWare Player or VirtualBox • Download image (6 GB)• Install and run (give it lots of memory)
![Page 35: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/35.jpg)
DEMO
– Transform and modify data
– Machine learning with Spark
– Integrate with ElasticSearch
NEXT: ARCHITECHTURE AND HOW IT WORKS
![Page 36: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/36.jpg)
DEMO• Hortonworks Sandbox
• Hortonworks Ambari
• Hortonworks Hue
![Page 37: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/37.jpg)
Hadoop - ArchitectureHDFSYARN
MapReduce
![Page 38: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/38.jpg)
2.X.X
• Hadoop Distributed File System (HDFS)
• YARN (Yet Another Resource Negotiator)
• MapReduce
![Page 39: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/39.jpg)
HDFS
D1
D2
DX
Name NodeFailover
Name Node
Client
![Page 40: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/40.jpg)
HDFS
Block indexD1
D2
D3
Data Nodes
B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3
Name node
![Page 41: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/41.jpg)
HDFS
Block indexD1
D2
D3
Data Nodes
B: 1, D1B: 2, D2B: 3, D3B: 4, D1B: 5, D2B: 6, D3
Name node
![Page 42: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/42.jpg)
HDFS Write
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write adocument!
![Page 43: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/43.jpg)
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write/path/to/document1, R:2, B:{3,4} /path/to/document1, R:2, B:{5,6}
HDFS Write
![Page 44: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/44.jpg)
Client
Name NodeYou can write to
: D1,D2,D3 D1
D2
D3
Data Nodes
HDFS Write
![Page 45: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/45.jpg)
Client
Name Node D1
D2
D3
B:{D2:5,D3:6}
B:{D3:3,D1:4}
B:{D1:1,D2:2}Split and write
HDFS Write
![Page 46: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/46.jpg)
HDFS Write
Client
Name Node
D1
D2
D3
Replicate B:1 to D2:2
Success
![Page 47: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/47.jpg)
HDFS Read
Client
Name Node
D1
D2
D3
I want to read
/path/to/document1
B:{D3:3,D3:6}
B:{D2:2,D2:5}
![Page 48: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/48.jpg)
• HDFS blocks are immutable you can not change them!
• Deletes and updates are written as new blocks
• The node name takes care of overwriting deleted blocks
• Small files are consuming a lot of name node memory
HDFS Delete/Update
![Page 49: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/49.jpg)
HDFS Scalability
D1
D2
DX
Name NodeFailover
Name Node
![Page 50: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/50.jpg)
YARN
HOW DOES HADOOP PROCESS THE DATA STORED IN HDFS?
![Page 51: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/51.jpg)
YARN
Client
Resource Manager
Scheduler
Applications manager
I want to process file “docuemt1” with my-app.jar?
![Page 52: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/52.jpg)
YARN
Resource Manager
Scheduler
Applications manager
You can process on D1!
![Page 53: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/53.jpg)
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications managerStart my-app.jar
Application Master
![Page 54: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/54.jpg)
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications manager
Application Master
AM to RM: “document1” is located on d1 and d2 and I need X Gb RAM
![Page 55: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/55.jpg)
YARN
D1 D2
Node Manager Node Manager
Application Master Container
Resource Manager
Scheduler
Applications manager
my-app.jar is running here!
Start my-app.jar
![Page 56: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/56.jpg)
YARN + HDFSD1
D2
D3
Name Node
Client
Client
Client
• YARN will try to make sure data is processed where it is stored
• ….. data locality
![Page 57: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/57.jpg)
YARN + HDFS• Blocks are immutable. This enables high write speeds
• Data is schema free! You can store any data you want
• Data locality is what differentiates HDFS from other data storage
• You can read massive amounts of data only limited by disk read speeds
![Page 58: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/58.jpg)
MapReduce and others
OK… BUT HOW DO I PROCESS ?
![Page 59: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/59.jpg)
YARN
Tez MapReduce <Name here>
Libraries: Mahout, MLib, GraphX, Oryx Languages: Hive, Pig, R, Spark SQL, Stinger
![Page 60: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/60.jpg)
YARN
Tez <Name here>
Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, Mlib, GraphX, Oryx
MapReduce
![Page 61: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/61.jpg)
MapReduce
![Page 62: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/62.jpg)
Document
Deer Bear RiverCar Car RiverDeer Car BearDocument
stored in HDFS
![Page 63: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/63.jpg)
Splitting
Deer Bear River
Deer Car Bear
Deer Bear River
Car Car RiverCar Car River
Deer Car Bear
![Page 64: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/64.jpg)
MappingDeer Bear River
Car Car River
Deer Car Bear
Deer 1Bear 1 River 1
Car 1Car 1 River 1
Deer 1Car 1Bear 1
![Page 65: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/65.jpg)
ShufflingDeer 1Bear 1 River 1
Deer 1Car 1Bear 1
Car 1Car 1 River 1
Deer 1Deer 1Deer 1
Bear 1Bear 1
Car 1Car 1
River 1River 1
![Page 66: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/66.jpg)
ReduceDeer 1Deer 1Deer 1
Bear 1Bear 1
Car 1Car 1
River 1River 1
Deer 3
Bear 2
Car 2
River 2
Deer 3Bear 2Car 2River 2
HDFS
![Page 67: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/67.jpg)
API: Mapper interface
![Page 68: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/68.jpg)
API: Reduce interface
![Page 69: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/69.jpg)
API: Main
![Page 70: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/70.jpg)
How to run
$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out
![Page 71: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/71.jpg)
MapReduce• Mappers and reducers are distributed in YARN
containers
• Chaining of MapReduce jobs make them slow
• Easy to scale but difficult to code
• … use the data DSL languages instead
![Page 72: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/72.jpg)
Languages
![Page 73: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/73.jpg)
YARN
Tez MapReduce <Name here>
Languages: Hive, Pig, R, Spark SQL, StingerLibraries: Mahout, Crunch, MLib, GraphX, Oryx
![Page 74: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/74.jpg)
”Languages”
![Page 75: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/75.jpg)
PIG• Procedural language
• Execute on YARN
• Great for• Structuring• Moving• Transforming
![Page 76: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/76.jpg)
Hive/Drill/Spark SQL
• Declarative / SQL-like languages
• Great for• Column data / Database dumps• Aggregations• Connect BI tools and Dashboards
• Data Warehouse for Hadoop++
![Page 77: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/77.jpg)
Spark• Core language (runs in YARN or standalone)
• Great for• Anything that MapReduce can do• Analytics, Machine Learning
• In memory and languages in Java, Scala and Python
![Page 78: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/78.jpg)
Summary• Hadoop is designed to handle/process massive amounts of data
through HDFS and/or YARN
• The data do not need to be structured before it is stored in HDFS
• Hadoop is an ecosystem and have languages/frameworks for data extraction, data management, data analysis and data integration
• It is most convenient to begin with Hadoop by testing distributions. E.g. Hortonworks, Cloudera, MapR etc.
• Learn MapReduce and learn to understand languages and a few integration tools
![Page 79: Introduction to hadoop V2](https://reader031.vdocuments.mx/reader031/viewer/2022021423/58ac56bb1a28ab8e258b55e7/html5/thumbnails/79.jpg)
Is it a fad?