big data with apache hadoop
DESCRIPTION
Slidedeck from our seminar about Hadoop (08/10/2014) Topics covered: - What is Big Data? - About Apache Hadoop - HDFS - MapReduce - Pig - Hive - HBase - Mahout & Machine Learning - Other tooling: Sqoop, Oozie, ... - Hadoop deployment options - Real-life casesTRANSCRIPT
![Page 1: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/1.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Big Data with Apache Hadoop
12/04/2023
![Page 2: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/2.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Who am I
BEN VERMEERSCHBig Data Consultant
Cloudera Certified Developer for Apache
Hadoop
![Page 3: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/3.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
About InfoFarm
Data Science
Big Data
Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business
value from it.
![Page 4: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/4.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
About InfoFarm
2 Data Scientists 4 Big Data Consultants
1 Infrastructure Specialist
![Page 5: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/5.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Java
PHP E-Commerce
Mobile
Web Developmen
t
![Page 6: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/6.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 7: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/7.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• 09:30 – What is Big Data?• 09:45 – Hadoop – HDFS &
MapReduce• 10:00 – HDFS & MapReduce in
Practice• 10:30 – The Hadoop Ecosystem• 11:30 – Examples• 12:00 – Wrap up and Lunch
![Page 10: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/10.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What is Big Data not?
• a technology• a solution (certainly not a silver-
bullet) to any IT problem• a replacement for an RDBMs• a cloud storage system• …
![Page 11: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/11.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Big Data definition attempt
“a description of a problem domain with specific challenges and solutions which has become relevant with increasing volume, velocity and variety in business data and the increasing requirements towards processing of this data”
![Page 13: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/13.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
![Page 14: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/14.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Working the (Hadoop) Big Data way• Bringing data processing to the data
(vs centralized db)• Using unstructured or semi-
structured data• Store first, process later• Simple techniques applied at
massive scale• Your hardware will fail!
![Page 15: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/15.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
OozieWorkflow
Hadoop (limited) overview
HDFSDistributed File System Amazon S3 Local FS
YARNDistributed Data Processing
MapReduce
HBaseNoSQL
HiveData Mart
PigScripting
SqoopSQL
ImportExport
MahoutMachine Learning
…
![Page 18: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/18.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce• A method for distributing tasks across multiple
nodes• Data is processed where it is stored (where
possible)• Two phases:– Map– Reduce
• Both fases have key-value pairs as input and output that may be chosen by the programmer
• The output from the mappers is used by the reducers
![Page 19: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/19.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Map & Reduce
Mapper input Mapper output Reducer input Reducer output
![Page 20: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/20.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Map functionInput.txt
Block 1
Block 2
Block 3
Node 1
Block 1
Block 2
Node 2
Block 2
Block 3
Node 3
Block 1
Block 3
![Page 21: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/21.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Shuffle and sort
• Hadoop automatically sorts and merges output from all map tasks
This intermediate process is known as the shuffle and sort The result is supplied to reduce tasks
![Page 22: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/22.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Reduce function• Reducer input comes from the shuffle and sort process receives one record at a time receives all records for a given key emit zero or more output records
• Example: A reduce function sums total per person and emits employee name (key) and total (value) as output
![Page 23: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/23.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce under the hood
Client ResourceManager
AppMasterNode 1
Node 2
Node 3
HDFS
![Page 25: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/25.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Joining
User Name
1 John
2 Maria
3 Jane
User Comment
1 Cool
2 Nonono
2 Hi there
3 Hadoop is awesome
Mapper Mapper
Key Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome
![Page 26: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/26.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
JoiningKey Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome
Reducer
Shuffle/Sort
Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
![Page 27: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/27.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Joining
Reducer
Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
Userid Name Comment
1 John Cool
2 Maria Nonono
2 Maria Hi there
3 Jane Hadoop is awesome
![Page 28: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/28.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce Design Patterns
• More info:
• Frameworks on top of MapReduce like Hive or Pig make this easier
![Page 29: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/29.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
OozieWorkflow
The Hadoop Ecosystem
HDFSDistributed File System Amazon S3 Local FS
YARNDistributed Data Processing
MapReduce
HBaseNoSQL
HiveData Mart
PigScripting
SqoopSQL
ImportExport
MahoutMachine Learning
…
![Page 30: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/30.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Pig
• Processing framework for (large) datasets
• Pig Latin• Runs on Hadoop
(or local) with MapReduce
• Extensible with UDFs
![Page 32: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/32.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Hive
• SQL-like querying on Hadoop datasets
• Translates to MapReduce under the hood
• Originally developed at Facebook
• Now Apache Top Level project
![Page 33: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/33.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive <-> Traditional RDBMS
• Schema on read• Fast initial load• Flexible schema• No update or
delete (only insert into)
• HiveQL (subset of SQL)
• Schema on write
• Slow initial load• Fixed schema• Updates,
deletes, inserts all possible
• SQL compliant
![Page 35: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/35.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HBase
• Column-oriented Data Store• Distributed• Type of NoSQL-DB• Based on Google BigTable
![Page 36: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/36.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HBase
• Lots and lots of data
• Large amount of clients
• Single selects• Range scan by
key• Variable schema
• Not Traditional RDBMS– Transactions– Group by– Join–Where– Like
![Page 38: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/38.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Sqoop• Import data from structured data source
(typically RDBMS) into Hadoop• Export data into structured data sources from
Hadoop• sqoop import --connect jdbc:mysql://localhost/salesdb --table orders
• sqoop export --connect jdbc:mysql://localhost/salesdb --table orders --export-dir /user/test/orders --input-fields-terminated-by ‘\t’
![Page 39: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/39.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Mahout
• Scalable Machine Learning
Recommendation
Classification
Clustering
![Page 43: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/43.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
More information:
• Free seminar: Machine Learning in practice
• Fri 7th of November 2014 12:00 – 16:00
• Kontich
• http://www.buzzberry.be/events/
![Page 44: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/44.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Integrating Hadoop in your IT landscape
JDBC
![Page 45: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/45.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Tools – BigData – IT options• Hadoop is not a trivial piece of software to manage!
• On-premise– Commodity Hardware– Advantage: full control & performance– Disadvantage: required skills, migrations, backup, ...
• Cloud – Amazon AWS
– EMR (Elastic Map Reduce)– Storage in S3– Very competitive offering financially– Manageability and flexibility
• Cloud - IBM SoftLayer• Hardware options (performance)
![Page 48: Big Data with Apache Hadoop](https://reader036.vdocuments.mx/reader036/viewer/2022062312/556f68a2d8b42a9d338b4977/html5/thumbnails/48.jpg)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Oak3 Courses
• Data Science• Hadoop• Hbase
• http://www.oak3.be/