introduction to hadoop-mapreduce platform
DESCRIPTION
Introduction to Hadoop-Mapreduce PlatformTRANSCRIPT
Introduction to Hadoop-Mapreduce
Platform
Presented by:
Monzur MorshedHabibur Rahman
TigerHATSwww.tigerhats.org
The International Research group dedicated to Theories, Simulation and
Modeling, New Approaches, Applications, Experiences, Development, Evaluations, Education, Human, Cultural and Industrial Technology
TigerHATS - Information is power
HadoopHadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. This module explains the basics of how to begin using Hadoop to experiment and learn from the rest of this tutorial. It covers setting up the platform and connecting other tools to use it.
Source: http://developer.yahoo.com/hadoop/tutorial/module3.html
What Hadoop is
•Inspired by Google
•Distributed file system similar to Google
File System
•Parallel programming model similar to
Google MapReduce
•Parallel database similar to Google
Bigtable
•Open source Java project
Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.
Hadoop
• Distributed file system (HDFS)
• Distributed execution framework (MapReduce)
• Query language (Pig)
• Distributed, column-oriented data store (HBase)
• Machine learning (Mahout)
Hadoop Distributed File system
• Cluster filing system
• Designed for huge files (many GBs)
• Designed for lots of streaming reads and
infrequent writes
• Not a POSIX file system: requires client help
What Hadoop isn’t
•Hadoop is not a ―classical grid solution
•HDFS is not a POSIX file system
•HDFS is not designed for low latency access to a huge
number of small files
•Hadoop MapReduce is not designed for interactive
applications
•HBase is not a relational database and does not have
transactions or SQL support
•HDFS and HBase are not focused on security, encryption
or multi-tenancy
HDFS, MapReduce
Typical Hadoop Cluster
Commodity Hardware
Typically in 2 level architecture– Nodes are commodity PCs– 30-40 nodes/rack– Uplink from rack is 3-4 gigabit– Rack-internal is 1 gigabit
SecondaryNameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filename
2. BlckId, DataNodes
o
3.Read data
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log
Data Flow
Web Servers Scribe Servers
Network Storage
Hadoop ClusterOracle RAC MySQL
Image Source: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Very Large Distributed File System–10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware–Files are replicated to handle hardware failure–Detect failures and recover from them
Optimized for Batch Processing–Data locations exposed so that computations can move to where data resides–Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
HDFS –Hadoop Distributed File System
Data Coherency–Write-once-read-many access model–Client can only append to existing files
Files are broken up into blocks–Typically 128 MB block size–Each block replicated on multiple Data Nodes
Intelligent Client–Client can find location of blocks–Client accesses data directly from Data Node
Distributed File System
Simple data-parallel programming model designed for scalability and fault-tolerance
Framework for distributed processing of large data sets
Originally designed by Google
Pluggable user code runs in generic framework
Pioneered by Google -Processes 20 petabytes of data per day
MapReduce Paradigm
At Google: - Index construction for Google Search - Article clustering for Google News
- Statistical machine translation
At Yahoo!: - “Web map” powering Yahoo! Search - Spam detection for Yahoo! Mail
At Facebook:- Data mining- Ad optimization- Spam detection
What is MapReduce used for?
In research: Astronomical image analysis
(Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)
What is MapReduce used for?
Mapreduce processing model
How the final multi-node cluster will look like
Who uses Hadoop?
• Amazon/A9• Facebook• Google• IBM• Joost• Last.fm• New York Times• PowerSet• Veoh• Yahoo!
Data type: key-value records
Map function:(Kin, Vin) -> list(Kinter, Vinter)
Reduce function: (Kinter, list(Vinter)) -> list(Kout,
Vout)
MapReduce Programming Model
def mapper(line):foreachword in line.split():
output(word, 1)
def reducer(key, values):output(key, sum(values))
Example: Word Count
Single master controls job execution on multiple slaves
Mappers preferentially placed on same node or same rack as their input block
- Minimizes network usage Mappers save outputs to local disk
before serving them to reducers- Allows recovery if a reducer crashes- Allows having more reducers than
nodes
MapReduce Execution Details
1. If a task crashes: Retry on another node• OK for a map because it has no
dependencies• OK for reduce because map outputs are
on disk If the same task fails repeatedly, fail
the job or ignore that input block (user-controlled)
Fault Tolerance in MapReduce
2. If a node crashes: Re-launch its current tasks on other
nodes Re-run any maps the node previously
ran Necessary because their output files
were lost along with the crashed node
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):
Launch second copy of task on another node (“speculative execution”)• Take the output of whichever copy finishes first, and kill
the other
Surprisingly important in large clusters
- Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc
- Single straggler may noticeably slow down a job
Fault Tolerance in MapReduce
By providing a data-parallel programming model,
MapReduce can control job execution in useful
ways:
Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers
Takeaways
1. Search
Input: (lineNumber, line) recordsOutput: lines matching a given pattern
Map: if(line matches pattern):
output(line)
Reduce: identify functionAlternative: no reducer (map-only job)
Some practical MapReduce examples
2. Sort
Input: (key, value) recordsOutput: same records, sorted by key
Map: identity functionReduce: identify function
Trick: Pick partitioning function h such that k1<k2=> h(k1)<h(k2)
Some practical MapReduce examples
3. Inverted Index
Input: (filename, text) recordsOutput: list of files containing each word
Map: for each word in text.split():
output(word, filename)
Combine: uniquely file names for each word
Reduce:def reduce(word, filenames):
output(word, sort(filenames))
Some practical MapReduce examples
Inverted Index Example
4. Most Popular Words
Input: (filename, text) recordsOutput: top 100 words occurring in the most files
Two-stage solution:
Job 1:- Create inverted index, giving (word, list(file)) records
Job 2:- Map each (word, list(file)) to (count, word)- Sort these records by count as in sort job
Some practical MapReduce examples
Three ways to write jobs in Hadoop:
- Java API- Hadoop Streaming (for Python, Perl, etc)- Pipes API (C++)
MapReduce in Hadoop
MapReduce architecture
Scope of Mapreduce
http://developer.yahoo.com/hadoop/tutorial/module3.html
http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
Hadoop-Mapreduce Tutorial
We introduced MapReduce programming model for processing large scale data
We discussed the supporting Hadoop Distributed File System
The concepts were illustrated using a simple example
We reviewed some important parts of the source code for the example.
Summary
HDFS is not a POSIX file system but using Gfarm file System instead of HDFS. The Gfarm file system has advantage since it supports not only MapReduce applications but also POSIX and MPI-IO applications.
Ref Article:
a) Hadoop MapReduce on Gfarm File System Download:www.hpcs.cs.tsukuba.ac.jp/~mikami/publications/pragma18.pdf
b) Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications
Download: www.shun0102.net/wp-content/uploads/PID2037887.pdf
Gfarm fi le system for POSIX & MPI-IO support
Thank You