cse509 lecture 4
Upload: web-science-research-group-at-institute-of-business-administration-karachi-pakistan
Post on 13-May-2015
627 views
DESCRIPTION
Lecture 4 of CSE509:Web Science and Technology Summer CourseTRANSCRIPT
Muhammad Atif Qureshi
Web Science Research Group
Institute of Business Administration (IBA)
CSE509: Introduction to Web Science
and Technology
Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems
and MapReduce
2
Last Time…
Search Engine Architecture
Overview of Web Crawling
Web Link Structure Ranking Problem
SEO and Web Spam
Web Spam Research
July 30, 2011
3
Today
Web Data Explosion
Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture
Part II Large-Scale File Systems
Google File System Case-Study
July 30, 2011
4
Introduction
Web data sets can be very large Tens to hundreds of terabytes
Cannot mine on a single server (why?)
“Big data” is a fact on the World Wide Web Larger data implies effective algorithms
Web-scale processing: Data-intensive processing Also applies to startups and niche players
July 30, 2011
5
How Much Data?
Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s LHC will generate 15 PB a year (??)
July 30, 2011
6
Cluster Architecture
July 30, 2011
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between any pair of nodesin a rack
2-10 Gbps backbone between racks
7
Concerns
If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully
If one node fails, all its files would be unavailable until the node is replaced Can also lead to permanent loss of files
July 30, 2011
Solutions: MapReduce and Google File system
8
PART I: MapReduce
July 30, 2011
9
Major Ideas
Scale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machines
Move processing to the data Cluster have limited bandwidth
Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable
Seamless scalability From the traditional mythical man-month approach to a newly known
phenomenon tradable machine-hour Twenty-one chicken together cannot make an egg hatch in a day
July 30, 2011
10
Traditional Parallelization: Divide and Conquer
July 30, 2011
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker”
“worker”
“worker”
Partition
Combine
11
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
July 30, 2011
12
Common Theme
Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data)
Thus, we need a synchronization mechanism
July 30, 2011
13
Parallelization is Hard
Traditionally, concurrency is difficult to reason about (uni to small-scale architecture)
Concurrency is even more difficult to reason about At the scale of datacenters (even across datacenters) In the presence of failures In terms of multiple interacting services
Not to mention debugging…
The reality: Write your own dedicated library, then program with it Burden on the programmer to explicitly manage everything
July 30, 2011
14
Solution: MapReduce
Programming model for expressing distributed computations at a massive scale
Hides system-level details from the developers No more race conditions, lock contention, etc.
Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution
July 30, 2011
15
What is MapReduce Used For?
At Google: Index building for Google Search Article clustering for Google News Statistical machine translation
At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail
At Facebook: Data mining Ad optimization Spam detection
July 30, 2011
16
Typical MapReduce Execution
Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output
Key idea: provide a functional abstraction for these two operations
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
17
MapReduce Basics
Programmers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer
The execution framework handles everything else…
July 30, 2011
18
Warm Up Example: Word Count
We have a large file of words, one word to a line
Count the number of times each distinct word appears in the file
Sample application: analyze web server logs to find popular URLs
July 30, 2011
19
Word Count (2)
Case 1: Entire file fits in memory
Case 2: File too large for mem, but all <word, count> pairs fit in mem
Case 3: File on disk, too many distinct words to fit in memory
sort datafile | uniq –c
July 30, 2011
20
Word Count (3)
To make it slightly harder, suppose we have a large corpus of documents
Count the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -c
where words takes a file and outputs the words in it, one to a line
The above captures the essence of MapReduce Great thing is it is naturally parallelizable
July 30, 2011
21
Word Count using MapReduce
July 30, 2011
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
reduce(key, values):// key: a word; values: an iterator over counts
result = 0for each count v in values:
result += vemit(key,result)
22
Word Count Illustration
July 30, 2011
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):Sum all “1”s in values list
Emit result “(word, sum)”
see bob runsee spot
throw
see 1bob 1 run 1see 1spot 1throw 1
bob 1 run 1see 2spot 1throw 1
23
Implementation Overview
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bandwidth
Storage is on local IDE disks
GFS: distributed file system manages data (SOSP'03)
Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
July 30, 2011
Implementation at Google is a C++ library linked to user programs
24
Distributed Execution Overview
July 30, 2011
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
Master
UserProgram
outputfile 0
outputfile 1
(1) submit
(2) schedule map (2) schedule reduce
(3) read(4) local write
(5) remote read(6) write
Inputfiles
Mapphase
Intermediate files(on local disk)
Reducephase
Outputfiles
Adapted from (Dean and Ghemawat, OSDI 2004)
25
MapReduce Implementations
Google has a proprietary implementation in C++ Bindings in Java, Python
Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem
Lots of custom research implementations For GPUs, cell processors, etc.
July 30, 2011
26
Bonus Assignment
Write MapReduce version of Assignment no. 2
July 30, 2011
27
MapReduce in VisionerBOT
July 30, 2011
28
VisionerBOT Distributed Design
July 30, 2011
29
PART II: Google File System
July 30, 2011
30
Distributed File System
Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local
Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop
31
GFS: Assumptions
Commodity hardware over “exotic” hardware Scale “out”, not “up”
High component failure rates Inexpensive commodity components fail all the time
“Modest” number of huge files Multi-gigabyte files are common, if not encouraged
Files are write-once, mostly appended to Perhaps concurrently
Large streaming reads over random access High sustained throughput over low latency
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
32
GFS: Design Decisions
Files stored as chunks Fixed size (64MB)
Reliability through replication Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata Simple centralized management
No data caching Little benefit due to large datasets, streaming reads
Simplify the API Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
33
QUESTIONS?
July 30, 2011