cse509 lecture 4

33
Muhammad Atif Qureshi Web Science Research Group Institute of Business Administration (IBA) CSE509: Introduction to Web Science and Technology Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduce

Category:

Technology


0 download

DESCRIPTION

Lecture 4 of CSE509:Web Science and Technology Summer Course

TRANSCRIPT

Page 1: CSE509 Lecture 4

Muhammad Atif Qureshi

Web Science Research Group

Institute of Business Administration (IBA)

CSE509: Introduction to Web Science

and Technology

Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems

and MapReduce

Page 2: CSE509 Lecture 4

2

Last Time…

Search Engine Architecture

Overview of Web Crawling

Web Link Structure Ranking Problem

SEO and Web Spam

Web Spam Research

July 30, 2011

Page 3: CSE509 Lecture 4

3

Today

Web Data Explosion

Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture

Part II Large-Scale File Systems

Google File System Case-Study

July 30, 2011

Page 4: CSE509 Lecture 4

4

Introduction

Web data sets can be very large Tens to hundreds of terabytes

Cannot mine on a single server (why?)

“Big data” is a fact on the World Wide Web Larger data implies effective algorithms

Web-scale processing: Data-intensive processing Also applies to startups and niche players

July 30, 2011

Page 5: CSE509 Lecture 4

5

How Much Data?

Google processes 20 PB a day (2008)

Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

eBay has 6.5 PB of user data + 50 TB/day (5/2009)

CERN’s LHC will generate 15 PB a year (??)

July 30, 2011

Page 6: CSE509 Lecture 4

6

Cluster Architecture

July 30, 2011

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Each rack contains 16-64 nodes

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Switch1 Gbps between any pair of nodesin a rack

2-10 Gbps backbone between racks

Page 7: CSE509 Lecture 4

7

Concerns

If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully

If one node fails, all its files would be unavailable until the node is replaced Can also lead to permanent loss of files

July 30, 2011

Solutions: MapReduce and Google File system

Page 8: CSE509 Lecture 4

8

PART I: MapReduce

July 30, 2011

Page 9: CSE509 Lecture 4

9

Major Ideas

Scale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machines

Move processing to the data Cluster have limited bandwidth

Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable

Seamless scalability From the traditional mythical man-month approach to a newly known

phenomenon tradable machine-hour Twenty-one chicken together cannot make an egg hatch in a day

July 30, 2011

Page 10: CSE509 Lecture 4

10

Traditional Parallelization: Divide and Conquer

July 30, 2011

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker”

“worker”

“worker”

Partition

Combine

Page 11: CSE509 Lecture 4

11

Parallelization Challenges

How do we assign work units to workers?

What if we have more work units than workers?

What if workers need to share partial results?

How do we aggregate partial results?

How do we know all the workers have finished?

What if workers die?

July 30, 2011

Page 12: CSE509 Lecture 4

12

Common Theme

Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data)

Thus, we need a synchronization mechanism

July 30, 2011

Page 13: CSE509 Lecture 4

13

Parallelization is Hard

Traditionally, concurrency is difficult to reason about (uni to small-scale architecture)

Concurrency is even more difficult to reason about At the scale of datacenters (even across datacenters) In the presence of failures In terms of multiple interacting services

Not to mention debugging…

The reality: Write your own dedicated library, then program with it Burden on the programmer to explicitly manage everything

July 30, 2011

Page 14: CSE509 Lecture 4

14

Solution: MapReduce

Programming model for expressing distributed computations at a massive scale

Hides system-level details from the developers No more race conditions, lock contention, etc.

Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution

July 30, 2011

Page 15: CSE509 Lecture 4

15

What is MapReduce Used For?

At Google: Index building for Google Search Article clustering for Google News Statistical machine translation

At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail

At Facebook: Data mining Ad optimization Spam detection

July 30, 2011

Page 16: CSE509 Lecture 4

16

Typical MapReduce Execution

Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output

Key idea: provide a functional abstraction for these two operations

Map

Reduce

(Dean and Ghemawat, OSDI 2004)

Page 17: CSE509 Lecture 4

17

MapReduce Basics

Programmers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer

The execution framework handles everything else…

July 30, 2011

Page 18: CSE509 Lecture 4

18

Warm Up Example: Word Count

We have a large file of words, one word to a line

Count the number of times each distinct word appears in the file

Sample application: analyze web server logs to find popular URLs

July 30, 2011

Page 19: CSE509 Lecture 4

19

Word Count (2)

Case 1: Entire file fits in memory

Case 2: File too large for mem, but all <word, count> pairs fit in mem

Case 3: File on disk, too many distinct words to fit in memory

sort datafile | uniq –c

July 30, 2011

Page 20: CSE509 Lecture 4

20

Word Count (3)

To make it slightly harder, suppose we have a large corpus of documents

Count the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -c

where words takes a file and outputs the words in it, one to a line

The above captures the essence of MapReduce Great thing is it is naturally parallelizable

July 30, 2011

Page 21: CSE509 Lecture 4

21

Word Count using MapReduce

July 30, 2011

map(key, value):

// key: document name; value: text of document

for each word w in value:

emit(w, 1)

reduce(key, values):// key: a word; values: an iterator over counts

result = 0for each count v in values:

result += vemit(key,result)

Page 22: CSE509 Lecture 4

22

Word Count Illustration

July 30, 2011

map(key=url, val=contents):

For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):Sum all “1”s in values list

Emit result “(word, sum)”

see bob runsee spot

throw

see 1bob 1 run 1see 1spot 1throw 1

bob 1 run 1see 2spot 1throw 1

Page 23: CSE509 Lecture 4

23

Implementation Overview

100s/1000s of 2-CPU x86 machines, 2-4 GB of memory

Limited bandwidth

Storage is on local IDE disks

GFS: distributed file system manages data (SOSP'03)

Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

July 30, 2011

Implementation at Google is a C++ library linked to user programs

Page 24: CSE509 Lecture 4

24

Distributed Execution Overview

July 30, 2011

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

Master

UserProgram

outputfile 0

outputfile 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

Adapted from (Dean and Ghemawat, OSDI 2004)

Page 25: CSE509 Lecture 4

25

MapReduce Implementations

Google has a proprietary implementation in C++ Bindings in Java, Python

Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem

Lots of custom research implementations For GPUs, cell processors, etc.

July 30, 2011

Page 26: CSE509 Lecture 4

26

Bonus Assignment

Write MapReduce version of Assignment no. 2

July 30, 2011

Page 27: CSE509 Lecture 4

27

MapReduce in VisionerBOT

July 30, 2011

Page 28: CSE509 Lecture 4

28

VisionerBOT Distributed Design

July 30, 2011

Page 29: CSE509 Lecture 4

29

PART II: Google File System

July 30, 2011

Page 30: CSE509 Lecture 4

30

Distributed File System

Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local

Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable

A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop

Page 31: CSE509 Lecture 4

31

GFS: Assumptions

Commodity hardware over “exotic” hardware Scale “out”, not “up”

High component failure rates Inexpensive commodity components fail all the time

“Modest” number of huge files Multi-gigabyte files are common, if not encouraged

Files are write-once, mostly appended to Perhaps concurrently

Large streaming reads over random access High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

Page 32: CSE509 Lecture 4

32

GFS: Design Decisions

Files stored as chunks Fixed size (64MB)

Reliability through replication Each chunk replicated across 3+ chunkservers

Single master to coordinate access, keep metadata Simple centralized management

No data caching Little benefit due to large datasets, streaming reads

Simplify the API Push some of the issues onto the client (e.g., data layout)

HDFS = GFS clone (same basic ideas)

Page 33: CSE509 Lecture 4

33

QUESTIONS?

July 30, 2011