processing big data (chapter 3, sc 11 tutorial)

An Introduc+on to Data Intensive Compu+ng

Chapter 3: Processing Big Data

Robert Grossman University of Chicago Open Data Group

Collin BenneC

Open Data Group

November 14, 2011 1

1.  Introduc+on (0830-‐0900) a.  Data clouds (e.g. Hadoop) b.  U+lity clouds (e.g. Amazon)

2.  Managing Big Data (0900-‐0945) a.  Databases b.  Distributed File Systems (e.g. Hadoop) c.  NoSql databases (e.g. HBase)

3.  Processing Big Data (0945-‐1000 and 1030-‐1100) a.  Mul+ple Virtual Machines & Message Queues b.  MapReduce c.  Streams over distributed file systems

4.  Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)

Sec+on 3.1 Processing Big Data Using U+lity and Data Clouds

A Google produc+on rack of servers from about 1999.

•  How do you do analy+cs over commodity disks and processors?

•  How do you improve the efficiency of programmers?

Serial & SMP Algorithms

•  * local disk and memory

local disk*

Task

local disk*

Task Task Task

Serial algorithm Symmetric Mul+processing (SMP) algorithm

Pleasantly (= Embarrassingly) Parallel

•  Need to par++on data, start tasks, collect results. •  Oden tasks organized into DAG.

local disk

Task Task Task

local disk

Task Task Task

local disk

Task Task Task

MPI

How Do You Program A Data Center?

7

The Google Data Stack

•  The Google File System (2003) •  MapReduce: Simplified Data Processing… (2004) •  BigTable: A Distributed Storage System… (2006)

8

Google’s Large Data Cloud

9

Google’s Early Data Stack circa 2000

Google File System (GFS)

Google’s MapReduce

Google’s BigTable

Storage Services

Compute Services

Applica+ons

Data Services

Hadoop’s Large Data Cloud (Open Source)

Storage Services

Compute Services

10

Hadoop’s Stack

Applica+ons

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL, e.g. HBase

A very nice recent book by Barroso and Holzle

The Amazon Data Stack

Amazon uses a highly decentralized, loosely coupled, service oriented architecture consis+ng of hundreds of services. In this environment there is a par+cular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.

SOSP’07

Amazon Style Data Cloud

S3 Storage Services

Simple Queue Service

13

Load Balancer

EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances

EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances

SDB

Open Source Versions

•  Eucalyptus – Ability to launch VMs –  S3 like storage

•  Open Stack – Ability to launch VMs –  S3 like storage -‐ Swid

•  Cassandra –  Key-‐value store like S3 –  Columns like BigTable

•  Many other open source Amazon style services available.

Some Programming Models for Data Centers

•  Opera+ons over data center of disks – MapReduce (“string-‐based” scans of data) – User-‐Defined Func+ons (UDFs) over data center –  Launch VMs that all have access to highly scalable and available disk-‐based data.

–  SQL and NoSQL over data center •  Opera+ons over data center of memory

– Grep over distributed memory – UDFs over distributed memory –  Launch VMs that all have access to highly scalable and available membory-‐based data.

–  SQL and NoSQL over distributed memory

Sec+on 3.2 Processing Data By Scaling Out Virtual Machines

Processing Big Data PaCern 1: Launch Independent Virtual Machines and Task with a Messaging Service

Task With Messaging Service & Use S3 (Variant 1)

S3

Task

VM

Messaging Services (AWS SMS, AMQP Service, etc.)

Task

VM

Task

VM

Task

VM

…

Control VM: Launches and tasks workers

Worker VMs

Task With Messaging Service & Use NoSQL DB (Variant 2)

AWS SimpleDB

Task

VM


Task

VM

Task

VM

Task

VM

…


Worker VMs

Task With Messaging Service & Use Clustered FS (Variant 3)

GlusterFS

Task

VM


Task

VM

Task

VM

Task

VM

…


Worker VMs

Sec+on 3.3 MapReduce

Google 2004 Technical Report

Core Concepts

•  Data are (key, value) pairs and that’s it •  Par++on data over commodity nodes filling racks in a data center.

•  Sodware handles failures, restarts, etc. This is the hard part.

•  Basic examples: – Word Count –  Inverted index

Processing Big Data PaCern 2: MapReduce

HDFS

Map Task

Task Tracker

local disk

Map Task Map Task

HDFS

Map Task

Task Tracker

local disk

Map Task Map Task

HDFS

Map Task

Task Tracker

local disk

Map Task Map Task

local disk

HDFS

Reduce Task

local disk

HDFS

Reduce Task

Shuffle & Sort

Example: Word Count & Inverted Index

•  How do you count the words in a million books? –  (best, 7)

•  Inverted index: –  (best; page 1, page 82, …)

–  (worst; page 1, page 12, …)

Cover of serial Vol. V, 1859, London

•  Assume you have a cluster of 50 computers, each with an aCached local disk and half full of web pages.

•  What is a simple parallel programming framework that would support the computa+on of word counts and inverted indices?

Basic PaCern: Strings

1. Extract words from web pages in parallel.

2. Hash and sort words.

3. Count (or construct inverted index) in parallel.

1. Extract words from web pages in parallel.

2. Hash and sort words.


1. Extract binned field value from data records in parallel.

2. Hash and sort binned field values.


What about data records?

Map-‐Reduce Example •  Input is files with one document per record •  User specifies map func+on

–  key = document URL –  Value = document contents

“doc cdickens two ci+es”, “it was the best of +mes”

“it”, 1 “was”, 1 “the”, 1 “best”, 1

Input of map

Output of map

Example (cont’d) •  MapReduce library gathers together all pairs with the same key value (shuffle/sort phase)

•  The user-‐defined reduce func+on combines all the values associated with the same key

key = “it” values = 1, 1

key = “was” values = 1, 1

key = “best” values = 1

key = “worst” values = 1

Input of reduce

“it”, 2 “was”, 2 “best”, 1 “worst”, 1

Output of reduce

Why Is Word Count Important?

•  It is one of the most important examples for the type of text processing oden done with MapReduce.

•  There is an important mapping

document < -‐-‐-‐-‐-‐ > data record words < -‐-‐-‐-‐-‐ > (field, value)

Inversion

Pleasantly Parallel MapReduce

Data structure Arbitrary (key, value) pairs

Func+ons Arbitrary Map & Reduce

Middleware MPI (message passing)

Hadoop

Ease of use Difficult Medium

Scope Wide Narrow

Challenge Geung something working

Moving to MapReduce

Common MapReduce Design PaCerns

•  Word count •  Inversion – inverted index •  Compu+ng simple sta+s+cs •  Compu+ng windowed sta+s+cs •  Sparse matrix (document-‐term, data record-‐FieldBinValue, …)

•  Site-‐en+ty sta+s+cs •  PageRank •  Par++oned and ensemble models •  EM

Sec+on 3.4 User Defined Func+ons over DFS

sector.sf.net

Processing Big Data PaCern 3: User Defined Func+ons over Distributed File Systems

Sector/Sphere

•  Sector/Sphere is a plaworm for data intensive compu+ng.

Idea 1: Apply User Defined Func+ons (UDF) to Files in a Distributed File System

map/shuffle reduce

UDF UDF

This generalizes Hadoop’s implementa+on of MapReduce over the Hadoop Distributed File system.

Idea 2: Add Security From the Start

•  Security server maintains informa+on about users and slaves.

•  User access control: password and client IP address.

•  File level access control. •  Messages are encrypted over SSL. Cer+ficate is used for authen+ca+on.

•  Sector is a good basis for HIPAA compliant applica+ons.

Security Server Master Client

Slaves

data AAA

SSL SSL

Idea 3: Extend the Stack to Include Network Transport Services

Storage Services

39

Storage Services

Rou+ng & Transport Services Google, Hadoop

Sector

Compute Services

Data Services

Compute Services

Data Services

Sec+on 3.5 Compu+ng With Streams: Warming Up With Means and Variances

Warm Up: Par++oned Means

•  Means and variances cannot be computed naively when the data is in distributed par++ons.

Step 1. Compute local (Σ xi, Σ xi2, ni) in parallel for each par++on. Step 2. Compute global mean and variance from these tuples.

Trivial Observa+on 1

If si = Σ xi is a the i’th local means, then global mean = Σ si / Σ ni. •  If local means for each par++on are passed (without corresponding counts), then there is not enough informa+on to compute global means.

•  Same tricks works for variance, but need to pass triples (Σ xi, Σ xi2, ni).

Trivial Observa+on 2

•  To reduce data passed over the network, combine appropriate sta+s+cs as early as possible.

•  Consider average. Recall with MapReduce there are 4 steps (Map, Shuffle, Sort and Reduce) and Reduce pulls data from local disk that performs Map.

•  A Combine Step in MapReduce combines local data before it is pulled for Reduce Step.

•  There are built in combiners for counts, means, etc.

Sec+on 3.6 Hadoop Streams

Processing Big Data PaCern 4: Streams over Distributed File Systems

Hadoop Streams

•  In addi+on to the Java API, Hadoop offers –  Streaming interface for any language that supports reading and wri+ng to Standard In and Out

–  Pipes for C++ •  Why would I want to use something besides Java? Because Hadoop Streams provide direct access to –  (Without JNI/ NIO) to C++ libraries like Boost, GNU Scien+fic Library (GSL)

–  R modules

Pros and Cons •  Java

+ Best documented + Largest community – More LOC per MR job

•  Python + Efficient memory handling + Programmers can be very efficient –  Limited logging / debugging

•  R + Vast collec+on of sta+s+cal algorithms –  Poor error handling and memory handling –  Less familiar to developers

Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)

Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)

MalStone Benchmark

MalStone A MalStone B Hadoop MapReduce 455m 13s 840m 50s

Hadoop Streams (Python)

87m 29s 142m 32s

C++ implemented UDFs 33m 40s 43m 44s

Sector/Sphere 1.20, Hadoop 0.18.3 with no replica+on on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-‐byte records / node.

Word Count R Mapper trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Word Count R Reducer trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count

Word Count R Reducer (cont’d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)

Word Count Java Mapper public static class Map

extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }

Code Comparison – Word Count Mapper

Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Code Comparison – Word Count Reducer

Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count

if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Ques+ons?

For the most current version of these notes, see rgrossman.com

processing big data (chapter 3, sc 11 tutorial)

Technology