processing big data (chapter 3, sc 11 tutorial)

58
An Introduc+on to Data Intensive Compu+ng Chapter 3: Processing Big Data Robert Grossman University of Chicago Open Data Group Collin BenneC Open Data Group November 14, 2011 1

Upload: robert-grossman

Post on 20-Aug-2015

2.869 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Processing Big Data (Chapter 3, SC 11 Tutorial)

An  Introduc+on  to    Data  Intensive  Compu+ng  

 Chapter  3:  Processing  Big  Data  

Robert  Grossman  University  of  Chicago  Open  Data  Group  

 Collin  BenneC  

Open  Data  Group    

November  14,  2011  1  

Page 2: Processing Big Data (Chapter 3, SC 11 Tutorial)

1.  Introduc+on  (0830-­‐0900)  a.  Data  clouds  (e.g.  Hadoop)  b.  U+lity  clouds  (e.g.  Amazon)  

2.  Managing  Big  Data  (0900-­‐0945)  a.  Databases  b.  Distributed  File  Systems  (e.g.  Hadoop)  c.  NoSql  databases  (e.g.  HBase)  

3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)  a.  Mul+ple  Virtual  Machines  &  Message  Queues  b.  MapReduce  c.  Streams  over  distributed  file  systems  

4.  Lab  using  Amazon’s  Elas+c  Map  Reduce  (1100-­‐1200)  

 

Page 3: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.1  Processing  Big  Data  Using  U+lity  and  Data  Clouds  

A  Google  produc+on  rack  of  servers  from  about  1999.  

Page 4: Processing Big Data (Chapter 3, SC 11 Tutorial)

•  How  do  you  do  analy+cs  over  commodity  disks  and  processors?  

•  How  do  you  improve  the  efficiency  of  programmers?  

Page 5: Processing Big Data (Chapter 3, SC 11 Tutorial)

Serial  &  SMP  Algorithms  

•  *  local  disk  and  memory  

local  disk*  

Task  

local  disk*  

Task  Task  Task  

Serial  algorithm   Symmetric  Mul+processing  (SMP)  algorithm  

Page 6: Processing Big Data (Chapter 3, SC 11 Tutorial)

Pleasantly  (=  Embarrassingly)  Parallel    

•  Need  to  par++on  data,  start  tasks,  collect  results.    •  Oden  tasks  organized  into  DAG.  

local  disk  

Task  Task  Task  

local  disk  

Task  Task  Task  

local  disk  

Task  Task  Task  

MPI  

Page 7: Processing Big Data (Chapter 3, SC 11 Tutorial)

How  Do  You  Program  A  Data  Center?  

7  

Page 8: Processing Big Data (Chapter 3, SC 11 Tutorial)

The  Google  Data  Stack  

•  The  Google  File  System  (2003)  •  MapReduce:  Simplified  Data  Processing…  (2004)  •  BigTable:  A  Distributed  Storage  System…  (2006)  

8  

Page 9: Processing Big Data (Chapter 3, SC 11 Tutorial)

Google’s  Large  Data  Cloud  

9

Google’s  Early  Data  Stack  circa  2000  

Google  File  System  (GFS)  

Google’s  MapReduce  

Google’s  BigTable  

Storage  Services  

Compute  Services  

Applica+ons  

Data  Services  

Page 10: Processing Big Data (Chapter 3, SC 11 Tutorial)

Hadoop’s  Large  Data  Cloud    (Open  Source)  

Storage  Services  

Compute  Services  

10

Hadoop’s  Stack  

Applica+ons  

Hadoop  Distributed  File  System  (HDFS)  

Hadoop’s  MapReduce  

Data  Services   NoSQL,  e.g.  HBase  

Page 11: Processing Big Data (Chapter 3, SC 11 Tutorial)

A  very  nice  recent  book  by    Barroso  and  Holzle  

Page 12: Processing Big Data (Chapter 3, SC 11 Tutorial)

The  Amazon  Data  Stack  

Amazon  uses  a  highly  decentralized,  loosely  coupled,  service  oriented  architecture  consis+ng  of  hundreds  of  services.  In  this  environment  there  is  a  par+cular  need  for  storage  technologies  that  are  always  available.  For  example,  customers  should  be  able  to  view  and  add  items  to  their  shopping  cart  even  if  disks  are  failing,  network  routes  are  flapping,  or  data  centers  are  being  destroyed  by  tornados.    

SOSP’07  

Page 13: Processing Big Data (Chapter 3, SC 11 Tutorial)

Amazon  Style  Data  Cloud  

S3  Storage  Services  

Simple  Queue  Service  

13

Load  Balancer  

EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instances  

EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instances  

SDB  

Page 14: Processing Big Data (Chapter 3, SC 11 Tutorial)

Open  Source  Versions  

•  Eucalyptus  – Ability  to  launch  VMs  –  S3  like  storage  

•  Open  Stack  – Ability  to  launch  VMs  –  S3  like  storage  -­‐  Swid    

•  Cassandra  –  Key-­‐value  store  like  S3  –  Columns  like  BigTable  

•  Many  other  open  source  Amazon  style  services  available.  

Page 15: Processing Big Data (Chapter 3, SC 11 Tutorial)

Some  Programming  Models  for  Data  Centers  

•  Opera+ons  over  data  center  of  disks  – MapReduce  (“string-­‐based”  scans  of  data)  – User-­‐Defined  Func+ons  (UDFs)  over  data  center  –  Launch  VMs  that  all  have  access  to  highly  scalable  and  available  disk-­‐based  data.  

–  SQL  and  NoSQL  over  data  center  •  Opera+ons  over  data  center  of  memory  

– Grep  over  distributed  memory  – UDFs  over  distributed  memory  –  Launch  VMs  that  all  have  access  to  highly  scalable  and  available  membory-­‐based  data.  

–  SQL  and  NoSQL  over  distributed  memory  

Page 16: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.2        Processing  Data  By  Scaling  Out    Virtual  Machines  

Page 17: Processing Big Data (Chapter 3, SC 11 Tutorial)

Processing  Big  Data  PaCern  1:    Launch  Independent  Virtual  Machines  and  Task  with  a  Messaging  Service  

Page 18: Processing Big Data (Chapter 3, SC 11 Tutorial)

Task  With  Messaging  Service  &  Use  S3  (Variant  1)  

S3  

Task  

VM  

Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)  

Task  

VM  

Task  

VM  

Task  

VM  

…  

Control  VM:  Launches  and  tasks  workers  

Worker  VMs  

Page 19: Processing Big Data (Chapter 3, SC 11 Tutorial)

Task  With  Messaging  Service  &  Use  NoSQL  DB  (Variant  2)  

AWS  SimpleDB  

Task  

VM  

Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)  

Task  

VM  

Task  

VM  

Task  

VM  

…  

Control  VM:  Launches  and  tasks  workers  

Worker  VMs  

Page 20: Processing Big Data (Chapter 3, SC 11 Tutorial)

Task  With  Messaging  Service  &  Use  Clustered  FS  (Variant  3)  

GlusterFS  

Task  

VM  

Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)  

Task  

VM  

Task  

VM  

Task  

VM  

…  

Control  VM:  Launches  and  tasks  workers  

Worker  VMs  

Page 21: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.3  MapReduce  

Google  2004  Technical  Report  

Page 22: Processing Big Data (Chapter 3, SC 11 Tutorial)

Core  Concepts  

•  Data  are  (key,  value)  pairs  and  that’s  it  •  Par++on  data  over  commodity  nodes  filling  racks  in  a  data  center.  

•  Sodware  handles  failures,  restarts,  etc.  This  is  the  hard  part.    

•  Basic  examples:  – Word  Count  –  Inverted  index  

Page 23: Processing Big Data (Chapter 3, SC 11 Tutorial)

Processing  Big  Data  PaCern  2:    MapReduce  

Page 24: Processing Big Data (Chapter 3, SC 11 Tutorial)

HDFS  

Map  Task  

Task  Tracker  

local  disk  

Map  Task  Map  Task  

HDFS  

Map  Task  

Task  Tracker  

local  disk  

Map  Task  Map  Task  

HDFS  

Map  Task  

Task  Tracker  

local  disk  

Map  Task  Map  Task  

local  disk  

HDFS  

Reduce  Task  

local  disk  

HDFS  

Reduce  Task  

Shuffle  &  Sort  

Page 25: Processing Big Data (Chapter 3, SC 11 Tutorial)

Example:  Word  Count  &  Inverted  Index  

•  How  do  you  count  the  words  in  a  million  books?  –  (best,  7)  

•  Inverted  index:  –  (best;  page  1,  page  82,  …)  

–  (worst;  page  1,  page  12,  …)    

Cover  of  serial  Vol.  V,  1859,  London  

Page 26: Processing Big Data (Chapter 3, SC 11 Tutorial)

•  Assume  you  have  a  cluster  of  50  computers,  each  with  an  aCached  local  disk  and  half  full  of  web  pages.  

•  What  is  a  simple  parallel  programming  framework  that  would  support  the  computa+on  of  word  counts  and  inverted  indices?  

Page 27: Processing Big Data (Chapter 3, SC 11 Tutorial)

Basic  PaCern:  Strings  

1.  Extract  words  from  web  pages  in  parallel.  

2.  Hash  and  sort  words.  

3.  Count  (or  construct  inverted  index)  in  parallel.  

Page 28: Processing Big Data (Chapter 3, SC 11 Tutorial)

1.  Extract  words  from  web  pages  in  parallel.  

2.  Hash  and  sort  words.  

3.  Count  (or  construct  inverted  index)  in  parallel.  

1.  Extract  binned  field  value  from  data  records  in  parallel.  

2.  Hash  and  sort  binned  field  values.  

3.  Count  (or  construct  inverted  index)  in  parallel.  

What  about  data  records?  

Page 29: Processing Big Data (Chapter 3, SC 11 Tutorial)

Map-­‐Reduce  Example  •  Input  is  files  with  one  document  per  record  •  User  specifies  map  func+on  

–  key  =  document  URL  –  Value  =  document  contents  

“doc  cdickens  two  ci+es”,  “it  was  the  best  of  +mes”  

“it”,  1  “was”,  1  “the”,  1  “best”,  1  

Input  of  map  

Output  of  map  

Page 30: Processing Big Data (Chapter 3, SC 11 Tutorial)

Example  (cont’d)  •  MapReduce  library  gathers  together  all  pairs  with  the  same  key  value  (shuffle/sort  phase)  

•  The  user-­‐defined  reduce  func+on  combines  all  the  values  associated  with  the  same  key  

key  =  “it”  values  =  1,  1  

key  =  “was”  values  =  1,  1  

key  =  “best”  values  =  1  

key  =  “worst”  values  =  1  

Input  of  reduce  

“it”,  2  “was”,  2  “best”,  1  “worst”,  1  

 Output  of  reduce  

Page 31: Processing Big Data (Chapter 3, SC 11 Tutorial)

Why  Is  Word  Count  Important?  

•  It  is  one  of  the  most  important  examples  for  the  type  of  text  processing  oden  done  with  MapReduce.  

•  There  is  an  important  mapping  

             document          <  -­‐-­‐-­‐-­‐-­‐  >            data  record                      words                  <  -­‐-­‐-­‐-­‐-­‐  >              (field,  value)  

Inversion  

Page 32: Processing Big Data (Chapter 3, SC 11 Tutorial)

Pleasantly  Parallel   MapReduce  

Data  structure   Arbitrary   (key,  value)  pairs  

Func+ons   Arbitrary   Map  &  Reduce  

Middleware   MPI  (message  passing)  

Hadoop  

Ease  of  use   Difficult   Medium  

Scope   Wide   Narrow  

Challenge     Geung  something  working  

Moving  to  MapReduce    

Page 33: Processing Big Data (Chapter 3, SC 11 Tutorial)

Common  MapReduce  Design  PaCerns  

•  Word  count  •  Inversion  –  inverted  index  •  Compu+ng  simple  sta+s+cs  •  Compu+ng  windowed  sta+s+cs  •  Sparse  matrix  (document-­‐term,  data  record-­‐FieldBinValue,  …)  

•   Site-­‐en+ty  sta+s+cs  •  PageRank  •  Par++oned  and  ensemble  models  •  EM  

Page 34: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.4  User  Defined  Func+ons  over  DFS  

sector.sf.net  

Page 35: Processing Big Data (Chapter 3, SC 11 Tutorial)

Processing  Big  Data  PaCern  3:    User  Defined  Func+ons  over  Distributed  File  Systems  

Page 36: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sector/Sphere  

•  Sector/Sphere  is  a  plaworm  for  data  intensive  compu+ng.    

Page 37: Processing Big Data (Chapter 3, SC 11 Tutorial)

Idea  1:  Apply  User  Defined  Func+ons  (UDF)  to  Files  in  a  Distributed  File  System  

map/shuffle reduce

UDF UDF

This  generalizes  Hadoop’s  implementa+on  of  MapReduce  over  the  Hadoop  Distributed  File  system.  

Page 38: Processing Big Data (Chapter 3, SC 11 Tutorial)

Idea  2:  Add  Security  From  the  Start  

•  Security  server  maintains  informa+on  about  users  and  slaves.  

•  User  access  control:  password  and  client  IP  address.  

•  File  level  access  control.  •  Messages  are  encrypted  over  SSL.  Cer+ficate  is  used  for  authen+ca+on.  

•  Sector  is  a  good  basis  for  HIPAA  compliant  applica+ons.  

Security Server Master Client

Slaves

data AAA

SSL SSL

Page 39: Processing Big Data (Chapter 3, SC 11 Tutorial)

Idea  3:  Extend  the  Stack  to  Include  Network  Transport  Services  

Storage  Services  

39  

Storage  Services  

Rou+ng  &    Transport  Services  Google,  Hadoop  

Sector  

Compute  Services  

Data  Services  

Compute  Services  

Data  Services  

Page 40: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.5    Compu+ng  With  Streams:    Warming  Up  With  Means  and  Variances  

Page 41: Processing Big Data (Chapter 3, SC 11 Tutorial)

Warm  Up:  Par++oned  Means  

•  Means  and  variances  cannot  be  computed  naively  when  the  data  is  in  distributed  par++ons.  

Step  1.  Compute  local  (Σ  xi,    Σ  xi2,    ni)  in  parallel  for  each  par++on.    Step  2.  Compute  global  mean  and  variance  from  these  tuples.      

Page 42: Processing Big Data (Chapter 3, SC 11 Tutorial)

Trivial  Observa+on  1  

If  si  =  Σ  xi  is  a  the  i’th  local  means,  then  global  mean  =  Σ  si  /    Σ  ni.    •  If  local  means  for  each  par++on  are  passed  (without  corresponding  counts),  then  there  is  not  enough  informa+on  to  compute  global  means.  

•  Same  tricks  works  for  variance,  but  need  to  pass  triples  (Σ  xi,    Σ  xi2,    ni).  

 

Page 43: Processing Big Data (Chapter 3, SC 11 Tutorial)

Trivial  Observa+on  2  

•  To  reduce  data  passed  over  the  network,  combine  appropriate  sta+s+cs  as  early  as  possible.  

•  Consider  average.      Recall  with  MapReduce  there  are  4  steps  (Map,  Shuffle,  Sort  and  Reduce)  and  Reduce  pulls  data  from  local  disk  that  performs  Map.  

•  A  Combine  Step  in  MapReduce  combines  local  data  before  it  is  pulled  for  Reduce  Step.  

•  There  are  built  in  combiners  for  counts,  means,  etc.  

Page 44: Processing Big Data (Chapter 3, SC 11 Tutorial)

Sec+on  3.6  Hadoop  Streams  

Page 45: Processing Big Data (Chapter 3, SC 11 Tutorial)

Processing  Big  Data  PaCern  4:    Streams  over  Distributed  File  Systems  

Page 46: Processing Big Data (Chapter 3, SC 11 Tutorial)

Hadoop  Streams  

•  In  addi+on  to  the  Java  API,  Hadoop  offers  –  Streaming  interface  for  any  language  that  supports  reading  and  wri+ng  to  Standard  In  and  Out  

–  Pipes  for  C++  •  Why  would  I  want  to  use  something  besides  Java?    Because  Hadoop  Streams  provide  direct  access  to  –  (Without  JNI/  NIO)  to  C++  libraries  like  Boost,  GNU  Scien+fic  Library  (GSL)  

–  R  modules  

Page 47: Processing Big Data (Chapter 3, SC 11 Tutorial)

Pros  and  Cons  •  Java  

+    Best  documented  +    Largest  community  – More  LOC  per  MR  job  

•  Python  +    Efficient  memory  handling  +    Programmers  can  be  very  efficient  –  Limited  logging  /  debugging  

•  R  +    Vast  collec+on  of  sta+s+cal  algorithms  –  Poor  error  handling  and  memory  handling  –  Less  familiar  to  developers  

Page 48: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  Python  Mapper    def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)

Page 49: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  Python  Reducer  def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)

Page 50: Processing Big Data (Chapter 3, SC 11 Tutorial)

MalStone  Benchmark  

MalStone  A   MalStone  B  Hadoop  MapReduce   455m  13s   840m  50s  

Hadoop  Streams  (Python)  

87m  29s   142m  32s  

C++  implemented  UDFs   33m  40s   43m  44s  

Sector/Sphere  1.20,  Hadoop  0.18.3  with  no  replica+on  on  Phase  1  of  Open  Cloud  Testbed  in  a  single  rack.    Data  consisted  of  20  nodes  with  500  million  100-­‐byte  records  /  node.  

Page 51: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  R  Mapper  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)  

Page 52: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  R  Reducer  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  

Page 53: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  R  Reducer  (cont’d)  if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”)  

Page 54: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  Java  Mapper  public static class Map

extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

 

Page 55: Processing Big Data (Chapter 3, SC 11 Tutorial)

Word  Count  Java  Reducer  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  

Page 56: Processing Big Data (Chapter 3, SC 11 Tutorial)

Code  Comparison  –  Word  Count  Mapper  

Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con)

Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Page 57: Processing Big Data (Chapter 3, SC 11 Tutorial)

Code  Comparison  –  Word  Count  Reducer  

Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count

if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Page 58: Processing Big Data (Chapter 3, SC 11 Tutorial)

Ques+ons?  

For  the  most  current  version  of  these  notes,  see  rgrossman.com