managing big data (chapter 2, sc 11 tutorial)

57
An Introduc+on to Data Intensive Compu+ng Chapter 2: Data Management Robert Grossman University of Chicago Open Data Group Collin BenneC Open Data Group November 14, 2011 1

Upload: robert-grossman

Post on 20-Aug-2015

2.734 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Managing Big Data (Chapter 2, SC 11 Tutorial)

An  Introduc+on  to    Data  Intensive  Compu+ng  

 Chapter  2:  Data  Management  

Robert  Grossman  University  of  Chicago  Open  Data  Group  

 Collin  BenneC  

Open  Data  Group    

November  14,  2011  1  

Page 2: Managing Big Data (Chapter 2, SC 11 Tutorial)

1.  Introduc+on  (0830-­‐0900)  a.  Data  clouds  (e.g.  Hadoop)  b.  U+lity  clouds  (e.g.  Amazon)  

2.  Managing  Big  Data  (0900-­‐0945)  a.  Databases  b.  Distributed  File  Systems  (e.g.  Hadoop)  c.  NoSql  databases  (e.g.  HBase)  

3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)  a.  Mul+ple  Virtual  Machines  &  Message  Queues  b.  MapReduce  c.  Streams  over  distributed  file  systems  

4.  Lab  using  Amazon’s  Elas+c  Map  Reduce  (1100-­‐1200)  

 

Page 3: Managing Big Data (Chapter 2, SC 11 Tutorial)

What  Are  the  Choices?  

Databases    (SqlServer,  Oracle,  DB2)  

File  Systems  

Distributed  File  Systems  (Hadoop,  Sector)  

Clustered  File  Systems  (glusterfs,  …)  

NoSQL  Databases  (HBase,  Accumulo,  Cassandra,  SimpleDB,  …)  

Applica+ons    (R,  SAS,  Excel,  etc.  )  

Page 4: Managing Big Data (Chapter 2, SC 11 Tutorial)

What  is  the  Fundamental  Trade  Off?  

Scale  up  Scale  out  

vs   …  

Page 5: Managing Big Data (Chapter 2, SC 11 Tutorial)

2.1    Databases  

Page 6: Managing Big Data (Chapter 2, SC 11 Tutorial)

Advice  From  Jim  Gray  

1.  Analyzing  big  data  requires  scale-­‐out  solu+ons  not  scale-­‐up  solu+ons  (GrayWulf)  

2.  Move  the  analysis  to  the  data.  3.  Work  with  scien+sts  to  find  the  

most  common  “20  queries”  and  make  them  fast.  

4.  Go  from  “working  to  working.”  

Page 7: Managing Big Data (Chapter 2, SC 11 Tutorial)

PaCern  1:  Put  the  metadata  in  a  database  and  point  to  files  in  a  

file  system.    

Page 8: Managing Big Data (Chapter 2, SC 11 Tutorial)

Example:  Sloan  Digital  Sky  Survey  •  Two  surveys  in  one  

– Photometric  survey  in  5  bands  – Spectroscopic  redshii  survey  

•  Data  is  public  – 40  TB  of  raw  data  – 5  TB  processed  catalogs  – 2.5  Terapixels  of  images  

•  Catalog  uses  Microsoi  SQLServer  •  Started  in  1992,  finished  in  2008  •  JHU  SkyServer  serves  millions  of  queries    

Page 9: Managing Big Data (Chapter 2, SC 11 Tutorial)

Example:  Bionimbus  Genomics  Cloud  

www.bionimbus.org  

Page 10: Managing Big Data (Chapter 2, SC 11 Tutorial)

Database  Services  

Analysis  Pipelines  &  Re-­‐analysis  

Services  

GWT-­‐based  Front  End  

Data    Cloud  Services  

Data  Inges+on  Services  

U+lity  Cloud  Services  

Intercloud  Services  

Page 11: Managing Big Data (Chapter 2, SC 11 Tutorial)

Database  Services  

Analysis  Pipelines  &  Re-­‐analysis  

Services  

GWT-­‐based  Front  End  

Large  Data    Cloud  Services  

Data  Inges+on  Services  

Elas+c  Cloud  Services  

Intercloud  Services  

(Hadoop,  Sector/Sphere)  

(Eucalyptus,  OpenStack)  

(PostgreSQL)  

ID  Service  (UDT,  replica+on)  

Page 12: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sec+on  2.2  Distributed  File  Systems  

Sector/Sphere  

Page 13: Managing Big Data (Chapter 2, SC 11 Tutorial)

Hadoop’s  Large  Data  Cloud  

Storage  Services  

Compute  Services  

13

Hadoop’s  Stack  

Applica+ons  

Hadoop  Distributed  File  System  (HDFS)  

Hadoop’s  MapReduce  

Data  Services   NoSQL  Databases  

Page 14: Managing Big Data (Chapter 2, SC 11 Tutorial)

PaCern  2:  Put  the  data  into  a  distributed  file  system.  

Page 15: Managing Big Data (Chapter 2, SC 11 Tutorial)

Hadoop  Design  •  Designed  to  run  over  commodity  components  that  fail.  

•  Data  replicated,  typically  three  +mes.  •  Block-­‐based  storage.  •  Op+mized  for  efficient  scans  with  high  throughput,  not  low  latency  access.  

•  Designed  for  write  once,  read  many.  •  Append  opera+on  planned  for  future.  

Page 16: Managing Big Data (Chapter 2, SC 11 Tutorial)

Hadoop  Distributed  File  System  (HDFS)    Architecture  

Name  Node  

Data  Node  

Data  Node  

Data  Node  

Client  control  

Data  Node  

Data  Node  

Data  Node  

data  

Rack   Rack   Rack  

•  HDFS  is  block-­‐based.  

•  WriCen  in  Java.  

Page 17: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sector  Distributed  File  System  (SDFS)  Architecture  

•  Broadly  similar  to  Google  File  System  and  Hadoop  Distributed  File  System.  

•  Uses  na+ve  file  system.    It  is  not  block  based.  •  Has  security  server  that  provides  authoriza+ons.  

•  Has  mul+ple  master  name  servers  so  that  there  is  no  single  point  of  failure.  

•  Use  UDT  to  support  wide  area  opera+ons.  

Page 18: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sector  Distributed  File  System  (SDFS)    Architecture  Master  Node  

Slave  Node  

Slave  Node  

Slave  Node  

Client  control  

Slave  Node  

Slave  Node  

Slave  Node  

data  

Rack   Rack   Rack  

•  HDFS  is  file-­‐based.  

•  WriCen  in  C++.  •  Security  server.  •  Mul+ple  masters.  

Security  Server  

control  

Master  Node  

Page 19: Managing Big Data (Chapter 2, SC 11 Tutorial)

GlusterFS  Architecture  

•  No  metadata  server.  •  No  single  point  of  failure.  •  Uses  algorithms  to  determine  loca+on  of  data.  •  Can  scale  out  by  adding  more  bricks.  

Page 20: Managing Big Data (Chapter 2, SC 11 Tutorial)

GlusterFS  Architecture  

Brick  

Brick  

Brick  

Client  

Brick  

Brick  

Brick  

data  

Rack   Rack   Rack  

File-­‐based.  

GlusterFS  Server  

Page 21: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sec+on  2.3  NoSQL  Databases  

21  

Page 22: Managing Big Data (Chapter 2, SC 11 Tutorial)

Evolu+on  •  Standard  architecture  for  simple  web  applica+ons:  – Presenta+on:  front-­‐end,  load  balanced  web  servers  – Business  logic  layer    – Backend  database  

•  Database  layer  does  not  scale  with  large  numbers  of  users  or  large  amounts  of  data  

•  Alterna+ves  arose  – Sharded  (par++oned)  databases  or  master-­‐slave  dbs  – memcache  

22  

Page 23: Managing Big Data (Chapter 2, SC 11 Tutorial)

Scaling  RDMS  •  Master  –  slave  database  systems  

– Writes  to  master  – Reads  from  slaves  – Can  be  boClenecks  wri+ng  to  slaves;  can  be  inconsistent  

•  Sharded  databases  – Applica+ons  and  queries  must  understand  sharing  schema  

– Both  reads  and  writes  scale  – No  na+ve,  direct  support  for  joins  across  shards  

23  

Page 24: Managing Big Data (Chapter 2, SC 11 Tutorial)

NoSQL  Systems  

•  Suggests  No  SQL  support,  also  Not  Only  SQL  •  One  or  more  of  the  ACID  proper+es  not  supported  

•  Joins  generally  not  supported  •  Usually  flexible  schemas  •  Some  well  known  examples:  Google’s  BigTable,  Amazon’s  Dynamo  &  Facebook’s  Cassandra  

•  Quite  a  few  recent  open  source  systems  

24  

Page 25: Managing Big Data (Chapter 2, SC 11 Tutorial)

PaCern  3:  Put  the  data  into  a  NoSQL  applica+on.  

Page 26: Managing Big Data (Chapter 2, SC 11 Tutorial)

26  

Page 27: Managing Big Data (Chapter 2, SC 11 Tutorial)

C  

A   P  

Consistency  

Availability   Par++on-­‐resiliency  

CA:  available  and  consistent,  unless  there  is  a  par++on.  

AP:  a  reachable  replica  provides  service  even  in  a  par++on,  but  may  be  inconsistent.  

CP:  always  consistent,  even  in  a  par++on,  but  a  reachable  replica  may  deny  service  without  quorum.  

Dynamo,  Cassandra    

BigTable,  HBase  

CAP  –  Choose  Two  Per  Opera+on  

Page 28: Managing Big Data (Chapter 2, SC 11 Tutorial)

CAP  Theorem  

•  Proposed  by  Eric  Brewer,  2000  •  Three  proper+es  of  a  system:  consistency,  availability  and  par++ons  

•  You  can  have  at  most  two  of  these  three  proper+es  for  any  shared-­‐data  system  

•  Scale  out  requires  par++ons  •  Most  large  web-­‐based  systems  choose  availability  over  consistency  

28  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002  

Page 29: Managing Big Data (Chapter 2, SC 11 Tutorial)

Eventual  Consistency  •  If  no  updates  occur  for  a  while,  all  updates  eventually  propagate  through  the  system  and  all  the  nodes  will  be  consistent  

•  Eventually,  a  node  is  either  updated  or  removed  from  service.      

•  Can  be  implemented  with  Gossip  protocol  •  Amazon’s  Dynamo  popularized  this  approach  •  Some+mes  this  is  called  BASE  (Basically  Available,  Soi  state,  Eventual  consistency),  as  opposed  to  ACID  

29  

Page 30: Managing Big Data (Chapter 2, SC 11 Tutorial)

Different  Types  of  NoSQL  Systems  

•  Distributed  Key-­‐Value  Systems  – Amazon’s  S3  Key-­‐Value  Store  (Dynamo)  –  Voldemort  –  Cassandra  

•  Column-­‐based  Systems  –  BigTable  – HBase  –  Cassandra  

•  Document-­‐based  systems  –  CouchDB  

30  

Page 31: Managing Big Data (Chapter 2, SC 11 Tutorial)

Hbase  Architecture  

HRegionServer  

Client   Client   Client   Client  Client  

HBaseMaster  

REST API

Disk  

HRegionServer  

Java  Client  

Disk  

HRegionServer  

Disk  

HRegionServer  

Disk  

HRegionServer  

Source:  Raghu  Ramakrishnan  

Page 32: Managing Big Data (Chapter 2, SC 11 Tutorial)

HRegion  Server  •  Records  par++oned  by  column  family  into  HStores  

–  Each  HStore  contains  many  MapFiles  

•  All  writes  to  HStore  applied  to  single  memcache  •  Reads  consult  MapFiles  and  memcache  •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full  •  Compac+ons  limit  number  of  MapFiles  

HRegionServer  

HStore  

MapFiles  

Memcache  writes  

Flush  to  disk  

reads  

Source:  Raghu  Ramakrishnan  

Page 33: Managing Big Data (Chapter 2, SC 11 Tutorial)

Facebook’s  Cassandra  

•  Modeled  aier  BigTable’s  data  model  •  Modeled  aier  Dynamo’s  eventual  consistency  •  Peer  to  peer  storage  architecture  using  consistent  hashing  (Chord  hashing)  

33  

Page 34: Managing Big Data (Chapter 2, SC 11 Tutorial)

Databases   NoSQL  Systems  Scalability   100’s  TB   100’s  PB  Func+onality   Full  SQL-­‐based  queries,  

including  joins  Op+mized  access  to  sorted  tables  (tables  with  single  keys)  

Op+mized   Databases  op+mized  for  safe  writes  

Clouds  op+mized  for  efficient  reads  

Consistency  model  

ACID  (Atomicity,  Consistency,  Isola+on  &  Durability)  –  database  always  consist  

Eventual  consistency  –  updates  eventually  propagate  through  system  

Parallelism   Difficult  because  of  ACID  model;  shared  nothing  is  possible  

Basic  design  incorporates  parallelism  over  commodity  components    

Scale   Racks   Data  center   34  

Page 35: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sec+on  2.3    Case  Study:  Project  Matsu  

Page 36: Managing Big Data (Chapter 2, SC 11 Tutorial)

Zoom  Levels  /  Bounds  Zoom  Level  1:  4  images   Zoom  Level  2:  16  images  

Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  

Source:  Andrew  Levine  

Page 37: Managing Big Data (Chapter 2, SC 11 Tutorial)

Mapper  Input  Key:  Bounding  Box  

Mapper  Input  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  resizes  and/or  cuts  up  the  original  image  into  pieces  to  output  Bounding  Boxes  

(minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)  

Step  1:  Input  to  Mapper  

Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Mapper  Output  Key:  Bounding  Box  Mapper  Output  Value:  

Build  Tile  Cache  in  the  Cloud  -­‐  Mapper  

Source:  Andrew  Levine  

Page 38: Managing Big Data (Chapter 2, SC 11 Tutorial)

Reducer  Key  Input:  Bounding  Box  (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)  

Reducer  Value  Input:  

Step  1:  Input  to  Reducer  

…  

Step  2:  Reducer  Output  

Assemble  Images  based  on  bounding  box  

•  Output  to  HBase  •  Builds  up  Layers  for  WMS  for  various  datasets  

Build  Tile  Cache  in  the  Cloud  -­‐  Reducer  

Source:  Andrew  Levine  

Page 39: Managing Big Data (Chapter 2, SC 11 Tutorial)

HBase  Tables  

•  Open  Geospa+al  Consor+um  (OGC)  Web  Mapping  Service  (WMS)  Query  translates  to  HBase  scheme  – Layers,  Styles,  Projec+on,  Size  

•  Table  name:  WMS  Layer  – Row  ID:  Bounding  Box  of  image  -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  

Page 40: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sec+on  2.4  Distributed  Key-­‐Value  Stores  

S3  

Page 41: Managing Big Data (Chapter 2, SC 11 Tutorial)

PaCern  4:  Put  the  data  into  a  distributed  key-­‐value  store.  

Page 42: Managing Big Data (Chapter 2, SC 11 Tutorial)

S3  Buckets  •  S3  bucket  names  must  be  unique  across  AWS  •  A  good  prac+ce  is  to  use  a  paCern  like  

   tutorial.osdc.org/dataset1.txt  for  a  domain  you  own.  

•  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/

dataset1.txt  •  If  you  own  osdc.org  you  can  create  a  DNS  CNAME  entry  to  access  the  file  as  tutorial.osdc.org/dataset1.txt  

Page 43: Managing Big Data (Chapter 2, SC 11 Tutorial)

S3  Keys  

•  Keys  must  be  unique  within  a  bucket.  •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  

Page 44: Managing Big Data (Chapter 2, SC 11 Tutorial)

S3  Security  

•  AWS  access  key  (user  name)  •  This  func+on  as  your  S3  username.  It  is  an  alphanumeric  text  string  that  uniquely  iden+fies  users.    

•  AWS  Secret  key  (func+ons  as  password)  

Page 45: Managing Big Data (Chapter 2, SC 11 Tutorial)

AWS  Account  Informa+on  

Page 46: Managing Big Data (Chapter 2, SC 11 Tutorial)

Access  Keys  

User  Name   Password  

Page 47: Managing Big Data (Chapter 2, SC 11 Tutorial)

Other  Amazon  Data  Services  

•  Amazon  Simple  Database  Service  (SDS)  •  Amazon’s  Elas+c  Block  Storage  (EBS)  

Page 48: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sec+on  2.5  Moving  Large  Data  Sets  

Page 49: Managing Big Data (Chapter 2, SC 11 Tutorial)

The  Basic  Problem  

•  TCP  was  never  designed  to  move  large  data  sets  over  wide  area  high  performance  networks.  

•  As  a  general  rule,  reading  data  off  disks  is  slower  than  transpor+ng  it  over  the  network.      

Page 50: Managing Big Data (Chapter 2, SC 11 Tutorial)

TCP Throughput vs RTT and Packet Loss

0.01%

0.05%

0.1%

0.1%

0.5%

1000

800

600

400

200

1 10 100 200 400

1000

800

600

400

200

Thro

ughp

ut (M

b/s)

Round Trip Time (ms)

LAN US-EU US-ASIA US

Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  

Page 51: Managing Big Data (Chapter 2, SC 11 Tutorial)

The  Solu+on  

•  Use  parallel  TCP  streams  – GridFTP  

•  Use  specialized  network  protocols  – UDT,  FAST,  etc.  

•  Use  RAID  to  stripe  data  across  disks  to  improve  throughput  when  reading  

•  These  techniques  are  well  understood  in  HEP,  astronomy,  but  not  yet  in  biology.  

Page 52: Managing Big Data (Chapter 2, SC 11 Tutorial)

Case  Study:  Bio-­‐mirror  

[The  open  source  GridFTP]  from  the  Globus  project  has  recently  been  improved  to  offer  UDP-­‐based  file  transport,  with  long-­‐distance  speed  improvements  of  3x  to  10x  over  the  usual  TCP-­‐based  file  transport.    -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  

Page 53: Managing Big Data (Chapter 2, SC 11 Tutorial)

Moving  113GB  of  Bio-­‐mirror  Data  

Site   RTT   TCP   UDT   TCP/UDT   Km  NCSA   10   139   139   1   200  Purdue   17   125   125   1   500  ORNL   25   361   120   3   1,200  TACC   37   616   120   55   2,000  SDSC   65   750   475   1.6   3,300  CSTNET   274   3722   304   12   12,000  

GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐mirror.net/biomirror/  

Page 54: Managing Big Data (Chapter 2, SC 11 Tutorial)

Case  Study:  CGI  60  Genomes  

•  Trace  by  Complete  Genomics  showing  performance  of  moving  60  complete  human  genomes  from  Mountain  View  to  Chicago  using  the  open  source  Sector/UDT.  

•  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.  Source:  Complete  Genomics.      

Page 55: Managing Big Data (Chapter 2, SC 11 Tutorial)

Resource  Use  

Protocol   CPU  Usage*   Memory*  GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb  GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb  

*CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more  CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  

Page 56: Managing Big Data (Chapter 2, SC 11 Tutorial)

Sector/Sphere  

•  Sector/Sphere  is  a  pla{orm  for  data  intensive  compu+ng  built  over  UDT  and  designed  to  support  geographically  distributed  clusters.    

Page 57: Managing Big Data (Chapter 2, SC 11 Tutorial)

Ques+ons?  

For  the  most  current  version  of  these  notes,  see  rgrossman.com