managing big data (chapter 2, sc 11 tutorial)

An Introduc+on to Data Intensive Compu+ng

Chapter 2: Data Management

Robert Grossman University of Chicago Open Data Group

Collin BenneC

Open Data Group

November 14, 2011 1

1.  Introduc+on (0830-‐0900) a.  Data clouds (e.g. Hadoop) b.  U+lity clouds (e.g. Amazon)

2.  Managing Big Data (0900-‐0945) a.  Databases b.  Distributed File Systems (e.g. Hadoop) c.  NoSql databases (e.g. HBase)

3.  Processing Big Data (0945-‐1000 and 1030-‐1100) a.  Mul+ple Virtual Machines & Message Queues b.  MapReduce c.  Streams over distributed file systems

4.  Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)

What Are the Choices?

Databases (SqlServer, Oracle, DB2)

File Systems

Distributed File Systems (Hadoop, Sector)

Clustered File Systems (glusterfs, …)

NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)

Applica+ons (R, SAS, Excel, etc. )

What is the Fundamental Trade Off?

Scale up Scale out

vs …

2.1 Databases

Advice From Jim Gray

1.  Analyzing big data requires scale-‐out solu+ons not scale-‐up solu+ons (GrayWulf)

2.  Move the analysis to the data. 3.  Work with scien+sts to find the

most common “20 queries” and make them fast.

4.  Go from “working to working.”

PaCern 1: Put the metadata in a database and point to files in a

file system.

Example: Sloan Digital Sky Survey •  Two surveys in one

– Photometric survey in 5 bands – Spectroscopic redshii survey

•  Data is public – 40 TB of raw data – 5 TB processed catalogs – 2.5 Terapixels of images

•  Catalog uses Microsoi SQLServer •  Started in 1992, finished in 2008 •  JHU SkyServer serves millions of queries

Example: Bionimbus Genomics Cloud

www.bionimbus.org

Database Services

Analysis Pipelines & Re-‐analysis

Services

GWT-‐based Front End

Data Cloud Services

Data Inges+on Services

U+lity Cloud Services

Intercloud Services

Database Services

Analysis Pipelines & Re-‐analysis

Services

GWT-‐based Front End

Large Data Cloud Services

Data Inges+on Services

Elas+c Cloud Services

Intercloud Services

(Hadoop, Sector/Sphere)

(Eucalyptus, OpenStack)

(PostgreSQL)

ID Service (UDT, replica+on)

Sec+on 2.2 Distributed File Systems

Sector/Sphere

Hadoop’s Large Data Cloud

Storage Services

Compute Services

13

Hadoop’s Stack

Applica+ons

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL Databases

PaCern 2: Put the data into a distributed file system.

Hadoop Design •  Designed to run over commodity components that fail.

•  Data replicated, typically three +mes. •  Block-‐based storage. •  Op+mized for efficient scans with high throughput, not low latency access.

•  Designed for write once, read many. •  Append opera+on planned for future.

Hadoop Distributed File System (HDFS) Architecture

Name Node

Data Node

Data Node

Data Node

Client control

Data Node

Data Node

Data Node

data

Rack Rack Rack

•  HDFS is block-‐based.

•  WriCen in Java.

Sector Distributed File System (SDFS) Architecture

•  Broadly similar to Google File System and Hadoop Distributed File System.

•  Uses na+ve file system. It is not block based. •  Has security server that provides authoriza+ons.

•  Has mul+ple master name servers so that there is no single point of failure.

•  Use UDT to support wide area opera+ons.

Sector Distributed File System (SDFS) Architecture Master Node

Slave Node

Slave Node

Slave Node

Client control

Slave Node

Slave Node

Slave Node

data

Rack Rack Rack

•  HDFS is file-‐based.

•  WriCen in C++. •  Security server. •  Mul+ple masters.

Security Server

control

Master Node

GlusterFS Architecture

•  No metadata server. •  No single point of failure. •  Uses algorithms to determine loca+on of data. •  Can scale out by adding more bricks.

GlusterFS Architecture

Brick

Brick

Brick

Client

Brick

Brick

Brick

data

Rack Rack Rack

File-‐based.

GlusterFS Server

Sec+on 2.3 NoSQL Databases

21

Evolu+on •  Standard architecture for simple web applica+ons: – Presenta+on: front-‐end, load balanced web servers – Business logic layer – Backend database

•  Database layer does not scale with large numbers of users or large amounts of data

•  Alterna+ves arose – Sharded (par++oned) databases or master-‐slave dbs – memcache

22

Scaling RDMS •  Master – slave database systems

– Writes to master – Reads from slaves – Can be boClenecks wri+ng to slaves; can be inconsistent

•  Sharded databases – Applica+ons and queries must understand sharing schema

– Both reads and writes scale – No na+ve, direct support for joins across shards

23

NoSQL Systems

•  Suggests No SQL support, also Not Only SQL •  One or more of the ACID proper+es not supported

•  Joins generally not supported •  Usually flexible schemas •  Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra

•  Quite a few recent open source systems

24

PaCern 3: Put the data into a NoSQL applica+on.

C

A P

Consistency

Availability Par++on-‐resiliency

CA: available and consistent, unless there is a par++on.

AP: a reachable replica provides service even in a par++on, but may be inconsistent.

CP: always consistent, even in a par++on, but a reachable replica may deny service without quorum.

Dynamo, Cassandra

BigTable, HBase

CAP – Choose Two Per Opera+on

CAP Theorem

•  Proposed by Eric Brewer, 2000 •  Three proper+es of a system: consistency, availability and par++ons

•  You can have at most two of these three proper+es for any shared-‐data system

•  Scale out requires par++ons •  Most large web-‐based systems choose availability over consistency

28 Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002

Eventual Consistency •  If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent

•  Eventually, a node is either updated or removed from service.

•  Can be implemented with Gossip protocol •  Amazon’s Dynamo popularized this approach •  Some+mes this is called BASE (Basically Available, Soi state, Eventual consistency), as opposed to ACID

29

Different Types of NoSQL Systems

•  Distributed Key-‐Value Systems – Amazon’s S3 Key-‐Value Store (Dynamo) –  Voldemort –  Cassandra

•  Column-‐based Systems –  BigTable – HBase –  Cassandra

•  Document-‐based systems –  CouchDB

30

Hbase Architecture

HRegionServer

Client Client Client Client Client

HBaseMaster

REST API

Disk

HRegionServer

Java Client

Disk

HRegionServer

Disk

HRegionServer

Disk

HRegionServer

Source: Raghu Ramakrishnan

HRegion Server •  Records par++oned by column family into HStores

–  Each HStore contains many MapFiles

•  All writes to HStore applied to single memcache •  Reads consult MapFiles and memcache •  Memcaches flushed as MapFiles (HDFS files) when full •  Compac+ons limit number of MapFiles

HRegionServer

HStore

MapFiles

Memcache writes

Flush to disk

reads

Source: Raghu Ramakrishnan

Facebook’s Cassandra

•  Modeled aier BigTable’s data model •  Modeled aier Dynamo’s eventual consistency •  Peer to peer storage architecture using consistent hashing (Chord hashing)

33

Databases NoSQL Systems Scalability 100’s TB 100’s PB Func+onality Full SQL-‐based queries,

including joins Op+mized access to sorted tables (tables with single keys)

Op+mized Databases op+mized for safe writes

Clouds op+mized for efficient reads

Consistency model

ACID (Atomicity, Consistency, Isola+on & Durability) – database always consist

Eventual consistency – updates eventually propagate through system

Parallelism Difficult because of ACID model; shared nothing is possible

Basic design incorporates parallelism over commodity components

Scale Racks Data center 34

Sec+on 2.3 Case Study: Project Matsu

Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images

Zoom Level 3: 64 images Zoom Level 4: 256 images

Source: Andrew Levine

Mapper Input Key: Bounding Box

Mapper Input Value:

Mapper Output Key: Bounding Box Mapper Output Value:

Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes

(minx = -‐135.0 miny = 45.0 maxx = -‐112.5 maxy = 67.5)

Step 1: Input to Mapper

Step 2: Processing in Mapper Step 3: Mapper Output








Build Tile Cache in the Cloud -‐ Mapper


Reducer Key Input: Bounding Box (minx = -‐45.0 miny = -‐2.8125 maxx = -‐43.59375 maxy = -‐2.109375)

Reducer Value Input:

Step 1: Input to Reducer

…

Step 2: Reducer Output

Assemble Images based on bounding box

•  Output to HBase •  Builds up Layers for WMS for various datasets

Build Tile Cache in the Cloud -‐ Reducer


HBase Tables

•  Open Geospa+al Consor+um (OGC) Web Mapping Service (WMS) Query translates to HBase scheme – Layers, Styles, Projec+on, Size

•  Table name: WMS Layer – Row ID: Bounding Box of image -‐Column Family: Style Name and Projec+on -‐Column Qualifier: Width x Height -‐Value: Buffered Image

Sec+on 2.4 Distributed Key-‐Value Stores

S3

PaCern 4: Put the data into a distributed key-‐value store.

S3 Buckets •  S3 bucket names must be unique across AWS •  A good prac+ce is to use a paCern like

tutorial.osdc.org/dataset1.txt for a domain you own.

•  The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/

dataset1.txt •  If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt

S3 Keys

•  Keys must be unique within a bucket. •  Values can be as large as 5 TB (formerly 5 GB)

S3 Security

•  AWS access key (user name) •  This func+on as your S3 username. It is an alphanumeric text string that uniquely iden+fies users.

•  AWS Secret key (func+ons as password)

AWS Account Informa+on

Access Keys

User Name Password

Other Amazon Data Services

•  Amazon Simple Database Service (SDS) •  Amazon’s Elas+c Block Storage (EBS)

Sec+on 2.5 Moving Large Data Sets

The Basic Problem

•  TCP was never designed to move large data sets over wide area high performance networks.

•  As a general rule, reading data off disks is slower than transpor+ng it over the network.

TCP Throughput vs RTT and Packet Loss

0.01%

0.05%

0.1%

0.1%

0.5%

1000

800

600

400

200

1 10 100 200 400

1000

800

600

400

200

Thro

ughp

ut (M

b/s)

Round Trip Time (ms)

LAN US-EU US-ASIA US

Source: Yunhong Gu, 2007, experiments over wide area 1G.

The Solu+on

•  Use parallel TCP streams – GridFTP

•  Use specialized network protocols – UDT, FAST, etc.

•  Use RAID to stripe data across disks to improve throughput when reading

•  These techniques are well understood in HEP, astronomy, but not yet in biology.

Case Study: Bio-‐mirror

[The open source GridFTP] from the Globus project has recently been improved to offer UDP-‐based file transport, with long-‐distance speed improvements of 3x to 10x over the usual TCP-‐based file transport. -‐-‐ Don Gilbert, August 2010, bio-‐mirror.net

Moving 113GB of Bio-‐mirror Data

Site RTT TCP UDT TCP/UDT Km NCSA 10 139 139 1 200 Purdue 17 125 125 1 500 ORNL 25 361 120 3 1,200 TACC 37 616 120 55 2,000 SDSC 65 750 475 1.6 3,300 CSTNET 274 3722 304 12 12,000

GridFTP TCP and UDT transfer +mes for 113 GB from gridip.bio-‐mirror.net/biomirror/blast/ (Indiana USA). All TCP and UDT +mes in minutes. Source: hCp://gridip.bio-‐mirror.net/biomirror/

Case Study: CGI 60 Genomes

•  Trace by Complete Genomics showing performance of moving 60 complete human genomes from Mountain View to Chicago using the open source Sector/UDT.

•  Approximately 18 TB at about 0.5 Mbs on 1G link. Source: Complete Genomics.

Resource Use

Protocol CPU Usage* Memory* GridFTP (UDT) 1.0% -‐ 3.0% 40 Mb GridFTP (TCP) 0.1% -‐ 0.6% 6 Mb

*CPU and memory usage collected by Don Gilbert. He reports that rsync uses more CPU than GridFTP with UDT. Source: hCp://gridip.bio-‐mirror.net/biomirror/.

Sector/Sphere

•  Sector/Sphere is a pla{orm for data intensive compu+ng built over UDT and designed to support geographically distributed clusters.

Ques+ons?

For the most current version of these notes, see rgrossman.com

managing big data (chapter 2, sc 11 tutorial)

Technology