system software considerations for cloud computing on big data michael kozuch intel labs pittsburgh...

System Software Considerations for Cloud Computing on Big Data

Michael Kozuch

Intel Labs Pittsburgh

March 17, 2011

Outline

1. Background: Open Cirrus

2. Cluster software stack

3. Big Data

4. Power

5. Recent news

2

Open Cirrus

Open Cirrus* Cloud Computing Testbed

MIMOS*

ETRI*

ISPRAS*KIT*UIUC*

IDA*

Sponsored by HP, Intel, and Yahoo! (with additional support from NSF)14 sites currently, target of around 20 in the next two years

Collaboration between industry and academia, sharing•hardware infrastructure

•software infrastructure•research •applications and data sets

CMU*China Mobile*

China Telecom*CESGA*

GaTech*

Open Cirrus*

Objectives–Foster systems research around cloud computing

–Enable federation of heterogeneous datacenters

–Vendor-neutral open-source stacks and APIs for the cloud

–Expose research community to enterprise level requirements

–Capture realistic traces of cloud workloads

Each Site–Runs its own research and technical teams,

–Contributes individual technologies

–Operates some of the global servicesIndependently-managed sites… providing a cooperative research testbed

http://opencirrus.org

1 Gb/s (x8 p2p)Intel BigData Cluster

45 Mb/s T3 to Internet

3U Rack5 storage

nodes-------------

12 1TB Disk

1 Gb/s (x2x5 p2p)

x3(r1r3,r1r4,r2r3)

20 nodes: 1 Xeon (single-core) [Irwindale]6GB DRAM366GB disk10 nodes: 2 Xeon 5160 (dual-core) [Woodcrest]4GB RAM 2 75GB Disk10 nodes: 2 Xeon E5345 (quad-core) [Clovertown]8GB DRAM2 150GB Disk

x1(r1r1)

Switch48 Gb/s

x2(r3r2,r3r3)

1 Gb/s (x4x4 p2p)

Blade Rack

40 nodes-------------

2 Xeon E5345(quad-core)[Clovertown]8GB DRAM

2 150GB Disk

Switch48 Gb/s

1 Gb/s (x4x4 p2p)

Blade Rack

40 nodes

Switch48 Gb/s

1 Gb/s (x15 p2p)

1U Rack 15 nodes-------------

2 Xeon E5420(quad-core)

[Harpertown]8GB DRAM2 1TB Disk

Switch48 Gb/s

1 Gb/s (x15 p2p)

2U Rack 15 nodes-------------



Switch48 Gb/s

1 Gb/s (x15 p2p)

2U Rack 15 nodes-------------


[Nehalem-EP] 16GB DRAM6 1TB Disk

Switch48 Gb/s

Mobile Rack 8 (1u) nodes

-------------2 Xeon E5440(quad-core)

[Harpertown] 16GB DRAM2 1TB Disk

Switch24 Gb/s

Switch48 Gb/s

1 Gb/s (x8)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)

1 Gb/s (x4)1 Gb/s

(x4)1 Gb/s

(x4)1 Gb/s

(x4)

(r2r2c1-4)(r2r1c1-4)

r2r1c1-4 r2r2c1-4 r1r1 r1r2r1r3 r1r4

r2r3 r3r2 r3r3 mobile storage TOTALNodes 40 40 15 27 45 30 8 5 210Cores 140 320 120 264 360 240 64 1508DRAM (GB) 240 320 120 696 360 480 128 2344Spindles 80 80 30 102 270 180 16 60 818Storage (TB) 12 12 30 66 270 180 16 60 646

(r1r5)

Key:rXrY=row X rack YrXrYcZ=row X rack Y chassis Z

x1(r1r2)

12 nodes-------------

2 Xeon X5650(six-core)

[WestmereEP]48GB DRAM6 0.5TB Disk

Switch48 Gb/s

1U Rack 15 nodes-------------



1 Gb/s (x27 p2p)

1 Gb/s (x4)

Cloud Software Stack

Cloud Software Stack – Key Learnings

• Enable use of application frameworks (Hadoop, Maui-Torque)

• Enable general IaaS use

• Provide Big Data storage service

• Enable physical resources allocation

Resource Allocator

IaaS

Storage Service

Application Frameworks

Why Physical?1.Virtualization overhead2.Access to phys resource

3.Security issues

Zoni Functionality

• Allocation• Assignment of physical resources to users

• Isolation• Allow multiple mini-clusters to co-exist without interference

• Provisioning• Booting of specified OS

• Management• OOB power management

• Debugging• OOB console access

Server Pool 0

PXE/DNS/DHCP

Domain 0

Server Pool 1

Server

Pool 0

DNS/PXE/DHCP

Domain 1

Gateway

Provides each project with a mini-datacenter

Isolation of experiments

Intel BigData Cluster Dashboard

Big Data

12

Example Applications

Application Big Data Algorithms Compute Style

Scientific study (e.g. earthquake study)

Ground model Earthquake simulation, thermal conduction, …

HPC

Internet library search

Historic web snapshots

Data mining MapReduce

Virtual world analysis

Virtual world database

Data mining TBD

Language translation

Text corpuses, audio archives,…

Speech recognition, machine translation, text-to-speech, …

MapReduce & HPC

Video search Video data Object/gesture identification, face recognition, …

MapReduce

There has been more video uploaded to YouTube in the last 2 months than if ABC, NBC,

and CBS had been airing content 24/7/365 continuously since 1948. - Gartner

13

Big Data

Interesting applications are data hungry

The data grows over time

The data is immobile– 100 TB @ 1Gbps ~= 10 days

Compute comes to the data

Big Data clusters are the new librariesThe value of a cluster is its data

14

Example Motivating Application:Online Processing of Archival Video• Research project: Develop a context recognition system that is 90% accurate

over 90% of your day• Leverage a combination of low- and high-rate sensing for perception• Federate many sensors for improved perception• Big Data: Terabytes of archived video from many egocentric cameras

• Example query 1: “Where did I leave my briefcase?”• Sequential search through all video streams [Parallel Camera]

• Example query 2: “Now that I’ve found my briefcase, track it”• Cross-cutting search among related video streams [Parallel Time]

14

Big Data Cluster

Big Data System Requirements

Provide high-performance execution over Big Data repositories

Many spindles, many CPUs

Parallel processing

Enable multiple services to access a repository concurrently

Enable low-latency scaling of services

Enable each service to leverage its own software stack

IaaS, file-system protections where needed

Enable slow resource scaling for growth

Enable rapid resource scaling for power/demand

Scaling-aware storage

15

16

Storing the Data – Choices

computeservers

storageservers

Model 1: Separate Compute/Storage

compute/storageservers

Model 2: Co-located Compute/Storage

Compute and storage can scale independently

Many opportunities for reliability

No compute resources are under-utilized

Potential for higher throughput

Cluster Model

TOR Switch

Rack of N server

nodes

Connections to R Racks

BWswitch

BWdisk

BWnode

p cores d disks

Cluster Switch

external network

17

The cluster switch quicklybecomes the bottleneck.

Local computation is crucial.

I/O Throughput Analysis

0

1000

2000

3000

4000

5000

6000

Disk-1G SSD-1G Disk-10G SSD-10G

Dat

a T

hrou

ghpu

t (G

b/s)

Random PlacementLocation-Aware Placement

3.6X

11X

3.5X

9.2X

20 racks of 20 2-disk servers; BWswitch = 10 Gbps18

Data Location Information

Issues:

• Many different file system possibilities (HDFS, PVFS, Lustre, etc)

• Many different application framework possibilities

• Consumers could be virtualized

Solution:

• Standard cluster-wide Data Location Service

• Resource Telemetry Service to evaluate scheduling choices

• Enables virtualized location info and file system agnosticism

19

Exposing Location Information

OS

DataLocationService

ResourceTelemetry

ServiceDFS

LA runtime

LA application

(a) non-virtualized

OS

DFSVMM

(b) virtualized

Guest OS

DFS

LA runtime

LA application

Virtu

al Mach

ines

VM Runtime

20

“A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems,” Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, and Albert Zomaya

22

(System) EfficiencyDemand Scaling/

Power Proportionality

Power Proportionality and Big Data

Nu

mb

er o

f b

lock

s s

tore

d o

n n

od

e i

Node number i i=100

2000Possible power savings: ~0%~66%

Optimal: ~95%

The Hadoop Filesystem(10K blocks)

Rabbit Filesystem

24

A reliable, power-proportional

filesystem for Big Dataworkloads

Simple Strategy: Maintain a “primary replica”

Recent News

Recent News

• “Intel Labs to Invest $100 Million in U.S. University Research”• Over five years• Intel Science and Technology Centers– 3+2 year sponsored research• Half-dozen or more by 2012• Each can have small number of Intel research staff on site

• New ISTC focusing on cloud computing possible

26

Tentative Research Agenda Framing

Potential Questions

Potential Research Questions

Software stack

• Is physical allocation an interesting paradigm for the public cloud?

• What are the right interfaces between the layers?

• Can multi-variable optimization work across layers?

Big Data

• Can a hybrid cloud-HPC file system provide best-of-both-worlds?

• How should the file system deal with heterogeneity?

• What are the right file system sharing models for the cloud?

• Can physical resources be taken from the FS and given back?

29

Potential Research Questions

Power

• Can storage service power be reduced without reducing availability?

• How should a power-proportional FS maintain a good data layout?

Federation

• Which applications can cope with limited bandwidth between sites?

• What are the optimal ways to join data across clusters?

• How necessary is federation?

30

How should compute, storage, and powerbe managed to optimize for

performance, energy, and fault-tolerance?

Backup

Scaling– Power Proportionality

Demand scaling presents perf./power trade-off

• Our servers: 250W loaded, 150W idle, 10W off, 200s setup

Research underway for scaling cloud applications

• Control theory

• Load prediction

• Autoscaling

Scaling beyond single tier less well-understood

Request rate: λ

Cloud-based App

Note: proportionality issue is orthogonal to FAWN design

Scaling– Power Proportionality

Project 1: Multi-tier power management

• E.g. Facebook

Project 2: Multi-variable optimization

Project 3: Collective optimization

• Open Cirrus may have key role

λ

λ

Physical resources

IaaSDistributed file system

Resource allocatore.g. Rabbite.g. Tashi

e.g. Zoni

system software considerations for cloud computing on big data michael kozuch intel labs pittsburgh...

Documents

r3r31 gbs x4x4 p2pblade

r2r320 nodes

u nodes

tb disk1 gbs x2x5 p2px3r1r3

tb diskswitch48 gbsmobile

tb diskswitch48 gbs1u

open cirrus open cirrus

xeon e5520quadcorenehalem