cs294, yelickistore, p1 cs 294-8 istore: hardware overview and software challenges yelick/294

43
CS294, Yelick ISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges http://www.cs.berkeley.edu/~yelick /294

Upload: david-bryant

Post on 25-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p1

CS 294-8ISTORE: Hardware

Overview and Software Challenges

http://www.cs.berkeley.edu/~yelick/294

Page 2: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p2

The ISTORE Gang

James Beck, Aaron Brown, Daniel Hettena, Jon Kuroda, David Oppenheimer, David Patterson, Alex Stamos, Randi Thomas,

Noah Treuhaft, Kathy Yelick

Page 3: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p3

Today’s Agenda

• ISTORE History & Vision• ISTORE-1 Hardware• Proposed ISTORE-2 Hardware• Software Techniques• Administrivia

Page 4: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p4

Related Project: IRAMMicroprocessor &

DRAM on a chip:– Better latency

5-10X,

– Better bandwidth 50-100X

– improve energy 2X-4X

– smaller board area/volume

CPU+$

I/O4 Vector Pipes/Lanes

Memory (64 Mbits / 8 MBytes)

Memory (64 Mbits / 8 MBytes)

Xbar

IRAM focus is ~1-chip systems (PDAS, etc.)

Page 5: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p5

A glimpse into the future?• Add system-on-a-chip to disk: computer,

memory, redundant network interfaces without significantly increasing size of disk

• ISTORE HW in 5-7 years:– building block:

MicroDrive integrated with IRAM

– If low power, 10,000 nodes fit into one rack!

• Design for O(10,000) nodes

Page 6: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p6

The real scalability problems: AME• Availability

– systems should continue to meet quality of service goals despite hardware and software failures

• Maintainability– systems should require only minimal ongoing

human administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase

• Evolutionary Growth– systems should evolve gracefully in terms of

performance, maintainability, and availability as they are grown/upgraded/expanded

• These are problems at today’s scales, and will only get worse as systems grow

Page 7: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p7

Principles for achieving AME (1)

• No single points of failure• Redundancy everywhere• Introspection

– reactive techniques to detect and adapt to failures, workload variations, and system evolution

– proactive techniques to anticipate and avert problems before they happen

Page 8: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p8

Principles for achieving AME (2)

• Performance robustness is more important than peak performance– performance under real-world,

imperfect scenarios, including failures

• Performance can be sacrificed for improvements in AME– resources should be dedicated to AME

• compare: biological systems spend > 50% of resources on maintenance

– scale system to meet performance needs

Page 9: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p9

The Applications: Big Data

• Two key application domains:– storage: public, private, and institutional

data– search: building static indexes, dynamic

discovery

• Applications demand large storage, some computation

Page 10: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p10

Today’s Agenda

• ISTORE History & Vision• ISTORE-1 Hardware• Proposed ISTORE-2 Hardware• Software Techniques• Administrivia

Page 11: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p11

HW (1): SON• Cluster of Storage Oriented Nodes (SON)• Distribute processing with storage

– Most storage servers limited by speed of CPUs– Amortize sheet metal, power, cooling, network for

disk to add processor, memory, and a real network?

– Embedded processors 2/3 perf, 1/10 cost, power?– Switches also growing with Moore’s Law

• Advantages of cluster organization– Truly scalable architecture– Architecture that tolerates partial failure– Automatic hardware redundancy

Page 12: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p12

HW(2+3): Instruments+DP• Heavily instrumented hardware

– sensors for temp, vibration, humidity, power, intrusion– helps detect environmental problems before they can

affect system integrity

• Independent diagnostic processor on each node– provides remote control of power, remote console

access to the node, selection of node boot code– collects, stores, processes environmental data for

abnormalities– non-volatile “flight recorder” functionality– all diagnostic processors connected via independent

diagnostic network

Page 13: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p13

HW(4): Subset Isolation• On-demand network

partitioning/isolation– Internet applications must remain

available despite failures of components, therefore can isolate a subset for preventative maintenance

– Allows testing, repair of online system– Managed by diagnostic processor and

network switches via diagnostic network

Page 14: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p14

HW(5): Fault Injection

• Built-in fault injection capabilities– Power control to individual node components– Can inject glitches into I/O and memory busses– Managed by diagnostic processor – Used for hardware introspection

• automated detection of flaky components• controlled testing of error-recovery mechanisms

– Important for AME benchmarking

Page 15: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p15

ISTORE-1 hardware platform• 80-node x86-based cluster, 1.4TB storage

– cluster nodes are plug-and-play, intelligent, network-attached storage “bricks”

• a single field-replaceable unit to simplify maintenance

– each node is a full x86 PC w/256MB DRAM, 18GB disk– more CPU than NAS; fewer disks/node than cluster

ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, fans, heat and vibration sensors...

Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM

Redundant NICs (4 100 Mb/s links)Diagnostic Processor

Disk

Half-height canister

Page 16: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p16

ISTORE Brick Block Diagram

CPUNorthBridge

Mobile Pentium II Module

DRAM256 MB

DiagnosticProcessor

PCI

SCSI

SouthBridge

SuperI/O

BIOS

DUALUART

Ethernets4x100 Mb/s

DiagnosticNet

Flash RTC RAM

Monitor&

Control

Disk (18 GB)

• Sensors for heat and vibration

• Control over power to individual nodes

Page 17: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p17

ISTORE-1 System Layout

Brick shelf

Brick shelf

Brick shelf

Brick shelf

Brick shelf

Patch panel

Patch panel

UPS UPSUPS

UPS

PE5200

PE5200

PE1000s

Brick shelf

Brick shelf

Brick shelf

Brick shelf

Brick shelf

Patch panel

Patch panel

UPS UPS

PE1000s: PowerEngines 100Mb switches

PE5200s: PowerEngines 1 Gb switches

UPSs: “used”

Page 18: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p18

ISTORE-1 Status• All boards fabbed • Boots OS• Diagnostic Processor Interface SW

complete• Minor boards: backplane, patch

panel, power supply in design/layout/fab

• Finish 64-80 node system: October

Page 19: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p19

Onward to ISTORE-2?

• IStore-1: Present UCB Project• IStore-2: Possible Joint Research

Prototype between UCB and IBM– IBM: Winfried Wilcke, Dick Booth…– ~2000 nodes– Split between UCB, IBM and others– Hardware similar to IStore-1– Focus on real applications and

management software

Page 20: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p20

HW(6): RAIN• Switches for ISTORE-1 substantial fraction of

space, power, cost, and just 80 nodes!• Redundant Array of Inexpensive Disks (RAID):

– replace large, expensive disks by many small, inexpensive disks, saving volume, power, cost

• Redundant Array of Inexpensive Network switches (RAIN) for ISTORE-2: – replace large, expensive switches by many small,

inexpensive switches, saving volume, power, cost?– E.g., replace 2 16-port 1-Gbit switches by fat tree of

8 8-port switches, or 24 4-port switches?

Page 21: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p21

Redundant Networking• ISTORE-1 has redundant networks

– Each node has 4 100Mb connections– 2 Gb switches at the top– Can-bus network for DPs

• Using the 4 ethernets– Striping for performance (on ISTORE-

0)– Adaption for single switch or interface

failures

Page 22: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p22

Underlying Beliefs...• Commodity components are quickly

winning the server wars– Gigabit Ethernet will win everything– x86 Processors– Linux OS will prosper

• Large servers (100-10k nodes) will be quite common - and most are storage centric

• What matters most:– Ease of management, density of nodes and

seamless geographical interconnect

Page 23: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p23

IStore-2 Footprint• 16 Storage (19") Racks

– 64 Storage bricks/rack• 8 type 1 storage bricks/drawer• 8 storage drawers/rack

– Ethernet switches in rack

• 8 Global Ethernet Switch (19") Racks• Requires 600 sq ft lab!!

– Soda II or better packaging and integration (2.5” disks…) necessary

Page 24: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p24

Today’s Agenda

• ISTORE History & Vision• ISTORE-1 Hardware• Proposed ISTORE-2 Hardware• Software Techniques• Administrivia

Page 25: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p25

Software Techniques• Reactive introspection• Proactive introspection• Semantic redundancy• Self-Scrubbing data structures• Language for system

administration• Load adaptation for faults• Benchmarking• Your ideas go here!

Page 26: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p26

SW1: Reactive Introspection• Automatic fault detection• “Mining” any available system data

– Disk logs (aka Tertiary Disk experience)• Disk “firings” are predictable

– Load, memory usage, etc.– Network traffic– Programmer counter?

• Untested code is always buggy• Exception code is often untested

Page 27: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p27

SW2: Proactive Introspection• Test recovery system

– Are replicas, backups, etc. consistent?– Does recovery code work in current

setting?• Isolate subset of system • Inject faults using hardware or

software techniques• Automate this process and

schedule for regular runs

Page 28: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p28

SW3: Semantic Redundancy• Replication is the simplest form of

redundancy • How can other forms of redundant

information help reliability?– Multiple paths in linked data structures – Coding-based replication– Checkpointing of computations using

application-specific knowledge

Page 29: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p29

SW4: Self-Scrubbing Data Structures

• Distributed data structures– Often implicit: based on communication

patterns in application• Trees, hash tables, etc.

– Complex invariants may exist across the structure• Maintain n copies of every data element• Balance size of each sub-container within p%

– Check invariants at runtime (and repair?)

Page 30: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p30

SW5: Safer System Management

• ~40% of systems failures are due to administrative errors

• The most common tools used by administrators include:– Perl, csh, awk, sed– Lack all of the features of good language design

• Little or no redundancy, e.g., untyped• Syntax based on minimizing keystrokes: error prone• Little or no support for procedural or data abstraction• No static checking of program correctness, e.g., compiler

– Systems like Cfengine provide limited support

• Apply language techniques to age-old problem?

Page 31: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p31

SW6: Load Adaptation for Faults

• ISTORE looks homogeneous now, but– Didn’t get enough disks (only 70)

• New model, IDE vs. SCSI?

– Large systems change over time• Grow based on need• Decay (components replaced) based on faults

– Even physically homogenous systems may have performance heterogeneity• Process or network load, sharing machine, …

– Hardware faults + redundancy can lead to performance faults

Page 32: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p32

SW7: I/O Performance Heterogeneity

• Delivered disk performance is subject to many factors: specs, layout, load

• Static I/O load balancing turns local slowdown into global slowdown

• Two solutions– Virtual Streams (River system)– Graduated Declustering

implementation and performance

Page 33: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p33

Context: Bulk I/O• Applications in decision support, data

mining, simulation (oil reservoir) use– Batch-style, bulk-synchronous computations– Large reads (e.g., sorting) and writes (e.g.,

checkpoints)

Process

Disk

ISTORE Brick

...

Page 34: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p34

Record-breaking performance is not the

common case• NOW-Sort experience: heterogeneity is the norm– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x) and heat

• E.g., perturb just 1 of 8 nodes and...

0

1

2

3

4

5

Best case Bad disklayout

Busy disk LightCPU

HeavyCPU

Paging

Slow

dow

n

R. Arpaci-Dusseau

Page 35: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p35

Virtual StreamsDynamic load balancing for I/O Replicas of data

serve as second sources• Maintain a notion of each process’s progress • Arbitrate use of disks to ensure equal progress• The right behavior, but what mechanism?

Process

Virtual Streams Software

Disk

Arbiter

Page 36: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p36

Performance Heterogeneity• System performance limited by the weakest link• Virtual Streams: dynamically off-load I/O work

from slower disks to faster ones

0

1

2

3

4

5

6

100% 67% 39% 29%

Efficiency Of Single Slow Disk

Min

imu

m P

er-

Pro

ce

ss

Ba

nd

wid

th

(MB

/se

c)

Ideal

Virtual Streams

Static

Treuhaft

Page 37: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p37

Read Performance:Multiple Slow Disks

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8

# Of Slow Disks (out of 8)

Min

imu

m P

er-

Pro

ce

ss

B

an

dw

idth

(M

B/s

ec

) Ideal

Virtual Streams

Static

Page 38: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p38

HW/SW: Benchmarking• Benchmarking

– One reason for 1000X processor performance was ability to measure (vs. debate) which is better• e.g., Which most important to improve:

clock rate, clocks per instruction, or instructions executed?

– Need AME benchmarks“what gets measured gets done”“benchmarks shape a field”“quantification brings rigor”

Page 39: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p39

Availability benchmarks• Goal: quantify variation in QoS metrics as events

occur that affect system availability• Leverage existing performance benchmarks

– to generate fair workloads– to measure & trace quality of service metrics

• Use fault injection to compromise system– hardware faults (disk, memory, network, power)– software faults (corrupt input, driver error returns)– maintenance events (repairs, SW/HW upgrades)

• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and

macro-benchmarks

Page 40: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p40

Time

Per

form

ance }normal behavior

(99% conf)

injecteddisk failure

reconstruction

0

• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior?

• 99% confidence intervals calculated from no-fault runs

Methodology for reporting

Page 41: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p41

Time (minutes)0 10 20 30 40 50 60 70 80 90 100 110

80

100

120

140

160

0

1

2

Hits/sec# failures tolerated

0 10 20 30 40 50 60 70 80 90 100 110

Hit

s p

er s

eco

nd

190

195

200

205

210

215

220

#fai

lure

s t

ole

rate

d

0

1

2

Reconstruction

Reconstruction

Example single-fault result

• Compares Linux and Solaris reconstruction– Linux: minimal performance impact but longer window of

vulnerability to second fault– Solaris: large perf. impact but restores redundancy fast

Linux

Solaris

Page 42: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p42

Summary• ISTORE: hardware for availability,

reliability, maintainbility• Needs software

– To use hardware ft features– For problems not solvable by hw

• Other ideas?

Page 43: CS294, YelickISTORE, p1 CS 294-8 ISTORE: Hardware Overview and Software Challenges yelick/294

CS294, Yelick ISTORE, p43

Administrivia• Homework 1 due next Tuesday (9/5)

– Anatomy of a failure

• Read Grapevine and Porcupine papers for 9/5– Porcupine link fixed– Grapevine paper copies available

• Read wireless (Baker) paper for 9/7– Short discussion next Thursday 9/7

(4:30-5:00 only)