building petabyte servers

1

Building PetaByte ServersBuilding PetaByte Servers

Jim GrayJim Gray

Microsoft ResearchMicrosoft Research

[email protected]@Microsoft.com

http://www.Research.Microsoft.com/~Gray/talkshttp://www.Research.Microsoft.com/~Gray/talks

KiloKilo 101033

MegaMega 101066

GigaGiga 101099

TeraTera 10101212 today, we are here today, we are here PetaPeta 10101515

ExaExa 10101818

2

OutlineOutline

• The challenge: Building GIANT data storesThe challenge: Building GIANT data stores

– for example, the EOS/DIS 15 PB systemfor example, the EOS/DIS 15 PB system

• Conclusion 1Conclusion 1

– Think about MOX and SCANSThink about MOX and SCANS

• Conclusion 2:Conclusion 2:

– Think about ClustersThink about Clusters

– SMP reportSMP report

– Cluster reportCluster report

3

The Challenge -- EOS/DISThe Challenge -- EOS/DIS

• Antarctica is melting -- Antarctica is melting -- 77% of fresh water liberated77% of fresh water liberated

– sea level rises 70 meters sea level rises 70 meters – Chico & Memphis are beach-front propertyChico & Memphis are beach-front property– New York, Washington, SF, LA, London, Paris New York, Washington, SF, LA, London, Paris

• Let’s study it! Let’s study it! Mission to Planet EarthMission to Planet Earth

• EOS: Earth Observing System EOS: Earth Observing System (17B$ => 10B$)(17B$ => 10B$)

– 50 instruments on 10 satellites 1997-200150 instruments on 10 satellites 1997-2001– Landsat (added later)Landsat (added later)

• EOS DIS: Data Information System:EOS DIS: Data Information System:

– 3-5 MB/s raw, 30-50 MB/s processed.3-5 MB/s raw, 30-50 MB/s processed.– 4 TB/day, 4 TB/day, – 15 PB by year 200715 PB by year 2007

4

The Process FlowThe Process Flow

• Data arrives and is pre-processed.Data arrives and is pre-processed.– instrument data is instrument data is calibrated, calibrated,

griddedgriddedaveragedaveraged

– Geophysical data is derived Geophysical data is derived • Users ask Users ask for stored data for stored data

OROR to analyze and combine data.to analyze and combine data.• Can make the pull-push split dynamicallyCan make the pull-push split dynamically

Pull Processing Push ProcessingOther Data

5

Designing EOS/DIS Designing EOS/DIS • Expect that millions will use the system (online)Expect that millions will use the system (online)

Three user categories:Three user categories:– NASA 500 -- NASA 500 -- funded by NASA to do sciencefunded by NASA to do science– Global Change 10 k - Global Change 10 k - other dirt bagsother dirt bags– Internet 20 m - Internet 20 m - everyone elseeveryone else

Grain speculatorsGrain speculatorsEnvironmental Impact ReportsEnvironmental Impact ReportsNew applicationsNew applications

=> discovery & access must be automatic => discovery & access must be automatic

• Allow anyone to set up a peer- node (DAAC & SCF)Allow anyone to set up a peer- node (DAAC & SCF)

• Design for Ad Hoc queries, Design for Ad Hoc queries, Not Standard Data Products Not Standard Data Products If push is 90%, then 10% of data is read (on average). If push is 90%, then 10% of data is read (on average).

=> A failure: no one uses the data, in DSS, push is 1% or less.=> A failure: no one uses the data, in DSS, push is 1% or less.

=> computation demand is enormous=> computation demand is enormous (pull:push is 100: 1)(pull:push is 100: 1)

6

Obvious Points: Obvious Points: EOS/DIS will be a cluster of SMPsEOS/DIS will be a cluster of SMPs

• It needs 16 PB storageIt needs 16 PB storage

– = 1 M disks in current technology= 1 M disks in current technology

– = 500K tapes in current technology= 500K tapes in current technology

• It needs 100 TeraOps of processing It needs 100 TeraOps of processing

– = 100K processors (current technology)= 100K processors (current technology)

– and ~ 100 Terabytes of DRAMand ~ 100 Terabytes of DRAM

• 1997 requirements are 1000x smaller1997 requirements are 1000x smaller

– smaller data ratesmaller data rate

– almost no re-processing workalmost no re-processing work

7

The architectureThe architecture

• 2+N data center design2+N data center design

• Scaleable OR-DBMSScaleable OR-DBMS

• Emphasize Pull vs Push processingEmphasize Pull vs Push processing

• Storage hierarchyStorage hierarchy

• Data PumpData Pump

• Just in time acquisitionJust in time acquisition

8

2+N data center design2+N data center design• duplex the archive (for fault tolerance)duplex the archive (for fault tolerance)

• let anyone build an extract (the +N)let anyone build an extract (the +N)

• Partition data by time and by space Partition data by time and by space (store 2 or 4 ways).(store 2 or 4 ways).

• Each partition is a free-standing OR-DBBMSEach partition is a free-standing OR-DBBMS(similar to Tandem, Teradata designs).(similar to Tandem, Teradata designs).

• Clients and Partitions interact Clients and Partitions interact via standard protocolsvia standard protocols

– OLE-DB, DCOM/CORBA, HTTP,…OLE-DB, DCOM/CORBA, HTTP,…

9

Hardware ArchitectureHardware Architecture

• 2 Huge Data Centers2 Huge Data Centers

• Each has 50 to 1,000 nodes in a clusterEach has 50 to 1,000 nodes in a cluster

– Each node has about 25…250 TB of storageEach node has about 25…250 TB of storage– SMP SMP .5Bips to 50 Bips .5Bips to 50 Bips 20K$ 20K$ – DRAMDRAM 50GB to 1 TB50GB to 1 TB 50K$ 50K$– 100 disks 100 disks 2.3 TB to 230 TB2.3 TB to 230 TB 200K$ 200K$– 10 tape robots10 tape robots 25 TB to 250 TB 25 TB to 250 TB 200K$ 200K$– 2 Interconnects2 Interconnects 1GBps to 100 GBps1GBps to 100 GBps 20K$ 20K$

• Node costs 500K$ Node costs 500K$

• Data Center costs 25M$ (capital cost)Data Center costs 25M$ (capital cost)

10

Scaleable OR-DBMSScaleable OR-DBMS

• Adopt cluster approach Adopt cluster approach (Tandem, Teradata, VMScluster, DB2/PE, Informix,....)(Tandem, Teradata, VMScluster, DB2/PE, Informix,....)

• System must scale to many processors, disks, linksSystem must scale to many processors, disks, links

• OR DBMS based on standard object modelOR DBMS based on standard object model

– CORBA or DCOM (not vendor specific) CORBA or DCOM (not vendor specific)

• Grow by adding componentsGrow by adding components

• System must be self-managingSystem must be self-managing

11

Storage HierarchyStorage Hierarchy

• Cache hot 10% (1.5 PB) on disk.Cache hot 10% (1.5 PB) on disk.

• Keep cold 90% on near-line tape.Keep cold 90% on near-line tape.

• Remember recent results on speculationRemember recent results on speculation

15 PB of Tape Robot

1 PB of Disk

10-TB RAM 500 nodes

10,000 drives

4x1,000 robots

12

Data PumpData Pump

• Some queries require reading ALL the data Some queries require reading ALL the data (for reprocessing)(for reprocessing)

• Each Data Center scans the data every 2 weeks.Each Data Center scans the data every 2 weeks.– Data rate 10 PB/day = 10 TB/node/day = 120 MB/sData rate 10 PB/day = 10 TB/node/day = 120 MB/s

• Compute on demand small jobsCompute on demand small jobs• less than 1,000 tape mountsless than 1,000 tape mounts• less than 100 M disk accessesless than 100 M disk accesses• less than 100 TeraOps.less than 100 TeraOps.• (less than 30 minute response time)(less than 30 minute response time)

• For BIG JOBS scan entire 15PB database For BIG JOBS scan entire 15PB database • Queries (and extracts) “snoop” this data pump.Queries (and extracts) “snoop” this data pump.

13

Just-in-time acquisition 30%Just-in-time acquisition 30%• Hardware prices decline 20%-40%/yearHardware prices decline 20%-40%/year

• So buy at last momentSo buy at last moment

• Buy best product that day: commodityBuy best product that day: commodity

• Depreciate over 3 years so that facility is fresh. Depreciate over 3 years so that facility is fresh. • (after 3 years, cost is 23% of original). 60% decline peaks at 10M$(after 3 years, cost is 23% of original). 60% decline peaks at 10M$

1996

EOS DIS Disk Storage Size and Cost

1994 1998 2000 2002 2004 2006 2008

Storage Cost M$

Data Need TB

1

10

10

10

10

10

2

3

4

5 assume 40% price decline/year

14

ProblemsProblems

• HSMHSM

• Design and Meta-dataDesign and Meta-data

• IngestIngest

• Data discovery, search, and analysisData discovery, search, and analysis

• reorg-reprocessreorg-reprocess

• disaster recoverydisaster recovery

• costcost

16

What's a TerabyteWhat's a Terabyte1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images

Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes

1994: 1 M$ of magnetic disc 120 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes

Terror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!!

150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

17

The Cost of Storage & AccessThe Cost of Storage & AccessThe Cost of Storage & AccessThe Cost of Storage & Access• File Cabinet: File Cabinet: cabinet (4 drawer)cabinet (4 drawer) 250$250$ paper paper

(24,000 sheets)(24,000 sheets) 250$250$ space (2x3 @ space (2x3 @ 10$/ft2)10$/ft2) 180$180$ totaltotal 700$700$

3.0 ¢/sheet3.0 ¢/sheet

• Disk:Disk: disk (9 GB =)disk (9 GB =) 2,000$ 2,000$ASCII: ASCII:

5 m pages5 m pages 0.040.04 ¢/sheet ¢/sheet (100x (100x cheaper)cheaper)

• Image:Image: 200 k pages200 k pages 1 ¢/sheet 1 ¢/sheet (similar to paper)(similar to paper)

18

Standard Storage MetricsStandard Storage MetricsStandard Storage MetricsStandard Storage Metrics• Capacity: Capacity:

– RAM: RAM: MB and $/MB: today at 100 MB & 10 $/MBMB and $/MB: today at 100 MB & 10 $/MB– Disk:Disk: GB and $/GB: today at 10 GB and 200 $/GBGB and $/GB: today at 10 GB and 200 $/GB– Tape: Tape: TB and $/TB: today at .1 TB and 100 k$/TB TB and $/TB: today at .1 TB and 100 k$/TB

(nearline)(nearline)

• Access time (latency)Access time (latency)– RAM:RAM: 100 ns100 ns– Disk: Disk: 10 ms 10 ms– Tape: 30 second pick, 30 second position Tape: 30 second pick, 30 second position

• Transfer rateTransfer rate– RAM:RAM: 1 GB/s 1 GB/s– Disk:Disk: 5 MB/s - - - Arrays can go to 1GB/s 5 MB/s - - - Arrays can go to 1GB/s– Tape: 3 MB/s - - - not clear that striping worksTape: 3 MB/s - - - not clear that striping works

19

New Storage Metrics: New Storage Metrics: KOXs, MOXs, GOXs, SCANs?KOXs, MOXs, GOXs, SCANs?New Storage Metrics: New Storage Metrics: KOXs, MOXs, GOXs, SCANs?KOXs, MOXs, GOXs, SCANs?

• KOX:KOX: How many kilobyte objects served per second How many kilobyte objects served per second– the file server, transaction processing metricthe file server, transaction processing metric

• MOXMOX:: How many megabyte objects served per How many megabyte objects served per secondsecond– the Mosaic metricthe Mosaic metric

• GOXGOX:: How many gigabyte objects served per hour How many gigabyte objects served per hour– the video & EOSDIS metricthe video & EOSDIS metric

• SCANS:SCANS: How many scans of all the data per day How many scans of all the data per day– the data mining and utility metricthe data mining and utility metric

20

Summary (of new ideas)Summary (of new ideas)Summary (of new ideas)Summary (of new ideas)• Storage accesses are the bottleneckStorage accesses are the bottleneck

• Accesses are getting larger (MOX, GOX, SCANS)Accesses are getting larger (MOX, GOX, SCANS)

• Capacity and cost are improvingCapacity and cost are improving

• BUTBUT

• Latencies and bandwidth are not improving muchLatencies and bandwidth are not improving much

• SOSO

• Use parallel access (disk and tape farms)Use parallel access (disk and tape farms)

21

How To Get Lots of MOX, How To Get Lots of MOX, GOX, SCANSGOX, SCANSHow To Get Lots of MOX, How To Get Lots of MOX, GOX, SCANSGOX, SCANS

• parallelism: use many little devices in parallelparallelism: use many little devices in parallel

• Beware of the media mythBeware of the media myth

• Beware of the access time mythBeware of the access time myth

1 Terabyte

10 MB/s

At 10 MB/s: 1.2 days to scan

1 Terabyte

1,000 x parallel: 1.5 minute SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

22

Meta-Message:Meta-Message: Technology Ratios Are Important Technology Ratios Are ImportantMeta-Message:Meta-Message: Technology Ratios Are Important Technology Ratios Are Important

• If everything gets faster&cheaper If everything gets faster&cheaper at the same rate at the same rate then nothing really changes.then nothing really changes.

• Some things getting MUCH BETTER:Some things getting MUCH BETTER:– communication speed & cost 1,000xcommunication speed & cost 1,000x– processor speed & cost 100xprocessor speed & cost 100x– storage size & cost 100xstorage size & cost 100x

• Some things staying about the sameSome things staying about the same– speed of light (more or less constant)speed of light (more or less constant)– people (10x worse)people (10x worse)– storage speed (only 10x better)storage speed (only 10x better)

23

OutlineOutline









24

Scaleable ComputersScaleable ComputersBOTH SMP and Cluster BOTH SMP and Cluster

SMPSuper Server

DepartmentalServer

PersonalSystem

Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard

Grow Out with ClusterGrow Out with Cluster

Cluster has inexpensive partsCluster has inexpensive parts

Clusterof PCs

25

TPC-C Current ResultsTPC-C Current ResultsTPC-C Current ResultsTPC-C Current Results• Best Performance is 30,390 tpmC @ $305/tpmC (Oracle/DEC)

• Best Price/Perf. is 7,693 tpmC @ $43.5/tpmC (MS SQL/Dell)

• Graphs show– UNIX high price – UNIX scaleup diseconomy

tpmC vs $/tpmC

$0

$50

$100

$150

$200

$250

$300

0 5000 10000 15000 20000 25000 30000

tpmC

$/tp

mC

DB2

Informix

MS SQL Server

Oracle

Sybase

tpmC vs $/tpmClow -end

$0

$50

$100

$150

$200

0 2000 4000 6000 8000 10000

tpmC

$/t

pm

C

DB2

Informix

MS SQL Server

Oracle

Sybase

26

Compare SMP PerformanceCompare SMP Performance

tpmC vs CPS

0

5,000

10,000

15,000

20,000

25,000

0 5 10 15 20

CPUs

tpm

C

SMP Scaleability

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

0 5 10 15 20

cpus

tpm

C

SUN Sybase

SQL Server

27

TPC C improved fastTPC C improved fast

$/tpmC vs time

$10

$100

$1,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97

date

$/tp

mC

250 %/year improvement!

tpmC vs time

100

1,000

10,000

100,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97

date

tpm

C

250 %/year improvement!

40% hardware, 100% software, 100% PC Technology

28

Where the money goesWhere the money goes

TPC Price/tpmC

6664

54

3942 41

30

9

33

47

17

11

4042

31

3835

38

22

4145

30

8

19

27

40

3

21

9

22

3

9

0

10

20

30

40

50

60

70

processor disk software net

Oracle on DEC UnixOracle on UltraSparc/SolarisOracle on Compaq/NTSybase on Compaq/NTMicrosoft on Compaq with VisigenicsMicrosoft on Intergraph with IISMicrosoft on Compaq with IISMicrosoft on Dell with IIS

29

What does this mean?What does this mean?• PC Technology is 3x cheaper than high-end PC Technology is 3x cheaper than high-end

SMPsSMPs

• PC nodes performance are 1/2 of high-end PC nodes performance are 1/2 of high-end SMPsSMPs

– 4xP6 vs 20xUltraSparc 4xP6 vs 20xUltraSparc

• Peak performance is a clusterPeak performance is a cluster

– Tandem 100 node clusterTandem 100 node cluster

– DEC Alpha 4x8 clusterDEC Alpha 4x8 cluster

• Commodity solutions WILL come to this marketCommodity solutions WILL come to this market

30

Cluster: Shared What?Cluster: Shared What?• Shared Memory MultiprocessorShared Memory Multiprocessor

– Multiple processors, one memoryMultiple processors, one memory

– all devices are localall devices are local

– DEC, SG, Sun Sequent 16..64 nodesDEC, SG, Sun Sequent 16..64 nodes

– easy to program, not commodityeasy to program, not commodity

• Shared Disk ClusterShared Disk Cluster– an array of nodesan array of nodes

– all shared common disksall shared common disks

– VAXcluster + OracleVAXcluster + Oracle

• Shared Nothing ClusterShared Nothing Cluster– each device local to a nodeeach device local to a node

– ownership may changeownership may change

– Tandem, SP2, WolfpackTandem, SP2, Wolfpack

31

Clusters being builtClusters being built• Teradata 1500 nodes +24 TB disk Teradata 1500 nodes +24 TB disk (50k$/slice) (50k$/slice)• Tandem,VMScluster 150 nodes Tandem,VMScluster 150 nodes (100k$/slice)(100k$/slice)• Intel, 9,000 nodes @ 55M$ Intel, 9,000 nodes @ 55M$ ( (

6k$/slice)6k$/slice)• Teradata, Tandem, DEC moving to NT+low slice priceTeradata, Tandem, DEC moving to NT+low slice price

• IBM: 512 nodes @ 100m$ (200k$/slice)IBM: 512 nodes @ 100m$ (200k$/slice)• PC clusters (bare handed) at dozens of nodes PC clusters (bare handed) at dozens of nodes

web servers (msn, PointCast,…), DB serversweb servers (msn, PointCast,…), DB servers

• KEY TECHNOLOGY HERE IS THE APPS.KEY TECHNOLOGY HERE IS THE APPS.– Apps distribute dataApps distribute data– Apps distribute executionApps distribute execution

32

Cluster AdvantagesCluster Advantages

• Clients and Servers made from the same stuff.Clients and Servers made from the same stuff.

– Inexpensive: Built with commodity components Inexpensive: Built with commodity components

• Fault tolerance: Fault tolerance: – Spare modules mask failures Spare modules mask failures

• Modular growthModular growth

– grow by adding small modulesgrow by adding small modules

• Parallel data searchParallel data search

– use multiple processors and disksuse multiple processors and disks

33

Clusters are winning the high endClusters are winning the high end• You saw that a 4x8 cluster has best TPC-C performanceYou saw that a 4x8 cluster has best TPC-C performance• This year, a 95xUltraSparc cluster won the MinuteSort This year, a 95xUltraSparc cluster won the MinuteSort

Speed Trophy Speed Trophy (see NOWsort at www.now.cs.berkeley.edu)(see NOWsort at www.now.cs.berkeley.edu)

• Ordinal 16x on SGI Origin is close (but the loser!).Ordinal 16x on SGI Origin is close (but the loser!).

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1985 1990 1995 2000

Sort Records/second vs Time

M68000

Cray YMP

IBM 3090

Tandem

Hardware Sorter

Sequent

Intel Hyper

SGIIBM RS6000

Alpha

NOW (95 nodes)Ordinal +

SGI

34

Clusters (Plumbing)Clusters (Plumbing)

• Single system imageSingle system image

– namingnaming

– protection/securityprotection/security

– management/load balancemanagement/load balance

• Fault ToleranceFault Tolerance

– Wolfpack DemoWolfpack Demo

• Hot Pluggable hardware & SoftwareHot Pluggable hardware & Software

35

So, What’s New?So, What’s New?• When slices cost 50k$, you buy 10 or 20.When slices cost 50k$, you buy 10 or 20.• When slices cost 5k$ you buy 100 or 200.When slices cost 5k$ you buy 100 or 200.• Manageability, programmability, usability Manageability, programmability, usability

become key issues (total cost of ownership).become key issues (total cost of ownership).• PCs are MUCH easier to use and programPCs are MUCH easier to use and program

New MPP &NewOS

New App

New MPP &NewOS

New App

New MPP &NewOS

New App

New MPP &NewOS

New App

StandardOS & Hardware

Apps

Customers

MPPVicious CycleNo Customers!

CP/CommodityVirtuous Cycle:Standards allow progressand investment protection

36

Windows NT Server ClusteringWindows NT Server ClusteringHigh Availability On Standard HardwareHigh Availability On Standard Hardware

Standard API for clusters on many platformsStandard API for clusters on many platformsNo special hardware required.No special hardware required.Resource Group is unit of failoverResource Group is unit of failoverTypical resources: Typical resources:

shared disk, printer, ...shared disk, printer, ...IP address, NetNameIP address, NetNameService (Web,SQL, File, Print Mail,MTS Service (Web,SQL, File, Print Mail,MTS

API to define API to define resource groups,resource groups,dependencies, dependencies, resources, resources,

GUI administrative interfaceGUI administrative interfaceA consortium of 60 HW & SW vendors A consortium of 60 HW & SW vendors (everybody who is anybody(everybody who is anybody))

2-Node Cluster in beta test now.Available 97H1>2 node is nextSQL Server and Oracle

Demo on it todayKey concepts

System: a nodeCluster: systems working together Resource: hard/ soft-ware moduleResource dependency: resource needs

anotherResource group: fails over as a unitDependencies: do not cross group

boundaries

37

WolfpackWolfpack NT Clusters 1.0 NT Clusters 1.0

Shared SCSI Disk Strings

BettyAlice

PrivateDisks

PrivateDisks

Clients

• Two node file and print failover

• GUI admin interface

39

Where We Are TodayWhere We Are Today• Clusters moving fast Clusters moving fast

– OLTPOLTP

– SortSort

– WolfPackWolfPack

• Technology ahead of scheduleTechnology ahead of schedule

– cpus, disks, tapes,wires,..cpus, disks, tapes,wires,..

• OR Databases are evolvingOR Databases are evolving

• Parallel DBMSs are evolvingParallel DBMSs are evolving

• HSM still immatureHSM still immature

40

OutlineOutline









41

Building PetaByte ServersBuilding PetaByte Servers

Jim GrayJim Gray

Microsoft ResearchMicrosoft Research

[email protected]@Microsoft.com

http://www.Research.Microsoft.com/~Gray/talkshttp://www.Research.Microsoft.com/~Gray/talks

KiloKilo 101033

MegaMega 101066

GigaGiga 101099

TeraTera 10101212 today, we are here today, we are here PetaPeta 10101515

ExaExa 10101818

building petabyte servers

Documents