rules of thumb in data engineering

47
1 Rules of Thumb in Data Engineering Jim Gray CMU 8 Oct 2001 [email protected] , http://research.Microsoft.com/~Gray/T alks/

Upload: adem

Post on 19-Jan-2016

93 views

Category:

Documents


0 download

DESCRIPTION

Rules of Thumb in Data Engineering. Jim Gray CMU 8 Oct 2001 [email protected] , http://research.Microsoft.com/~Gray/Talks/. Outline. Moore’s Law and consequences Storage rules of thumb Balanced systems rules revisited Networking rules of thumb Caching rules of thumb. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rules of Thumb in Data Engineering

1

Rules of Thumb in Data Engineering

Jim GrayCMU8 Oct [email protected], http://research.Microsoft.com/~Gray/Talks/

Page 2: Rules of Thumb in Data Engineering

2

Outline

Moore’s Law and consequences

Storage rules of thumb

Balanced systems rules revisited

Networking rules of thumb

Caching rules of thumb

Page 3: Rules of Thumb in Data Engineering

3

Meta-Message: Technology Ratios Matter

Price and Performance change.

If everything changes in the same way, then nothing really changes. If some things get much cheaper/faster than others, then that is real change.Some things are not changing much: Cost of people Speed of light …

And some things are changing a LOT

Page 4: Rules of Thumb in Data Engineering

4

Trends: Moore’s LawPerformance/Price doubles every 18 months100x per decadeProgress in next 18 months

= ALL previous progress New storage = sum of all old storage

(ever) New processing = sum of all old

processing.

E. coli double ever 20 minutes!

15 years ago

Page 5: Rules of Thumb in Data Engineering

5

Trends: ops/s/$ Had Three Growth Phases1890-1945

Mechanical

Relay

7-year doubling

1945-1985Tube, transistor,..

2.3 year doubling

1985-2000Microprocessor

1.0 year doubling 1.E-06

1.E-03

1.E+00

1.E+03

1.E+06

1.E+09

1880 1900 1920 1940 1960 1980 2000

doubles every 7.5 years

doubles every 2.3 years

doubles every 1.0 years

ops per second/$

Page 6: Rules of Thumb in Data Engineering

6

So: a problem

Suppose you have a ten-year compute job on the world’s fastest supercomputer. What should you do.? Commit 250M$ now?? Program for 9 years Software speedup: 26 = 64x Moore’s law speedup: 26 = 64x so 4,000x speedup:

spend 1M$ (not 250M$ on hardware) runs in 2 weeks, not 10 years.Homework problem:

What is the optimum strategy?

Page 7: Rules of Thumb in Data Engineering

7

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB growth: 112%/y

Moore's Law: 58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (J im Porter)

http://www.disktrend.com/pdf/portrpkg.pdf.Storage capacity beating Moore’s law

2 k$/TB today (raw disk)

1k$/TB by end of 2002

Moores law 58.70% /year

Revenue 7.47%TB growth 112.30% (since 1993)

Price decline 50.70% (since 1993)

Page 8: Rules of Thumb in Data Engineering

11

Consequence of Moore’s law:Need an address bit every 18 months.

Moore’s law gives you 2x more in 18 months.RAM Today we have 10 MB to 100 GB machines

(24-36 bits of addressing) then In 9 years we will need 6 more bits:

30-42 bit addressing (4TB ram).

Disks Today we have 10 GB to 100 TB file systems/DBs

(33-47 bit file addresses) In 9 years, we will need 6 more bits

40-53 bit file addresses (100 PB files)

Page 9: Rules of Thumb in Data Engineering

12

Architecture could change this

1-level store: System 48, AS400 has 1-level store. Never re-uses an address. Needs 96-bit addressing today.

NUMAs and Clusters Willing to buy a 100 M$ computer? Then add 6 more address bits.

Only 1-level store pushes us beyond 64-bitsStill, these are “logical” addresses, 64-bit physical will last many years

Page 10: Rules of Thumb in Data Engineering

13

Trends: Gilder’s Law: 3x bandwidth/year for 25 more years

Today: 40 Gbps per channel (λ) 12 channels per fiber (wdm): 500 Gbps 32 fibers/bundle = 16 Tbps/bundle

In lab 3 Tbps/fiber (400 x WDM)In theory 25 Tbps per fiber1 Tbps = USA 1996 WAN bisection bandwidthAggregate bandwidth doubles every 8 months!

1 fiber = 25 Tbps

Page 11: Rules of Thumb in Data Engineering

14

Outline

Moore’s Law and consequences

Storage rules of thumb

Balanced systems rules revisited

Networking rules of thumb

Caching rules of thumb

Page 12: Rules of Thumb in Data Engineering

15

How much storage do we need?

Soon everything can be recorded and

indexedMost bytes will never be seen by humans.Data summarization, trend detection anomaly detection are key technologies

See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html

See Lyman & Varian: How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All LoC books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Page 13: Rules of Thumb in Data Engineering

16

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

10 6

SpringfieldSpringfield

This Campus

This RoomMy Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromeda

Page 14: Rules of Thumb in Data Engineering

17

Storage Hierarchy : Speed & Capacity vs Cost TradeoffsStorage Hierarchy : Speed & Capacity vs Cost Tradeoffs

1015

1012

109

106

103

Typ

ical

Sys

tem

(by

tes)

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape

Offline Tape

Online Tape

102

100

10-2

10-4

10-6

$/M

B

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

Offline Tape

Online Tape

Page 15: Rules of Thumb in Data Engineering

18

Disks: TodayDisk is 18GB to 180 GB10-50 MBps5k-15k rpm (6ms-2ms rotational latency)

12ms-7ms seek2K$/IDE-TB, 7k$/SCSI-TBFor shared disks most time spent waiting in queue for access to arm/controller

Seek

Rotate

Transfer

Seek

Rotate

Transfer

Wait

Page 16: Rules of Thumb in Data Engineering

20

Standard Storage MetricsCapacity: RAM: MB and $/MB: today at 512MB and 200$/GB Disk: GB and $/GB: today at 80GB and 70k$/TB Tape: TB and $/TB: today at 40GB and

10k$/TB (nearline)

Access time (latency) RAM: 100 ns Disk: 15 ms Tape: 30 second pick, 30 second position

Transfer rate RAM: 1-10 GB/s Disk: 10-50 MB/s - - -Arrays can go to 10GB/s Tape: 5-15 MB/s - - - Arrays can go to

1GB/s

Page 17: Rules of Thumb in Data Engineering

21

New Storage Metrics: Kaps, Maps, SCAN

Kaps: How many kilobyte objects served per second The file server, transaction processing metric This is the OLD metric.

Maps: How many megabyte objects served per sec The Multi-Media metric

SCAN: How long to scan all the data the data mining and utility metric

And Kaps/$, Maps/$, TBscan/$

Page 18: Rules of Thumb in Data Engineering

24

Disk ChangesDisks got cheaper: 20k$ -> 200$

$/Kaps etc improved 100x (Moore’s law!) (or even 500x)

One-time event (went from mainframe prices to PC prices)

Disk data got cooler (10x per decade): 1990 disk ~ 1GB and 50Kaps and 5 minute scan 2001 disk ~160GB and 120Kaps and 1 hour scan

So 1990: 1 Kaps per 20 MB 2001: 1 Kaps per 1,000 MB disk scans take longer (10x per decade)

Backup/restore takes a long time (too long)

Page 19: Rules of Thumb in Data Engineering

27

Data on Disk Can Move to RAM in 10 years

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

Year

MB

/k$

100:1

10 years

Page 20: Rules of Thumb in Data Engineering

28

The “Absurd” 10x (=4 year) Disk

2.5 hr scan time (poor sequential access)1 aps / 5 GB (VERY cold data)It’s a tape!

1 TB100 MB/s

200 Kaps

Page 21: Rules of Thumb in Data Engineering

29

Disk vs Tape

Disk 80 GB 20 MBps 5 ms seek time 3 ms rotate latency 3$/GB for drive

3$/GB for ctlrs/cabinet 15 TB/rack

1 hour scan

Tape 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media

8$/GB for drive+library 10 TB/rack

1 week scanThe price advantage of tape is narrowing, and the performance advantage of disk is growingAt 10K$/TB, disk is competitive with nearline tape.

GuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB= 8 drives

Page 22: Rules of Thumb in Data Engineering

31

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?).

A geo-plexScrub it continuously (look for errors)On failure, use other copy until failure repaired, refresh lost copy from safe copy.

Can organize the two copies differently (e.g.: one by time, one by space)

Page 23: Rules of Thumb in Data Engineering

32

Auto Manage Storage1980 rule of thumb: A DataAdmin per 10GB, SysAdmin per mips

2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app).

Problem: 5TB is 50k$ today, 5k$ in a few years.

Admin cost >> storage cost !!!!Challenge: Automate ALL storage admin tasks

Page 24: Rules of Thumb in Data Engineering

33

How to cool disk data:

Cache data in main memory See 5 minute rule later in presentation

Fewer-larger transfers Larger pages (512-> 8KB -> 256KB)

Sequential rather than random access Random 8KB IO is 1.5 MBps Sequential IO is 30 MBps (20:1 ratio is

growing)

Raid1 (mirroring) rather than Raid5 (parity).

Page 25: Rules of Thumb in Data Engineering

37

Summarizing storage rules of thumb (1)

Moore’s law: 4x every 3 years 100x more per decade

Implies 2 bit of addressing every 3 years.Storage capacities increase 100x/decadeStorage costs drop 100x per decadeStorage throughput increases 10x/decadeData cools 10x/decadeDisk page sizes increase 5x per decade.

Page 26: Rules of Thumb in Data Engineering

38

Summarizing storage rules of thumb (2)

RAM:Disk and Disk:Tape cost ratios are 100:1 and 3:1So, in 10 years, disk data can move to RAM since prices decline 100x per decade. A person can administer a million dollars of disk storage: that is 1TB - 100TB todayDisks are replacing tapes as backup devices.You can’t backup/restore a Petabyte quicklyso geoplex it.

Mirroring rather than Parity to save disk arms

Page 27: Rules of Thumb in Data Engineering

39

Outline

Moore’s Law and consequences

Storage rules of thumb

Balanced systems rules revisited

Networking rules of thumb

Caching rules of thumb

Page 28: Rules of Thumb in Data Engineering

40

Standard Architecture (today)

PCI Bus 2

System Bus

PCI Bus 1

Page 29: Rules of Thumb in Data Engineering

41

Amdahl’s Balance Laws

parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S.balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps.

memory law: =1: the MB/MIPS ratio (called alpha ()), in a balanced system is 1.IO law: Programs do one IO per 50,000 instructions.

Page 30: Rules of Thumb in Data Engineering

42

Amdahl’s Laws Valid 35 Years Later?

Parallelism law is algebra: so SURE! Balanced system laws? Look at tpc results (tpcC, tpcH) at http://

www.tpc.org/

Some imagination needed: What’s an instruction (CPI varies from 1-

3)? RISC, CISC, VLIW, … clocks per instruction,…

What’s an I/O?

Page 31: Rules of Thumb in Data Engineering

43

Disks/ cpu

 

50

22

TPC systemsNormalize for CPI (clocks per instruction) TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO

TPC-H needs ½ as many disks, sequential vs randomBoth use 9GB 10 krpm disks (need arms, not bytes)

  MHz/cpu

CPI mipsKB

/IO

IO/s/

disk

Disks

MB/s/

cpu

Ins/IO

Byte

Amdahl 1 1 1 6      8

TPC-C=random

550 2.1 262 8 100 397 40 7TPC-H= sequential

550 1.2 458 64 100 176 141 3

Page 32: Rules of Thumb in Data Engineering

46

Amdahl’s Balance Laws Revised

Laws right, just need “interpretation” (imagination?)

Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. Sequential workloads have low CPI (clocks per

instruction), random workloads tend to have higher CPI.

Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue.One Random IO’s per 50k instructions. Sequential IOs are larger One sequential IO per 200k instructions

Page 33: Rules of Thumb in Data Engineering

48

Outline

Moore’s Law and consequencesStorage rules of thumbBalanced systems rules revisitedNetworking rules of thumbCaching rules of thumb

Page 34: Rules of Thumb in Data Engineering

51

Networking

WANS are getting faster than LANSG8 = OC192 = 8Gbps is “standard”Link bandwidth improves 4x per 3 yearsSpeed of light (60 ms round trip in US)Software stacks have always been the problem.

Time = SenderCPU + ReceiverCPU + bytes/bandwidth

This has been the problem

Page 35: Rules of Thumb in Data Engineering

54

How much does wire-time cost?$/Mbyte?

Cost Time

Gbps Ethernet .2µ$ 10 ms100 Mbps Ethernet .3µ$ 100 msOC12 (650 Mbps) .003$ 20 msDSL .0006$ 25 secPOTs .002$ 200 secWireless: .80$ 500 sec

Seat cost$/3y

BandwidthB/s $/MB Time

GBpsE 2000 1.00E+08 2.E-07 0.010100MbpsE 700 1.00E+07 7.E-07 0.100OC12 12960000 5.00E+07 3.E-03 0.020OC3 3132000 3.00E+06 1.E-02 0.333T1 28800 1.00E+05 3.E-03 10.000DSL 2300 4.00E+04 6.E-04 25.000POTS 1180 5.00E+03 2.E-03 200.000Wireless ? 2.00E+03 8.E-01 500.000

seconds in 3 years 94608000

Page 36: Rules of Thumb in Data Engineering

55

Data delivery costs 1$/GB today

Rent for “big” customers:

300$/megabit per second per monthImproved 3x in last 6 years (!).That translates to 1$/GB at each end.

You can mail a 160 GB disk for 20$.

That’s 16x cheaper If overnight it’s 3 MBps.

3x160 GB

~ ½ TB

Page 37: Rules of Thumb in Data Engineering

56

Outline

Moore’s Law and consequences

Storage rules of thumb

Balanced systems rules revisited

Networking rules of thumb

Caching rules of thumb

Page 38: Rules of Thumb in Data Engineering

57

The Five Minute RuleTrade DRAM for Disk AccessesCost of an access (Drive_Cost / Access_per_second)Cost of a DRAM page ( $/MB/ pages_per_MB)Break even has two terms:Technology term and an Economic term

Grew page size to compensate for changing ratios.Now at 5 minutes for random, 10 seconds sequential

ofDRAMPricePerMB

skDrivePricePerDi

skecondPerDiAccessPerS

ofDRAMPagesPerMBtervaleferenceInBreakEvenR

Page 39: Rules of Thumb in Data Engineering

58

Cost a RAM Page RAM_$_Per_MB

PagesPerMB

The 5 Minute Rule Derived

Breakeven: RAM_$_Per_MB = _____DiskPrice . PagesPerMB T x AccessesPerSecond

T = DiskPrice x PagesPerMB . RAM_$_Per_MB x AccessPerSecond

$

( )/

T

T =TimeBetweenReferences to Page

Disk Access Cost /T

DiskPrice .

AccessesPerSecond

Page 40: Rules of Thumb in Data Engineering

59

Plugging in the Numbers

ofDRAMPricePerMB

skDrivePricePerDi

skecondPerDiAccessPerS

ofDRAMPagesPerMBtervaleferenceInBreakEvenR

PPM/aps disk$/Ram$ Break Even

Random 128/120 ~1

1000/3 ~300 5 minutes

Sequential

1/30 ~ .03 ~ 300 10second

s Trend is longer times because disk$ not changing much, RAM$ declining 100x/decade

5 Minutes & 10 second rule

Page 41: Rules of Thumb in Data Engineering

60

The 10 Instruction RuleSpend 10 instructions /second to save 1 byteCost of instruction:

I =ProcessorCost/MIPS*LifeTimeCost of byte:

B = RAM_$_Per_B/LifeTimeBreakeven:

NxI = B

N = B/I = (RAM_$_B X MIPS)/ ProcessorCost ~ (3E-6x5E8)/500 = 3 ins/B for Intel

~ (3E-6x3E8)/10 = 10 ins/B for ARM

Page 42: Rules of Thumb in Data Engineering

62

When to Cache Web Pages.

Caching saves user timeCaching saves wire timeCaching costs storageCaching only works sometimes: New pages are a miss Stale pages are a miss

Page 43: Rules of Thumb in Data Engineering

63

Web Page Caching Saves People Time

Assume people cost 20$/hour (or .2 $/hr ???)Assume 20% hit in browser, 40% in proxy Assume 3 second server timeCaching saves people time

28$/year to 150$/year of people time or .28 cents to 1.5$/year.

connection cacheR_remoteseconds

R_localseconds

Hhit rate

People Savings¢/page

LAN proxy 3 0.3 0.4 0.6

LAN browser 3 0.1 0.2 0.3

Modem proxy 5 2 0.4 0.7

Modem browser 5 0.1 0.2 0.5

Mobile proxy 13 10 0.4 0.7

Mobile browser 13 0.1 0.2 1.4

Page 44: Rules of Thumb in Data Engineering

64

Web Page Caching Saves Resources

Wire cost is penny (wireless) to 100µ$ LAN

Storage is 8 µ$/mo

Breakeven: wire cost = storage rent4 to 7 months

Add people cost: breakeven is ~ 4 years.“cheap people” (.2$/hr) 6 to 8 months.A

$/10 KB

download

network

B

$/10 KB

storage/mo

Time = A/B

Break-even

cache

storage time

C

People Cost

of download

$

Time =

(A+ C )/B

Break Even

Internet/LAN 1.E-04 8.E-06 18 months 0.02 15 yearsModem 2.E-04 8.E-06 36 months 0.03 21 yearsWireless 1.E-02 2.E-04 300 years 0.07 >999 years

Page 45: Rules of Thumb in Data Engineering

65

Caching Disk caching 5 minute rule for random IO 10 second rule for sequential IO

Web page caching: If page will be re-referenced in

18 months: with free users 15 years: with valuable usersthen cache the page in the client/proxy.

Challenge: guessing which pages will be re-referenceddetecting stale pages (page velocity)

Page 46: Rules of Thumb in Data Engineering

66

Meta-Message: Technology Ratios Matter

Price and Performance change.

If everything changes in the same way, then nothing really changes. If some things get much cheaper/faster than others, then that is real change.Some things are not changing much: Cost of people Speed of light …

And some things are changing a LOT

Page 47: Rules of Thumb in Data Engineering

67

Outline

Moore’s Law and consequences

Storage rules of thumb

Balanced systems rules revisited

Networking rules of thumb

Caching rules of thumb