1 pennysort award ceremony beijing china 23 october 2006

24
1 PennySort Award Ceremony Beijing China 23 October 2006

Upload: mervin-cameron

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 PennySort Award Ceremony Beijing China 23 October 2006

1

PennySort Award Ceremony

Beijing China23 October 2006

Page 2: 1 PennySort Award Ceremony Beijing China 23 October 2006

2

Outline

• Penny Sort history and Award

• What I have been doing.

Page 3: 1 PennySort Award Ceremony Beijing China 23 October 2006

3

Benchmark History

WisconsinBitton Boral DeWitt Turbyfill

IBM TP 1-7CA and Tony Lukes

Debit CreditGray

DatamationAnon et al

TPC-A

MCCBoral &...

TPC-B

TPC-C

1970

1980

1990

2000

TPC-W ?

TeradataBollinger &...

TPC-D

Sort

PennySortMinuteSort

TPC-H

2010

Page 4: 1 PennySort Award Ceremony Beijing China 23 October 2006

4

A Short History of Sort• April Fools 1995: Datamation Sort

– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!

• 1993: {Minute | Penny}x{Daytona | Indy}

• 1998: TeraByte Sort• Web site:

http://research.Microsoft.com/barc/SortBenchmark/

Page 5: 1 PennySort Award Ceremony Beijing China 23 October 2006

5

Ground Rules • How much can you sort for a penny (or in a minute).

– Hardware cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident• Input is

– 100-byte records (random data)– key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 6: 1 PennySort Award Ceremony Beijing China 23 October 2006

6

1998 PennySort• Hardware

– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks

• Software– NT workstation 4.3– NT 5 sort

• Performance– sort 15 M 100-byte records (~1.5 GB)

– Disk to disk– elapsed time 820 sec

• cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 7: 1 PennySort Award Ceremony Beijing China 23 October 2006

7

2004 Daytona Terabyte Sort• NEC Express/5800/1320Xd

32x Itanium2 1.5Ghz 128GB 900 disk TPC-C machine

• Striped across 20 HBA– Read and write at 3.5 GBps–Sort 34GB in 60 seconds.–Sort 1 TB in 33 minutes

Input Phase of 1 TB nSort

Page 8: 1 PennySort Award Ceremony Beijing China 23 October 2006

8

1999 Sort Records

2006 Sort Records   Daytona Indy

Penny 590 M records ( 55GB)in 644 seconds 

GpuTeraSort1,469$ system

3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”)

WindowsXP Naga Govindaraju, Ritesh Kumar ,

Dinesh Manocha, Jim GrayU. North Carolina at Chapel Hill, USA

Minute 40 GB (400 million records) NeoSort pdf MSword

Windows, Fujitsu 32 Itanium2, 128 SAN disksChris Nyberg, Charles Koester Ordinal Technology

( 2005) 116GB (125 M records)SCS pdf 58.7 secondsLinux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research

TeraByte (2004) 33 minutesNsort pdf, word, htm Windows, 32 Itanium2, 2,350 SAN disks

Chris Nyberg, Charles Koester Ordinal Technology

(2005) 435 seconds (7.25 minutes)SCS pdf 

Linux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research

344 million records (32 GB)in 1,679 seconds

Bytes-Split-Index Sort (BSIS) $760 system

1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP

Xing Huang and BinHeng Song School of Software, Tsinghua U., Beijing, China

Bo Huang Math&CS, Hunan U. of Technology, Zhuzhou, China

Page 9: 1 PennySort Award Ceremony Beijing China 23 October 2006

9

Bytes Split Index Sort (BSIS)Xing Huang & BinHeng Song, Tsinghua

Bo Huang, Hunan U. of Technology

• A radix-partition sort. • Then merge the partitions.• 344 million records (32 GB) in 1,679 seconds

$760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP

• Phase 1: 66 MB/s, Phase 28 MB/s• See http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf

Page 10: 1 PennySort Award Ceremony Beijing China 23 October 2006

10

Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995

http://research.microsoft.com/barc/SortBenchmark/

• Sort recs/s/cpuplateaued in1995

Records per Second per CPU slow improvement after 1995

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1985 1990 1995 2000 2005

reco

rds/

sec/

cpu

Mini

Super

cache conscious

Page 11: 1 PennySort Award Ceremony Beijing China 23 October 2006

11

Technology Trends: CPU and GPU

2.2GHz

4.4GHz

31 GHz

0.8 GHz

1.6 GHz

11.2

4.2

Lo

g o

f R

elat

ive

Pro

cess

ing

Po

wer

2002 2004 2006 2008

Corporate DT SW Requirements

Moore’s Law Trajectory

CPU

Value

Leading

Edge

Mobile

Mainstream Desktop

DT ‘Replacement’

Enthusiast / Specialty

Cooling (Cost)LimitationsGPU

Moore’s

Law 3 fo

r 18 m

o

Then Moore

’s La

w trajecto

ry

Graphics Req’m

ts

(enhanced experience)

Leading Edge

Value / UMA

?CPU

Page 12: 1 PennySort Award Ceremony Beijing China 23 October 2006

12

Moore’s Wall: Chip Heat Death• Processor power density going to infinity.

• Solution: stablize clock at ~5GHzMulti-core (aka MTA) (1,000 core?)

Page 13: 1 PennySort Award Ceremony Beijing China 23 October 2006

13

GPU TeraSort Naga Govindaraju, Ritesh Kumar , Dinesh Manocha,

U. North Carolina at Chapel Hill

• Use GPU for Phase 1 bitonic sort• 590 M records ( 55GB) in 644 seconds 

1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP WindowsXP

• Phase 1: 185 MB/s, Phase 150 MB/s• See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183

Page 14: 1 PennySort Award Ceremony Beijing China 23 October 2006

14

Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995

http://research.microsoft.com/barc/SortBenchmark/

• Sort recs/s/cpuplateaued in1995

• Had to get GPU to getbetter Memory bandwidth

• SIGMOD 2006GpuTeraSort

Records per Second per CPU slow improvement after 1995

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1985 1990 1995 2000 2005

reco

rds/

sec/

cpu

Mini

Super

cache conscious

GPU better memory architecture, so finally more records/second

Page 15: 1 PennySort Award Ceremony Beijing China 23 October 2006

15

Motherboard14%

CPU26%

GPU0%RAM

11%Disk controller

0%

Disks36%

Case, power, fan9%

Assembly4%

BSIS

2006 PennySort Price Breakdown

Motherboard16%

CPU12%

GPU18%

RAM10%

Disk controller6%

Disks33%

Case, power, fan3%

Assembly

2%

GpuTeraSort

$760 $1470

Page 16: 1 PennySort Award Ceremony Beijing China 23 October 2006

16

Sort Performance/Price improved

• Based on parallelism and “commodity” not per-cpu performance.

1E+2

1E+3

1E+4

1E+5

1E+6

1E+7

1E+8

1985 1990 1995 2000 2005

Sort Records/second vs Time

M68000

Cray YMP

IBM 3090

Tandem

Hardware Sorter

Sequent

Intel Hyper

SGIIBM RS6000

NOW

Alpha

NOW

PennySort

TeraByte Sort

Minute Sort

1E+0

1E+3

1E+6

1E+9

1985 1990 1995 2000 2005

Speed: SortedRecords/Sec

Performance/Price: GB Sorted/$

Sort 68%/y performance/price improvement

Page 17: 1 PennySort Award Ceremony Beijing China 23 October 2006

17

Musings: PennySort=TBsort• 2 pass so 3TB of disk

• = 8 disks if 400GB/disk

• = 0.5GBps (if each disk = 65 Mbps)

• So, 6000 seconds (3TB/5GBps)

• So, node can cost 200$

• Costs 10x that today

• maybe in 5 years?

Page 18: 1 PennySort Award Ceremony Beijing China 23 October 2006

18

Musings: MinuteSort=TBsort• Sorts 1TB in 1Minute• 1 pass so 1TB of ram• 266Gbps bisection bandwidth• 1 pass so 2TB of IO in 60 sec

=> 600 disks => ~80 nodes: 8 disks 2GB ram=> interconnect with 10Gbps Ethernet

• or 300 nodes at 1Gbps Ethernet. • doable today

Page 19: 1 PennySort Award Ceremony Beijing China 23 October 2006

19

What I Have Been Doing• Traveling & Talking

• Helping Build the SkyServer and the Virtual Observatory

• Doing spatial geometry in SQL (no kidding)!

• Trying to get all science literature and data online and interlinked.

• and…– to blob or not to blob– disk reliability

Page 20: 1 PennySort Award Ceremony Beijing China 23 October 2006

20

To Blob or Not To Blob• For objects X smaller than 1MB

Select X into x from T where key = 123faster than h = open(X); read(h,x,n); close(h)

• So, blob beats file for objects < 1MB (on SQL Server – what about other DBs?)

• Because DB is CISC and FS is RISC• Most things are less than 1MB• DB should work to make this 10MB• File system should borrow ideas from DB.

“To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?” Rusty Sears, Catharine Van Ingen, Jim Gray, MSR-TR-2006-45, April 2006

Page 21: 1 PennySort Award Ceremony Beijing China 23 October 2006

21

How Often do Disks Fail?

Observed failure rates.

System Source TypePart

Years FailsFails /Year

TerraServer SAN

Barclay

SCSI 10krpm 858 24 2.8%

controllers 72 2 2.8%

san switch 9 1 11.1%TerraServer

Brick Barclay SATA 7krpm 138 10 7.2%

Web Property 1

anonSCSI 10krpm 15,805 972 6.0%

controllers 900 139 15.4%

Web Property 2

anonPATA 7krpm 22,400 740 3.3%

motherboard 3,769 66 1.7%

Page 22: 1 PennySort Award Ceremony Beijing China 23 October 2006

22

What About Bit Error Rates• Uncorrectable Errors on Read (UERs)

– Quoted uncorrectable bit error rates10-13 to 10-15

– That’s 1 error in 1TB to 1 error in 100TB

– WOW!!!

• We moved 1.5 PB looking for errors• Saw 5 UER events

– 3 real, 3 of them were masked by retry

• Many controller fails and system security reboots • Conclusion:

– UER not a useful metric – want mean time to data loss

– UER better than advertised. Empirical Measurements of Disk Failure Rates and Error RatesJim Gray, Catharine van Ingen, Microsoft Technical Report MSR-TR-2005-166

Page 23: 1 PennySort Award Ceremony Beijing China 23 October 2006

23

So, You Want to Copy a Petabyte?• Today, that’s 4,000 disks (read 2k write 2k)

• Takes ~4 hours if they run in parallel, but…

• Probably not one file.

• You will see a few UERs.

• What’s the best strategy?

• How fast can you move a Petabyte from CERN to Pasadena? Is sneaker-net fastest and cheapest?

Page 24: 1 PennySort Award Ceremony Beijing China 23 October 2006

24

UER things I wish I knew

• Better statistics from larger farms, and more diversity.

• What is the UER on a LAN, WAN?• What is the UER over time:

for a file on disk for a disk

• What’s the best replication strategy?– Symmetric (1+1)+(1+1) or triplex (1+1) + 1