1 pennysort award ceremony beijing china 23 october 2006
TRANSCRIPT
1
PennySort Award Ceremony
Beijing China23 October 2006
2
Outline
• Penny Sort history and Award
• What I have been doing.
3
Benchmark History
WisconsinBitton Boral DeWitt Turbyfill
IBM TP 1-7CA and Tony Lukes
Debit CreditGray
DatamationAnon et al
TPC-A
MCCBoral &...
TPC-B
TPC-C
1970
1980
1990
2000
TPC-W ?
TeradataBollinger &...
TPC-D
Sort
PennySortMinuteSort
TPC-H
2010
4
A Short History of Sort• April Fools 1995: Datamation Sort
– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!
• 1993: {Minute | Penny}x{Daytona | Indy}
• 1998: TeraByte Sort• Web site:
http://research.Microsoft.com/barc/SortBenchmark/
5
Ground Rules • How much can you sort for a penny (or in a minute).
– Hardware cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident• Input is
– 100-byte records (random data)– key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
6
1998 PennySort• Hardware
– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks
• Software– NT workstation 4.3– NT 5 sort
• Performance– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk– elapsed time 820 sec
• cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
7
2004 Daytona Terabyte Sort• NEC Express/5800/1320Xd
32x Itanium2 1.5Ghz 128GB 900 disk TPC-C machine
• Striped across 20 HBA– Read and write at 3.5 GBps–Sort 34GB in 60 seconds.–Sort 1 TB in 33 minutes
Input Phase of 1 TB nSort
8
1999 Sort Records
2006 Sort Records Daytona Indy
Penny 590 M records ( 55GB)in 644 seconds
GpuTeraSort1,469$ system
3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”)
WindowsXP Naga Govindaraju, Ritesh Kumar ,
Dinesh Manocha, Jim GrayU. North Carolina at Chapel Hill, USA
Minute 40 GB (400 million records) NeoSort pdf MSword
Windows, Fujitsu 32 Itanium2, 128 SAN disksChris Nyberg, Charles Koester Ordinal Technology
( 2005) 116GB (125 M records)SCS pdf 58.7 secondsLinux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research
TeraByte (2004) 33 minutesNsort pdf, word, htm Windows, 32 Itanium2, 2,350 SAN disks
Chris Nyberg, Charles Koester Ordinal Technology
(2005) 435 seconds (7.25 minutes)SCS pdf
Linux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research
344 million records (32 GB)in 1,679 seconds
Bytes-Split-Index Sort (BSIS) $760 system
1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP
Xing Huang and BinHeng Song School of Software, Tsinghua U., Beijing, China
Bo Huang Math&CS, Hunan U. of Technology, Zhuzhou, China
9
Bytes Split Index Sort (BSIS)Xing Huang & BinHeng Song, Tsinghua
Bo Huang, Hunan U. of Technology
• A radix-partition sort. • Then merge the partitions.• 344 million records (32 GB) in 1,679 seconds
$760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP
• Phase 1: 66 MB/s, Phase 28 MB/s• See http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf
10
Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995
http://research.microsoft.com/barc/SortBenchmark/
• Sort recs/s/cpuplateaued in1995
Records per Second per CPU slow improvement after 1995
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1985 1990 1995 2000 2005
reco
rds/
sec/
cpu
Mini
Super
cache conscious
11
Technology Trends: CPU and GPU
2.2GHz
4.4GHz
31 GHz
0.8 GHz
1.6 GHz
11.2
4.2
Lo
g o
f R
elat
ive
Pro
cess
ing
Po
wer
2002 2004 2006 2008
Corporate DT SW Requirements
Moore’s Law Trajectory
CPU
Value
Leading
Edge
Mobile
Mainstream Desktop
DT ‘Replacement’
Enthusiast / Specialty
Cooling (Cost)LimitationsGPU
Moore’s
Law 3 fo
r 18 m
o
Then Moore
’s La
w trajecto
ry
Graphics Req’m
ts
(enhanced experience)
Leading Edge
Value / UMA
?CPU
12
Moore’s Wall: Chip Heat Death• Processor power density going to infinity.
• Solution: stablize clock at ~5GHzMulti-core (aka MTA) (1,000 core?)
13
GPU TeraSort Naga Govindaraju, Ritesh Kumar , Dinesh Manocha,
U. North Carolina at Chapel Hill
• Use GPU for Phase 1 bitonic sort• 590 M records ( 55GB) in 644 seconds
1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP WindowsXP
• Phase 1: 185 MB/s, Phase 150 MB/s• See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183
14
Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995
http://research.microsoft.com/barc/SortBenchmark/
• Sort recs/s/cpuplateaued in1995
• Had to get GPU to getbetter Memory bandwidth
• SIGMOD 2006GpuTeraSort
Records per Second per CPU slow improvement after 1995
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1985 1990 1995 2000 2005
reco
rds/
sec/
cpu
Mini
Super
cache conscious
GPU better memory architecture, so finally more records/second
15
Motherboard14%
CPU26%
GPU0%RAM
11%Disk controller
0%
Disks36%
Case, power, fan9%
Assembly4%
BSIS
2006 PennySort Price Breakdown
Motherboard16%
CPU12%
GPU18%
RAM10%
Disk controller6%
Disks33%
Case, power, fan3%
Assembly
2%
GpuTeraSort
$760 $1470
16
Sort Performance/Price improved
• Based on parallelism and “commodity” not per-cpu performance.
1E+2
1E+3
1E+4
1E+5
1E+6
1E+7
1E+8
1985 1990 1995 2000 2005
Sort Records/second vs Time
M68000
Cray YMP
IBM 3090
Tandem
Hardware Sorter
Sequent
Intel Hyper
SGIIBM RS6000
NOW
Alpha
NOW
PennySort
TeraByte Sort
Minute Sort
1E+0
1E+3
1E+6
1E+9
1985 1990 1995 2000 2005
Speed: SortedRecords/Sec
Performance/Price: GB Sorted/$
Sort 68%/y performance/price improvement
17
Musings: PennySort=TBsort• 2 pass so 3TB of disk
• = 8 disks if 400GB/disk
• = 0.5GBps (if each disk = 65 Mbps)
• So, 6000 seconds (3TB/5GBps)
• So, node can cost 200$
• Costs 10x that today
• maybe in 5 years?
18
Musings: MinuteSort=TBsort• Sorts 1TB in 1Minute• 1 pass so 1TB of ram• 266Gbps bisection bandwidth• 1 pass so 2TB of IO in 60 sec
=> 600 disks => ~80 nodes: 8 disks 2GB ram=> interconnect with 10Gbps Ethernet
• or 300 nodes at 1Gbps Ethernet. • doable today
19
What I Have Been Doing• Traveling & Talking
• Helping Build the SkyServer and the Virtual Observatory
• Doing spatial geometry in SQL (no kidding)!
• Trying to get all science literature and data online and interlinked.
• and…– to blob or not to blob– disk reliability
20
To Blob or Not To Blob• For objects X smaller than 1MB
Select X into x from T where key = 123faster than h = open(X); read(h,x,n); close(h)
• So, blob beats file for objects < 1MB (on SQL Server – what about other DBs?)
• Because DB is CISC and FS is RISC• Most things are less than 1MB• DB should work to make this 10MB• File system should borrow ideas from DB.
“To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?” Rusty Sears, Catharine Van Ingen, Jim Gray, MSR-TR-2006-45, April 2006
21
How Often do Disks Fail?
Observed failure rates.
System Source TypePart
Years FailsFails /Year
TerraServer SAN
Barclay
SCSI 10krpm 858 24 2.8%
controllers 72 2 2.8%
san switch 9 1 11.1%TerraServer
Brick Barclay SATA 7krpm 138 10 7.2%
Web Property 1
anonSCSI 10krpm 15,805 972 6.0%
controllers 900 139 15.4%
Web Property 2
anonPATA 7krpm 22,400 740 3.3%
motherboard 3,769 66 1.7%
22
What About Bit Error Rates• Uncorrectable Errors on Read (UERs)
– Quoted uncorrectable bit error rates10-13 to 10-15
– That’s 1 error in 1TB to 1 error in 100TB
– WOW!!!
• We moved 1.5 PB looking for errors• Saw 5 UER events
– 3 real, 3 of them were masked by retry
• Many controller fails and system security reboots • Conclusion:
– UER not a useful metric – want mean time to data loss
– UER better than advertised. Empirical Measurements of Disk Failure Rates and Error RatesJim Gray, Catharine van Ingen, Microsoft Technical Report MSR-TR-2005-166
23
So, You Want to Copy a Petabyte?• Today, that’s 4,000 disks (read 2k write 2k)
• Takes ~4 hours if they run in parallel, but…
• Probably not one file.
• You will see a few UERs.
• What’s the best strategy?
• How fast can you move a Petabyte from CERN to Pasadena? Is sneaker-net fastest and cheapest?
24
UER things I wish I knew
• Better statistics from larger farms, and more diversity.
• What is the UER on a LAN, WAN?• What is the UER over time:
for a file on disk for a disk
• What’s the best replication strategy?– Symmetric (1+1)+(1+1) or triplex (1+1) + 1