supercomputing on windows clusters: experience and future directions andrew a. chien cto, entropia,...
TRANSCRIPT
Supercomputing on Windows Clusters: Experience and Future Directions
Andrew A. ChienCTO, Entropia, Inc.SAIC Chair ProfessorComputer Science and Engineering, UCSDNational Computational Science Alliance
Invited Talk, USENIX Windows, August 4, 2000
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Overview
Critical Enabling Technologies The Alliance’s Windows Supercluster
– Design and Performance Other Windows Cluster Efforts Future
– Terascale Clusters– Entropia
External Technology Factors
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Microprocessor Performance
Year Introduced
DEC Alpha (5)
Clo
ck (
ns)
100
10
1
1975 1980 1985 1990 1995
MIPS R2000 (125)
MIPS R3000 (40)
HP 7000 (15)R4000 (10)
R4400 (6.7)Cray 1S (12.5)
Cray X-MP (8.5)
Cray Y-MP (6)Cray C90 (4.2)
Vector supercomputers
Microprocessors
X86/Alpha (1)
Micros: 10MF -> 100 MF -> 1GF -> 3GF -> 6GF (2001?) => Memory system performance catching up (2.6 GB/s 21264 memory BW)
Adapted from Baskett, SGI and CSC Vanguard
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Killer Networks
LAN: 10Mb/s -> 100Mb/s -> ? SAN: 12MB/s -> 110MB/s
(Gbps) -> 1100MB/s -> ?– Myricom, Compaq, Giganet, Intel,...
Network bandwidths limited by system internal memory bandwidths
Cheap and very fast communication hardware
GigSAN/GigE: 110 MB/s
UW Scsi: 40 MB/s
FastE: 12 MB/sEthernet 1MB/s
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Rich Desktop Operating Systems Environments
Desktop (PC) operating systems now provide– richest OS functionality– best program development tools– broadest peripheral/driver support– broadest application software/ISV support
1981 1985 1995 19991990
Basic device access
Graphical InterfacesAudio/Graphics
HD StorageNetworks
Multiprocess ProtectionSMP support
Clustering, Performance,Mass store, HP networking,Management, Availability, etc.
Critical Enabling Technologies
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Critical Enabling Technologies
Cluster management and resource integration (“use like” one system)
Delivered communication performance– IP protocols inappropriate
Balanced systems– Memory bandwidth– I/O capability
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
The HPVM System
Goals– Enable tightly coupled and distributed clusters with high efficiency
and low effort (integrated solution)– Provide usable access thru convenient standard parallel interfaces
– Deliver highest possible performance and simple programming model
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Delivered Communication Performance Early 1990’s, Gigabit testbeds
– 500Mbits (~60MB/s) @ 1 MegaByte packets– IP protocols not for Gigabit SAN’s
Cluster Objective: High performance communication to small and large messages
Performance Balance Shift: Networks faster than I/O, memory, processor
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Fast Messages Design Elements
User-level network access Lightweight protocols
– flow control, reliable delivery– tightly-coupled link, buffer, and I/O bus management
Poll-based notification Streaming API for efficient composition
Many generations 1994-1999– [IEEE Concurrency, 6/97]– [Supercomputing ’95, 12/95]
Related efforts: UCB AM, Cornell U-Net,RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Improved Bandwidth
20MB/s -> 200+ MB/s (10x) – Much of advance is software structure: API’s and implementation– Deliver *all* of the underlying hardware performance
0
50
100
150
200
250
1995 1996 1997 1998 1999
Pe
rfo
rma
nce
(m
eg
ab
yte
s/se
c)
MB/s
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Improved Latency
100s to 2s overhead (50x)– Careful design to minimize overhead while maintaining throughput– Efficient event handling, fine-grained resource management and interlayer coordination– Deliver *all* of the underlying hardware performance
0
5
10
15
20
25
1995 1996 1997 1998 1999
1-w
ay
late
ncy
microseconds
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
HPVM = Cluster Supercomputers
Turnkey Cluster Computing; Standard API’s Network hardware and API’s increase leverage for users, achieve critical mass for system Each involved new research challenges and provided deeper insights into the research issues
– Drove continually better solutions (e.g. multi-transport integration, robust flow control and queue management)
Fast Messages
MPI Put/GetGlobalArrays
BSP
MyrinetServer-
NetGiganet
VIASMP WAN
Scheduling & Mgmt (LSF)
PerformanceTools
HPVM 1.0 (8/1997)HPVM 1.2 (2/1999) - multi, dynamic, installHPVM 1.9 (8/1999) - giganet, smp
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
HPVM Communication Performance
Delivers underlying performance for small messages, endpoints are the limits 100MB/s at 1K vs 60MB/s at 1000K
– >1500x improvement
0
20
40
60
80
100
120
4
512
2688
3584
4480
5376
6272
7168
8064
8960
9856
1075
2
1164
8
1254
4
1344
0
1433
6
1523
2
1612
8
message size (bytes)
MB
/s FM on Myrinet
MPI on FM-Myrinet
• N1/2 ~ 400 Bytes
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
HPVM/FM on VIA
FM Protocol/techniques portable to Giganet VIA Slightly lower performance, comparable N1/2
Commercial version: WSDI (stay tuned)
0
10
20
30
40
50
60
70
80
90
410
2429
4439
6849
9260
1670
4080
6490
88
1011
2
1113
6
1216
0
1318
4
1420
8
1523
2
1625
6
message size (bytes)
MB
/s FM on Giganet VIA
MPI-FM on Giganet VIA
• N1/2 ~ 400 Bytes
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Unified Transfer and Notification (all transports)
Solution: Uniform notify and poll (single Q representation) Scalability: n into k (hash); arbitrary SMP size or number of NIC cards Key: integrate variable-sized messages; achieve single DMA transfer
– no pointer-based memory management, no special synchronization primitives, no complex computation
Memory format provides atomic notification in single contiguous memory transfer (bcopy or DMA)
Procs
Networks
Fixed SizeFrames
Variable SizeData
Fixed Size Trailer+ Length/Flag
<space>
IncreasingAddresses
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Integrated Notification Results
No polling or discontiguous access performance penalties Uniform high performance which is stable over changes of
configuration or the addition of new transports– no custom tuning for configuration required
Framework is scalable to large numbers of SMP processors and network interfaces
Single Transport IntegratedMyrinet (latency) 8.3s 8.4sMyrinet (BW) 101MB/s 101MB/sShared Memory (latency) 3.4s 3.5sShared Memory (BW) 200+MB/s 200+MB/s
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Supercomputer Performance Characteristics (11/99)
MF/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500
SGI Origin2000 500 ~0.5 ~1,000
HPVM NT Supercluster 600 ~8 ~12,000
IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K
Beowulf(100Mbit) 600 ~50 ~200,000
The NT SuperclusterWindows
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Windows Clusters Early prototypes in CSAG
– 1/1997, 30P, 6GF
– 12/1997, 64P, 20GF
Alliance’s Supercluster– 4/1998, 256P, 77GF– 6/1999, 256P*, 109GF
NCSA’s Windows Supercluster
Rob Pennington (NCSA), Andrew Chien (UCSD)
Using NT, Myrinet Interconnect, and HPVM
128 HP Kayak XU Dual PIII 550 MHz/1GB RAM
Origin
550 MHz
300 MHz
Engineering Fluid Flow Problem
D. Tafti, NCSA#207 in Top 500
Supercomputing Sites
FTP to Mass StorageDaily backups
Internet
Front-End Systems
LSF BatchJob Scheduler•Apps development
•Job submission
128 Compute Nodes, 256 CPUs
128 GB Home200 GB Scratch
Fast Ethernet
File serversLSF master
Infrastructure and Development Testbeds
Windows 2K and NT
8 4p 550 + 32 2p 300 + 8 2p 333
Windows NT, Myrinet and HPVM
128 Dual 550 MHz Systems
Windows Cluster System
(courtesy Rob Pennington, NCSA)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Example Application Results MILC – QCD Navier-Stokes Kernel Zeus-MP – Astrophysics CFD
Large-scale Science and Engineering codes
Comparisons to SGI O2K and Linux clusters
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSASrc: D. Toussaint and K. Orginos, Arizona
0
2
4
6
8
10
12
0 50 100Processors
GF
LO
Ps
IA-32/Win NT, 300 MHz PII250 MHz SGI O2KT3E 900IA-32/Win NT 550MHz Xeon
MILC Performance
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Zeus-MP (Astrophysics CFD)
0
1000
2000
3000
4000
50006000
7000
8000
9000
10000
1 4 16 32 64 96 128 192 256
# procs
MF
lop
s/se
c
SGI O2K
Janus (ASCI Red)
NT Supercluster550 Mhz
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
2D Navier Stokes Kernel
Source: Danesh Tafti, NCSA
AS-PCG MPI Performance - 2D Navier Stokes Kernel
0
2
4
6
8
10
12
14
16
18
20
0 32 64 96 128 160 192 224 256Processors
GF
LO
Ps
SGI O2000, 250 MHz R10000
NT Cluster: Intel 550 MHz PIII Xeon HP Kayak
NT Cluster: Intel 300MHz PII HP Kayak
Cluster: 128 550MHz + 128 300 MHz
128 300 MHz Intel Pentium II + 128 550 MHz Pentium III Xeon
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Applications with High Performance on Windows Supercluster
Zeus-MP (256P, Mike Norman) ISIS++ (192P, Robert Clay) ASPCG (256P, Danesh Tafti) Cactus (256P, Paul Walker/John Shalf/Ed Seidel) MILC QCD (256P, Lubos Mitas) QMC Nanomaterials (128P, Lubos Mitas) Boeing CFD Test Codes, CFD Overflow (128P, David Levine) freeHEP (256P, Doug Toussaint) ARPI3D (256P, weather code, Dan Weber) GMIN (L. Munro in K. Jordan) DSMC-MEMS (Ravaioli) FUN3D with PETSc (Kaushik) SPRNG (Srinivasan) MOPAC (McKelvey) Astrophysical N body codes (Bode) => Little code retuning and quickly running ... Parallel Sorting (Rivera – CSAG),
18.3 GB Minutesort World Record
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
MinuteSort
Sort max data disk-to-disk in 1 minute “Indy sort”
– fixed size keys, special sorter, and file format
HPVM/Windows Cluster winner for 1999 (10.3GB) and 2000 (18.3GB)– Adaptation of Berkeley NOWSort code (Arpaci and
Dusseau)
Commodity configuration ($$ not a metric)– PC’s, IDE disks, Windows– HPVM and 1Gbps Myrinet
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
MinuteSort ArchitectureKayak
Kayak
Netserver
(Luis Rivera UIUC, Xianan Zhang UCSD)
32 HP Kayaks3Ware Controllers4 x 20GB IDE disks
32 HP Netservers2 x 16GB SCSI disks
HPVM & 1Gbps Myrinet
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Sort Scaling
Concurrent read/bucket-sort/communicate is bottleneck – faster I/O infrastructure required (busses and memory, not disks)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
MinuteSort Execution Time
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Reliability
Gossip: “Windows platforms are not reliable”– Larger systems => intolerably low MTBF
Our Experience: “Nodes don’t crash”– Application runs of 1000s of hours– Node failure means an application failure; effectively not a
problem Hardware
– Short term: Infant mortality (1 month burn-in)– Long term
• ~1 hardware problem/100 machines/month• Disks, network interfaces, memory• No processor or motherboard problems.
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Windows Cluster Usage
Lots of large jobs Runs up to ~14,000 hours (64p * 9 days)
NT Cluster Usage by Number of ProcessorsMay1999 to Jul2000
0
100000
200000
300000
400000
500000
1 - 31 32 - 63 64 - 256
Number of Processors
CP
U H
ou
rs
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Other Large Windows Clusters Sandia’s Kudzu Cluster (144 procs, 550 disks, 10/98) Cornell’s AC3 Velocity Cluster (256 procs, 8/99) Others (sampled from vendors)
– GE Research Labs (16, Scientific)– Boeing (32, Scientific)– PNNL (96, Scientific)– Sandia (32, Scientific)– NCSA (32, Scientific)– Rice University (16, Scientific)– U. of Houston (16, Scientific)– U. of Minnesota (16, Scientific)– Oil & Gas (8,Scientific) – Merrill Lynch (16, Ecommerce)– UIT (16, ASP/Ecommerce)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA(courtesy David A. Lifka, Cornell TC)
The AC3 Velocity
64 Dell PowerEdge 6350 Servers• Quad Pentium III 500 MHz/2 MB Cache Processors (SMP)• 4 GB RAM/Node• 50 GB Disk (RAID 0)/Node
GigaNet Full Interconnect• 100 MB/Sec Bandwidth between any 2 Nodes• Very Low Latency
2 Terabytes Dell PowerVault 200S Storage• 2 Dell PowerEdge 6350 Dual Processor File Servers• 4 PowerVault 200S Units/File Server• 8 36 GB/Disk Drives/PowerVault 200S• Quad Channel SCSI Raid Adapter• 180 MB/sec Sustained Throughput/ Server
2 Terabyte PowerVault 130T Tape Library• 4 DLT 7000 Tape Drives• 28 Tape Capacity
#381 in Top 500Supercomputing Sites
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Recent AC3 Additions
8 Dell PowerEdge 2450 Servers (Serial Nodes)• Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node• 50 GB Disk (RAID 0)/Node
7 Dell PowerEdge 2450 Servers (First All NT Based AFS Cell)
• Dual Processor Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node Fileservers, 512 MB RAM/Node Database servers• 1 TB SCSI based RAID 5 Storage• Cross platform filesystem support
64 Dell PowerEdge 2450 Servers (Protein Folding, Fracture Analysis)
• Dual Processor Pentium III 733 Mhz/256 KB Cache• 2 GB RAM/Node• 27 GB Disk (RAID 0)/Node• Full Giganet Interconnect
3 Intel ES6000 & 1 ES1000 Gigabit switches• Upgrading our Server Backbone network to Gigabit Ethernet
(courtesy David A. Lifka, Cornell TC)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
AC3 Goals
Only commercially supported technology– Rapid spinup and spinout– Package technologies for vendors to sell as
integrated systems => All of commercial packages were moved
from SP2 to Windows, all users are back and more!
Users: “I don’t do windows” => “I’m agnostic about operating systems, and
just focus on getting my work done.”
Protein Folding
http://www.tc.cornell.edu/reports/NIH/resource/CompBiologyTools/
The cooperative motion of ion and water through the gramicidin ion channel. The effective quasi-article that permeates through the channel includes eight water molecules and the ion. Work of Ron Elber with Bob Eisenberg, Danuta Rojewska and Duan Pin.
Reaction path study of lig and diffusion in leghemoglobin. The ligand is CO (white) and it is moving from the binding site, the heme pocket, to the protein exterior. A study by Weislaw Nowak and Ron Elber.
(courtesy David A. Lifka, Cornell TC)
Protein Folding Per/Processor Performance
Results on different computers for protein structures:Machine System CPU CPU speed
[MHz]compiler Energy evaluations per
secondBlue Horizon (SP San Diego)
AIX 4 Power3 222 xlf 44.3
Linux cluster Linux 2.2 PentiumIII
650 PGF 3.1 59.1
Velocity (CTC) Win 2000 PentiumIII Xeon
500 df v6.1 46.0
Velocity+ (CTC) Win 2000 PentiumIII
733 df v6.1 59.2
Results on different computers for ( / or proteins): Machine System CPU CPU speed
[MHz]compiler Energy evaluations per
secondBlue Horizon (SP San Diego)
AIX 4 Power3 222 xlf 15.0
Linux cluster Linux 2.2 PentiumIII
650 PGF 3.1 21.0
Velocity (CTC) Win 2000 PentiumIII Xeon
500 df v6.1 16.9
Velocity+ (CTC) Win 2000 PentiumIII
733 df v6.1 22.4
(courtesy David A. Lifka, Cornell TC)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
AC3 Corporate Members
-Air Products and Chemicals-Candle Corporation-Compaq Computer Corporation-Conceptual Reality Presentations-Dell Computer Corporation-Etnus, Inc.-Fluent, Inc.-Giganet, Inc.-IBM Corporation-ILOG, Inc.-Intel Corporation-KLA-Tencor Corporation-Kuck & Associates, Inc.
-Lexis-Nexis-MathWorks, Inc.-Microsoft Corporation-MPI Software Technologies, Inc.-Numerical Algorithms Group-Portland Group, Inc.-Reed Elsevier, Inc.-Reliable Network Solutions, Inc.-SAS Institute, Inc.-Seattle Lab, Inc.-Visual Numerics, Inc.-Wolfram Research, Inc.
(courtesy David A. Lifka, Cornell TC)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Windows Cluster Summary
Good performance Lots of Applications Good reliability Reasonable Management complexity (TCO) Future is bright; uses are proliferating!
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Windows Cluster Resources
NT Supercluster, NCSA – http://www.ncsa.uiuc.edu/General/CC/ntcluster/ – http://www-csag.ucsd.edu/projects/hpvm.html
AC3 Cluster, TC– http://www.tc.cornell.edu/UserDoc/Cluster/
University of Southampton– http://www.windowsclusters.org/
=> application and hardware/software evaluation => many of these folks will work with you on deployment
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Tools and Technologies for Building Windows Clusters
Communication Hardware– Myrinet, http://www.myri.com/– Giganet, http://www.giganet.com/– Servernet II, http://www.compaq.com/
Cluster Management and Communication Software– LSF, http://www.platform.com/– Codeine, http://www.gridware.net/– Cluster CoNTroller, MPI, http://www.mpi-softtech.com/ – Maui Scheduler http://www.cs.byu.edu/– MPICH, http://www-unix.mcs.anl.gov/mpi/mpich/– PVM, http://www.epm.ornl.gov/pvm/
Microsoft Cluster Info– Win2000 http://www.microsoft.com/windows2000/– MSCS,http://www.microsoft.com/ntserver/ntserverenterprise/exec/
overview/clustering.asp
Future Directions
Terascale ClustersEntropia
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
A Terascale Cluster
NSF currently running a $36M Terascale competition Budget could buy
– an Itanium cluster (3000+ processors)– ~3TB of main memory– > 1.5Gbps high speed network interconnect
10+ Teraflops in 2000?
? #1 in Top 500 ?Supercomputing Sites
Entropia: Beyond Clusters
• COTS, SHV enable larger, cheaper, faster systems• Supercomputers (MPP’s) to…• Commodity Clusters (NT Supercluster) to…• Entropia
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Internet Computing
Idea: Assemble large numbers of idle PC’s in people’s homes, offices, into a massive computational resource– Enabled by broadband connections, fast microprocessors, huge PC volumes
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Unprecedented Power
Entropia network: ~30,000 machines (and growing fast!)– 100,000, 1Ghz => 100 TeraOp system
– 1,000,000, 1Ghz => 1,000 TeraOp system (1 PetaOp) IBM ASCI White (12 TeraOp, 8K processors, $110 Million
system)
Why Participate: Cause Computing!
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
People Will Contribute
Millions have demonstrated willingness to donate their idle cycles
“Great Cause” Computing– Current: Find ET, Large Primes, Crack DES…– Next: find cure for cancer, muscular dystrophy, air
and water pollution, …• understand human genome, ecology, fundamental
properties of matter, economy
Participate in science, medical research, promoting causes that you care about!
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Technical Challenges Heterogeneity (machine, configuration, network) Scalability (thousands to millions) Reliability (turn off, disconnect, fail) Security (integrity, confidentiality) Performance Programming . . .
Entropia: harnessing the computational power of the Internet
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Entropia is . . .
Power: a network with unprecedented power and scale
Empower: ordinary people to participate in solving the great social challenges and mysteries of our time
Solve: team solving fascinating technical problems
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Summary
Windows clusters are powerful, successful high performance platforms– Cost effective and excellent performance– Poised for rapid proliferation
Beyond clusters are Internet computing systems– Radical technical challenges, vast and profound
opportunities For more information see
– HPVM http://www-csag.ucsd.edu/ – Entropia http://www.entropia.com/
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA
Credits
NT Cluster Team Members– CSAG (UIUC and UCSD Computer Science) – my research
group– NCSA Leading Edge Site -- Robert Pennington’s team
Talk materials• NCSA (Rob Pennington, numerous application groups)
• Cornell TC (David Lifka)
• Boeing (David Levine)
• MPISoft (Tony Skjellum)
• Giganet (David Wells)
• Microsoft (Jim Gray)
Entropia, Inc -- University of California, San Diego (UCSD/CSE) -- NCSA