improving end2end performance for the columbia...

28
March 12-14, 2007 La Jolla, CA cenic07.cenic.org CENIC `07 MAKING WAVES Improving End2End Performance for the Columbia Supercomputer Mark Foster Computer Sciences Corp. NASA Ames Research Center March 2007 This work is supported by the NASA Advanced Supercomputing Division under Task Order A61812D (ITOP Contract DTTS59-99-D-00437/TO #A61812D) with Advanced Management Technology Incorporated (AMTI).

Upload: ledieu

Post on 13-Jun-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

March 12-14, 2007La Jolla, CA

cenic07.cenic.org

CENIC `07MAKING WAVES

Improving End2EndPerformance for the

Columbia SupercomputerMark Foster

Computer Sciences Corp.NASA Ames Research Center

March 2007

This work is supported by the NASA Advanced Supercomputing Division underTask Order A61812D (ITOP Contract DTTS59-99-D-00437/TO #A61812D) withAdvanced Management Technology Incorporated (AMTI).

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESend2end for Columbia

• overview• Columbia system• LAN• WAN• e2e efforts

– what we observed– constraints, and tools used– impact of efforts

• sample applications– earth, astro, science, aero, spaceflight

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESoverview

• scientists using large scale supercomputing resources toinvestigate problems: work is time critical– limited computational cycles allocated– results needed to feed into other projects

• 100’s GBs to multiple TB data sets now common andincreasing– data transfer performance becomes crucial bottleneck

• many scientists from many locations/hosts: no simplesolution

• bringing network engineers to the edge, we have been ableto improve the transfer rates from a few Mbps to a few Gbpsfor some applications

• system utilization now often well above 90%

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESshared challenges

• Chris Thomas @ UCLA :– 10 Mbps end hosts, OC3 campus/group access– asymmetric (campus) path– firewall performance consideration– end users: not network engineers

• Russ Hobby on Cyber Infrastructure:– it is a system (complex, but not as complex as earth/ocean as John

Delaney described)– composition of components that must work together (efficiently)– not all problems are purely technical

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESthe Columbia supercomputer

Systems: SGI Altix 3700, 3700-BX2 and 4700Processors: 10,240 Intel Itanium 2

(single and dual core)Global Shared Memory: 20 Terabytes

Front-End: SGI Altix 3700 (64 proc.)Online Storage: 1.1 Petabytes RAIDOffline Storage: 6 Petabytes STK Silo

Internode Comm: InfinibandHi-Speed Data Transfer: 10 Gigabit Ethernet2048p subcluster: NUMAlink4 interconnect

• 8th fastest supercomputer in world: 62Tflops peak

• supporting wide variety of projects– >160 projects; >900 accounts; ~150simultaneous logins

– Users from across and outside NASA– 24x7 support

• effective architecture: easier applicationscaling for high-fidelity, shorter time-to-solution, higher throughput– 20 x 512p/1TB shared memory nodes– Some applications scaling to 2048p andabove

• fast build: order to full ops in 120 days;dedicated Oct. 2004– Unique partnership with industry (SGI, Intel,Voltaire)

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESColumbia configuration

SATA35TB

Capability System13 TF

Capacity System50 TF

T512p

Front Ends (3)28p Altix 3700Hyperwall Access (HWvis)16p Altix 3700Networking- 10GigE Switches- 10GigE Cards (1 Per 512p)- InfiniBand Switch (288port)- InfiniBand Cards (6 per 512p)- Altix 3700 2BX 2048 NumalinkCompute Node (Single Sys

Image)- Altix 3700 (A)- Altix 3700 BX2 (T)- Altix 4700 (M)

Storage Area Network- Brocade Switch 2x128port

Online Storage (1,040 TB) - 24racks

- SATA RAID- FC RAID- SATA RAID

A512p

A512p

A512p

A512p

A512p

A512p

A512p

A512p

A512p

A512p

A512p

T512p

T512p

T512p

T512p

M512p

T512p

T512p

T512p

FC Switch 128p

SATA35TB

SATA35TB

SATA35TBSATA

35TBSATA35TB

SATA35TB

SATA35TB

Fibre Channel

20TB

Fibre Channel

20TB

Fibre Channel

20TB

Fibre Channel

20TBFibre

Channel20TB

Fibre Channel

20TB

Fibre Channel

20TB

Fibre Channel

20TB

SATA75TB

SATA35TB

SATA35TB

SATA35TBSATA

75TBSATA75TB

SATA75TB

SATA75TB

10GigE

CFE1 CFE3

FC Switch

CFE2 HWvis

InfiniBand

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESColumbia access LAN

C1C1

C1C1

6500

6500

6500

C1C1

C1C1

C1C1

C1C1

C1C1

C1C1

C1C1

C1Cn

PE 6500NISN

NREN

Columbianodes

interconnect andaggregation

access and borderpeering

external peers

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESwide area network - NREN

10G waves at the core, dark fiber to end sites

Ext Peering PointsDistributed ExchNLR/Regional net10 GigE1 GigE

JPLJPL

McLean, VAMcLean, VA

ESNetESNet

LRCLRC

PacWavePacWave

Sunnyvale, CASunnyvale, CA

Los Angeles, CALos Angeles, CA

ARCARC

Norfolk, VANorfolk, VA

NGIX-ENGIX-E

GSFCGSFC

CENICCENIC

NLRNLR

MATP/MATP/ELITEELITE

Huntsville, ALHuntsville, AL

MSFCMSFC

SLRSLR

MAX/MAX/DRAGONDRAGON

Atlanta,Atlanta,GAGA

•National and Regional optical networks provide links over which 10 Gbps and 1 Gbps wavescan be established.•Distributed exchange points provide interconnect in metro and regional areas to othernetworks and research facilities

NGIX-WNGIX-WAIXAIX

GSFCGSFC

(in progress)

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESend2end efforts

what we observed– long running but low data rates (Kbps, Mbps)– very slow bulk file moves reported– bad mix: untuned systems, small windows, small mtu, long rtt

(insert historical graph here)

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESend2end efforts

constraints, and tools used– facilities leveraging web100 could be really helpful, but…– local policies/procedures sometimes preclude helpful changes

• system admin practices: “standardization” for lowest commondenominator, “fear” of impact (mtu, buffers size increase)

• IT security policies, firewalls: “just say no”• WAN performance issues: “we don’t observe a problem on our LAN”

– path characterization: ndt, npad, nuttcp, iperf, ping, traceroute• solve obvious issues early (duplex mismatch, mtu limitation, poor route)

– flow monitoring: netflow, flow-tools (Fullmer), FlowViewer (Loiacono)

– bulk transfer: bbftp (IN2P3/Gilles Farrache), bbscp (NASA), hpnssh(PSC/Rapier), starting to look at others: VFER & UDT

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESinitial investigations

• scp 2-5 Mbps (or worse): cpu limits, and tcp limits– can achieve much better results with HPN-SSH (enables tcp window

scaling), and by using RC4 encryption (much more efficient on someprocessors - use “openssl speed” to assess cpu’s performance)

– even with these improvements, still need to use 8-12 concurrentstreams to get maximum performance with small MTUs

• nuttcp shows udp performance near line rate in many cases,but tcp performance still lacking– examine tcp behavior (ndt, npad, tcptrace)– tcp buffer sizes main culprit in large RTT environment; small amount

of loss can be hard to detect/resolve– mid-span (or nearby) test platforms helpful

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESrecommend TCP adjustments

typical linux example for 85ms rtt: # Set maximum TCP window sizes to 100 megabytes net.core.rmem_max = 104857600

net.core.wmem_max = 104857600 # Set minimum, default, and maximum TCP buffer limits net.ipv4.tcp_rmem = 4096 524288 104857600 net.ipv4.tcp_wmem = 4096 524288 104857600

# Set maximum network input buffer queue length net.core.netdev_max_backlog = 30000 # Disable caching of TCP congestion state (2.6 only) # (workaround a bug in some Linux stacks) net.ipv4.tcp_no_metrics_save = 1

# Ignore ARP requests for local IP received on wrong interface net.ipv4.conf.all.arp_ignore = 1

ref: “Enabling High Performance Data Transfers”www.psc.edu/networking/projects/tcptune

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESrecommend ssh changes

• at least OpenSSH 4.3p2, using OpenSSL 0.9.8b (May 2006)• use faster ciphers than the default (RC4 leverage of

processor specific coding)• OpenSSH should be patched (HPN-SSH) - support large

buffers and congestion windowwww.psc.edu/networking/projects/hpn-ssh

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVES firewall impacts

Prior to firewallupgrade(199 - 644 Mpbs)

After firewallupgrade(792 - 980 Mpbs)

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESend host aggregate improvement

host performanceusing multiple streams,with some tuning8 streams: 257 Mbps

after more tuning,firewall upgrade4 streams: 4.7 Gbps

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESexample application increase

308

169

5.7

0 50 100 150 200 250 300 350

Improved File

Transfer Application &

Jumbo Frames

Improved File

Transfer Application

Standard File Transfer

Application

Multi-Stream bbFTP + Jumbo Frames

Multi-Stream bbFTP

SCP

• NASA Goddard’s 3-D Cloud-Resolving Model: 54x throughputperformance gains

• collaboration between NREN and GSFC Scientific & EngineeringNetwork (SEN), High-End Computing Network (HECN) teams

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESfactors driving traffic growth

• increased Columbia usage• storage and file system

upgrades on Columbia• aggressive campaign to work

with users to improveperformance to Columbiathrough the use of better filetransfer tools, end systemtuning and user education

• network bandwidth increasesacross the NREN wide areanetwork, local area networksand firewalls

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESimpact of e2e efforts

• trends showaggregate 5TB/moincreased to morethan 20TB/mo

• three of previous fivemonths exceed1TB/day

efforts do not just result in increased bandwidth; improvednetwork performance results in improved capability, increasedfidelity, more efficient computing, & better productivity

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESecco at JPL/MIT - visualization

High Temporal Resolution Visualization ProvidesNew Insights for Ocean Researchers

• NAS Visualization groupcompleted a 110-hourcomputational run of theMassachusetts Institute ofTechnology’s general circulationmodel, MITgcm, simulating anentire year of ocean dynamics

• These visualizations areallowing researchers at MIT andNASA’s Jet PropulsionLaboratory to investigate modeldynamics with unprecedentedtemporal resolution

salt concentration, parts per thousand at15m depth

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESdark matter at UCSC

• Madau, Diemand,and Kuhlen at UCSanta Cruzsimulate evolutionof dark matter halo

• Projected darkmatter density-square maps ofthe simulated MilkyWay-size halo at13.3 billion yearsago, 460 millionyears after the BigBang.

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVEScombustion science at LBL

• Marc Day andcollaborators at LBLperform high-fidelitynumerical simulationson Columbia

• results aid in thedevelopment of cleanfuel-efficientcombustion systemsfor transportation andstationary powergeneration

partial period of combustion simulation,colored by the local fuel consumption rate

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESnational combustion code

•Combustor hardware iscomplex and the turbulentreacting flow process iscomplicated (and still notwell understood).•Massively parallelprocessing via messagepassing interface speedsup the calculations toacceptable levels—toapproximately a wall-clockweek.

Partially resolvedNavier-Strokessimulations ofthe GE LM6000combustor

Combustioninstabilities in aLOX-Methanerocket engine

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESAres I stage separation

•Ares I rocket launchsystem concept is similar tothe Saturn rocket of theApollo Program.•A simulation of a high-altitude stage separationcomputes the flow of airaround the vehicle and theresultant aerodynamicforces.

early stagesof separation

flowfieldaround thevehicle post-separation

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESOrion - crew vehicle

re-entry wake turbulence

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESOrion - crew vehicle

re-entry wake turbulence - IBWAN at SC06 (Henze)

Obsidian longbow IB over NREN via 10 GbE NLR FrameNetbetween NASA Ames and Tampa

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESNLCS awards - March 2007

4.75 million hours of supercomputing time under NASA’sNational Leadership Computing System (NLCS) initiative:computationally intensive research projects of nationalinterest

• Transition in High-Speed Boundary Layers: Numerical Investigations Using DNS andLES: Led by Hermann Fasel, University of Arizona, Tucson: high-fidelity simulations tounderstand how turbulence starts in high-speed airflow over air vehicles

• Large Scale URANS/DES Ship Hydrodynamics Computations with CFDShip-Iowa: Ledby Fredrick Stern, University of Iowa: accelerate code development for viscous shiphydrodynamics simulation

• Flame Dynamics and Emission Chemistry in High-Pressure Industrial Burners: Led byMarcus Day, Lawrence Berkeley National Laboratory: simulate natural gas combustion inpower-generation turbines to quantify the mechanisms that control the formation ofpollutants

• Multi-Scale Modeling and Computation of Convective Geophysical Turbulence: Led byKeith Julien, University of Colorado, Boulder: new algorithms in large-scale simulations tostudy the role of global ocean thermohaline circulation (THC) in modulating the world’sclimate

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESsome tuning references

• NREN TCP Performance Tuning Guide www.nren.nasa.gov/tcp_tuning.html(also has links for bbftp, bbscp)

• Other useful guides:WAN Tuning and Troubleshooting

www.internet2.edu/~shalunov/writing/tcp-perf.htmlEnabling High Performance Data Transfers

www.psc.edu/networking/projects/tcptuneTCP Tuning Guide www-didc.lbl.gov/TCP-tuning

CENIC `07: Making WavesMarch 12-14, 2007 • La Jolla, CA • cenic07.cenic.org

CENIC `07MAKING WAVESthank you

Mark FosterComputer Sciences [email protected]