architectures for exascale computingusers.cecs.anu.edu.au/~peter/seminars/exascalearch.pdfcompsci:...
TRANSCRIPT
Architectures for Exascale Computing
Peter StrazdinsComputer Systems Group,
Research School of Computer Science
Computational Science at the Petascale and Beyond Workshop,Australian National University
13 February 2012
(slides available from http://cs.anu.edu.au/ā¼Peter.Strazdins/seminars)
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 1
1 Overview
ā¢ major components & attributes
ā¢ system requirements for exascale computing
ā¢ architectural complexity challenges
ā¢ resiliency/reliability issues; checkpointing
ā¢ energy efficiency challenges: on-chip and system-wide
ā¢ examples from recent petascale architectures
ā¢ Jaguar, Tsubame, TianHe (FeiTeng), Blue Gene / Q
ā¢ outlook and conclusions
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 2
2 Exascale Architecture: Components & Attributesā¢ processor
ā¢ floating point / integer speed, memory la-tency and bandwidth
ā¢ interconnect: interface chips, switches andwiring
ā¢ bandwidth: point-to-point and bisectional;latency!!
ā¢ input / output
ā¢ bandwidth, asynchronicity (output), ||ismā¢ attributes for all:
ā¢ performance, reliability / resiliency, and energy/powerā¢ system design must manage these
ā¢ must be balanced for their intended useā¢ long-running, ultra large-scale simulationsā¢ highly compute and memory-intensive; moderately I/O intensive
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 3
3 System Requirements for Exascale Computing
Courtesy of Lucy Nowell & Sonia Sachs, DOE Office of Science
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 4
4 System Architectural Complexity Challenges
ā¢ trend towards heterogeneous node architectures
ā¢ GPUs, FGPAs, custom accelerators?ā¢ trend towards deeper hier-
archies
ā¢ deeper memory hierar-chy (now 3 levels cache)
ā¢ I/O: solid-state storage(on node?), I/O nodes
ā¢ distributed memory: alsowithin node?
ā¢ also within chip? (100sof cores)
!"#$%&
'()"*+#& '()"*+#& '()"*+#&
,-.+/0-(1.& ,-.+/0-(1.&,-.+/0-(1.&
'%"2/ -$3+#& '%"2/ -$3+#&'%"2/ -$3+#&
'-1+& '-1+& '-1+& '-1+&
4&
4&
'-1+&
!"#$%&
'()"*+#& '()"*+#& '()"*+#&
,-.+/0-(1.& ,-.+/0-(1.&,-.+/0-(1.&
"2/ -$3+#& '%"2/ -$3+#&'%"2/ -$3+#&
4&
4&
4& ;<=&;<=&;<=&
(courtesy Jack Dongarra, SCā11)
ā¢ these, coupled with massive increases in concurrency, present majorchallenges to the software!
ā¢ debugging, performance analysis, automatic tuning . . .
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 5
5 Challenges for Resiliency as Core Count Grows
(Courtesy of John Daly)
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 6
6 Resiliency: Issues with System-wide Checkpointing Approach
ā¢ process-level checkpoint-ing: highly I/O intensive
ā¢ increased capacity ofI/O system exacerbatesMTTI problem
ā¢ also increases powercosts!
ā¢ note: applications tend tohave poor memory scal-ability
ā¢ application-level better butcomplex (Courtesy of Lucy Nowell & Sonia Sachs)
ā¢ other approaches
ā¢ hardware redundancy (cores, network links, nodes)ā¢ algorithmic methods
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 7
7 Energy / Power Challenges: On-chip
ā¢ energy cost to access data off-chip (DRAM) is 100Ć more thanfrom registers
ā¢ across system network, only 2Ć as much
ā¢ within each manycore chip, therewill be a non-trivial interconnectstoo
ā¢ maintaining coherency is not scal-able
ā¢ cache misses and atomic instructions (have O(p) energy cost each (O(p2)total!)
ā¢ cache line size is sub-optimal for messages on on-chip networks
ā¢ soon, interconnects are expected to consume 50Ć more energy thanlogic circuits!
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 8
8 Energy / Power Challenges: System Wide
ā¢ essential not only the systemās peak energy bebounded, that current energy is minimized
ā¢ need to be able to throttle down componentswhen not in (full use)
ā¢ network and I/O as well as compute proces-sors
ā¢ e.g. on the Intel Single-chip Cloud Com-puter, has frequency and voltage domains(P = fV 2)
ā¢ exascale system will need a hierarchical powerand thermal management system
4747
Frequency
Voltage
MC
MC MC
MC R
Tile Tile
Tile
Tile Tile
Tile
Tile
TileR
R
Tile
TileR
R
Tile
TileR
TileR
TileR
Tile TileR
Bus to PCI
Tile
TileR
R
Tile
TileR
TileR
TileR
TileR
TileR RR
RR
R
RRR
RC package
Memory
(courtesy Intel)
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 9
9 Current Systems: TSUBAME2: Commodity + Accelerators
ā¢ system: 2+.4 PFLOPS (peak), 1.2(LINPACK); 1.6 MW (peak)
ā¢ 1408 Intel Westemere nodes (6cores 2.93 GHz)
ā¢ 3 NVIDIA Tesla M2050 GPUs(8GB/s PCI Express) each!
ā¢ memory bandwidth is now 32GB/s per CPU
(courtesy TITECH)
ā¢ 2 PB of local fast SSD-backed local storage + Lustre 9 PB (30 nodes)
ā¢ unified system and storage interconnect:
ā¢ fat-tree, Infinband (12Ć 324 + 179Ć 38 port switches) 8GB/s per node
ā¢ racks: controlled heat-exchangers for high density (32 cooling towers)
ā¢ only system in top ten of Top 500, Green 500 and Graph 500
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 10
10 Current Systems: Jaguar: Proprietary + Accelerators (ā)
ā¢ 1.75 PFLOPs (LINPACK), 2.35(peak); 3 MW (peak), air cooled
ā¢ 18,688 Cray XT5 compute nodesdual hex-core AMD Istanbul
ā¢ Seastar2+ interconnect: 3 D torus,
ā¢ based on HyperTransport andproprietary protocols
ā¢ 1 ASIC per node, 6 ports, 9.6GB/sec each
9.6 GB/s
9.6 GB/sec 9.6 GB/sec
9.6 GB/sec
9.6
GB
/se
c
9.6
GB
/se
c
Cray
SeaStar2+
Interconnect
(courtesy Cray)
ā¢ external 10 PB Lustre file system, 240 GB/s read/write
ā¢ reliability & management system has own 100 Mb/s management fabricbetween blades & cabinet-level controllers
ā¢ expected to be upgraded in 2012 with GPU accelerators (20 PF peak)
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 11
11 Current Systems: TianHe-1A: Proprietary + Accelerators (ā)
ā¢ 4.7 PFLOPS (peak), 2.3 (LINPACK); 4 MW
ā¢ 700m2 area (112 compute, 8 service, 6 comm.& 14 I/O racks)
ā¢ 7168 compute nodes: 2 Xeon X5670 hex-coreCPUs + NVIDIA M2050 GPU
ā¢ service nodes: 2 FT-1000 CPUs (similar archi-tecture to UltraSPARC T3); 2PB storage
ā¢ high-radix routing chip, 8 Ć 1.2 GB/s links;
ā¢ hierarchical fat tree (11Ć 384 port switches, op-tical fibres) to connect racks (8Ć 8 torus)
ā¢ closed air cooling (2 aircond units per rack)
ā¢ 3-layer power management(courtesy Yang et al, NUDT)
ā¢ predecessor system, based on VLIW āstream processorsā (SPARC ISA)
ā¢ successor system (FeiTeng, v3): SIMD multicore chip (SPARC ISA)
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 12
12 Current Systems: Blue Gene/Q: Proprietary + Multicore
ā¢ nodes: 1.6GHz PowerPC A2: 18 cores (1 OS,1 spare), 4-way HMT, quad d.p. SIMD (205GFLOPS @ 55watts)
ā¢ on-chip network: crossbar to 0.8GHz 32MBL2$
ā¢ integrates logic for chip-to-chip communica-tions in a 5D torus
ā¢ supports Transactional Memory and specula-tive execution (!) (courtesy A Register)
ā¢ 4-rack system achieved 0.67 PFLOPs (0.84 peak) in Nov 11
ā¢ top on Green 500: 165% (77%) more efficient Tianhe-1A (Tsubame 2.0)@ 2 GFLOPS/watt
ā¢ Sequoia system at LLNL expected to reach 20 PFLOPs (peak) in 2012(96 racks, 6 MW)
āā ā ā¢ ā® ā®ā® Ć
CompSci: Petascale and Beyond, Feb 2012 Architectures for Exascale Computing 13
13 Outlook and Conclusions
ā¢ 2 main swim-lanes: accelerators vs customized multicoreā¢ accelerators will be integrated on-chip (e.g. Fusion, Knights Corner)ā¢ in either case, some heterogeneity (cores to support OS, comm.)
ā¢ NVIDIAās vision: the Echelon architectureā¢ heterogeneous chips: 28 in-order stream-processors (VLIW with reg-
ister hierarchy) + 8 post-RISC coresā¢ self aware RTS + OS, locality aware compiler + autotuner
ā¢ custom (proprietary) components needed for power efficiency
ā¢ 100s of cores per chip required: issues for programming models (co-herency, core hierarchies) and scalability of communication
ā¢ data locality is crucial: deeper hierarchies over several scales expected
ā¢ interconnects: fat-tree or high-D torus or combination?
ā¢ āself-awareā systems to manage power and reliability (also redundancy)
ā¢ latency (esp. for globals) and power are likely the limiting issues
āā ā ā¢ ā® ā®ā® Ć