1 m clark. 2 contents gpu computing gpus for radio astronomy the problem is power astronomy at the...
TRANSCRIPT
![Page 1: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/1.jpg)
1
EXASCALE RADIO ASTRONOMYM Clark
![Page 2: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/2.jpg)
2
Contents
GPU ComputingGPUs for Radio AstronomyThe problem is powerAstronomy at the Exascale
![Page 3: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/3.jpg)
3
The March of GPUs
![Page 4: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/4.jpg)
4
What is a GPU?
Kepler K20X (2012)2688 processing cores3995 SP Gflops peak
Effective SIMD width of 32 threads (warp)Deep memory hierarchyAs we move away from registers
Bandwidth decreasesLatency increases
Limited on-chip memory65,536 32-bit registers per SM48 KiB shared memory per SM1.5 MiB L2
Programmed using a thread model
![Page 5: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/5.jpg)
5
Minimum Port, Big Speed-up
Application Code
+
GPU CPUOnly Critical Functions
Rest of SequentialCPU Code
![Page 6: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/6.jpg)
6
![Page 7: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/7.jpg)
7
Strong CUDA GPU Roadmap
SG
EM
M /
W N
orm
aliz
ed
2012 20142008 2010 2016
TeslaCUDA
FermiFP64
KeplerDynamic Parallelism
MaxwellDX12
PascalUnified Memory3D MemoryNVLink
20
16
12
8
6
2
0
4
10
14
18
![Page 8: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/8.jpg)
8
Introducing NVLINK and Stacked Memory
NVLINKGPU high speed interconnect80-200 GB/sPlanned support for POWER CPUs
Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit
![Page 9: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/9.jpg)
9
NVLink Enables Data Transfer At Speed of CPU Memory
TESLAGPU
CPU
DDR MemoryStacked Memory
NVLink80 GB/s
DDR450-75 GB/s
HBM1
Terabyte/s
![Page 10: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/10.jpg)
10
CorrelatorCalibration & Imaging
RF Samplers
Real-Time R-T, post R-T
O(N) O(N)
O(N2)O(N log N)
O(N)O(N2)
N
digital
Radio Telescope Data Flow
![Page 11: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/11.jpg)
11
Where can GPUs be Applied?
Cross correlation – GPU are idealPerformance similar to CGEMMHigh performance open-source library https://github.com/GPU-correlators/xGPU
Calibration and ImagingGridding - Coordinate mapping of input data to a regular grid
Arithmetic intensity scales with kernel convolution sizeCompute-bound problem maps well to GPUsDominant time sink in compute pipeline
Other image processing stepsCUFFT – Highly optimized Fast Fourier Transform libraryPFB – Computational intensity increases with number of tapsCoordinate transformations and resampling
![Page 12: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/12.jpg)
12
GPUs in Radio Astronomy
Already an essential tool in radio astronomyASKAP – Western AustraliaLEDA – United States of AmericaLOFAR – Netherlands (+ Europe) MWA – Western AustraliaNCRA - IndiaPAPER – South Africa
LOFAR
LEDA
MWAASKAP PAPER
![Page 13: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/13.jpg)
13
Cross correlation is essentially GEMMHierarchical locality
Cross Correlation on GPUs
![Page 14: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/14.jpg)
14
Correlator Efficiency
2012
2014
2008
2010
X-e
ngin
e G
FLO
PS p
er
Watt
Kepler
Tesla
Fermi
Maxwell
Pascal64
32
16
8
4
2
1
>1 TFLOPS sustained
>2.5 TFLOPS sustained
0.35 TFLOPS sustained
2016
![Page 15: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/15.jpg)
15
Software Correlation Flexibility
Why do software correlation?Software correlators inherently have a great degree of flexibility
Software correlation can do on-the-fly reconfigurationSubset correlation at increased bandwidthSubset correlation at decreased integration time Pulsar binningEasy classification of data (RFI threshold)
Software is portable, correlator unchanged since 2010Already running on 2016 architecture
![Page 16: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/16.jpg)
16
Power of 300 Petaflop
CPU-only Supercomputer
=Power for the
city of San Francisco
HPC’s Biggest Challenge: Power
![Page 17: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/17.jpg)
17
GPUs Power World’s 10 Greenest Supercomputers
Green500 Rank MFLOPS/W Site
1 4,503.17 GSIC Center, Tokyo Tech
2 3,631.86 Cambridge University
3 3,517.84 University of Tsukuba
4 3,185.91Swiss National Supercomputing (CSCS)
5 3,130.95 ROMEO HPC Center
6 3,068.71 GSIC Center, Tokyo Tech
7 2,702.16 University of Arizona
8 2,629.10 Max-Planck
9 2,629.10 (Financial Institution)
10 2,358.69 CSIRO
37 1959.90 Intel Endeavor (top Xeon Phi cluster)
49 1247.57 Météo France (top CPU cluster)
![Page 18: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/18.jpg)
18
The End of Historic Scaling
C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011
![Page 19: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/19.jpg)
19
The End of Voltage Scaling
In the Good Old DaysLeakage was not important, and voltage scaled with feature size
L’ = L/2V’ = V/2E’ = CV2 = E/8f’ = 2fD’ = 1/L2 = 4DP’ = P
Halve L and get 4x the transistors and 8x the
capability for thesame power
The New RealityLeakage has limited threshold voltage, largely ending voltage
scaling
L’ = L/2V’ = ~VE’ = CV2 = E/2f’ = 2fD’ = 1/L2 = 4DP’ = 4P
Halve L and get 4x the transistors and 8x the
capability for 4x the power,
or 2x the capability for the same power in ¼ the
area.
![Page 20: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/20.jpg)
20
Major Software Implications
Need to expose massive concurrencyExaflop at O(GHz) clocks
O(billion-way) parallelism!
Need to expose and exploit localityData motion more expensive than computation> 100:1 global v. local energy
Need to start addressing resiliency in the applications
![Page 21: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/21.jpg)
21
How Parallel is Astronomy?
SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth
Correlator5 Pflops of computationData-parallel across visibilitiesTask-parallel across frequency channelsO(trillion-way) parallelism
![Page 22: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/22.jpg)
22
How Parallel is Astronomy?
SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth
Gridding (W-projection)Kernel size 100x100Parallel across kernel size and visibilities (J. Romein)O(10 billion-way) parallelism
![Page 23: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/23.jpg)
23
Energy Efficiency Drives Locality
20mm
64-bit DP
1000 pJ
28nm IC
256-bit access8 kB SRAM 50 pJ
16000 pJ DRAM Rd/Wr
500 pJ Efficient off-chip link
20 pJ 26 pJ 256 pJ
256bits
![Page 24: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/24.jpg)
24
Energy Efficiency Drives Locality
FMA
Registe
rs L1 L2
DRAM
stac
ked
poin
t to
poin
t10
100
1000
10000
100000p
ico
Jou
les
![Page 25: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/25.jpg)
25
Energy Efficiency Drives Locality
This is observable todayWe have lots of tunable parameters:
Register tile size: how many much work should each thread do?Thread block size: how many threads should work together?Input precision: size of the input words
Quick and dirty cross correlation example4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt
![Page 26: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/26.jpg)
26
CorrelatorCalibration & Imaging
RF Samplers
8-bit digitization O(100) PFLOPSO(10) PFLOPSN = 1024
digital
SKA1 LOW Sketch
10 Tb/s 50 Tb/s
![Page 27: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/27.jpg)
27
Energy Efficiency Drives Locality
FMA
Registe
rs L1 L2
DRAM
stac
ked
poin
t to
poin
t10
100
1000
10000
100000p
ico
Jou
les
![Page 28: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/28.jpg)
28
Do we need Moore’s Law?
Moore’s law come from shrinking processMoore’s law is slowing down
Denard scaling is deadIncreasing wafer costs means that it takes longer to move to the next process
![Page 29: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/29.jpg)
29
Improving Energy Efficiency @ Iso-Process
We don’t know how to build the perfect processorHuge focus on improved architecture efficiencyBetter understanding of a given process
Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nmGF117: 96 cores, peak 192 GflopsGK107: 384 cores, peak 770 GflopsGM107: 640 cores, peak 1330 Gflops
Use cross-correlation benchmarkOnly measure GPU power
![Page 30: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/30.jpg)
30
Improving Energy Efficiency @ 28 nm
Fermi Kepler Maxwell0
5
10
15
20
25
80% 55% 80%
GF
LO
PS
/ w
att
![Page 31: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/31.jpg)
31
How Hot is Your Supercomputer?
1. TSUBAME-KFCTokyo Tech, oil cooled4503 GFLOPS / watt
2. Wilkes ClusterU. Cambridge, air cooled3631 GFLOPS / watt
Number 1 is 24% more efficient than number 2
![Page 32: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/32.jpg)
32
Temperature is Power
Power is dynamic and staticDynamic power is workStatic power is leakage
Dominant term from sub-threshold leakage
Voltage terms:Vs: Gate to source voltageVth: Switching threshold voltagen: transistor sub-threshold swing coeff
Device specifics:A: Technology specific constantL, W: device channel length & width
Thermal Voltage:8.62×10−5 eV/K26 mV at room temperature
![Page 33: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/33.jpg)
33
Temperature is Power
Tesla K20mGK110, 28nm
Geforce GTX 580GF110, 40nm
![Page 34: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/34.jpg)
34
Tuning for Power Efficiency
A given processor does not have a fixed power efficiencyDependent on
Clock frequencyVoltageTemperatureAlgorithm
Tune in this multi-dimensional space for power efficiencyE.g., cross-correlation on Kepler K20
12.96 -> 18.34 GFLOPS / watt
Bad news: no two processors are exactly alike
![Page 35: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/35.jpg)
35
Precision is Power
Power scales with the square of the multiplier (approximately)Most computation done in FP32 / FP64Should use the minimum precision required by science needs
Maxwell GPUs have 16-bit integer multiply-add at FP32 rate
Algorithms should increasingly use hierarchical precisionOnly invoke in high precision when necessary
Signal processing folks known this for a long timeLesson feeding back into the HPC community...
![Page 36: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale](https://reader035.vdocuments.mx/reader035/viewer/2022062621/551c2acc550346b24f8b60a2/html5/thumbnails/36.jpg)
36
Conclusions
Astronomy has insatiable amount of computeMany-core processors are a perfect match to the processing pipelinePower is a problem but
Astronomy has oodles of parallelismKey algorithms possess localityPrecision requirements are well understood
Scientists and Engineers wedded to the problemAstronomy is perhaps the ideal application for the exascale