parallel programming many-core computing: intro (1/5)rob/parallel-programming... · cuda toolkit...
TRANSCRIPT
![Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/2.jpg)
Schedule 2
1. Introduction, performance metrics & analysis
2. Many-core hardware
3. Cuda class 1: basics
4. Cuda class 2: advanced
5. Case study: LOFAR telescope with many-cores
![Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/3.jpg)
What are many-cores? 3
![Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/4.jpg)
What are many-cores 4
From Wikipedia: “A many-core processor is a multi-
core processor in which the number of cores is large
enough that traditional multi-processor techniques
are no longer efficient — largely because of issues
with congestion in supplying instructions and data
to the many processors.”
![Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/5.jpg)
What are many-cores 5
How many is many?
Several tens of cores
How are they different from multi-core CPUs?
Non-uniform memory access (NUMA)
Private memories
Network-on-chip
Examples
Multi-core CPUs (48-core AMD magny-cours)
Graphics Processing Units (GPUs)
GPGPU = general purpose programming on GPUs
Cell processor (PlayStation 3)
Server processors (Sun Niagara)
![Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/6.jpg)
Many-core questions
The search for performance Build hardware
What architectures?
Evaluate hardware
What metrics?
How do we measure?
Use it
What workloads?
Expected performance?
Program it
How to program?
How to optimize?
Benchmark
How to analyze performance?
6
![Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/7.jpg)
Today’s Topics 7
Introduction
Why many-core programming?
History
Hardware introduction
Performance model:
Arithmetic Intensity and Roofline
![Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/8.jpg)
Why do we need many-cores? 8
![Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/9.jpg)
T12
Westmere
NV30 NV40
G70
G80
GT200
3GHz Dual
Core P4
3GHz
Core2 Duo
3GHz Xeon
Quad
Why do we need many-cores? 9
![Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/10.jpg)
Why do we need many-cores? 10
![Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/11.jpg)
Why do we need many-cores? 11
![Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/12.jpg)
China's Tianhe-1A
#5 in top500 list
4.701 pflops peak
2.566 pflops max
14,336 Xeon X5670 processors
7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores
12
![Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/13.jpg)
Power efficiency 13
![Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/14.jpg)
Graphics in 1980 14
![Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/15.jpg)
Graphics in 2000 15
![Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/16.jpg)
Graphics now: GPU movie 16
![Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/17.jpg)
Realism of modern GPUs 17
http://www.youtube.com/watch?v
=bJDeipvpjGQ&feature=play
er_embedded#t=49s
Courtesy
techradar.com
![Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/18.jpg)
Why do we need many-cores?
Performance
Large scale parallelism
Power Efficiency
Use transistors more efficiently
Price (GPUs)
Huge market, bigger than Hollywood
Mass production, economy of scale
“spotty teenagers” pay for our HPC needs!
18
![Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/19.jpg)
1995 2000 2005 2010
RIVA 128 3M xtors
GeForce® 256 23M xtors
GeForce FX 125M xtors
GeForce 8800
681M xtors
GeForce 3 60M xtors
“Fermi” 3B xtors
GPGPU history 19
![Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/20.jpg)
GPGPU History
Use Graphics primitives for HPC
Ikonas [England 1978]
Pixel Machine [Potmesil & Hoffert 1989]
Pixel-Planes 5 [Rhoades, et al. 1992]
Programmable shaders, around 1998
DirectX / OpenGL
Map application onto graphics domain!
GPGPU
Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...
20
![Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/21.jpg)
CUDA C/C++ Continuous Innovation
2007 2008 2009 2010
July 07 Nov 07 April 08
Aug 08 July 09
Nov 09 Mar 10
CUDA Toolkit 1.1
• Win XP 64
• Atomics
support
• Multi-GPU
support
CUDA Toolkit 2.0
Double
Precision
• Compiler
Optimizations
• Vista 32/64
• Mac OSX
• 3D Textures
• HW Interpolation
CUDA Toolkit 2.3
• DP FFT
• 16-32 Conversion
intrinsics
• Performance
enhancements
CUDA Toolkit 1.0
• C Compiler
• C Extensions
• Single
Precision
• BLAS
• FFT
• SDK
40 examples
CUDA
Visual Profiler 2.2
cuda-gdb
HW Debugger
Parallel Nsight
Beta CUDA Toolkit 3.0
C++ inheritance
Fermi support
Tools updates
Driver / RT interop
21
![Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/22.jpg)
Parallel Nsight
Visual Studio
Visual Profiler
For Linux
cuda-gdb
For Linux
Cuda Tools 22
![Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/23.jpg)
Many-core hardware introduction 23
![Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/24.jpg)
The search for performance 24
![Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/25.jpg)
The search for performance 25
We have M(o)ore transistors …
Bigger cores?
We are hitting the walls!
power, memory, instruction-level parallelism (ILP)
How do we use them?
Large-scale parallelism
Many-cores !
![Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/26.jpg)
Choices … 26
Core type(s):
Fat or slim ?
Vectorized (SIMD) ?
Homogeneous or heterogeneous?
Number of cores:
Few or many ?
Memory
Shared-memory or distributed-memory?
Parallelism
Instruction-level parallelism, threads, vectors, …
![Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/27.jpg)
A taxonomy 27
Based on “field-of-origin”:
General-purpose
Intel, AMD
Graphics Processing Units (GPUs)
NVIDIA, ATI
Gaming/Entertainment
Sony/Toshiba/IBM
Embedded systems
Philips/NXP, ARM
Servers
Oracle, IBM, Intel
High Performance Computing
Intel, IBM, …
![Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/28.jpg)
General Purpose Processors 28
Architecture
Few fat cores
Vectorization (SSE, AVX)
Homogeneous
Stand-alone
Memory
Shared, multi-layered
Per-core cache and shared cache
Programming
Processes (OS Scheduler)
Message passing
Multi-threading
Coarse-grained parallelism
![Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/29.jpg)
Server-side 29
General-purpose-like with more hardware threads
Lower performance per thread
high throughput
Examples
Sun Niagara II
8 cores x 8 threads
IBM POWER7
8 cores x 4 threads
Intel SCC
48 cores, all can run their own OS
![Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/30.jpg)
Graphics Processing Units 30
Architecture
Hundreds/thousands of slim cores
Homogeneous
Accelerator
Memory
Very complex hierarchy
Both shared and per-core
Programming
Off-load model
Many fine-grained symmetrical threads
Hardware scheduler
![Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/31.jpg)
Cell/B.E. 31
Architecture
Heterogeneous
8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)
Memory
Per-core memory, network-on-chip
Programming
User-controlled scheduling
6 levels of parallelism, all under user control
Fine- and coarse-grain parallelism
![Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/32.jpg)
Take home message 32
Variety of platforms
Core types & counts
Memory architecture & sizes
Parallelism layers & types
Scheduling
Open questions:
Why so many?
How many platforms do we need?
Can any application run on any platform?
![Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/33.jpg)
Hardware performance metrics 33
![Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/34.jpg)
Hardware Performance metrics 34
Clock frequency [GHz] = absolute hardware speed
Memories, CPUs, interconnects
Operational speed [GFLOPs]
Operations per cycle
Memory bandwidth [GB/s]
differs a lot between different memories on chip
Power [Watt]
Derived metrics
FLOP/Byte, FLOP/Watt
![Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/35.jpg)
Theoretical peak performance 35
Peak = chips * cores * vectorWidth *
FLOPs/cycle * clockFrequency
Examples from DAS-4:
Intel Core i7 CPU
2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle
* 2.4 GHz = 154 GFLOPs
NVIDIA GTX 580 GPU
1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle
* 1.544 GhZ = 1581 GFLOPs
ATI HD 6970
1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle
* 0.880 GhZ = 2703 GFLOPs
![Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/36.jpg)
DRAM Memory bandwidth 36
Throughput =
memory bus frequency * bits per cycle * bus width
Memory clock != CPU clock!
In bits, divide by 8 for GB/s
Examples:
Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s
NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s
ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s
![Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/37.jpg)
Memory bandwidths 37
On-chip memory can be orders of magnitude faster
Registers, shared memory, caches, …
E.g., AMD HD 7970 L1 cache achieves 2 TB/s
Other memories: depends on the interconnect
Intel’s technology: QPI (Quick Path Interconnect)
25.6 GB/s
AMD’s technology: HT3 (Hyper Transport 3)
19.2 GB/s
Accelerators: PCI-e 2.0
8 GB/s
![Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/38.jpg)
Power 38
Chip manufactures specify Thermal Design Power (TDP)
We can measure dissipated power
Whole system
Typically (much) lower than TDP
Power efficiency
FLOPS / Watt
Examples (with theoretical peak and TDP)
Intel Core i7: 154 / 160 = 1.0 GFLOPs/W
NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W
ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
![Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/39.jpg)
Summary
Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte
Sun Niagara 2 8 64 11.2 76 0.1
IBM bg/p 4 8 13.6 13.6 1.0
IBM Power 7 8 32 265 68 3.9
Intel Core i7 4 16 85 25.6 3.3
AMD Barcelona 4 8 37 21.4 1.7
AMD Istanbul 6 6 62.4 25.6 2.4
AMD Magny-Cours 12 12 125 25.6 4.9
Cell/B.E. 8 8 205 25.6 8.0
NVIDIA GTX 580 16 512 1581 192 8.2
NVIDIA GTX 680 8 1536 3090 192 16.1
AMD HD 6970 384 1536 2703 176 15.4
AMD HD 7970 32 2048 3789 264 14.4
![Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/40.jpg)
Absolute hardware performance 40
Only achieved in the optimal conditions:
Processing units 100% used
All parallelism 100% exploited
All data transfers at maximum bandwidth
No application is like this
Even difficult to write micro-benchmarks
![Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/41.jpg)
Operational Intensity and the Roofline model
Performance analysis 41
![Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/42.jpg)
Software performance metrics (3 P’s) 42
Performance
Execution time
Speed-up vs. best available sequential application
Achieved GFLOPs
Computational efficiency
Achieved GB/s
Memory efficiency
Productivity and Portability
Programmability
Production costs
Maintenance costs
![Page 43: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/43.jpg)
Arithmetic intensity 43
The number of arithmetic (floating point) operations
per byte of memory that is accessed
Is the program compute intensive or data intensive on a
particular architecture?
Ignore “overheads”
Loop counters
Array index calculations
Etc.
![Page 44: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/44.jpg)
RGB to gray 44
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
![Page 45: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/45.jpg)
RGB to gray 45
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
2 additions, 3 multiplies = 5 operations
3 reads, 1 write = 4 memory accesses
AI = 5/4 = 1.25
![Page 46: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/46.jpg)
Compute or memory intensive? 46
RGB to Gray
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Sun Niagara 2
IBM bg/p
IBM Power 7
Intel Core i7
AMD Barcelona
AMD Istanbul
AMD Magny-Cours
Cell/B.E.
NVIDIA GTX 580
NVIDIA GTX 680
AMD HD 6970
![Page 47: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/47.jpg)
Applications AI 47
A r i t h m e t i c I n t e n s i t y
O( N ) O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra
(BLAS3)
Particle Methods
![Page 48: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/48.jpg)
Operational intensity 48
The number of operations per byte of DRAM traffic
Difference with Arithmetic Intensity
Operations, not just arithmetic
Caches
“After they have been filtered by the cache hierarchy”
Not between processor and cache
But between cache and DRAM memory
![Page 49: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/49.jpg)
Attainable performance 49
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational Intensity)
![Page 50: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/50.jpg)
The Roofline model 50
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
![Page 51: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/51.jpg)
Roofline: comparing architectures 51
AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9
![Page 52: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/52.jpg)
Roofline: computational ceilings 52
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
![Page 53: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/53.jpg)
Roofline: bandwidth ceilings 53
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
![Page 54: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/54.jpg)
Roofline: optimization regions 54
![Page 55: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU](https://reader036.vdocuments.mx/reader036/viewer/2022071005/5fc2e1581490c1451c035671/html5/thumbnails/55.jpg)
Use the Roofline model 55
Determine what to do first to gain performance
Increase memory streaming rate
Apply in-core optimizations
Increase arithmetic intensity
Reader
Samuel Williams, Andrew Waterman, David Patterson
“Roofline: an insightful visual performance model for
multicore architectures”