parallel programming many-core computing: intro (1/5)rob/parallel-programming... · cuda toolkit...

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

INTRO (1/5)

Rob van Nieuwpoort

[email protected]

Schedule 2

1. Introduction, performance metrics & analysis

2. Many-core hardware

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

What are many-cores? 3

What are many-cores 4

From Wikipedia: “A many-core processor is a multi-

core processor in which the number of cores is large

enough that traditional multi-processor techniques

are no longer efficient — largely because of issues

with congestion in supplying instructions and data

to the many processors.”

What are many-cores 5

How many is many?

Several tens of cores

How are they different from multi-core CPUs?

Non-uniform memory access (NUMA)

Private memories

Network-on-chip

Examples

Multi-core CPUs (48-core AMD magny-cours)

Graphics Processing Units (GPUs)

GPGPU = general purpose programming on GPUs

Cell processor (PlayStation 3)

Server processors (Sun Niagara)

Many-core questions

The search for performance Build hardware

What architectures?

Evaluate hardware

What metrics?

How do we measure?

Use it

What workloads?

Expected performance?

Program it

How to program?

How to optimize?

Benchmark

How to analyze performance?

6

Today’s Topics 7

Introduction

Why many-core programming?

History

Hardware introduction

Performance model:

Arithmetic Intensity and Roofline

Why do we need many-cores? 8

T12

Westmere

NV30 NV40

G70

G80

GT200

3GHz Dual

Core P4

3GHz

Core2 Duo

3GHz Xeon

Quad


China's Tianhe-1A

#5 in top500 list

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

12

Power efficiency 13

Graphics in 1980 14

Graphics in 2000 15

Graphics now: GPU movie 16

Realism of modern GPUs 17

http://www.youtube.com/watch?v

=bJDeipvpjGQ&feature=play

er_embedded#t=49s

Courtesy

techradar.com

http://www.youtube.com/watch?v=bJDeipvpjGQ&feature=player_embedded



Why do we need many-cores?

Performance

Large scale parallelism

Power Efficiency

Use transistors more efficiently

Price (GPUs)

Huge market, bigger than Hollywood

Mass production, economy of scale

“spotty teenagers” pay for our HPC needs!

18

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800

681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

GPGPU history 19

GPGPU History

Use Graphics primitives for HPC

Ikonas [England 1978]

Pixel Machine [Potmesil & Hoffert 1989]

Pixel-Planes 5 [Rhoades, et al. 1992]

Programmable shaders, around 1998

DirectX / OpenGL

Map application onto graphics domain!

GPGPU

Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

20

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010

July 07 Nov 07 April 08

Aug 08 July 09

Nov 09 Mar 10

CUDA Toolkit 1.1

• Win XP 64

• Atomics

support

• Multi-GPU

support

CUDA Toolkit 2.0

Double

Precision

• Compiler

Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion

intrinsics

• Performance

enhancements

CUDA Toolkit 1.0

• C Compiler

• C Extensions

• Single

Precision

• BLAS

• FFT

• SDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel Nsight

Beta CUDA Toolkit 3.0

C++ inheritance

Fermi support

Tools updates

Driver / RT interop

21

Parallel Nsight

Visual Studio

Visual Profiler

For Linux

cuda-gdb

For Linux

Cuda Tools 22

Many-core hardware introduction 23

The search for performance 24

The search for performance 25

We have M(o)ore transistors …

Bigger cores?

We are hitting the walls!

power, memory, instruction-level parallelism (ILP)

How do we use them?

Large-scale parallelism

Many-cores !

Choices … 26

Core type(s):

Fat or slim ?

Vectorized (SIMD) ?

Homogeneous or heterogeneous?

Number of cores:

Few or many ?

Memory

Shared-memory or distributed-memory?

Parallelism

Instruction-level parallelism, threads, vectors, …

A taxonomy 27

Based on “field-of-origin”:

General-purpose

Intel, AMD

Graphics Processing Units (GPUs)

NVIDIA, ATI

Gaming/Entertainment

Sony/Toshiba/IBM

Embedded systems

Philips/NXP, ARM

Servers

Oracle, IBM, Intel

High Performance Computing

Intel, IBM, …

General Purpose Processors 28

Architecture

Few fat cores

Vectorization (SSE, AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Processes (OS Scheduler)

Message passing

Multi-threading

Coarse-grained parallelism

Server-side 29

General-purpose-like with more hardware threads

Lower performance per thread

high throughput

Examples

Sun Niagara II

8 cores x 8 threads

IBM POWER7

8 cores x 4 threads

Intel SCC

48 cores, all can run their own OS

Graphics Processing Units 30

Architecture

Hundreds/thousands of slim cores

Homogeneous

Accelerator

Memory

Very complex hierarchy

Both shared and per-core

Programming

Off-load model

Many fine-grained symmetrical threads

Hardware scheduler

Cell/B.E. 31

Architecture

Heterogeneous

8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

Memory

Per-core memory, network-on-chip

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Take home message 32

Variety of platforms

Core types & counts

Memory architecture & sizes

Parallelism layers & types

Scheduling

Open questions:

Why so many?

How many platforms do we need?

Can any application run on any platform?

Hardware performance metrics 33

Hardware Performance metrics 34

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Operations per cycle

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power [Watt]

Derived metrics

FLOP/Byte, FLOP/Watt

Theoretical peak performance 35

Peak = chips * cores * vectorWidth *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle

* 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

DRAM Memory bandwidth 36

Throughput =

memory bus frequency * bits per cycle * bus width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Memory bandwidths 37

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

E.g., AMD HD 7970 L1 cache achieves 2 TB/s

Other memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3)

19.2 GB/s

Accelerators: PCI-e 2.0

8 GB/s

Power 38

Chip manufactures specify Thermal Design Power (TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Absolute hardware performance 40

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

No application is like this

Even difficult to write micro-benchmarks

Operational Intensity and the Roofline model

Performance analysis 41

Software performance metrics (3 P’s) 42

Performance

Execution time

Speed-up vs. best available sequential application

Achieved GFLOPs

Computational efficiency

Achieved GB/s

Memory efficiency

Productivity and Portability

Programmability

Production costs

Maintenance costs

Arithmetic intensity 43

The number of arithmetic (floating point) operations

per byte of memory that is accessed

Is the program compute intensive or data intensive on a

particular architecture?

Ignore “overheads”

Loop counters

Array index calculations

Etc.

RGB to gray 44

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

RGB to gray 45

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

2 additions, 3 multiplies = 5 operations

3 reads, 1 write = 4 memory accesses

AI = 5/4 = 1.25

Compute or memory intensive? 46

RGB to Gray

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sun Niagara 2

IBM bg/p

IBM Power 7

Intel Core i7

AMD Barcelona

AMD Istanbul

AMD Magny-Cours

Cell/B.E.

NVIDIA GTX 580

NVIDIA GTX 680

AMD HD 6970

Applications AI 47

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

Operational intensity 48

The number of operations per byte of DRAM traffic

Difference with Arithmetic Intensity

Operations, not just arithmetic

Caches

“After they have been filtered by the cache hierarchy”

Not between processor and cache

But between cache and DRAM memory

Attainable performance 49

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

The Roofline model 50

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

Roofline: comparing architectures 51

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

Roofline: computational ceilings 52


Roofline: bandwidth ceilings 53


Roofline: optimization regions 54

Use the Roofline model 55

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David Patterson

“Roofline: an insightful visual performance model for

multicore architectures”

parallel programming many-core computing: intro (1/5)rob/parallel-programming... · cuda toolkit...

Documents