nvidia’s fermi: the first complete gpu computing architecture a white paper by peter n. glaskowsky...

31
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Presented by: Ahmad Hammad Course: Course: CSE 661 - Fall 2011

Upload: flora-howard

Post on 14-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

NVIDIA’S FERMI:THE FIRST COMPLETE GPU COMPUTING ARCHITECTUREA WHITE PAPER BY PETER N. GLASKOWSKY

Presented by: Presented by: Ahmad HammadCourse: Course: CSE 661 - Fall 2011

Page 2: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Outline

Introduction What is GPU Computing? Fermi

The Programming Model The Streaming Multiprocessor The Cache and Memory Hierarchy

Conclusion

Page 3: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Introduction

Traditional microprocessor technology see diminishing returns. Improvement in clock speeds and

architectural sophistication is slowing Focus has shifted to multicore designs.

These too are reaching practical limits for personal computing.

Page 4: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Introduction (2)

CPUs are optimized for applications where work done by limited number of threads Threads exhibit high data locality Mix of different operations High percentage of conditional branches.

CPUs are inefficient for high-performance computing applications Integer and floating-point execution units is

small Most of the CPU space, complexity and heat it

generates, devoted to: Caches, instruction decoders, branch predictors, and

other features to enhance single-threaded performance.

Page 5: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Introduction (3)

GPU design aims applications with multiple threads dominated by long sequences of computational instructions.

GPUs are much better at thread handling, data caching, virtual memory management, flow control, and other CPU-like features.

CPUs will never go away, but GPUs deliver more cost-effective and

energy-efficient performance.

Page 6: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Introduction (4)

The key GPU design goal is to maximize floating-point throughput.

Most of the circuitry within each core is dedicated to computation, rather than speculative features

power consumed by GPUs goes into the application’s actual algorithmic work.

Page 7: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

What is GPU Computing?

Use of a graphics processing unit to do general purpose scientific and engineering computing.

GPU computing not a replacement for CPU computing. Each approach has advantages for certain kinds of

software. CPU and GPU work together in a heterogeneous

co-processing computing model. The sequential part of the application runs on the

CPU the computationally-intensive part is accelerated by

the GPU.  

Page 8: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

What is GPU Computing?

From the user’s perspective, the application just runs faster because of using the GPU to boost performance. 

Page 9: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Fermi

Code name for NVIDIA’s next-generation CUDA architecture

consists of 16 streaming multiprocessors (SMs)

each consisting of 32 cores each can execute one floating-point or integer

instruction per clock.

The SMs are supported by a second-level cache Host interface GigaThread scheduler Multiple DRAM interfaces.

Page 10: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Fermi

Code name for NVIDIA’s next-generation CUDA architecture

Page 11: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model

Complexity of the Fermi architecture is managed by multi-level programming model allows software developers to focus on

algorithm design No need to know details about mapping

algorithm to the hardware improve productivity

Page 12: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model Kernels In NVIDIA’s CUDA software platform the

computational elements of algorithms called kernels Kernels can be written in the C language extended with additional keywords to

express parallelism directly Once compiled, kernels consist of many

threads that execute the same program in parallel

Page 13: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model Thread Blocks

Page 14: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model Warps Thread blocks are divided into warps of 32

threads. The warp is the fundamental unit of

dispatch within a single SM. Two warps from different thread blocks

can be issued and executed concurrently increase hardware utilization and energy

efficiency. Thread blocks are grouped into grids

each executes a unique kernel

Page 15: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model

At any one time, the entire Fermi device is dedicated to a single application. an application may include multiple kernels. Fermi supports simultaneous execution of

multiple kernels from the same application Each kernel distributed to one or more SMs

This capability increases the utilization of the device

Page 16: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model GigaThread Switching from one application to

another needs 25µsec Short enough to maintain high utilization

even when running multiple applications This switching is managed by

GigaThread (hardware thread scheduler) Manages 1,536 simultaneously active

threads for each streaming multiprocessor across 16 kernels.

Page 17: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Programming Model Languages Fermi support

C-language FORTRAN (with independent solutions from

The Portland Group and NOAA) Java, Matlab, and Python

Fermi brings new instruction level support for C++ previously unsupported on GPUs will make GPU computing more widely

available than ever.

Page 18: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Supported software platforms Supported software platforms

NVIDIA’s own CUDA development environment

The OpenCL standard managed by the Khronos Group

Microsoft’s Direct Compute API.

Page 19: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor Comprise 32 cores each:

can perform floating-point and integer operations

16 load-store units for memory operations

four special-function units 64K of local SRAM split between cache

and local memory.

Page 20: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:
Page 21: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor core Floating-point operations follow the IEEE 754-

2008 floating-point standard. Each core can perform

one single-precision fused multiply-add operation in each clock period

one double-precision fused multiply-add FMA in two clock periods. no rounding off in the intermediate result

Fermi performs more than 8× as many double-precision operations per clock than previous GPU generations

Page 22: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor core FMA support increases the accuracy and

performance of other mathematical operations division and square root extended-precision arithmetic interval arithmetic Linear algebra.

The integer ALU supports the usual mathematical and logical operations including multiplication, on both 32-bit and

64-bit values.

Page 23: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor Memory operations Handled by a set of 16 load-store units in

each SM. load/store instructions refer to memory in

terms of two-dimensional arrays providing addresses in terms of x and y

values. Data can be converted from one format

to another as it passes between DRAM and the core registers at the full rate. examples of optimizations unique to GPUs

Page 24: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor four Special Function Units handle special operations such as sin,

cos and exp Four of these operations can be issued

per cycle in each SM.

Page 25: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Streaming Multiprocessor execution blocks Within Fermi SM has four execution blocks

Cores are divided into two execution blocks:16 cores each.

One block offer 16 load-store units One block of the four SFUs,

In each cycle, 32 instructions can be dispatched from one or two warps to these blocks. Two cycles to execute the 32 instructions on the cores or

load/store units. 32 special-function instructions can issued in single

cycle takes eight cycles to complete on the four SFUs. (32/4 =

8)

Page 26: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

This figure shows how instructions are issued to the execution blocks.

Page 27: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Cache and Memory Hierarchy L1 Fermi architecture provides local

memory in each SM, can be split Shared memory First-level (L1) cache for global memory

references. The local memory is 64K in size

Split 16K/48K or 48K/16K between L1 cache and shared memory. Depends on How much shared memory is needed, how predictable the kernel’s accesses to global

memory are likely to be.

Page 28: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Cache and Memory Hierarchy L2 Fermi come with an L2 cache

768KB in size for a 512-core chip. Covers GPU local DRAM as well as system

memory. The L2 cache subsystem implements:

Set of memory read-modify-write atomic operations Managing access to data shared across thread blocks

or kernels. atomic operations are 5× to 20× faster than on

previous GPUs using conventional synchronization methods.

Page 29: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

The Cache and Memory Hierarchy DRAM The final stage of the local memory

hierarchy. Fermi provides six 64-bit DRAM channels

that support SDDR3 and GDDR5 DRAMs. Up to 6GB of GDDR5 DRAM can be

connected to the chip.

Page 30: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Error Correcting Code ECC

Fermi is the first GPU to provide ECC protects DRAM, register files, shared memories, L1 and L2

caches. The level of protection is known as SECDED:

single (bit) error correction, double error detection. Instead of each 64-bit memory channel

carrying eight extra bits for ECC information NVIDIA has a secrete solution for packing the ECC

bits into reserved lines of memory.

Page 31: NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:

Conclusion

CPUs is the best for dynamic workloads with short sequences of computational operations and unpredictable control flow.

workloads dominated by computational work performed within a simpler control flow need GPU