name: kaiyong zhao supervisor: dr. x. -w chu. background & related work multiple-precision...

28
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

Name: Kaiyong ZhaoSupervisor: Dr. X. -W Chu

Page 2: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 3: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 4: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 5: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

•Computing Capability

•Memory Bandwidth

Page 6: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Streaming Multiprocessor (SM)

Streaming Processor (SP)

Page 7: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

CUDA: CPU + GPU CParallel Computing modal

Single instruction Multiple Thread (SIMT)All threads run the same function(1000s threads on the fly)Each core deal with different data

Hidden the IO by multiple-threads(more than 1000s threads)Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data1 cycle@GPU vs. ~1000 cycles@CPU

Page 8: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 9: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 10: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 11: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 12: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 13: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 14: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

C = vectorA * Matrix B % prime

Page 15: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

There is no cache for global memory on G80/G200

Constant memory & texture memory have little cache

IO latency400-600 clock cycles

This is the bottle neckKey to Optimization!

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 16: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 17: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 18: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 19: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

Global memory access by threads in a half-warp can be coalesced

When the words accessed by all threads lie in the same segment of size equal to:

32 bytes if all threads access 8-bit words64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp

Including patterns where multiple threads access the same address

Page 20: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

Segment 0 (128B) Segment 1 (128B)

Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

Page 21: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

C = vectorA * Matrix B % prime

Page 22: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 23: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic
Page 24: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Page 25: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

C = vectorA * Matrix B % prime

Page 26: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Page 27: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic

Summary

Page 28: Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic