gpu architecture overview
DESCRIPTION
GPU Architecture Overview. Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012. Announcements. Project 0 due Monday 09/17 Karl’s office hours Tuesday, 4:30-6pm Friday, 2-5pm. Course Overview Follow-Up. Read before or after lecture? Project vs. final project - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/1.jpg)
GPU Architecture Overview
Patrick CozziUniversity of PennsylvaniaCIS 565 - Fall 2012
![Page 2: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/2.jpg)
Announcements
Project 0 due Monday 09/17 Karl’s office hours
Tuesday, 4:30-6pmFriday, 2-5pm
![Page 3: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/3.jpg)
Course Overview Follow-Up
Read before or after lecture? Project vs. final project Closed vs. open source Feedback vs. grades Guest lectures
![Page 4: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/4.jpg)
Acknowledgements
CPU slides – Varun Sampath, NVIDIA GPU slides
Kayvon Fatahalian, CMUMike Houston, AMD
![Page 5: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/5.jpg)
CPU and GPU Trends
FLOPS – FLoating-point OPerations per Second
GFLOPS - One billion (109) FLOPS TFLOPS – 1,000 GFLOPS
![Page 7: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/7.jpg)
CPU and GPU Trends
Compute Intel Core i7 – 4 cores – 100 GFLOPNVIDIA GTX280 – 240 cores – 1 TFLOP
Memory BandwidthSystem Memory – 60 GB/sNVIDIA GT200 – 150 GB/s
Install BaseOver 200 million NVIDIA G80s shipped
![Page 8: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/8.jpg)
CPU Review
What are the major components in a CPU die?
![Page 9: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/9.jpg)
CPU Review• Desktop Applications
– Lightly threaded– Lots of branches– Lots of memory accesses
Profiled with psrun on ENIAC
Slide from Varun Sampath
![Page 10: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/10.jpg)
A Simple CPU Core
Image: Penn CIS501
Fetch Decode Execute Memory Writeback
PC I$ RegisterFile
s1 s2 d D$
+4
control
![Page 11: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/11.jpg)
Pipelining
Image: Penn CIS501
PC InsnMem
RegisterFile
s1 s2 dDataMem
+4
Tinsn-mem Tregfile TALU Tdata-mem Tregfile
Tsinglecycle
![Page 12: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/12.jpg)
Branches
Image: Penn CIS501
jeq loop??????
RegisterFile
SX
s1 s2 dDataMem
a
d
IR
A
B
IR
O
B
IR
O
D
IR
F/D D/X X/M M/W
nop
![Page 13: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/13.jpg)
Branch Prediction
+ Modern predictors > 90% accuracyo Raise performance and energy efficiency
(why?)– Area increase– Potential fetch stage latency increase
![Page 14: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/14.jpg)
Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers:
Latency Bandwidth SizeSRAM (L1, L2, L3)
1-2ns 200GBps 1-20MB
DRAM (memory)
70ns 20GBps 1-20GB
Flash (disk) 70-90µs 200MBps 100-1000GBHDD (disk) 10ms 1-150MBps 500-3000GB
![Page 15: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/15.jpg)
Caching
Keep data you need close Exploit:
Temporal locality Chunk just used likely to be used again soon
Spatial locality Next chunk to use is likely close to previous
![Page 16: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/16.jpg)
Cache Hierarchy
Hardware-managed L1 Instruction/Data
caches L2 unified cache L3 unified cache
Software-managed Main memory Disk
I$ D$
L2
L3
Main Memory
Disk
(not to scale)
Larger
Faster
![Page 17: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/17.jpg)
Intel Core i7 3960X – 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s totalSource: www.lostcircuits.com
![Page 18: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/18.jpg)
Improving IPC
IPC (instructions/cycle) bottlenecked at 1 instruction / clock
Superscalar – increase pipeline width
Image: Penn CIS371
![Page 19: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/19.jpg)
Scheduling
Consider instructions:xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1
xor and add are dependent (Read-After-Write, RAW)
sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW)
![Page 20: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/20.jpg)
Register Renaming
How about this instead:xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9
xor and sub can now execute in parallel
![Page 21: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/21.jpg)
Out-of-Order Execution
Reordering instructions to maximize throughput Fetch Decode Rename Dispatch Issue
Register-Read Execute Memory Writeback Commit
Reorder Buffer (ROB) Keeps track of status for in-flight instructions
Physical Register File (PRF) Issue Queue/Scheduler
Chooses next instruction(s) to execute
![Page 22: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/22.jpg)
Parallelism in the CPU
Covered Instruction-Level (ILP) extractionSuperscalarOut-of-order
Data-Level Parallelism (DLP)Vectors
Thread-Level Parallelism (TLP)Simultaneous Multithreading (SMT)Multicore
![Page 23: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/23.jpg)
Vectors Motivation
for (int i = 0; i < N; i++)A[i] = B[i] + C[i];
![Page 24: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/24.jpg)
CPU Data-level Parallelism Single Instruction Multiple Data (SIMD)
Let’s make the execution unit (ALU) really wide Let’s make the registers really wide too
for (int i = 0; i < N; i+= 4) {// in parallelA[i] = B[i] + C[i];A[i+1] = B[i+1] + C[i+1];A[i+2] = B[i+2] + C[i+2];A[i+3] = B[i+3] + C[i+3];
}
![Page 25: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/25.jpg)
Vector Operations in x86
SSE2 4-wide packed float and packed integer instructions Intel Pentium 4 onwards AMD Athlon 64 onwards
AVX 8-wide packed float and packed integer instructions Intel Sandy Bridge AMD Bulldozer
![Page 26: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/26.jpg)
Thread-Level Parallelism
Thread Composition Instruction streamsPrivate PC, registers, stackShared globals, heap
Created and destroyed by programmer Scheduled by programmer or by OS
![Page 27: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/27.jpg)
Simultaneous Multithreading
Instructions can be issued from multiple threads
Requires partitioning of ROB, other buffers+ Minimal hardware duplication+ More scheduling freedom for OoO– Cache and execution resource contention can reduce single-threaded performance
![Page 28: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/28.jpg)
Multicore
Replicate full pipeline Sandy Bridge-E: 6 cores+ Full cores, no resource sharing other than last-level cache+ Easier way to take advantage of Moore’s Law– Utilization
![Page 29: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/29.jpg)
CPU Conclusions
CPU optimized for sequential programmingPipelines, branch prediction, superscalar,
OoOReduce execution time with high clock speeds
and high utilization Slow memory is a constant problem Parallelism
Sandy Bridge-E great for 6-12 active threadsHow about 12,000?
![Page 30: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/30.jpg)
How did this happen?
http://proteneer.com/blog/?p=263
![Page 31: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/31.jpg)
Graphics Workloads
Triangles/vertices and pixels/fragments
Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html
![Page 32: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/32.jpg)
Early 90s – Pre GPU
Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf
![Page 33: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/33.jpg)
Why GPUs?
Graphics workloads are embarrassingly parallelData-parallelPipeline-parallel
CPU and GPU execute in parallel Hardware: texture filtering, rasterization,
etc.
![Page 34: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/34.jpg)
Data Parallel
Beyond GraphicsCloth simulationParticle systemMatrix multiply
Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306
![Page 35: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/35.jpg)
![Page 36: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/36.jpg)
![Page 37: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/37.jpg)
![Page 38: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/38.jpg)
![Page 39: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/39.jpg)
![Page 40: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/40.jpg)
![Page 41: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/41.jpg)
![Page 42: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/42.jpg)
![Page 43: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/43.jpg)
![Page 44: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/44.jpg)
![Page 45: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/45.jpg)
![Page 46: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/46.jpg)
![Page 47: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/47.jpg)
![Page 48: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/48.jpg)
![Page 49: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/49.jpg)
![Page 50: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/50.jpg)
![Page 51: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/51.jpg)
![Page 52: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/52.jpg)
![Page 53: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/53.jpg)
![Page 54: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/54.jpg)
![Page 55: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/55.jpg)
![Page 56: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/56.jpg)
![Page 57: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/57.jpg)
![Page 58: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/58.jpg)
![Page 59: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/59.jpg)
![Page 60: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/60.jpg)
![Page 61: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/61.jpg)
![Page 62: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/62.jpg)
![Page 63: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/63.jpg)
![Page 64: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/64.jpg)
![Page 65: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/65.jpg)
![Page 66: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/66.jpg)
![Page 67: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/67.jpg)
![Page 68: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/68.jpg)
![Page 69: GPU Architecture Overview](https://reader035.vdocuments.mx/reader035/viewer/2022062218/568161a3550346895dd15d39/html5/thumbnails/69.jpg)
Reminders
Piazza Signup: https://piazza.com/upenn/fall2012/cis565/
GitHub Create an account: https://github.com/signup/free Change it to an edu account: https://github.com/edu Join our organization: https://github.com/CIS565-Fall-2012
No class Wednesday, 09/12