attila research group

38
1 Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)

Upload: anka

Post on 15-Jan-2016

74 views

Category:

Documents


0 download

DESCRIPTION

Attila Research Group. attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC). Attila Project. Started 2003 Research on GPUs Focus on the microarchitecture Use real games as workloads Analyze bandwidth/latency/threading tradeoffs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Attila Research Group

1

Attila Research Group

attila.ac.upc.edu

Computer Architecture DepartmentUniv Politècnica de Catalunya (UPC)

Page 2: Attila Research Group

2

Attila Project

• Started 2003• Research on GPUs

– Focus on the microarchitecture– Use real games as workloads– Analyze bandwidth/latency/threading tradeoffs

• Spent large fraction of time developing tools• Currently three PhDs in progress

• Funding from– CICYT / Ministry of Education, Spain (2)– Intel (1)

• 2 Students spent 6 months with ATI

Page 3: Attila Research Group

3

Attila Team

• Faculty – Agustin Fernandez

• 3 Ph.D. Students– Victor Moya -- Hired by Intel / VCG ’06– Carlos González -- 6 months internship at ATI (Jun’07)– Jordi Roca -- 6 months internship at ATI (Jun’07)

• Master Thesis– Chema Solis – DX9 Driver Development

• Alumni– David Abella – DX9 Player and PIX reader– Christian Perez – Color Compression in Attila

• Industrial Advisor– Roger Espasa, Intel VCG

Page 4: Attila Research Group

Attila Facts

• Simulation time– 1 frame @1280x1024 per hour

• Lines of code– Simulator: 142697 lines– Library, driver and trace tools: 217266 lines

• ACDL : 37791 lines• OpenGL : 35960 lines• D3D9: 17348 lines

Page 5: Attila Research Group

5

Attila Publications• Conference Papers

– Workload Characterization of 3D Games.Jordi Roca, Victor Moya, Carlos González, Chema Solis, Agustín Fernández and Roger Espasa.IEEE International Symposium on Workload Characterization (IISWC-2006), pp. - , January 2006.

– ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), March 2006.

– Shader Performance Analysis on a Modern GPU Architecture.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.The 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38), November 2005.

– A Single (Unified) Shader GPU Microarchitecture for Embedded Systems.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.2005 International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), November 2005.

• Master Thesis– Caracterización e implementación de algoritmos de compresión en la GPU ATILA (Text in Spanish)

Christian Perez. Master Thesis for the Graduate Studies, January 2008.

– Extensión a Direct3D del driver de un simulador de GPU (Text in Spanish) Chema SolisMaster Thesis for the Graduate Studies, July 2007.

– Librería Direct3D (Text in Catalan) David Abella Master Thesis for the Graduate Studies, July 2007– Shader generation and compilation for a programmable GPU (Text in Spanish) Jordi Roca.

Master Thesis for the Graduate Studies, July 2005. – Support tools for a 3D graphics processor simulation framework (Text in Spanish) Carlos González.

Master Thesis for the Graduate Studies, June 2004.

Page 6: Attila Research Group

6

Outline

• Attila Tracing Environment

• Attila Architecture & Simulator

• Current Research– Shaders– Memory Hierarchy– Micropolygons– DX9 Driver Development

Page 7: Attila Research Group

7

Prey

Riddick Quake 4 Doom 3 UT2004

Half Life 2

Supported workloads

and upcoming D3D games …

Page 8: Attila Research Group

8

Collect Verify Simulate Analyze

OGL/D3D App

OGL/D3DCapturer

Vendor OGL/D3DDriver

Trace

ATI R600/NVIDIA G80

Framebuffer

OGL/D3DPlayer

µ-ArchStatistics

Signal Traffic

or Microsoft PIX Capturer

or Attila Pix Player

Framebuffer

ATI R600/NVIDIA G80

Vendor OGL/D3DDriver

Framebuffer

ATTILA Simulator

ATTILA OGL/D3D Driver

Signal TraceVisualizer

Internal traces(mem,$,…)

Detailed cycle-to-cycle visualization

CHECK CHECK

API Stats

Page 9: Attila Research Group

9

CollectCollect Verify Simulate Analyze

OGL/D3D App

OGL/D3DCapturer

Vendor OGL/D3DDriver

Trace

ATI R600/NVIDIA G80

Framebuffer

OGL/D3DPlayer

µ-ArchStatistics

Signal Traffic

or Microsoft PIX Capturer

or Attila Pix Player

Framebuffer

ATI R600/NVIDIA G80

Vendor OGL/D3DDriver

Framebuffer

ATTILA Simulator

ATTILA OGL/D3D Driver

Signal TraceVisualizer

Internal traces(mem,$,…)

Detailed cycle-to-cycle visualization

CHECK CHECK

API Stats

API Capturers• Capture API calls from a real game• Gather API level statistics

Page 10: Attila Research Group

10

Simulate Analyze

OGL/D3D App

OGL/D3DCapturer

Vendor OGL/D3DDriver

Trace

ATI R600/NVIDIA G80µ-Arch

Statistics

Signal Traffic

or Microsoft PIX Capturer

Framebuffer

ATTILA Simulator

ATTILA OGL/D3D Driver

Signal TraceVisualizer

Internal traces(mem,$,…)

Detailed cycle-to-cycle visualization

CHECK CHECK

API Stats

API Players• Trace checking/integrity• Batch-to-batch playing (helps debug)

OGL/D3DPlayer

or Attila Pix Player

Framebuffer

ATI R600/NVIDIA G80

Vendor OGL/D3DDriver

Framebuffer

Collect VerifyVerify

Page 11: Attila Research Group

11

SimulateSimulate Analyze

OGL/D3D App

OGL/D3DCapturer

Vendor OGL/D3DDriver

Trace

ATI R600/NVIDIA G80µ-Arch

Statistics

Signal Traffic

or Microsoft PIX Capturer

Signal TraceVisualizer

Internal traces(mem,$,…)

Detailed cycle-to-cycle visualization

CHECK CHECK

API Stats

Simulation• Attila Drivers

• AOGL (90%)• AD3D9 (60%)

• Attila Simulator• Detailed cycle-to-cycle simulation • 20 Boxes modeling 100-deep pipeline• Execute@Execute: Functionality embedded at each pipeline stage

Framebuffer

Collect

OGL/D3DPlayer

or Attila Pix Player

ATI R600/NVIDIA G80

Vendor OGL/D3DDriver

Verify

Framebuffer

ATTILA Simulator

ATTILA OGL/D3D Driver

Framebuffer

Page 12: Attila Research Group

12

AnalyzeAnalyze

OGL/D3D App

OGL/D3DCapturer

Vendor OGL/D3DDriver

ATI R600/NVIDIA G80

or Microsoft PIX Capturer

CHECK CHECK

API Stats

Framebuffer

Collect

OGL/D3DPlayer

or Attila Pix Player

ATI R600/NVIDIA G80

Vendor OGL/D3DDriver

Verify Simulate

Framebuffer

ATTILA Simulator

ATTILA OGL/D3D Driver

µ-ArchStatistics

Signal Traffic

Signal TraceVisualizer

Internal traces(mem,$,…)

Detailed cycle-to-cycle visualization

Framebuffer

Trace

Simulation output• Micro-architectural statistics• Traffic for cache, mem, …• Signal trace (input for STV tool)

• Debug simulation performance

Page 13: Attila Research Group

13

Attila Drivers• OpenGL driver

– 200 API calls supported.

– 80% OpenGL 2.0 fixed functionality

• DirectX9 driver– About 50 calls

supported.– 60% API

functionality.

ATTILA Architecture

HAL

Attila OpenGL Driver(GLLIB)

Attila DX9 Driver(D3DLIB)

Page 14: Attila Research Group

14

Unified Driver Architecture

• Currently stalled due to lack of resources

• Runs basics traces– Non-textured torus with simple vtx shader.

ATTILA Architecture

HAL

ACDLX ACDL

AOGL* AGL/ES ADX9* ADX10 AREY

Page 15: Attila Research Group

15

Outline

• Attila Tracing Environment

• Attila Architecture & Simulator

• Current Research– Shaders– Memory Hierarchy– Micropolygons– DX9 Driver Development

Page 16: Attila Research Group

16

Attila Architecture

Me

mo

ryC

on

tro

ller

Me

mo

ryC

on

tro

ller

Me

mo

ryC

on

tro

ller

Me

mo

ryC

on

tro

ller

RO

PR

OP

RO

PR

OP

Shader

Shader

Shader

Shader

Vertex Fetch

Primitive Assembly

Clipping

Triangle Setup

Rasterization

HierarchicalZ

Scheduler

Distributor

Unified shaders, multithreaded … GDDR4 detailed protocol, selectable memory schedulers…

Page 17: Attila Research Group

17

Attila Simulator ImplementationUsing Boxes & Signals

StreamerFetch

StreamerOutputCache

StreamerCommit

StreamerLoader

PrimitiveAssembly

ClipperTriangleSetup

FragmentGenerator

HierarchicalZ

ShaderFetch

ShaderDecodeExecute

TextureUnit

FragmentFIFO

Interpolator

Z StencilTest

ColorWrite

DAC

CommandProcessor

MemoryController

STREAMER/VERTEX FETCH

SHADER

Data-driven & cycle-accurate

Page 18: Attila Research Group

18

Lots of configurable parameters

GPU Unit Params Examples

COMMAND PROCESSOR 1 Batch pipelining

MEMORY CONTROLLER 42Size, channels and banks (number and interleaving).

STREAMER 13 Fetched indices and attributes per cycle

PRIMITIVE ASSEMBLY 4 Assembled triangles per cycle

CLIPPER 5 Clipping latency

SETUP + RASTERIZER 43 MSAA samples/cycle, Enabled HZ

UNIFIED SHADER UNIT 39 Fetch Instrs/cycle, temp regs, scalar ALU

TEXTURE CACHE 19 Line size, ways, port width

ROP (Z + COLOR) 47 Compression, cache size.

DAC 9 Refresh rate

TOTAL 222

Page 19: Attila Research Group

19

Statistics – High Level• API level

• µ-arch level

• “Workload Characterization of 3D Games”, IEEE International Symposium on WC 2006

Quake 4 Vertex Cache Hit Rate

0

0.2

0.4

0.6

0.8

1

1 101 201 301 401 501 601 701 801 901 1001Frames

Hit

Rat

e

UT2004 Mem BW Consumed per Unit

020406080

100120140160

1 101 201 301 401Frames

MB

s

color ztexture vertexdac

Doom3 Unit Utilization

0

0.2

0.4

0.6

0.8

1

1 101 201 301 401 501 601 701 801Frames

mem shExectxALU rop

Prey Average Shader Instructions

0

5

10

15

20

25

30

1 235 469 703 937 1171 1405 1639 1873 2107 2341 2575 2809

Frame

inst

ruct

ions

Vertex instructionsFragment instructionsTexture instructions

FEAR Texture Filter

0

0.2

0.4

0.6

0.8

1

1 50 99 148 197 246 295 344 393 442 491 540 589 638 687

Frame%

filt

er

NEARESTNEAREST_MIPMAP_LINEARANISO BILINEARANISO TRILINEAR

Half Life 2 Triangles per frame

0

100000

200000

300000

400000

500000

1 167 333 499 665 831 997 1163 1329 1495 1661Frame

tria

ngle

s

Page 20: Attila Research Group

20

Statistics – Zooming InDoom3 - Frame 0836 Unit Utilization detail

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1

Texture ZStencil Shader

Stencil pass Shading pass Stencil pass Shading pass

Light 0 Light 1

• Fine-grain stats at configurable fractions of i.e: 100, 1K, 10K or 100K execution cycles.

Page 21: Attila Research Group

21

Statistics – Cycle Level

Page 22: Attila Research Group

22

Outline

• Attila Tracing Environment

• Attila Architecture & Simulator

• Current Research– Shaders– Memory Hierarchy– Micropolygons– DX9 Driver Development

Page 23: Attila Research Group

GPU Memory Hierarchy Optimizations

Carlos González

[email protected]

Page 24: Attila Research Group

Previous Work

1. Initial Attila’s Boxes & Signals framework2. Tracing Framework

– GLInterceptor & GLPlayer Tools– OpenGL Driver for Attila– Signal Trace Visualizer tool

3. New highly-detailed Memory Controller for Attila4. Internship at ATI (6 months, 07’)

– Work mainly focused on the MC block– Analysis of bandwidth and latency by means of simulation

techniques– Some contributions to the initial system

• Mechanisms to pinpoint sources of latency and analyze bandwidth over time slices

Page 25: Attila Research Group

Today’s GPUs remarks

• Tremendous bandwidth available– Core 2: 12 GB/sec VS NVIDIA G80 > 100 GB/sec

• But…– Dozens of clients accessing memory simultaneously– Unbalance and inefficient scheduling of memory

transactions can lead to poor performance• Workload unbalance

– Total available BW decreases• Inefficient scheduling

– Latency increases (DDR protocol overhead)

• Overall performance degradation

Page 26: Attila Research Group

Thesis Goals

1. Optimize bank mapping and load balancing among memory channels. Also, propose multiple separated address spaces (per client)

2. Propose efficient memory controller scheduling algorithms

• Also: Measure DRAM chips consumption of our proposals

3. Propose new cache hierarchies for ROP and Texture units

4. Research in interconnection topologies

Page 27: Attila Research Group

Some experiments…Doom 3 Total execution time

5

6

7

8

9

10

11

12

13

14

15

64 128 256 512 1024 2048 4096Channel interleaving (bytes)

Execu

tio

n t

ime (

mil

lio

ns o

f cycle

s)

Doom3 - Scheduler Performance Analysis

0

2

4

6

8

10

12

14

16

18

229 240 260 325 350 375

Frames

Ex

ec

uti

on

tim

e (

in m

illio

ns

of

cy

cle

s)

17’8%

17’8%

17’6% 17’1

%13’6%

17’1%

Channel Interleaving Analysis

Some config. parameters of the simulation

• 4 channels of 64-bit

• 8 banks per 32-bit IO chip

• Channel interleaving = 256 bytes

• Bank interleaving = 1024 bytes

• 4 unified shaders (4x)

• Texture cache line (L1) = 64 bytes

• Texture cache ways (L1) = 16

• Texture cache lines (L1) = 16

• Color and Zstencil caches:

4 ways

• line size = 256 bytes - 16 cache lines

Some config. parameters of the experiment

• 8 channels of 32-bit

• 8 banks per 32-bit IO chip

• Bank interleaving fixed to 256

• 4 unified shaders (4x)

Memory Scheduling Analysis

Page 28: Attila Research Group

28

Micropolygon Rendering

Jordi Roca

[email protected]

Page 29: Attila Research Group

29

Past work1. OpenGL Fixed Function to ARB vp/fp 1.0 translator.

2. Workload Characterization of 3D Games (IISWC´06):– Extensive analysis of current games in terms of both

API call and µarchitectural level stats.

3. Multi-GPU performance evaluation project (at ATI 2007´s internship):– Hybrid SFR/AFR modes.– Alternatives for RTT surface synchronization.– Scaling of current PCIe BW.

(Related paper is currently submitted at the IISWC 2008).

Page 30: Attila Research Group

30

Micropolygon rendering• Understanding and characterizing the pipeline backend

unbalance due to very small polygons.– Newer games tend to render outsides, thus projecting polygons

of a few pixels size.

Synthetic micropolygon test:

Fills the screen with 1 pixel aligned quads:

Raster Input: 1 triangle/clockRaster Output: 15/16 empty slots/clock

(high-end cards).

Unit utilization

0

0.2

0.4

0.6

0.8

1

Mem Shader Raster ROP

Page 31: Attila Research Group

31

Research on:• Proposal #1: µpolygon grid traversal scheme:

– An alternative rasterization path to detect and efficiently traverse grids of adjacent pixel-size primitives:

• Fill backend slots combining fragments of different primitives.

• Reuse triangle setup and traversal computations for pixel proximate primitives.

• Proposal #2: Dynamic balancing of rasterization workload: – Assign & schedule shader threads for rasterization.

Page 32: Attila Research Group

32

DX9 Driver Development

Chema Solís

[email protected]

Page 33: Attila Research Group

Project target

• Project target is to use D3D9 games as workload for ATTILA GPU simulator.

• Two main tasks:– Trace D3D9 calls executed by the games.– Build a D3D9 driver on top of GPU simulator.

D3D application

Microsoft D3D9 ATTILA D3D9 driver

D3D9 Trace

Page 34: Attila Research Group

PixRun Player

• Executes traces of calls to D3D9 captured by Microsoft PIX.• Analyse how the game is using D3D9.

Page 35: Attila Research Group

D3D9 Driver

• D3D9 functionality is being added progressively.• The driver is close to support commercial games.

Page 36: Attila Research Group

36

Unified Shader Architecture

Victor Moya

[email protected]

Page 37: Attila Research Group

Unified Shader Architecture

• Evaluated performance of an unified vertex and fragment shader architecture on legacy applications– Evaluated area vs performance

• Evaluated the performance of implementing Triangle Setup on the shader for embedded GPU architectures

• Evaluated bottleneck of GPU architectures with high shader ALU to texture.

Page 38: Attila Research Group

Current Research

• Evaluate thread and resource scheduling in an unified shader architecture

• Implementation blending on the shader