mass market applications of massively parallel computing

45
Mass Market Applications of Massively Parallel Computing Chas. Boyd

Upload: cleave

Post on 22-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Mass Market Applications of Massively Parallel Computing. Chas. Boyd. Outline. Projections of future hardware The client computing space Mass-market parallel applications Common application characteristics Interesting processor features. The Physics of Silicon. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mass Market Applications of Massively Parallel Computing

Mass Market Applications of Massively Parallel Computing

Chas. Boyd

Page 2: Mass Market Applications of Massively Parallel Computing

3

Outline

• Projections of future hardware• The client computing space• Mass-market parallel applications• Common application characteristics• Interesting processor features

Page 3: Mass Market Applications of Massively Parallel Computing

4

The Physics of Silicon

• The way processors get faster has fundamentally changed

• No more free performance gains due to clock rate and Instruction-Level Parallelism

• Yet gates-per-die continues to grow• Possibly faster now that clock rate isn’t an issue• Estimate: doubling every 2-2.5 years

• New area means more cores and caches• In-order core counts may grow faster

than Out-of-Order core counts do

Page 4: Mass Market Applications of Massively Parallel Computing

5

Page 5: Mass Market Applications of Massively Parallel Computing

6

The Old Story

Page 6: Mass Market Applications of Massively Parallel Computing

7

Page 7: Mass Market Applications of Massively Parallel Computing

8

Page 8: Mass Market Applications of Massively Parallel Computing

9

A Surplus of Cores

• ‘More cores than we know what to do with’• Literally

• Servers scale with transaction counts• Technical Computing

• history of dealing with parallel workloads• What are the opportunities for the PC client?

• Are there mass market applications that are parallelizable?

Page 9: Mass Market Applications of Massively Parallel Computing

10

Requirements of Mass Market Space

• Fairly easy to program and maintain• Cannot break on future hardware or

operating systems• Transparent back-compatibility, fwd compatibility• Mass market customers hate regressions!• Consumer software must operate for decades

• Must get faster automatically• Why we are here

Page 10: Mass Market Applications of Massively Parallel Computing

11

AMD Term:

• Personal Stream Computing• Actually nothing like ‘stream processing’ as used

by Stanford Brook, etc.

Page 11: Mass Market Applications of Massively Parallel Computing

12

Data-Parallel Processing

• Key technique, how do we apply it to consumers?

• What takes lots of data?• Media, pixels, audio samples• Video, imaging, audio• Games

Page 12: Mass Market Applications of Massively Parallel Computing

13

Video

• Decode, encode, transcode• Motion Estimation, DCT, Quantization

• Effects• Anything you would want to do to an image• Scaling, sepia, DVE effects (transitions)

• Indexing• Search/Recognition -convolutions

Page 13: Mass Market Applications of Massively Parallel Computing

14

Imaging

• Demosaicing• Extract colors with knowledge of sensor layout

• Segmentation• Identify areas of image to process

• Cleanup• Color correction, noise removal, etc.

• Indexing• Identify areas for tagging

Page 14: Mass Market Applications of Massively Parallel Computing

15

User Interaction with Media

• Client applications can/should be interactive• Mass market wants full automation• ‘Pro-sumer’ wants some options to participate, but with

real-time feedback (20+ fps) on 16 GPixel images• Automating media processing requires analysis

• Recognition, segmentation, image understanding• Is this image outdoors or inside?• Is this image right-side up?• Does it contain faces?• Are their eyes red?

Page 15: Mass Market Applications of Massively Parallel Computing

16

Imaging Markets

• In some sense, the broader the market, the more sophisticated the algorithm required• Although pro-sumers care more about performance, and

they are the ones that write the reviews

Page 16: Mass Market Applications of Massively Parallel Computing

17

FFT Performance

Page 17: Mass Market Applications of Massively Parallel Computing

18

Game Applications of Mass Parallel

• Rendering• Imaging• Physics• IK• AI

Page 18: Mass Market Applications of Massively Parallel Computing

1919

Ultima Underworld 1993

Page 19: Mass Market Applications of Massively Parallel Computing

Dark Messiah 2007Dark Messiah 2007

20

Page 20: Mass Market Applications of Massively Parallel Computing

Game Rendering

• Well established at this point, but new techniques keep being discovered

• Rendering different terms at different spatial scales• E.g. Irradiance can be computed more sparsely

than exit radiance enabling large increases in the number of incident light sources considered

• Spherical harmonic coefficient manipulations

Page 21: Mass Market Applications of Massively Parallel Computing

22

Game Imaging

• Post processing• Reduction (histogram or single average value)

• Exposure estimation based on log average luminance

• Exposure correction• Oversaturation extraction• Large blurs (proportional to screen size)• Depth of field• Motion blur

Page 22: Mass Market Applications of Massively Parallel Computing

Half-Life 2Half-Life 2

23

Page 23: Mass Market Applications of Massively Parallel Computing

Half-Life 2Half-Life 2

24

Page 24: Mass Market Applications of Massively Parallel Computing

Half-Life 2Half-Life 2

25

Page 25: Mass Market Applications of Massively Parallel Computing

26

Game Physics

• Particles -non-interacting• Particles -interacting• Rigid bodies• Deformable bodies• Etc.

Page 26: Mass Market Applications of Massively Parallel Computing

Game Processor EvolutionGame Processor Evolution

Vertex Shader

Pixel Shader

AnimationAI

Texture Creation

Mesh ModelingPhysicsContent Creation Process

Game Stack

Offline

CPU

GPU

Real Time

27

Page 27: Mass Market Applications of Massively Parallel Computing

28

Common Properties of Mass Apps

• Results of client computations are displayed at interactive rates• Fundamental requirement of client systems

• Tight coupling with graphics is optimal• Physical proximity to renderer is beneficial

• Smaller data types are key

Page 28: Mass Market Applications of Massively Parallel Computing

29

Support for Image Data Types

• Pixels, texels, motion vectors, etc.• Image data more important than float32s

• Data declines in size as importance drops• Bytes, words, fp11, fp16, single, double

• Bytes may be declining in importance• Hardware support for formatting is useful• Clock cycles required by shift/or/mul, etc.

cost too much power

Page 29: Mass Market Applications of Massively Parallel Computing

30

I/O Considerations

• Like most computations that are not 3-D rendering, GPUs are i/o bound

• Arithmetic intensity is lower than GPUs• Convolutions

• Support for efficient data types is very important

Page 30: Mass Market Applications of Massively Parallel Computing

31

GPU Arithmetic Intensity Projection

Page 31: Mass Market Applications of Massively Parallel Computing

32

GPU Arithmetic Intensity Projection

• 2-3 more process doublings before new memory technologies will help much• Stacked die?, 2k wide bus?, optical?

• Estimate at least 4x increase in nr of compute instructions per read operation

• Arithmetic intensities reach 64??• This is fine for 3-D rendering• Other data-parallel apps need more i/o

Page 32: Mass Market Applications of Massively Parallel Computing

33

I/O Patterns

• Solutions will have a variety of mechanisms to help with worsening i/o constraints

• Data re-use (at cache size scales) is relatively rare in media applications

• Read-write use of memory is rare• Read-write caches are less critical• Streaming data behavior is sufficient

• Read contention and write contention are the issue, not read-after-write scenarios

Page 33: Mass Market Applications of Massively Parallel Computing

34

Interesting Techniques

• Shared registers• Possibly interesting to help with i/o bandwidth• Reducing on-chip bandwidth may help

power/heat• Scatter

• Can be useful in scenarios that don’t thrash output subsystem

• Can reduce pressure on gather input system

Page 34: Mass Market Applications of Massively Parallel Computing

35

Convolution

• Key element of almost all image and video processing operations• Scaling, glows, blurs, search, segmentation

• Algorithm has very low arithmetic intensity• 1 MAD per sample

• Also has huge re-use (order of kernel size)• Shared registers should reduce arithmetic

intensity by factor of kernel size

Page 35: Mass Market Applications of Massively Parallel Computing

36

Processor Core Types

• Heterogeneous Many-core• In-Order vs. Out-of-Order

• Distinction arose from targeting 2 different workload design points

• Software can select ideal core type for each algorithm (workload design point)• Since not all cores can be powered anyway

• Hardware can make trade-offs on:• Power, area, performance growth rate

Page 36: Mass Market Applications of Massively Parallel Computing

Workloads

Local Memory Accesses Streaming Memory Access

Cod

e B

ranc

hine

ss

CPUs

GPUs

Page 37: Mass Market Applications of Massively Parallel Computing

Workload Differences

General Processing• Small batches• Frequent branches• Many data inter-

dependencies• Scalar ops• Vector ops

Media Processing• Large batches• Few branches• Few data inter-

dependencies• Scalar ops• Vector ops

Page 38: Mass Market Applications of Massively Parallel Computing

39

Lesson from GPGPU Research

• Many important tasks have data-parallel implementations• Typically requires a new algorithm• May be just as maintainable

• Definitely more scalable with core count

Page 39: Mass Market Applications of Massively Parallel Computing

40

APIs Must Hide Implementations

• Implementation attributes must be hidden from apps to enable scaling over time• Number of cores operating• Number of registers available• Number of i/o ports• Sizes of caches• Thread scheduling policies

• Otherwise, these cannot change, and performance will not grow

Page 40: Mass Market Applications of Massively Parallel Computing

41

Order of Thread Execution

• Shared registers and scatter share a pitfall:• It may be possible to write code that is dependent

on the order of thread execution• This violates scaling requirement

• The order of thread execution may vary from run-to-run (each frame)

• Will certainly vary between implementations• Cross-vendor and within vendor product lines

• All such code is considered incorrect

Page 41: Mass Market Applications of Massively Parallel Computing

42

System Design Goals

• Enable massively parallel implementations• Efficient scaling to 1000s of cores• No blocking/waiting• No constraints on order of thread execution• No read-after-write hazards

• Enable future compatibility• New hardware releases, new operating systems

Page 42: Mass Market Applications of Massively Parallel Computing

43

Other Computing Paradigms

• CPU –originated:• Lock-based, Lockless• Message Passing• Transactional Memory• May not scale well to 1000s of cores

• GPU Paradigms• CUDA, CtM• May not scale well over time

Page 43: Mass Market Applications of Massively Parallel Computing

44

High Level APIs

• Microsoft Accelerator• Google Peakstream• Rapidmind• Acceleware• Stream processing

• Brook, Sequoia

Page 44: Mass Market Applications of Massively Parallel Computing

45

Additional GPU Features

• Linear Filtering• 1-D, 2-D, 3-D floating point array indices• Image and video data benefit

• Triangle Interpolators• Address calculations take many clocks

• Blenders• Atomic reduction ops reduce ordering concerns

• 4-vector operations• Vector data, syntactic convenience

Page 45: Mass Market Applications of Massively Parallel Computing

46

Processor Opportunities

• Client computing performance can improve• Client space is a large un-tapped

opportunity for parallel processing• Hardware changes required are minimal

and fairly obvious• Fast display, efficient i/o, scalable over time