mass market applications of massively parallel computing

Mass Market Applications of Massively Parallel Computing

Chas. Boyd

3

Outline

• Projections of future hardware• The client computing space• Mass-market parallel applications• Common application characteristics• Interesting processor features

4

The Physics of Silicon

• The way processors get faster has fundamentally changed

• No more free performance gains due to clock rate and Instruction-Level Parallelism

• Yet gates-per-die continues to grow• Possibly faster now that clock rate isn’t an issue• Estimate: doubling every 2-2.5 years

• New area means more cores and caches• In-order core counts may grow faster

than Out-of-Order core counts do

6

The Old Story

9

A Surplus of Cores

• ‘More cores than we know what to do with’• Literally

• Servers scale with transaction counts• Technical Computing

• history of dealing with parallel workloads• What are the opportunities for the PC client?

• Are there mass market applications that are parallelizable?

10

Requirements of Mass Market Space

• Fairly easy to program and maintain• Cannot break on future hardware or

operating systems• Transparent back-compatibility, fwd compatibility• Mass market customers hate regressions!• Consumer software must operate for decades

• Must get faster automatically• Why we are here

11

AMD Term:

• Personal Stream Computing• Actually nothing like ‘stream processing’ as used

by Stanford Brook, etc.

12

Data-Parallel Processing

• Key technique, how do we apply it to consumers?

• What takes lots of data?• Media, pixels, audio samples• Video, imaging, audio• Games

13

Video

• Decode, encode, transcode• Motion Estimation, DCT, Quantization

• Effects• Anything you would want to do to an image• Scaling, sepia, DVE effects (transitions)

• Indexing• Search/Recognition -convolutions

14

Imaging

• Demosaicing• Extract colors with knowledge of sensor layout

• Segmentation• Identify areas of image to process

• Cleanup• Color correction, noise removal, etc.

• Indexing• Identify areas for tagging

15

User Interaction with Media

• Client applications can/should be interactive• Mass market wants full automation• ‘Pro-sumer’ wants some options to participate, but with

real-time feedback (20+ fps) on 16 GPixel images• Automating media processing requires analysis

• Recognition, segmentation, image understanding• Is this image outdoors or inside?• Is this image right-side up?• Does it contain faces?• Are their eyes red?

16

Imaging Markets

• In some sense, the broader the market, the more sophisticated the algorithm required• Although pro-sumers care more about performance, and

they are the ones that write the reviews

17

FFT Performance

18

Game Applications of Mass Parallel

• Rendering• Imaging• Physics• IK• AI

1919

Ultima Underworld 1993

Dark Messiah 2007Dark Messiah 2007

20

Game Rendering

• Well established at this point, but new techniques keep being discovered

• Rendering different terms at different spatial scales• E.g. Irradiance can be computed more sparsely

than exit radiance enabling large increases in the number of incident light sources considered

• Spherical harmonic coefficient manipulations

22

Game Imaging

• Post processing• Reduction (histogram or single average value)

• Exposure estimation based on log average luminance

• Exposure correction• Oversaturation extraction• Large blurs (proportional to screen size)• Depth of field• Motion blur

Half-Life 2Half-Life 2

23


24


25

26

Game Physics

• Particles -non-interacting• Particles -interacting• Rigid bodies• Deformable bodies• Etc.

Game Processor EvolutionGame Processor Evolution

Vertex Shader

Pixel Shader

AnimationAI

Texture Creation

Mesh ModelingPhysicsContent Creation Process

Game Stack

Offline

CPU

GPU

Real Time

27

28

Common Properties of Mass Apps

• Results of client computations are displayed at interactive rates• Fundamental requirement of client systems

• Tight coupling with graphics is optimal• Physical proximity to renderer is beneficial

• Smaller data types are key

29

Support for Image Data Types

• Pixels, texels, motion vectors, etc.• Image data more important than float32s

• Data declines in size as importance drops• Bytes, words, fp11, fp16, single, double

• Bytes may be declining in importance• Hardware support for formatting is useful• Clock cycles required by shift/or/mul, etc.

cost too much power

30

I/O Considerations

• Like most computations that are not 3-D rendering, GPUs are i/o bound

• Arithmetic intensity is lower than GPUs• Convolutions

• Support for efficient data types is very important

31

GPU Arithmetic Intensity Projection

32

GPU Arithmetic Intensity Projection

• 2-3 more process doublings before new memory technologies will help much• Stacked die?, 2k wide bus?, optical?

• Estimate at least 4x increase in nr of compute instructions per read operation

• Arithmetic intensities reach 64??• This is fine for 3-D rendering• Other data-parallel apps need more i/o

33

I/O Patterns

• Solutions will have a variety of mechanisms to help with worsening i/o constraints

• Data re-use (at cache size scales) is relatively rare in media applications

• Read-write use of memory is rare• Read-write caches are less critical• Streaming data behavior is sufficient

• Read contention and write contention are the issue, not read-after-write scenarios

34

Interesting Techniques

• Shared registers• Possibly interesting to help with i/o bandwidth• Reducing on-chip bandwidth may help

power/heat• Scatter

• Can be useful in scenarios that don’t thrash output subsystem

• Can reduce pressure on gather input system

35

Convolution

• Key element of almost all image and video processing operations• Scaling, glows, blurs, search, segmentation

• Algorithm has very low arithmetic intensity• 1 MAD per sample

• Also has huge re-use (order of kernel size)• Shared registers should reduce arithmetic

intensity by factor of kernel size

36

Processor Core Types

• Heterogeneous Many-core• In-Order vs. Out-of-Order

• Distinction arose from targeting 2 different workload design points

• Software can select ideal core type for each algorithm (workload design point)• Since not all cores can be powered anyway

• Hardware can make trade-offs on:• Power, area, performance growth rate

Workloads

Local Memory Accesses Streaming Memory Access

Cod

e B

ranc

hine

ss

CPUs

GPUs

Workload Differences

General Processing• Small batches• Frequent branches• Many data inter-

dependencies• Scalar ops• Vector ops

Media Processing• Large batches• Few branches• Few data inter-

dependencies• Scalar ops• Vector ops

39

Lesson from GPGPU Research

• Many important tasks have data-parallel implementations• Typically requires a new algorithm• May be just as maintainable

• Definitely more scalable with core count

40

APIs Must Hide Implementations

• Implementation attributes must be hidden from apps to enable scaling over time• Number of cores operating• Number of registers available• Number of i/o ports• Sizes of caches• Thread scheduling policies

• Otherwise, these cannot change, and performance will not grow

41

Order of Thread Execution

• Shared registers and scatter share a pitfall:• It may be possible to write code that is dependent

on the order of thread execution• This violates scaling requirement

• The order of thread execution may vary from run-to-run (each frame)

• Will certainly vary between implementations• Cross-vendor and within vendor product lines

• All such code is considered incorrect

42

System Design Goals

• Enable massively parallel implementations• Efficient scaling to 1000s of cores• No blocking/waiting• No constraints on order of thread execution• No read-after-write hazards

• Enable future compatibility• New hardware releases, new operating systems

43

Other Computing Paradigms

• CPU –originated:• Lock-based, Lockless• Message Passing• Transactional Memory• May not scale well to 1000s of cores

• GPU Paradigms• CUDA, CtM• May not scale well over time

44

High Level APIs

• Microsoft Accelerator• Google Peakstream• Rapidmind• Acceleware• Stream processing

• Brook, Sequoia

45

Additional GPU Features

• Linear Filtering• 1-D, 2-D, 3-D floating point array indices• Image and video data benefit

• Triangle Interpolators• Address calculations take many clocks

• Blenders• Atomic reduction ops reduce ordering concerns

• 4-vector operations• Vector data, syntactic convenience

46

Processor Opportunities

• Client computing performance can improve• Client space is a large un-tapped

opportunity for parallel processing• Hardware changes required are minimal

and fairly obvious• Fast display, efficient i/o, scalable over time

mass market applications of massively parallel computing

Documents

mass market applications

dataparallel processors

massmarket applications

interactivemass market

parallel workloadswhat

parallel computingchas

image understandingis

image rightside