innovative applications and technology pivots a perfect ... · wen-mei hwu professor and...

125
Innovative Applications and Technology Pivots – A Perfect Storm in Computing Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign with Izzat El Hajj, Liwen Chang, Simon Garcia, and Carl Pearson

Upload: others

Post on 17-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Innovative Applications and Technology Pivots –A Perfect Storm in Computing

Wen-mei Hwu

Professor and Sanders-AMD Chair, ECE, NCSA

University of Illinois at Urbana-Champaign

with

Izzat El Hajj, Liwen Chang, Simon Garcia, and Carl Pearson

Page 2: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Agenda

• Revolutionary paradigm shift in applications

• Post-Dennard technology pivot - heterogeneity

• An example of positive application-technology spiral

• Engineering high-efficiency software for heterogeneous computing

Page 3: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

A major paradigm shift

In the 20th Century, we were able to understand, design, and manufacture what we can measure• Physical instruments and computing systems allowed us to see farther, capture

more, communicate better, understand natural processes, control artificial processes…

Page 4: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

A major paradigm shift

In the 20th Century, we were able to understand, design, and manufacture what we can measure• Physical instruments and computing systems allowed us to see farther, capture

more, communicate better, understand natural processes, control artificial processes…

In the 21st Century, we are able to understand, design, and create what we can compute• Computational models are allowing us to see even farther, going back and

forth in time, learn better, test hypothesis that cannot be verified any other way, create safe artificial processes…

Page 5: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Examples of Paradigm Shift20th Century

Small mask patterns

Electronic microscope and Crystallography with computational image processing

Anatomic imaging with computational image processing

Teleconference

GPS

21st Century

Optical proximity correction

Computational microscope with initial conditions from Crystallography

Metabolic imaging sees disease before visible anatomic change

Tele-emersion

Self-driving cars

Page 6: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Diving deeper into computational microscope

• Large clusters (scale out) allow simulation of biological systems of realistic space dimensions• 0.5Å (0.05 nm) lattice spacing needed for accuracy• Interesting biological systems have dimensions of mm or larger• Thousands of nodes are required to hold and update all the grid points.

• Fast nodes (scale up) allow simulation at realistic time scales• Simulation time steps at femtosecond (10-15 second) level needed for accuracy• Biological processes take miliseconds or longer• Current molecular dynamics simulations progress at about one day for each

10-100 microseconds of the simulated process.

Page 7: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Blue Waters Science Breakthrough Example Determination of the structure of the HIV

capsid at atomic-level

Collaborative effort of experimental groups at the U. of Pittsburgh and Vanderbilt U., and the Schulten’s computational team at the U. of Illinois.

64-million-atom HIV capsid simulation of the process through which the capsid disassembles, releasing its genetic material

a critical step in understanding HIV infection and finding a target for antiviral drugs.

Page 8: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Post-Dennard technology pivot -heterogeneity

Page 9: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Dennard Scaling of MOS Devices

In this ideal scaling, as L → α*L

• VDD → α*VDD, C → α*C, i → α*i

• Delay = CVDD/I scales by α, so f → 1/α

• Power for each transistor is CV2*f and scales by α2

• keeping total power constant for same chip area

JSSC Oct 1974, page 256

Page 10: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Frequency Scaled Too Fast 1993-2003

Clock Frequency (MHz)

10

100

1000

10000

85 87 89 91 93 95 97 99 01 03 05

Page 11: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Total Processor Power Increased (super-scaling of frequency and chip size)

1

10

100

85 87 89 91 93 95 97 99 01 03

Page 12: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Post-Dennard Pivoting

Multiple cores with more moderate clock frequencies

Heavy use of vector execution

Employ both latency-oriented and throughput-oriented cores

3D packaging for more memory bandwidth

Page 13: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Blue Waters Computing SystemOperational at Illinois since 3/2013

Sonexion: 26 PBs

>1 TB/sec

100 GB/sec

10/40/100 GbEthernet Switch

Spectra Logic: 300 PBs

120+ Gb/sec

WAN

IB Switch12.5 PF1.6 PB DRAM

$250M

49,504 CPUs -- 4,224 GPUs

Page 14: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPUs: Latency Oriented Design

High clock frequency

Large caches• Convert long latency memory accesses

to short latency cache accesses

Sophisticated control• Branch prediction for reduced branch

latency

• Data forwarding for reduced data latency

Powerful ALU• Reduced operation latency

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

Page 15: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

GPUs: Throughput Oriented Design

Moderate clock frequency

Small caches• To boost memory throughput

Simple control• No branch prediction• No data forwarding

Energy efficient ALUs• Many, long latency but heavily pipelined

for high throughput

Require massive number of threads to tolerate latencies

DRAM

GPU

Page 16: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Applications Benefit from Both CPU and GPU

CPUs for sequential parts where latency matters• CPUs can be 10+X faster than GPUs

for sequential code

GPUs for parallel parts where throughput wins• GPUs can be 10+X faster than CPUs

for parallel code

Page 17: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Initial Production Use ResultsApplication Description Application Speedup

NAMD100 million atom benchmark with Langevin dynamics and

PME once every 4 steps, from launch to finish, all I/O included

1.8

ChromaLattice QCD parameters: grid size of 483 x 512 running at the

physical values of the quark masses2.4

QMCPACKFull run Graphite 4x4x1 (256 electrons), QMC followed by

VMC2.7

ChaNGaCollisionless N-body stellar dynamics with multipole

expansion and hydrodynamics2.1

AWPAnelastic wave propagation with staggered-grid finite-

difference and realistic plastic yielding1.2

Page 18: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

An example of positive application-technology spiral

Page 19: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

19

DEEP LEARNING IN COMPUTER VISION

Deep Learning Object Detection

DNN + Data + HPCTraditional Computer Vision

Experts + TimeDeep Learning Achieves “Superhuman” Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2009 2010 2011 2012 2013 2014 2015 2016

Traditional CV

Deep Learning

ImageNet

Slide courtesy of Steve Oberlin, NVIDIA

Page 20: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

20

DIFFERENT MODALITIES OF REAL-WORLD DATA

Image Vision features Detection

Images/video

Audio Audio features Speaker ID

Audio

Text

Text Text features

Text classification, machine

translation, information

retrieval, ....

Slide courtesy of Andrew Ng, Stanford University

Page 21: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

A long way to go towards cognitive computing

Image Recognition

Text Extraction

Human Instructions

Speech Recognition

Natural Language Processing

Diagram Understanding

IR

Knowledge Indexing

Knowledge Inferencing

Programming Framework

Hardware Platform

Page 22: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

More Heterogeneity Is Coming

Beyond traditional CPUs and GPUs• FPGAs (e.g., Microsoft FPGA cloud)

• ASICs (e.g., Google’s TPU)

Beyond traditional DRAM• Stacked DRAM for more memory bandwidth

• Non-volatile RAM for memory capacity

• Near/in memory computing for reduced power used in data movement

Page 23: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Some Lessons Learned

• Throughput computing using GPUs can result in 2-3X end-to-end application-level performance improvement

• GPUs, big data and deep learning have formed a positive spiral for the industry

• GPU computing has so far had narrow but deep impact in the application space• Data movement overhead and small GPU memory

• Unified memory, HBM, NVLink, and HSA-style systems will help

• Low-level programming interfaces with poor performance portability

Page 24: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Engineering high-efficiency software for heterogeneous computing

Page 25: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 26: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Coarsening Scheduling Alternatives

Depth First Order (DFO) Scheduling

DFO Scheduling with Vectorization(time progresses as color

gets darker)

Breadth First Order (BFO) Scheduling

BFO with Vectorization

(time progresses as color gets darker)

Page 27: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

0

0.2

0.4

0.6

0.8

1

ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo

AMD Intel LC (no vec.) LC

Performance Results

Spee

du

p(n

orm

aliz

ed t

o f

ast

est)

Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations

Kim et al., CGO’15

Page 28: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 29: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Hierarchical Compute Organization of Devices

CPU

1. Process

2. Thread (vector-capable)

3. Vector Lane

4. Instruction-level Parallelism

GPU

1. Grid

2. Block

3. Warp

4. Thread

5. Instruction-level Parallelism

Page 30: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Hierarchical Compute Organization of Devices

CPU

1. Process

2. Thread (vector-capable)

3. Vector Lane

4. Instruction-level Parallelism

nt = omp_get_num_threads();tile = (len + nt – 1)/nt;#pragma omp parallel{

j = omp_get_thread_num();accum = 0;#pragma unrollfor(int i = 0; i < tile; ++i) {

accum += in[j*tile + i];}partial[j] = accum;

}sum = 0;for(int j = 0; j < nt; ++j) {

sum += partial[j];}return sum;

Page 31: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Hierarchical Compute Organization of Devices

GPU

1. Grid

2. Block

3. Warp

4. Thread

5. Instruction-level Parallelism

tile = (len + gridDim.x – 1)/gridDim.x;sub_tile = (tile + blockDim.x – 1)/blockDim.x;accum = 0#pragma unrollfor(unsigned i = 0; i < sub_tile; ++i) {

accum += in[blockIdx.x*tile+ i*blockDim.x + threadIdx.x];

}tmp[threadIdx.x] = accum; __syncthreads();for(unsigned s=1; s<blockDim.x; s *= 2) {

if(id >= s)tmp[threadIdx.x] +=

tmp[threadIdx.x - s];__syncthreads();

}partial[blockIdx.x] = tmp[blockDim.x-1];return; // Launch new kernel to sum up partial

Page 32: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Tangram: Codelet-based Programming Model__codeletint sum(const Array<1,int> in) {

unsigned len = in.size();int accum = 0;for(unsigned i=0; i < len; ++i) {

accum += in[i];}return accum;

}(a) Atomic autonomous codelet

__codelet __tag(asso_tiled) int sum(const Array<1,int> in) {

__tunable unsigned p;unsigned len = in.size();unsigned tile = (len+p-1)/p;return sum( map( sum, partition(in,

p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}

__codelet __coop __tag(kog)int sum(const Array<1,int> in) {

__shared int tmp[coopDim()]; unsigned len = in.size();unsigned id = coopIdx();tmp[id] = (id < len)? in[id] : 0;for(unsigned s=1; s<coopDim(); s *= 2) {

if(id >= s)tmp[id] += tmp[id - s];

}return tmp[coopDim()-1];

}(b) Atomic cooperative codelet

(c) Compound codelet using adjacent tiling

(d) Compound codelet using strided tiling

__codelet __tag(stride_tiled) int sum(const Array<1,int> in) {

__tunable unsigned p;unsigned len = in.size();unsigned tile = (len+p-1)/p;return sum( map( sum, partition(in,

p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}

cb

?

pc

? ? ?

?

pd

? ? ?

ca

Page 33: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Tangram: Composition Example

?

?

cb

?

pc

? ? ?

?

pd

? ? ?

ca

cb

pc

ca ca ca

__syncthreads()

pc

ca ca ca

?

__syncthreads()

pc

ca ca ca

ca

Page 34: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Tangram Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

scan spmv dgemm kmeans bfs

No

rmal

ize

d P

erf

orm

ance

(hig

her

is b

ette

r)

Fermi (Reference)

Fermi (TGM)

Kepler (Reference)

Kepler (TGM)

CPU (Reference)

CPU (TGM)

(Tangram)

(Tangram)

(Tangram)

Page 35: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Need for Run-time Selection

• Statically determining best algorithm could be difficult or infeasible• Sometimes it is input dependent

• Even a robust compiler or an expert could select suboptimal sequence of optimization• A catastrophic performance loss could happen

35

Page 36: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DySel Runtime Selects the Best Version

• Application or compiler provides multiple versions• Typically 4-10

• Runtime performs the final selection• Apply micro-profiling to sample the performance of each candidate

• Use a small subset of the actual workload per candidate• Contributes to final result

• Profile candidates concurrently• Reduces profiling overhead

• Incurs less than 8% of overhead in the worst observed case

36

Page 37: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Productive Profiling Mode

• Computation in profiling also contributes to the final output

37

profile

profile

compute Version A

Version B

Output

Workload Space →

← Probational Period → ← Tenured Period →

Page 38: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Input-dependent Optimizations

• Best optimizations could be input-dependent

38

0.00

0.50

1.00

1.50

2.00

2.50

3.00

randommatrix diagonalmatrix

Rela

veexecu

onmeoveroracle

(lo

werisbe

er)

Oracle

Sync

Async(bestini alselec on)Async(worstini alselec on)scalar,DFO

scalar,BFO

vector,DFO

vector,BFO

Worst

8.63 8.638.60

0.00

0.50

1.00

1.50

2.00

2.50

3.00

randommatrix diagonalmatrix

Rela

veexecu

onmeov

eroracle

(lowerisbe

er)

Oracle

Sync

Async(bestini alselec on)

Async(worseini alselec on)

Scalar

Vector

Worst

4.73 4.73 22.7322.73

(a) CPU (b) GPU

Page 39: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Conclusion and Outlook

• Applications have very large appetite for more computing power• Both larger scale clusters and faster devices

• Heterogeneity has become the norm for all hardware systems• HPC community are currently seeing about 2-3x application speedup• Recent positive spiral between deep learning and GPU computing • More positive spirals are yet to come

• Performance portability is critical for broad software adoption • There is critical need for programming systems with strong support for

portability• Performance portability involves several dimensions of technical challenges• Unfortunately, vendors have not been interested in solving this problem.

Page 40: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Thank you!

Page 41: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Backup Slides

Page 42: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 43: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 44: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Coarse-grain CPU threads Fine-grain GPU threads

Automatic Parallelization

Thread Coarsening

Page 45: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Coarsening Scheduling Alternatives

Depth First Order (DFO) Scheduling

DFO Scheduling with Vectorization(time progresses as color

gets darker)

Breadth First Order (BFO) Scheduling

BFO with Vectorization

(time progresses as color gets darker)

Page 46: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

OpenCL/CUDA to CPU CompilersBasic Coarsening

(DFO)Vectorization

Locality-aware Scheduling (DFO vs. BFO)

AMD No No No

MCUDA Yes No No

SnuCL Yes No No

Karrenberg& Hack

Yes Yes No

pocl Yes Yes No

Intel Yes Yes No

MxPA Yes Yes Yes

Page 47: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

0

0.2

0.4

0.6

0.8

1

ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo

AMD Intel LC (no vec.) LC

Performance Results

Spee

du

p(n

orm

aliz

ed t

o f

ast

est)

Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations

Kim et al., CGO’15

Page 48: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 49: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Data Placement OptionsCPU

Global memory

Caches (data tiling)

Registers

GPU

Global memory

Caches (data tiling)

Registers

+

Scratchpad memory

Constant memory

Texture memory

Page 50: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Rule-based vs. Model-based

• Rule-based (e.g., Jang et al.)• Heuristics on the memory access pattern

• Model-based (e.g., PORPLE)• Create a model the memory subsystem

• Slower but more accurate

Page 51: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Tangram’s Rule-based Data PlacementContainer

shared?no

stride 0?

Transpose

Texture

texture?yes

nostride 1?

no const.stride

no

yes yes yes

scratchpad?

Scratchpad

yesshuffle?

Registers

yescache?

Global

yes

candidate for on-chip memory

yes

Mem

ory

Acc

ess

Ch

arac

teri

stic

sM

emo

ry S

yste

mFe

atu

res

Page 52: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 53: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

GPU Tuning: Scan Case Study

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Tuned for Fermi Tuned for Fermi Tuned for Kepler Re-optimized for Kepler

Run on Fermi Run on Kepler

Perf

orm

ance

(% o

f b

est,

hig

her

is b

ette

r)

Architecture Upgrade

Retune Re-optimize

Page 54: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance-Portability: One Source for All

Levels of Hierarchy

CodeletComposition

Memory Characteristics

Automatic Data Placement

Resource Sizes

Autotuning

Micro-architecture

Algorithmic Choice

Granularity of Parallelism

Coarsening

Challenges

Solutions

Page 55: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

0.3380

0.7782

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

5 10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

GTX580(FermiGF110)

K40c(KeplerGK110)

GTX980(MaxwellGM204)

Frac onoffiltereditems

Execu

onme(m

s)

Scratchpad atomics performance (stream compaction)

Page 56: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Motivation Backup

Page 57: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

• Codesign among diverse areas will be required to reach exascale• Every level of the computational stack is a potential bottleneck.

• XPACC code will need to run efficiently and portably on next-generation heterogeneous platforms (CPUs, GPUs, Xeon-Phis)

Page 58: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Initial Production Use Results

• NAMD• 100 million atom benchmark with Langevin dynamics and PME once every 4 steps,

from launch to finish, all I/O included• 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only• 768 nodes, XK7 is 1.8X XE6

• Chroma• Lattice QCD parameters: grid size of 483 x 512 running at the physical values of the

quark masses• 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only• 768 nodes, XK7 is 2.4X XE6

• QMCPACK• Full run Graphite 4x4x1 (256 electrons), QMC followed by VMC• 700 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only• 700 nodes, XK7 is 2.7X XE6

Page 59: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Blue Waters Science Production Applications

• Work with science teams to effectively use GPUs in their production code.• ChaNGa – cosmological simulation, University of Washington• AWP – earthquake simulation, Southern California Earthquake Center

• Significant speedup by tuning kernels to specific GPU characteristics• Real-world opportunities for performance portability

Running Time (ms) SpeedupChaNGa Baseline 1.35 2.11

Optimized 1.16AWP Baseline 61.6 1.33

Optimized 43.3

GPU Kernel Optimizations

Page 60: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Levels of GPU Programming Interfaces

Current generation CUDA, OpenCL, DirectCompute

Next generation OpenACC, HCC++, Thrust, Bolt

Simplifies data movement, kernel details and kernel launch

Same GPU execution model (but less boilerplate)

Prototype & in development X10, Chapel, Nesl, Delite, Par4all, Tangram...

Implementation manages GPU threading and synchronizationinvisibly to user

Page 61: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Portability- CPU vs. GPU Code Versions

Maintaining multiple code versions is extremely expensive

Most CUDA/OpenCL developers maintain original CPU version

Many developers report that when they back ported the CUDA/OpenCL algorithms to CPU, they got better performing code• Locality, SIMD, multicore

MxPA is designed to automate this process (John Stratton, Hee-Seok Kim, Izzat El Hajj)

Page 62: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Performance Library

A major qualifying factor for new computing platforms• MKL, BLAS, CUSPARSE, Trust, FFT, OpenCV, CUDNN, etc.

• Currently redeveloped and hand-tuned for each HW type/generation

Exa-scale HW expected to have increasing levels of heterogeneity, parallelism, and hierarchy• Increasing levels of memory heterogeneity and hierarchy

• Increase SIMD width and types/number of cores

Performance library development process must keep up with the HW evolution and diversification• Performance portability

SCF 2016

Page 63: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds
Page 64: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

2003

1 core

2005

2 cores

2006

4 cores

2010

6 cores

Page 65: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

2003

1 core

2005

2 cores

2006

4 cores

2007

many-core

2010

many-core

2010

6 cores

2012

many-core

2012

many-core

NVIDIA Maxwell

many-core

2008

Stellarton

SoC (1 core)

CPU+FPGA

2011

APU (1st gen)

APU (2nd gen)

SoC (2 cores)

2014

APU (3rd gen)

Kaveri

2014

SoC (6 cores)

Page 66: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Portability Backup

Page 67: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Results of thread coarsening for Parboil benchmarks(written for NVIDIA SIMT GPUs) on AMD Radeon HD6990 (VLIW-5)

Granularity Tuning (OpenCL)

Results compiled using MulticoreWare’s SlotMaximizer

* Not a single kernel** Results from more than one dimension coarsening

Page 68: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPUs favor intra-thread localityGPUs favor inter-thread locality

(within Work Groups)

• Reduction – CPU vs. GPU (Part 1)

Tree-shapeparallel reduction

Page 69: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPU 2-level hierarchy GPU 4-level hierarchy

• Reduction – CPU vs. GPU (Part 2)

Collect from Work Group partial results

Page 70: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Mandelbrot performance with vector width

0

1

2

3

4

5

6

7

8

128 256 512 1024 2048 4096

Spe

ed

up

Image Size

Scalar SSE AVX

Results courtesy of intel.com

• CPU Parameter Tuning

Page 71: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Non-portable tile sizes

58%

68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Original version tuned for Tesla Tiling parameters retuned forFermi

Tiling restructured

Re

lati

ve P

erf

orm

ance

(all running on Fermi GPU)

GPU Parameter Tuning

Page 72: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Non-portable tile sizes

58%

68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Original version tuned for Tesla Tiling parameters retuned forFermi

Tiling restructured

Re

lati

ve P

erf

orm

ance

(all running on Fermi GPU)

GPU Parameter Tuning

Page 73: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Slide courtesy of nvidia.com

Page 74: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPU Xeon Phi

C/FORTRAN

OpenMP, TBB, Pthreads, Cilk…

CUDA, OpenCL

Multicore GPU

+ SIMDIntrinsics

Verilog, VHDL

FPGA

Page 75: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPU Xeon Phi

C/FORTRAN

OpenMP, TBB, Pthreads, Cilk…

Multicore GPU

+ SIMDIntrinsics

Verilog, VHDL

FPGA

CUDA, OpenCLMxPA

Page 76: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPU Xeon Phi

C/FORTRAN

OpenMP, TBB, Pthreads, Cilk…

Multicore GPU

+ SIMDIntrinsics

Verilog, VHDL

FPGA

CUDA, OpenCLMxPA

• Locality-centric work-item scheduling

• Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations

Page 77: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

CPU Xeon Phi

C/FORTRAN

OpenMP, TBB, Pthreads, Cilk…

CUDA, OpenCL

Multicore GPU

+ SIMDIntrinsics

Verilog, VHDL

FPGA

Tangram

Page 78: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Tangram Backup

Page 79: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Devices have different architectural hierarchies

Page 80: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Computation Codelets

Decomposition Codelets

Programmer writes architecture-neurtral

computations and decomposition rules

Page 81: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Computation Codelets

Decomposition Codelets

Compiler maps computations

to each level of the hierarchy…

Page 82: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Computation Codelets

Decomposition Codelets

…and decomposition rules between

each level

Page 83: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DySel Backup

Page 84: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

• Pronounced as diesel/ˈdiːzəl/

• Imply low-cost and high-efficiency• Diesel was cheaper than regular gas, when we submitted the paper… :v

• A small but useful tool to save compiler optimization developers

84

Page 85: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Motivation

• Statically determining the optimal code could be default or even infeasible• Sometimes it is input dependent

• Even a robust compiler or an expert could select suboptimal sequence of optimization• A catastrophic performance loss could happen

85

Page 86: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Example: Intel OpenCL Vectorization for CPU

• Suboptimal heuristic for vectorization in sgemm and spmv-jds

86

2.13X↓ 1.24X↓

Page 87: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Relax the Constraints

• Instead of asking a compiler for an optimized version which it thought is the best

• Ask a compiler for multiple versions which are competitive • A typical number is around 4-10

• Let the runtime to do the final selection

87

Page 88: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Version Selection on Runtime

• We propose DySel for dynamic version selection on runtime

• Apply micro-profiling to sample the performance of each candidate

88

Page 89: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Micro-Profiling

• Profile a kernel with smaller workload• A smaller number of work-group/thread block

• Avoid large impact of performance

• Multiple micro-profiling can be scheduled and even executed concurrently

89

Page 90: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Productive Profiling Mode

• Computation in profiling also contributes to the final output

90

profile

profile

compute Version A

Version B

Output

Page 91: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Synchronous vs Asynchronous Scheduling

• Synchronous: Schedule the remaining workload after the best version is finalized

• Asynchronous: Schedule remaining workload eagerly in a batch using the current best candidate

91

blocking

time

workload profile

compute

time

workload profile

compute

time

workload profile

compute

(a) Sync (b) Async (bad default) (c) Async (good default)

Page 92: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Sync vs Async Scheduling

• Sync • Schedule the remaining workload after the best version is finalized

• Async• Schedule remaining workload eagerly in a batch using the current best

candidate

92

Page 93: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Sync vs Async Scheduling

Productive,

Micro-Profiling Ki

Assign workgroups

to each Ki

Profiling finished? Kselect = best Ki

Apply Kselect to

compute the

remaining workload

Schedule Kdefault

for a batch of

work groups

no yes

Suggest an initial

Kdefault

K1

K2 K3

K4

K5

Kernel Version

Generator

Update Kdefault using

the best profiled

Ki so far

(a) Sync (b) Async

K1

K2 K3

K4

K5

Kernel Version

Generator

Productive,

Micro- Profiling Ki

Kselect = best Ki

Assign workgroups

to each Ki

Apply Kselect to

compute the

remaining workload

93

Page 94: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Sync vs Async Scheduling

blocking

time

workload profile

compute

… tim

e

workload profile

compute

time

workload profile

compute

(a) Sync (b) Async (bad default) (c) Async (good default)

94

Page 95: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Things I skipped

• The two extra profiling modes

• Applicability and resource requirement of each mode

• What kind of compiler analyses needed for different modes

• Where compilers add profiling code in both CPU and GPU

• More details about DySel runtime using TBB and CUDA

95

Page 96: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DySel Interface

DySelLaunchKernel(stringkernel_sig,//kernelnameboolprofiling=true,//profilingactivationflagenummode=fully_async//profilingmode);

DySelAddKernel(stringkernel_sig,//kernelnamefunc_ptrimplementation,//kernelimplementationdim3wa_factor,//workassignmentfactorvector<int>sandbox_index=[]//argumentoffsetsfor

//sandboxes/privateoutputs);

(a) Kernel Implementation Registration API

(b) Kernel Launch API

96

Page 97: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Locality-centric Scheduling for CPU OpenCL

97

• Iterate in-kernel loops first or work-item loops for OpenCL on CPU (CGO’15) using MxPA• Through analyzing access patterns

• It is open-source, and robust• “3.32x over AMD, 1.71x over Intel OpenCL stacks”

Page 98: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Locality-centric Scheduling for CPU OpenCL

98

1.15X↓

Page 99: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Data Placement for GPU

• Data placement optimizations are crucial for performance on GPUs (TPDS 2011 & MICRO 2014)• Although they are not open-source, they did show the transformed results

• Suboptimal decisions due to inaccurate model or improper heuristic

99

0.00

0.50

1.00

1.50

2.00

2.50

spmv-csr par clefilter

Rela

veexecu

onmeov

eroracle

(lowerisbe

er)

OracleSyncAsync(bestini alselec on)Async(worstini alselec on)PORPLEHeuris c-basedWorst

1.29X↓

2.29X↓

Page 100: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Experts’ Mixed Optimizations

• Parboil provides multiple versions with different optimization strategies• Optimized versions usually run better

• Some Optimizations are improper or redundant

• E.g. loop unrolling and prefetching in spmv-jds on Kepler

100

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

cutcp sgemm spmv-jds stencil GeoMean

Rela

veexecu

onmetooracle

(lo

werisbe

er)

Oracle Sync Async(bestini alselec on) Async(worstini alselec on) Worst

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

cutcp sgemm spmv-jds stencil GeoMean

Rela

veexecu

onmeoveroracle

(lo

werisbe

er)

Oracle Sync Async(bestini alselec on) Async(worstini alselec on) Worst7.74 2.28

(a) CPU (b) GPU

Page 101: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Case Study: Input-dependent Optimizations

• Best optimizations could be input-dependent

101

0.00

0.50

1.00

1.50

2.00

2.50

3.00

randommatrix diagonalmatrix

Rela

veexecu

onmeoveroracle

(lo

werisbe

er)

Oracle

Sync

Async(bestini alselec on)Async(worstini alselec on)scalar,DFO

scalar,BFO

vector,DFO

vector,BFO

Worst

8.63 8.638.60

0.00

0.50

1.00

1.50

2.00

2.50

3.00

randommatrix diagonalmatrix

Rela

veexecu

onmeov

eroracle

(lowerisbe

er)

Oracle

Sync

Async(bestini alselec on)

Async(worseini alselec on)

Scalar

Vector

Worst

4.73 4.73 22.7322.73

(a) CPU (b) GPU

Page 102: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Conclusion

• DySel can deliver high accuracy and low overhead for dynamic version selection in data-parallel programing model• Incur less than 8% of overhead in the worst observed case

• Using DySel is like buying an insurance…

102

Page 103: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

MxPA Backup

Page 104: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Contributions

• Exploiting data locality in scheduling work-items for performance

• Real system and measurement demonstrates speedups of 3.32x and1.71x over AMD and Intel OpenCL implementations• 18 benchmarks from Parboil and Rodinia

• Nominated for best paper award at CGO’15

• AE certified

Page 105: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

OpenCL Programming Model

Device

Global Memory

Compute Unit

Local Memory

Compute Unit

Local Memory

Compute Unit

Local Memory

Kernel

Work Group Work Group Work Group…WorkItems

Page 106: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

void kernel(…) {i0;i1;…ia-1;barrier();ia;ia+1;…ib-1;

}

kernel code

immediate dependencyii Instruction or instruction block

barrier for work-items in a work-group

wi = work-itemwg = work-groupLS = local sizeGS = global size

OpenCL Execution Model

region0

region1

i1

ia-1

ia

ia+1

in-1

i0

wiLS-1 wiLS wiLS+1 wi2LS-1wi0 wi1 wiGS-1

wg0 wg1 wgGS/LS-1

How to schedule this execution graph on a multicore CPU?

Page 107: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Work-group Scheduling

• Assign work-groups in whole to different cores• Considerations: Locality, Load balance

CPU Core CPU Core CPU Core CPU Core

Page 108: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Region Scheduling

• Serialize barrier-separated regions

Page 109: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Work-item Scheduling

• How to schedule work-items within a region?• Different approaches by different compilers

Page 110: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Existing Approaches

• Industry• Intel

• AMD (Twin Peaks)

• Academia• Karrenberg & Hack

• SnuCL

• pocl

Depth First Order (DFO) Scheduling

Page 111: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Existing Approaches

• Industry• Intel

• AMD (Twin Peaks)

• Academia• Karrenberg & Hack

• SnuCL

• pocl

DFO Scheduling with Vectorization(time progresses as color gets darker)

Page 112: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Memory Access Patternse.g. bfs(each thread traverses a list of neighbors)

e.g. sgemm(threads computing adjecentoutputs access adjacent inputs)

e.g. kmeans(all threads loop over the same mean values)

Page 113: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DFO and Locality

Page 114: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DFO and Locality

Page 115: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DFO and Locality

Page 116: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Alternative Schedule: BFO

Breadth First Order (BFO) Scheduling

Page 117: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Alternative Schedule: BFO

BFO with Vectorization(time progresses as color gets darker)

Page 118: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DFO’s vs. BFO’s Impact on Locality

0

0.2

0.4

0.6

0.8

1

sgm ctcp mrig tpcf sc hw kmns hst mriq nw spmv lkct lud pf sad pbfs rbfs lmd geo

DFO BFO

L1 d

ata

cach

e m

isse

s (n

orm

aliz

ed t

o w

ors

t)

BFO has better locality DFO has better locality

BFO has better locality for 13 benchmarks, DFO has better locality for 5 benchmarks. No schedule is always the best.

Page 119: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

DFO scheduling

wi0 wi1 wiLS-1

ibefore

i0

iafter

iN-1

wi0 wi1 wiLS-1

ibefore

i0

iafter

i1

iN-1

BFO scheduling

prefers BFO?

classify memory accesses in loop

contains loop?

No

kernel region

Yes

No

Yes

Locality Centric (LC) Scheduling

Page 120: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Locality Centric (LC) Scheduling

Work-item Stride

0 1 Other

Loo

p It

erat

ion

Str

ide 0 - DFO DFO

1 BFO - DFO

Other BFO BFO -

Classify memory accesses per loop body and tally which schedule has greater popularity

Page 121: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

LC’s Impact on Locality

0

0.2

0.4

0.6

0.8

1

sgm ctcp mrig tpcf sc hw kmns hst mriq nw spmv lkct lud pf sad pbfs rbfs lmd geo

DFO BFO LC

L1 d

ata

cach

e m

isse

s (n

orm

aliz

ed t

o w

ors

t)

BFO has better locality DFO has better locality

LC captures the best of both schedules

Page 122: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Locality Results

0

0.2

0.4

0.6

0.8

1

sgm ctcp tpcf mrig lkct sc lmd kmns hw hst pf lud mriq nw spmv sad pbfs rbfs geo

AMD Intel LC

L1 d

ata

cach

e m

isse

s (n

orm

aliz

ed t

o w

ors

t)

LC has best locality for most benchmarks

Page 123: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

0

0.2

0.4

0.6

0.8

1

ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo

AMD Intel LC (no vec.) LC

Performance Results

Spee

du

p(n

orm

aliz

ed t

o f

ast

est)

LC (with vec.) outperforms AMD (without vec.) and Intel (with vec.) by 3.32x and 1.71x

LC (without vec.) is faster than Intel (with vec.) by 1.04x

Page 124: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Summary

• Proposed an alternative scheduling approach to the state-of-the-art

• Demonstrated that no schedule is always best and proposed a staticschedule selection

• Outperformed industry implementations in memory system efficiencyand performance

Page 125: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds

Heterogeneous Computing in Blue Waters

Blue Waters contains 4,224 Cray XK7 compute nodes.

Dual-socket Node• One AMD Interlagos chip

• 8 core modules, 32 threads• 156.5 GFs peak performance

• Consumes 2,504 GB of data per second

• 32 GBs memory• 51 GB/s bandwidth

• One NVIDIA Kepler chip• 1.3 TFs peak performance

• Consumes 20,800 GB of data per second

• 6 GBs GDDR5 memory• 250 GB/sec bandwidth