optimizing loop performance for clustered vliw architectures

24
Optimizing Loop Performance for Clustered VLIW Architectures Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)

Upload: rayya

Post on 05-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments). Optimizing Loop Performance for Clustered VLIW Architectures. Clustered VLIW Architecture. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

by

Yi Qian(Texas Instruments)

Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)

Page 2: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Clustered VLIW Architecture

Page 3: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Motivation

Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low.

The compiler must

Expose maximal parallelism,

Maintain minimal communication overhead.

High-level optimizations can improve loop performance on clustered VLIW machines.

Page 4: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Background

Software Pipelining – modulo scheduling

Archive ILP by overlapping execution of different loop iterations.

Initiation Interval (II)

ResII -- constraints from the machine resources.

RecII -- constraints from the dependence recurrences.

MinII = max(ResII, RecII)

Page 5: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations

Scalar Replacement

replace array references with scalar variables.

improve register usage

for (i=0; i<n; ++i) for ( j=0; j<n; ++j)

a[i] = a[i] + b[j] * x[i][j];

for (i=0; i<n; ++i){ t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; }

Page 6: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations

Unrolling reduce inter-iteration overhead

enlarge loop body size

Unroll-and-jambalance the computation and memory-access requirements

improve uMinII (MinII / unrollAmount)

for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j];

uMinII = 4

for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; }

uMinII = 3

(1 computational unit, 1 memory unit)unroll-and-jammed loop:original loop:

Page 7: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations

Unroll-and-jam/unrolling

generate intercluster parallelism

for (i=0; i<2*n; ++i)

a[i] = a[i] + 1;

for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; }

for (i=0; i<2*n; ++i)

a[i] = a[i-1] + 1;

for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }

Page 8: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations

Loop Alignment

Remove loop-carried dependences

Alignment conflicts

Used to determine intercluster communication cost

for (i=1; i<n; ++i) { a[i] = b[i] + c[i];

x[i] = a[i-1] *q; }

x[1] = a[0] * q;for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i];

x[i+1] = a[i] * q;}a[n-1] = b[n-1] + c[n-1];

for (i=1; i<n; ++i) { a[i] = b[i] + q;

c[i] = a[i-1] + a[i-2];}

for (i=1; i<n; ++i)

a[i] = a[i-1] + b[i];<1>

<2>

Page 9: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Related Work

Partitioning ProblemEllis -- BUG

Capitanio et al. -- LC-VLIW

Nystrom et al. -- cluster assignment & software pipelining

Ozer et al. -- UAS

Sanchez et al. -- unified method

Hiser et al. – RCG

Aleta et al. – pseudo-scheduler

Page 10: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Loop Transformations

Scalar ReplacementCallahan, et al -- pipelined architectures

Carr, Kennedy -- general algorithm

Duesterwalk -- data flow framework

Loop AlignmentAllen et al -- shared-memory machines

Unrolling/UjamCallahan et al -- pipelined architectures

Carr,Kennedy -- ILP

Carr, Guan -- linear algebra

Carr -- cache, software pipelining

Sarkar -- ILP, IC

Sanchez et al -- clustered machines

Huang et al -- clustered machines

Shin et al – Superwood Register files

Page 11: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Optimization StrategyUnroll-and-jam/Unrolling

Scalar Replacement

Intermediate CodeGenerator

Data-flow OptimizationValue Cloning

Register PartitioningSoftware Pipelining

Assembly Code GeneratorTarget Code

Source Code

Page 12: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Our Method

Picking loops to unroll

Computing uMinII

Computing register pressure (see paper)

Determining unroll amounts

Page 13: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Picking Loops to Unroll

: carries the most dep. that are amenable to S.R.

: contains the fewest alignment conflicts.

Computing uMinII

uRecII does not increase

uResII

la U a

lp U p

F= f×Ua×Up ,

M=M L×Un

where

pa

CMF

UU

FUC

FUM

FUF

},,max{

Page 14: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Computing Communication Cost for Unrolled Loops

Intercluster Copies

multiple loops (see paper)single loop

invariantdep.

variant dep.

innermost loopis unrolled

innermost loopis not unrolled

invariantdep.

variant dep.

Page 15: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single LoopVariant Dep.

v0 w 0

v0

v1 w n1

w n0

w n

Cluster 1

...

Cluster?

Before unrollingd l e

...

After unrollinguC l e = # of e where

copies per cluster:

aUn

sinks of the new dependences:

)mod())(( apl UUedmn

total costs:

ClEe

plCl UeuCC )(

1aUv

Page 16: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single LoopVariant Dep.

Special Casesif , then d l e mod Up×Ua = 0

uC l e = 0

for (i=0; i<4*n; i+=4) {

a[i] = a[i-4];

a[i+1] = a[i–3];

a[i+2] = a[i–2];

a[i+3] = a[i-1];}

if , then uC l e = d l e

for (i=0; i<6*n; i+=6) {

a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i];

a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3];}

4 clusters: 2 clusters:Ua=1,Up= 4 Ua=3,Up=2

)(edU la

Page 17: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Unrolling a Single LoopInvariant Dep.

references can be eliminatedby scalar replacement. clusters need a copy operation.

Ua×Up

for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];

for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t;

a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; }

1pU

IlEe

pIl UC )1(

Page 18: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Determining Unroll Amounts

Integer optimization problem

Exhaustive search

Heuristic method

min uMinII

MRR1, pa UU

Page 19: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Experimental Results

Benchmarks

119 DSP loops from the TI's benchmark suite

DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc.

Architectures

URM, a simulated architecture

8 functional units - 2 clusters, 4 clusters (1 copy unit)

16 functional units - 2 clusters, 4 clusters (2 copy units)

TMS320C64x

Page 20: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Unroll-and-jam/unrolling is applicable to 71 loops.

URM Speedups: Transformed vs. Original

width 8 16

clusters 2 4 2 4

Speedup

Harmonic 1.39 1.68 1.4 1.43

Median 1.52 1.78 1.6 1.6

Improved 50 69 50 51

Page 21: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Our Algorithm vs. Fixed Unroll Amounts

Using a fixed unroll amount may cause performance

degradation when communication costs are dominant.

Width 8 16

Clusters 2 4 2 4

Speedup

Harmonic 1 0.91 1 1.07

Harmonic(fixed) 0.88 0.84 0.88 0.95

# of loops 9 4 9 21

Page 22: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

C64x Results

TMS320C64x Speedups: Unrolled vs. Original

Speedup

Harmonic 1.7

Median 2

Improved 55

Page 23: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Accuracy of Communication Cost Model

Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops

2-cluster: 66 exact prediction 4-cluster: 64 exact prediction

Page 24: Optimizing Loop Performance for Clustered VLIW Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

ConclusionProposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops.70%-90% of 71 loops can be improved by a speedup of 1.4-1.7.High-level transformations should be an integral part of compilation for clustered VLIW machines.