optimizing loop performance for clustered vliw architectures

Optimizing Loop Performance for Clustered VLIW Architectures


by

Yi Qian(Texas Instruments)

Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)


Clustered VLIW Architecture


Motivation

Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low.

The compiler must

Expose maximal parallelism,

Maintain minimal communication overhead.

High-level optimizations can improve loop performance on clustered VLIW machines.


Background

Software Pipelining – modulo scheduling

Archive ILP by overlapping execution of different loop iterations.

Initiation Interval (II)

ResII -- constraints from the machine resources.

RecII -- constraints from the dependence recurrences.

MinII = max(ResII, RecII)


Loop Transformations

Scalar Replacement

replace array references with scalar variables.

improve register usage

for (i=0; i<n; ++i) for ( j=0; j<n; ++j)

a[i] = a[i] + b[j] * x[i][j];

for (i=0; i<n; ++i){ t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; }



Unrolling reduce inter-iteration overhead

enlarge loop body size

Unroll-and-jambalance the computation and memory-access requirements

improve uMinII (MinII / unrollAmount)

for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j];

uMinII = 4

for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; }

uMinII = 3

(1 computational unit, 1 memory unit)unroll-and-jammed loop:original loop:



Unroll-and-jam/unrolling

generate intercluster parallelism

for (i=0; i<2*n; ++i)

a[i] = a[i] + 1;

for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; }

for (i=0; i<2*n; ++i)

a[i] = a[i-1] + 1;

for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }



Loop Alignment

Remove loop-carried dependences

Alignment conflicts

Used to determine intercluster communication cost

for (i=1; i<n; ++i) { a[i] = b[i] + c[i];

x[i] = a[i-1] *q; }

x[1] = a[0] * q;for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i];

x[i+1] = a[i] * q;}a[n-1] = b[n-1] + c[n-1];

for (i=1; i<n; ++i) { a[i] = b[i] + q;

c[i] = a[i-1] + a[i-2];}

for (i=1; i<n; ++i)

a[i] = a[i-1] + b[i];<1>

<2>


Related Work

Partitioning ProblemEllis -- BUG

Capitanio et al. -- LC-VLIW

Nystrom et al. -- cluster assignment & software pipelining

Ozer et al. -- UAS

Sanchez et al. -- unified method

Hiser et al. – RCG

Aleta et al. – pseudo-scheduler



Scalar ReplacementCallahan, et al -- pipelined architectures

Carr, Kennedy -- general algorithm

Duesterwalk -- data flow framework

Loop AlignmentAllen et al -- shared-memory machines

Unrolling/UjamCallahan et al -- pipelined architectures

Carr,Kennedy -- ILP

Carr, Guan -- linear algebra

Carr -- cache, software pipelining

Sarkar -- ILP, IC

Sanchez et al -- clustered machines

Huang et al -- clustered machines

Shin et al – Superwood Register files


Optimization StrategyUnroll-and-jam/Unrolling

Scalar Replacement

Intermediate CodeGenerator

Data-flow OptimizationValue Cloning

Register PartitioningSoftware Pipelining

Assembly Code GeneratorTarget Code

Source Code


Our Method

Picking loops to unroll

Computing uMinII

Computing register pressure (see paper)

Determining unroll amounts


Picking Loops to Unroll

: carries the most dep. that are amenable to S.R.

: contains the fewest alignment conflicts.

Computing uMinII

uRecII does not increase

uResII

la U a

lp U p

F= f×Ua×Up ,

M=M L×Un

where

pa

CMF

UU

FUC

FUM

FUF

},,max{


Computing Communication Cost for Unrolled Loops

Intercluster Copies

multiple loops (see paper)single loop

invariantdep.

variant dep.

innermost loopis unrolled

innermost loopis not unrolled

invariantdep.

variant dep.


Unrolling a Single LoopVariant Dep.

v0 w 0

v0

v1 w n1

w n0

w n

Cluster 1

...

Cluster?

Before unrollingd l e

...

After unrollinguC l e = # of e where

copies per cluster:

aUn

sinks of the new dependences:

)mod())(( apl UUedmn

total costs:

ClEe

plCl UeuCC )(

1aUv


Unrolling a Single LoopVariant Dep.

Special Casesif , then d l e mod Up×Ua = 0

uC l e = 0

for (i=0; i<4*n; i+=4) {

a[i] = a[i-4];

a[i+1] = a[i–3];

a[i+2] = a[i–2];

a[i+3] = a[i-1];}

if , then uC l e = d l e

for (i=0; i<6*n; i+=6) {

a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i];

a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3];}

4 clusters: 2 clusters:Ua=1,Up= 4 Ua=3,Up=2

)(edU la


Unrolling a Single LoopInvariant Dep.

references can be eliminatedby scalar replacement. clusters need a copy operation.

Ua×Up

for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];

for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t;

a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; }

1pU

IlEe

pIl UC )1(


Determining Unroll Amounts

Integer optimization problem

Exhaustive search

Heuristic method

min uMinII

MRR1, pa UU


Experimental Results

Benchmarks

119 DSP loops from the TI's benchmark suite

DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc.

Architectures

URM, a simulated architecture

8 functional units - 2 clusters, 4 clusters (1 copy unit)

16 functional units - 2 clusters, 4 clusters (2 copy units)

TMS320C64x


Unroll-and-jam/unrolling is applicable to 71 loops.

URM Speedups: Transformed vs. Original

width 8 16

clusters 2 4 2 4

Speedup

Harmonic 1.39 1.68 1.4 1.43

Median 1.52 1.78 1.6 1.6

Improved 50 69 50 51


Our Algorithm vs. Fixed Unroll Amounts

Using a fixed unroll amount may cause performance

degradation when communication costs are dominant.

Width 8 16

Clusters 2 4 2 4

Speedup

Harmonic 1 0.91 1 1.07

Harmonic(fixed) 0.88 0.84 0.88 0.95

# of loops 9 4 9 21


C64x Results

TMS320C64x Speedups: Unrolled vs. Original

Speedup

Harmonic 1.7

Median 2

Improved 55


Accuracy of Communication Cost Model

Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops

2-cluster: 66 exact prediction 4-cluster: 64 exact prediction


ConclusionProposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops.70%-90% of 71 loops can be improved by a speedup of 1.4-1.7.High-level transformations should be an integral part of compilation for clustered VLIW machines.

optimizing loop performance for clustered vliw architectures

Documents