software pipelining in pegasus/cash cody hartwig elie krevat {chartwig,ekrevat}@cs.cmu.edu

Software Pipelining in Pegasus/CASH

Cody Hartwig

Elie Krevat

{chartwig,ekrevat}@cs.cmu.edu

Software Pipelining Software pipelining is a method for increasing the available

parallelism for instruction scheduling Data dependencies limit the opportunity for parallel execution Software pipelining can overlap loop iterations to increase

available operations to schedule between dependencies Many techniques exist [classification by Allan et al.]

Kernel recognition (e.g., Aiken & Nicolau) Assumes schedule for iterations are fixed, loop is unrolled n times Pattern recognition identifies a repeating kernel

Modulo scheduling Analysis of data dependencies (resource/precedence constraints) Finds minimum initiation interval to use when scheduling

Software Pipelining in Pegasus/CASH

Pegasus is an intermediate representation used by the CASH compiler Pegasus graph models control-flow and data-flow

Our Approach: Apply optimizations to the Pegasus graph, not the generated assembly Abstracts away resource constraints Feedback loop possible after scheduler and

register allocation (e.g., to implement less aggressive pipelining because of register spilling)

How Operations are Pipelined

Our approach computes operation outputs for future loop iterations in the current iteration Operations are copied into pre-header and the data-flow for

values before and after executing that operation are fed into the loop hyperblock

Then each loop iteration uses the value of the operation already computed, and computes the operation value for the next iteration

This approach is analogous to preparing temporary variables of future iterations to make the loop body schedule more efficient

Choosing Operations to Pipeline via Pattern Matching

An operation may be pipelined if it matches a number of possible patterns Patterns depend only on the type of operation and the

source of its inputs Operation type must allow speculative execution (e.g.,

loads are ok, but not stores)

Operations on the most expensive paths to etas are the first ones moved The most expensive path is not necessarily the longest

(e.g., a single ‘load’ operation is more expensive than two ‘add’ operations)

Recognized Patterns

Arithmetic Operation Load Operation Cast Operation

As operations are moved, new operations will form the recognized patterns

Example

int i = 0;

char a[100];

while(i < 100) {

char tmp = a[i];

tmp = tmp * 2;

a[i] = tmp;

i++;

}

The load and store are forced to execute in series

Operations in red are available to move

Step 1 Step 2

Load and store are no longer dependent!

Evaluation – Moving Average

void move_avg(int *a){ int i = 1; while (i < l00) { int t1 = a[i]; int t2 = a[i-1]; a[i] = (t1+t2)/2; i++; }}

Schedule Length Statistics(after moving 11 operations)

Before After

Pre-header 8 14

Loop Body 22 18

Cost of entire function ≈ Cost(Pre-header) + 100*Cost(Loop Body)

Cost before Software Pipelining ≈ 2208

Cost after Software Pipelining ≈ 1814

Software Pipelining improves performance here by ≈ 18%

Moving Average – Before Software Pipelining

Moving Average – After Software PipeliningPipelined graphs are considerably more complex

Conclusion

Software pipelining at the Pegasus level can achieve significant loop improvement

Most regular operation types are pipelinable via our iterative pattern matching algorithm

Cost of improvement is increased register pressure & more complicated Pegasus graphs

software pipelining in pegasus/cash cody hartwig elie krevat {chartwig,ekrevat}@cs.cmu.edu

Documents