generation of highly parallel code for 2106x processors an introduction developed by m. r. smith...

59
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September 2000

Post on 20-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

Generation of highly parallel code for 2106X processors

An introduction

Developed by M. R. Smith Presented by S. Lei

SHARC2000 Workshop, Boston, September 2000

Page 2: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Background assumed

Familiarity with SHARC 2106X architecture

Familiarity with SHARC programmer’s model for registers

Some assembly experienceAn interest in beating the compiler

in those special cases when you need the last drop of blood out of the CPU :-)

Page 3: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

To be tackledWhat’s causing the problem– General limitations of instruction sets

How to recognize when you might be coming up against SHARC architecture limitations

A process for optimizing the SHARC parallelism– Example -- Temperature conversion

– Bonus if time permits -- Average and instantaneous power

Page 4: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Efficient Move for 68k -- MOVEQ.L

Want instruction to work with 1 memory FETCH – 16 bits available to describe operation

5 bits taken up to say MOVEQ.L instruction and not something else

3 bits taken up for the 8 possible destination data registers

ONLY 8 bits left to describe value– Value = + 127 to - 128 -- NOTHING ELSE– Value is sign extended to 32 bits

0 1 1 1 D D D 0P P P P P P P P

Page 5: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Same basic issues on SHARCYou can’t do EVERYTHING with ALL

possible resources

Compute/dreg<->DM/dreg<->PM – 3 bits opcode – 2 bits for direction of memory ops– ONLY 12 bits available to describe 4 DAG

registers– 8 bits to describe which registers used

for destination/source– 23 bits to describe Compute operations

Page 6: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

When are DSP instructions valid?

You are going to customize– When can you use the DSP instructions?

– Most -- From Monday to Friday

– Some Only between 9:00 a.m. and 9:00 p.m.Check against architecture21k -- Parallel ops MUST be able to do this

– Can it be fetched in one cycle (op-code size)– Can it be executed in one cycle (resource question)– Can it execute without conflicting with other instructions?– Then PROBABLY legal

HOWEVER -- The designers had the final decision and you have to live by that decision!

Page 7: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

You can’t do parallel Memory to UREG ops

Note you need 8-bits to describe just one UREG out of all possible UREGs

Dm(<addr>) = ureg– instruction = ? Bits– addr described in 32 bits– UREG description needs 8 bits

JUST enough instruction bits to allow dm(<offset>, Ireg) = Ureg– NOTE that maximum number of bits to

describe the offset even if offset = 1

Page 8: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Pipeline considerations -- REAL ISSUE!R2 = R1 + R3, R3 = dm(I2, M2), pm(I8,M9) = R2

The R2 in R2 = R1 + R3 is not the R2 in pm(I8,M9) = R2

The R3 in R2 = R1 + R3

is not the R3 in R3 = dm(I2, M2)

---------------------------------------------------------------

You can do R3 = dm(I2, M2), pm(I8,M9) = R2

but you can’t do R3 = dm(I2, M2), dm(I3,M3) = R2

even though it look like the data bus is free for accesses at begin and end of a cycles because it ain’t.

Memory accesses take the WHOLE cycle to complete

Page 9: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Compute operationsOnly 23 bits availableRequires 1 destination and 2 sourcesONLY work on data registers as there is not enough

instruction bits to describe 3 uregsR1 = R2 + R3 ALLOWEDR2 = R3 + 2 NOT ALLOWED I1 = I2 + I3 NOT ALLOWEDCompute operations can be made conditional, and

also combined with UREG to UREG moves (instead of memory operations)

NO PARKING BETWEEN 8:30 and 9:30 IF R IN THE MONTH R1 = R2 + R3 can sometimes be ILLEGAL

Page 10: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Under best conditionsIf instruction described the right way– 1 data memory access (in or out) with a

REQUIRED post modification operation possibly with a modify register containing the value 0

– 1 program memory access (in or out) PROVIDED that the NEXT instruction being fetched is stored in the instruction cache

– 1 compute operation on data registers (EXCEPT for certain multi-function instructions with specific registers)

Page 11: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

Introduction to PPPPIC

Professor’s Personal Process for Parallel Instruction Coding

Page 12: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Basic code development -- any systemWrite the “C” code for the function

void Convert(float *temperature, int N)

which converts an array of temperatures measured in “Celsius” (Canadian Market) to “Fahrenheit” (Tourist Trade)

Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices, or just use the compiler to do the job for you

Page 13: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Standard “C” code

void Convert(float *temperature, int N) {

int count;

for (count = 0; count < N; count++) {

*temperature = (*temperature) * 9 / 5 + 32;

temperature++

}

Page 14: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom codeRewrite the “C” code using “LOAD/STORE”

techniques -- 2106X is essentially super-scaler RISC

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 15: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

21061-style load/store “C” codevoid Convert(register float *temperature, register int N) {

register int count;

register float *pt = temperature;

register float scratch;

for (count = 0; count < N; count++) {

scratch = *pt;

scratch = scratch * (9 / 5);

scratch = scratch + 32;

*pt = scratch;

pt++;

}

Page 16: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom code

Rewrite the “C” code using “LOAD/STORE” techniques

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 17: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

All assembly code routines REQUIRE

PROLOGUE– Appropriate defines to make easy reading of

code– Saving of non-volatile registers

CODE BODY -- what you want to do– Try to plan ahead for parallel operations– Know which 21k “multi-functions” are valid

with which registers.

EPILOGUE– Recover non-volatile registers

Page 18: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Straight conversion -- PROLOGUE// void Convert(reg float *temperature, reg int N) {

.segment/pm seg_pmco;

.global _Convert;

_Convert:

// register int count = GARBAGE;

#define count scratchR1

// register float *pt = temperature;

#define pt scratchDMpt

pt = INPAR1;

// float scratch = GARBAGE;

#define scratchF2 F2

// For the CURRENT code -- no non-volatile // registers are needed -- may not remain true

Page 19: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Straight conversion of BODY and EPILOGUE// for (count = 0; count < N; count++) {

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

// scratch = *pt;

scratchF2 = dm(pt, 0); // Not ++ as pt re-used

// scratch = scratch * (9 / 5);

// INPAR1 (R4) is dead -- can reuse as F4

#define constantF4 F4 // Must be float

constantF4 = 1.8 // No division needed, Use register constant scratchF2 = scratchF2 * constantF4;

// scratch = scratch + 32, Register constant;

#define F0_32 F0 // Must be float

F0_32 = 32.0;

scratchF2 = scratchF2 + F0_32;

// *pt = scratch; pt++;

LOOP_END: dm(pt, 1) = scratchF2;

5 magic lines of code used to return -- EPILOGUE

Page 20: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom code

Rewrite the “C” code using “LOAD/STORE” techniques

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 21: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Speed rules for memory access

scratch = dm(pt, 0); // Not ++ as to be re-used

dm(pt, 1) = scratch;

Use of constants as modifiers is not allowed -- not enough bits in the opcode for parallel ops!

Must use Modify registers already defined

scratch = dm(pt, zeroDM); // Not ++ as to be re-used

dm(pt, plus1DM) = scratch;

CAN’T USE

Page 22: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Speed rules IF you want adds and multiplies to occur on the same line

F1 = F2 * F3, F4 = F5 + F6;– Want to do as a single instruction– Not enough bits in the opcode

• Register description 4 + 4 + 4 + 4 + 4 + 4 (bits)

• Plus how many bits for operation description?

Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7)Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

– Rearrange register usage for this instruction to work

– Register description 4 + 2 + 2 + 4 + 2 + 2 (bits)• Inconvenient rather than really limiting -- can still use more

than half of the SHARC data registers in 1 instruction

Page 23: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

When to worry about the register assignment#define count scratchR1

#define pt scratchDMpt

#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

scratchF2 = dm(pt, 0); // Not ++ as to be re-used

// INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be float

constantF4 = 1.8;

scratchF2 = scratchF2 * constantF4 // Parallel later

#define F0_32 F0 // Must be float

F0_32 = 32.0;

scratchF2 = scratchF2 + F0_32; // Parallel later

LOOP_END: dm(pt, 1) = F0_32;

Page 24: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Check on required register use#define count scratchR1

#define pt scratchDMpt

#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

scratchF2 = dm(pt, zeroDM); Any special requirements here on F2??

// INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be float

constantF4 = 1.8;

scratchF2 = scratchF2 * constantF4

Fn = F(0,1,2 or 3) * F(4,5,6 or 7),

#define F0_32 F0 // Must be float

F0_32 = 32.0;

scratchF2 = scratchF2 + F0_32;

Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

LOOP_END: dm(pt, plus1DM) = scratchF2;

Page 25: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Register re-assignment -- Step 1#define count scratchR1

#define pt scratchDMpt

#define scratchF2 F2 -- APPEARS OKAY

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse

#define constantF4 // Must be float -- APPEARS OKAY

constantF4 = 1.8;

scratchF2 = scratchF2 * constantF4 -- APPEARS OKAY

Fn = F(0,1,2 or 3) * F(4,5,6 or 7),

#define F0_32 F0 // Must be float

F0_32 = 32.0; -- WRONG to use F0

scratchF2 = scratchF2 + F0_32; -- WRONG to use F2

Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

LOOP_END: dm(pt, plus1DM) = scratchF2;

Page 26: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Register re-assignment -- Step 2#define count scratchR1

#define pt scratchDMpt

#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be float

constantF4 = 1.8;

scratchF8 = scratchF2 * constantF4

FOR LATER USE answer must be in F(8, 9, 10 or 11)

#define F12_32 F12 // INPAR3 is available

F12_32 = 32.0;

scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

LOOP_END: dm(pt, plus1DM) = scratchF2;

Page 27: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

MOVE “CONSTANT” OPERATIONS#define count scratchR1

#define pt scratchDMpt

#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END UNTIL LDE:

scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be float

constantF4 = 1.8; MOVE OUTSIDE LOOP

scratchF8 = scratchF2 * constantF4

answer must be in F(8, 9, 10 or 11)

#define F12_32 F12 // INPAR3 is available

F12_32 = 32.0; MOVE OUTSIDE LOOP

scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

LOOP_END: dm(pt, plus1DM) = scratchF2;

Page 28: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom code

Rewrite the “C” code using “LOAD/STORE” techniques

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 29: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Resource Chart -- Basic code

ADDER MULTIPLIER DM ACCESS PMACCESS

_Convert: pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 LCNTR = INPAR2, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM)F8 = F2 * F4_1_8

F2 = F8 + F12_32LOOP_END: dm(pt, PLUS1DM) = F2 5 magic lines of “C” Time = 4 + N * 4 + 5 + 5 to do the call

Page 30: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom code

Rewrite the “C” code using “LOAD/STORE” techniques

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 31: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Un-roll the loop Temporarily straight line your code

Key technique for deciding where parallel operations are possible

Careful -- will re-roll the straight line code later and then the number of parallel operations in the loop is important.

Final code may requiring different loops coded for different values of the loop size – Loop size N = 3p where p is an integer– N = 3p + 1 etc

Page 32: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 1 -- unroll the loop -- 5 times hereADDER MULTIPLIER DM ACCESS

F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 M1

F2 = F8 + F12_32 A1dm(pt, PLUS1DM) = F2 W1F2 = dm(pt, ZERODM) R2

F8 = F2 * F4_1_8 M2 F2 = F8 + F12_32 A2

dm(pt, PLUS1DM) = F2 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 M3 F2 = F8 + F12_32 A3

dm(pt, PLUS1DM) = F2 W3F2 = dm(pt, ZERODM) R4

F8 = F2 * F4_1_8 M4 F2 = F8 + F12_32 A4

dm(pt, PLUS1DM) = F2 W4F2 = dm(pt, ZERODM) R5

F8 = F2 * F4_1_8 M5 F2 = F8 + F12_32 A5

dm(pt, PLUS1DM) = F2 W5

Page 33: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 2 -- Identify resource usage in SOURCE and DESTINATION stages of the instructions

-- then try to move the instructions into compound (super-scalar) operations

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)

SRC

SRC

SRC

SRC

SRC

SRC

SRC

SRC

DEST

DEST

DEST

DEST

DEST

DEST

DEST

DEST

Page 34: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 3 -- Carefully check what instructions can be moved for earlier execution

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)

NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

SRC

SRC

SRC

SRC

SRC

SRC

SRC

SRC

DEST

DEST

DEST

DEST

DEST

DEST

DEST

DEST

Page 35: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Memory resource availabilityMove up F2 = dm(pt, ZERODM) from second loop

into first loop – Okay since F2 is in use as source in one part of

the proposed compound instruction and destination in another

F8 = F2 * F4, dm (Ix, My) = F2

However now we have a possible conflict about which F2 should be used for the

dm(pt, plus1DM) = F2

instruction at end of the first loop especially if the final code is going to involve multiple loops all intertwined and executing simultaneously

Page 36: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 3A -- What’s up, Doc?

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 F2 = Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 F8 = F2 = Decode(F8,F4)

Writeback(F2) F2 = F8 = NO dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

SRC

SRC

SRC

SRC

SRC

SRC

SRC

SRC

DEST

DEST

DEST

DEST

DEST

DEST

DEST

DEST

Page 37: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 4 -- Solution -- Use F9 (after saving)Any data destination is allowed for parallel +/*

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)

Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)

Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)

Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)

Writeback(Mem)

SRC

SRC

SRC

SRC

SRC

SRC

SRC

SRC

DEST

DEST

DEST

DEST

DEST

DEST

DEST

DEST

Page 38: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 5 -- Faster solution than originalBut no one resource is in full use

Limiting resource should be data memory access

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4

STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5

F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5

Page 39: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step 6 -- unroll the loop a bit more

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

dm(pt, PLUS1DM) = F9 W5F2 = dm(pt, ZERODM) R6

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8

dm(pt, PLUS1DM) = F9 W9

Page 40: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Now to “re-roll the loop”Execution involves overlapped loop components

where the loop counter has the value p, p+1 and p+2

Where the original loop went around N times, there are now three stages associated with the any “re-rolled loop”

1) Fill the ALU pipeline

2) Overlap N - 2 times around the loop

3) Empty the ALU pipeline

Page 41: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Step -- Final code versionADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8

F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2

F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM) R3F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4

F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C

Page 42: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Speed improvementsBEFORE ANY PARALLELISM WAS INTRODUCED

START LOOP EXIT ENTRY

4 + N*4 + 5 + 5= 14 + 4 * N

NOW with 2-fold loop unfolding

START LOOP EXIT ENTRY

4 + 7 + (N – 2) * 5 / 2 + 5 + 8 + 5 = 24 + 2.5 * N

NOW with 3-fold loop unfolding

START LOOP EXIT ENTRY

4 + 5 + (N – 2) * 6 / 3 + 5 + 1 + 5 = 16 + 2 * N

WARNING -- Will need 3 different coding situations

N = 3p, 3p + 1, 3p + 2

Page 43: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Question to AskWe now know the final codeShould we have made the substitution F2

to F9?Who cares -- do it anyway as more likely

to be necessary rather than unnecessary in most algorithms! – No real disadvantage since we can probably

overlap the save and recovery of the non-volatile R9 with other instructions!

Page 44: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Parallelism requires

Standard Code DevelopmentCustom Code development– Rewrite with specialized resources– Move to “resource chart”– Unroll the loop– Adjust code– Re-roll the loop– Check if worth the effort

• Probably NOT -- Remember that this code runs in the middle of a lot of other code!!!!!

Page 45: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Resources for more detail

Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP(DSP2000) Workshop, Hunt, Texas, October 2000.

Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000.

Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus theMinnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in ElectronicDesign Magazine, September 2000.

Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus theMinnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in ElectronicDesign Magazine October 2000.

Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bitcushions? --Tthe effect of finite precision in embedded systems", based on anSHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online.

Page 46: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

Another example

Probably not enough time to cover in the workshop

Page 47: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Calculate instantaneous and average power of a complex signal

// short ints are 16-bit values on this machine

short int Power(short int real[ ], short int imag[ ], short int power[ ], short int Npts) {

short int count = 0; short int totalpower = 0; short int re_power, im_power; for (count = 0; count < Npts; count++) { re_power = real[count] * real[count]; im_power = imag[count] * imag[count]; power[count] = re_power + im_power; totalpower += re_power + im_power; }

return (totalpower / Npts); }

// short ints are 16-bit values on this machine

short int Power(short int real[ ], short int imag[ ], short int power[ ], short int Npts) {

short int count = 0; short int totalpower = 0; short int re_power, im_power; for (count = 0; count < Npts; count++) { re_power = real[count] * real[count]; im_power = imag[count] * imag[count]; power[count] = re_power + im_power; totalpower += re_power + im_power; }

return (totalpower / Npts); }

Page 48: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Code rewritten to provide VisualDSP compiler

the opportunity to some parallel optimization including using

multiple data busses

Page 49: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

float Power(float dm *real, float pm *imag, float dm *power, short int Npts) { short int count = 0; float totalpower = 0; float re_power, im_power; float temp;

// Following unrolled code works for Npts divisible by 2 if ( (Npts % 2) != 0 ) exit (0);for (count = 0; count < Npts / 2; count++) { re_power = *real++; im_power = *imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp;

re_power = *real++; im_power = *imag++; temp=re_power*re_power+im_power*im_power; *power++ = temp; totalpower += temp; } return (totalpower / Npts); }

Page 50: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

r13=dm(i0,dm_one); // real [ ] on dm

// Hardware looplcntr=r11, do(pc,_L$816004-1)until lce;

// Access to imag[ ] data along pm bus as wantedr3=pm(i8,pm_one); // imag[ ] on pmF8=F13*F13;F12=F3*F3;F13=F8+F12;

// Part of second part of the loopr3=pm(i8,pm_one); // imag[ ] on pm

F10=F10+F13;dm(i1,dm_one)=r13; // power[ ] on dm

r13=dm(i0,dm_one); // real [ ] on dm

F9=F13*F13;F14=F3*F3;F13=F9+F14;

dm(i1,dm_one)=r13; // power[ ] on dmF10=F10+F13;

r13=dm(i0,dm_one); // real [ ] on dm!end loop

_L$816004

Page 51: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

VisualDSP Compiler generates code using program memory bus for data movement, but does not do any optimizing.

Hand optimizing can reduce these 14 lines generated by the compiler to just 7 without getting particularly fancy

Page 52: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Hand optimizing the compiler output14 cycles reduced to 7

lcntr=r11, do(pc,_L$816004-1)until lce;

// Dual access along dm and pm data bussesr13=dm(i0,dm_one), r3=pm(i8,pm_one);

// pm_zero contains zero to surpress the auto-incrementing modeF8=F13*F13, r1=dm(i0,dm_one), r4=pm(i8,pm_zero);

// The value in F1 must be passed over to F5 in order to// prepare for the combined multiplication and addition// operation

F12=F3*F3, F5 = F1;

// Accessing pm memory is an alternate approach to preparing// for parallel multiplication and addition operations// One cycle overhead first time round the loop.

F9=F1*F5, F13=F8+F12, r2=pm(i8,pm_one);

F14=F2*F4, F10=F10+F13, dm(i1,dm_one)=r13;

F13=F9+F14;F10=F10+F13, dm(i1,dm_one)=r13;!end loop

_L$816004:

Page 53: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Process for developing custom code

Rewrite the “C” code using “LOAD/STORE” techniques

Write the assembly code using a hardware loop– Check that end of loop label is in the correct place

REWRITE the assembly code using registers and instructions that COULD be used in parallel IF you could find the correct optimization approach

Move algorithm to “Resource Usage Chart”Optimize (Attempt to) Compare and contrast time -- include set up and

loop control time -- was it worth the effort?

Page 54: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Generate the resource usage chart Here are the 7 cycles needed during

EVERY calculation of the power

Page 55: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

2 cycles / calculation on average after pipelining

IF you ignore DM Memory Operations --

Page 56: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Know the processor characteristicsNeed two extra DM cycles or equivalentReorder the code to give

– 1 extra DM cycle in parallel with a register to register move

But R1 = R2 form of operation is a UREG to UREG move and will not fit into the instruction

So REPLACE UREG to UREG move with a COMPUTE OPERATION

R1 = PASS R2

End up with 2.5 cycles/calculation instead of original 7

Page 57: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Final code -- testing a pain

Page 58: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

More luck than judgement

Unlike the first “easier” Temperature Conversion code, this “hard” example actually optimizes much more, especially in term of overall code length.

This particular length of code happens to work REGARDLESS of the size of N

Page 59: Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September

04/18/23Introduction to highly parallel SHARC code

Copyright M. Smith and S. Lei Contact [email protected] / 45 + B14

Resources for more detail

Smith, M. R., "Code Optimization Techniques for DSP Applications", 9th IEEE DSP(DSP2000) Workshop, Hunt, Texas, October 2000.

Smith, M. R. "The SHARC in the C", Circuit Cellar Online Magazine, April 2000.

Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus theMinnow’ -- Part 1 -- The Minnow's Viewpoint", accepted for publication in ElectronicDesign Magazine, September 2000.

Smith, M. R., "Code Optimization Techniques -- the case of 'The SHARC versus theMinnow" -- Part 2 -- The Byte of the SHARC", accepted for publication in ElectronicDesign Magazine October 2000.

Smith, M. R. and L. E. Turner, "Are you hurting your data through a lack of bitcushions? --Tthe effect of finite precision in embedded systems", based on anSHARC99 paper, submitted January 2000 for publication in Circuit Cellar Online.