doing lab. 3 the tdd way first optimizing...

Doing Lab. 3 the TDD wayFirst optimizing step

Using example of a program that converts temperature from Centigrade to Fahrenheit

Conversion of TemperatureConversion of Temperature

F = 9 / 5 C + 32;F = 9 / 5 C + 32;

void Convert1pt_Temperature(C, F)

id C l k (C[] [] )Void ConvertBlock_Temperature(C[], F[], N)

Has very similar properties to FIR when run in a loopp

Write first TestWhen run ‐‐ Expect test to fail since no code

NOT FAILDOES NOT COMPILE

Which line is causing the problem?

Hidden (Ghost) hardware break pointnot allowing tests to completenot allowing tests to complete

Clear ghost break – point andContinue testing

Three indicators that tests“Probably completed”

Stack overflow could giveStack overflow could give these “good” results but still bewrong

All tests should fail – but don’t

Now the “code review” spots the errorh h l hWhich line is wrong in the tests?

Fix that line – and now we get the expected failures

Do line by line translationWatch my expected “exam coding” formatWatch my expected exam coding format

WATCH FOR THOSEVLIWVLIWINSTRUCTORDELIMITERS ;;

CHECK ;; AGAIN

Expect BTB errorsDo temp fixWith nop; lines

How well are we doing? 30% worse

Working code

Note all the“ bl ” i“assembler” issuesWe have to resolve

In exam – leave unresolved unlesstold other wise

Exams are hard enouhExams are hard enouh

Compare”Compare

Single Point Block C My gC 1024 calls

yfirst assembly code

Debug 90 cycles / pt

69 cycles / pt 18

cycle /R l 23 l / 14 cycle / pt

Release 23 cycles / pt

14 cycles / pt

But this 23 cycles /pt was 30 cycles / pt yesterday –But this 23 cycles /pt was 30 cycles / pt yesterday –Are we see cache issues?Data cache – NO ‐‐ as we have not activated itBranch target cache – DON’T KNOW – is it automatically activated?Alignment of loops? – DON’T KNOW – expect “compiler” to take care of that

Design by contractDesign by contract

• Attempt to switch MIMD mode where additions pand multiplications and memory ops occur in parallel

• Check the tests• Check the tests

• Switch to super Harvard mode (dual memory• Switch to super‐Harvard mode (dual memory fetches

• Switch to MIMD mode with SIMD overtones– Use both X and Y compute blocks

• Try and persuade “C” to do the same

Quick build of tests for ll l dd d l lParallel add and multiply ASM

• Use a C define statement (Line 4) to change name of function called

• Refactor later so have all tests availableRefactor later so have all tests available

Make a copy of RealASM MultiplePointProcess.asmRealASM_MultiplePointProcess.asmand perform function name change

Can use the test as we refactor the code for speed

ORIGINAL CODE

OTHER ADDER MULT J‐BUS

MOVE CONSTANTS OUTSIDE LOOP


LC0 = N

LOOP:

XR0 9/5

LC0 = N

XR0 = 9/5

XR1 =32XR0 = 9/5

XR1 =32

XR2

XR1 =32

LOOP:

XR2 = J4++= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

= J4++

XFR3 =R0 * R2

XFR4 = R3 + R1J5++= XR4

IF NCL0E GOTO LOOP

XFR4 R3 R1

J5++= XR4

IF NCL0E GOTO LOOPCycles / Pt 18.18

Cycles / Pt Expect improve 2 cycles / ptActual a little bit better

MOVE CONSTANTS OUTSIDE LOOP


Does .align_code 4 help


LC0 = N

XR0 = 9/5

XR1 32

LC0 = N

XR0 = 9/5

XR1 = 32XR1 =32

LOOP:

XR2

XR1 = 32

.align_code 4;

LOOP: XR2 = J4++= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

J5++= XR4

IF NCL0E GOTO LOOP

J5++= XR4

.align_code 4 IF NCL0E GOTO LOOP

Cycles / Pt Expect improve 2 cycles / ptActual a little bit better 15 cycles / pt

IF NCL0E GOTO LOOP

Cycles / Pt – big change 11 cycles /pt

Does .align_code 4 help


Process multiple points inside loop


LC0 = N

XR0 = 9/5

XR1 32

LC0 = N / 2;

.align_code 4;

LOOP: XR2XR1 =32

.align_code 4;

LOOP: XR2

LOOP: XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

XFR4 = R3 + R1

J5++= XR4

XR2J5++= XR4

.align_code 4IF NCL0E GOTO LOOP

XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1IF NCL0E GOTO LOOP

Cycles / Pt – big change 11 cycles /pt J5++= XR4


Should work sameActual works faster ‐‐ 8 cycles /pt


IF N is even Jump LOOP

Process multiple points inside loop


XR2 = J4++

XFR3 = R0 * R2

LC0 = N / 2;

.align_code 4;

LOOP: XR2 XFR4 = R3 + R1

J5++= XR4

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1.align_code 4;

LOOP: XR2 = J4++

J5++= XR4

XR2 XFR3 = R0 * R2

XFR4 = R3 + R1

J5++

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

= XR4

XR2 = J4++

J5++= XR4

.align_code 4IF NCL0E GOTO LOOP XFR3 = R0 * R2

XFR4 = R3 + R1

J5++

IF NCL0E GOTO LOOP

Problem if N is off

Actual Code – How would C know to do this?

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even Jump LOOP_START

XR2 = J4++J

XFR3 = R0 * R2

XFR4 = R3 + R1

J5++ = XR4

LOOP_START: LC0= N / 2

.align_code 4;

LOOP: XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1XFR4 R3 + R1

J5++ = XR4

XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3+R1

J5++ = XR4




CODE FOR 1 POINTCO O O

LOOP_START: LC0 = N / 2 15 cycles for

.align_code 4; 2 points

LOOP: XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1 EXPECTED 8.5 cycles /point

ADD STALL ACTUAL 10.6 cycles

J5++ = XR4

XR2 = J4++XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1

ADD STALL

Many ways to handle thisll lparallelization process

• More than 1 point inside loopMore than 1 point inside loop

• Rename registers in last part of loop

hi id fli d ll i i• This avoids conflicts and allows instructions to be moved up into empty slots



CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G

LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1

ADD STALLADD STALL

J5++ = XR4

XR12 = J4++

MEMORY STALL

XFR13 = R0 * R12

MULTIPLY STALL

XFR14 = R13 + R1

ADD STALL

J5++ = XR14




LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2 = J4++

MEMORY STALLXR12 = J4++

XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL

MULTIPLY STALLXFR13 = R0 * R12

XFR4 = R3 + R1 MULTIPLY STALLXFR4 = R3 + R1 MULTIPLY STALL

ADD STALLXFR14 = R13 + R1

ADD STALL J5++ = XR4ADD STALL J5++ = XR4

J5++ = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP

Did not optimize d f ll hcode carefully enough

Move all defines to one location

Error is now obvious – Pattern broken




LOOP_START: LC0 = N / 2 9 cycles /loop

.align_code 4;

LOOP: XR2 = J4++


XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL


EXPECT4.5 cycles /pt

XFR4 = R3 + R1 MULTIPLY STALL Actual 8.68


ADD STALL J5 XR4ADD STALL J5++ = XR4

J5++ = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP


.align_code 4;

LOOP: XR2 = J4++ 11cycles / loopOO J cyc es / oop

PASS 1024FAIL 1021, 1022, 1203


4 points / loop

XFR3 = R0 * R2 MEMORY STALLXR2 = J4++



EXPECT2.75 cycles /pt

XFR4 = R3 + R1 MULTIPLY STALLXFR3 = R0 * R2

MEMORY STALL ACTUAL4.83 cycles /pt

ADD STALLXFR14 R13 R1

MULTIPLY STALLXFR13 R0 * R12XFR14 = R13 + R1 XFR13 = R0 * R12


MULTIPLY STALL J5++ = XR4

ADD STALL J5++ = XR14ADD STALLXFR14 = R13 + R1

J5++ = XR14

ADD STALL J5++ = XR4

J5++ = XR14J5++ XR14


Compare”Compare

Single Block My Parallel Parallel Next step gPointC 1024 calls

Cy

first assembly code

add / mult

2

add / mult

4 points / loop

p

Use K‐BUS for output

calls code 2 points / loop

loop

Debu 90 cycles 69Debug

90 cycles / pt

69 cycles / pt 18

cycle /

Expect8.5

l

Expect4.83

Relea 23 cycles 14 / pt Actual

10.6Expect2.75

sey

/ pt cycles / pt

doing lab. 3 the tdd way first optimizing...

Documents