doing lab. 3 the tdd way first optimizing...
TRANSCRIPT
Doing Lab. 3 the TDD wayFirst optimizing step
Using example of a program that converts temperature from Centigrade to Fahrenheit
Conversion of TemperatureConversion of Temperature
F = 9 / 5 C + 32;F = 9 / 5 C + 32;
void Convert1pt_Temperature(C, F)
id C l k (C[] [] )Void ConvertBlock_Temperature(C[], F[], N)
Has very similar properties to FIR when run in a loopp
Write first TestWhen run ‐‐ Expect test to fail since no code
NOT FAILDOES NOT COMPILE
Which line is causing the problem?
Hidden (Ghost) hardware break pointnot allowing tests to completenot allowing tests to complete
Clear ghost break – point andContinue testing
Three indicators that tests“Probably completed”
Stack overflow could giveStack overflow could give these “good” results but still bewrong
All tests should fail – but don’t
Now the “code review” spots the errorh h l hWhich line is wrong in the tests?
Fix that line – and now we get the expected failures
Do line by line translationWatch my expected “exam coding” formatWatch my expected exam coding format
WATCH FOR THOSEVLIWVLIWINSTRUCTORDELIMITERS ;;
CHECK ;; AGAIN
Expect BTB errorsDo temp fixWith nop; lines
How well are we doing? 30% worse
Working code
Note all the“ bl ” i“assembler” issuesWe have to resolve
In exam – leave unresolved unlesstold other wise
Exams are hard enouhExams are hard enouh
Compare”Compare
Single Point Block C My gC 1024 calls
yfirst assembly code
Debug 90 cycles / pt
69 cycles / pt 18
cycle /R l 23 l / 14 cycle / pt
Release 23 cycles / pt
14 cycles / pt
But this 23 cycles /pt was 30 cycles / pt yesterday –But this 23 cycles /pt was 30 cycles / pt yesterday –Are we see cache issues?Data cache – NO ‐‐ as we have not activated itBranch target cache – DON’T KNOW – is it automatically activated?Alignment of loops? – DON’T KNOW – expect “compiler” to take care of that
Design by contractDesign by contract
• Attempt to switch MIMD mode where additions pand multiplications and memory ops occur in parallel
• Check the tests• Check the tests
• Switch to super Harvard mode (dual memory• Switch to super‐Harvard mode (dual memory fetches
• Switch to MIMD mode with SIMD overtones– Use both X and Y compute blocks
• Try and persuade “C” to do the same
Quick build of tests for ll l dd d l lParallel add and multiply ASM
• Use a C define statement (Line 4) to change name of function called
• Refactor later so have all tests availableRefactor later so have all tests available
Make a copy of RealASM MultiplePointProcess.asmRealASM_MultiplePointProcess.asmand perform function name change
Can use the test as we refactor the code for speed
ORIGINAL CODE
OTHER ADDER MULT J‐BUS
MOVE CONSTANTS OUTSIDE LOOP
OTHER ADDER MULT J‐BUS
LC0 = N
LOOP:
XR0 9/5
LC0 = N
XR0 = 9/5
XR1 =32XR0 = 9/5
XR1 =32
XR2
XR1 =32
LOOP:
XR2 = J4++= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1
= J4++
XFR3 =R0 * R2
XFR4 = R3 + R1J5++= XR4
IF NCL0E GOTO LOOP
XFR4 R3 R1
J5++= XR4
IF NCL0E GOTO LOOPCycles / Pt 18.18
Cycles / Pt Expect improve 2 cycles / ptActual a little bit better
MOVE CONSTANTS OUTSIDE LOOP
OTHER ADDER MULT J‐BUS
Does .align_code 4 help
OTHER ADDER MULT J‐BUS
LC0 = N
XR0 = 9/5
XR1 32
LC0 = N
XR0 = 9/5
XR1 = 32XR1 =32
LOOP:
XR2
XR1 = 32
.align_code 4;
LOOP: XR2 = J4++= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1
= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1
J5++= XR4
IF NCL0E GOTO LOOP
J5++= XR4
.align_code 4 IF NCL0E GOTO LOOP
Cycles / Pt Expect improve 2 cycles / ptActual a little bit better 15 cycles / pt
IF NCL0E GOTO LOOP
Cycles / Pt – big change 11 cycles /pt
Does .align_code 4 help
OTHER ADDER MULT J‐BUS
Process multiple points inside loop
OTHER ADDER MULT J‐BUS
LC0 = N
XR0 = 9/5
XR1 32
LC0 = N / 2;
.align_code 4;
LOOP: XR2XR1 =32
.align_code 4;
LOOP: XR2
LOOP: XR2 = J4++
XFR3 = R0 * R2
XFR4 = R3 + R1= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1
XFR4 = R3 + R1
J5++= XR4
XR2J5++= XR4
.align_code 4IF NCL0E GOTO LOOP
XR2 = J4++
XFR3 = R0 * R2
XFR4 = R3 + R1IF NCL0E GOTO LOOP
Cycles / Pt – big change 11 cycles /pt J5++= XR4
.align_code 4IF NCL0E GOTO LOOP
Should work sameActual works faster ‐‐ 8 cycles /pt
OTHER ADDER MULT J‐BUS
IF N is even Jump LOOP
Process multiple points inside loop
OTHER ADDER MULT J‐BUS
XR2 = J4++
XFR3 = R0 * R2
LC0 = N / 2;
.align_code 4;
LOOP: XR2 XFR4 = R3 + R1
J5++= XR4
= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1.align_code 4;
LOOP: XR2 = J4++
J5++= XR4
XR2 XFR3 = R0 * R2
XFR4 = R3 + R1
J5++
= J4++
XFR3 = R0 * R2
XFR4 = R3 + R1
= XR4
XR2 = J4++
J5++= XR4
.align_code 4IF NCL0E GOTO LOOP XFR3 = R0 * R2
XFR4 = R3 + R1
J5++
IF NCL0E GOTO LOOP
Problem if N is off
Actual Code – How would C know to do this?
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
IF N is even Jump LOOP_START
XR2 = J4++J
XFR3 = R0 * R2
XFR4 = R3 + R1
J5++ = XR4
LOOP_START: LC0= N / 2
.align_code 4;
LOOP: XR2 = J4++
XFR3 = R0 * R2
XFR4 = R3 + R1XFR4 R3 + R1
J5++ = XR4
XR2 = J4++
XFR3 = R0 * R2
XFR4 = R3+R1
J5++ = XR4
.align_code 4IF NCL0E GOTO LOOP
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
IF N is even Jump LOOP_START
CODE FOR 1 POINTCO O O
LOOP_START: LC0 = N / 2 15 cycles for
.align_code 4; 2 points
LOOP: XR2 = J4++
MEMORY STALL
XFR3 = R0 * R2
MULTIPLY STALL
XFR4 = R3 + R1 EXPECTED 8.5 cycles /point
ADD STALL ACTUAL 10.6 cycles
J5++ = XR4
XR2 = J4++XR2 = J4++
MEMORY STALL
XFR3 = R0 * R2
MULTIPLY STALL
XFR4 = R3 + R1
ADD STALL
Many ways to handle thisll lparallelization process
• More than 1 point inside loopMore than 1 point inside loop
• Rename registers in last part of loop
hi id fli d ll i i• This avoids conflicts and allows instructions to be moved up into empty slots
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
IF N is even Jump LOOP_START
CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G
LOOP_START: LC0 = N / 2
.align_code 4;
LOOP: XR2 = J4++
MEMORY STALL
XFR3 = R0 * R2
MULTIPLY STALL
XFR4 = R3 + R1
ADD STALLADD STALL
J5++ = XR4
XR12 = J4++
MEMORY STALL
XFR13 = R0 * R12
MULTIPLY STALL
XFR14 = R13 + R1
ADD STALL
J5++ = XR14
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
IF N is even Jump LOOP_START
CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G
LOOP_START: LC0 = N / 2
.align_code 4;
LOOP: XR2 = J4++
MEMORY STALLXR12 = J4++
XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL
MULTIPLY STALLXFR13 = R0 * R12
XFR4 = R3 + R1 MULTIPLY STALLXFR4 = R3 + R1 MULTIPLY STALL
ADD STALLXFR14 = R13 + R1
ADD STALL J5++ = XR4ADD STALL J5++ = XR4
J5++ = XR14
.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP
Did not optimize d f ll hcode carefully enough
Move all defines to one location
Error is now obvious – Pattern broken
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
IF N is even Jump LOOP_START
CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G
LOOP_START: LC0 = N / 2 9 cycles /loop
.align_code 4;
LOOP: XR2 = J4++
MEMORY STALLXR12 = J4++
XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL
MULTIPLY STALLXFR13 = R0 * R12
EXPECT4.5 cycles /pt
XFR4 = R3 + R1 MULTIPLY STALL Actual 8.68
ADD STALLXFR14 = R13 + R1
ADD STALL J5 XR4ADD STALL J5++ = XR4
J5++ = XR14
.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP
OTHER ADDER MULTIPLIER J‐BUS K‐BUS
.align_code 4;
LOOP: XR2 = J4++ 11cycles / loopOO J cyc es / oop
PASS 1024FAIL 1021, 1022, 1203
MEMORY STALLXR12 = J4++
4 points / loop
XFR3 = R0 * R2 MEMORY STALLXR2 = J4++
MULTIPLY STALLXFR13 = R0 * R12
MEMORY STALLXR12 = J4++
EXPECT2.75 cycles /pt
XFR4 = R3 + R1 MULTIPLY STALLXFR3 = R0 * R2
MEMORY STALL ACTUAL4.83 cycles /pt
ADD STALLXFR14 R13 R1
MULTIPLY STALLXFR13 R0 * R12XFR14 = R13 + R1 XFR13 = R0 * R12
ADD STALLXFR4 = R3 + R1
MULTIPLY STALL J5++ = XR4
ADD STALL J5++ = XR14ADD STALLXFR14 = R13 + R1
J5++ = XR14
ADD STALL J5++ = XR4
J5++ = XR14J5++ XR14
.align_code 4IF NCL0E GOTO LOOP
Compare”Compare
Single Block My Parallel Parallel Next step gPointC 1024 calls
Cy
first assembly code
add / mult
2
add / mult
4 points / loop
p
Use K‐BUS for output
calls code 2 points / loop
loop
Debu 90 cycles 69Debug
90 cycles / pt
69 cycles / pt 18
cycle /
Expect8.5
l
Expect4.83
Relea 23 cycles 14 / pt Actual
10.6Expect2.75
sey
/ pt cycles / pt