doing lab. 3 the tdd way first optimizing...

Doing Lab. 3 the TDD wayFirst optimizing step

Using example of a program that converts temperature from Centigrade to Fahrenheit

Conversion of TemperatureConversion of Temperature

F = 9 / 5 C + 32;F = 9 / 5 C + 32;

void Convert1pt_Temperature(C, F)

id C l k (C[] [] )Void ConvertBlock_Temperature(C[], F[], N)

Has very similar properties to FIR when run in a loopp

Write first TestWhen run ‐‐ Expect test to fail since no code

NOT FAILDOES NOT COMPILE

Which line is causing the problem?

Hidden (Ghost) hardware break pointnot allowing tests to completenot allowing tests to complete

Clear ghost break – point andContinue testing

Three indicators that tests“Probably completed”

Stack overflow could giveStack overflow could give these “good” results but still bewrong

All tests should fail – but don’t

Now the “code review” spots the errorh h l hWhich line is wrong in the tests?

Fix that line – and now we get the expected failures

Do line by line translationWatch my expected “exam coding” formatWatch my expected exam coding format

WATCH FOR THOSEVLIWVLIWINSTRUCTORDELIMITERS ;;

CHECK ;; AGAIN

Expect BTB errorsDo temp fixWith nop; lines

How well are we doing? 30% worse

Working code

Note all the“ bl ” i“assembler” issuesWe have to resolve

In exam – leave unresolved unlesstold other wise

Exams are hard enouhExams are hard enouh

Compare”Compare

Single Point Block C My gC 1024 calls

yfirst assembly code

Debug 90 cycles / pt

69 cycles / pt 18

cycle /R l 23 l / 14 cycle / pt

Release 23 cycles / pt

14 cycles / pt

But this 23 cycles /pt was 30 cycles / pt yesterday –But this 23 cycles /pt was 30 cycles / pt yesterday –Are we see cache issues?Data cache – NO ‐‐ as we have not activated itBranch target cache – DON’T KNOW – is it automatically activated?Alignment of loops? – DON’T KNOW – expect “compiler” to take care of that

Design by contractDesign by contract

• Attempt to switch MIMD mode where additions pand multiplications and memory ops occur in parallel

• Check the tests• Check the tests

• Switch to super Harvard mode (dual memory• Switch to super‐Harvard mode (dual memory fetches

• Switch to MIMD mode with SIMD overtones– Use both X and Y compute blocks

• Try and persuade “C” to do the same

Quick build of tests for ll l dd d l lParallel add and multiply ASM

• Use a C define statement (Line 4) to change name of function called

• Refactor later so have all tests availableRefactor later so have all tests available

Make a copy of RealASM MultiplePointProcess.asmRealASM_MultiplePointProcess.asmand perform function name change

Can use the test as we refactor the code for speed

ORIGINAL CODE

OTHER ADDER MULT J‐BUS

MOVE CONSTANTS OUTSIDE LOOP

LC0 = N

XR0 9/5

LC0 = N

XR0 = 9/5

XR1 =32XR0 = 9/5

XR1 =32

XR2 = J4++= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

= J4++

XFR3 =R0 * R2

XFR4 = R3 + R1J5++= XR4

IF NCL0E GOTO LOOP

XFR4 R3 R1

J5++= XR4

IF NCL0E GOTO LOOPCycles / Pt 18.18

Cycles / Pt Expect improve 2 cycles / ptActual a little bit better

MOVE CONSTANTS OUTSIDE LOOP

Does .align_code 4 help

LC0 = N

XR0 = 9/5

XR1 32

LC0 = N

XR0 = 9/5

XR1 = 32XR1 =32

XR1 = 32

.align_code 4;

LOOP: XR2 = J4++= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

J5++= XR4

IF NCL0E GOTO LOOP

J5++= XR4

.align_code 4 IF NCL0E GOTO LOOP

Cycles / Pt Expect improve 2 cycles / ptActual a little bit better 15 cycles / pt

IF NCL0E GOTO LOOP

Cycles / Pt – big change 11 cycles /pt

Does .align_code 4 help

Process multiple points inside loop

LC0 = N

XR0 = 9/5

XR1 32

LC0 = N / 2;

.align_code 4;

LOOP: XR2XR1 =32

.align_code 4;

LOOP: XR2

LOOP: XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

J5++= XR4

XR2J5++= XR4

.align_code 4IF NCL0E GOTO LOOP

XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1IF NCL0E GOTO LOOP

Cycles / Pt – big change 11 cycles /pt J5++= XR4

Should work sameActual works faster ‐‐ 8 cycles /pt

IF N is even Jump LOOP

Process multiple points inside loop

XR2 = J4++

XFR3 = R0 * R2

LC0 = N / 2;

.align_code 4;

LOOP: XR2 XFR4 = R3 + R1

J5++= XR4

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1.align_code 4;

LOOP: XR2 = J4++

J5++= XR4

XR2 XFR3 = R0 * R2

XFR4 = R3 + R1

= J4++

XFR3 = R0 * R2

XFR4 = R3 + R1

XR2 = J4++

J5++= XR4

.align_code 4IF NCL0E GOTO LOOP XFR3 = R0 * R2

XFR4 = R3 + R1

IF NCL0E GOTO LOOP

Problem if N is off

Actual Code – How would C know to do this?

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even Jump LOOP_START

XR2 = J4++J

XFR3 = R0 * R2

XFR4 = R3 + R1

J5++ = XR4

LOOP_START: LC0= N / 2

.align_code 4;

LOOP: XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3 + R1XFR4 R3 + R1

J5++ = XR4

XR2 = J4++

XFR3 = R0 * R2

XFR4 = R3+R1

J5++ = XR4

CODE FOR 1 POINTCO O O

LOOP_START: LC0 = N / 2 15 cycles for

.align_code 4; 2 points

LOOP: XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1 EXPECTED 8.5 cycles /point

ADD STALL ACTUAL 10.6 cycles

J5++ = XR4

XR2 = J4++XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1

ADD STALL

Many ways to handle thisll lparallelization process

• More than 1 point inside loopMore than 1 point inside loop

• Rename registers in last part of loop

hi id fli d ll i i• This avoids conflicts and allows instructions to be moved up into empty slots

CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G

LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2 = J4++

MEMORY STALL

XFR3 = R0 * R2

MULTIPLY STALL

XFR4 = R3 + R1

ADD STALLADD STALL

J5++ = XR4

XR12 = J4++

MEMORY STALL

XFR13 = R0 * R12

MULTIPLY STALL

XFR14 = R13 + R1

ADD STALL

J5++ = XR14

LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2 = J4++

MEMORY STALLXR12 = J4++

XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL

MULTIPLY STALLXFR13 = R0 * R12

XFR4 = R3 + R1 MULTIPLY STALLXFR4 = R3 + R1 MULTIPLY STALL

ADD STALLXFR14 = R13 + R1

ADD STALL J5++ = XR4ADD STALL J5++ = XR4

J5++ = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP

Did not optimize d f ll hcode carefully enough

Move all defines to one location

Error is now obvious – Pattern broken

LOOP_START: LC0 = N / 2 9 cycles /loop

.align_code 4;

LOOP: XR2 = J4++

XFR3 R0 * R2 MEMORY STALLXFR3 = R0 * R2 MEMORY STALL

EXPECT4.5 cycles /pt

XFR4 = R3 + R1 MULTIPLY STALL Actual 8.68

ADD STALL J5 XR4ADD STALL J5++ = XR4

J5++ = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP

.align_code 4;

LOOP: XR2 = J4++ 11cycles / loopOO J cyc es / oop

PASS 1024FAIL 1021, 1022, 1203

4 points / loop

XFR3 = R0 * R2 MEMORY STALLXR2 = J4++

EXPECT2.75 cycles /pt

XFR4 = R3 + R1 MULTIPLY STALLXFR3 = R0 * R2

MEMORY STALL ACTUAL4.83 cycles /pt

ADD STALLXFR14 R13 R1

MULTIPLY STALLXFR13 R0 * R12XFR14 = R13 + R1 XFR13 = R0 * R12

MULTIPLY STALL J5++ = XR4

ADD STALL J5++ = XR14ADD STALLXFR14 = R13 + R1

J5++ = XR14

ADD STALL J5++ = XR4

J5++ = XR14J5++ XR14

Compare”Compare

Single Block My Parallel Parallel Next step gPointC 1024 calls

first assembly code

add / mult

4 points / loop

Use K‐BUS for output

calls code 2 points / loop

Debu 90 cycles 69Debug

90 cycles / pt

69 cycles / pt 18

cycle /

Expect8.5

Expect4.83

Relea 23 cycles 14 / pt Actual

10.6Expect2.75

/ pt cycles / pt

doing lab. 3 the tdd way first optimizing...

Documents

tdd na prática · tdd na prática o que será abordado o...

add tdd to your toolbox: an introduction to tdd

tdd - introduction

checklist tdd

pensando tdd

optimizing the stage gate...

last tdd patterns · 2015. 2. 25. · last tdd patterns ......

[2016-03-09] tdd on spring ~ 봄에는 tdd ~

tdd & automation

tdd by example why shall we test ? what is tdd (test driven...

optimizing the stage-gate process: what best...

tdd painkillers

monografia tdd

tdd & legacy

tdd introduction

tdd er død. lenge leve tdd!

tdd overview

tdd-fdd-convergence_wp_v1.1_2015 tdd org.pdf

city of independence, missouri39th st tdd arrowhead shopping...

tdd workshop - indiana workshop_presentation_11... ·...