doing lab. 3 the tdd way first optimizing...

30
Doing Lab. 3 the TDD way First optimizing step Using example of a program that converts temperature from Centigrade to Fahrenheit

Upload: others

Post on 11-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Doing Lab. 3 the TDD wayFirst optimizing step

Using example of a program that converts temperature from Centigrade to Fahrenheit

Page 2: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Conversion of TemperatureConversion of Temperature

F = 9 / 5 C + 32;F = 9 / 5 C + 32;

void Convert1pt_Temperature(C, F)

id C l k (C[] [] )Void ConvertBlock_Temperature(C[], F[], N)

Has very similar properties to FIR when run in a loopp

Page 3: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Write first TestWhen run ‐‐ Expect test to fail since no code

NOT FAILDOES NOT COMPILE

Which line is causing the problem?

Page 4: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Hidden (Ghost) hardware break pointnot allowing tests to completenot allowing tests to complete

Clear  ghost break – point andContinue testing

Three indicators that tests“Probably completed”

Stack overflow could giveStack overflow could give these “good” results but still bewrong

Page 5: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

All tests should fail – but don’t

Page 6: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Now the “code review” spots the errorh h l hWhich line is wrong in the tests?

Fix that line – and now we get the expected failures

Page 7: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Do line by line translationWatch my expected “exam coding” formatWatch my expected  exam coding  format

WATCH FOR THOSEVLIWVLIWINSTRUCTORDELIMITERS   ;;

CHECK ;; AGAIN

Expect BTB errorsDo temp fixWith nop; lines

Page 8: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

How well are we doing?  30% worse

Page 9: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Working code

Note all the“ bl ” i“assembler” issuesWe have to resolve

In exam – leave unresolved unlesstold other wise

Exams are hard enouhExams are hard enouh

Page 10: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Compare”Compare

Single Point Block C My gC 1024 calls

yfirst assembly code

Debug 90 cycles / pt

69 cycles / pt 18 

cycle /R l 23 l / 14 cycle / pt

Release 23 cycles / pt

14 cycles / pt

But this 23 cycles /pt was 30 cycles / pt yesterday –But this 23 cycles /pt was 30 cycles / pt yesterday –Are we see cache issues?Data cache – NO ‐‐ as we have not activated itBranch target cache – DON’T KNOW – is it automatically activated?Alignment of loops? – DON’T KNOW – expect “compiler” to take care of that

Page 11: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Design by contractDesign by contract

• Attempt to switch MIMD mode where additions pand multiplications and memory ops occur in parallel

• Check the tests• Check the tests

• Switch to super Harvard mode (dual memory• Switch to super‐Harvard mode (dual memory fetches

• Switch to MIMD mode with SIMD overtones– Use both X and Y compute blocks

• Try and persuade “C” to do the same

Page 12: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Quick build of tests for ll l dd d l lParallel add and multiply ASM

• Use a C define statement (Line 4) to change name of function called

• Refactor later so have all tests availableRefactor later so have all tests available

Page 13: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Make a copy of RealASM MultiplePointProcess.asmRealASM_MultiplePointProcess.asmand perform function name change 

Can use the test as we refactor the code for speed

Page 14: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

ORIGINAL CODE

OTHER ADDER MULT J‐BUS

MOVE CONSTANTS OUTSIDE LOOP

OTHER ADDER MULT J‐BUS

LC0 = N

LOOP:

XR0 9/5

LC0 = N

XR0 = 9/5

XR1 =32XR0 = 9/5

XR1   =32

XR2 

XR1   =32

LOOP:

XR2 = J4++= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1

= J4++

XFR3 =R0 * R2

XFR4 = R3 + R1J5++= XR4

IF NCL0E GOTO LOOP

XFR4   R3   R1

J5++= XR4

IF NCL0E GOTO LOOPCycles / Pt     18.18

Cycles / Pt     Expect improve 2 cycles / ptActual a little bit better

Page 15: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

MOVE CONSTANTS OUTSIDE LOOP

OTHER ADDER MULT J‐BUS

Does .align_code 4 help

OTHER ADDER MULT J‐BUS

LC0 = N

XR0 = 9/5

XR1 32

LC0 = N

XR0 = 9/5

XR1 = 32XR1   =32

LOOP:

XR2 

XR1   =  32

.align_code 4;

LOOP: XR2 = J4++= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1

= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1

J5++= XR4

IF NCL0E GOTO LOOP

J5++= XR4

.align_code 4 IF NCL0E GOTO LOOP

Cycles / Pt     Expect improve 2 cycles / ptActual a little bit better  15 cycles / pt

IF NCL0E GOTO LOOP

Cycles / Pt – big change   11 cycles /pt

Page 16: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Does .align_code 4 help

OTHER ADDER MULT J‐BUS

Process multiple points inside loop

OTHER ADDER MULT J‐BUS

LC0 = N

XR0 = 9/5

XR1 32

LC0 = N / 2;

.align_code 4;

LOOP: XR2XR1   =32

.align_code 4;

LOOP: XR2 

LOOP: XR2 = J4++

XFR3 =   R0 * R2

XFR4 = R3 + R1= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1

XFR4  = R3 + R1

J5++= XR4

XR2J5++= XR4

.align_code 4IF NCL0E GOTO LOOP

XR2 = J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1IF NCL0E GOTO LOOP

Cycles / Pt – big change   11 cycles /pt J5++= XR4

.align_code 4IF NCL0E GOTO LOOP

Should work  sameActual works faster ‐‐ 8 cycles /pt

Page 17: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a
Page 18: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULT J‐BUS

IF N is even   Jump LOOP

Process multiple points inside loop

OTHER ADDER MULT J‐BUS

XR2 = J4++

XFR3 =   R0 * R2

LC0 = N / 2;

.align_code 4;

LOOP: XR2 XFR4  = R3 + R1

J5++= XR4

= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1.align_code 4;

LOOP: XR2 = J4++

J5++= XR4

XR2 XFR3 =   R0 * R2

XFR4  = R3 + R1

J5++

= J4++

XFR3 =   R0 * R2

XFR4  = R3 + R1

= XR4

XR2 = J4++

J5++= XR4

.align_code 4IF NCL0E GOTO LOOP XFR3 =   R0 * R2

XFR4  = R3 + R1

J5++

IF NCL0E GOTO LOOP

Problem if N is off

Page 19: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Actual Code – How would C know to do this?

Page 20: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even   Jump LOOP_START

XR2  = J4++J

XFR3 =   R0 * R2

XFR4  = R3 + R1

J5++  = XR4

LOOP_START: LC0= N / 2

.align_code 4;

LOOP: XR2  = J4++

XFR3 =   R0 * R2

XFR4 = R3 + R1XFR4   R3 + R1

J5++  = XR4

XR2  = J4++

XFR3 =   R0 * R2

XFR4  = R3+R1

J5++  = XR4

.align_code 4IF NCL0E GOTO LOOP

Page 21: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even   Jump LOOP_START

CODE FOR 1 POINTCO O O

LOOP_START: LC0 = N / 2 15 cycles for 

.align_code 4; 2 points

LOOP: XR2  = J4++

MEMORY STALL

XFR3 =   R0 * R2

MULTIPLY STALL

XFR4  = R3 + R1 EXPECTED 8.5  cycles /point

ADD STALL ACTUAL 10.6 cycles

J5++  = XR4

XR2 = J4++XR2  = J4++

MEMORY STALL

XFR3 =   R0 * R2

MULTIPLY STALL

XFR4  = R3 + R1

ADD STALL

Page 22: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Many ways to handle thisll lparallelization process

• More than 1 point inside loopMore than 1 point inside loop

• Rename registers in last part of loop

hi id fli d ll i i• This avoids conflicts and allows instructions to be moved up into empty slots

Page 23: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even   Jump LOOP_START

CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G

LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2  = J4++

MEMORY STALL

XFR3 =   R0 * R2

MULTIPLY STALL

XFR4  = R3 + R1

ADD STALLADD STALL

J5++  = XR4

XR12    = J4++

MEMORY STALL

XFR13 =   R0 * R12

MULTIPLY STALL

XFR14  = R13 + R1

ADD STALL

J5++ = XR14

Page 24: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even   Jump LOOP_START

CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G

LOOP_START: LC0 = N / 2

.align_code 4;

LOOP: XR2  = J4++

MEMORY STALLXR12    = J4++

XFR3 R0 * R2 MEMORY STALLXFR3 =   R0 * R2 MEMORY STALL

MULTIPLY STALLXFR13 =   R0 * R12

XFR4 = R3 + R1 MULTIPLY STALLXFR4  = R3 + R1 MULTIPLY STALL

ADD STALLXFR14  = R13 + R1

ADD STALL J5++ = XR4ADD STALL J5++  = XR4

J5++  = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP

Page 25: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Did not optimize d f ll hcode carefully enough

Page 26: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Move all defines to one location

Page 27: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Error is now obvious – Pattern broken

Page 28: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

IF N is even   Jump LOOP_START

CODE FOR 1 POINT CHECK CODE STILL WORKS WITH EACH STAGECO O O C C CO S O S C S G

LOOP_START: LC0 = N / 2 9 cycles /loop

.align_code 4;

LOOP: XR2  = J4++

MEMORY STALLXR12    = J4++

XFR3 R0 * R2 MEMORY STALLXFR3 =   R0 * R2 MEMORY STALL

MULTIPLY STALLXFR13 =   R0 * R12

EXPECT4.5 cycles /pt

XFR4  = R3 + R1 MULTIPLY STALL Actual 8.68

ADD STALLXFR14  = R13 + R1

ADD STALL J5 XR4ADD STALL J5++  = XR4

J5++  = XR14

.align_code 4IF NCL0E GOTO LOOPIF NCL0E GOTO LOOP

Page 29: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

OTHER ADDER MULTIPLIER J‐BUS K‐BUS

.align_code 4;

LOOP: XR2  = J4++ 11cycles / loopOO J cyc es / oop

PASS 1024FAIL  1021, 1022, 1203

MEMORY STALLXR12    = J4++

4 points / loop

XFR3 =   R0 * R2 MEMORY STALLXR2    = J4++

MULTIPLY STALLXFR13 =   R0 * R12

MEMORY STALLXR12    = J4++

EXPECT2.75 cycles /pt

XFR4  = R3 + R1 MULTIPLY STALLXFR3 =   R0 * R2

MEMORY STALL ACTUAL4.83 cycles /pt

ADD STALLXFR14 R13 R1

MULTIPLY STALLXFR13 R0 * R12XFR14  = R13 + R1 XFR13 =   R0 * R12

ADD STALLXFR4  = R3 + R1

MULTIPLY STALL J5++  = XR4

ADD STALL J5++ = XR14ADD STALLXFR14  = R13 + R1

J5++  = XR14

ADD STALL J5++  = XR4

J5++ = XR14J5++    XR14

.align_code 4IF NCL0E GOTO LOOP

Page 30: Doing Lab. 3 the TDD way First optimizing steppeople.ucalgary.ca/~smithmr/2010webs/encm515_10/10Lectures/10… · Doing Lab. 3 the TDD way First optimizing step Using example of a

Compare”Compare

Single  Block  My  Parallel  Parallel  Next step gPointC 1024 calls

Cy

first assembly code

add / mult

2

add / mult

4 points / loop

p

Use K‐BUS for output

calls code 2 points / loop

loop

Debu 90 cycles 69Debug

90 cycles / pt

69 cycles / pt 18 

cycle /

Expect8.5

l

Expect4.83

Relea 23 cycles  14 / pt Actual

10.6Expect2.75

sey

/ pt cycles / pt