ee457 midterm (~24%-28%)€¦ · performance comparison (between durf and ddrf): ___ _ ___...

March 23, 2017 11:40 am EE457 MT - Spring 2017 1 / 12 C Copyright 2017 Gandhi Puvvada

EE457 Midterm (~24%-28%)Closed-book Closed-notes Exam; No cheat sheets; No cell phones or computers

Calculators and Verilog Guides are not allowed.

Spring 2017Instructor: Gandhi Puvvada

Thursday, 3/23/2017 (A 2H 50M exam)05:00 PM - 07:50 PM (170 min) in SGM124

Viterbi School of EngineeringUniversity of Southern California

Ques# Topic Page# Time Points Score

1 Lab 6 EX_MEM combined 2-4 40 min 80

2 Pipelining miscellaneous 5-6 50 min 93

3 Virtual Memory 7 10 min 29

4 Cache 8-9 40 min 66

7 Lab 7 Part 3 SP1Multi-cycle CPU

10-11 20 min 33

Total Cover+10+ Blank = 12

2H 40M 301

Perfect Score 285

Student’s Last Name: _______________________________________

Student’s First Name: _______________________________________

Student’s DEN D2L username: [email protected]


1 ( 40 + 40 = 80 points) 40 min. Lab 6 Early Branch Design -- EX and MEM stages combined:

Nothing needs to be done on this page. Complete your design on the next two pages. On the next page, I have mainly removed the EX/MEM stage register. Remove unneeded connections and unneeded items of the original 5-stage pipeline on the next page. Revise FU_Br, FU, HDU_Br, HDU as needed. On the further next page, complete the new EX stage stalling, Memory address, MR and MW.

Consider the left two instructions and compare them with the right two instructions.

lw $8, 0($2); lw $8, 40($2);sw $8, 0($2); sw $8, 60($2);Let us say, for the sake of this problem, that we observed that many programs use lw and sw instructions with 0 offset (for their address calculation) as shown on the left quite often.

So we proposed to merge the EX and MEM stages into one stage called new EX. Now lw $8, 0($2) on the left side bypasses the ALU, and uses the raw $rs ($2 here) as the memory address. So the left-hand instructions require only 1 clock in the new EX stage. Hence instructions dependent on lw $8, 0($2) do not need to stall for one clock. This is a gain.

But instruction on the right need two clocks in the new EX stage for the address calculation to complete before you access memory. This is a loss. We use our standard mechanism of stalling a stage for exactly one clock (from our Lab 7 Part 3 SP2) to provide an extra clock for the right side instructions in the new EX stage.

Certainly, in the case of sw $8, 60($2), we should not convey MemWrite control signal in the first clock of the two clock process to the Data Memory before the address 60($2) is fully calculated by the ALU. Otherwise you would be corrupting the Data Memory. It is also important in the case of lw $8, 40($2), not to convey MemRead control signal in the first clock to avoid unwanted side-effects in the Data Cache. Make sure that the MR and MW are inactivates at all other times (other than the needed times).

We should convey the right address to the memory either bypassing the ALU or as calculated by the ALU. Hence a mux is added in the address path of the Memory.

Offset0

D QCLKCLRCLK

RESET_B

2nd

EX

MEWB

ALUSrcALUOpRegDst

MemReadMemWrite

EN

Offset0SW_LW_0

SW_LW_N0

MR

MW

Offset is 0

Offset is not 0

I0[0]

I0[1]

I1[0]

I1[1]

Y[0]

Y[1]

S

ALU

ALUSrc

Dat

am

emor

y

@

W

R

1

0

1

0

March 23, 2017 11:40 am

EE457 MT - Spring 2017 3 / 12

CC

opyright 2017 Gandhi Puvvada

Hazarddetection

unit

04

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Co

ntr

ol(PC

)

(rs)

(rt)

ALU

rtrd

ALUctrlSign

ext.

EX

MEWB

ALUSrcALUOpRegDst

ALUSrc

Reg

Dst

ALUOp

RegWrite_EX

Dat

am

emor

y

@

W

R

Mem

Rea

d

Mem

Writ

e

Reg

Writ

e

IF.Flush

WR

WB

MEM

_dat

aR

EG_d

ata

Reg

Writ

e

MemtoReg

+

=

func

ts_ext

ShiftLeft 2

Zero

Forwarding Unit

Early Branch with incomplete modifications to merge EX & MEM stages

IF/IDIF-Stage

ID/EXID-Stage EX_Stage

EX/WBWB-Stage

rs

MemRead_EXMemRead_MEM

Writ

eReg

iste

r_EX

FU_BrFW

_RS_

WB

FW_R

S_M

EM

FW_R

T_W

B

FW_R

T_M

EM

FW_R

T

FW_R

S

Writ

eReg

iste

r_M

EM

WriteRegister_MEMHDU_Br

STALL_BEQSTALL_LW

STALL

Bra

nch

01

0

1

1

0

0

1

11

11

1

00

00

0

0

0

1

Bra

nch

1

fowarding_mux_control

?

Offset0

D QCLKCLRCLK

RESET_B

2ndRemove unneeded connections and unneeded items. Revise FU_Br, FU,HDU_Br, HDU.

40 pts

March 23, 2017 11:40 am

EE457 MT - Spring 2017 4 / 12

CC


04

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Co

ntr

ol(PC

)

(rs)

(rt)

ALU

rtrd

Signext.

EX

MEWB

ALUSrcALUOpRegDst

ALUSrc

Reg

Dst

Dat

am

emor

y

@

W

R

MemReadMemWrite

IF.Flush

WR

WB

MEM

_dat

aR

EG_d

ata

Reg

Writ

e

MemtoReg

+

=

func

ts_ext

ShiftLeft 2

Zero

Early Branch with incomplete modifications to merge EX & MEM stages

IF/IDIF-Stage

ID/EXID-Stage EX_Stage

EX/WBWB-Stage

rs

Bra

nch

01

0

1

1

0

0

0

1

1

Offset is 0

D QCLKCLRCLK

RESET_B

2nd

STALL

EN

EN

RESET_B

RESET_B

RESET_B

RESET_B

Offset0SW_LW_0

SW_LW_N0

MR

MW

Offset is 0

Offset is not 0

I0[0]

I0[1]

I1[0]

I1[1]

Y[0]

Y[1]

S

1

0

Complete the rest of the design.New EX stage stalling, Memory address, MR, MW.

40 pts


2 ( 23+16+6+5+8+11+5+4+15 = 93 points) 50 min. Pipelining miscellaneous

2.1 In Lab 6 early branch design, the Branch signal X can be tapped from ______________(A / B / C / multiple (state which)), where as signal Y can be tapped from __________________(A / B / C / multiple (state which)).

The HDU_Br has 5-bit comparison units, _____ (2 / 4 / 6 / 8 / 10) in the 5-stage pipeline, _____ (2 / 4 / 6 / 8 / 10) in the 7-stage pipeline, _____ (2 / 4 / 6 / 8 / 10) in the 9-stage pipeline.

We used a ______ -bit wide bubble-injecting multiplexer in the late branch but here it is only a ______-bit wide mux as branch signal is used in the ID stage. These can be reduced to the bare minimum of _____ -bit and ____-bit respectively. In the 5-stage early branch design, the significant control signals in the EX, MEM and WB stages are MemWrite, ____________________________________________.

2.2 WB_After stage: Recently we were discussing a Spring 2012 Midterm question allowing stalling in the EX1 stage instead of stalling in ID stage by in our Lab 7 Part 1 (the 3-element adder) adding a WB_After stage . It ______ (is / isn’t) possible to move stalling to EX2 stage by adding another WB_After2 stage.

Now let us consider the three (5-stage, 7-stage, and the 9-stage) pipelines of the late branch (we said "late") and consider moving the HDU from ID stage to the EX stage. I am only showing the 7-stage on the side.

2.3 In the Lab 7 Part 1 (the 3-element adder), we have 4 forwarding muxes (X_mux, Y_mux, and Z1_mux, in EX1, and Z2_mux in EX2), 6 comparison units in the comparison station in the ID stage, 3 comparison units in the Register File for IFRF. Now consider a 4-element adder, adding (((X+Y)+Z)+W). It has EX3 to add W. This will have ____ forwarding muxes ( ____________ ________________________________________________________________________________________________________________________________________________________)and also _____ comparison units in the comparison station in the ID stage, _____comparison units in the Register File for IFRF.

0

opco

de Control(P

C)

EX

MEWB

IF/ID ID/EX

ID-Stage

HDU_Br

STALL_BEQSTALL_LW

STALL

Branch01

BranchA

B

C

Hazarddetection

unit

XY

23 pts

AB S

AB SPC IM RF

IF ID EX1 EX2 WB

WB_a

fter

16 pts

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC

IF1 IF2 ID EX MEM1 MEM2 WB

Zero

Zero

BRANCH

BR

1

HDU

cont

rol

Possible to move HDU to EX?

If possible, how manyWB_After stages needed?

5-stage7-stage9-stage

Yes / No 0 / 1 / 2 / 3 / 4WB / WBA1 / WBA2 / WBA3 / WBA4Yes / No

Yes / No0 / 1 / 2 / 3 / 40 / 1 / 2 / 3 / 4 WB / WBA1 / WBA2 / WBA3 / WBA4

WB / WBA1 / WBA2 / WBA3 / WBA4

If possible, writing to the Register Fileoccurs from:

Optional comments/observations

11 pts


2.4 Spurious _______________ (STALLs / FORWARDs / neither / both) only reduce performance but still yield right program results. There ________ (is / is no) harm if forwarding help is offered from a senior NOP. There ________ (is / is no) harm if forwarding help is offered to a jump instruction which has no sources in the late branch design where jump also executes from the MEM stage.

2.5 Subdividing stages thinner and increasing the number of stages in a CPU pipeline without increasing its clock frequency is likely to __________ (increase / decrease) its performance because _____________________________________________________________________

2.6 We concluded that the marked pair of 2-to-1 muxes ________ (can / can’t) be removed ________________________ (as they / because though they seem to) provide the same help to the dependent instruction in EX stage from its senior 2 in WB, that was already provided through the FU_Br in the previous clock ____________________________________________________________

2.7 In the Lab 7 Part 3 SP1 (with SUB3 in EX1 and ADD4 in EX2), a MOV instruction in EX2 can help its junior _____________________ (always / sometimes / never) ________________ (because / though) it itself may be dependent on its senior in WB.

2.8 In both Lab 7 Part 1 and Lab 7 Part 3, there ____________ (is a / isn’t any) wrist-band Flip-Flop____________________________________________________________________________

2.9 Your VLSI engineer wanted to add a dummy stage for some layout convenience, either on the upstream or on the downstream of the Register File as shown below to our 7-stage late branch implementation . Which is less expensive and why? What are the disadvantages of each?

Disadvantages (of DURF and of DDRF): _______________________________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________Cost comparison (between DURF and DDRF): ___________________________________________________________________________________________________________________Performance comparison (between DURF and DDRF): ______________________________________________________________________________________________________________________________________________________________________________________________

6 pts

5 pts

RegInstr.

HDU

Data

FU

IF ID EX MEM WB

BRANCH

BR

1

FU_Br

PC

cont

rol

HDU_Br

Zero

Remove??

8 pts

5 pts

4 pts

15 pts

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC


Zero

Zero

BRANCH

BR

1

HDU

cont

rol

Dummy

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC


Zero

Zero

BRANCH

BR

1

HDU

cont

rol

Dummy

DURF (Dummy on Upstream of RF) DDRF (Dummy on Downstream of RF)


3 ( 7 + 22 = 29 points) 10 min. Virtual Memory

3.1 MMU stands for __________________________________.

VPN stands for _____________________. PPFN stands for ____________________________

PTBR stands for _______________________________________________________________

PTBR is changed by the ______________________ (MMU/Operating System) at the time of ___________________________________________________________________________

TLB stands for ______________________________________ .

3.2 If TLB is flushed on context switch, then that TLB _________________ (holds / doesn’t hold) an ASN (Address Space Number ( =Process ID)). TLBs in multi-threaded processors _________________ (hold / don’t hold) the ASN (Address Space Number (= Process ID)), hence those TLBs ___________ (are/ aren’t) flushed on a context switch.

Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).

We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table.

Page table becomes _________ (smaller / bigger) if the page size is increased from 4KB to 16KB.

Changing the 2-level Page Table to a 3-level Page table has an advantage and also a disadvantage.Advantage: ___________________________________________________________________Disadvantage: _________________________________________________________________

Address coming out of the inner CPU block is a _____________ (Virtual / Physical) address.Address coming out of the CPU chip (which includes the on-chip MMU) is a _____________ (Virtual / Physical) address.

In the 7-stage CPU in the Lab 6 Part 4, we have replaced the Data Memory stage with the two stages called ___________________________________________.

7 pts

22 pts

Rough work area


4 ( 8 + 8 + 11 + 10 + 10 + 8 + 11 = 66 points) 40 min. Cache

For the sake of this problem, we are choosing the L1 cache for our USC80486 processor (32-bit address, 32-bit Data, Byte addressable processor) to be 1KB (Data RAM of 256 words of 32-bit words in 4 byte-wide banks). The block size is 4 32-bit words. Hence there are 64 Block frames. We are now exploring the following 4+3 = 7 choices: (1) Direct, (2) Set-Associative with DoSA (Degree of Set Associativity) = 2, (3) Set-Associative with DoSA = 32, (4) Fully Associative (5) Repeat #4 with Cache size doubled to 512, (6) Repeat #2 with Cache size of 150% (374 words) DoSA of 3, and (7) Repeat #3 with Cache size doubled to 512. For each of the four mappings, divide the address into appropriate fields, show how the Data RAM is organized, and show how the Tag RAM(s) (or the CAM(s) holding the TAGs) is/are organized. Add all missing information (depth, width, size, Address labels, etc.).

4.1 Direct Mapping:

4.2 Set-Associative Mapping with DoSA (Degree of Set Associativity) = 2

4.3 Set-Associative Mapping with DoSA (Degree of Set Associativity) = 32 using CAMs to do TAG search in each set.

8 ptsA19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4

(Byte enables)

(CPU address bits)

Size

of o

ne

D[7

:0]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]

DATA RAM( ___ more like this)

Address

Data_in

Data_out

1Valid

Com

p un

it __

- bi

ts w

ide

Hit/Miss

( ___ more like this)TAG RAM

1 Byt

e-w

ide

Ban

k

____

_ x

8

/BE[3:0] 4


(Byte enables)

(CPU address bits)D

[7:0

]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]


Address

Data_in

Data_out

1Valid

Com

p un

it __

- bi

ts w

ide

Hit/Miss


1 Size

of o

neB

yte-

wid

e B

ank

____

_ x

8

/BE[3:0] 4

11 pts A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4

(Byte enables)

(Block address)

D[7

:0]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]

DATA RAM# of sets: ______# of total such

(Word address)

TAG_IN1

[Hit/Miss]_i

CAM SizeWrite

CCU_WE_j

CCU_RE

Address_j

# of TAGs= __# of comp units = __

Valid

CAM+Data RAMcombinations: _____ uP_Rd

uP_Wr CCU_WE_j

WERE

Size

of o

neB

yte-

wid

e B

ank

____

_ x

8

ReadAddress A[3:2] 2

/BE[3:0] 4


4.4 Fully Associative Mapping

4.5 Do the Fully Associative Mapping of Q#4.4 again for a cache size of 512 words (of 32-bit words) (256*2=512).

4.6 In Q#4.2, we have DoSA = 2 for 256 words cache. We change here DoSA to 3 and the cache size to 374 words(256 + 128 = 374)

4.7 In Q#4.3, we used 256 word cache. Now double it to 512 words. Do the Set-Associative Mapping with DoSA (Degree of Set Associativity) = 32 using CAMs to do TAG search in each set again.


(Byte enables)

(Block address)

D[7:

0]

D[15

:8]

D[23

:16]

D[31

:24]

DATA RAM

# of such (Word address)

TAG_IN1

[Hit/Miss]

CAM SizeWrite

CCU_WE

CCU_RE

Address


Valid


uP_Wr CCU_WE

WERE

Size

of o

neB

yte-

wid

e B

ank

____

_ x

8 ReadAddress A[3:2] 2

/BE[3:0] 4


(Byte enables)

(Block address)

D[7:

0]

D[15

:8]

D[23

:16]

D[31

:24]

DATA RAM

# of such (Word address)

TAG_IN1

[Hit/Miss]

CAM SizeWrite

CCU_WE

CCU_RE

Address


Valid


uP_Wr CCU_WE

WERE

Size

of o

neB

yte-

wid

e B

ank

____

_ x

8 ReadAddress A[3:2] 2

/BE[3:0] 4


(Byte enables)

(CPU address bits)D

[7:0

]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]


Address

Data_in

Data_out

1Valid

Com

p un

it __

- bi

ts w

ide

Hit/Miss


1 Size

of o

neB

yte-

wid

e B

ank

____

_ x

8

4/BE[3:0]


(Byte enables)

(Block address)

D[7

:0]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]

DATA RAM# of sets: ______# of such

(Word address)

TAG_IN1

[Hit/Miss]_i

CAM SizeWrite

CCU_WE_j

CCU_RE

Address_j


Valid


uP_Wr CCU_WE_j

WERE

Size

of o

neB

yte-

wid

e B

ank

____

_ x

8

ReadAddress A[3:2] 2

/BE[3:0] 4


5 ( 8 + 10 + 15 = 33 points) 20 min. Multi-cycle CPU

The following figure of Lab 7 Part 3 Sub part 1 is provided for reference only. You do not have to complete it. _________ (Like / Unlike) in the MIPs Multi-cycle CPU with 10 states, here we ________________ (increment / do not increment) the PC in the first clock while fetching the instruction. Explain: _____________________________________________________________________________________

8 pts

For

refe

renc

e on

ly

MO

V

Not

e: N

OP

= (A

DD

1 +

AD

D4

+ SU

B3

+ M

OV

)

AD

D1

+ A

DD

4 +

SUB

3 +

MO

V (ID)

(EX12_1)

(WB)

(EX12_2)

(IF)

SUB3 MOV

ADD4 MOV

March 23, 2017 11:40 am

EE457 MT - Spring 2017 11 / 12

CC


The datapath on the previous page is suitable for a Single Cycle C

PU. T / F

Explain: ___________________________________________________________________________ W

e modified the previous page datapath by adding an IR

(Instruction Register) and an IR

_write

control signal. We revised the control unit and also com

pleted it below. C

ompare this design w

ith the design on previous page (com

pleted by you recently for lab paper submission) and state w

hich is more

expensive and which perform

s better. ______________________________________________________________________________________________________________________________________C

onvert the control unit to a MEA

LY and PR

EFETCH

the next instruction as you complete the current

instruction. Do you need to m

odify the DPU

for this? Yes / N

o . If needed, modify the D

PU also.

10 pts

15 pts

IR_W

rite

= 1

0 101

0

ADD1

ADD4 + SUB3

MOV

11

1


Blank page: Please write your name and email. Tear it off and use for rough work. Do not submit.Student’s First & Last Name:______________________ email: ________________

ee457 midterm (~24%-28%)€¦ · performance comparison (between durf and ddrf): _____ _____ _____...

Documents

ee457 midterm (~24%-28%)€¦ · performance comparison (between durf and ddrf): ___ _ ___...