ee457 midterm (~24%-28%)€¦ · performance comparison (between durf and ddrf): _____ _____ _____...
TRANSCRIPT
March 23, 2017 11:40 am EE457 MT - Spring 2017 1 / 12 C Copyright 2017 Gandhi Puvvada
EE457 Midterm (~24%-28%)Closed-book Closed-notes Exam; No cheat sheets; No cell phones or computers
Calculators and Verilog Guides are not allowed.
Spring 2017Instructor: Gandhi Puvvada
Thursday, 3/23/2017 (A 2H 50M exam)05:00 PM - 07:50 PM (170 min) in SGM124
Viterbi School of EngineeringUniversity of Southern California
Ques# Topic Page# Time Points Score
1 Lab 6 EX_MEM combined 2-4 40 min 80
2 Pipelining miscellaneous 5-6 50 min 93
3 Virtual Memory 7 10 min 29
4 Cache 8-9 40 min 66
7 Lab 7 Part 3 SP1Multi-cycle CPU
10-11 20 min 33
Total Cover+10+ Blank = 12
2H 40M 301
Perfect Score 285
Student’s Last Name: _______________________________________
Student’s First Name: _______________________________________
Student’s DEN D2L username: [email protected]
March 23, 2017 11:40 am EE457 MT - Spring 2017 2 / 12 C Copyright 2017 Gandhi Puvvada
1 ( 40 + 40 = 80 points) 40 min. Lab 6 Early Branch Design -- EX and MEM stages combined:
Nothing needs to be done on this page. Complete your design on the next two pages. On the next page, I have mainly removed the EX/MEM stage register. Remove unneeded connections and unneeded items of the original 5-stage pipeline on the next page. Revise FU_Br, FU, HDU_Br, HDU as needed. On the further next page, complete the new EX stage stalling, Memory address, MR and MW.
Consider the left two instructions and compare them with the right two instructions.
lw $8, 0($2); lw $8, 40($2);sw $8, 0($2); sw $8, 60($2);Let us say, for the sake of this problem, that we observed that many programs use lw and sw instructions with 0 offset (for their address calculation) as shown on the left quite often.
So we proposed to merge the EX and MEM stages into one stage called new EX. Now lw $8, 0($2) on the left side bypasses the ALU, and uses the raw $rs ($2 here) as the memory address. So the left-hand instructions require only 1 clock in the new EX stage. Hence instructions dependent on lw $8, 0($2) do not need to stall for one clock. This is a gain.
But instruction on the right need two clocks in the new EX stage for the address calculation to complete before you access memory. This is a loss. We use our standard mechanism of stalling a stage for exactly one clock (from our Lab 7 Part 3 SP2) to provide an extra clock for the right side instructions in the new EX stage.
Certainly, in the case of sw $8, 60($2), we should not convey MemWrite control signal in the first clock of the two clock process to the Data Memory before the address 60($2) is fully calculated by the ALU. Otherwise you would be corrupting the Data Memory. It is also important in the case of lw $8, 40($2), not to convey MemRead control signal in the first clock to avoid unwanted side-effects in the Data Cache. Make sure that the MR and MW are inactivates at all other times (other than the needed times).
We should convey the right address to the memory either bypassing the ALU or as calculated by the ALU. Hence a mux is added in the address path of the Memory.
Offset0
D QCLKCLRCLK
RESET_B
2nd
EX
MEWB
ALUSrcALUOpRegDst
MemReadMemWrite
EN
Offset0SW_LW_0
SW_LW_N0
MR
MW
Offset is 0
Offset is not 0
I0[0]
I0[1]
I1[0]
I1[1]
Y[0]
Y[1]
S
ALU
ALUSrc
Dat
am
emor
y
@
W
R
1
0
1
0
March 23, 2017 11:40 am
EE457 MT - Spring 2017 3 / 12
CC
opyright 2017 Gandhi Puvvada
Hazarddetection
unit
04
Inst
ruct
ion
mem
ory
PC
+
r1
r2
R1
R2w
W
opco
ders
rtrd
shift
func
t
Reg
iste
rs
Co
ntr
ol(PC
)
(rs)
(rt)
ALU
rtrd
ALUctrlSign
ext.
EX
MEWB
ALUSrcALUOpRegDst
ALUSrc
Reg
Dst
ALUOp
RegWrite_EX
Dat
am
emor
y
@
W
R
Mem
Rea
d
Mem
Writ
e
Reg
Writ
e
IF.Flush
WR
WB
MEM
_dat
aR
EG_d
ata
Reg
Writ
e
MemtoReg
+
=
func
ts_ext
ShiftLeft 2
Zero
Forwarding Unit
Early Branch with incomplete modifications to merge EX & MEM stages
IF/IDIF-Stage
ID/EXID-Stage EX_Stage
EX/WBWB-Stage
rs
MemRead_EXMemRead_MEM
Writ
eReg
iste
r_EX
FU_BrFW
_RS_
WB
FW_R
S_M
EM
FW_R
T_W
B
FW_R
T_M
EM
FW_R
T
FW_R
S
Writ
eReg
iste
r_M
EM
WriteRegister_MEMHDU_Br
STALL_BEQSTALL_LW
STALL
Bra
nch
01
0
1
1
0
0
1
11
11
1
00
00
0
0
0
1
Bra
nch
1
fowarding_mux_control
?
Offset0
D QCLKCLRCLK
RESET_B
2ndRemove unneeded connections and unneeded items. Revise FU_Br, FU,HDU_Br, HDU.
40 pts
March 23, 2017 11:40 am
EE457 MT - Spring 2017 4 / 12
CC
opyright 2017 Gandhi Puvvada
04
Inst
ruct
ion
mem
ory
PC
+
r1
r2
R1
R2w
W
opco
ders
rtrd
shift
func
t
Reg
iste
rs
Co
ntr
ol(PC
)
(rs)
(rt)
ALU
rtrd
Signext.
EX
MEWB
ALUSrcALUOpRegDst
ALUSrc
Reg
Dst
Dat
am
emor
y
@
W
R
MemReadMemWrite
IF.Flush
WR
WB
MEM
_dat
aR
EG_d
ata
Reg
Writ
e
MemtoReg
+
=
func
ts_ext
ShiftLeft 2
Zero
Early Branch with incomplete modifications to merge EX & MEM stages
IF/IDIF-Stage
ID/EXID-Stage EX_Stage
EX/WBWB-Stage
rs
Bra
nch
01
0
1
1
0
0
0
1
1
Offset is 0
D QCLKCLRCLK
RESET_B
2nd
STALL
EN
EN
RESET_B
RESET_B
RESET_B
RESET_B
Offset0SW_LW_0
SW_LW_N0
MR
MW
Offset is 0
Offset is not 0
I0[0]
I0[1]
I1[0]
I1[1]
Y[0]
Y[1]
S
1
0
Complete the rest of the design.New EX stage stalling, Memory address, MR, MW.
40 pts
March 23, 2017 11:40 am EE457 MT - Spring 2017 5 / 12 C Copyright 2017 Gandhi Puvvada
2 ( 23+16+6+5+8+11+5+4+15 = 93 points) 50 min. Pipelining miscellaneous
2.1 In Lab 6 early branch design, the Branch signal X can be tapped from ______________(A / B / C / multiple (state which)), where as signal Y can be tapped from __________________(A / B / C / multiple (state which)).
The HDU_Br has 5-bit comparison units, _____ (2 / 4 / 6 / 8 / 10) in the 5-stage pipeline, _____ (2 / 4 / 6 / 8 / 10) in the 7-stage pipeline, _____ (2 / 4 / 6 / 8 / 10) in the 9-stage pipeline.
We used a ______ -bit wide bubble-injecting multiplexer in the late branch but here it is only a ______-bit wide mux as branch signal is used in the ID stage. These can be reduced to the bare minimum of _____ -bit and ____-bit respectively. In the 5-stage early branch design, the significant control signals in the EX, MEM and WB stages are MemWrite, ____________________________________________.
2.2 WB_After stage: Recently we were discussing a Spring 2012 Midterm question allowing stalling in the EX1 stage instead of stalling in ID stage by in our Lab 7 Part 1 (the 3-element adder) adding a WB_After stage . It ______ (is / isn’t) possible to move stalling to EX2 stage by adding another WB_After2 stage.
Now let us consider the three (5-stage, 7-stage, and the 9-stage) pipelines of the late branch (we said "late") and consider moving the HDU from ID stage to the EX stage. I am only showing the 7-stage on the side.
2.3 In the Lab 7 Part 1 (the 3-element adder), we have 4 forwarding muxes (X_mux, Y_mux, and Z1_mux, in EX1, and Z2_mux in EX2), 6 comparison units in the comparison station in the ID stage, 3 comparison units in the Register File for IFRF. Now consider a 4-element adder, adding (((X+Y)+Z)+W). It has EX3 to add W. This will have ____ forwarding muxes ( ____________ ________________________________________________________________________________________________________________________________________________________)and also _____ comparison units in the comparison station in the ID stage, _____comparison units in the Register File for IFRF.
0
opco
de Control(P
C)
EX
MEWB
IF/ID ID/EX
ID-Stage
HDU_Br
STALL_BEQSTALL_LW
STALL
Branch01
BranchA
B
C
Hazarddetection
unit
XY
23 pts
AB S
AB SPC IM RF
IF ID EX1 EX2 WB
WB_a
fter
16 pts
RegInstr.TLB
Instr.cache
DataTLB
Datacache
FU
PC
IF1 IF2 ID EX MEM1 MEM2 WB
Zero
Zero
BRANCH
BR
1
HDU
cont
rol
Possible to move HDU to EX?
If possible, how manyWB_After stages needed?
5-stage7-stage9-stage
Yes / No 0 / 1 / 2 / 3 / 4WB / WBA1 / WBA2 / WBA3 / WBA4Yes / No
Yes / No0 / 1 / 2 / 3 / 40 / 1 / 2 / 3 / 4 WB / WBA1 / WBA2 / WBA3 / WBA4
WB / WBA1 / WBA2 / WBA3 / WBA4
If possible, writing to the Register Fileoccurs from:
Optional comments/observations
11 pts
March 23, 2017 11:40 am EE457 MT - Spring 2017 6 / 12 C Copyright 2017 Gandhi Puvvada
2.4 Spurious _______________ (STALLs / FORWARDs / neither / both) only reduce performance but still yield right program results. There ________ (is / is no) harm if forwarding help is offered from a senior NOP. There ________ (is / is no) harm if forwarding help is offered to a jump instruction which has no sources in the late branch design where jump also executes from the MEM stage.
2.5 Subdividing stages thinner and increasing the number of stages in a CPU pipeline without increasing its clock frequency is likely to __________ (increase / decrease) its performance because _____________________________________________________________________
2.6 We concluded that the marked pair of 2-to-1 muxes ________ (can / can’t) be removed ________________________ (as they / because though they seem to) provide the same help to the dependent instruction in EX stage from its senior 2 in WB, that was already provided through the FU_Br in the previous clock ____________________________________________________________
2.7 In the Lab 7 Part 3 SP1 (with SUB3 in EX1 and ADD4 in EX2), a MOV instruction in EX2 can help its junior _____________________ (always / sometimes / never) ________________ (because / though) it itself may be dependent on its senior in WB.
2.8 In both Lab 7 Part 1 and Lab 7 Part 3, there ____________ (is a / isn’t any) wrist-band Flip-Flop____________________________________________________________________________
2.9 Your VLSI engineer wanted to add a dummy stage for some layout convenience, either on the upstream or on the downstream of the Register File as shown below to our 7-stage late branch implementation . Which is less expensive and why? What are the disadvantages of each?
Disadvantages (of DURF and of DDRF): _______________________________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________Cost comparison (between DURF and DDRF): ___________________________________________________________________________________________________________________Performance comparison (between DURF and DDRF): ______________________________________________________________________________________________________________________________________________________________________________________________
6 pts
5 pts
RegInstr.
HDU
Data
FU
IF ID EX MEM WB
BRANCH
BR
1
FU_Br
PC
cont
rol
HDU_Br
Zero
Remove??
8 pts
5 pts
4 pts
15 pts
RegInstr.TLB
Instr.cache
DataTLB
Datacache
FU
PC
IF1 IF2 ID EX MEM1 MEM2 WB
Zero
Zero
BRANCH
BR
1
HDU
cont
rol
Dummy
RegInstr.TLB
Instr.cache
DataTLB
Datacache
FU
PC
IF1 IF2 ID EX MEM1 MEM2 WB
Zero
Zero
BRANCH
BR
1
HDU
cont
rol
Dummy
DURF (Dummy on Upstream of RF) DDRF (Dummy on Downstream of RF)
March 23, 2017 11:40 am EE457 MT - Spring 2017 7 / 12 C Copyright 2017 Gandhi Puvvada
3 ( 7 + 22 = 29 points) 10 min. Virtual Memory
3.1 MMU stands for __________________________________.
VPN stands for _____________________. PPFN stands for ____________________________
PTBR stands for _______________________________________________________________
PTBR is changed by the ______________________ (MMU/Operating System) at the time of ___________________________________________________________________________
TLB stands for ______________________________________ .
3.2 If TLB is flushed on context switch, then that TLB _________________ (holds / doesn’t hold) an ASN (Address Space Number ( =Process ID)). TLBs in multi-threaded processors _________________ (hold / don’t hold) the ASN (Address Space Number (= Process ID)), hence those TLBs ___________ (are/ aren’t) flushed on a context switch.
Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).
We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table.
Page table becomes _________ (smaller / bigger) if the page size is increased from 4KB to 16KB.
Changing the 2-level Page Table to a 3-level Page table has an advantage and also a disadvantage.Advantage: ___________________________________________________________________Disadvantage: _________________________________________________________________
Address coming out of the inner CPU block is a _____________ (Virtual / Physical) address.Address coming out of the CPU chip (which includes the on-chip MMU) is a _____________ (Virtual / Physical) address.
In the 7-stage CPU in the Lab 6 Part 4, we have replaced the Data Memory stage with the two stages called ___________________________________________.
7 pts
22 pts
Rough work area
March 23, 2017 11:40 am EE457 MT - Spring 2017 8 / 12 C Copyright 2017 Gandhi Puvvada
4 ( 8 + 8 + 11 + 10 + 10 + 8 + 11 = 66 points) 40 min. Cache
For the sake of this problem, we are choosing the L1 cache for our USC80486 processor (32-bit address, 32-bit Data, Byte addressable processor) to be 1KB (Data RAM of 256 words of 32-bit words in 4 byte-wide banks). The block size is 4 32-bit words. Hence there are 64 Block frames. We are now exploring the following 4+3 = 7 choices: (1) Direct, (2) Set-Associative with DoSA (Degree of Set Associativity) = 2, (3) Set-Associative with DoSA = 32, (4) Fully Associative (5) Repeat #4 with Cache size doubled to 512, (6) Repeat #2 with Cache size of 150% (374 words) DoSA of 3, and (7) Repeat #3 with Cache size doubled to 512. For each of the four mappings, divide the address into appropriate fields, show how the Data RAM is organized, and show how the Tag RAM(s) (or the CAM(s) holding the TAGs) is/are organized. Add all missing information (depth, width, size, Address labels, etc.).
4.1 Direct Mapping:
4.2 Set-Associative Mapping with DoSA (Degree of Set Associativity) = 2
4.3 Set-Associative Mapping with DoSA (Degree of Set Associativity) = 32 using CAMs to do TAG search in each set.
8 ptsA19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(CPU address bits)
Size
of o
ne
D[7
:0]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM( ___ more like this)
Address
Data_in
Data_out
1Valid
Com
p un
it __
- bi
ts w
ide
Hit/Miss
( ___ more like this)TAG RAM
1 Byt
e-w
ide
Ban
k
____
_ x
8
/BE[3:0] 4
8 ptsA19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(CPU address bits)D
[7:0
]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM( ___ more like this)
Address
Data_in
Data_out
1Valid
Com
p un
it __
- bi
ts w
ide
Hit/Miss
( ___ more like this)TAG RAM
1 Size
of o
neB
yte-
wid
e B
ank
____
_ x
8
/BE[3:0] 4
11 pts A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(Block address)
D[7
:0]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM# of sets: ______# of total such
(Word address)
TAG_IN1
[Hit/Miss]_i
CAM SizeWrite
CCU_WE_j
CCU_RE
Address_j
# of TAGs= __# of comp units = __
Valid
CAM+Data RAMcombinations: _____ uP_Rd
uP_Wr CCU_WE_j
WERE
Size
of o
neB
yte-
wid
e B
ank
____
_ x
8
ReadAddress A[3:2] 2
/BE[3:0] 4
March 23, 2017 11:40 am EE457 MT - Spring 2017 9 / 12 C Copyright 2017 Gandhi Puvvada
4.4 Fully Associative Mapping
4.5 Do the Fully Associative Mapping of Q#4.4 again for a cache size of 512 words (of 32-bit words) (256*2=512).
4.6 In Q#4.2, we have DoSA = 2 for 256 words cache. We change here DoSA to 3 and the cache size to 374 words(256 + 128 = 374)
4.7 In Q#4.3, we used 256 word cache. Now double it to 512 words. Do the Set-Associative Mapping with DoSA (Degree of Set Associativity) = 32 using CAMs to do TAG search in each set again.
10 ptsA19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(Block address)
D[7:
0]
D[15
:8]
D[23
:16]
D[31
:24]
DATA RAM
# of such (Word address)
TAG_IN1
[Hit/Miss]
CAM SizeWrite
CCU_WE
CCU_RE
Address
# of TAGs= __# of comp units = __
Valid
CAM+Data RAMcombinations: _____ uP_Rd
uP_Wr CCU_WE
WERE
Size
of o
neB
yte-
wid
e B
ank
____
_ x
8 ReadAddress A[3:2] 2
/BE[3:0] 4
10 ptsA19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(Block address)
D[7:
0]
D[15
:8]
D[23
:16]
D[31
:24]
DATA RAM
# of such (Word address)
TAG_IN1
[Hit/Miss]
CAM SizeWrite
CCU_WE
CCU_RE
Address
# of TAGs= __# of comp units = __
Valid
CAM+Data RAMcombinations: _____ uP_Rd
uP_Wr CCU_WE
WERE
Size
of o
neB
yte-
wid
e B
ank
____
_ x
8 ReadAddress A[3:2] 2
/BE[3:0] 4
8 pts A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(CPU address bits)D
[7:0
]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM( ___ more like this)
Address
Data_in
Data_out
1Valid
Com
p un
it __
- bi
ts w
ide
Hit/Miss
( ___ more like this)TAG RAM
1 Size
of o
neB
yte-
wid
e B
ank
____
_ x
8
4/BE[3:0]
11 pts A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(Block address)
D[7
:0]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM# of sets: ______# of such
(Word address)
TAG_IN1
[Hit/Miss]_i
CAM SizeWrite
CCU_WE_j
CCU_RE
Address_j
# of TAGs= __# of comp units = __
Valid
CAM+Data RAMcombinations: _____ uP_Rd
uP_Wr CCU_WE_j
WERE
Size
of o
neB
yte-
wid
e B
ank
____
_ x
8
ReadAddress A[3:2] 2
/BE[3:0] 4
March 23, 2017 11:40 am EE457 MT - Spring 2017 10 / 12 C Copyright 2017 Gandhi Puvvada
5 ( 8 + 10 + 15 = 33 points) 20 min. Multi-cycle CPU
The following figure of Lab 7 Part 3 Sub part 1 is provided for reference only. You do not have to complete it. _________ (Like / Unlike) in the MIPs Multi-cycle CPU with 10 states, here we ________________ (increment / do not increment) the PC in the first clock while fetching the instruction. Explain: _____________________________________________________________________________________
8 pts
For
refe
renc
e on
ly
MO
V
Not
e: N
OP
= (A
DD
1 +
AD
D4
+ SU
B3
+ M
OV
)
AD
D1
+ A
DD
4 +
SUB
3 +
MO
V (ID)
(EX12_1)
(WB)
(EX12_2)
(IF)
SUB3 MOV
ADD4 MOV
March 23, 2017 11:40 am
EE457 MT - Spring 2017 11 / 12
CC
opyright 2017 Gandhi Puvvada
The datapath on the previous page is suitable for a Single Cycle C
PU. T / F
Explain: ___________________________________________________________________________ W
e modified the previous page datapath by adding an IR
(Instruction Register) and an IR
_write
control signal. We revised the control unit and also com
pleted it below. C
ompare this design w
ith the design on previous page (com
pleted by you recently for lab paper submission) and state w
hich is more
expensive and which perform
s better. ______________________________________________________________________________________________________________________________________C
onvert the control unit to a MEA
LY and PR
EFETCH
the next instruction as you complete the current
instruction. Do you need to m
odify the DPU
for this? Yes / N
o . If needed, modify the D
PU also.
10 pts
15 pts
IR_W
rite
= 1
0 101
0
ADD1
ADD4 + SUB3
MOV
11
1
March 23, 2017 11:40 am EE457 MT - Spring 2017 12 / 12 C Copyright 2017 Gandhi Puvvada
Blank page: Please write your name and email. Tear it off and use for rough work. Do not submit.Student’s First & Last Name:______________________ email: ________________