introduction to processor architecture
DESCRIPTION
Introduction to Processor Architecture. Contents. Introduction Processor architecture overview ISA(Instruction Set Architecture) RISC example (SMIPs) CISC example (Y86) Processor architecture Single-cycle processor example(SMIPs) Pipelining Control hazard Branch Predictor Data hazard - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/1.jpg)
www.company.com
Introduction to Processor Architecture
![Page 2: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/2.jpg)
www.company.com
• Introduction• Processor architecture overview• ISA(Instruction Set Architecture)
– RISC example (SMIPs)– CISC example (Y86)
• Processor architecture– Single-cycle processor example(SMIPs)
• Pipelining– Control hazard– Branch Predictor– Data hazard
• Cache memory
Contents
![Page 3: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/3.jpg)
www.company.com
Introduction
![Page 4: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/4.jpg)
www.company.com
Processors
• What is the processor?
• What’s the difference among them?
![Page 5: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/5.jpg)
www.company.com
Processor architecture and program
• Understanding architecture, there’s more opportunity to optimize your program.
• Let’s see some examples
![Page 6: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/6.jpg)
www.company.com
Example1
• for(i=0 ; i<size ; i++) { for(j=0 ; j<size ; j++) {
sum += array[i][j];}
}
• for(j=0 ; j<size ; j++) { for(i=0 ; i<size ; i++) {
sum += array[i][j];}
} Keyword : Cache
1
2
![Page 7: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/7.jpg)
www.company.com
Example2 (1/2)
• for(i=0 ; i<size ; i++) { if(i%2 == 0) {
action_even(); { else {
action_odd(); } }
1
![Page 8: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/8.jpg)
www.company.com
Example2 (2/2)
• for(i=0 ; i<size ; i += 2) {action_even();
}
for(i=1 ; i<size ; i+= 2) {action_odd();
}
Keyword : Branch predictor and pipeline
2
![Page 9: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/9.jpg)
www.company.com
Processor architecture overview
![Page 10: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/10.jpg)
www.company.com
Von Neumann Architecture
• Input -> process -> output model
• Integrated Instruction Memory and Data Memory
![Page 11: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/11.jpg)
www.company.com
Basic components of x86 CPURegister file
Status Registers
Zeroflag
Signflag
Overflowflag
Carryflag
%eax %esp%ecx %ebp%edx %esi%ebx %edi
CPU pipeline DecodeFetch Execution Units Commit
Memory(external)
Program Counter %eip Cache Memory $
![Page 12: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/12.jpg)
www.company.com
Register file
• What is a register?
A simple memory element(s.t. edge triggered flip flops)
![Page 13: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/13.jpg)
www.company.com
Register file
• A collection of registers– 8 registers are visible
• In fact, there are a lot of registers hided for other usages.
ex) There are 168 registers in Intel’s Haswell
![Page 14: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/14.jpg)
www.company.com
Program counter
• Points the address of instruction that processor should execute next cycle.
• %eip is the name of program counter register in X86.
• Naming convention differs with ISA,
Instruction Set Architecture
%eip
![Page 15: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/15.jpg)
www.company.com
Status registers
• Also a collection of registers• Boolean registers that represents processor’s
status.• Used to evaluate conditions
Zeroflag
Signflag
Overflowflag
Carryflag
![Page 16: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/16.jpg)
www.company.com
Memory
• Main memory, usually D-RAM• In Von Neumann architecture,
instructions(codes) and data are on same memory module
![Page 17: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/17.jpg)
www.company.com
CPU pipeline
• Where actual operation occurs
• Details will be explained later
CPU pipeline
DecodeFetch Execution Units Commit
![Page 18: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/18.jpg)
www.company.com
Instruction Set Architecture
![Page 19: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/19.jpg)
www.company.com
• How you actually talk to a Processor
Instruction Set Architecture (ISA)
![Page 20: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/20.jpg)
www.company.com
• Mapping between assembly code and machine code– What assembly codes will be included?– How to represent assembly codes in byte codes
Instruction Set Architecture (ISA)
![Page 21: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/21.jpg)
www.company.com
• A command to processor to make processor perform specific task(s)
– Ex1) Mov 4(%eax), %esp (x86) -> move the data in the address of (%eax) + 4, to %esp – Ex2) Irmovl %eax, $256 (y86) -> store the value 256 to the register eax
Instruction
![Page 22: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/22.jpg)
www.company.com
• Instructions are represented in byte codes– Pushl %ebx => 0xa01f– Irmovl %eax, $256 => 0x30f000010000
Representation of instructions
A 0 rA x10 2
pushl
B 0 rA x10 2
popl
3 0 x rB Immediate Value10 53 42 6
irmovl
![Page 23: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/23.jpg)
www.company.com
CISC vs RISC
CISC(Y86) RISC(sMips)
![Page 24: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/24.jpg)
www.company.com
CISC
• Basic Idea : give programmers powerful instructions ; fewer instructions to complete a work
• One instruction do multiple work• A lot of instructions! (over 300 in x86)• Many instruction can access memory• Variable instruction length
![Page 25: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/25.jpg)
www.company.com
RISC
• Basic Idea : Using simple instructions, write a complex program
• Each instruction do only one task• Small instructions set (about 100 in MIPS)• Only load and store instruction can access
memory• Fixed instruction length
![Page 26: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/26.jpg)
www.company.com
RISC exampleSMIPs ISA
![Page 27: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/27.jpg)
www.company.com
• Only three formats but the fields are used differently by different types of instructions
6 5 5 16opcode rs rt immediate I-type
6 5 5 5 5 6opcode rs rt rd shamt func R-type
6 26opcode target J-type
M05-27
Instruction formats
![Page 28: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/28.jpg)
www.company.com
• Computational Instructions
• Load/Store Instructions
opcode rs rt immediate rt (rs) op immediate
6 5 5 5 5 6 0 rs rt rd 0 func rd (rs) func (rt)
rs is the base registerrt is the destination of a Load or the source for a Store
6 5 5 16 addressing modeopcode rs rt displacement (rs) + displacement31 26 25 21 20 16 15 0
M05-28
Instruction formats
![Page 29: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/29.jpg)
www.company.com
• Conditional (on GPR) PC-relative branch
– target address = (offset in words)4 + (PC+4)– range: 128 KB range
• Unconditional register-indirect jumps
• Unconditional absolute jumps
– target address = {PC<31:28>, target4}– range : 256 MB range
6 5 5 16opcode rs offset BEQZ, BNEZ
6 26opcode target J, JAL
6 5 5 16opcode rs JR, JALR
jump-&-link stores PC+4 into the link register (R31)M05-29
Control instructions
![Page 30: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/30.jpg)
www.company.com
CISC exampleY86 ISA
![Page 31: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/31.jpg)
www.company.com
1 Byte
iCd iFun rA rB2 Bytes
iCd iFun rA rB Immediate/Offset6 Bytes
iCd iFun Destination5 Bytes
iCd iFun
M05-3
10 53 42 6
10 53 42
10 2
10
iCd : Instruction code iFun : Function code rA, rB : Register index
Instruction formats
![Page 32: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/32.jpg)
www.company.com
halt: Used as a sign of program termination - Changes processor state to halt (HLT)
nop: No operation. Used as a bubble.
M05-4
0 010
halt
1 0nop10
1 byte instructions – halt, nop
![Page 33: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/33.jpg)
www.company.com
OPl : Perform 4 basic ALU operations; add, sub, and, xor - R[rB] <- R[rB] Op R[rA] - Condition flags are set depending on the result.
M05-33
6 Op rA rB10 2
OPl
2 byte instruction – opl
![Page 34: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/34.jpg)
www.company.com
call - R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top) - M[esp] <- pc + 5 (Store the return address on the stack top) - pc <- Destination (Jump to Destination address)
M05-34
8 0 Destination10 53 42
call dest
5 byte instruction – call
![Page 35: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/35.jpg)
www.company.com
rmmovl: Store - target address = R[rB] + offset - M[target address] <- R[rA]
mrmovl: Load - source address = R[rB] + offset - R[rA] <- M[source address]
4 0 rA rB Offset
10 53 42 6rmmovl
rA, Offset(rB)
M05-7
5 0 rA rB Offset10 53 42 6mrmovl
Offset(rB), rA
6 byte instructions – rmmov, mrmov
![Page 36: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/36.jpg)
www.company.com
Processor Architecture
![Page 37: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/37.jpg)
www.company.com
Simple processor architecture
![Page 38: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/38.jpg)
www.company.com
Simplified version (a lot..)
Large sequential Logic
Memory
Load program codes Store Data
Output(register values)Clock
![Page 39: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/39.jpg)
www.company.com
Sequential design
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
![Page 40: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/40.jpg)
www.company.com
Fetch unit
Fetch
%EIP
Memory
1) Get PC
2) Require next instruction 3) Get next instruction
4) Update PC
5) Give next instruction (Byte code)
![Page 41: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/41.jpg)
www.company.com
Decode
Decode unit(1/2)
iCd fCd rA rB imm
1) Truncate input Instruction
Decode Combinational Logic
2) Fill information structure for execution
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B
… (depends on ISA)
![Page 42: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/42.jpg)
www.company.com
Decode
Decode unit(2/2)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B…
(depends on ISA)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B…
(depends on ISA)
RegisterRead
3) Read register values
Decoded Instruction
Register File
![Page 43: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/43.jpg)
www.company.com
Execute
Execution unit(1/2)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B
1) Select input for ALU
ALU
2) Perform appropriate ALU operation
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
ExecuteCombinational Logic
3) Using ALU result, fill information structure for memory & register update
Executed Instruction
![Page 44: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/44.jpg)
www.company.com
Execute
Execution unit(2/2)
4) Perform memory operations(Ld, St)
Memory Operation Logic
Executed Instruction(updated)
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
Memory
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
5) Update the field (if load instruction executed)
![Page 45: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/45.jpg)
www.company.com
Commit
Commit unit
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
Register File
Register UpdateLogic
![Page 46: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/46.jpg)
www.company.com
Single-cycle processor exampleSMIPs
![Page 47: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/47.jpg)
www.company.com
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
2 read & 1 write ports
separate Instruction & Data memories
M06-47
Single-Cycle SMIPS
SMIPs instructions are all 4 byte-long
![Page 48: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/48.jpg)
www.company.com
module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;
Rule doProc()let inst = iMem.req(pc);let dInst = decode(inst);let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));let eInst = exec(dInst, rVal1, rVal2, pc);
if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data});if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);pc <= eInst.brTaken ? eInst.addr : pc + 4;
endrule endmodule
M06-48
Single-Cycle SMIPS
![Page 49: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/49.jpg)
www.company.com
module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;
M06-49
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
• Declaration of components
![Page 50: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/50.jpg)
www.company.com
Rule doProc()let inst = iMem.req(pc);
M06-50
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 51: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/51.jpg)
www.company.com
let dInst = decode(inst);
M06-51
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 52: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/52.jpg)
www.company.com
let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));
M06-52
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 53: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/53.jpg)
www.company.com
let eInst = exec(dInst, rVal1, rVal2, pc);
M06-53
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 54: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/54.jpg)
www.company.com
if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data});
M06-54
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 55: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/55.jpg)
www.company.com
if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);
M06-55
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 56: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/56.jpg)
www.company.com
pc <= eInst.brTaken ? eInst.addr : pc + 4;
M06-56
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
![Page 57: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/57.jpg)
www.company.com
Improve processor performance-Pipelining
![Page 58: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/58.jpg)
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
![Page 59: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/59.jpg)
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
FIFO or Register
![Page 60: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/60.jpg)
www.company.com
Pipelining
• In this case, 4 instructions are executing on the same time
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
![Page 61: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/61.jpg)
www.company.com
Pipelining
• Throughput is same?– Sequential : 1 instructions / 1 cycle– Pipelined : 4 instructions / 4 cycle
• No, pipelined design can make clock faster
• Because task amount per cycle is decreased, we can apply shorter clock time
![Page 62: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/62.jpg)
www.company.com
Pipeline hazard- control hazard
![Page 63: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/63.jpg)
www.company.com
Control hazard (1/5)
• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %ediB : leave ret
• We don’t know the condition before run the code
![Page 64: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/64.jpg)
www.company.com
Control hazard (2/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
mov %ecx, %ebxsubl %eax, (%ebx)
je B
Execution flow
• What’s next?
addl? leave?
![Page 65: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/65.jpg)
www.company.com
Control hazard (3/5)
• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %edi mov %edi, %eaxB : leave ret
• As we can’t know about the future, bring the instruction in the next position
![Page 66: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/66.jpg)
www.company.com
Control hazard (4/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
subl %eax, (%ebx)
je B
Execution flow
• What if jump occurred? addl %ecx, %edi
![Page 67: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/67.jpg)
www.company.com
Control hazard (5/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
je B
Execution flow
addl %ecx, %edimov %edi, %eax
• Wrong instructions were executing
![Page 68: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/68.jpg)
www.company.com
Control hazard - analysis
• We must discard some instructions when we mispredict the branch direction…
• The longer the pipeline, the more instructions must be discarded when branch mispredict.
![Page 69: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/69.jpg)
www.company.com
• Add an epoch register in the processor state • The Execute stage changes the epoch whenever the
pc prediction is wrong and sets the pc to the correct value
• The Fetch stage associates the current epoch with every instruction when it is fetched
PC
iMem
pred f2d
Epoch
Fetch Execute
inst
targetPC
The epoch of the instruction is examined when it is ready to
execute. If the processor epoch has changed the
instruction is thrown away
M07-69
Epoch method
![Page 70: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/70.jpg)
www.company.com
Epoch method - Summary
• Add a ‘Tag’ to each instructions
• There are two Tag machine; in Fetch and Execute
• If tag machines recognize something wrong, they change their testing tag
![Page 71: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/71.jpg)
www.company.com
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4 f2d
FIFO
FIFO
redi
rect
Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo
fEpo
ch
eEpo
ch
M07-71
Epoch method example (SMIPs)
![Page 72: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/72.jpg)
www.company.com
rule doFetch ; let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc; f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch, inst:inst});Endrule
rule doExecute; let x=f2d.first; let inpc=x.pc; let inEp=x.epoch; let ppc = x.ppc; let inst = x.inst; if(inEp == epoch) begin let dInst = decode(inst); ... register fetch ...;
let eInst = exec(dInst, rVal1, rVal2, inpc, ppc); ...memory operation ... ...rf update ... if (eInst.mispredict) begin
pc <= eInst.addr; epoch <= inEp + 1; end end f2d.deq;endrule
Epoch method example (SMIPs)
![Page 73: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/73.jpg)
www.company.com
Branch Prediction
![Page 74: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/74.jpg)
www.company.com
Need to predict next PC
• We must fetch a instruction every cycle
• But as we see in control hazard part, we can’t know what is exactly next instruction
• So we must predict what is next instruction
![Page 75: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/75.jpg)
www.company.com
How to predict next PC?
• We can depend on the history– Memo the history and make use of it
• So we must predict what is next instruction
![Page 76: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/76.jpg)
www.company.com
2-bit counter branch predictor
(Weakly taken)
(Weakly not taken)
(Strongly not taken)
(Strongly taken)
![Page 77: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/77.jpg)
www.company.com
• Assume 2 BP bits per instruction• Use saturating counter
On ¬taken
On taken
1 1 Strongly taken
1 0 Weakly taken
0 1 Weakly ¬taken
0 0 Strongly ¬takenDirection prediction changes only after two successive bad
predictions
M11-77
2-bit counter branch predictor
![Page 78: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/78.jpg)
www.company.com
4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions
0 0Fetch PC
Branch?
Opcode offsetInstruction
k
BHT Index
2k-entryBHT,
2 bits/entry
Taken/¬Taken?
Target PC
+
from Fetch
After decoding the instruction if it turns out be a branch, then we can consult BHT using the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch
M11-78
Branch History Table (BHT)
![Page 79: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/79.jpg)
www.company.com
Let’s see program example again
• for(i=0 ; i<size ; i++) { if(i%2 == 0) {
action_even(); { else {
action_odd(); } }
![Page 80: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/80.jpg)
www.company.com
Pipeline hazard- data hazard
![Page 81: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/81.jpg)
www.company.com
Data hazard by flow dependence
• Sometimes, instructions uses the result of former instruction
I1 addl %eax, %ebxI2 subl %ebx, %ecx
I2 must wait until I1 updates the register file, so that I2 can see the result.
![Page 82: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/82.jpg)
www.company.com
Dealing with data hazards
• We can wait until desired value is updated– Stall method
• Or, we can send the value directly– Bypass method
![Page 83: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/83.jpg)
www.company.com
Pipeline stalling example
![Page 84: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/84.jpg)
www.company.com
Data bypass example
![Page 85: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/85.jpg)
www.company.com
Improve processor performance-Cache Memory
![Page 86: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/86.jpg)
www.company.com
Memory operations are bottle neck!
• Memory transfer rate of DDR3 RAM– Peak transfer rate : 6400 MB/s
• Assume that we have single core processor of clock speed 3.0Ghz, and we process a word(32bit) every cycle.– Approximately, we process 1.2GB/s• 3.0 Ghz * 32bit(word size) / 8
![Page 87: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/87.jpg)
www.company.com
Memory hierarchy
![Page 88: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/88.jpg)
www.company.com
Locality of reference
• Temporal locality– If a value is used, it is likely to be used again soon.
mov $100, %ecx mov Array, %ebx // %ebx = &Arrayxorl %eax, %eax // %eax = 0
Loop : mov (%ebx), %esi // %esi += *(%ebx)addl %esi, %eax // %eax += %esiaddl $4, %ebx // %ebx += 4subl $1, %ecx // %ecx = %ecx - 1je Loop
Fin :leaveret
![Page 89: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/89.jpg)
www.company.com
Locality of reference
• Spatial locality– If a value is used, nearby values are likely to be
used
for(i=0; i<size; i++) {for(j=0; j<size; j++) {
sum += array[i][j];}
}
![Page 90: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/90.jpg)
www.company.com
Cache memory
• A buffer between processor and memory– Often several levels of caches
• Small but fast– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal locality
• Parameters vary by system – unknown to programmer
![Page 91: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/91.jpg)
www.company.com
• Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory
• CPU looks first for data in L1, then in L2, then in main memory.• Typical bus structure:
mainmemory
I/Obridgebus interfaceL2 cache
ALU
register file
CPU chip
cache bus system bus memory bus
L1 cache
Cache memory
![Page 92: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/92.jpg)
www.company.com
copy of main memlocations 100, 101, ...
Data Block
DataByte
DataByte
DataByte
100
304
6848 416
How many bits are needed for the tag?Enough to uniquely identify block
Address Tag
Structure of cache memory
• Basically, an array of memory elements
![Page 93: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/93.jpg)
www.company.com
Search cache tags to find match for the processor generated address
Found in cache a.k.a. HIT
Return copy of data from cache
Not in cachea.k.a. MISS
Read block of data from Main Memory – may require writing back a cache line
Return data to processor and update cache
Which line do we replace?
Read behavior
![Page 94: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/94.jpg)
www.company.com
• On a write hit– Write-through: write to both cache and the next level memory– Writeback: write only to cache and update the next level
memory when line is evacuated
• On a write miss – Allocate – because of multi-word lines we first fetch the line,
and then update a word in it– Not allocate – word modified in memory
M12-94
Write behavior
![Page 95: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/95.jpg)
www.company.com
Direct mapped cache
• A buffer between processor and memory– Often several levels of caches
• Small but fast– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal locality
• Parameters vary by system – unknown to programmer
![Page 96: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/96.jpg)
www.company.com
• A cache line usually holds more than one word(32bit)– Reduces the number of tags and the tag size needed to
identify memory locations
– To exploit spatial locality
M12-96
Cache line size
![Page 97: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/97.jpg)
www.company.com
• Compulsory misses (cold start)– First time data is referenced– Run billions of instructions, become insignificant
• Capacity misses– Working set is larger than cache size– Solution: increase cache size
• Conflict misses– Usually multiple memory locations are mapped to the same
cache location to simplify implementations– Thus it is possible that the designated cache location is full
while there are empty locations in the cache. – Solution: Set-Associative Caches
Cold fact of life!
M12-97
Types of misses
![Page 98: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/98.jpg)
www.company.com
Tag Data Block V
=
Offset Tag Index
t k b
t
HIT Data Word or Byte
2k
lines
Block number Block offset
What is a bad reference pattern? Strided = size of cache
req address
Direct-mapped cache
![Page 99: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/99.jpg)
www.company.com
• Bitwise truncation
• Goto 270th cache entry and compare the Tag
Cold fact of life!
M12-99
Addressing example
00000100001100 00000100001110 01 00
Index = 270
req address
Tag = 524 Block offset = 1
![Page 100: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/100.jpg)
www.company.com
Tag Data Block V
=
Offset Index
t k b
t
HIT Data Word or Byte
2k
lines
Tag
Why might this be undesirable?Spatially local blocks conflict
Address selection
![Page 101: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/101.jpg)
www.company.com
• Memory time = Hit time + Prob(miss) * Miss penalty
• Associativity: Allow blocks to go to several sets in cache– 2-way set associative: each block maps to either of 2 cache
sets– Fully associative: each block maps to any cache frame
M12-101
Reduce conflict misses
![Page 102: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/102.jpg)
www.company.com
Tag Data Block V
=
BlockOffset
Tag Index
t k
b
HIT
Tag Data Block V
DataWord
or Byte
=
t
2-way Set-Associative cache
![Page 103: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/103.jpg)
www.company.com
• In order to bring in a new cache line, usually another cache line has to be thrown out. Which one?– No choice in replacement if the cache is direct mapped
• Replacement policy for set-associative caches– One that is not dirty, i.e., has not been modified
• In I-cache all lines are clean• In D-cache if a dirty line has to be thrown out then it must be written
back first– Least recently used?– Most recently used?– Random?
How much is performance affected by the choice?
Difficult to know without quantitative measurements
M12-103
Replacement policy
![Page 104: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/104.jpg)
www.company.com
Implementing LRU..
• We need time stamps for all lines
• And also require time stamp comparison!– Log scale over head
![Page 105: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/105.jpg)
www.company.com
Pseudo LRU example
• So use pseudo LRU instead of true LRU
• We’ll use 8-way set associative cache• Use three bit history bits
• If a line is referenced, memo the following code at history bits– Line 0 : 000– Line 1 : 001– Line 2 : 010– Line 3 : 011…
![Page 106: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/106.jpg)
www.company.com
Pseudo LRU example
![Page 107: Introduction to Processor Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012900/5681658a550346895dd84faf/html5/thumbnails/107.jpg)
www.company.com
Pseudo LRU example