web.stonehill.edu€¦  · web viewyou can try to write an assembler using one pass, but that...

48
When Corona interrupted, we had finished the software part of the class including procedures, stack frames (activation records), data structures, and recursion. Most recently we completed the material on translation of MIPS code to binary. Both these topics have good review links on the course web page: Methods in MIPS - Recursive Factorial and StackTrace MIPS_Translation_Examples We are now up to Assemblers. Assemblers are programs that convert MIPS code to binary executable code. Effectively, an Assembler assembles an assembly language program to a machine language program. Assemblers are much simpler than compilers. Most of what they do is what we did by hand: translate each MIPS instruction to binary. It is easiest to do this by scanning through the code twice. This is called a 2-pass assembler. And, it runs in linear time. You can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler is that every time it encounters a label that it hasn’t yet seen (a forward reference), it has to look ahead and count how far away the label is from the current instruction. In the worst case, this means scanning ahead through the entire program at every line, hence the potential O(n^2). The two-pass assembly process avoids this issue. See sections 7.2.2 and 7.2.3 of http://users.ece.utexas.edu/~patt/05f.360N/handouts/360n_ch07.pdf for details, or you can check the Appendix A (sections 1-4) of our text. Pass One: Find all labels and store their symbols along with the actual address of the labels, in an array called a symbol table. Pass Two: Translate each instruction like we did in class. If you see a label reference, look it up in the symbol table to calculate the offset from the current instruction. See section 7.2.3 of the previous link. In assignment 5, you are asked to write an assembler for a CLO accumulator-style machine language. You can write the assembler

Upload: others

Post on 10-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

When Corona interrupted, we had finished the software part of the class including procedures, stack frames (activation records), data structures, and recursion. Most recently we completed the material on translation of MIPS code to binary. Both these topics have good review links on the course web page:

Methods in MIPS - Recursive Factorial and StackTrace MIPS_Translation_Examples

We are now up to Assemblers. Assemblers are programs that convert MIPS code to binary executable code. Effectively, an Assembler assembles an assembly language program to a machine language program. Assemblers are much simpler than compilers. Most of what they do is what we did by hand: translate each MIPS instruction to binary. It is easiest to do this by scanning through the code twice. This is called a 2-pass assembler. And, it runs in linear time. You can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler is that every time it encounters a label that it hasn’t yet seen (a forward reference), it has to look ahead and count how far away the label is from the current instruction. In the worst case, this means scanning ahead through the entire program at every line, hence the potential O(n^2). The two-pass assembly process avoids this issue. See sections 7.2.2 and 7.2.3 of http://users.ece.utexas.edu/~patt/05f.360N/handouts/360n_ch07.pdf for details, or you can check the Appendix A (sections 1-4) of our text.

Pass One: Find all labels and store their symbols along with the actual address of the labels, in an array called a symbol table.

Pass Two: Translate each instruction like we did in class. If you see a label reference, look it up in the symbol table to calculate the offset from the current instruction. See section 7.2.3 of the previous link.

In assignment 5, you are asked to write an assembler for a CLO accumulator-style machine language. You can write the assembler in the language of your choice -- Java -- whatever, but it does not have to be in MIPS.

In MIPS, we assemble each program as though it starts at the sane hex address 0x4000000. But this is unrealistic. After a program is assembled, it needs to be linked and loaded by the operating system so that it can be executed. It is the operating system’s job to put each program in RAM, at an actual starting address. Thus, a machine language needs to be relocatable. Appendix A discusses all this.

Hardware: CPU, Pipelining, Memory

Page 2: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

The rest of the class is all about the hardware of MIPS. This topic is extremely diagram focused and you will have to look at the various links on the webpage to understand it.

We first discuss a multi-cycle approach along with the finite state machine of its control unit. In this approach, every MIPS instruction executes in at most 5 cycles. The advantage of a multi-cycle approach is that we can avoid duplicate hardware that might be necessary with a single cycle. For example, with a multi-cycle approach, we can get by with a single ALU, which is used for different purposes depending on which of the five cycles we are in. At cycle 1 we use the ALU for incrementing the PC; at cycle 3 we use it for adding registers. If we executed each instruction in a single cycle, we would need two different ALUs, because one cannot do two things at the same time. This is also true for memory. With a multi-cycle approach, we need one RAM, but with a single cycle approach we need RAM that is reserved only for instructions and RAM that is used for data, because you cannot access RAM for different things in one cycle.

Furthermore, the 5-cycle approach is a natural lead-in to pipelining. So, you might wonder why I discuss the single approach at all. It is because when we introduce pipelining, a lot of the redundant hardware that we need in the single cycle approach becomes needed again, so seeing the hardware of the single cycle approach is helpful to understand the pipelined data-path.

End of Week 1Let’s start with the multicycle approach and you will need to look at the two links:

 MultiCycleDataPath FiniteStateMachineMultiCycle

In class I would trace through with you what happens with each type of instruction: a simple add instruction, a branch instruction, a load, a store, and a jump. You will need to try this yourselves and ask me if you are confused. You must understand the encoding of each instruction in order to follow the data-path and control lines. The control lines are orange and the data lines are black.

The multicycle machine has one ALU used for different things in different cycles. It has one Memory unit, used for different purposes in different cycles.

Here are the five cycles with the book’s nicknames: IF, Reg, ALU, Mem, RegWrite.

Cycle 1: IF – Instruction Fetch: Fetch and increment PC by 4.

For every instruction, Memory is used to get the instruction, and ALU is used to add 4 to PC.

Cycle 2: Reg - Register Read, Branch Target Calculation Stage

Page 3: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

The register unit is read, spilling appropriate register values from the instruction to be ready for ALU. These values are stored in buffer registers A and B so they are available to be used in later stages. These “intermediate buffer” registers (along with ALUOut) are going to be prominent when we switch to pipelining.

The ALU is used to calculate a possible Branch target address in case the instruction turns out to be a branch or jump instruction. This is done by adding the 16-bit offset from bits 0-15 of the instruction, sign-extending it to 32 bits and adding it to the PC. The relevant control signals are: ALUSrcA = 0, ALUSrcB = 11, and ALUOp = 00. Note that the value of the ALU calculation is stored in a buffer register called ALUOut, where it can be used later.

Jump instructions could be done here, but in practice the FSM has them finish next cycle.

Cycle 3: ALU

The ALU is used for different things depending on the instruction. For arithmetic instructions, it takes registers and operates on them. For Loads/Stores, it adds an offset to a register and forms a target address in memory from which it will extracts data in stage 4. For branches, it decides whether or not to branch by using the ALU to subtract one register from another. Branch instructions are done.

Cycle 4: Mem - Memory

Arithmetic instructions write their results to the Register unit. Arithmetic instructions are done. This stage could be called the RegWrite stage for arithmetic instructions. Loads/Stores access memory to read/write values. Store instructions are now done.

Cycle 5: RegWrite - Register Write

Only Loads are left, and they write their values to the register unit.

Note that there is no duplication of hardware. There is no waste, no muss, no fuss. Also note, that different instructions take different numbers of cycles.

Branches: 3 cycles

Arithmetic: 4 cycles

Stores: 4 cycles

Loads: 5 cycles

Page 4: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Jump: 3 cycles

Some brief notes about control signals (the orange/amber lines):

1. The lines that control multiplexers are making a choice between 2 or 4 inputs to decide which one gets through. A 2-choice has one bit for control, and a 4-choice has two bits for control. These control lines are listed with = signs in the FSM.

2. If a control line is listed in the FSM without an equal sign – like PCWrite, it just means that it is asserted (1). Other control lines like IorD (Instruction or Data) are indicated 0 or 1.

3. The ALU-Op control line is 2 bits: 00 means add, 01, means subtract, and 10 means that the ALU should look at instruction bits 0-5 (the function field) to decide what operation to perform. For example, 00 is used for Branch Target calculation, PC increment, and memory address calculation, while 01 is used for BEQ instruction, and 10 for any general ALU instruction.

4. Every control line in every stage has some value (0 or 1) but not all of these values are listed in the circles of the FSM. If a control line is not asserted, it is not written at all in the circle. If a control line to a multiplexer is irrelevant to the state, meaning it doesn’t matter what you set I to because that hardware is off, then it also won’t be listed in the circle.

To make sure you understand this data-path and control unit, you should follow the finite state machine through for each kind of instruction. I can do that with you in our online meeting, if you have questions. I cannot figure out how to do it easily in notes.

Here are the instructions I would normally do with you in class:

loop: add $3, $4, $5

lw $3, 16($10)

beq $3, $6, loop # the offset is -6

j back # assume back is 0x4000000a

It is important at this point to remember or recalculate how each instruction is encoded, otherwise you won’t know what data is where in the diagram.

To test your understanding, you should see if you can modify the diagram and/or the finite state machine to accommodate new instructions. See this link:

Practice Adding Instructions to MultiCycle Implementation

Page 5: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Read through the whole thing and do the exercises. It is really useful – and it takes work to make these things up Doing the practice exercises is the only way you will know whether you get it, outside of chatting me up.

For example, consider the new swap instruction, swap $3, $2, which swaps the values in registers 3 and 2. As the problem suggests, this instruction is really stored as swap $3, $3, $2 in disguise. Here are some hints:

Stage Action

1 Fetch2 Read registers 3 and 23 Write to register 2 with data from register 34 Write to register 3 with data from register 2

You will need to add an expanded 4-way multiplexer for the register-data, and Rd and Rt are each sent to it with new wires. The finite state machine needs two new states, at levels 3 and 4 and a choice from level 2 for instruction SWAP. The first two states of the FSM are unchanged. The register-write multiplexer can be used as is.

The jal problem is easier. You need to store the value of the PC into register 31, and then set PC equal to BranchTarget. There is very little modification to the hardware and finite state machine.

Here is an additional kind of new instruction to try to add (it is equivalent to the LDI instruction in the practice link). It is a load instruction with a new address mode: double indirect.

lw $6, (12 ($5))

The meaning is to add 12 to value in $5, then go to that address in memory and retrieve the data there. That data there is an address. You then go back to memory with that address, and retrieve that data, and finally, put it into $6.

FWIW, this address mode is one of the standard ones in the 1970-80’s CISC style VAX assembly language of the Digital Equipment Corporation (DEC), which was acquired by Compaq in the late 1990s, and subsequently by HP.

End of Week 2Now we will consider a single long cycle approach to the datapath. Note that now we have three ALUs: PC increment, Arithmetic, and BranchTarget; and two Memories: Instruction and Data. This is because now all the actions happen in one cycle, and you can’t use the same ALU to do three different things in one cycle.

Page 6: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Look here to see the differences: SingleCycleDataPath  (Also, Figure 4.17 in text)

Now you can look through this pdf to see the instructions executed on the SingleCycleDataPath. Look at Figures 4.19, 4.20, 4.21, and 4.24, for arithmetic, Load, Branch and Jump instructions respectively.

Performance Comparison Between Single-Cycle and Multi-Cycle Data Paths.

Let’s now consider a performance problem comparing the Single-Cycle versus the Multi-Cycle. Generally, a multicycle approach helps you tighten up the times by letting the shorter instructions end faster. In the single cycle approach, the clock needs to wait as long as the longest instruction and the other instructions have idle time during the cycle.

Let’s assume that the five stages which we labeled: IF, Reg, ALU, Mem, RegWrite need 3, 2, 2, 3, and 2 ns., respectively. For the multicycle machine, this means the clock needs to tick at 3ns, because that is the slowest (critical) state. For a single cycle machine, the cycle needs to wait for all the stages to occur, one after the other. The Load instruction needs all the stages to the worst case for this is the sum of all the stages, namely 12 ns.

Furthermore, recall (look at the FSM link - FiniteStateMachineMultiCycle) that Loads need 5 cycles, Stores and Arithmetic need 4 cycles, Branches need 3, and Jumps need 2 (FSM shows these needing 3 but they do not use stage 2, so stages 2 and 3 of a jump instruction could be merged in the FSM).

Now, assume we have a program with the following distribution of instructions:

Loads 20% Stores 30% Arithmetic 20% Branches 20% Jumps 10%

Let’s calculate the average time per instruction for the single-cycle and the multicycle machines.

Single-cycle: Every cycle takes 12 ns, so the average time is simply 12 ns.

Multi-cycle: Every cycle takes 3ns. Loads (20%) need 5 cycles; Stores and Arithmetic (50%) need 4 cycles, Branches (20%) need 3 cycles, and Jumps (10%) need 2 cycles. This gives:

3 (5*.2 + 4*.5 + 3*.2 + 2*.1) = 11.4 ns

So, the multicycle approach lets us save some time, by allowing the simpler instructions to finish earlier than they would be able to do in single-cycle machine.

End of Week 3

Page 7: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

PipeliningNow, we are about to enter the modern world of pipelining, where the instructions are done somewhat in parallel. This will really save a lot more time.

This topic eventually needs detailed and complex diagrams to make things clear, but let’s start with just general concepts:

Pipelining only makes sense with a multi-cycle approach. If the instructions are divided into 5 stages, then we can start the next instruction before the previous instruction has finished. The idea is that the different stages in the multi-cycle approach use different parts of the CPU hardware, so that in any particular cycle, there is lots of hardware sitting idle. Pipelining is going to make use of that hardware.

In particular, instruction 1 runs for one stage, and when that is done, instruction 2 begins its first stage, while instruction 1 goes to its second stage. At the third cycle, instruction 1 is on stage three, instruction 2 is on stage two, and instruction 3 is on stage one. It looks like this as time moves on:

1 2 3 4 5 6 7 8 9 10

1 IF Reg ALU Mem RegWrite

2 IF Reg ALU Mem RegWrite

3 IF Reg ALU Mem RegWrite

4 IF Reg ALU Mem RegWrite

5 IF Reg ALU Mem RegWrite

6 IF Reg ALU Mem RegWrite

The instructions are numbered on the left 1 through 6, and cycles are numbered on the top 1 through 10. Notice that by cycle 5, the entire CPU is being used, each part by a different instruction! This makes tracing the hardware much harder than before, because there are now five different instructions all being executed at the same time, and each one is at a different stage in its execution!

Before we consider any more details, let’s first calculate how much this idea will speed up our computer from before. Recall, that in the multicycle computer, the clock ticks every 3 ns, but now a new instruction is finished after each 3ns period! Indeed, the nth instruction is finished after 3n + 12 ns. The extra 12 is for the startup cycles. That is, the first instruction finishes in 15 ns, and the second in 18, and the third in 21, etc.

Page 8: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

So to conclude, for n instructions, we have the following times:

Single-cycle: 12n ns.

Multi-cycle: 11.4n ns.

Pipelined: 3n + 12 ns.

Pipelining kills it. And, engineers have known this for at least 30 years.

Unfortunately, pipelining is not so simple and comes with three major types of hazards:

Structural Hazards, Data Hazards, and Branch Hazards.

Structural Hazards:

The first obstacle you hit immediately when you try to pipeline is that the hardware used in the different stages are not independent. Each piece is active in more than one cycle. Indeed, as we mentioned earlier, the ALU is used in Stages 1, 2, and 3 for different purposes. If you only have one ALU, then you cannot use it simultaneously for Stage 1 of instruction three, Stage 2 of instruction 2, and Stage 3 of instruction 1! That is a clash and what we call a “structural hazard”. The solution is to add hardware. In fact, we end up adding hardware just like we had to do for the single-cycle machine. This is really the main reason we ever considered the single-cycle machine: to preview what the multi-cycle pipelined data-path would look like.

Another structural hazard occurs because if we are executing 5 instructions at a time, then the data and control for each instruction must be stored in between the stages in large buffers, so that the next stage of that instruction will have access to the correct information, and we can keep track of each instruction without having their information interfere with each other. These purpose of these large “in between” buffers are reminiscent of the reasons we have A, B, and ALUout registers in the multicycle datapath. That is, we need to store calculated information from the Register Unit and the ALU for use in a later stage, because these hardware units will be calculating other stuff in those stages and will overwrite the earlier stuff if we do not save them.

Data Hazards:

A subtler obstacle when you pipeline instructions is that an instruction that stores a value in a register will not finish in time for an instruction following it that reads that register. Writing a register value occurs in the last stage, while reading the registers occurs in Stage 2.

1 2 3 4 5 6 7 8 9 10

1 IF Reg ALU Mem RegWrite

Page 9: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

2 IF Reg ALU Mem RegWrite

3 IF Reg ALU Mem RegWrite

4 IF Reg ALU Mem RegWrite

For example, instruction 1 will write a register at the fifth cycle, but instructions 2 and 3 have already read this register value (incorrectly) from the register unit at cycles 3 and 4 respectively. That is a serious problem. There are many ways to fix data hazards. One way is called data forwarding. This involves adding complicated hardware and wires to the data path to move the data from one part of the data-path to where it needs to be so that the later instructions get the correct data in time, even though the data has not yet been written to the register unit. Details will be discussed later.

Branch (or Control) Hazards

The last type of hazard is a branch hazard. This hazard is more obvious than a data hazard. This happens when a branch instruction is executed. The problem is that we do not know whether or not the branch will be taken until Stage 3. Meanwhile, we have already started to execute the next two instructions! If the branch ends up not being taken, then no problem. But if the branch is taken, then we have two bogus instructions running through the pipeline and they need to get flushed out. We handle branch hazards in a number of ways. First, we move up the branch decision one stage, so we only have one bad instruction in the pipeline. Second, we try to predict accurately when a branch will be taken and hope to make the right guess. Details will follow later.

Now we will try to follow a set of instructions going through a pipelined machine:

This is impossible without elaborate diagrams. Please refer to your text

https://web.stonehill.edu/compsci/Architecture/Chapter04.pdf

and/or this power point:

https://web.stonehill.edu/compsci/Architecture/Pipelining.pptx

End of Week 4

This week we will look in detail at the PipeliningDataPath.   Among other details, the link shows how we insert buffer registers in between each of the five stages of the multi-cycle approach in order to keep the control and data organized and not overlapped.

Page 10: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

After we trace through a few simultaneous instructions though the pipelined data path, we will revisit data hazards in great detail.

Data Hazards in Detail

There is one way a data hazard can occur. That is when a register is written in one instruction and that same register is read in a subsequent instruction. If the reading occurs in time before the register is written, then the subsequent instruction has gotten the old (wrong) value.

For example,

1. add $5, $6, $72. add $3, $5, $43. sub $8, $5, $44. and $9, $5, $4

The first instruction does not write $5 until its fifth stage. Meanwhile, instruction 2 is reading $5 in stage 3 of instruction 1, and instruction 3 is reading $5 in stage 4 of instruction 1. Both instructions 2 and 3 have read the old (wrong) value of $5. See figure 4.52.

Page 11: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

What about instruction 4? Did it get the correct value of $5? Yes it did! It reads $5 in stage 5 of instruction 1 at the exact same time that instruction 1 is writing the value of $5. You should know (or remember from CLO) that a register unit can read and write at the same time, and when it does so, the write occurs in the first half of the cycle so the read gets the new value. Thus, instruction 4 has no hazard with instruction 1.

In general, an arithmetic instruction can cause a data hazard only with the next two instructions that follow: in the example above that means instructions 2 and 3.

The coolest thing about the hazards in instructions 2 and 3 is that they can be fixed by a slick trick called forwarding. The idea is illustrated in figure 4.53.

Here is how it works. The value of $5 that instruction 1 writes in stage 5 is actually known at the end of stage 3 and is passed through the buffer registers to stage 5. That is, we know the value of $5 much earlier than we write it. We will try to forward this information to instructions 2 and 3. We cannot get the value to the register unit, but we can get it to the ALU in time, which is all that matters. In other words, instructions 2 and 3 will read the wrong value of $5, but the correct value will replace this wrong value in time for their respective ALU stages.

Page 12: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

The ALU stage of instruction 2 occurs at stage 4 of instruction 1, and the ALU stage of instruction 3 occurs at stage 5 of instruction 1. Since we know the value of $5 after stage 3 of instruction 1, we can send it from the appropriate buffer to the ALU in time. You should look at figure 4.53 to see how this all happens.

These two hazards can be checked for as written on page 306 of your text:EX/MEM.RD == ID/EX(RS or RT)andMEM/WB.RD == ID/EX(RS or RT)

In our example, the first condition checks for hazards between instruction 1 and instruction 2, and the second condition checks for hazards between instruction 1 and instruction 3.

The rest of the details are in the text pages 308-312, Figures 4.56 and 4.60, where the hardware to do the checks and accomplish the forwarding is all drawn in. It is complex looking but conceptually straight-forward if you followed the discussion up to now. Figure 4.56 shows the data forwarding unit without the detection, and Figure 4.60 adds the detection unit.

Figure 4.60 shows all the hardware put together: the multiplexers for forwarding, the hazard detection unit, and the hazard forwarding unit.

Page 13: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

There is one detail left to discuss. All the data hazards can be handled by forwarding when the offending instruction that writes to the register is an arithmetic instruction. However, when the offending instruction 1 is a load instruction (the only other instruction that writes to the register unit), then we do not know the value to be written until after stage 4 of instruction 1. And, this is too late to forward it to the ALU stage of instruction 2. Figure 4.58 shows this.

Page 14: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Instruction 2 needs the register value when it starts its own stage 3, which is the same as the start of stage 4 of instruction 1. And, we do not know the value at the point in time. It is too early, too soon, to be able to forward the value.

To summarize, there is one kind of hazard that we cannot fix with forwarding, and that is a load instruction that writes a register followed immediately by an instruction that reads that register.

It is checked like this (see page 314):ID/EX.MEMREAD # i.e., a load instruction, andID/EX.RT == IF/ID.(RS or RT).

When this occurs, we need to stall the pipeline. We insert a nop instruction in between instruction 1 and instruction 2, which delays instruction 2 for one cycle – enough time to be able to forward properly. Figure 4.59 shows the insertion of a nop instruction to cause a stall. The nop instruction in practice is sll $0, $0, 0, which does nothing because $0 is hardwired to 0.

Page 15: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

For practice, let’s find all the data hazards in this next example and indicate which can be handled by forwarding, and which need a stall.

1. add $4, $5, $62. lw $3, 10($4)3. sub $6, $3, $44. or $7, $3, $65. sw $8, 10($3)

Instruction 1 has a data hazard with instruction 2 on register $4. This is handled by forwarding.Instruction 1 has a data hazard with instruction 3 on register $4. This is handled by forwarding.Instruction 2 has a data hazard with instruction 3 on register $3. This requires a stall.Instruction 2 has a data hazard with instruction 4 on register $3. This is handled by forwarding.Instruction 3 has a data hazard with instruction 4 on register $6. This is handled by forwarding.There is no data hazard from instruction 2 to instruction 5. It is too far away.

Summary: Data hazards are only possible for the two instructions that follow a register write. They can be handled with forwarding unless the register write was done by a load instruction.

Page 16: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

All the referenced figures and examples are reviewed in the links below, in case you are confused by the book.

PipeliningHazards DataHazard-Forwarding

Compiler Rewrites

There is one last way to handle a data hazard and that is by using a compiler rewrite. Indeed, inserting a nop instruction for a stall is a kind of compiler rewrite. But you can be more creative. For example, consider the two lines of code in some Java program:A = B + E;C = B + F;

Let’s say memory values A, B, C, E, and F are 12, 0, 16, 4, and 8 bytes offset from some base address stored in $t0. Then normally, a compiler would create these 7 MIPS instructions:

lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0) # These lines do A = B + E

lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0) # These lines do C = B + F

You should verify that there are five data hazards, two of which cannot be handled by forwarding and require stalls. The five hazards are:Lines 1 and 3Lines 2 and 3 -- requires a stallLines 3 and 4Lines 5 and 6 -- requires a stallLines 6 and 7

Now, if the compiler is clever, it might notice that you can move line 5 in between lines 2 and 3, to completely get rid of any need for a stall! Amazing trick! The code is shown below. There are still data hazards, but none that require a stall!lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0) # These lines load up B, E, and Fadd $t3, $t1, $t2sw $t3, 12($t0) # These lines do A = B + Eadd $t5, $t1, $t4sw $t5, 16($t0) # These lines do C = B + F

Page 17: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Branch Hazards in Detail

If a branch instruction enters the pipeline, and it ends up needing to be taken, then we have already begun two instructions in the pipeline that need to be flushed out. This hazard occurs much less frequently than a data hazard, but we cannot solve it with forwarding. There are a number of ways to minimize the stalls required by a branch hazard.

1. Stall every branch instruction for two cycles putting nop instructions in the pipeline. This is the worst solution.

2. Move the branch target hardware and the branch decision hardware to stage 2. This means we will know whether we are branching, and where to, by the end of the second stage. That means only one incorrect instruction will have entered into the pipeline. Note that this solution will create new data hazards.

3. Assume the branch is not taken. If the branch ends up being taken, then we can “flush” the bad instructions out of the pipeline because no MEM or WB stage has yet occurred, so nothing has been changed in memory or the register unit. We just set all the buffers to zero as though the bad instruction for nop instructions. This means we will stall only when the branch is taken.

4. Assume the branch is taken. This can only be done if we know the branch target (where to branch) before we know whether we are branching or not. This is indeed the case in MIPS, where the branch target calculation occurs early (even in stage 1- see option 2). This will necessitate a stall when the branch is not taken.

5. Branch Prediction. This idea tries to leverage the notion that a branch tends to do what it did last time. Think of a loop. There are 1-bit predictors and two-bit predictors. A 1-bit predictor just does what was done last time. A 2-bit predictor will not change what it did last time until it is wrong twice. A 1-bit predictor makes more mistakes than a 2-bit predictor. Consider a situation where the branch will be taken 9 times and then not taken once. The 1-bit predictor will get 8/10 predictions correct, but the 2-bit predictor will get 9/10 correct. See the figures below.

6. Branch Delay Slot. This last technique is done by a clever compiler and it is the coolest of all. The idea is to reorder the instructions, if possible, so that the instruction that

Page 18: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

follows the branch becomes one that needs to get executed regardless of the branch outcome. This avoids the need for a stall. For example, consider the instructions:

Add $1, $2, $3Beq $2, $0, labelAnything

Note that the add instruction must get done before the branch, but the branch happens to not depend on its outcome, so we can safely move the add instruction to the branch delay slot where it will get executed before Anything and before the line on label.

The new code looks like this:Beq $2, $0, labelAdd $1, $2, $3Anything

The branch delay slot can also be used to implement options 3 and 4 above – (assume branch taken or not taken). These three options are explained by this guy pretty nicely – with a pleasant accent: https://www.youtube.com/watch?v=vgMwKpp3L9o.

Pipelining Performance Problem

Let’s consider a performance problem that incorporates all the hazards. Your machine runs at 500 MHz, i.e., a critical state of 2 ns. The five cycles of the instructions take 2, 1, 2, 2, and 1 ns respectively. Loads need 5 cycles, Stores and ALUs need 4, Branches need 3, and Jumps need 2.

Your program has the following distribution of instructions:

Loads: 20% Stores 10% ALU 50% Branches 18% Jumps 2%

The Loads have a 50% data dependency. That means that a stall is needed 50% of the time. All other data hazards are handled by forwarding. Branches are mis-predicted 15% of the time with no delay slot to help. Jumps always require one stall, because you know the Jump after two cycles.

Now let’s compare single-cycle, multi-cycle, and pipelined machines.

Single Cycle: The length of a cycle is 8 ns. The CPI is 1. That is, 8 ns per instruction.Multi Cycle: The length of a cycle is 2 ns. The CPI = 5(.2) + 4(.1) + 4(.5) + 3(.18) + 2(.02) = 3.98.

That is, 2(3.98) = 7.96 ns per instruction.

Finally,

Page 19: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Pipelined: The length of a cycle is 2 ns. Half the loads are 2 cycles, and half are 1 cycle. The branches have 85% one cycle and 15% 2 cycles. Stores and ALUs each take one cycle. And, Jumps need 2 cycles (one stall because of the unavoidable branch hazard with no usable delay slot). This gives CPI =

.5(.2*2 + .2*1) Loads+ .1(1) Stores+ .5(1) ALUs+ .18(.85*1 + .15*2) Branches+ .02(2) Jumps

= .3 + .1 + .5 + .153 + .054 + .04 = 1.147

This gives 2(1.147) = 2.3 ns per instruction. The throughput of the pipelined soundly beats both single and multi-cycle.

End of Week 5

Cache, Virtual Memory, and Memory Hierarchy

We finish up the hardware of MIPS by discussing Cache and Virtual Memory. These two ideas together comprise the important computer engineering idea of memory hierarchy. Space is memory. In a computer we keep data and programs in different places depending on how often we expect to use them. The simplest hierarchy is RAM (primary memory) versus disk (secondary memory). RAM is faster, smaller, and more expensive than disks.

A good analogy to memory is the space you store your stuff in your dorm. You keep the stuff you use all the time - like your phone - on a shelf near your bed. The stuff you need every day – like your wallet and your jewelry - you keep on your desk. The stuff that you use once a semester – like the pretty dress for the Senior dance, or your dress shoes, or the lacrosse stick you brought to have a catch with your friend who forgot his at home – you keep in your closet or under the bed. RAM is like your desk and your closet is like a disk.

As you probably should know, RAM is electronic while disks are electromechanical. Thus, RAM is more expensive per bit than disk space, and RAM is much faster (100,000 times faster) than disks for accessing a particular address. It’s not called random access memory for nothing. On the other hand, RAM speeds for transferring data once you find the start address is only 100-1000 faster than a disk. This is because the seeking memory part of a disk is mechanical (spinning metal under moving reading head on a mechanical arm), while the transfer is electronic. And, because of these differences, RAM is used for running programs and their data, while disks are used to store stuff you are not currently running, stuff you might want to use one day – both data and programs.

Page 20: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

And, if you think that if only I had enough money I would buy RAM big enough for everything, that is silly. You want to have slow memory and fast memory – because you don’t need to keep everything right on your night table. Memory hierarchy is good. It is efficient. It is the way to design a well-balanced machine. There will always be a need for different levels of performance. The stuff we don’t use much can sit in slow access disk and we can wait a few seconds as it loads into RAM. The stuff we need on our desk can stay in RAM until we are done with it, and then we put it back in the closet (on the disk). Slow goes with large and cheap. Fast goes with small and expensive.

The interesting part is that it is not just RAM and disks. There are different kinds of RAM – static and dynamic. Dynamic RAM needs to be refreshed all the time or it forgets it values. Static RAM does not. Thus, Static RAM is faster and more expensive than dynamic RAM. That gives us a hierarchy in RAM: static RAM is used for a smaller faster subset of RAM called cache. Some machines even have two levels of cache. Level 1 is the fastest, smallest, and most expensive; level 2 cache is less fast, larger, and cheaper, and RAM is the slowest, largest, and cheapest of the three. Of course, disks or flash memory are slower, larger, and cheaper than all these kinds of memory. It doesn’t matter what the current speeds of all these memories are today, because the times are always improving, and there will always be a hierarchy.

That is the general idea, but the specifics get complicated.

When we access memory in RAM, we would like to have 90% or more of those to be in the cache. In order to accomplish this, we make use of two concepts:

a. Temporal locality.b. Spatial locality.

Temporal (time) locality means that if we access a memory element, then we are much more likely to access it again soon. Thus, whenever we access memory, we copy the value into cache, so that the next access will be faster. How this copying is done, and how we efficiently check whether a value is in cache or not are important details that we will discuss later.

Spatial locality means that if we access a memory element, then we are more likely to access memory elements near it. Think of arrays, and how many programs are traversing arrays. Thus, whenever we access a value, we copy a block of memory elements that are contiguous with the referenced element. How big these blocks should be, how we blend this idea with temporal locality, how we choose when to replace a block, how we keep the values in cache and in RAM consistent, and how we determine whether we get a hit in this cache are all details we will discuss.

We are hoping to get 90% or more references to come from cache. Cache is 10 times faster than RAM, so we want to spend less time in RAM and more in cache. And, obviously, we do not want to spend any significant time deciding whether a reference is in cache (hit) or not (miss). We need a scheme where determining hits and misses takes virtually no time.

Page 21: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

You should know that the speed of cache is similar to that of the CPU registers and other hardware, on the order of 1-2 cycles, while the speed of RAM (tens of cycles) would effectively stall the CPU. We really do not want to access RAM that often.

Cache in Detail

We will review these slides in class: Cache Review Slides. They are a good overview with a slightly different organization than my notes that follow below. Both sets of notes complement each other. It is worth reviewing both. You can skip these now and come back later, or do them first, as you choose.

Direct Mapping

The basic way to implement a cache uses a strategy called direct mapping. This strategy makes sure that determining when a memory access is a hit or a miss is accomplished with no delay at all. There are other strategies that trade some time on the hit/miss determination in order to increase the hit rate. They are called associative and set-associative. We will discuss them later, only after you understand the basic direct mapping strategy.

Cache Example 1:  RAM = 2^32 bytes  Cache = 2^10 bytes    Single-byte blocks

The RAM address is 9392 in decimal which holds the value B, a byte. In binary this address is:0000000000000000001001 0010110000 (spaces are for your reading convenience)

You calculate 9392%1024 = 0010110000 = 176 to find the byte address in cache (176) for RAM address 9392. That is 9392 in RAM direct maps to 176 in cache. The data from 9392 is stored in cache location 176 along with a tag that tells you where it came from in RAM. The tag in our case is 9. Finally, a valid bit is stored and is set 1 whenever the cache has been written to.

The calculation of 9392%1024 = 0010110000 = 176 is done in hardware by, in this case, simply splicing from the last ten wires of the 32-bit RAM address.

The cache itself looks like this:

Address (10 bits) Contents of Cache Address (31 bits)

Data (8 bits) Tag (32 – 10 = 22 bits) Valid (1 bit)0010110000 B 0000000000000000001001 1

Cache Example 2:  RAM = 2^32 bytes  Cache = 2^10 blocks, Block = 4 bytes

The figure below shows the structure of Cache Example 2, and how a hit is checked in hardware.

Page 22: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Before any RAM addresses, the cache is empty, and its structure looks like this:

Block Address (10 bits) Contents of Cache Block Address (53 bits)

Data (32 bits) Tag (32 – 12 = 20 bits) Valid (1 bit)

(Note that the diagram has the order Valid, Tag, and Data, but that is a trivial difference.)

Consider the following three 32-bit RAM addresses, the last two being identical, except for the byte in the block.

00000000000000000010 0000000011 00 (spaces are for your reading convenience)00000000000000000011 0000000011 0000000000000000000011 0000000011 01

Page 23: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

The first 20 bits end up being the tag; the next 10 bits give you the block in the cache; the last two bits tell you the byte in that block.

The results of these three addresses are:Miss Miss Hit

Before the addresses are accessed, all the valid bits in every block are set to zero. Assume that the first address contains four bytes D3 D2 D1 D0, and the last two contain E3 E2 E1 E0.

After 00000000000000000010 0000000011 00

The cache looks like this:

Block Address (10 bits) Contents of Cache Block Address (53 bits)

Data (32 bits) Tag (32 – 12 = 20 bits) Valid (1 bit)0000000011 D3 D2 D1 D0 00000000000000000010 1

After

00000000000000000011 0000000011 00, the cache looks like this:

Block Address (10 bits) Contents of Cache Block Address (53 bits)

Data (32 bits) Tag (32 – 12 = 20 bits) Valid (1 bit)0000000011 E3 E2 E1 E0 00000000000000000011 1

The third address is a hit, and the cache is unchanged. We are accessing E1, rather than E0.

Now consider this subtle point: If when we missed trying to read D0, and we only copied E0 into the block rather than the entire 4-byte word, then the cache after the second RAM address would look like this:

0000000011 D3 D2 D1 E0 00000000000000000011 1

And, when get a hit with the next RAM address, 00000000000000000011 0000000011 01(tags match), we would incorrectly access D1 when it should be E1.

Subtle point:  Therefore, on a miss, we must move the entire block from RAM to cache, to make sure subsequent hits access the correct data even if it is a different byte in that block.

What happens to miss penalty and hit ratio when block size increases?

Page 24: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Answer: They both increase. A larger block size means it is more likely to get a hit due to spatial locality. However, when you miss, a larger block size means you must move more data back and forth between cache and RAM.

The best size for the block is determined by experimentation because a larger miss penalty is bad, while a larger hit ratio is good.

Cache Example 3:  RAM = 2^32 bytes    Cache = 2^14 bytes   Block = 2^6 bytes (2^4 words)

a. How many bits in the tag?Answer: 32-14 = 18.

b. How many blocks of data in cache?Answer: 2^14/2^6 = 2^8.

c. How many blocks of data in RAM?Answer: 2^32/2^6 = 2^26.

d. Which bit is RAM address tell you the block in RAM?Answer: The 26 bits: 31 through 6.

e. Which bits in RAM address tell you the block in cache? Answer: The 8 bits: 13 through 6.

f. Which bits indicate the byte in a block?Answer: The 6 bits: 5 through 0.

g. How large is cache?Answer: 2^8 (1 + 18 + 64*8) bits.There are 2^8 blocks. Each block holds an 18-bit tag, 1-bit valid bit, and 64-byte data.

A RAM address is parsed this way:

Tag Block in Cache Word in Block Byte in Word18 bits 8 bits 4 bits 2 bits

The cache structure looks like this:

Block Address (8 bits) Contents of Cache Block Address (64*8 + 19 bits)

Data (64 bytes) Tag (32 – 14 = 18 bits) Valid (1 bit)

Writes into Cache

Everything we discussed so far deals with reads in the cache and whether they are hits or misses. However, you can have writes also, and that complicates things by giving us two options for single-word blocks.

Page 25: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

a. Write Back: This tests for a hit or miss as usual. If it is a miss and at least one write has occurred in the cache, then the entire cache block is copied back to RAM, updating RAM, and then the new RAM block is copied into cache. A “dirty” bit is used and maintained to keep track if the cache block has had at least one write. At any time, this strategy might leave RAM with incorrect data until a miss occurs, at which time the block in cache is replaced and the entire block in RAM is updated at once. The advantage of this strategy is that we do not access RAM on every write, which slows down writes.

b. Write Through: This writes directly to RAM and cache and does not even bother to check whether or not the memory write access was a hit or miss. With this strategy any updates to memory are made immediately in both RAM and cache. When a block is replaced in cache, it is not necessary to copy the cache back to old RAM, because the RAM has been updated all along.

The write to the cache is synchronous and needs no stall, but the write to RAM normally would require a stall. A buffer can be used to make write-through to RAM synchronous and more efficient. A buffer is simply an extra register(s) used to store the write synchronously. This avoids any delay at the moment of the write. Meanwhile, the actual write from that register to RAM occurs a few cycles later, but almost always before that value in RAM needs to be accessed again.

For large multi-word blocks only the write-back strategy is practical because of the high miss penalty. That is, it would take way too long to be copying entire blocks to RAM with every write, even with the help of a buffer.

This is my summary sheet for how cache works, followed by my handwritten version for those of you who miss my chicken scrawl.

The Big Cache SummarySingle-Word Blocks

Write-Back Write-Through (use buffer for efficiency)

Read Hit: Access cache Hit: Access cache

Miss: *Copy cache back to old RAM Miss:Copy new RAM word to cache Copy new RAM word to cacheReplace tag, Access cache Replace tag, Access cache

Write Hit: Write word to cache Do not check for hit or miss

Miss: *Copy cache back to old RAM Write word to cache and RAMWrite word to cache Replace tagReplace tag

Page 26: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

* On a miss, you copy back the entire word even if only one byte is changed. (A “dirty” bit is used and maintained to keep track if the cache block has had at least one write.) This is to ensure that tags are correct for the entire word. Otherwise, a future hit with a different byte will access wrong data. See previous “subtle point”.

Multi-Word Blocks (Write-Through is not an option because of high miss penalty)

Read Hit: Access cache

Miss: Copy cache block back to old RAM if cache block is “dirty”Copy new RAM block to cache, Replace tag, Access cache.

Write Hit: Write word to cacheMiss: Copy cache block to old RAM if cache block is “dirty”

Write to RAMCopy new RAM block to cache

Page 27: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Let’s finish our discussion with a performance problem.

Page 28: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

A computer has 5% instruction memory reference misses, and 10% data misses. The average CPI is 4. There is a 12 cycle miss penalty. And, 33% of the instructions are Loads/Stores. What is the effective CPI including cache misses.

First of all, note that a perfect cache with a 100% hit rate, and no misses, would leave the CPI = 4. On the flip side, with no cache at all, then each memory access costs an extra 12 cycles. So, we would have 12 extra cycles to access memory for the fetch, and 1/3 (12) for the data memory access due to the 33% Load and Store instructions. This gives a CPI = 4 + 12 + (1/3)12 = 20.

Now let’s calculate the actual effective CPI in our example. The instruction references miss 5% of time. The data references miss 10% of 1/3 of the time, because they only happen with Load and Store instructions. So, the CPI with cache misses = 4 + 1/3 x 1/10 x 12 + 1/20 x 12 = 5.

Now let’s assume that the speed of the computer doubles, so that 12 old cycles is now the same as 24 new cycles. And, furthermore, assume that the memory unit does not speed up, so that the miss penalty is now 24 new cycles. Let’s recalculate the effective CPI now. The effective CPI = 4 + (1/3 x 1/10 x 24) + (1/20 * 24) = 6 new cycles = 3 old cycles. That is, the new machine is only 67% faster than the old one, even though it is 100% faster in its clock. This is an example of Amdahl’s Law – which states that the speedup in a machine is proportional to the part that is improved. Here we are improving the clock, but not the memory unit.

End of Week 6

Direct versus Set-Associative versus Fully-Associative Cache

The cache implementation we discussed so far is called direct mapping. Its feature is that every block in RAM has a specific place it is mapped to in the cache. This modulus calculation is done directly in hardware and is very fast. Thus, the determination of a hit or a miss is extremely fast.

An alternative is to allow a block in RAM to map to anywhere at all in the cache. This is called a fully associative cache. The implication is that the hit rate goes way up, because you don’t get a miss unless the cache is full. As long as there is any room in the cache, you can map a RAM block to whatever is left. When the cache is full, the standard strategy is to replace the block that was least recently used (LRU).

The downside of the fully associative scheme, and it is a killer downside, is that it becomes very hard to determine whether or not a memory reference is a hit or a miss. You would have to

Page 29: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

search through the entire cache looking for the tag. There would be no direct mapped spot to look for. This penalty is a deal breaker for cache.

In practice, we compromise and use something in between direct-mapped and fully-associative, called set-associative. In this scheme, you partition the cache into sets of blocks. A given block in RAM must be mapped to a specific set of blocks in the cache, but within that set, the block can be put anywhere. The determination of a hit or miss is still done in hardware as long the sets are not very large, but the hit rate increases. The figure below shows how we determine a hit using parallel hardware for a set-associative strategy using sets of size 4. A set goes across the diagram from left to right. If each set was very large, then the hardware needs would not be practical. The extreme version is one set consisting of the entire cache, and that is the fully associative option we rejected at the start.

Let’s look an example.

Example 1:

Consider a cache of size 2^8 bytes, partitioned into 16 blocks of 2^4 bytes per block. Then consider that the cache is divided into 2 sets of 8 blocks each. Every RAM address is divided by 2 to decide what set it is mapped to. Within the set, the block can be placed anywhere.

Let’s process the following sequence of memory references:

Page 30: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

0, 3, 5, 19, 6, 2, 20, 5, 7, 9, 6, 10.

Set I: 0 6 2 20 10Set II: 3 5 19 7 9

Each set holds 8 blocks, and none of the sets fill up, so we never get a miss.

If we tried to process these references using direct mapping, then we need to divide each reference by 16 to find the cache block destination. In this case, 19 is a miss, because it maps to the same block as 3.

Example 2:

Now consider a cache of 4 single-word blocks with three different schemes:a. Direct-mappedb. Set-associative with two sets of 2 words each.c. Fully-associative.

Consider these references: 0, 3, 8, 5, 0, 3, 6, 8, 7, 0. In all three schemes there are a number of misses until the blocks or sets get filled up.

Direct Set-Associative Fully-AssociativeBlocks 0 1 2 3 Sets 0 1

0miss 5miss 6miss 3miss 0miss 3miss 0miss8miss 3hit 8miss 5miss 3miss0miss 7miss 0hit 3hit 8miss8miss 6miss 7miss 5miss0miss 8miss 0hit

0miss 3hit6miss8hit7miss0miss

You can see that as you move from direct to fully-associative, the number of hits increases.Here is another example of the same phenomenon.

Page 31: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler
Page 32: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Virtual Memory

Virtual memory (VM) is a way to make RAM look bigger. In the old days, computers had small RAMs that could not fit all the applications at the same time. Virtual memory in those days was necessary to give the user the illusion that RAM was bigger that it was, while really the hard drive was being used to store chunks of RAM whenever RAM ran out of space. Nowadays, with huge RAMs, VM is mainly used by the operating system to maintain security in the memory system. Here, we discuss the way VM works.

Cache is a real physical memory that sits in between RAM and the CPU in order to take advantage of a memory hierarchy. It makes RAM look faster. In some sense, Virtual memory (VM) is to the hard drive (secondary memory) what cache is to RAM. VM sits in between RAM and the hard drive. So, in principle it is just another stage of the memory hierarchy. That is where the similarity to cache ends.

VM is not a real physical device like cache is. VM is virtual – it is simulated. Furthermore, since RAM is the center of the computer’s world and cache is closer to the CPU, the cache makes RAM look faster. On the other hand, VM is closer to the secondary memory - it is on the slow side of the RAM; so rather than making secondary memory look faster, VM makes RAM look bigger. This is a very important distinction and it is the entire concept behind VM as part of the memory hierarchy.

In general, if you have a memory hierarchy A to B, decreasing in speed and cost, and increasing in size, then A makes B look faster, and B makes A look bigger. With a computer, you have the hierarchy cache, RAM, VM, but you are concerned primarily about RAM, so cache makes RAM look faster, and VM makes RAM look bigger. We do not normally think of RAM making cache seem bigger, or RAM making VM or secondary memory seem faster. We are focused on how RAM is affected. This is what causes the lack of symmetry between VM and cache in the memory hierarchy.

Another big difference between cache and VM is that the difference in speed between RAM and cache is on the order of 10-100, while the difference in speed between RAM and VM is on the order 10-100k. This asymmetry implies that VM is implemented very differently than cache. So, let’s find out how.

VM is virtual because it is part of the hard drive or secondary memory. Effectively, we set aside a section of the hard drive to expand the RAM. Typically, this space is anywhere from 2 to 16 times the size of RAM. When you try to access this expanded RAM, sometimes you get a hit (the access is in RAM), and sometimes you get a miss because the value is off on the hard drive. The penalty for this is 100,000’s cycles! Because of this, we need a miss rate that is teeny, something like .001%. In order to accomplish this, we need to use a fully associative scheme and very large block sizes, which for VM we call pages. A typical page size in VM is 4k-128k bytes. A miss in VM is called a page fault.

Page 33: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

When a page fault occurs, a page needs to be copied from disk to RAM. This takes a long time, so in order to minimize page faults, we replace the page that was least recently used (LRU). We keep track of least recently used pages by setting a “dirty” bit to one whenever that page is accessed, and periodically resetting all the dirty bits to zero. When it comes time to replace a page, we replace the first page we see that has a dirty bit of zero.

Let’s look at an example to get the basic idea.

Example 1:

RAM = 2^30 = 1GB. And VM = 2^32 = 4GB. So, VM is expanding RAM by a factor of 4, and 4GB of the hard drive is reserved for this RAM expansion. A page has 2^12 bytes.

Addresses in VM are called virtual addresses. In our example, a virtual address has 32 bits to address the 2^32 byte virtual space. Pages are stored anywhere on the hard drive in a fully-associative style. However, no tags are used because it would be way too slow to find a page by searching the RAM for a tag. Instead, a page table is stored in RAM that keeps a list of all pages currently in RAM. This table is used to convert a virtual address to a real address. The page table is a ledger of every RAM page stored in VM. It allows us to find pages and determine hits or faults quickly. The price we pay is that the page table can get very large.

The first 20 bits of the virtual address points to an entry in the page table and the last 12 bits tell you what byte on the page. The page table holds an 18-bit physical page address in RAM for each of the 2^20 entries, and a valid bit that indicates whether that virtual address is currently in RAM (1) or not (0).

Virtual AddressBits 31-12 11-0

20-bit pointer P to an index in page table Offset M for byte on the page

Every program has its own page table. The base address of the page table is added to the 20-bit value of P to get an address which holds a physical page address P in RAM that is 18 bits long. This P is concatenated to M to get a 30-bit real physical RAM address.

Page 34: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

Example 2:

VM is 8 MB, with 2KB per page.

How many bits in a virtual address?Answer: 23. This is because 8MB is 2^23 bytes. Bits 22-11 are the pointer to the page table, and bits 10-0 is the byte on the page.

Page 35: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

A page table in RAM has two hex digits or 8 bits for each entry. The format of each entry: Bit 7 is the valid 1 bit. Bits 5-0 are the address of physical page in RAM, if valid is 1. Bits 6-0 are the address of a physical page on the hard drive, if valid is 0.

How large is RAM?Answer: There are 6 bits for pages in RAM, so 2KB * 2^6 pages = 2^17 bytes.

A sample page table is shown below with the four entries being 0, 1, 2, and 3 bytes offset from the base address of the table.

Base address: 64 A2 9B 13

How many pages are in RAM, and which ones?Answer: There are two pages in RAM, represented by A2 and 9B. Both these entries start with 1 in the valid bit, showing that the pages are in RAM.

Given the virtual address 000000000010 10010000001, where is this address, RAM or hard drive? What is its physical address?

Answer: The first 12 bits of the virtual address is 2. This is added to the base address of the page table, to get to entry 9B. This is 10011011, meaning that the page is in RAM because valid =1. The physical address of the page in RAM is: 011011 (bits 5-0 of the page entry). The offset (from the virtual address) is 10010000001. Thus, the real physical address is in RAM, and its address is: 01101110010000001. Note that a RAM address has 17 bits.

Putting It All Together

What happens first virtual memory or cache? The answer is virtual memory. Every memory reference is a virtual address. This virtual address needs to get turned into a physical address. If on disk, then we take a page fault, and place the page into RAM replacing the least recently used (LRU) page in RAM. If in RAM, then we check whether the address is in cache.

Unfortunately, the page table scheme for VM means that we have to access RAM for every memory reference just to look at the page table! That ruins all the speedup we were getting from cache access! The solution is to store chunks of the page table in the cache. This is called the translation lookaside buffer (TLB). The figure below shows the entire scheme. First, we look in the TLB in cache for the VM address, then we look in RAM or the hard disk for the physical address. If disk, then we have a page fault. If RAM, then we first try the cache hoping for a hit, otherwise we get a cache miss.

Page 36: web.stonehill.edu€¦  · Web viewYou can try to write an assembler using one pass, but that makes the program harder and it runs in O(n^2). The problem with a one-pass assembler

The link below combines the cache, virtual memory and the Table Lookaside Buffer in an example for you to try:

Practice with Cache and Virtual Memory

You should definitely try this example. Once you get it, you will know that you understand the entire picture.

Finally, all this complex memory management is just the tip of the iceberg. There are machines with multiple levels of cache. Cache 1 acts as a cache for cache 2 which acts as a cache for RAM.

End of Week 7