akubitron/courses/cs152-f99/proj…  · web view4 word split pre-fetching stream buffer. 8 word...

17
A.S.D.F. (Amazingly Super Duper Fast) Marco Carloni

Upload: others

Post on 27-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

A.S.D.F. (Amazingly Super Duper Fast)

Marco CarloniDan GoldbergJesse RankinNorman Zhou

T.A. Kelvin Lwin

Dec 9, 1999

Page 2: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

The A.S.D.F. (Amazingly Super Duper Fast) Processor is a MIPS processor featuring 2 way superscalar core pipelined 2 stage execution 16 word fully associative data and instruction cache 8 entry branch prediction table 64 bit memory bus burst mode dram controller 4 word split pre-fetching stream buffer 8 word victim cache 4 word write-buffer

Page 3: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Our goal when designing this processor was to build a machine with the smallest clock period possible while at the same time account for the limitations of an in order pipeline. Our first observation focused on the ALU. If we kept a single execute stage, the minimum clock size would be 15ns + 3ns (the register delay) + other component delays. So, we decided to split the execute stage in half and pipeline the ALU. The project specifications for an extended pipeline suggested dividing the memory stage into two stages; allowing the cache to take two cycles to hit. However, we were confident our cache would take less than 10 ns and therefore rival the access time of the register file. Since a fundamental limit on cycle time lay at around 14.5 ns inside the decode stage (register file + register access + muxes) – assuming we would not pipeline the decode stage, we realized no significant performance gains would come from dividing the memory stage in half. Therefore, we decided to create a 6 stage pipeline with two execute stages.

Dividing the execute stage in half allowed the cycle time to decrease thereby allowing instructions to complete faster. However, the next step forward would be to execute more instructions at the same time. We subsequently decided to implement a 2-way superscalar core, thereby taking advantage of instructions that can execute in parallel. And to complement that core, we built a comprehensive memory hierarchy which would be able to handle the increased demands of parallel instructions.

Page 4: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Memory Enhancements

With clock rate in mind, we setout to remake the a memory system with access time equal to that of the regfile which is one of the fundamental clock rate limitations of our pipeline stages. In order to take advantage of spatial and temporal localities stream buffers, victim cache was added. The stream buffer is used to take advantage of spatial locality by using burst read to prefetch 2 extra instructions per dram access. The victim cache is used to take advantage of temporal locality by keeping cachelines around longer that would normally be thrown out of the cache. A 2 cache line write buffer is also added to buffer memory writes and keep the pipeline going instead of stalling for memory write backs. Testing for the memory system showed that the memory system can work on synchronous clock rate of 100Mhz. This result met our goal of making the memory access match the critical path of other parts of the pipeline.By adding these components memory read becomes more complicated because a L1 cache read miss would have to search though the stream buffer, victim cache, and write buffer for a match before making a request to the dram controller. This did not hurt L1 cache hit time because first we used a 3 port interface to the arbiter in which the data cache is given it’s own distinct read and write port. This allowed the victim cache, stream buffer, and write buffer to search for hits at the same time as the L1 cache. However hits in these additional memory components does have a 1 cycle penalty because of the requirement that they be behind the L1 cache not parallel to it.

Page 5: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Branching

When deciding whether a branch is taken the processor must access the register file and then compare the values. However, those access times, assuming standardized times, would be 10ns+10ns = 20ns. This meant that if we were to branch out of the decode stage the minimum clock period would be 20ns+extra gates. So, we decided to move the compare into the execute stage and branch from that stage. This meant we would have to implement branch prediction or assume branch not taken for all the instructions executed after the delay slot. However, because we were already executing two instructions in parallel we could run into the case where a branch is in the even instruction address and its delay slot is in the odd address. But at that point we would have an instruction in the IF stage so deciding branches in the decode stage would imply some form of branch prediction and branch recovery. Moving the branch decision later caused a larger penalty for miss-predicted branches, however, we implemented branch prediction to amortize that penalty. We decided to reference branches in the target buffer with Delay slots addresses this garentees that we a prediction is made it is aways inserted the next cycle.

JR OptimizationBecause the memory stage contained our critical path we became careful not to forward data from

that stage. In particular JR stood to add several extra mux and gate delays to the critical path if it were to forward from that stage. In the case of JR, it needs to forward it’s value to the decode stage and then to a mux in front of the PC register. We decided to allow JR to stall one more cycle, thus write back to the register cycle, and retrieve it’s register directly from the register file on the following clock cycle. This special case enabled us to keep the cycle time lower. By delaying an unlikely sequence of events which would otherwise lengthen the cycle time we were able to increase throughput of the processor.

LW – LW OptimizationWhile discussing issue dependencies we discussed how to handle two memory operations being

issued at the same time. Because the cache is not dual ported the obvious observation would be to stall the instruction in the odd address. However, we observed that instruction sequences such as lw $r2, 0($r1) followed by lw $r2, 4($r1) could potentially hit on the same cache line. Because the cache already returned

Page 6: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

a 64 bit cache line that we subsequently muxed in half, we realized we could take advantage of parallel lw operations. Therefore, we decided to postpone lw, lw stall logic to the memory stage.

Page 7: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Critical Path and Timing Information

A hit in the cache takes 10ns.A cache miss which hits in either the victim cache, write buffer, or stream buffer costs one extra cycle.A cache miss followed by a stream buffer, victim and write buffer miss costs 60ns.The memory read throughput is 128 bits per 175ns.The memory write throughput is 64 bits per 125 ns.

Top 3 critical paths:

In the memory stageCache + tri-state + mux2x32(forwarding) + register = 10+1.5+1.5+3 = 16ns

In the decode stage:Register file + mux2x32 + tri-state + register = 10+1.5+1.5 + 3 = 16ns

In the execute 1 stageComparator + gate logic + gate logic + tri-state + register = 4 + 4 + 1 + 1 + 3 = 13ns

Performance Information

In terms of pure Mhz our lab 7 processor is about 3 times faster than the lab 6 one. However when we increase the speed of the processor we become more and more memory performance bound. In order to counteract that effect many enhancements were made to the memory subsystem. In spite of the enhancements, initial testing with the lab5 mystery program showed that our CPI is probably going to increase anyways (since lab 5 mystery program has no real loops, actual performance could be better with other programs). Final results of our processor could not be obtained because of the fact that it still does not perform correctly on testing programs. The branch predictor, 2-way issue, forward, and dependency stall units seem to work with ideal memory; however, we encounter problems when memory stalls are present. Hand analysis show the critical path to be around 16 ns. (62.5Mhz) Overall run time of a program should be faster despite the increase in CPI because of the clock rate increase. In addition, the enhanced memory system should also decrease run time.

Page 8: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Testing Methodologies

Many pieces of the datapath were able to be tested independently before combining it into the full datapath. For instance, the branch target buffer and every component of the memory hierarchy could be tested extensively before adding it into the datapath. Furthermore, our datapath is divided into blocks signifying each stage of the pipeline which enabled us to test a single block of the datapath before executing it on the datapath. After the datapath was assembled from each stage of the pipeline we placed ideal memories in place of each cache and simulated programs. This allowed us to test forwarding logic from each stage for all instructions. This is actually a more efficient method for testing forwarding logic because actual execution of a program causes too many stalls to allow forwarding to even occur. Our final test involved placing the memory hierarchy in place of the ideal memories and simulating the entire processor. To make testing easier, we rewrote our processor monitor, and added special monitor registers to the pipeline to keep track of debugging information at each stage of the pipeline, so we could see what instruction was currently in each stage of both of the 6-stage pipelines at all times. We used previous test scripts and mystery programs to test this processor as well as wrote special test cases to test unusual circumstances that only occur in superscalar and branch predicting processors.

Page 9: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Main datapath

Page 10: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Instruction Fetch

Page 11: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Decode stage (page 1)

Page 12: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Decode Stage (page 2)

EXE stage 1

Page 13: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

Memory stage

First page of cache

Page 14: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

First page of victim cache

Page 15: Akubitron/courses/cs152-F99/proj…  · Web view4 word split pre-fetching stream buffer. 8 word victim cache. 4 word write-buffer Our goal when designing this processor was to build

write buffer

First page of the pipelined ALU