Download - Architecture of Aca
-
8/10/2019 Architecture of Aca
1/20
1
Branch Prediction
BTB basics, return addressprediction, correlating prediction
Blah blah blah
-
8/10/2019 Architecture of Aca
2/20
Why what??
All unwanted creepy things happen in pipeline due tostalls and other unwanted things which are in bondwith one another.
We wanna break this bond and achieve our ultimate
aim of getting pipeline ideal CPI which offcourse weknow as impossible
Human tendency wat to do ..??we challenge
So here I am as a human thinking on handling branch
problems and stuffsActually henessey and patterson have done I amredoing it
2
-
8/10/2019 Architecture of Aca
3/20
Speculation :wat do u mean by tat?Speculative executionis an optimizationtechnique
where a computer systemperforms some task thatmay not be actually needed. The main idea is to dowork beforeit is known whether that work will beneeded at all, so as to prevent a delay that would
have to be incurred by doing the workafter
it isknown whether it is needed. If it turns out the workwas not needed after all, any changes made by thework are reverted and the results are ignored
3
http://en.wikipedia.org/wiki/Optimization_(computer_science)http://en.wikipedia.org/wiki/Computer_systemhttp://en.wikipedia.org/wiki/Computer_systemhttp://en.wikipedia.org/wiki/Optimization_(computer_science) -
8/10/2019 Architecture of Aca
4/20
4
Branch Target BufferBranch Target Buffer (BTB): Address of branch index toget prediction AND branch address (if taken) Note: must check for branch match now, since cant use wrong
branch address
Example: BTB combined with BHTBranch PC Predicted PC
=?
PC
of
instruction
FETCH
Extraprediction state
bits
Yes: instruction isbranch and usepredicted PC asnext PC
No: branch notpredicted, proceed normally
(Next PC = PC+4)
-
8/10/2019 Architecture of Aca
5/20
5
Wat does btb do??
The PC of the instruction being fetched is matchedagainst a set of instruction addresses stored in thefirst column; these represent the
addresses of known branches. If the PC matches
one of these entries, then the instructionbeing fetched is a taken branch, and the secondfield, predicted PC, contains the prediction for thenext PC after the branch. Fetching begins
immediately at that address. Thethird field, which is optional, may be used for extraprediction state bits.
-
8/10/2019 Architecture of Aca
6/20
6
Return A resses Pre ictionWhy?
Register indirect branch hard to predictaddress Many callers, one callee
Jump to multiple return addresses from a single
address (no PC-target correlation)SPEC89 85% such branches for procedurereturn
Since stack discipline for procedures, savereturn address in small buffer that acts likea stack: 8 to 16 entries has small miss rate
-
8/10/2019 Architecture of Aca
7/20
7
Branch Target BufferBranch Target Buffer (BTB): Address of branch index toget prediction AND branch address (if taken) Note: must check for branch match now, since cant use wrong
branch address
Example: BTB combined with BHTBranch PC Predicted PC
=?
PC
of
instruction
FETCH
Extraprediction state
bits
Yes: instruction isbranch and usepredicted PC asnext PC
No: branch notpredicted, proceed normally
(Next PC = PC+4)
-
8/10/2019 Architecture of Aca
8/20
HOW???return address stack [ras]
return address stacks are also very simple: they're fixed sizestacks of return addresses.
to use a return address stack, we push pc+4 onto the stack whenwe execute a procedure call instruction. this pushes the returnaddress of the call instruction onto the stack - when the call is
finished, it will return to pc+4 of the procedure call instruction.when we execute a return instruction, we pop an address off thestack, and predict that the return instruction will return to thepopped address.
since return instructions almost always return to the last
procedure call instruction, return address stacks are highlyaccurate.
remember that return address stacks only generate predictionsfor returninstructions. they don't help at all for procedure callinstructions [we use the btb to predict calls].
8
-
8/10/2019 Architecture of Aca
9/20
problems
suppose i have a standard 5-stage pipeline wherebranches and jumps are resolved in the execute stage.20% of my instructions are branches, and they'retaken 60% of the time. 5% of my instructions arereturn instructions [return instructions are notbranches!].
what's the cpi of my system if i always stall myprocessor on branches and jumps? what if i alwayspredict that branches and jumps are taken?
My idea waste to consider CPI chuck parellellism enjoywith serial exeuction
But again human tendency ????i am a human ..i dontgive up kind of attitude ..
9
-
8/10/2019 Architecture of Aca
10/20
What to do ??
Integrated Instruction Fetch UnitsTo meet the demands of multiple-issue processors, many recentdesigners have
chosen to implement an integrated instruction fetch unit, as aseparate autonomous unit that feeds instructions to the rest of
the pipeline. Essentially, thisamounts to recognizing that characterizing instruction fetch asa simple single
pipe stage given the complexities of multiple issue is no longervalid.
Instead, recent designs have used an integrated instructionfetch unit that integrates several functions:
10
-
8/10/2019 Architecture of Aca
11/20
1. Integrated branch predictionThe branchpredictor becomes part of the
instruction fetch unit and is constantlypredicting branches, so as to drive the
fetch pipeline.
2. Instruction prefetchTo deliver multipleinstructions per clock, the
instruction fetch unit will likely need to
fetch ahead. The unit autonomouslymanages the prefetching of instructions (seeChapter 5 )
Bcuz patterson says so11
-
8/10/2019 Architecture of Aca
12/20
Instruction memory access and bufferingWhenfetching multiple instructions per cycle a variety ofcomplexities are encountered, including thedifficulty that fetching multiple instructions mayrequire accessing multiple cache
lines. The instruction fetch unit encapsulates thiscomplexity, using prefetchto try to hide the cost of crossing cache blocks. Theinstruction fetch unit also
provides buffering, essentially acting as an on-demand unit to provideinstructions to the issue stage as needed and in thequantity needed
12
-
8/10/2019 Architecture of Aca
13/20
Terminology ???ROB
Which place bank ormint??
Sorry its read onlybuffer
So wat do u mean bytat ??
13
-
8/10/2019 Architecture of Aca
14/20
14
A re-order buffer(ROB) is used in a Tomasuloalgorithmfor out-of-orderinstruction execution. Itallows instructions to be committed in-order.
Normally, there are three stages of instructions:"Issue", "Execute", "Write Result". In Tomasuloalgorithm, there is an additional stage "Commit". Inthis stage, the results of instructions will be stored in a
register or memory. In the "Write Result" stage, theresults are just put in the re-order buffer. All contentsin this buffer can then be used when executing otherinstructions depending on these
http://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Tomasulo_algorithmhttp://en.wikipedia.org/wiki/Tomasulo_algorithm -
8/10/2019 Architecture of Aca
15/20
15
Register Renaming versus Reorder BuffersWith the addition of speculation, register values may alsotemporarily reside in the ROB. In either case, if the
processor does not issue new instructions for a period of time, allexistinginstructions will commit, and the register values will appear in theregister file,which directly corresponds to the architecturally visible registers.
So wat ??In the register-renaming approach, an extended set of physical registers isused to hold both the architecturally visible registers as well as temporaryvalues.
Thus, the extended registers replace the function of both the ROB and thereservation stations. During instruction issue, a renaming process maps thenames ofarchitectural registers to physical register numbers in the extended register set,allocating a new unused register for the destination.
-
8/10/2019 Architecture of Aca
16/20
16
An advantage ????of the renaming approach versus the ROB approach is that
instruction commit is simplified, since it requires only two simpleactions: recordthat the mapping between an architectural register number andphysical register
number is no longer speculative, and free up any physicalregisters being used tohold the older value of the architectural register. In a designwith reservation
stations, a station is freed up when the instruction using itcompletes execution,and a ROB entry is freed up when the corresponding instructioncommits
-
8/10/2019 Architecture of Aca
17/20
17
How do we ever know which registers arethe architectural registers if they are constantly changing? Most ofthe time when
the program is executing it does not matter. There are clearly cases,however,where another process, such as the operating system, must be able to knowexactly where the contents of a certain architectural register reside. Tounderstandhow this capability is provided, assume the processor does not issueinstructionsfor some period of time. Eventually all instructions in the pipeline willcommit,and the mapping between the architecturally visible registers and physicalregisters will become stable. At that point, a subset of the physical registers
containsthe architecturally visible registers, and the value of any physical register notassociated with an architectural register is unneeded. It is then easy to movethearchitectural registers to a fixed subset of physical registers so that the
values canbe communicated to another process.
H M h t S l t
-
8/10/2019 Architecture of Aca
18/20
18
How Much to SpeculateOne of the significant advantages of speculation is its ability to uncover eventsthat would otherwise stall the pipeline early, such as cache misses. This potentialadvantage, however, comes with a significant potential disadvantage. Speculationis not free: it takes time and energy, and the recovery of incorrect speculation further reducesperformance. In addition, to support the higher instruction execution
rate needed to benefit from speculation, the processor must have additionalresources, which take silicon area and power. Finally, if speculation causes anexceptional event to occur, such as a cache or TLB miss, the potential for significant performanceloss increases, if that event would not have occurred withoutspeculation.To maintain most of the advantage, while minimizing the disadvantages, mostpipelines with speculation will allow only low-cost exceptional events (such as afirst-level cache miss) to be handled in speculative mode. If an expensive exceptional eventoccurs, such as a second-level cache miss or a translation lookasidebuffer (TLB) miss, the processor will wait until the instruction causing the eventis no longer speculative before handling the event. Although this may slightlydegrade the performance of some programs, it avoids significant performancelosses in others, especially those that suffer from a high frequency of such events
coupled with less-than-excellent branch prediction
-
8/10/2019 Architecture of Aca
19/20
-
8/10/2019 Architecture of Aca
20/20
20
Accuracy of Return Address Predictor