® lihu rappoport 1 xbc - extended block cache lihu rappoport stephan jourdan yoav almog mattan erez...
Post on 20-Dec-2015
236 views
TRANSCRIPT
![Page 1: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/1.jpg)
RR
®® Lihu Rappoport1
XBC - eXtended Block Cache
Lihu Rappoport
Stephan Jourdan
Yoav Almog
Mattan Erez
Adi Yoaz
Ronny Ronen
Intel Corporation
![Page 2: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/2.jpg)
RR
®® Lihu Rappoport2
The Frontend
Frontend goal: supply instructions to execution – Predict which instructions to fetch – Fetch the instructions from cache / memory – Decode the instructions– Deliver the decoded instructions to execution
FrontendFrontend
MemoryMemoryExecutionExecution
Instructions
Data
The processor:
![Page 3: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/3.jpg)
RR
®® Lihu Rappoport3
Requirements from the Frontend
High bandwidth
Low latency
![Page 4: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/4.jpg)
RR
®® Lihu Rappoport4
The Traditional Solution: Instruction Cache Basic unit: cache line
– A sequence of consecutive instructions in memory
Deficiencies:– Low Bandwidth Jump into
the lineJump out of the line
jmpjmp
– High Latency– Instructions need decoding
![Page 5: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/5.jpg)
RR
®® Lihu Rappoport5
TC Goals: high bandwidth & low latency Basic unit: trace
– A sequence of dynamically executed instructions
Trace Cache
Instructions are decoded into uops – Fixed length, RISC like instructions
Traces have a single entry, and multiple exits
Trace end condition
jmpjmp jmpjmpjmpjmpjmpjmp
– Trace tag/index is derived from starting IP
![Page 6: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/6.jpg)
RR
®® Lihu Rappoport6
Redundancy in the TC
CodeIf (cond) AB
Possible Traces
(i) AB (ii) BBB
AA
Space inefficiencySpace inefficiency low hit rate low hit rate
![Page 7: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/7.jpg)
RR
®® Lihu Rappoport7
XBC Goals
High bandwidth
Low latency
High hit rate
![Page 8: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/8.jpg)
RR
®® Lihu Rappoport8
XBC - eXtended Block Cache Basic unit: XB - eXtended Block
jccjccjmpjmp
XB features – Multiple entry, single exit– Tag / index derived from ending instruction IP
– Instructions are decoded
XB end conditions – Conditional or indirect branches– Call/Return– Quota (16 uops)
![Page 9: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/9.jpg)
RR
®® Lihu Rappoport9
XBC Fetch Bandwidth Fetch multiple XBs per cycle
– A conditional branch ends a XB– Need to predict only 1 branch/ XB– Predicting 2 branch/cyc fetch 2 XB/cyc
Promote 99% biased conditional branches*
Build longer XBs Maximize XBC bandwidth for a given #pred/cyc
99% biased99% biased
jccjcc jccjccjccjccjmpjmp
*[Patel 98]
![Page 10: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/10.jpg)
RR
®® Lihu Rappoport10
XB LengthBlock types Average Length
BB basic block 7.7
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
BB
XB
XBp
DBL
XB don’t break on uncond 8.0
XBp XB + promotion 10.0 DBL group 2 XBp 12.7
![Page 11: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/11.jpg)
RR
®® Lihu Rappoport11
XBC StructureA banked structure which supports Variable length XBs (minimize fragmentation) Fetching multiple XBs/cycle
Reorder & AlignReorder & Align
Bank0 Bank1 Bank2 Bank3
4 uop4 uop 4 uop4 uop 4 uop4 uop 4 uop4 uop
![Page 12: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/12.jpg)
RR
®® Lihu Rappoport12
Support Variable Length XBs An XB may spread over several Banks on the same
set
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
![Page 13: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/13.jpg)
RR
®® Lihu Rappoport13
Support Fetching 2 XBs/cycle Data may be received from all Banks in the same cycle
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
11 00
00 11 00 11
![Page 14: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/14.jpg)
RR
®® Lihu Rappoport14
Support Fetching 2 XBs/cycle Actual bandwidth may be sometimes less than 4 banks
per cycle
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
11 00
00 11 00
![Page 15: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/15.jpg)
RR
®® Lihu Rappoport15
Reordering and Aligning Uops
bankbank00 bankbank11 bankbank22 bankbank33
bankbanki2i2 bankbanki3i3
Reorder Reorder BanksBanks
Mux 1
bankbanki0i0 bankbanki1i1
Align UopsAlign UopsMux 2
bnkbnki0i0 bnkbnki1i1 bankbanki2i2
Empty uopsEmpty uops
![Page 16: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/16.jpg)
RR
®® Lihu Rappoport16
XBC Structure The average XB length is >8 uops
16 uop/line is < 2-XB set associative
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 1100 22 11
16 uop16 uop
![Page 17: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/17.jpg)
RR
®® Lihu Rappoport17
XBC Structure The average XB length is >8 uops
make each bank set-associative
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
1100 1100 22
![Page 18: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/18.jpg)
RR
®® Lihu Rappoport18
The XBTB The XBTB provides the next XB for each XB
– XBs are indexed according to ending IP Cannot directly lookup next IP in the XBC XBC can only be accessed using the XBTB
XBTB provides info needed to access next XB – The IP of the next XB
– Defines the set in which the XB resides – A masking vector, indicating the banks in which the
XB resides– The #uops counted backward from the end of XB
– Defines where to enter the XB XBTB provides next 2 XBs
![Page 19: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/19.jpg)
RR
®® Lihu Rappoport19
XBTBXBC
Decoder
Memory / Cache BTB
Delivery mode
PriorityEncode XBQ
XBC Structure: the whole picture
Build mode
Fill Unit
![Page 20: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/20.jpg)
RR
®® Lihu Rappoport20
XB Build Algorithm XBTB lookup fails build a new XB into the fill buffer End-of-XB condition reached lookup XBC for the new XB
– No match store new XB in the XBC, and update XBTB – Match there are three cases:
XBnew XBexist
Update XBTB
IP1XBexist
IP1XBnew
Extend XBexist
Update XBTB
XBnew XBexist
XBexist IP1
IP1XBnew
Complex XB,Update XBTB
XBnewXBexist
IP1
IP1
XBexist
XBnew
The XBC has NO RedundancyThe XBC has NO Redundancy
![Page 21: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/21.jpg)
RR
®® Lihu Rappoport21
XBnew and XBexist have same suffix but different
prefix:
– Possible solution, complying to no-redundancy:
Complex XBs
– Drawback: we get 2 short XBs instead of a single long XB
Wrong Way
IP1
IP1
XBexist
XBnew
IP1XBexist
Prefixnew
![Page 22: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/22.jpg)
RR
®® Lihu Rappoport22
XBnew and XBexist have same suffix but different
prefix:– Second solution: a single “complex XB”
Complex XBs
Complex XBs: no redundancy, but still high bandwidth
Right Way
PrefixcurIP1
IP1
IP1
XBexist
XBnew
Prefixnew
Suffix
![Page 23: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/23.jpg)
RR
®® Lihu Rappoport23
bank0 bank1 bank2 bank3
Extending an Existing XB An XB can only be extended at its beginning
991 2 3 41 2 3 4 5 6 7 85 6 7 8 8 98 91 2 31 2 3 4 5 6 74 5 6 700
Since the existing uops move, the pointers in the XBTB become stale
If we store XB in the usual way, when an XB is extended, we need to move all its uops
![Page 24: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/24.jpg)
RR
®® Lihu Rappoport24
Storing Uops in Reverse Order The solution is to store the uops of an XB in a
reversed order
11 2 3 4 52 3 4 5 6 7 8 96 7 8 9
bank0 bank1 bank2 bank3
00
XB IP is the IP of the ending instruction extending the XB does not change the XB IP
when an XB is extended, no need to move uops
![Page 25: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/25.jpg)
RR
®® Lihu Rappoport25
Set Search XB is replaced and then placed again
– Not on same set different XB– Same set, same banks no problem– Same set but not on the same banks
XBTB entries which point to the old location of the XB are erroneous
Solution - Set Search– On an XBTB hit & XBC miss, try to locate the XB in
other banks in the same set– Calculate new mask according to offset– Only a small penalty: cycle loss, but no switch to
build
![Page 26: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/26.jpg)
RR
®® Lihu Rappoport26
XB Replacement Use a LRU among all the lines in a given set LRU also makes sure that we do not evict a line
other than the first line of a XB (a head line)–There is no point in retaining the head line while
evicting another line– if we enter the XB in the head line, we will get a miss
when we reach the evicted line – if a head line is evicted, but we enter the XB in its
middle, we may still avoid a miss A non-head line is always accessed after a head
line is accessed its LRU will be higher it will not be evicted before the head line
![Page 27: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/27.jpg)
RR
®® Lihu Rappoport27
XB Placement Build-mode placement algorithm
– New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible)
– LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed
– Set-search repairs the XBTB Delivery mode placement algorithm
– repeating bandwidth losses due to bank conflicts found conflicting lines are moved to non-conflicting banks
– Each XB is augmented with a counter – incremented when XB has a bank conflict – when counter reaches threshold, the conflicting lines
are switched with other lines in non-conflicting banks – A line can be switched with another line, only if its LRU is
higher, or if both gain from the switch
![Page 28: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/28.jpg)
RR
®® Lihu Rappoport28
0
1
2
3
4
5
6
7
Games SpecINT SysmarkNT Average
Uo
p p
er C
ycle
XBC vs. TC Delivery Bandwidth
TC XBC
![Page 29: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/29.jpg)
RR
®® Lihu Rappoport29
0%1%2%3%4%5%
6%7%
8%9%
10%
16K 32K 64K
Size - KUops
Uo
p M
iss
Rat
e
Miss Rate as a Function of Size
TC XBC
29%
>50%
![Page 30: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/30.jpg)
RR
®® Lihu Rappoport30
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
1 2 4Associativity
Uo
p M
iss
Rat
e
Miss Rate as a Function of Size
XBCTC
![Page 31: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/31.jpg)
RR
®® Lihu Rappoport31
XBC Features Summary Basic unit - XB
– Ends with a conditional branch– Multiple entries, single exit – Indexed according to ending IP– Branch promotion longer XBs
XBC uses a banked structure– Supports fetching multiple XBs/cycle– Supports variable length XBs
– Uops within XBs are stored in reverse order
![Page 32: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/32.jpg)
RR
®® Lihu Rappoport32
Conclusions Instruction Cache has high hit rate, but …
–Low bandwidth, high latency
TC has high bandwidth, low latency, but …–Low hit rate
XBC combines the best of both worlds–High bandwidth, low latency and high hit rate
![Page 33: ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d4b5503460f94a290fb/html5/thumbnails/33.jpg)
RR
®® Lihu Rappoport33