guaranteeing hits to improve the efficiency of a small ...€¦ · guaranteeing hits to improve the...
TRANSCRIPT
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Guaranteeing Hits to Improve the Efficiency ofa Small Instruction Cache
Stephen Hines, David Whalley, and Gary Tyson
Department of Computer ScienceFlorida State University
December 5, 2007
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
L0/Filter Instruction Caches
Programs spend a great deal of time executing small loops
Small, direct-mapped→ Better energy/access
Low hit rate→Worse execution times (4–8%)Power/Performance tradeoff in embedded systems
Prediction can reduce some of this cycle penalty ...
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
L0/Filter Instruction Caches
Programs spend a great deal of time executing small loops
Small, direct-mapped→ Better energy/access
Low hit rate→Worse execution times (4–8%)Power/Performance tradeoff in embedded systems
Prediction can reduce some of this cycle penalty ...But guarantees will give us the best of both worlds
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Instruction Fetch
Types of Instruction Fetch
Sequential FetchesNon-sequential (branching) fetches
Direct branchesIndirect branches
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Instruction Fetch
Types of Instruction Fetch
Sequential FetchesNon-sequential (branching) fetches
Direct branchesIndirect branches
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Intra-line Sequential Fetch
Traditional Line Buffers
Perform tag comparison early→ disable L1-IC on hit
May lengthen cycle time for fetch
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Intra-line Sequential Fetch
Traditional Line Buffers
Perform tag comparison early→ disable L1-IC on hit
May lengthen cycle time for fetch
Tagless Hit Line Buffer (TH-LB)
Keep track of previous cycle’s branch status/prediction
If not-taken and intra-line (not last inst), access TH-LB;otherwise fetch from the L1-IC and write into TH-LB
Eliminates the need for TH-LB tag comparison or storage
Identical hit and miss behavior as LB and cycle time is thesame as just L1-IC
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Guaranteeing Tagless Hits
A TH-LB makes guarantees about intra-line sequentialfetch behavior
Extend this principle to handle other regular fetches
Tagless Hit Instruction Cache (TH-IC)
Tag comparisons for most accesses
Recognize/capture regularity inherent to fetch
Employ more useful metadata to handle links betweeninstructions in the cache
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Tagless Hits
Fetching Sequential Lines
Next Sequential bit (NS) associated with each TH-IC lineSequential fetches that cross a line boundary examine NSto determine the next cycle’s fetch
NS is not set – fetch from L1-IC and set the NS bitNS is set – guaranteed hit in TH-IC (L1-IC, tag check)
When a line is replaced, we clear its NS bit as well as theprevious line’s NS bit
Essentially we now have a multiple line buffer
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Tagless Hits
Handling Direct Branches
Next Target bit (NT) associated with each TH-ICinstructionDirect branches that are predicted taken examine NT todetermine next cycle’s fetch
NT is not set – fetch from L1-IC and set the NT bitNT is set – guaranteed hit in TH-IC (L1-IC, tag check)
On line replacement, we will need to invalidate some ofthe NT bits contained in the TH-IC
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Tagless Hits
TH-IC Guaranteed Hit and False Miss Rates
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Tagless Hits
TH-IC Line and State Information
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1inst 1
inst 2inst 3inst 4
inst 5inst 6inst 7
inst 8. . .
0
0000
000
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 5inst 6inst 7
inst 8. . .
01
0000
000
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 6inst 6inst 7inst 7
inst 8. . .
01
0000
000
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 2inst 3inst 4
inst 5inst 6inst 7
inst 8. . .
01
0000
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 3inst 4inst 4
inst 5inst 6inst 7
inst 8. . .
01
0000
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 5inst 6inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 6inst 6inst 7inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 2inst 3inst 4
inst 5inst 6inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 3inst 4inst 4
inst 5inst 6inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 5inst 6inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hitsinst 5 hit
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 6inst 6inst 7inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hitsinst 5 hit
insts 6,7 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 2inst 3inst 3inst 4inst 4
inst 5inst 5inst 6inst 6inst 7inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hitsinst 5 hit
insts 6,7 hits
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 6inst 7
inst 8inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hitsinst 5 hit
insts 6,7 hitsinst 8 hit
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Example
Fetching a Loop
. . .
inst 1
inst 2inst 3inst 4
inst 5inst 6inst 7
inst 8. . .
01
00001
0001
00
11
NSNT
Line
1Li
ne2
Fetched Result Metadata Set L1/ITLB?
inst 1 miss set line 0 NS bit X
inst 5 miss set inst 1 NT bit X
insts 6,7 hitsinst 2 false miss set inst 7 NT bit X
insts 3,4 hitsinst 5 false miss set line 1 NS bit X
insts 6,7 hitsinst 2 hit
insts 3,4 hitsinst 5 hit
insts 6,7 hitsinst 8 hit
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Tag Comparisons in TH-IC
Guaranteed HitsTag comparison is completely unnecessary
ITLB access can also be skipped
Potential Misses
Access L1-IC and TH-ICDetermine if line is actually in TH-IC (false miss)
TH-IC is inclusive of L1-IC, so . . .
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Tag Comparisons in TH-IC
Guaranteed HitsTag comparison is completely unnecessary
ITLB access can also be skipped
Potential Misses
Access L1-IC and TH-ICDetermine if line is actually in TH-IC (false miss)
TH-IC is inclusive of L1-IC, so . . .Only need to check whether TH-IC line points to L1-IC lineto verify a false miss
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Reducing TH-IC Tag Size
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Reducing TH-IC Tag Size
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Reducing TH-IC Tag Size
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Metadata Invalidation
TH-IC line replacements can corrupt NT and NS links
NS: Clear previous line’s NS (easy)NT: Clear any NT that points to replaced line (harder)
Too much metadata – inefficient energy usage due to extratracking bitsToo little metadata – overly aggressive invalidation kicks outlinks that are still valid (i.e. point to other lines)
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Invalidation Policies
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Invalidation Policies
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Invalidation Policies
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Metadata Management
Invalidation Policies
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Configuration
SimpleScalar PISA with Wattch extensions forpower/energy modelingStrongARM-derived configuration (in-order, 1-issue, . . . )
L0-IC (128B – 512B)TH-IC (128B – 512B) x (TN, TT, TL, TI) + TH-LBSlides only show 256B (16x4) configurations
VPO optimized MiBench benchmarks
Benchmarks run to completion
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Total Processor Energy
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Fetch Efficiency Across Different Application Domains
Fetch Statistics
MiBench Average 176.gcc (SPECInt2k)L0_16x4 TL_16x4 L0_16x4 TL_16x4
Execution Cycles 106.05% 100.00% 104.10% 100.00%Total Energy 75.17% 68.58% 83.81% 79.03%Small Cache Hit Rate 87.63% 84.96% 77.86% 73.57%Fetch Power 43.81% 35.47% 63.72% 56.07%Energy-Delay Squared 84.57% 68.58% 90.82% 79.03%
TH-IC is beneficial even for applications with more diverseinstruction and data access behavior like 176.gcc
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Fetch Efficiency Across Different Application Domains
Fetch Statistics
MiBench Average 176.gcc (SPECInt2k)L0_16x4 TL_16x4 L0_16x4 TL_16x4
Execution Cycles 106.05% 100.00% 104.10% 100.00%Total Energy 75.17% 68.58% 83.81% 79.03%Small Cache Hit Rate 87.63% 84.96% 77.86% 73.57%Fetch Power 43.81% 35.47% 63.72% 56.07%Energy-Delay Squared 84.57% 68.58% 90.82% 79.03%
TH-IC is beneficial even for applications with more diverseinstruction and data access behavior like 176.gcc
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Fetch Efficiency Across Different Application Domains
Fetch Statistics
MiBench Average 176.gcc (SPECInt2k)L0_16x4 TL_16x4 L0_16x4 TL_16x4
Execution Cycles 106.05% 100.00% 104.10% 100.00%Total Energy 75.17% 68.58% 83.81% 79.03%Small Cache Hit Rate 87.63% 84.96% 77.86% 73.57%Fetch Power 43.81% 35.47% 63.72% 56.07%Energy-Delay Squared 84.57% 68.58% 90.82% 79.03%
TH-IC is beneficial even for applications with more diverseinstruction and data access behavior like 176.gcc
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Fetch Efficiency Across Different Application Domains
Fetch Statistics
MiBench Average 176.gcc (SPECInt2k)L0_16x4 TL_16x4 L0_16x4 TL_16x4
Execution Cycles 106.05% 100.00% 104.10% 100.00%Total Energy 75.17% 68.58% 83.81% 79.03%Small Cache Hit Rate 87.63% 84.96% 77.86% 73.57%Fetch Power 43.81% 35.47% 63.72% 56.07%Energy-Delay Squared 84.57% 68.58% 90.82% 79.03%
TH-IC is beneficial even for applications with more diverseinstruction and data access behavior like 176.gcc
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Related Work
L0/Filter Cache Prediction
Cannot eliminate performance penalty completely
Still using full size tags along with prediction metadata andthere are still tag checks for hits
Way Memoization
Utilizes NT/NS concept to avoid 64-way tag comparisons inL1-IC for many fetches
Large amount of metadata required and expensiveinvalidation mitigates much of the energy consumptionbenefit
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Conclusions
TH-IC simultaneously eliminates performance penalty ofsmall caches, and further reduces fetch energy
Eliminates majority of tag checks and ITLB accessesReduces size of tag/ID check on miss due to inclusion
Make guarantees, not predictions for regularly behavedpipeline features like instruction fetch
Suitable for high-performance computing due to energyreductions and lack of performance penalty
Ease of integration with nearly any processor design
Introduction Tagless Hit Instruction Cache (TH-IC) Experimental Evaluation Summary
Questions???
Backup Slides Graphs
Backup Slides Graphs
Execution Time
Backup Slides Graphs
Average Fetch Power
Backup Slides Graphs
Energy-Delay2