federation: repurposing scalar cores for out- of-order instruction issue david tarjan*, michael...
Post on 14-Dec-2015
213 Views
Preview:
TRANSCRIPT
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue
David Tarjan*, Michael Boyer, and Kevin David Tarjan*, Michael Boyer, and Kevin Skadron*Skadron*
University of VirginiaUniversity of Virginia
Department of Computer ScienceDepartment of Computer Science
* Currently on internship/sabbatical at NVIDIA * Currently on internship/sabbatical at NVIDIA ResearchResearch
L2 L2
L2 L2
MotivationMotivation
L2 L2
L2 L2
Homogeneous Heterogeneous
Adaptive(Federation)
Multithreadedscalar IO
core
2-wayOO core
L2 L2
L2 L2
Basic InsightsBasic Insights
A multithreaded in-order core has many A multithreaded in-order core has many registers which can be reused for a reorder registers which can be reused for a reorder buffer orbuffer oractive listactive list
If cores are small, single cycle If cores are small, single cycle communication between neighbors is feasiblecommunication between neighbors is feasible
Prior work on making large OOO cores Prior work on making large OOO cores feasible can be applied at the low end to feasible can be applied at the low end to make low-cost OOO possiblemake low-cost OOO possible
Bpred
Allocate
Rename
Issue
Commit
In-order & Out-of-order PipelinesIn-order & Out-of-order Pipelines
Fetch
Decode
Execute
Mem
Writeback
Fetch
Decode
Execute
Mem
Writeback
In-order Out-of-order
Ready Bits
Subscriber Slot 1
Subscriber Slot 21
2
3
4
5
Issue Queue ExampleIssue Queue Example
1 1 IQ2
1
IQ3
IQ30
0 0
1
1
+
+
+
1
Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002
Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA 2007
1
2
3
Simplified Load-Store QueueSimplified Load-Store Queue
Memory Alias Table (MAT)Memory Alias Table (MAT) No store forwardingNo store forwarding No conservative waiting on storesNo conservative waiting on stores Only detect memory order violations after Only detect memory order violations after
they have occurred and flush the pipeline they have occurred and flush the pipeline when the offending instruction commitswhen the offending instruction commits
Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005
MAT ExampleMAT Example
st 0x13, r5ld r1, 0x13
EXE
0
0
0
1
0
0
0
0
MAT
0
1
2
3
4
5
6
7
ld executes and increments counter
MAT ExampleMAT Example
st 0x13, r5
COM
0
0
0
1 !
0
0
0
0
MAT
0
1
2
3
4
5
6
7
ld r1, 0x13
st commits and sets flag
MAT ExampleMAT Example
ld r1, 0x13
COM
0
0
0
1 !
0
0
0
0
MAT
0
1
2
3
4
5
6
7
Flush
ld commits, sees flag, and flushes pipeline
MAT ExampleMAT Example
ld r1, 0x13
0
0
0
0
0
0
0
0
MAT
0
1
2
3
4
5
6
7
MAT is reset and execution resumes
Performance ImpactPerformance Impact
0.00%
2.67%
1.71%
5.46%
0%
1%
2%
4%
5%
6%
consumer-basedissue queue
pseudo-randomscheduling
MAT commit-time branchrecovery
Ave
rag
e IP
C L
oss
PerformancePerformance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Scalar IO 2-way IO FederatedOO
2-way OO 4-way OO
Ave
rag
e IP
C
spec specint specfp
Energy EfficiencyEnergy Efficiency
0
0.5
1
1.5
2
2.5
Scalar IO 2-way IO FederatedOO
2-way OO 4-way OO
No
rmal
ized
BIP
S^
3/W
att
spec specint specfp
Area EfficiencyArea Efficiency
0
0.2
0.4
0.6
0.8
1
1.2
Scalar IO 2-way IO FederatedOO
2-way OO 4-way OO
No
rmal
ized
BIP
S^
3/(W
att*
mm
^2)
spec specint specfp
ConclusionsConclusions
Two in-order cores can be federated at run-Two in-order cores can be federated at run-time to form a 2-way OO coretime to form a 2-way OO core
Almost doubling IPC of throughput core is Almost doubling IPC of throughput core is possible with very little extra hardwarepossible with very little extra hardware
Don’t want traditional OO structures because Don’t want traditional OO structures because their performance comes at too high a pricetheir performance comes at too high a price
Best combined area- and energy-efficiencyBest combined area- and energy-efficiency
Core Fusion DataCore Fusion Data
Figure from Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007
Overall ResultsOverall Results
Scalar in-order core is 8KB I/D, 256KB L2Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, Base 2-way core has 16KB I and D-Caches,
256KB L2, 32 entry ROB, 16 entry issue 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpredqueue, 16 entry LSQ, bimodal bpred
4-way core is 32KB I/D, 2MB L2, 128 entry 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpredROB, 32 IQ and LSQ, tournament bpred
Branch PredictionBranch Prediction
Use only a Next Line and Set (NLS) predictor, Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Bimodal predictor and a Return Address Stack (RAS)Stack (RAS)
NLS ok if your instruction working set not > I$ NLS ok if your instruction working set not > I$ sizesize
Small bimodal predictor ik ok for small Small bimodal predictor ik ok for small window processorwindow processor
FetchFetch
Two I$’s act as a I$ of twice the size and Two I$’s act as a I$ of twice the size and associativity (and random replacement)associativity (and random replacement)
More logic and buffers to capture two More logic and buffers to capture two instructions instructions
Extra cycle to route instructions from two I$’s Extra cycle to route instructions from two I$’s to two decoders to two decoders
DecodeDecode
Cancel second instruction if first turns out to Cancel second instruction if first turns out to be branchbe branch
Extra cycle to route decoded instructions to Extra cycle to route decoded instructions to new allocate stagenew allocate stage
AllocateAllocate
New logic and free lists to allocate ROB, IQ New logic and free lists to allocate ROB, IQ entriesentries
RenameRename
New table since it has too many portsNew table since it has too many ports One, centralized rename table, not One, centralized rename table, not
distributeddistributed Has separate table (or field in each RAT Has separate table (or field in each RAT
entry) for each registers producer entry) for each registers producer instructions IQ-slot number (see our new instructions IQ-slot number (see our new issue queue)issue queue)
IssueIssue
Uses a simple lookup table as wakeup Uses a simple lookup table as wakeup structure, where instructions subscribe to structure, where instructions subscribe to their input instructions (explained in detail their input instructions (explained in detail later)later)
Centralized, one IQ for the two coresCentralized, one IQ for the two cores
Register File Register File
Register file is mirrored in the two coresRegister file is mirrored in the two cores No extra copy instructions or load-balancing No extra copy instructions or load-balancing
questionsquestions
ExecuteExecute
Add extra cycle for copying result to other Add extra cycle for copying result to other core’s register file (like EV6)core’s register file (like EV6)
Memory AccessMemory Access
The two D$s are checked in parallel, each The two D$s are checked in parallel, each responsible for half of the merged D$’s waysresponsible for half of the merged D$’s ways
No standard LSQ, only a Memory Alias Table No standard LSQ, only a Memory Alias Table (details later)(details later)
Only detects ordering violations and send Only detects ordering violations and send signal to pipelinesignal to pipeline
CommitCommit
Centralized commit, no slippageCentralized commit, no slippage Recover from branch mispredictions since no Recover from branch mispredictions since no
checkpoints of RAT on branchescheckpoints of RAT on branches Recover from memory order violations (or Recover from memory order violations (or
false positives) from MATfalse positives) from MAT
top related