2/15/2006 "Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 1
Software-Hardware Cooperative Memory
Disambiguation
Ruke Huang, Alok Garg, and Michael Huang
Department of Electrical & Computer Engineering
University of Rochester
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 2
Motivation
Hiding long latencies Scaling up of many structures Complex, hard to design Consumes more energy Slower
Inefficiency in hardware Meticulously keep track of all instructions No prior knowledge of out-of-order execution Simply cross-compare all loads and stores
ROB size: 320SQ size: 48LQ size: 48
LQ Size
16%
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 3
Software Assistance
Global information Statically identify non-conflicting memory accesses Advantages
Reduced resource pressure Energy savings
Loads not requiring memory disambiguation Average 43% dynamic loads in FP Spec applications
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 4
Recent Research
Chrysos and Emer (ISCA’98) Sethumadhavan et al. (MICRO’03) Park et al. (MICRO’03) Baugh and Zilles (PACC’04) Akkary et al. (MICRO’03) Gandhi et al. (ISCA’05), etc.
Hardware-only: Provisioning, re-occurring overhead
Cooperative: Consumption, one-time overhead
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 5
Outline
Cooperative Memory Disambiguation Framework Evaluation Conclusion
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 6
Cooperative Memory Disambiguation- Resource-Effective Approach
90% dynamic loads do not communicate with in-flight stores Many loads do not require memory disambiguation resources Safe loads: Software analyzer can identify them
Can exploit hardware specific information Hardware resources only for non-safe loads
int A[1000], B[1000];
void VecAdd() { for(int i=0; i<1000; i++)
A[i] = A[i] + B[i];}
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 7
Cooperative Memory Disambiguation Framework
Software-hardware Interface Decoupled ISA (No compatibility obligations)
Software Support Binary to binary translator - alto (Muth et al.) Binary analyzer
Identify read-only data loads Identify other general safe loads
Architectural Support Light-weight
Source compiler
Original binary
Hardware
Translator
Compilation
Hardware specifictranslator
ISA
Extended instruction set
Hardware specific internal binary
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 8
General Safe Loads
Scope of parser analysis Steady state loop No internal control flow
Limited in-flight instructions ROB size, store queue size
…LoadLoad…StoreBranch
Simple loop body
……Store…
……Store…
Load…Store…
i
i-1
i-2
Steady state loopexecution
Instructionwindow
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 9
General Safe Loads (Cont.)-Real example from a SPEC FP application
0x120033140: ldl r31, 256(r3) ; prefetch0x120033144: ldt f21, 0(r3) ; Ld10x120033148: lda r27, -2(r27) ; r27 = r27-20x12003314c: lda r3, 16(r3) ; r3 = r3+160x120033150: ldt f22, -8(r3) ; Ld20x120033154: ldt f23, 0(r11) ; Ld30x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+160x120033160: ldt f24, -8(r11) ; Ld40x120033164: lds f31, 240(r11) ; prefetch0x120033168: mult f20, f21, f21 ;0x12003316c: mult f20, f22, f22 ;0x120033170: addt f23, f21, f21 ;0x120033174: addt f24, f22, f22 ;0x120033178: stt f21, -16(r11) ; St10x12003317c: stt f22, -8(r11) ; St20x120033180: beq r1, 0x120033140 ;
One loop from galgel
0x120033140: ldl r31, 256(r3) ; prefetch0x120033144: ldt f21, 0(r3) ; Ld10x120033148: lda r27, -2(r27) ; r27 = r27-20x12003314c: lda r3, 16(r3) ; r3 = r3+160x120033150: ldt f22, -8(r3) ; Ld20x120033154: ldt f23, 0(r11) ; Ld20x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+160x120033160: ldt f24, -8(r11) ; Ld40x120033164: lds f31, 240(r11) ; prefetch0x120033168: mult f20, f21, f21 ;0x12003316c: mult f20, f22, f22 ;0x120033170: addt f23, f21, f21 ;0x120033174: addt f24, f22, f22 ;0x120033178: stt f21, -16(r11) ; St10x12003317c: stt f22, -8(r11) ; St20x120033180: beq r1, 0x120033140 ;
AddrLd1=_R3+16*i
AddrLd2=_R11+16*i
AddrSt1=_R11+16*iAddrSt2=_R11+16*i+8
Analysis window: 16 iterations
Address range =_R11+(i-16)*16 to _R11+(i-1)*16+8
Ld2 statically determined to be safe
Ld1 need run-time evaluation
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 10
General Safe Loads (Cont.)-Real example from a SPEC FP application
New_entry: mark_sqif(r3-r11+8>0) or (r3-r11+264<0) then
cset CR0, 1
0x120033144: sldt f21, 0(r3), [CR0] ; Ld1 (safe)
0x12003314c: lda r3, 16(r3) ; r3 = r3+16
0x120033154: sldt f23, 0(r11), [CR_TRUE] ; Ld2 (safe)0x120033158: cmple r27, 0x1, r1 ; 0x12003315c: lda r11, 16(r11) ; r11 = r11+16
0x120033174: addt f24, f22, f22 ;0x120033178: stt f21, -16(r11) ; St10x12003317c: stt f22, -8(r11) ; St2
Modified Code
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 11
Safe stores
Safe stores If it does not communicate with future loads Indirectly discover safe loads
Un-analyzable store Load is safe if all stores in SQ are safe
Summary of safe load detection Simple loop body All stores must be analyzable Address range calculation
…Load (A)…Store1 (UA)…Store2 (A)…Branch
Loop Body
…Load (A)…Store1 (UA)…Store2 (A)…Branch…Load (A)...
In-flightinstructions
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 12
Architectural Support
Safe loads Boolean condition registers cset (instruction)
Safe stores Scope marker Indirect jumps
Flash-reset all condition registers
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 13
Outline
Cooperative Memory Disambiguation Framework Evaluation Conclusion
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 14
Experimental Setup Modified SimpleScalar 3.0b simulator Wattch to estimate dynamic energy consumption SPEC CPU2000 benchmark suite
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 15
Breakdown of Safe Loads (FP)
97%
43%
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 16
Performance Improvement (FP)
40/48%
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 17
Breakdown of Safe Loads (INT)
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 18
Performance Improvement (INT)
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 19
Energy Savings
Floating-point applications
Integer applications
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 20
Conclusions
Software assistance improves LSQ efficiency Detects average 43% loads as safe Average 10% performance gain
Compiler techniques for optimization of micro-architecture resources
Future work More powerful static analyzer Manage other micro-architecture resources
E.g., register file
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 21
Thank you!
Questions?
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 22
Support for CoherencyHash Table: 2-bit
Total entries: 512 Details:
http://www.ece.rochester.edu/~mihuang/PAPERS/hpca06tr.pdfTable 1 Table 2
Accessbit
Invalidationbit
2/15/2006
"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA 2006 23
Read-Only Data Loads
Alpha COFF binary header Global pointer (GP) Read-only sections
Access address calculation Algorithm - extended constant propagation
gp=0x120022000
Read-Only Section
Start: 0x120023000
End: 0x120024000