![Page 1: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/1.jpg)
Code Layout Optimization for Transaction Processing Workloads
2006/05/29
KINS
Kyuhwan Kim
Alex Ramirez, Luiz Adnre Barroso, Kourosh Gharachorloo,
Robert Cohn, Josep Larriba-Pey, P.Geoffrey Lowney, and Mateo Valero
![Page 2: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/2.jpg)
Introduction OLTP (OnLine Transaction Processing)
A form of transaction processing conducted via computer network.
Electronic banking, order processing, e-commerce. Large number of clients who continually access and update smal
l portions of the database through short running transactions. Large memory stall Large instructions and data footprints and
high communication miss rates.
![Page 3: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/3.jpg)
Introduction (cont.) Code Layout Optimization
Large applications have a particular problem: A lot of instructions. Can’t hold entire application on-chip at any one time. Stalled waiting to fetch new instructions from memory.
Hold more useful instructions improve performance
![Page 4: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/4.jpg)
Outline Introduction Code Layout Optimizations Methodology Behavior of the Database Application in Isolation Combined Database Application and O/S Behavior Conclusion
![Page 5: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/5.jpg)
Code Layout Optimizations Spike
DTKS tool for performing code optimization after linking Profile-driven optimization.
Three parts of Spike optimizer algorithm Basic Block Chaining Fine-Grain Procedure Splitting Procedure Ordering
![Page 6: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/6.jpg)
Basic Block Chaining Definition
Order the basic blocks within a procedure. Algorithm
Simple greedy algorithm
1. Sort flow edges by weight
2. Chain two block with heaviest weight. Gain
Improve instruction cache behavior
![Page 7: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/7.jpg)
Ex) Basic Block Chaining
Unconditional branch / Fall-through
Conditional branch
A1 10 Node weight
0.6 0.4 Branch probability
A1
A2
A3
A4 A5
A6 A7
A8
10
10
10
6 4
2.4 7.6
10
0.6 0.4
0.4 0.6
A1
A2
A3
A4
A5
A6
A7
A8
10
10
10
6
7.6
10
4
2.4
![Page 8: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/8.jpg)
Fine-Grain Procedure Splitting Definition
Divide the chain into multiple code segments new procedures. Algorithm
Find unconditional branch or return. (just study) Split into hot and cold part. (current available)
Gain Extra degree of flexibility for the procedure ordering algorithm.
![Page 9: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/9.jpg)
Ex) Fine-Grain Procedure Splitting
Procedure 1
Unconditional branch
Procedure 2
Subroutine return
Procedure 3
Subroutine return
Procedure 4
Subroutine return
RET
RET
RET
![Page 10: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/10.jpg)
Procedure Ordering Definition
Place related procedures near one another. Algorithm
1. Build call graph and assign weight (# call).
2. Select the most heavily weighted edge and merge.
3. Use weights in original graph when merge.
4. Iterate until graph is reduced to a single node. Gain
Improve instruction cache behavior
![Page 11: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/11.jpg)
Ex) Procedure Ordering
E,D,B,A,C
A
B C
D E
4 10
8 1
3
1
B A,C
D E
8 1
7
1
B,D A,C
E
1 1
7
D,B,A,C E2
![Page 12: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/12.jpg)
Outline Introduction Code Layout Optimizations Methodology Behavior of the Database Application in Isolation Combined Database Application and O/S Behavior Conclusion
![Page 13: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/13.jpg)
Methodology OLTP Workload
TPC-B Oracle 8.0.4
Collecting Profiles OLTP profile data Pixie. Kernel profile Tru64 Unix kprofile tool.
Hardware and Simulation Platforms SimOS-Alpha environment
![Page 14: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/14.jpg)
Outline Introduction Code Layout Optimizations Methodology Behavior of the Database Application in Isolation Combined Database Application and O/S Behavior Conclusion
![Page 15: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/15.jpg)
Behavior of the DB App. Only Instruction cache miss
X-axis: cache line size Y-axis: # instruction cache miss Reduction of misses is 55~65%.
Baseline OLTP binary Optimized OLTP binary
![Page 16: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/16.jpg)
Experiment (cont.) Impact of different code layout optimization.
Procedure ordering increase cache misses. Largest benefit comes from basic block chaining. Procedure ordering after splitting improve performance further.
![Page 17: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/17.jpg)
Experiment (cont.) Sequentially executed instructions.
Optimized binary 7.3 to over 10 instructions. Temporal locality.
# instructions reused before eviction Optimized binary Increase # of instructions reused.
![Page 18: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/18.jpg)
Outline Introduction Code Layout Optimizations Methodology Behavior of the Database Application in Isolation Combined Database Application and O/S Behavior Conclusion
![Page 19: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/19.jpg)
Behavior of Combined DB App. & OS Instruction cache miss
Reduction of misses is 45~60%. Reduction of misses is 55~65% (App. in isolation).
Baseline OLTP binary Optimized OLTP binary
![Page 20: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/20.jpg)
Experiment (cont.) Interference between App. and OS
Majority of app. misses arise due to self interference. Kernel interferes very little with itself.
Baseline OLTP binary Optimized OLTP binary
![Page 21: Code Layout Optimization for Transaction Processing Workloads](https://reader035.vdocuments.mx/reader035/viewer/2022062305/5681593b550346895dc676be/html5/thumbnails/21.jpg)
Conclusion Profile-driven compiler optimization to improve code
layout in OLTP workloads. App in isolation reduce 55~65% cache misses. With OS reduce 45~60% cache misses. Overall, these optimizations yield improvement in
performance of 1.33 times