de-optimization
DESCRIPTION
Finding the Limits of Hardware Optimization through Software De-optimization. De-optimization. Derek Kern, Roqyah Alalqam, Ahmed Mehzer , Mohammed Mohammed. Presented By: . Outline:. Introduction Project Structure Judging de-optimizations What does a de-op look like? - PowerPoint PPT PresentationTRANSCRIPT
De-optimizationDerek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed
Finding the Limits of Hardware Optimization through Software De-optimizationPresented By: 1Outline:IntroductionProject StructureJudging de-optimizationsWhat does a de-op look like?General Areas of FocusInstruction Fetching and DecodingInstruction SchedulingInstruction Type Usage (e.g. Integer vs. FP)Branch PredictionConclusion2What are we doing? De-optimization? That's crazy! Why???
In the world of hardware development, when optimizations are compared, the comparisons often concern just how fast a piece of hardware can run an algorithm
Yet, in the world of software development, the hardware is often a distant afterthought
Given this dichotomy, how relevant are these standard analyses and comparisons?
3What are we doing? So, why not find out how bad it can get?
By de-optimizing software, we can see how bad algorithmic performance can be if hardware isn't considered
At a minimum, we want to be able to answer two questions:
How good of a compiler writer must someone be?How good of a programmer must someone be?
4Our project structureFor our research project: We have been studying instruction fetching/ decoding/ scheduling and branch optimization
We have been using knowledge of optimizations to design and predict de-optimizations
We have been studying the Opteron in detail
5Our project structureFor our implementation project:We will choose de-optimizations to implement
We will choose algorithms that may best reflect our de-optimizations
We will implement the de-optimizations
We will report the results
6Judging de-optimizations (de-ops)We need to decide on an overall metric for comparison Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm.
So, our metric of choice will be CPU clock cycles
7Judging de-optimizations (de-ops)With our metric, we can compare de-ops, but should we? Inevitably, we will ask which de-ops had greater impact, i.e. caused the greatest jump in clocks. So, yes, we should
But this has to be done very carefully since an intended de-op may not be the actual or full cause of a bump in clocks. It could be a side effect caused by the new code combination
Of course, this would be still be some kind of a de-op, just not the intended de-op
8What does a de-op look like?Definition: A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question
Is an infinite loop a de-op? -- NO Why not? It tells us nothing about the hardware
Is a loop that executes more cycles than necessary a de-op? -- NO Again, it tells us nothing about the CPU
Is a combination of instructions that causes increased branch mispredictions a de-op? -- YES
9General Areas of FocusGiven some CPU, what aspects can we optimize code for? These aspects will be our focus for de-optimization.
In general, when optimizing software, the following are the areas to focus on:Instruction Fetching and DecodingInstruction SchedulingInstruction Type Usage (e.g. Integer vs. FP)Branch Prediction
These will be our areas for de-optimization
Conclusion to course, lecture, et al. 10Some General FindingsIn class, when we discussed dynamic scheduling, for example, our team was not sanguine about being able to truly de-optimize code
In fact, we even imagined that our result may be that CPUs are now generally so good that true de-optimization is very difficult to achieve. In principle, we still believe this
In retrospect, we should have been more wise. Just like Platos Forms, there is a significant, if not absolute, difference between something imagined in the abstract and its worldly representation. There can be no perfect circles in the real world
Thus, in practice, as Gita has stressed, CPU designers made choices in their designs that were driven by cost, energy consumption, aesthetics, etc.
Conclusion to course, lecture, et al. 11Some General FindingsThese choices, when it comes time to write software for a CPU, become idiosyncrasies that must be accounted for when optimizing
For those writing optimal code, they are hassles that one must pay attention to
For our project team, these idiosyncrasies are potential "gold mines" for de-optimization
In fact, the AMD Opteron (K10 architecture) exhibits a number of idiosyncrasies. You will see some these today
Conclusion to course, lecture, et al. 12Examples of idiosyncrasiesAMD Opetron (K10)The dynamic scheduling pick window is 32 bytes length while instructions can be 1 - 16 bytes in length. So, scheduling can be adversely affected by instruction length
The branch target buffer (BTB) can only maintain 3 branch history entries per 16 bytes
Branch indicators are aligned at odd numbered positions within 16 byte code blocks. So, 1-byte branches like return instructions, if misaligned will be miss predicted
Examples of idiosyncrasiesIntel i7 (Nehalem)The number of read ports for the register file is too small. This can result in stalls when reading registers
Instruction fetch/decode bandwidth is limited to 16 bytes per cycle. Instruction density can overwhelm the predecoder, which can only manage 6 instructions (per 16 bytes) per cycle
Format of the de-op discussionIn the upcoming discussion of de-optimization techniques, we will present......an area of the CPU that it derives from
...some, hopefully, illuminating title
...a general characterization of the de-op. This characterization may apply to many different CPU architectures. Generally, each of these represents a choice that may be made by a hardware designer
...a specific characterization of the de-op on the AMD Opteron. This characterization will apply only to the Opterons on Hydra
Conclusion to course, lecture, et al. 15The De-optimizationsSo, without further adieu...Decoding BandwidthExecution LatencyInstruction Fetching and DecodingInstruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth [AMD05] Scenario #1Many CISC architectures offer combined load and execute instructions as well as the typical discrete versions
Often, using the discrete versions can decrease the instruction decoding bandwidth
Example:
add rax, QWORD PTR [foo]18Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)In Practice #1 - The Opteron The Opteron can decode 3 combined load-execute (LE) instructions per cycle
Using discrete LE instruction will allow us to decrease the decode rate
Example:
mov rbx, QWORD PTR [foo]add rax, rbx19Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd) Scenario #2Use of instruction with longer encoding rather than those with shorter encoding to decrease the average decode rate by decreasing the number of instruction that can fit into the L1 instruction cache
This also effectively shrinks the scheduling pick window
For example, use 32-bit displacements instead of 8-bit displacements and 2-byte opcode form instead of 1-byte opcode form of simple integer instructions
20Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)In Practice #2 - The Opteron The Opteron has short and long variants of a number of its instructions, like indirect add, for example. We can use the long variants of these instructions in order to drive down the decode rate
This will also have the affect of shrinking the Opterons 32-byte pick window for instruction scheduling.Example of long variant:
81 C0 78 56 34 12 add eax, 12345678h ;2-byte opcode form83 C3 FB FF FF FF add ebx, -5 ;32-bit immediate value0F 84 05 00 00 00 jz label1 ;2-byte opcode, 32-bit immediate ;value
21Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)A balancing actThe scenarios for this de-optimization have flip sides that could make them difficult to implement
For example, scenario #1 describes using discrete load-execute instructions in order to decrease the average decode rate. However, sometimes discrete load-execute instructions are called for:
The discrete load-execute instructions can provide the scheduler with more flexibility when scheduling
In addition, on the Opteron, they consume less of the 32-byte pick window, thereby giving the scheduler more options
22Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)When could this happen?This de-optimization could occur naturally when:A compiler does a very poor jobThe memory model forces long version encodings of instructions, e.g. 32-bit displacements
Our prediction for implementationWe predict mixed results when trying to implement this de-optimization
23Instruction Fetching and Decoding De-optimization #2 - Increase execution latency [AMD05] ScenarioCPUs often have instructions that can perform almost the same operation
Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized 24Instruction Fetching and Decoding De-optimization #2 - Increase execution latency In Practice - The OpteronWe can use 16-bit LEA instruction, which is a VectorPath instruction to reduce the decode bandwidth and increase execution latency
The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles
Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization.25Instruction Fetching and Decoding De-optimization #2 - Increase execution latency (cont'd)When could this happen?This de-optimization could occur if the user simply does the following:
float a, b;b = a / 100.0;
instead of:
float a, b;b = a * 0.01
Our prediction for implementationWe expect this de-op to be clearly reflected in an increase in clock cycles
26Address Generation interlocksRegister PressureLoop Re-rollingInstruction SchedulingInstruction Scheduling De-optimization #1 - Address-generation interlocks [AMD05] ScenarioScheduling loads and stores whose addresses cannot be calculated quickly ahead of loads and stores that require the declaration of a long dependency chain
In order to generate their addresses can create address-generation interlocks.Example:
add ebx, ecx; Instruction 1mov eax, DWORD PTR [10h]; Instruction 2mov edx, DWORD PTR [24h]; Place lode above ; instruction 3 to avoid AGI stall
mov ecx, DWORD PTR [eax+ebx]; Instruction 328Instruction Scheduling De-optimization #1 - Address-generation interlocks (cont'd) In Practice - The OpteronThe processor schedules instructions that access the data cache (loads and stores) in program order.
By randomly choosing the order of loads and stores, we can seek address-generation interlocks.Example:
add ebx, ecx; Instruction 1mov eax, DWORD PTR [10h]; Instruction 2 (fast address calc.)mov ecx, DWORD PTR [eax+ebx]; Instruction 3 (slow address calc.)
mov edx, DWORD PTR [24h] ; This load is stalled from accessing ; the data cache due to the long ; latency caused by generating the ; address for instruction 3 29Instruction Scheduling De-optimization #1 - Address-generation interlocks (cont'd) When could this happen?This happen when we have a long chain dependency of loads and stores addresses a head of one that can be calculated quickly.
Our prediction for implementation:We expect an increasing in the number of clock cycles by using this de-optimization technique.
30Instruction Scheduling De-optimization #2 - Increase register pressure [AMD05] ScenarioAvoid pushing memory data directly onto the stack and instead load it into a register to increase register pressure and create data dependencies.Example:
In Practice - The OpteronPermit code that first loads the memory data into a register and then pushes it into the stack to increase register pressure and allows data dependencies. Example:
push memmov rax, mempush rax31Instruction Scheduling De-optimization #2 - Increase register pressureWhen could this happen? This could take place by different usage of instruction load and store, when we have a register and we load an instruction into a register and we push it into a stack
Our prediction for implementation:We expect the performance will be affected by increasing the register pressure 32Instruction SchedulingDe-optimization #3 - Loop Re-rolling ScenarioLoops not only affect branch prediction. They can also affect dynamic schedulingHow ?Let instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be
In Practice - The OpteronGiven that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC
Instruction SchedulingDe-optimization #3 - Loop Re-rollingWhen could this happen?Easily, in C, this would be two consecutive loops each containing one or more many instructions such that the loops could be combined
Our prediction for implementationWe expect this de-op to be clearly reflected in an increase in clock cyclesExample:
--- Version 1 ---
for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; cubic_array[i] = i * i * i; }
--- Version 2 ---
for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i;}for( i = 0; i < n; i++ ) { cubic_array[i] = i * i * i;}
Store-to-load dependency Costly InstructionInstruction Type UsageInstruction Type Usage De-optimization #1 Store-to-load dependencyScenarioStore-to-load dependency takes place when stored data needs to be used shortly.
This is commonly used.
This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently.Example:
for (k=1;k