de-optimization

De-optimizationDerek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed

Finding the Limits of Hardware Optimization through Software De-optimizationPresented By: 1Outline:IntroductionProject StructureJudging de-optimizationsWhat does a de-op look like?General Areas of FocusInstruction Fetching and DecodingInstruction SchedulingInstruction Type Usage (e.g. Integer vs. FP)Branch PredictionConclusion2What are we doing? De-optimization? That's crazy! Why???

In the world of hardware development, when optimizations are compared, the comparisons often concern just how fast a piece of hardware can run an algorithm

Yet, in the world of software development, the hardware is often a distant afterthought

Given this dichotomy, how relevant are these standard analyses and comparisons?

3What are we doing? So, why not find out how bad it can get?

By de-optimizing software, we can see how bad algorithmic performance can be if hardware isn't considered

At a minimum, we want to be able to answer two questions:

How good of a compiler writer must someone be?How good of a programmer must someone be?

4Our project structureFor our research project: We have been studying instruction fetching/ decoding/ scheduling and branch optimization

We have been using knowledge of optimizations to design and predict de-optimizations

We have been studying the Opteron in detail

5Our project structureFor our implementation project:We will choose de-optimizations to implement

We will choose algorithms that may best reflect our de-optimizations

We will implement the de-optimizations

We will report the results

6Judging de-optimizations (de-ops)We need to decide on an overall metric for comparison Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm.

So, our metric of choice will be CPU clock cycles

7Judging de-optimizations (de-ops)With our metric, we can compare de-ops, but should we? Inevitably, we will ask which de-ops had greater impact, i.e. caused the greatest jump in clocks. So, yes, we should

But this has to be done very carefully since an intended de-op may not be the actual or full cause of a bump in clocks. It could be a side effect caused by the new code combination

Of course, this would be still be some kind of a de-op, just not the intended de-op

8What does a de-op look like?Definition: A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question

Is an infinite loop a de-op? -- NO Why not? It tells us nothing about the hardware

Is a loop that executes more cycles than necessary a de-op? -- NO Again, it tells us nothing about the CPU

Is a combination of instructions that causes increased branch mispredictions a de-op? -- YES

9General Areas of FocusGiven some CPU, what aspects can we optimize code for? These aspects will be our focus for de-optimization.

In general, when optimizing software, the following are the areas to focus on:Instruction Fetching and DecodingInstruction SchedulingInstruction Type Usage (e.g. Integer vs. FP)Branch Prediction

These will be our areas for de-optimization

Conclusion to course, lecture, et al. 10Some General FindingsIn class, when we discussed dynamic scheduling, for example, our team was not sanguine about being able to truly de-optimize code

In fact, we even imagined that our result may be that CPUs are now generally so good that true de-optimization is very difficult to achieve. In principle, we still believe this

In retrospect, we should have been more wise. Just like Platos Forms, there is a significant, if not absolute, difference between something imagined in the abstract and its worldly representation. There can be no perfect circles in the real world

Thus, in practice, as Gita has stressed, CPU designers made choices in their designs that were driven by cost, energy consumption, aesthetics, etc.

Conclusion to course, lecture, et al. 11Some General FindingsThese choices, when it comes time to write software for a CPU, become idiosyncrasies that must be accounted for when optimizing

For those writing optimal code, they are hassles that one must pay attention to

For our project team, these idiosyncrasies are potential "gold mines" for de-optimization

In fact, the AMD Opteron (K10 architecture) exhibits a number of idiosyncrasies. You will see some these today

Conclusion to course, lecture, et al. 12Examples of idiosyncrasiesAMD Opetron (K10)The dynamic scheduling pick window is 32 bytes length while instructions can be 1 - 16 bytes in length. So, scheduling can be adversely affected by instruction length

The branch target buffer (BTB) can only maintain 3 branch history entries per 16 bytes

Branch indicators are aligned at odd numbered positions within 16 byte code blocks. So, 1-byte branches like return instructions, if misaligned will be miss predicted

Examples of idiosyncrasiesIntel i7 (Nehalem)The number of read ports for the register file is too small. This can result in stalls when reading registers

Instruction fetch/decode bandwidth is limited to 16 bytes per cycle. Instruction density can overwhelm the predecoder, which can only manage 6 instructions (per 16 bytes) per cycle

Format of the de-op discussionIn the upcoming discussion of de-optimization techniques, we will present......an area of the CPU that it derives from

...some, hopefully, illuminating title

...a general characterization of the de-op. This characterization may apply to many different CPU architectures. Generally, each of these represents a choice that may be made by a hardware designer

...a specific characterization of the de-op on the AMD Opteron. This characterization will apply only to the Opterons on Hydra

Conclusion to course, lecture, et al. 15The De-optimizationsSo, without further adieu...Decoding BandwidthExecution LatencyInstruction Fetching and DecodingInstruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth [AMD05] Scenario #1Many CISC architectures offer combined load and execute instructions as well as the typical discrete versions

Often, using the discrete versions can decrease the instruction decoding bandwidth

Example:

add rax, QWORD PTR [foo]18Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)In Practice #1 - The Opteron The Opteron can decode 3 combined load-execute (LE) instructions per cycle

Using discrete LE instruction will allow us to decrease the decode rate

Example:

mov rbx, QWORD PTR [foo]add rax, rbx19Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd) Scenario #2Use of instruction with longer encoding rather than those with shorter encoding to decrease the average decode rate by decreasing the number of instruction that can fit into the L1 instruction cache

This also effectively shrinks the scheduling pick window

For example, use 32-bit displacements instead of 8-bit displacements and 2-byte opcode form instead of 1-byte opcode form of simple integer instructions

20Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)In Practice #2 - The Opteron The Opteron has short and long variants of a number of its instructions, like indirect add, for example. We can use the long variants of these instructions in order to drive down the decode rate

This will also have the affect of shrinking the Opterons 32-byte pick window for instruction scheduling.Example of long variant:

81 C0 78 56 34 12 add eax, 12345678h ;2-byte opcode form83 C3 FB FF FF FF add ebx, -5 ;32-bit immediate value0F 84 05 00 00 00 jz label1 ;2-byte opcode, 32-bit immediate ;value

21Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)A balancing actThe scenarios for this de-optimization have flip sides that could make them difficult to implement

For example, scenario #1 describes using discrete load-execute instructions in order to decrease the average decode rate. However, sometimes discrete load-execute instructions are called for:

The discrete load-execute instructions can provide the scheduler with more flexibility when scheduling

In addition, on the Opteron, they consume less of the 32-byte pick window, thereby giving the scheduler more options

22Instruction Fetching and Decoding De-optimization #1 - Decrease Decoding Bandwidth (cont'd)When could this happen?This de-optimization could occur naturally when:A compiler does a very poor jobThe memory model forces long version encodings of instructions, e.g. 32-bit displacements

Our prediction for implementationWe predict mixed results when trying to implement this de-optimization

23Instruction Fetching and Decoding De-optimization #2 - Increase execution latency [AMD05] ScenarioCPUs often have instructions that can perform almost the same operation

Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized 24Instruction Fetching and Decoding De-optimization #2 - Increase execution latency In Practice - The OpteronWe can use 16-bit LEA instruction, which is a VectorPath instruction to reduce the decode bandwidth and increase execution latency

The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles

Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization.25Instruction Fetching and Decoding De-optimization #2 - Increase execution latency (cont'd)When could this happen?This de-optimization could occur if the user simply does the following:

float a, b;b = a / 100.0;

instead of:

float a, b;b = a * 0.01

Our prediction for implementationWe expect this de-op to be clearly reflected in an increase in clock cycles

26Address Generation interlocksRegister PressureLoop Re-rollingInstruction SchedulingInstruction Scheduling De-optimization #1 - Address-generation interlocks [AMD05] ScenarioScheduling loads and stores whose addresses cannot be calculated quickly ahead of loads and stores that require the declaration of a long dependency chain

In order to generate their addresses can create address-generation interlocks.Example:

add ebx, ecx; Instruction 1mov eax, DWORD PTR [10h]; Instruction 2mov edx, DWORD PTR [24h]; Place lode above ; instruction 3 to avoid AGI stall

mov ecx, DWORD PTR [eax+ebx]; Instruction 328Instruction Scheduling De-optimization #1 - Address-generation interlocks (cont'd) In Practice - The OpteronThe processor schedules instructions that access the data cache (loads and stores) in program order.

By randomly choosing the order of loads and stores, we can seek address-generation interlocks.Example:

add ebx, ecx; Instruction 1mov eax, DWORD PTR [10h]; Instruction 2 (fast address calc.)mov ecx, DWORD PTR [eax+ebx]; Instruction 3 (slow address calc.)

mov edx, DWORD PTR [24h] ; This load is stalled from accessing ; the data cache due to the long ; latency caused by generating the ; address for instruction 3 29Instruction Scheduling De-optimization #1 - Address-generation interlocks (cont'd) When could this happen?This happen when we have a long chain dependency of loads and stores addresses a head of one that can be calculated quickly.

Our prediction for implementation:We expect an increasing in the number of clock cycles by using this de-optimization technique.

30Instruction Scheduling De-optimization #2 - Increase register pressure [AMD05] ScenarioAvoid pushing memory data directly onto the stack and instead load it into a register to increase register pressure and create data dependencies.Example:

In Practice - The OpteronPermit code that first loads the memory data into a register and then pushes it into the stack to increase register pressure and allows data dependencies. Example:

push memmov rax, mempush rax31Instruction Scheduling De-optimization #2 - Increase register pressureWhen could this happen? This could take place by different usage of instruction load and store, when we have a register and we load an instruction into a register and we push it into a stack

Our prediction for implementation:We expect the performance will be affected by increasing the register pressure 32Instruction SchedulingDe-optimization #3 - Loop Re-rolling ScenarioLoops not only affect branch prediction. They can also affect dynamic schedulingHow ?Let instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be

In Practice - The OpteronGiven that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC

Instruction SchedulingDe-optimization #3 - Loop Re-rollingWhen could this happen?Easily, in C, this would be two consecutive loops each containing one or more many instructions such that the loops could be combined

Our prediction for implementationWe expect this de-op to be clearly reflected in an increase in clock cyclesExample:

--- Version 1 ---

for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; cubic_array[i] = i * i * i; }

--- Version 2 ---

for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i;}for( i = 0; i < n; i++ ) { cubic_array[i] = i * i * i;}

Store-to-load dependency Costly InstructionInstruction Type UsageInstruction Type Usage De-optimization #1 Store-to-load dependencyScenarioStore-to-load dependency takes place when stored data needs to be used shortly.

This is commonly used.

This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently.Example:

for (k=1;k

de-optimization

Documents

intended deop

deoptimizing software

deoptimizationderek

knowledge of optimizations

hardware isnt

piece of hardware

world of hardware development

cpu clock cycles