increasing complexities

64
Compiler optimizations based on call-graph flattening Carlo Alberto Ferraris professor Silvano Rivoira Master of Science in Telecommunication Engineering Third School of Engineering: Information Technology Politecnico di Torino July 6 th , 2011

Upload: mirit

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Compiler optimizations based on call-graph flattening Carlo Alberto Ferraris professor Silvano Rivoira Master of Science in Telecommunication Engineering Third School of Engineering: Information Technology Politecnico di Torino July 6 th , 2011. Increasing complexities. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Increasing complexities

Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano Rivoira

Master of Science in Telecommunication EngineeringThird School of Engineering: Information TechnologyPolitecnico di TorinoJuly 6th, 2011

Page 2: Increasing complexities

Increasing complexitiesEveryday objects are becoming

multi-purposenetworkedinteroperablecustomizablereusableupgradeable

Page 3: Increasing complexities

Increasing complexitiesEveryday objects are becoming

more and more complex

Page 4: Increasing complexities

Increasing complexitiesSoftware that runs smart objects is

becomingmore and more complex

Page 5: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Page 6: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many different flavours

Page 7: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many different flavoursPowerEspecially valuable in battery-powered

scenarios such as mobile, sensor, 3rd world applications

Page 8: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many different flavoursPower, densityCritical factor in data-center and product

design

Page 9: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many different flavoursPower, density, computationalCPU, RAM, storage, etc. are often growing

slower than the potential applications

Page 10: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many different flavoursPower, density, computational, developmentDevelopment time and costs should be as low

as possible for low TTM and profitability

Page 11: Increasing complexities

Diminishing resourcesSystems have to be resource-efficient

Resources come in many non-orthogonal flavours

Power, density, computational, development

Page 12: Increasing complexities

Do more with less

Page 13: Increasing complexities

AbstractionsWe need to modularize and hide the

complexityOperating systems, frameworks, libraries,

managed languages, virtual machines, …

Page 14: Increasing complexities

AbstractionsWe need to modularize and hide the

complexityOperating systems, frameworks, libraries,

managed languages, virtual machines, …

All of this comes with a cost: generic solutions are generally less efficient than ad-hoc ones

Page 15: Increasing complexities

AbstractionsWe need to modularize and hide the

complexity

Palm webOSUser interface running onHTML+CSS+Javascript

Page 16: Increasing complexities

AbstractionsWe need to modularize and hide the

complexity

Javascript PC emulatorRunning Linux inside a browser

Page 17: Increasing complexities

OptimizationsWe need to modularize and hide the

complexity without sacrificing performance

Page 18: Increasing complexities

OptimizationsWe need to modularize and hide the

complexity without sacrificing performance

Compiler optimizations trade off compilation time with development, execution time

Page 19: Increasing complexities

Vestigial abstractionsThe natural subdivision of code in functions

is maintained in the compiler and all the way down to the processor

Each function is self-contained with strict conventions regulating how it relates to other functions

Page 20: Increasing complexities

Vestigial abstractionsProcessors don’t care about functions;

respecting the conventions is just additional work

Push the contents of the registers and return address on the stack, jump to the callee; execute the callee, jump to the return address; restore the registers from the stack

Page 21: Increasing complexities

Vestigial abstractionsMany optimizations are simply not feasible

when functions are presentint replace(int* ptr, int value) { int tmp = *ptr; *ptr = value; return tmp;}

int A(int* ptr, int value) { return replace(ptr, value);}

int B(int* ptr, int value) { replace(ptr, value); return value;}

void *malloc(size_t size) { void *ret; // [various checks] ret = imalloc(size); if (ret == NULL) errno = ENOMEM; return ret;}

// ...type *ptr = malloc(size);if (ptr == NULL) return NOT_ENOUGH_MEMORY;// ...

Page 22: Increasing complexities

Vestigial abstractionsMany optimizations are simply not feasible

when functions are presentinterpreter_setup();while (opcode = get_next_instruction()) interpreter_step(opcode);interpreter_shutdown();

function interpreter_step(opcode) { switch (opcode) { case opcode_instruction_A: execute_instruction_A(); break; case opcode_instruction_B: execute_instruction_B(); break; // ... default: abort("illegal opcode!"); }}

Page 23: Increasing complexities

Vestigial abstractionsMany optimization efforts are directed at working

around the overhead caused by functions

Inlining clones the body of the callee in the caller; optimal solution w.r.t. calling overhead but causes code size increase and cache pollution; useful only on small, hot functions

Page 24: Increasing complexities

Call-graph flattening

Page 25: Increasing complexities

Call-graph flatteningWhat if we dismiss

functions during early compilation…

Page 26: Increasing complexities

Call-graph flatteningWhat if we dismiss

functions during early compilation and track the control flow explicitely instead?

Page 27: Increasing complexities

Call-graph flatteningWhat if we dismiss

functions during early compilation and track the control flow explicitely instead?

Page 28: Increasing complexities

Call-graph flatteningWhat if we dismiss

functions during early compilation and track the control flow explicitely instead?

Page 29: Increasing complexities

Call-graph flatteningWe get most benefits of inlining, including

the ability to perform contextual code optimizations, without the code size issues

Page 30: Increasing complexities

Call-graph flatteningWe get most benefits of inlining, including

the ability to perform contextual code optimizations, without the code size issues

Where’s the catch?

Page 31: Increasing complexities

Call-graph flatteningThe load on the compiler increases greatly

both directly due to CGF itself and also indirectly due to subsequent optimizations

Worse case complexity (number of edges) is quadratic w.r.t. the number of callsites being transformed (heuristics may help)

Page 32: Increasing complexities

Call-graph flatteningDuring CGF we need to statically keep track

of all live values across all callsites in all functions

A value is alive if it will be needed in subsequent instructionsA = 5, B = 9, C = 0;

// live: A, BC = sqrt(B); // live: A, Creturn A + C;

Page 33: Increasing complexities

Call-graph flatteningBasically the compiler has to statically

emulate ahead-of-time all the possible stack usages of the program

This has already been done on microcontrollers and resulted in a 23% decrease of stack usage (and 5% performance increase)

Page 34: Increasing complexities

Call-graph flatteningThe indirect cause of increased compiler load

comes from standard optimizations that are run after CGF

CGF does not create new branches (each call and return instruction is turned exactely into a jump) but other optimizations can

Page 35: Increasing complexities

Call-graph flatteningThe indirect cause of increased compiler

load comes from standard optimizations that are run after CGF

Most optimizations are designed to operate on small functions with limited amounts of branches

Page 36: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Page 37: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Code motionMove instructions between function

boundaries; avoid unneeded computations, alleviate register pressure, improve cache locality

Page 38: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Code motion, macro compressionFind similar code sequences in different

parts of the code and merge them; reduce code size and cache pollution

Page 39: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Code motion, macro compression, nonlinear CF

CGF supports natively nonlinear control flows; almost-zero-cost EH and coroutines

Page 40: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Code motion, macro compression, nonlinear CF, stackless execution

No runtime stack needed in fully-flattened programs

Page 41: Increasing complexities

Call-graph flatteningMany possible application scenarios beside

inlining

Code motion, macro compression, nonlinear CF, stackless execution, stack protection

Effective stack poisoning attacks are much harder or even impossible

Page 42: Increasing complexities

ImplementationTo test if CGF is applicable also to complex

architectures and to validate some of the ideas presented in the thesis, a pilot implementation was written against the open-source LLVM compiler framework

Page 43: Increasing complexities

ImplementationOperates on LLVM-IR; host and target

architecture agnostic; roughly 800 lines of C++ code in 4 classes

The pilot implementation can not flatten recursive, indirect or variadic callsites; they can be used anyway

Page 44: Increasing complexities

ImplementationEnumerate suitable functionsEnumerate suitable callsites (and their live

values)Create dispatch function, populate with codeTransform callsitesPropagate live valuesRemove original functions or create wrappers

Page 45: Increasing complexities

int a(int n) { return n+1;}

int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}

Examples

Page 46: Increasing complexities

int a(int n) { return n+1;}

int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}

Page 47: Increasing complexities

int a(int n) { return n+1;}

int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}

Page 48: Increasing complexities

int a(int n) { return n+1;}

int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n;}

Examples

Page 49: Increasing complexities

int a(int n) { return n+1;}

int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n;}

Page 50: Increasing complexities

.type .Ldispatch,@function.Ldispatch: movl $.Ltmp4, %eax # store the return dispather of a in rax jmpq *%rdi # jump to the requested outer disp. .Ltmp2: # outer dispatcher of b movl $.LBB2_4, %eax # store the address of %10.Ltmp0: # outer dispatcher of a movl (%rsi), %ecx # load the argument n in ecx jmp .LBB2_4.Ltmp8: # block %17 movl $.Ltmp6, %eax jmp .LBB2_4.Ltmp6: # block %18 movl $.Ltmp7, %eax.LBB2_4: # block %10 movq %rax, %rsi incl %ecx # n = n + 1 movl $.Ltmp8, %eax jmpq *%rsi # indirectbr.Ltmp4: # return dispatcher of a movl %ecx, (%rdx) # store in pointer rdx the return value ret # in ecx and return to the wrapper.Ltmp7: # return dispatcher of b movl %ecx, (%rdx) ret

Page 51: Increasing complexities

FuzzingTo stress test the pilot implementation and

to perform benchmarks a tunable fuzzer has been written

int f_1_2(int a) { a += 1; switch (a%3) { case 0: a += f_0_2(a); break; case 1: a += f_0_4(a); break; case 2: a += f_0_6(a); break; } return a;}

Page 52: Increasing complexities
Page 53: Increasing complexities

BenchmarksDue to the shortcomings in the currently

available optimizations in LLVM, the only meaningful benchmarks that can be done are those concerning code size and stack usage

In literature, average code size increases of 13% were reported due to CGF

Page 54: Increasing complexities

BenchmarksUsing our tunable fuzzer different programs

were generated and key statistics of the compiled code were gathered

Page 55: Increasing complexities

BenchmarksUsing our tunable fuzzer different programs

were generated and key statistics of the compiled code were gathered

Page 56: Increasing complexities

BenchmarksIn short, when optimizations work the

resulting code size is better than the one found in literature

Page 57: Increasing complexities

BenchmarksIn short, when optimizations work the

resulting code size is better than the one found in literature

When they don’t, the register spiller and allocator perform so badly that most instructions simply shuffle data around on the stack

Page 58: Increasing complexities

Benchmarks

Page 59: Increasing complexities

Next stepsReduce live value verbosityAlternative indirection schemesTune available optimizations for CGF constructsBetter register spiller and allocatorAd-hoc optimizations (code threader, adaptive

fl.)Support recursion, indirect calls; better wrappers

Page 60: Increasing complexities

Conclusions“Do more with less”; optimizations are requiredCGF removes unneeded overhead due to low-

level abstractions and empowers powerful global optimizations

Benchmark results of the pilot implementation are better than those in literature when available LLVM optimizations can cope

Page 61: Increasing complexities

Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano Rivoira

Page 62: Increasing complexities
Page 63: Increasing complexities
Page 64: Increasing complexities

.type wrapper,@functionsubq $24, %rsp # allocate space on the stackmovl %edi, 16(%rsp) # store the argument n on the stackmovl $.Ltmp0, %edi # address of the outer dispatcherleaq 16(%rsp), %rsi # address of the incoming argument(s)leaq 12(%rsp), %rdx # address of the return value(s)callq .Ldispatch # call to the dispatch functionmovl 12(%rsp), %eax # load the ret value from the stackaddq $24, %rsp # deallocate space on the stackret # return