apple llvm gpu compilersllvm.org/devmtg/2017-10/slides/chandrasekaran... · apple llvm gpu...

71
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Upload: others

Post on 26-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Apple LLVM GPU Compiler: Embedded Dragons

Charu Chandrasekaran, Apple Marcello Maggioni, Apple

1

Page 2: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• How Apple uses LLVM to build a GPU Compiler

• Factors that affect GPU performance

• The Apple GPU compiler

• Pipeline passes

• Challenges

2

Agenda

Page 3: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Live on Trunk and merge continuously

• Benefit from latest improvements on trunk

• Identify any regressions immediately and report back

• Minimize changes to open source llvm code

• Reuse as much as possible

3

How Apple uses LLVM

Page 4: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Continuous Integration

4

LLVM Trunk

GPU Compiler

Year 1 production compiler

Page 5: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

5

LLVM Trunk

GPU Compiler

Continuous Integration

Year 1 production compiler

Page 6: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

LLVM Trunk

36

GPU Compiler

Continuous Integration

Year 1 production compiler

Year 2 production compiler

Year 3 production compiler

Page 7: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Regression testing involves:

• register count

• instruction count

• FileCheck : correctness

• compile time

• compiler size

• runtime performance

7

Testing

Page 8: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

App

The GPU SW Stack

iOS / watchOS / tvOS Process

Metal-FEXPC Service

Result

User

.metal

Shader

.obj.exec

IR

IR

8

BackendXPC Service

Metal Framework, GPU Driver

Interacts

Page 9: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

About GPUs

9

Page 10: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

GPUs are massively parallel vector processors

Threads are grouped together and execute in lockstep (they share the same PC)

About GPUs

10

Shader CorePC

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

Page 11: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The parallelism is implicit, a single thread looks like normal CPU code

float kernel(float a, float b) { float c = a + b; return c; }

11

About GPUs

Shader CorePC

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

Page 12: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The parallelism is implicit, a single thread looks like normal CPU code

float8 kernel(float8 a, float8 b) { float8 c = add_v8(a, b); return c; }

12

About GPUs

Shader CorePC

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

Page 13: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Multiple groups of threads are resident on the GPU at the same time for latency hiding

Shader Core

About GPUs : Latency hiding

float kernel(struct In_PS ) { float4 color = texture_fetch();

float4 c = In_PS.a * In_PS.b; … float4 d = c + color;

… }

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

PCPC

13

PC PC

Page 14: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The GPU picks up work from the various different groups of threads to hide the latency from the other groups

Shader Core

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

PCPCPCfloat kernel(struct In_PS ) { float4 color = texture_fetch();

float4 c = In_PS.a * In_PS.b; … float4 d = c + color;

… }

14

About GPUs : Latency hiding

PC

Page 15: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The GPU picks up work from the various different groups of threads to hide the latency from the other groups

Shader Core

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

PCPCPCfloat kernel(struct In_PS ) { float4 color = texture_fetch();

float4 c = In_PS.a * In_PS.b; … float4 d = c + color;

… }

15

About GPUs : Latency hiding

PC

Page 16: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The GPU picks up work from the various different groups of threads to hide the latency from the other groups

Shader Core

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

float kernel(struct In_PS ) { float4 color = texture_fetch();

float4 c = In_PS.a * In_PS.b; … float4 d = c + color;

… }

16

About GPUs : Latency hiding

PCPCPC PC

Page 17: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Shader Core

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

PCPCPC PC

17

About GPUs : Latency hiding

The GPU picks up work from the various different groups of threads to hide the latency from the other groups

float kernel(struct In_PS ) { float4 color = texture_fetch();

float4 c = In_PS.a * In_PS.b; … float4 d = c + color;

… }

Page 18: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Shader Core

Registers per lane

Registers per thread

18

About GPUs: Register file

The groups of threads share a big register file that is split between the threads

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

PCPCPC PC

0b0a 0c 0d

4b4a 4c 4d

6b6a 6c 6d

2b2a 2c 2d

1b1a 1c 1d

3b3a 3c 3d

5b5a 5c 5d

7b7a 7c 7d

Register File

ba c d

Page 19: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

19

About GPUs: Register file

The number of registers used per-thread impact the number of resident group of threads on the machine (occupancy)

Shader Core

Registers per thread

PC

2 3

6 7

0 1

4 5

Register File

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

Page 20: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

This in turn will impact the latency hiding capability

20

About GPUs: Register file

Shader Core

Registers per thread

PC

2 3

6 7

0 1

4 5

Register File

LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7

VERY IMPORTANT!

Page 21: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

The huge register file and number of concurrent threads makes spilling pretty costly

21

About GPUs: Spilling

Register File L1$

Page 22: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Example (spilling 1 register): 1024 threads x 32-bit register = 4 KB !

Register File L1$

22

About GPUs: Spilling

The huge register file and number of concurrent threads makes spilling pretty costly

Spilling is typically not an effective way of reducing register pressure to increase occupancy and should be avoided at all costs

Page 23: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Pipeline

23

Page 24: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

We support function calls and we try to exploit them

Like most GPU programming models though, we can inline everything if we want

Unoptimized IR

Inlining

24

Inlining

All functions + main kernel linked together in a single module

Page 25: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

I-Cache savings!

25

Not inlining showed significant speedup on some shaders where big functions were called multiple times

Inlining

Page 26: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Dead Arg Elimination

Get rid of dead arguments to functions

26

Inlining

Page 27: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Convert to pass by value as many objects as we can

27

Dead Arg Elimination

Argument Promotion

Inlining

Page 28: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Proceed to the actual inlining

28

Inlining

Dead Arg Elimination

Argument Promotion

Inlining

Page 29: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Inlining decision based on standard LLVM inlining policy + custom threshold + additional constraints

29

Inlining

Dead Arg Elimination

Argument Promotion

Inlining

Page 30: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Inlining

int function(int addrspace(stack)* v) { … }

int function(int addrspace(constant)* v) { … }

We force inline these cases

30

Objective of our inlining policy is to be very conservative while trying exploit cases where we can keep a function call can benefit us potentially a lot

Custom policies try to minimize the impact that not inlining could have on other key optimizations for performance (SROA, Buffer preloading)

Page 31: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

int callee() { add r1, r2, r3 ret }

int caller () { mul r4, r1, r3 push r4 call callee() pop r4 add r1, r1, r4 }

int callee() { add r1, r2, r3 ret }

int caller () { mul r4, r1, r3 push r4 call callee() pop r4 add r1, r1, r4 }

Without IPRA With IPRA

31

The new IPRA support in LLVM has been key in avoiding pointless calling convention register store/reload

Inlining

Page 32: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Inlining

SROA

32

SROA

Argument Promotion

Page 33: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

33

Inlining

SROA

SROA

We run it multiple times in our pipeline in order to be sure that we promote as many allocas to register values as possibleInlining

SROA

Argument Promotion

Page 34: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Inlining

SROA

Alloca Opt

int function(int i) { int a[4] = { x, y, z, w }; … … = a[i]; }

34

Alloca Opt

Argument Promotion

Page 35: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Inlining

SROA

Alloca Opt

int function(int i) { int a[4] = { x, y, z, w }; … … = i == 0 ? x : (i == 1 ? y : i == 2 ? z : w); }

Less stack accesses!

35

Alloca Opt

Argument Promotion

Page 36: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

SROA

Alloca Opt

Loop Unrolling

36

Loop Unrolling

Page 37: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Completely unrolling loops allows SROA to remove stack accesses

If we have dynamic memory access to stack or constant memory that we can promote to uniform memory we want to greatly increase the unrolling thresholds

int a[5] = { x, y, z, w, q }; int b = 0;

for (int i = 0; i < 5; ++i) { b += a[i]; }

int a[5] = { x, y, z, w, q }; int b = x; b += y; b += z; b += w; b += q;

37

Loop Unrolling

Page 38: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

We also keep track of register pressure

Our scheduler is very eager to try and help latency hiding by moving most of memory accesses at the top of the shader (and is difficult to teach it otherwise) so we limit unrolling when we detect we could blow up the register pressure

for (int i = 0; i < 5; ++i) { float4 a = texture_fetch(); float4 b = texture_fetch(); float4 c = texture_fetch(); float4 d = texture_fetch(); float4 e = texture_fetch();

// Math involving the above }

38

Loop Unrolling

Page 39: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

We allow partial unrolling if we detect a static loop count and the loop would be bigger than our unrolling threshold

for (int i = 0; i < 16; ++i) { float4 a = texture_fetch();

// Math involving the above }

for (int i = 0; i < 4; ++i) { float4 a1 = texture_fetch(); float4 a2 = texture_fetch(); float4 a3 = texture_fetch(); float4 a4 = texture_fetch(); … // Unrolled 4 times }

39

Loop Unrolling

Page 40: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Loop Unrolling

Flatten CFGif (val == x) { a = v + z; c = q + a; } else { b = v * z; c = q * b; } … = c;

40

Flatten CFG

Speculation helps in creating bigger blocks for the scheduler to do a better job

and reduces the total overhead introduced by small blocks

Page 41: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Loop Unrolling

Flatten CFG

Speculation helps in creating bigger blocks for the scheduler to do a better job

and reduces the total overhead introduced by small blocks

41

Flatten CFG

a = v + z; c1 = q + a; b = v * z; c2 = q * b; c = (val == x) ? c1 : c2; … = c;

if (val == x) { a = v + z; c = q + a; } else { b = v * z; c = q * b; } … = c;

Page 42: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Flatten CFG

Uniformity Hoisting

42

GPUs are massively parallel, but often some computation in shader can be statically determined to be the same for all the threads

Some of these patterns are really convenient or difficult for the shader writer to extract from the program

Uniformity Hoisting

Page 43: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Flatten CFG

Uniformity Hoistingvoid kernel(constant float4 *A, constant bool *b global float *C) { float4 f_vec = *b ? *A : float4(1.0); … = f_vec * C[tid]; }

43

GPUs are massively parallel, but often some computation in shader can be statically determined to be the same for all the threads

Some of these patterns are really convenient or difficult for the shader writer to extract from the program

Uniformity Hoisting

Page 44: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Flatten CFG

Uniformity Hoisting

void uniform_kernel(constant float4 *A, constant bool *b) { // uni_f_vec lives in uniform memory uni_f_vec = *b ? *A : float4(1.0); } void kernel(constant float4 *A, constant bool *b global float *C) { … = uni_f_vec * C[tid]; }

44

We can move such computation to a program that runs at a lower rate (once)

Even one instruction is a lot of parallel work saved

Uniformity Hoisting

Page 45: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Flatten CFG

Uniformity Hoisting

void kernel(constant float4 *A, constant bool *b global float *C) { const int a[5] = { 3, 2, 1, 4, 2 }; … = a[i]; }

Never stored to

45

Some stack arrays that are initialized and never stored to (and haven’t been optimized away previously) can be turned into global loads instead

Uniformity Hoisting

Page 46: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Flatten CFG

Uniformity Hoisting

const int a[5] = { 3, 2, 1, 4, 2 };

void kernel(constant float4 *A, constant bool *b global float *C) { … = a[i]; }

46

File scope constants can be initialized more efficiently before running the program

In the stack also the array is replicated for every thread, while in global memory the array memory is shared by all the threads

Uniformity Hoisting

Page 47: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Uniformity Hoisting

CFG Structurization

A

B C

D

47

When control-flow is unstructured (e.g., a block is controlled by multiple predecessors) execution on GPUs require some special handling

CFG Structurization

Page 48: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Uniformity Hoisting

CFG Structurization

A

B C

D

48

Our backend supports full execution of unstructured control-flow handled at MI-level with little overhead

So we need only limited structurization (we require loops to be transformed in LoopSimplify form though)

CFG Structurization

Page 49: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

For relatively small unstructured blocks employ structurization based on duplication

Uniformity Hoisting

CFG Structurization

A

B C

D

C’

49

CFG Structurization

Page 50: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Uniformity Hoisting

CFG Structurization

50

A

B C

D

C’

We thought about employing the LLVM StructurizeCFG pass, but the way it translated control-flow wasn’t optimal for us (Higher register pressure on avg, more control instructions)

CFG Structurization

Page 51: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

We run a bunch of optimizations (multiple times) in between passes

InstCombineDCE

SCCP

CSE

SimplifyCFG

Reassociate

51

Misc. optimizations

Page 52: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Instruction Selection is one of the most expensive steps of our compilation pipeline

We use lots of custom combines to extract performance from our hardware

52

Instruction Selection

SelectionDAG FastISel

Page 53: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Takes between 15% to 35% of our compile time!

53

Instruction Selection is one of the most expensive steps of our compilation pipeline

We use lots of custom combines to extract performance from our hardware

Instruction Selection

SelectionDAG FastISel

Page 54: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

SelectionDAG FastISel

Takes between 15% to 35% of our compile time!

On some devices FastISel helps keeping compile time in check

54

Instruction Selection is one of the most expensive steps of our compilation pipeline

We use lots of custom combines to extract performance from our hardware

Instruction Selection

Page 55: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

SelectionDAGFastISelGlobalISel

Plan is to switch to GlobalISel in the near future as our main compiler ISel

The switch should give us a better infrastructure while improving compile time

55

Instruction Selection

Page 56: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Scheduling is key for exploiting ILP, improve latency hiding and reducing power consumption by reducing register accesses

We try to achieve the above while being very careful at not to cause register pressure problems

add r5, r0, r3 mul r7, r3, r4 sub r6, r5, r4 load r1 add r3, r1, r6 load r2 mul r4, r2, r3

56

Scheduling

Page 57: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Adding unrelated after memory accesses helps with in-thread latency hiding so that other instructions can be executed while the load or texture fetch results are ready

load r1 load r2 add r5, r0, r3 mul r7, r3, r4 sub r6, r5, r4 add r3, r1, r6 mul r4, r2, r3

Wait here for the loads

Independent operations here

57

Scheduling

Page 58: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

load r1 load r2 add r5, r0, r3 mul r7, r3, r4 sub r6, r5, r4 add r3, r1, r6 mul r4, r2, r3

58

Interleaving independent operations to improve ILP

Forwarding instruction results help reducing register file traffic (lower power)

This is pretty standard scheduling

Scheduling

Page 59: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

load r1 load r2 add r5, r0, r3 mul r7, r3, r4 sub r6, r5, r4 add r3, r1, r6 mul r4, r2, r3

59

Many other target specific policies are enforced, all aimed at improving ILP, latency hiding and power (for example grouping instructions by type), all of this while battling with register pressure

We are willing to spend a lot of compile time on scheduling

Scheduling

Page 60: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

60

Challenges

Page 61: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Being JITs GPU compilers care about compile time very much

61

Compile-time and being a JIT

Page 62: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Being JITs GPU compilers care about compile time very much

• We optimize our pipeline to obtain the best results with the least wasted compile time

62

Compile-time and being a JIT

Page 63: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Main offenders:

• Instruction Selection: 15% - 35% compile-time

• Scheduling: 5% - 15% compile-time

• Instruction combining: ~10% compile-time

• Register Allocation/Register Coalescing: ~10% compile-time

63

Compile-time and being a JIT

Page 64: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Being JITs GPU compilers care about compile time very much

• We optimize our pipeline to obtain the best results with the least wasted compile time

• Having a custom pipeline often times creates problems as changing the order of the passes can unveil nasty bugs that used to be hidden …

64

Compile-time and being a JIT

Page 65: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• Being JITs GPU compilers care about compile time very much

• We optimize our pipeline to obtain the best results with the least wasted compile time

• Having a custom pipeline often times creates problems as changing the order of the passes can unveil nasty bugs that used to be hidden …

• We also reuse a single compiler instance for multiple compilations … this also uncovered some nasty bugs!

65

Compile-time and being a JIT

Page 66: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

66

Register definitions

Tuples of 4

Tuples of 2r0 r1

r0 r1 r2 r3

r0 r1 r2 r3 r4 r5 r6 r7 r8

r2 r3 r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

Page 67: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Some instructions support complex input/output operands loaded in contiguous registers

GPUs typically support register tuples with overlapping tuple elements sharing many RUs

67

Register definitions

Tuples of 4

Tuples of 2r0 r1

r0 r1 r2 r3

r0 r1 r2 r3 r4 r5 r6 r7 r8

r2 r3 r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

Page 68: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

This kind of register hierarchy generates a substantial amount of LLVM register definitions (one per each element of each tuple)

Tuples can go up to 16-wide on some architectures!

68

Register definitions

Tuples of 4

Tuples of 2r0 r1

r0 r1 r2 r3

r0 r1 r2 r3 r4 r5 r6 r7 r8

r2 r3 r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

Page 69: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Algorithms that scale with the number or registers or iterate over all the registers containing a RU can take a hit

We had problem with IPRA implementation for example where in our case for determining the registers used by a function was O(N2) on the number of registers

69

Register definitions

Tuples of 4

Tuples of 2r0 r1

r0 r1 r2 r3

r0 r1 r2 r3 r4 r5 r6 r7 r8

r2 r3 r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7 r8

Page 70: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

• LLVM has limited support for register pressure awareness

• IR passes largely ignore register pressure (example LICM)

• Machine-level has some register pressure estimation, but most passes care only if they are running out of registers

• For us increasing register pressure is potentially bad even if we don’t end up spilling as it reduces occupancy

70

Register pressure awareness

Page 71: Apple LLVM GPU Compilersllvm.org/devmtg/2017-10/slides/Chandrasekaran... · Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Q&A

71