optimizing builds - llvm

58
OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS Alexandre Ganea, Ubisoft [email protected] 2019 Bay Area LLVM Developers' Meeting, Oct.22-23 1

Upload: others

Post on 02-Dec-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

OPTIMIZING BUILDSON WINDOWS

SOME PRACTICAL CONSIDERATIONS

Alexandre Ganea, [email protected]

2019 Bay Area LLVM Developers' Meeting, Oct.22-231

2

SUMMARY

PART 1

PREAMBLE

PART 2

EXPERIMENTS

PART 3

PROPOSAL

PART 4

NEXT STEPS

PART 1

PREAMBLE:CHALLENGES

3

4PART 1 – PREAMBLE

Lines of Code(Assassins’ Creed, Far Cry)

5PART 1 – PREAMBLE

Editor Build 20,000 .CPP25,000 .H

23 GB .OBJ9 GB .DEBUG$T

10 M TYPE RECORDS42 M SYMBOLS

300 M .EXE2 GB .PDB

Windows 10Fastbuild, distributedAlways Unity builds

Concurent AAA games 20 – 25LoC/game 30 - 50 M

Programmers/title 100 – 250Code Changes/day 100 – 150

(peak:400)

Build targets/platform 5 – 6Platforms/Game 4+

Code workspace 70 - 100 GBData workspace 100 - 200 GB

Game builds/day 100 – 150Stripped Build 1 - 6 GBFinal Build 50 - 90 GB

Game production constraints @ Ubisoft

6PART 1 – PREAMBLE

08 min 50 sec

08 min 33 sec

08 min 50 sec

08 min 20 sec

07 min 00 sec

04 min 00 sec

04 min 15 sec

10 min 20 sec

06 min 46 sec

01 min 18 sec

43 sec

29 sec

29 sec

29 sec

00 min 00 sec 02 min 53 sec 05 min 46 sec 08 min 38 sec 11 min 31 sec 14 min 24 sec 17 min 17 sec 20 min 10 sec

2017 (MSVC)

2018 (MSVC)

Fall 2018 (MSVC + LLD)

2019 (MSVC + LLD)

2019 (Clang)

100% cache hit, local SSD

100% cache hit, 1 Gpbs network

AAA GAME, CLEAN REBUILDX64 EDITOR RELEASE (FASTBUILD)

Compiler Linker

PART 2

EXPERIMENTS

7

2.1Clang-scan-deps &

Fastbuild cache

8PART 2 – EXPERIMENTS

9

clang-cl /E md5sum curl https://store/ clang-cl

clang-scan-deps

while read x; do

md5sum $x;

done

deps.txt

a.cpp

a.cpp

found

not found

5-10 sec

0.02 sec 0.02 sec

FASTBUILD CACHE READ ALGORITHM

PART 2 – EXPERIMENTS

deps+MD5.txt

10

06 min 10 sec

04 min 05 sec

35 sec

40 sec

40 sec

40 sec

VS2017 15.9.16

Network cache

Network cache + clang-scan-deps

100% NETWORK CACHE HITSAAA GAME, X64 EDITOR RELEASE (FASTBUILD)

Compiler/Cache Linker

PART 2 – EXPERIMENTS

11

clang-scan-deps+ network cache

LLD(MSVC OBJs + ghash)

Intel Xeon W-2135 @ 3.7 GHz, 128 GB, NVMe SSD, 1Gbps Network

7 GB –> 22.6 GB50k files

PART 2 – EXPERIMENTS

(ms)

2.2StringMap

12PART 2 – EXPERIMENTS

Title of Document 13

11.5% process time

CLANG-SCAN-DEPS STANDALONE(50K FILES)

avg ~90% cpu

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

14PART 2 – EXPERIMENTS

STRINGMAP

15PART 2 – EXPERIMENTS

STRINGMAP

16PART 2 – EXPERIMENTS

sizeof(std::error_code) -> 16 bytes

sizeof(llvm::ErrorOr<DirectoryEntry&>) -> 24 bytes

sizeof(llvm::StringMapEntry<llvm::ErrorOr<DirectoryEntry&>>) –> 32 bytes (+string contents)

DOWN THE RABBIT HOLE

17

nullptr

nullptr

nullptr

0x15f238a92

nullptr

nullptr

nullptr

NumBuckets 0

0

0

0x12345678

0

0

0

uint32_t

StringMapEntry*

NumBuckets

count

value

string

size_t

T

count

PART 2 – EXPERIMENTS

STRINGMAP: MEMORY LAYOUT

18PART 2 – EXPERIMENTS

STRINGMAP (VTUNE)

19

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

1 5 9 13 17 21 25 29 33 37 41 45 49

60

.2%

14.7

%8

.0%

5.3

%3

.5%

1.7%

0.8

%0

.5%

0.2

%0

.1%0

.1%

Hash collisions / call

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

1 5 9 13 17 21 25 29 33 37 41 45

79.4

%11

.7%

3.5

%1.6

%1.0

%0

.5%

0.3

%0

.2%

0.1%

0.1%

Cachelines hit / call

187 M samplesPART 2 – EXPERIMENTS

STRINGMAP STATS

20PART 2 – EXPERIMENTS

DenseMap<uint64_t,T> + xxHash64()+ StringSaver

21PART 2 – EXPERIMENTS

DenseMap<__int128,T> + XXH128()+ StringSaver

2.4Multithreading LLD

(COFF driver)

23PART 2 – EXPERIMENTS

24

LINK AAA GAME, X64 EDITOR RELEASE(22.8GB MSVC OBJS)

VS2019 16.2

LLD 9.0

LLD 8 + // GHASH

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

PART 2 – EXPERIMENTS

58 sec

62 sec

49 sec

25

19.21 s

7.42 s

5.29 s

4.20 s

.0 s 5.0 s 10.0 s 15.0 s 20.0 s 25.0 s

Clang 9.0, no Ghash

Clang 8.0 + // Ghash (12-byte buckets)

Clang 8.0 + // Ghash (8-byte buckets)

Clang 8.0 + // Ghash (8-byte buckets) + 2MB pages

GHash

uint64_t

TypeIndex

uint32_t

GHash

uint64_t

TypeIndex

PART 2 – EXPERIMENTS

2.3Process Creation

26PART 2 – EXPERIMENTS

27PART 2 – EXPERIMENTS

COMPILING WITH CLANG 9.0

28

93 ms

PART 2 – EXPERIMENTS

CLANG CC1 IN PROCMON

29

int main(int argc_, const char **argv_) {noteBottomOfStack();llvm::InitLLVM X(argc_, argv_);SmallVector<const char *, 256> argv(argv_, argv_ + argc_);

if (llvm::sys::Process::FixupStandardFileDescriptors())return 1;

llvm::InitializeAllTargets();return ClangDriverMain(argv);

}

int ClangDriverMain(SmallVectorImpl<const char *>& argv) {static LLVM_THREAD_LOCAL bool EnterPE = true;if (EnterPE) {llvm::sys::DynamicLibrary::AddSymbol("ClangDriverMain", (void*)(i..EnterPE = false;

} else {llvm::cl::ResetAllOptionOccurrences();

}

auto TargetAndMode = ToolChain::getTargetAndModeFromProgramName(arg..

clang/tools/driver/driver.cpp int Command::Execute(ArrayRef<llvm::Optional<StringRef>> Redirects,std::string *ErrMsg, bool *ExecutionFailed) const {

[...]typedef int (*ClangDriverMainFunc)(SmallVectorImpl<const char *> &);ClangDriverMainFunc ClangDriverMain = nullptr;

[...]if (ClangDriverMain) {[...]llvm::CrashRecoveryContext CRC;CRC.EnableExceptionHandler = true;

const void *PrettyState = llvm::SavePrettyStackState();

int Ret = 0;auto ExecuteClangMain = [&]() { Ret = ClangDriverMain(Argv); };

if (!CRC.RunSafely(ExecuteClangMain)) {llvm::RestorePrettyStackState(PrettyState);return CRC.RetCode;

}return Ret;

} else {auto Args = llvm::toStringRefArray(Argv.data());return llvm::sys::ExecuteAndWait(Executable, Args, Env, Redirects,

/*secondsToWait*/ 0,/*memoryLimit*/ 0, ErrMsg,ExecutionFailed);

}}

clang/lib/driver/Job.cpp

PART 2 – EXPERIMENTS

MAKING CC1 REENTRANT

30PART 2 – EXPERIMENTS

CLANG DRIVER & CC1 MERGED

31

34 min 00 sec

28 min 00 sec

12 min 00 sec

32 min 30 sec

30 min 16 sec

13 min 10 sec

22 min 46 sec

19 min 54 sec

07 min 10 sec

6-core - W10 build 1803

6-core - W10 build 1903

36-core - W10 build 1709

BYPASSING THE CC1 PROCESSCLEAN REBUILD LLVM, CLANG & LLD

VS2019 16.2 Clang 9.0 Clang 9.0 + cc1 bypass

PART 2 – EXPERIMENTS

2.5CRT Allocator

32PART 2 – EXPERIMENTS

33

LINKING RAINBOW6: SIEGE WITH THINLTO :-(

96% idle

4%

PART 2 – EXPERIMENTS

34PART 2 – EXPERIMENTS

THINLTO: ALLOCATOR CONTENTION

35

$ LD_PRELOAD=/path/to/my/malloc.so /bin/ls

#include "rpmalloc/rpmalloc.c"

extern "C" {_ACRTIMP _CRTRESTRICT void *malloc(size_t size) {

return rpmalloc(size);}

_ACRTIMP void free(void *p) { rpfree(p); }

_ACRTIMP _CRTRESTRICT void *calloc(size_t n, size_t elem_size) {return rpcalloc(n, elem_size);

}_ACRTIMP _CRTRESTRICT void *realloc(void *ptr, size_t size) {

return rprealloc(ptr, size);}}

// Bypass CRT debug allocator#ifdef _DEBUGvoid *operator new(decltype(sizeof(0)) n) noexcept(false) { return malloc(n); }void __CRTDECL operator delete(void *const block) noexcept { free(block); }

void *operator new[](std::size_t s) throw(std::bad_alloc) { return malloc(s); }void operator delete[](void *p) throw() { free(p); }#endif

https://github.com/mjansson/rpmalloc

llvm/lib/Support/Windows/Memory.inc

PART 2 – EXPERIMENTS

REPLACING THE CRT ALLOCATOR

36

57 min 00 sec

20 min 13 sec

16 min 19 sec

37 min 12 sec

> 1 h 30 min

03 min 57 sec

VS 2017 15.9.16

Clang 9.0 ThinLTO

Clang 9.0 ThinLTO + rpmalloc

THINLTO (CLEAN REBUILD)RAINBOW 6: SIEGE, PC GAME PROFILE

6-core (W10 build 1903) 36-core (W10 build 1709)

PART 2 – EXPERIMENTS

PART 3

PROPOSALPROOF-OF-CONCEPT

37

PART 3 – PROPOSAL 38

FASTBUILD

clang.exe

lld-link.exe

llvm-tblgen.exe

clang-tblgen.exe

llvm-lib.exe

ml64.exe (masm)

rc.exe

cmake.exe

PREVIOUS BUILD PROCESS

39PART 3 – PROPOSAL

Maybe there’s a better way

40

Image Credit: Caterpillar

PART 3 – PROPOSAL

PART 3 – PROPOSAL 41

FASTBUILD

ml64.exe (masm)

rc.exe

cmake.exe

LLVM-BUILDOZER

clang.exe

lld-link.exe

llvm-tblgen.exe

clang-tblgen.exe

llvm-lib.exe

BUILDING WITH BUILDOZER

PART 3 – PROPOSAL 42

FASTBUILD

ml64.exe (masm)

rc.exe

cmake.exe

Worker 1 Worker 2 Worker 3 Worker 4 Worker 5

Local

Local Local

43

int buildozer::ImportEXE(llvm::StringRef EXE) {[..]HINSTANCE H = LoadLibraryA(EXE.data());if (!H)return 0;

RemapIAT(H);InitDebInfo();PatchRPMalloc(M);InitializeStaticTLS(H);InitializeCRT(M);FindEntryPoints(M);[..]

}

PART 3 – PROPOSAL

RUNNING THE DOZER

”LoadLibrary can also be used to load other executable modules.[..]However, do not use LoadLibrary to run an .exe file.Instead, use the CreateProcess function.” (MSDN)

44

int buildozer::ImportEXE(llvm::StringRef EXE) {[..]HINSTANCE H = LoadLibraryA(EXE.data());if (!H)return 0;

RemapImportAddressTable(H);InitDebInfo();PatchRPMalloc(M);InitializeStaticTLS(H);InitializeCRT(M);FindEntryPoints(M);[..]

}

PART 3 – PROPOSAL

RUNNING THE DOZER

45

Pool.emplace(NumWorkers, [&]() {while (true) {

buildozer::WorkUnit *WU = AcquireWork(..);if (!WU)

break;

int Mod = IdentifyMOD(WU);

llvm::CrashRecoveryContext CRC;CRC.RunSafely([&] {

buildozer::Launch(Mod, WU->Directory, WU->Arguments);});[..]

}});Pool.join();

PART 3 – PROPOSAL

RUNNING THE DOZER

46PART 3 – PROPOSAL

RUNNING THE DOZER

47

19 min 34 sec

11 min 53 sec

Clang 9.0

Buildozer

Local build, AAA game, x64 Editor Release

PART 3 – PROPOSAL

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

PART 4

NEXT STEPS

48

SHORT TERM

PART 4– NEXT STEPS 49

• Remove OS jitter (in-RAM file content & stat cache)

• OBJ cache (in-RAM)

• Clang-LLD in-memory bridge

• Incrementally link along the way

• Incrementally compile along the way (SN Systems’ Program Repository)

• Remote API for distribution & caching

LONG TERM

PART 4– NEXT STEPS 50

BUILD TARGET

PLATFORM

LONG TERM

PART 4– NEXT STEPS 51

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

LONG TERM

PART 4– NEXT STEPS 52

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

LONG TERM

PART 4– NEXT STEPS 53

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

DAILY COMMITS

LONG TERM

PART 4– NEXT STEPS 54

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

DAILY COMMITS

ACTIVE BRANCHES

LONG TERM

PART 4– NEXT STEPS 55

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

DAILY COMMITS

ACTIVE BRANCHES

GAME PRODUCTION

LONG TERM

PART 4– NEXT STEPS 56

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

PLATFORM

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

BUILD TARGETBUILD TARGET

DAILY COMMITS

ACTIVE BRANCHES5 min

x6

x6

x100

x4

GAME PRODUCTIONx20

57

Is there a better way?

58

THANK YOU

Q&A

59

Alexandre Ganea, [email protected]