optimizing memory allocation in c++ · 2018. 11. 23. · each line corresponds to a cache line (64...

60
basics detection Fix Optimizing Memory Allocation in C ++ ebastien Ponce [email protected] CERN February 5 th 2018 S. Ponce Optimizing Memory Allocation in C ++ 1 / 41

Upload: others

Post on 10-Dec-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Optimizing Memory Allocation in C++

Sebastien [email protected]

CERN

February 5th 2018

S. Ponce Optimizing Memory Allocation in C++ 1 / 41

Page 2: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Context

We are spending (far) too much time allocating and deallocatingmemory

Initially25% of the total HLT1 time !

S. Ponce Optimizing Memory Allocation in C++ 2 / 41

Page 3: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Why ?

Main issue

We are allocating too many small bits

We should allocate large chunks

Source of the problem

Our object model, full of containers of pointers

Plus our bad coding, not reserving the space

S. Ponce Optimizing Memory Allocation in C++ 3 / 41

Page 4: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

The solution

What to do

use containers of objects

and references, move semantic, emplace, ...so that things are never copied

reserve the full size of your container at creation

Why is it hard ?

Lot’s of code to be adapted

non trivial C++concepts at work

S. Ponce Optimizing Memory Allocation in C++ 4 / 41

Page 5: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Outline

1 Basics of memory allocation

2 Detect suboptimal allocations

3 How to improveChange containersAdapt creation codeAdapt insertion codeAdapt read accesses

S. Ponce Optimizing Memory Allocation in C++ 5 / 41

Page 6: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Foreword

the concepts and techniques presented here are generic

they apply to basically all containers

for simplicity, I’ll show them on vectors and maps

S. Ponce Optimizing Memory Allocation in C++ 6 / 41

Page 7: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Basics of memory allocation

S. Ponce Optimizing Memory Allocation in C++ 7 / 41

Page 8: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Basic container in memory

Simple vector case

std::vector<int> v;

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ...

Vector of objects

struct A { float x, y, z; };

std::vector<A> v;

x0 y0 z0

A0

x1 y1 z1

A1

x2 y2 z2

A2

x2 ...

S. Ponce Optimizing Memory Allocation in C++ 8 / 41

Page 9: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Container of pointers

Naıve view

struct A { float x, y, z; };

std::vector<A*> v;

ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...

Realistic view

ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...x0 y0 z0

x1 y1 z1

x2 y2 z2

x3 y3 z3 x4 y4 z4

x5 y5 z5

x6 y6 z6x7 y7 z7

x8 y8 z8

x9 y9 z9

S. Ponce Optimizing Memory Allocation in C++ 9 / 41

Page 10: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Container of pointers

Naıve view

struct A { float x, y, z; };

std::vector<A*> v;

ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...

Realistic view

ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...x0 y0 z0

x1 y1 z1

x2 y2 z2

x3 y3 z3 x4 y4 z4

x5 y5 z5

x6 y6 z6x7 y7 z7

x8 y8 z8

x9 y9 z9

S. Ponce Optimizing Memory Allocation in C++ 9 / 41

Page 11: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Consequences : memory allocations

Number of allocations

vector¡A¿ -¿ optimally 1 allocation

vector¡A*¿ -¿ minimum n+1 allocations

What is an allocation ?

finding an empty piece of memory

going though a list/map hold by the linux kernel

and taking a lock to make it thread safe

So allocations are costly !

S. Ponce Optimizing Memory Allocation in C++ 10 / 41

Page 12: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

More consequences : reading data

Memory view for vector<A>

Each line corresponds to a cache line (64 bytes, 16 floats)

0x00000x00400x00800x00C0

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5y5 z5 x6 y6 z6 x7 y7 z7 x8 y8 z8 x9 y9 z9 . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

One read from RAM to Level 1 Cache is enough (2 lines in one go)

S. Ponce Optimizing Memory Allocation in C++ 11 / 41

Page 13: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

More consequences : reading data

Memory view for vector<A>

Each line corresponds to a cache line (64 bytes, 16 floats)

0x00000x00400x00800x00C0

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5y5 z5 x6 y6 z6 x7 y7 z7 x8 y8 z8 x9 y9 z9 . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

One read from RAM to Level 1 Cache is enough (2 lines in one go)

S. Ponce Optimizing Memory Allocation in C++ 11 / 41

Page 14: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

More consequences : reading data

Memory view for vector<A*>

Each line corresponds to a cache line (64 bytes, 16 floats)

0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x0240

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

x0 x1

x2

x3 x4

x5 x6

x7

x8 x9

y0 y1

y2

y3 y4

y5 y6

y7

y8 y9

z0 z1

z2

z3 z4

z5 z6

z7

z8 z9

You need to read many lines, in several accessesRemember a RAM access is 100 cycles

S. Ponce Optimizing Memory Allocation in C++ 12 / 41

Page 15: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

More consequences : reading data

Memory view for vector<A*>

Each line corresponds to a cache line (64 bytes, 16 floats)

0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x0240

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

x0 x1

x2

x3 x4

x5 x6

x7

x8 x9

y0 y1

y2

y3 y4

y5 y6

y7

y8 y9

z0 z1

z2

z3 z4

z5 z6

z7

z8 z9

You need to read many lines, in several accessesRemember a RAM access is 100 cycles

S. Ponce Optimizing Memory Allocation in C++ 12 / 41

Page 16: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Detect suboptimal allocations

S. Ponce Optimizing Memory Allocation in C++ 13 / 41

Page 17: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

The main tools

vtune

uses internal processor counters to estimate what going on inreal execution

and in particular the (estimated) number of cycles spent inmemory allocations

as well as the cache misses

better used in opt mode

callgrind

simulates a processor and allows to count what is going on

and in particular the (estimated) number of cycles spent inmemory allocations

as well as the cache misses

better used in dbg mode

S. Ponce Optimizing Memory Allocation in C++ 14 / 41

Page 18: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

allow vtune to be found in your environment :

source /cvmfs/projects.cern.ch/intelsw/psxe/linux/18-all-setup.sh

run your program with a command line like this :

amplxe-cl -collect hotspots -start-paused -- \

./Brunel/run gaudirun.py MiniBrunelHLT1fast.py

-start-paused allows to start vtune only when needed, butrequires the option mbrunel.IntelProfile = True inMiniBrunel config file

use enouhg events, typically >10000 for MiniBrunel HLT1 only

you will get a directory called r000hs

visualize results with

amplxe-gui r000hs

S. Ponce Optimizing Memory Allocation in C++ 15 / 41

Page 19: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

you get a summary with hotspots, out of which new

go to bottom-up tab

S. Ponce Optimizing Memory Allocation in C++ 16 / 41

Page 20: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

new is indeed in the top CPU consumers

click on the triangle on the left to see who calls it

S. Ponce Optimizing Memory Allocation in C++ 17 / 41

Page 21: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

many culprits ! Choose yours and go to the caller-callee tab

find it on the right (Ctrl-f) and click on it

S. Ponce Optimizing Memory Allocation in C++ 18 / 41

Page 22: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

I’ve chosen FTMeasurementProvider::measurement

one can see where it’s called, and where it spends it time

S. Ponce Optimizing Memory Allocation in C++ 19 / 41

Page 23: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

vtune in practice

And the culprit is...

unsigned int HLT1Fitter::fit(LHCb::Track& ...) {

...

// Store results of the Kalman fit

std::vector<LHCb::Measurement*> measurements;

...

}

S. Ponce Optimizing Memory Allocation in C++ 20 / 41

Page 24: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

Callgrind in practice

run your program with a command line like this :

./Brunel/run \

/cvmfs/lhcbdev.cern.ch/tools/valgrind/3.12.0/x86_64-centos7/bin/valgrind \

--tool=callgrind --instr-atstart=no --dump-instr=yes --cache-sim=yes \

python $(./Brunel/run which gaudirun.py) MiniBrunelHLT1fast.py

--instr-atstart=no allows to start callgrind only whenneeded, but requires the option

mbrunel.CallgrindProfile = True

in MiniBrunel config file

use only few events, typically <1000

you will get 2 files callgrind .out. < tid > andcallgrind .out. < tid > .1

open the ’.1’ one with kcachegrind

S. Ponce Optimizing Memory Allocation in C++ 21 / 41

Page 25: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

callgrind in practice

you get a list of functions and time spent in each

search for new on top left and click on operator new

look at the bottom right panel for who calls new

S. Ponce Optimizing Memory Allocation in C++ 22 / 41

Page 26: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

callgrind in practice

one can double click on any item to see their callers/callees

one can display source code on the upper right panel

let’s look at FTMeasurementProvider::measurement

S. Ponce Optimizing Memory Allocation in C++ 23 / 41

Page 27: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

callgrind in practice

show where new is called

one can check the call graph

S. Ponce Optimizing Memory Allocation in C++ 24 / 41

Page 28: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix

callgrind in practice

all comes from the Fitter, where you find the wrong container

unsigned int HLT1Fitter::fit(LHCb::Track& ...) const {

// Store results of the Kalman fit

std::vector<LHCb::Measurement*> measurements;

S. Ponce Optimizing Memory Allocation in C++ 25 / 41

Page 29: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How to improve

S. Ponce Optimizing Memory Allocation in C++ 26 / 41

Page 30: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 1 : change container to container of objects

prefer standard containers

specially run away from KeyedContainer

be aware of the variety of containers and their specificities

e.g. do you know of flat map or small vector ?

practically, you often only need to drop a * in the containerdefinition. In our case :

std::vector<LHCb::Measurement*> measurements;

becomes

std::vector<LHCb::Measurement> measurements;

S. Ponce Optimizing Memory Allocation in C++ 27 / 41

Page 31: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 1 : possible impact

in case the container is member of a class, you can thenprobably get rid of quite some code

destructorcopy/move constructorsassignement operator

all now will be default, while they needed to release thecontent of the container before

S. Ponce Optimizing Memory Allocation in C++ 28 / 41

Page 32: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 2 : deal with container creation

should be trivial, using default

that’s ok for static containers where the size is fixed

std::array typically

not enough for growing containers

including std::vector, std::map, ...

S. Ponce Optimizing Memory Allocation in C++ 29 / 41

Page 33: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

struct A { float x, y, z; };

std::vector<A*> v;

Construction

Default constructor creates and empty vector, with no storage

end of storagefinish

start

0x0

0x0

0x0

First push

allocates storage for the first element only !

0x1234

0x1240

0x1240

x0 y0 z0

S. Ponce Optimizing Memory Allocation in C++ 30 / 41

Page 34: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

struct A { float x, y, z; };

std::vector<A*> v;

Construction

Default constructor creates and empty vector, with no storage

end of storagefinish

start

0x0

0x0

0x0

First push

allocates storage for the first element only !

0x1234

0x1240

0x1240

x0 y0 z0

S. Ponce Optimizing Memory Allocation in C++ 30 / 41

Page 35: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

struct A { float x, y, z; };

std::vector<A*> v;

Construction

Default constructor creates and empty vector, with no storage

end of storagefinish

start

0x0

0x0

0x0

First push

allocates storage for the first element only !

0x1234

0x1240

0x1240

x0 y0 z0

S. Ponce Optimizing Memory Allocation in C++ 30 / 41

Page 36: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0

x0 y0 z0 x1 y1 z1

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 37: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0

x0 y0 z0 x1 y1 z1

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 38: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0 x0 y0 z0

x1 y1 z1

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 39: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0 x0 y0 z0 x1 y1 z1

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 40: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0 x0 y0 z0 x1 y1 z10x5678

0x5684

0x5684

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 41: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Second push

0x1234

0x1240

0x1240

x0 y0 z0

x0 y0 z0 x1 y1 z10x5678

0x5684

0x5684

1 allocate new piece of memory for 2 items

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 31 / 41

Page 42: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1

x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 43: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1

x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 44: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1

x2 y2 z20x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 45: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1 x2 y2 z2

0x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 46: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 47: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How does a vector grow ?

Third push

0x5678

0x5684

0x5684

x0 y0 z0 x1 y1 z1

x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC

0x9AC8

0x9ACC

1 allocate new piece of memory for 4 items

double size at each iteration

2 copy existing content

3 write new content

4 update pointers

5 Deallocate original piece of memory

S. Ponce Optimizing Memory Allocation in C++ 32 / 41

Page 48: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

How to make proper allocation for vectors

Why to avoid the default ?

content of vectors is reallocated and copied when they grow

first item of a 1000 nodes vector will be copies 10 times !

when reaching 1000 items, you will have copied 1023 items intotal and allocated 11 pieces of memory, releasing 10

Solution

you can avoid all that thanks to reserve

std::vector<int> v;

v.reserve(1000);

ensures single allocation, no copies, no reallocations

0x1234

0x1234

0x1dec

...

S. Ponce Optimizing Memory Allocation in C++ 33 / 41

Page 49: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 3 : deal with insertions

What happens by default ?

1 std::vector<A> v;

2 v.reserve(10);

3 A tmp{args};

4 v.push_back(tmp);

What actually happens :

allocate space in the vector (line 2)

allocate space for the temporary A object (line 3)

call A constructor (line 3)

call copy constructor for A (line 4)

deallocate temporary A (end of scope)

S. Ponce Optimizing Memory Allocation in C++ 34 / 41

Page 50: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 3 : deal with insertions

What happens by default ?

1 std::vector<A> v;

2 v.reserve(10);

3 A tmp{args};

4 v.push_back(tmp);

What actually happens :

allocate space in the vector (line 2)

allocate space for the temporary A object (line 3)

call A constructor (line 3)

call copy constructor for A (line 4)

deallocate temporary A (end of scope)

S. Ponce Optimizing Memory Allocation in C++ 34 / 41

Page 51: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Ok, but we have move

What is the default ?

std::vector<A> v;

v.reserve(10);

A tmp{args};

v.push_back(std::move(tmp));

What actually happens :

allocate space in the vector (line 2)

allocate space for the temporary A object (line 3)

call A constructor (line 3)

call move constructor for A (line 4)

can be much better than copy, e.g. for vectorscan be identical, e.g. for plain object

deallocate temporary A (end of scope)

We would like to completely avoid the temporary object

S. Ponce Optimizing Memory Allocation in C++ 35 / 41

Page 52: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Ok, but we have move

What is the default ?

std::vector<A> v;

v.reserve(10);

A tmp{args};

v.push_back(std::move(tmp));

What actually happens :

allocate space in the vector (line 2)

allocate space for the temporary A object (line 3)

call A constructor (line 3)

call move constructor for A (line 4)

can be much better than copy, e.g. for vectorscan be identical, e.g. for plain object

deallocate temporary A (end of scope)

We would like to completely avoid the temporary object

S. Ponce Optimizing Memory Allocation in C++ 35 / 41

Page 53: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Proper solution for vectors

In place construction

std::vector<A> v;

v.reserve(10);

v.emplace_back(args);

What actually happens :

allocate space in the vector

call constructor for A

using args as the constructor argumentsusing the space allocated in the vector

For the record, this is using variadic templates, new in C++11

S. Ponce Optimizing Memory Allocation in C++ 36 / 41

Page 54: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

In place construction and maps

Naıve code

1 std::map<int,A> m;

2 std::pair<int,A> item(5, A(args));

3 m.insert(item);

Problems :

copy on line 2

copy on insertion on line 3

With emplace

1 std::map<int,A> m;

2 m.emplace(5, A(args));

Problem :

the pair is constructed in place, not A

still a move/copy for A

S. Ponce Optimizing Memory Allocation in C++ 37 / 41

Page 55: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

In place construction and maps

Naıve code

1 std::map<int,A> m;

2 std::pair<int,A> item(5, A(args));

3 m.insert(item);

Problems :

copy on line 2

copy on insertion on line 3

With emplace

1 std::map<int,A> m;

2 m.emplace(5, A(args));

Problem :

the pair is constructed in place, not A

still a move/copy for A

S. Ponce Optimizing Memory Allocation in C++ 37 / 41

Page 56: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

In place construction and maps

piecewise construct

To solve this problem, std::pair has a dedicated constructor

this constructor takes 2 tuples, holding the arguments toconstruct key and value

1 std::map<int,A> m;

2 m.emplace(piecewise_construct,

3 make_tuple(5),

4 make_tuple(args));

so A is now constructed in place, in the pair inside the map

Is that optimal ?

close to, but not completely

we are copying args into the tuple now !

S. Ponce Optimizing Memory Allocation in C++ 38 / 41

Page 57: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

In place construction and maps

piecewise construct

To solve this problem, std::pair has a dedicated constructor

this constructor takes 2 tuples, holding the arguments toconstruct key and value

1 std::map<int,A> m;

2 m.emplace(piecewise_construct,

3 make_tuple(5),

4 make_tuple(args));

so A is now constructed in place, in the pair inside the map

Is that optimal ?

close to, but not completely

we are copying args into the tuple now !

S. Ponce Optimizing Memory Allocation in C++ 38 / 41

Page 58: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

In place construction and maps

piecewise construct + forward as tuple

this can be solved by using tuple of references

1 std::map<int,A> m;

2 m.emplace(piecewise_construct,

3 make_tuple(5),

4 forward_as_tuple(args));

forward as tuple creates a tuple of references

prevents to copy args twice

S. Ponce Optimizing Memory Allocation in C++ 39 / 41

Page 59: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Step 3 : deal with accesses

The easy part

1 std::vector<A*> v;

2 bar(const A* a);

3 ...

4 v[n]->foo();

5 bar(v[n]);

becomes

1 std::vector<A> v;

2 bar(const A& a); // now const is real !

3 ...

4 v[n].foo();

5 bar(v[n]);

S. Ponce Optimizing Memory Allocation in C++ 40 / 41

Page 60: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y

basics detection Fix container creation insertion access

Conclusion

Memory allocation/deallocations are not cheap

Optimizing them will lead to ∼20% gain in HLT1

New code should take that into account

key words are :

container of objectsreserve

emplace

references

S. Ponce Optimizing Memory Allocation in C++ 41 / 41