the abc’s of atomics an introduction to std::atomic and the c++11 memory model

The ABC’s of Atomics

An Introduction to std::atomic<T> and the C++11 Memory Model

The ABC’s of Atomics

• C++11 atomic operations library• Header <atomic>• Atomic types

• std::atomic<T>• std::atomic_flag

• Functions• Standalone fences• Dozens of functions for C compatibility

ATOMICITY

Data Races

A data race is a race condition that occursif multiple threads concurrently

access the same memory location, without synchronisation,

and at least one of those accesses is a write.

Atomicity

Atomic (a-tomos, undividable)==

indivisible, data-race-free

Atomic Load & Store

x.store(1);x = 2;

// C interoperability:std::atomic_store(&x,

3);

f(x.load());f(x);int i = x;

int i = std::atomic_load(&x);

std::atomic<int> x;void f(int);

Atomic Load & Store

x = 1; assert(x == 1 || x == ~0ul);

std::atomic<std::uint64_t> x;

x = ~0ul;

Atomic Load & Store

x = 1; assert(x == 0 || x == 1 || x == ~0ul);

std::atomic<int> y(0);


x = ~0ul;

Atomic Load & Store

x = 1; auto r = x.load();assert(r == 0 || r == 1 || r == ~0ul);


x = ~0ul;

Atomic Load & Store

x = 1;// ...

x = ~0ul;

assert(x == 0 || x == 1 || x == ~0ul);


Atomic Load & Store

For any (trivially copyable) type T

std::atomic<int> x;std::atomic<long long int> y;

struct S { /*...*/ };std::atomic<S> z;

assert(x.is_lock_free()); // Most platformsassert(!z.is_lock_free());

Atomic Exchange


T exchange(T newValue){

T oldValue = load();store(newValue);return oldValue;

}

const auto before = x.exchange(123);

Atomic Compare Exchange


bool compare_exchange(T& expected, T desired){

const T currentValue = load();if (currentValue == expected) {

store(desired);return true;

} else {expected = currentValue;return false;

}}

Atomic Compare Exchange

• For any (trivially copyable) type T• Powerful tool for lock-free data structures• Two flavors

• compare_exchange_strong(T& des, T exp)• For use as standalone operation

• compare_exchange_weak(T& des, T exp)• May “fail” from time to time• For use in a loop

Atomic Operations

For any (trivially copyable) type T• Atomic load• Atomic store• Atomic exchange• Atomic compare_exchange_strong• Atomic compare_exchange_weak

Specialisations & Typedefs

std::atomic_bool std::atomic<bool>

std::atomic_char std::atomic<char>

std::atomic_schar std::atomic<signed char>

std::atomic_uchar std::atomic<unsigned char>

std::atomic_short std::atomic<short>

std::atomic_ushort std::atomic<unsigned short>

std::atomic_int std::atomic<int>

std::atomic_uint std::atomic<unsigned int>

std::atomic_long std::atomic<long>

std::atomic_ulong std::atomic<unsigned long>

std::atomic_llong std::atomic<long long>

std::atomic_ullong std::atomic<unsigned long long>

std::atomic_char16_t std::atomic<char16_t>

std::atomic_char32_t std::atomic<char32_t>

std::atomic_wchar_t std::atomic<wchar_t>

Specialisations & Typedefs

std::atomic_int_least8_t std::atomic<std::int_least8_t>

std::atomic_uint_least8_t std::atomic<std::uint_least8_t>







std::atomic_int_fast8_t std::atomic<std::int_fast8_t>

std::atomic_uint_fast8_t std::atomic<std::uint_fast8_t>






Specialised Atomic Operations

• For integral types T• operator++, operator++(int)• operator+=, operator-=, operator|=, operator&=, operator^=

• fetch_add, fetch_sub, fetch_or, fetch_and, fetch_xor

• For pointer types T*• operator++, operator++(int)• operator+=, operator-=• fetch_add, fetch_sub

std::atomic_flag

• Guaranteed lock-free• No load, store, exchange, compare_exchange

• Instead:• Asignment operator=• bool test_and_set() // Set to true• void clear() // Set to false

Takeaways

Atomics offer data-race-free operations • Any type, integral and pointer types in particular• Load, store, compare exchange, increment, …• Portable• Efficient (lock-free)

The As-If Rule

A conforming implementation is free to choose how it executes a well-formed program,

as long as the program’s observable behaviour is as if it were executed as written.

Single-threaded Optimisations

for (int c = 0; c < cols; ++c)for (int r = 0; r < rows; ++r)

sum += array[c*rows+r];

for (int r = 0; r < rows; ++r)for (int c = 0; c < cols; ++c)


int acc = sum;for (int c = 0; c < cols; ++c)

for (int r = 0; r < rows; ++r)acc += array[c*rows+r];

sum = acc;

int acc = sum, i = 0;for (int c = 0; c < cols; ++c)

for (int r = 0; r < rows; ++r)acc += array[++i];

sum = acc;

Single-threaded Optimisations

for (int r = 0; r < rows; ++r)for (int c = 0; c < cols; ++c)


int acc = sum, i = 0;for (int c = 0; c < cols; ++c)

for (int r = 0; r < rows; ++r)

acc += array[++i];sum = acc;

sum=42;int acc = sum, i = 0;for (int c = 0; c < cols; ++c)

for (int r = 0; r < rows; ++r)

acc += array[++i];if (i) sum = acc;

Code Transformations

x = true;if (y)

cout << "y";

y = true;if (x)

cout << "x";

bool x, y;

Compiler Transformations

if (y)cout << "y";

x = true;

if (x)cout << "x";

y = true;

bool x, y;

Real-life Examples


x = true;if (y)

cout << "y";

y = true;if (x)

cout << "x";

bool x, y;

Compiler Transformations

bool r = y;x = true;if (r)

cout << "y";

bool r = x;y = true;if (r)

cout << "x";

bool x, y;


x = true;if (y)

cout << "y";

y = true;if (x)

cout << "x";

bool x, y;

Store Buffer

Processor 1

x = true;if (y)

cout << "y";

Processor 2

y = true;if (x)

cout << "x";

Coherent Cache / Main Memory

Store Buffer Store Buffer

1

2

3

4

5 6


Source Code Transformations

Takeaways

1. Implementation rarely executes what you wrote• Code is reordered, omitted, invented• Compiler, processor, cache: all equivalent• Critical for performance

2. Atomicity + As-if rule: not enough!• Need to restrict code transformations

BARRIERS

x = "Hello, world";x_mutex.lock();x_mutex.unlock();

Critical Regions using Mutexes

x_mutex.lock();x = "Hello, world";x_mutex.unlock();

std::string x;std::mutex x_mutex;

x_mutex.lock();x_mutex.unlock();x = "Hello, world";

x = "Hello, world";x_mutex.lock();x_mutex.unlock();


x_mutex.lock();x = "Hello, world";x_mutex.unlock();

std::string x;std::mutex x_mutex;

y_mutex.lock();x = "Hell";y = "o, w";y_mutex.unlock();z = "orld";


x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";

std::string x, y, z;std::mutex y_mutex;

y_mutex.lock();x = "Hell";y = "o, w";z = "orld";y_mutex.unlock();




y_mutex.lock();z = "orld";y = "o, w";x = "Hell";y_mutex.unlock();




x = "Hell";z = "orld";y_mutex.lock();y = "o, w";y_mutex.unlock();




x = "Hell";y_mutex.lock();y = "o, w";z = "orld";y_mutex.unlock();




x = "Hell";y_mutex.lock();z = "orld";y = "o, w";y_mutex.unlock();




x = "Hell";z = "orld";y_mutex.lock();y = "o, w";y_mutex.unlock();




y_mutex.lock();y = "o, w";y_mutex.unlock();x = "Hell";z = "orld";




Memory Barriers



Acquire barrier

Release barrier

Atomic Barriers

x = true;

if (y)cout << "y";

y = true;

if (x)cout << "x";

bool x, y;

Atomic Barriers

x = true;

if (y)cout << "y";

y = true;

if (x)cout << "x";

“synchronised” bool x, y;

Atomic Barriers

x = true;

if (y)cout << "y";

y = true;

if (x)cout << "x";

std::atomic<bool> x, y;Atomic Store == Release Barrier

Atomic Load == Acquire Barrier

Takeaways

• Acquire barriers• mutex::lock, atomic::load, …• Code may flow down, but not up• “Wait until acquired”

• Release barriers• mutex::unlock, atomic::release, …• Code may flow up, but not down• “Finish before releasing”

• Atomicity + acquire/release barriers: not enough!

CONSISTENCY

Acquire & Release Barriers

x = true;

if (y)cout << "y";

y = true;

if (x)cout << "x";

std::atomic<bool> x, y;

“Plain” Acquire & Release SC Acquire & Release

Release

Acquire

Release

Acquire

Release

Acquire

Release

Acquire

Sequential Consistency (SC)

Sequentially Consistent Barriers

x = true;

if (y)cout << "y";

y = true;

if (x)cout << "x";

std::atomic<bool> x, y;Atomic Store ==

SC Release Barrier

Atomic Load == SC Acquire Barrier

Sequential Consistency

s = "Hello, world!";ready = true;

while (!ready) {}cout << s;

std::string s;std::atomic<bool> ready;

Sequentially Consistent Pointers

s = new string("Hello");

while (!s) {}cout << *s;

std::atomic<std::string*> s;

Sequential Consistency

auto temp = new string("Hello");s = temp;

while (!s) {}cout << *s;

std::atomic<std::string*> s;

Double-Checked Locking is Unbroken

class Singleton { /*...*/ };std::atomic<Singleton*> instance;std::mutex m;

Singleton* GetInstance() {if (instance == nullptr) {

std::lock_guard<std::mutex> lock(m);

if (instance == nullptr)instance = new Singleton();

}return instance;

}

Sequential Consistency: Transitivity

g = true;x = true;

if (x) y = true;

bool g;std::atomic<bool> x, y;

if (y) assert(g)

Sequential Consistency: Total Order

x = true; if (x && !y) cout << "x first";

std::atomic<bool> x, y;

y = true; if (y && !x) cout << "y first";

Key Takeaway

Don’t write race conditions, and use sequentially consistent atomics,

and your code will do what you think.

DON’T DO IT

EXPERTS ONLY

Don’t Do It, Experts Only

The First Rule of Program Optimization: Don't do it

The Second Rule of Program Optimization (for experts only!): Don't do it yet.

Memory Order

x = true;if (y)

cout << "y";

std::atomic_bool x, y;

y = true;if (x)

cout << "x";

Memory Order

x.store(true);if (y.load())

cout << "y";


y.store(true);if (x.load())

cout << "x";

Memory Order

x.store(true, std::memory_order_seq_cst);if (y.load(std::memory_order_seq_cst))

cout << "y";


y.store(true, std::memory_order_seq_cst);if (x.load(std::memory_order_seq_cst))

cout << "x";

Memory Order

enum memory_order{

memory_order_relaxed,memory_order_consume,memory_order_acquire,memory_order_release,memory_order_acq_rel,memory_order_seq_cst //

default};

Relaxed Memory Order

x.store(true, std::memory_order_relaxed);if (y.load(std::memory_order_relaxed))

cout << "y";


y.store(true, std::memory_order_relaxed);if (x.load(std::memory_order_relaxed))

cout << "x";

Acquire/Release Memory Order

x.store(true, std::memory_order_release);if (y.load(std::memory_order_acquire))

cout << "y";


y.store(true, std::memory_order_release);if (x.load(std::memory_order_acquire))

cout << "x";

Acquire/Release Memory Order

x.store(1, release);

if (x.load(acquire) && !y.load(acquire)) cout << "x first";

std::atomic_int x, y;#define acquire std::memory_order_acquire#define release std::memory_order_release

y.store(1, release);

if (y.load(acquire) && !x.load(acquire)) cout << "y first";

Don’t Do It, Experts Only

The difference between acq_rel and seq_cst is generally whether the operation is required to participate in the single global order of sequentially consistent operations. This has subtle and unintuitive effects. The fences in the current standard may be the most experts-only construct [in C++].

Peterson’s Mutex (Bartosz Milewski)

class PetersonMutexBM { std::atomic<bool> m_interested[2]; std::atomic<unsigned> m_victim;

public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0

m_interested[me].exchange(true, acq_rel); m_victim.store(me, release);

while (m_interested[he].load(acquire) && m_victim.load(acquire) == me); }

Peterson’s Mutex (Dmitriy V'jukov)

class PetersonMutexDV { std::atomic<bool> m_interested[2]; std::atomic<unsigned> m_victim;

public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0

m_interested[me].store(true, relaxed); m_victim.exchange(me, acq_rel);

while (m_interested[he].load(acquire) && m_victim.load(relaxed) == me); }

• Questions1. Is it correct?2. Is it worth it?

Relaxed Double Checked Locking (Herb Sutter)

Takeaway: Relaxed, Don’t Do it

FENCES

Standalone Fences

• Fence == barrier• std::atomic_thread_fence(std::memory_order) & std::atomic_signal_fence(std::memory_order)

• memory_order_relaxed // does nothing

• memory_order_consume• memory_order_acquire• memory_order_release• memory_order_acq_rel• memory_order_seq_cst // (no default)

Fences

x = true;atomic_thread_fence( memory_order_seq_cst);if (y)

cout << "y";

y = true;atomic_thread_fence( memory_order_seq_cst);if (x)

cout << "x";

bool x, y;using namespace std;

Takeaway

• Standalone fences are suboptimal• Error-prone• Suboptimal performance• Cf. “Atomic<> Weapons” by Herb Sutter

LOCK-FREE PROGRAMMING

Lock-Free Programming

Don’t do it, experts only!• New lock-free data structure == research article• Cf. “Lock-Free Programming (Or, Juggling Razor Blades)”

by Herb Sutter

Double-Checked Locking

class Singleton { /*...*/ };std::atomic<Singleton*> instance;std::mutex m;

Singleton* GetInstance() {if (instance == nullptr) {

std::lock_guard<std::mutex> lock(m);

if (instance == nullptr)instance = new Singleton();

}return instance;

}

Magic Statics Work…

class Singleton { /*...*/ };

// Magic statics are thread-safe in C++11...Singleton& GetInstance() {

static Singleton instance;return instance;

}

// ..., but check your compiler documentation!

THREADS

std::thread

Constructor Release barrierStart thread function Acquire barrier

End thread function Release barrier Join Acquire barrier

• Essentially:• Everything written prior the thread’s launch can safely

be read from the function it executes• Everything written during the thread’s execution can

safely be read after std::thread::join()• std::async & std::future are similar

VOLATILE

Volatile in C++

• Unoptimisable variables for talking (I/O) to something outside the program• E.g. hardware registers etc.

• Deliberately underspecified• Not necessarily atomic• Similar but different reordering constraints• No optimisation

• Not even e.g. “v=1; v=2;” or “v=1; r=v;”

Volatile in Ms VC++ (msdn)

“Visual Studio interprets the volatile keyword differently depending on the target architecture. For ARM, [the default is] /volatile:iso, [otherwise it is] /volatile:ms; [but] we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives […] when you are dealing with memory that is shared across threads.[…]Microsoft Specific [/volatile:ms]• A write to a volatile object […] has Release semantics; […]• A read of a volatile object […] has Acquire semantics; […]Note[Code that relies on] the enhanced guarantee that's provided when the /volatile:ms […] is used, […] is non-portable.”

Volatile in Java / .Net

volatile in Java / .Net ≈ std::atomic in C++• Java

• Main inspiration for C++11 memory model• Atomic load & store, sequential consistentent ordering• java.util.concurrent.atomic

• Atomic arrays• Atomic increment, exchange, cas, etc.

• .Net• Plain acquire & release, no sequential consistency

Takeaway

• volatile in C++ ≠ std::atomic in C++• Don’t use MS-specific volatile

• volatile in Java / .Net ≈ std::atomic in C++

NUTSHELL

Key Takeaway

Don’t write race conditions, and don’t use relaxed atomics,

and your code will do what you think.

QUESTIONS?

More Information

• Herb Sutter• Atomic<> Weapons, C++ and Beyond 2012 talk (part 1, part 2)• Lock-Free Programming (Or, Juggling Razor Blades), CppCon 2014 talk (

part 1, part 2)• …

• Hans Boehm• Threads Basics (article)• A Less Formal Explanation of the Proposed C++ Concurrency Memory

Model (C++11 standard proposal article)• …

• Anthony Williams• C++ Concurrency in Action (book)• Just Software Solutions blog• …

http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2

http://channel9.msdn.com/Events/CPP/C-PP-Con-2014/Lock-Free-Programming-or-Juggling-Razor-Blades-Part-I

http://channel9.msdn.com/Events/CPP/C-PP-Con-2014/Lock-Free-Programming-or-Juggling-Razor-Blades-Part-II

http://www.hpl.hp.com/techreports/2009/HPL-2009-259html.html

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.htmlhttp:/www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.html

http://www.cplusplusconcurrencyinaction.com/

https://www.justsoftwaresolutions.co.uk/threading/

the abc’s of atomics an introduction to std::atomic and the c++11 memory model

Documents