the abc’s of atomics an introduction to std::atomic and the c++11 memory model
TRANSCRIPT
The ABC’s of Atomics
• C++11 atomic operations library• Header <atomic>• Atomic types
• std::atomic<T>• std::atomic_flag
• Functions• Standalone fences• Dozens of functions for C compatibility
Data Races
A data race is a race condition that occursif multiple threads concurrently
access the same memory location, without synchronisation,
and at least one of those accesses is a write.
Atomic Load & Store
x.store(1);x = 2;
// C interoperability:std::atomic_store(&x,
3);
f(x.load());f(x);int i = x;
int i = std::atomic_load(&x);
std::atomic<int> x;void f(int);
Atomic Load & Store
x = 1; assert(x == 0 || x == 1 || x == ~0ul);
std::atomic<int> y(0);
std::atomic<std::uint64_t> x;
x = ~0ul;
Atomic Load & Store
x = 1; auto r = x.load();assert(r == 0 || r == 1 || r == ~0ul);
std::atomic<std::uint64_t> x;
x = ~0ul;
Atomic Load & Store
x = 1;// ...
x = ~0ul;
assert(x == 0 || x == 1 || x == ~0ul);
std::atomic<std::uint64_t> x;
Atomic Load & Store
For any (trivially copyable) type T
std::atomic<int> x;std::atomic<long long int> y;
struct S { /*...*/ };std::atomic<S> z;
assert(x.is_lock_free()); // Most platformsassert(!z.is_lock_free());
Atomic Exchange
For any (trivially copyable) type T
T exchange(T newValue){
T oldValue = load();store(newValue);return oldValue;
}
const auto before = x.exchange(123);
Atomic Compare Exchange
For any (trivially copyable) type T
bool compare_exchange(T& expected, T desired){
const T currentValue = load();if (currentValue == expected) {
store(desired);return true;
} else {expected = currentValue;return false;
}}
Atomic Compare Exchange
• For any (trivially copyable) type T• Powerful tool for lock-free data structures• Two flavors
• compare_exchange_strong(T& des, T exp)• For use as standalone operation
• compare_exchange_weak(T& des, T exp)• May “fail” from time to time• For use in a loop
Atomic Operations
For any (trivially copyable) type T• Atomic load• Atomic store• Atomic exchange• Atomic compare_exchange_strong• Atomic compare_exchange_weak
Specialisations & Typedefs
std::atomic_bool std::atomic<bool>
std::atomic_char std::atomic<char>
std::atomic_schar std::atomic<signed char>
std::atomic_uchar std::atomic<unsigned char>
std::atomic_short std::atomic<short>
std::atomic_ushort std::atomic<unsigned short>
std::atomic_int std::atomic<int>
std::atomic_uint std::atomic<unsigned int>
std::atomic_long std::atomic<long>
std::atomic_ulong std::atomic<unsigned long>
std::atomic_llong std::atomic<long long>
std::atomic_ullong std::atomic<unsigned long long>
std::atomic_char16_t std::atomic<char16_t>
std::atomic_char32_t std::atomic<char32_t>
std::atomic_wchar_t std::atomic<wchar_t>
Specialisations & Typedefs
std::atomic_int_least8_t std::atomic<std::int_least8_t>
std::atomic_uint_least8_t std::atomic<std::uint_least8_t>
std::atomic_int_least16_t std::atomic<std::int_least16_t>
std::atomic_uint_least16_t std::atomic<std::uint_least16_t>
std::atomic_int_least32_t std::atomic<std::int_least32_t>
std::atomic_uint_least32_t std::atomic<std::uint_least32_t>
std::atomic_int_least64_t std::atomic<std::int_least64_t>
std::atomic_uint_least64_t std::atomic<std::uint_least64_t>
std::atomic_int_fast8_t std::atomic<std::int_fast8_t>
std::atomic_uint_fast8_t std::atomic<std::uint_fast8_t>
std::atomic_int_fast16_t std::atomic<std::int_fast16_t>
std::atomic_uint_fast16_t std::atomic<std::uint_fast16_t>
std::atomic_int_fast32_t std::atomic<std::int_fast32_t>
std::atomic_uint_fast32_t std::atomic<std::uint_fast32_t>
std::atomic_int_fast64_t std::atomic<std::int_fast64_t>
Specialised Atomic Operations
• For integral types T• operator++, operator++(int)• operator+=, operator-=, operator|=, operator&=, operator^=
• fetch_add, fetch_sub, fetch_or, fetch_and, fetch_xor
• For pointer types T*• operator++, operator++(int)• operator+=, operator-=• fetch_add, fetch_sub
std::atomic_flag
• Guaranteed lock-free• No load, store, exchange, compare_exchange
• Instead:• Asignment operator=• bool test_and_set() // Set to true• void clear() // Set to false
Takeaways
Atomics offer data-race-free operations • Any type, integral and pointer types in particular• Load, store, compare exchange, increment, …• Portable• Efficient (lock-free)
The As-If Rule
A conforming implementation is free to choose how it executes a well-formed program,
as long as the program’s observable behaviour is as if it were executed as written.
Single-threaded Optimisations
for (int c = 0; c < cols; ++c)for (int r = 0; r < rows; ++r)
sum += array[c*rows+r];
for (int r = 0; r < rows; ++r)for (int c = 0; c < cols; ++c)
sum += array[c*rows+r];
int acc = sum;for (int c = 0; c < cols; ++c)
for (int r = 0; r < rows; ++r)acc += array[c*rows+r];
sum = acc;
int acc = sum, i = 0;for (int c = 0; c < cols; ++c)
for (int r = 0; r < rows; ++r)acc += array[++i];
sum = acc;
Single-threaded Optimisations
for (int r = 0; r < rows; ++r)for (int c = 0; c < cols; ++c)
sum += array[c*rows+r];
int acc = sum, i = 0;for (int c = 0; c < cols; ++c)
for (int r = 0; r < rows; ++r)
acc += array[++i];sum = acc;
sum=42;int acc = sum, i = 0;for (int c = 0; c < cols; ++c)
for (int r = 0; r < rows; ++r)
acc += array[++i];if (i) sum = acc;
Compiler Transformations
bool r = y;x = true;if (r)
cout << "y";
bool r = x;y = true;if (r)
cout << "x";
bool x, y;
Store Buffer
Processor 1
x = true;if (y)
cout << "y";
Processor 2
y = true;if (x)
cout << "x";
Coherent Cache / Main Memory
Store Buffer Store Buffer
1
2
3
4
5 6
Takeaways
1. Implementation rarely executes what you wrote• Code is reordered, omitted, invented• Compiler, processor, cache: all equivalent• Critical for performance
2. Atomicity + As-if rule: not enough!• Need to restrict code transformations
x = "Hello, world";x_mutex.lock();x_mutex.unlock();
Critical Regions using Mutexes
x_mutex.lock();x = "Hello, world";x_mutex.unlock();
std::string x;std::mutex x_mutex;
x = "Hello, world";x_mutex.lock();x_mutex.unlock();
Critical Regions using Mutexes
x_mutex.lock();x = "Hello, world";x_mutex.unlock();
std::string x;std::mutex x_mutex;
x_mutex.lock();x_mutex.unlock();x = "Hello, world";
x = "Hello, world";x_mutex.lock();x_mutex.unlock();
Critical Regions using Mutexes
x_mutex.lock();x = "Hello, world";x_mutex.unlock();
std::string x;std::mutex x_mutex;
y_mutex.lock();x = "Hell";y = "o, w";y_mutex.unlock();z = "orld";
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
y_mutex.lock();x = "Hell";y = "o, w";y_mutex.unlock();z = "orld";
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
y_mutex.lock();x = "Hell";y = "o, w";z = "orld";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
y_mutex.lock();z = "orld";y = "o, w";x = "Hell";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
x = "Hell";z = "orld";y_mutex.lock();y = "o, w";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
x = "Hell";y_mutex.lock();y = "o, w";z = "orld";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
x = "Hell";y_mutex.lock();z = "orld";y = "o, w";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
x = "Hell";z = "orld";y_mutex.lock();y = "o, w";y_mutex.unlock();
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
y_mutex.lock();y = "o, w";y_mutex.unlock();x = "Hell";z = "orld";
Critical Regions using Mutexes
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
Memory Barriers
x = "Hell";y_mutex.lock();y = "o, w";y_mutex.unlock();z = "orld";
std::string x, y, z;std::mutex y_mutex;
Acquire barrier
Release barrier
Atomic Barriers
x = true;
if (y)cout << "y";
y = true;
if (x)cout << "x";
std::atomic<bool> x, y;Atomic Store == Release Barrier
Atomic Load == Acquire Barrier
Takeaways
• Acquire barriers• mutex::lock, atomic::load, …• Code may flow down, but not up• “Wait until acquired”
• Release barriers• mutex::unlock, atomic::release, …• Code may flow up, but not down• “Finish before releasing”
• Atomicity + acquire/release barriers: not enough!
Acquire & Release Barriers
x = true;
if (y)cout << "y";
y = true;
if (x)cout << "x";
std::atomic<bool> x, y;
“Plain” Acquire & Release SC Acquire & Release
Release
Acquire
Release
Acquire
Release
Acquire
Release
Acquire
Sequential Consistency (SC)
Sequentially Consistent Barriers
x = true;
if (y)cout << "y";
y = true;
if (x)cout << "x";
std::atomic<bool> x, y;Atomic Store ==
SC Release Barrier
Atomic Load == SC Acquire Barrier
Sequential Consistency
s = "Hello, world!";ready = true;
while (!ready) {}cout << s;
std::string s;std::atomic<bool> ready;
Sequentially Consistent Pointers
s = new string("Hello");
while (!s) {}cout << *s;
std::atomic<std::string*> s;
Sequential Consistency
auto temp = new string("Hello");s = temp;
while (!s) {}cout << *s;
std::atomic<std::string*> s;
Double-Checked Locking is Unbroken
class Singleton { /*...*/ };std::atomic<Singleton*> instance;std::mutex m;
Singleton* GetInstance() {if (instance == nullptr) {
std::lock_guard<std::mutex> lock(m);
if (instance == nullptr)instance = new Singleton();
}return instance;
}
Sequential Consistency: Transitivity
g = true;x = true;
if (x) y = true;
bool g;std::atomic<bool> x, y;
if (y) assert(g)
Sequential Consistency: Total Order
x = true; if (x && !y) cout << "x first";
std::atomic<bool> x, y;
y = true; if (y && !x) cout << "y first";
Key Takeaway
Don’t write race conditions, and use sequentially consistent atomics,
and your code will do what you think.
Don’t Do It, Experts Only
The First Rule of Program Optimization: Don't do it
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Memory Order
x.store(true);if (y.load())
cout << "y";
std::atomic_bool x, y;
y.store(true);if (x.load())
cout << "x";
Memory Order
x.store(true, std::memory_order_seq_cst);if (y.load(std::memory_order_seq_cst))
cout << "y";
std::atomic_bool x, y;
y.store(true, std::memory_order_seq_cst);if (x.load(std::memory_order_seq_cst))
cout << "x";
Memory Order
enum memory_order{
memory_order_relaxed,memory_order_consume,memory_order_acquire,memory_order_release,memory_order_acq_rel,memory_order_seq_cst //
default};
Relaxed Memory Order
x.store(true, std::memory_order_relaxed);if (y.load(std::memory_order_relaxed))
cout << "y";
std::atomic_bool x, y;
y.store(true, std::memory_order_relaxed);if (x.load(std::memory_order_relaxed))
cout << "x";
Acquire/Release Memory Order
x.store(true, std::memory_order_release);if (y.load(std::memory_order_acquire))
cout << "y";
std::atomic_bool x, y;
y.store(true, std::memory_order_release);if (x.load(std::memory_order_acquire))
cout << "x";
Acquire/Release Memory Order
x.store(1, release);
if (x.load(acquire) && !y.load(acquire)) cout << "x first";
std::atomic_int x, y;#define acquire std::memory_order_acquire#define release std::memory_order_release
y.store(1, release);
if (y.load(acquire) && !x.load(acquire)) cout << "y first";
Don’t Do It, Experts Only
The difference between acq_rel and seq_cst is generally whether the operation is required to participate in the single global order of sequentially consistent operations. This has subtle and unintuitive effects. The fences in the current standard may be the most experts-only construct [in C++].
Peterson’s Mutex (Bartosz Milewski)
class PetersonMutexBM { std::atomic<bool> m_interested[2]; std::atomic<unsigned> m_victim;
public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0
m_interested[me].exchange(true, acq_rel); m_victim.store(me, release);
while (m_interested[he].load(acquire) && m_victim.load(acquire) == me); }
Peterson’s Mutex (Dmitriy V'jukov)
class PetersonMutexDV { std::atomic<bool> m_interested[2]; std::atomic<unsigned> m_victim;
public: void lock() { const auto me = binary_thread_id(); // 0 or 1 const unsigned he = 1 – me; // 1 or 0
m_interested[me].store(true, relaxed); m_victim.exchange(me, acq_rel);
while (m_interested[he].load(acquire) && m_victim.load(relaxed) == me); }
Standalone Fences
• Fence == barrier• std::atomic_thread_fence(std::memory_order) & std::atomic_signal_fence(std::memory_order)
• memory_order_relaxed // does nothing
• memory_order_consume• memory_order_acquire• memory_order_release• memory_order_acq_rel• memory_order_seq_cst // (no default)
Fences
x = true;atomic_thread_fence( memory_order_seq_cst);if (y)
cout << "y";
y = true;atomic_thread_fence( memory_order_seq_cst);if (x)
cout << "x";
bool x, y;using namespace std;
Takeaway
• Standalone fences are suboptimal• Error-prone• Suboptimal performance• Cf. “Atomic<> Weapons” by Herb Sutter
Lock-Free Programming
Don’t do it, experts only!• New lock-free data structure == research article• Cf. “Lock-Free Programming (Or, Juggling Razor Blades)”
by Herb Sutter
Double-Checked Locking
class Singleton { /*...*/ };std::atomic<Singleton*> instance;std::mutex m;
Singleton* GetInstance() {if (instance == nullptr) {
std::lock_guard<std::mutex> lock(m);
if (instance == nullptr)instance = new Singleton();
}return instance;
}
Magic Statics Work…
class Singleton { /*...*/ };
// Magic statics are thread-safe in C++11...Singleton& GetInstance() {
static Singleton instance;return instance;
}
// ..., but check your compiler documentation!
std::thread
Constructor Release barrierStart thread function Acquire barrier
End thread function Release barrier Join Acquire barrier
• Essentially:• Everything written prior the thread’s launch can safely
be read from the function it executes• Everything written during the thread’s execution can
safely be read after std::thread::join()• std::async & std::future are similar
Volatile in C++
• Unoptimisable variables for talking (I/O) to something outside the program• E.g. hardware registers etc.
• Deliberately underspecified• Not necessarily atomic• Similar but different reordering constraints• No optimisation
• Not even e.g. “v=1; v=2;” or “v=1; r=v;”
Volatile in Ms VC++ (msdn)
“Visual Studio interprets the volatile keyword differently depending on the target architecture. For ARM, [the default is] /volatile:iso, [otherwise it is] /volatile:ms; [but] we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives […] when you are dealing with memory that is shared across threads.[…]Microsoft Specific [/volatile:ms]• A write to a volatile object […] has Release semantics; […]• A read of a volatile object […] has Acquire semantics; […]Note[Code that relies on] the enhanced guarantee that's provided when the /volatile:ms […] is used, […] is non-portable.”
Volatile in Java / .Net
volatile in Java / .Net ≈ std::atomic in C++• Java
• Main inspiration for C++11 memory model• Atomic load & store, sequential consistentent ordering• java.util.concurrent.atomic
• Atomic arrays• Atomic increment, exchange, cas, etc.
• .Net• Plain acquire & release, no sequential consistency
Takeaway
• volatile in C++ ≠ std::atomic in C++• Don’t use MS-specific volatile
• volatile in Java / .Net ≈ std::atomic in C++
Key Takeaway
Don’t write race conditions, and don’t use relaxed atomics,
and your code will do what you think.
More Information
• Herb Sutter• Atomic<> Weapons, C++ and Beyond 2012 talk (part 1, part 2)• Lock-Free Programming (Or, Juggling Razor Blades), CppCon 2014 talk (
part 1, part 2)• …
• Hans Boehm• Threads Basics (article)• A Less Formal Explanation of the Proposed C++ Concurrency Memory
Model (C++11 standard proposal article)• …
• Anthony Williams• C++ Concurrency in Action (book)• Just Software Solutions blog• …