cs4021 lockless algorithms - trinity college dublin algorithms.pdf · xadd reg, mem ; tmp = reg +...
TRANSCRIPT
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 1
CS4021 Advanced Computer Architecture • concurrent programming with and without locks • atomic instructions / updates
• lock implementations and performance • lockless [non blocking] data structures and algorithms
CAS based MCAS based memory management [e.g. hazard pointers]
• hardware transactional memory [HTM]
Herlihy and Moss [1993] Intel Haswell CPU [2012]
• The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit
• CS4021 website https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 2
Why be concerned? • clock rate of a single CPU core appears to be limited to ≈ 4GHz • single CPU core processing power very far short of doubling every 18 months • Intel, AMD, Sun, IBM, … producing multicore CPUs instead • typical desktop has 4 cores with each core capable of executing 2 threads [hyper-
threading] giving a total of 8 concurrent threads
• typical desktop in 2014 16 threads, 2016 32 threads, … [Moore's Law and Joy's Law] • need to be able to exploit cheap threads on multicore CPUs • locked based solutions are simply not scalable as a lock inhibits parallelism
• need to explore lockless data structures and algorithms
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 3
Consider an ordered linked list [set] • each node has a key and a next field
• NB: list doesn't contain duplicate keys
• add(25) adds a node containing 25 to list [does nothing if item already in list]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 4
Consider an ordered linked list… • remove(30) removes node containing 30 from list [does nothing if item NOT in list]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 5
Concurrent Updates • conventional approach is to protect list with a single lock
[CriticalSection, Mutex, …] which prevents concurrent accesses by different threads
• if list is protected by a lock, it is clear that ONLY ONE operation can
occur at a time [access to list serialised by lock] • ALSO clear that if the list is long enough, multiple add and remove
operations can occur concurrently as they will update pointers in disjoint parts of the list [disjoint access parallelism]
• lockless approach allows multiple add and remove operations to occur concurrently
• remedial action taken if a clash is detected [non disjoint updates]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 6
Spin Lock Implementations • implementations should minimise bus traffic especially when a lock is heavily
contested
• CPUs waiting for a lock are idle and shouldn't generate unnecessary bus traffic which slow the CPUs doing real work
• spin lock implementations usually rely on atomic instructions which comprise an indivisible read-modify-write [RMW] access to a shared memory location
• in a single CPU system, many instructions are effectively atomic because interrupts
are ONLY recognised between instructions
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 7
Spin Lock Implementations… • consider a spinlock implementation based on an IA32 logical shift right instruction [shr]
; ; simple spin lock (NB: 1 == free, 0 == taken) ; wait shr lock, 1 ; lock in memory jnc wait ; jump no carry (retry if C == 0) ret ; return free mov lock, 1 ; lock = 1 (free) ret ; return
• works in a single CPU system, but not in a multiprocessor • why? determined by how CPU updates memory
if lock free and “shr lock, 1” is executed, lock becomes taken and the carry flag is set atomically/simultaneously sets lock as taken and returns the fact that the lock has been acquired in the carry flag
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 8
Bus Arbiter
• if CPU wishes to access shared memory, it asserts its bus request signal [/BREQn]
• arbiter grants access to one CPU at a time by asserting its bus grant signal [/BGRNTn]
• arbiter normally grants bus to CPUs on a cycle by cycle basis in a fair manner [round
robin]
• ONLY one CPU at time can access shared memory
• CPUs given access to bus and shared memory, one at a time, by a bus arbiter
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 9
Atomic Instructions • atomic RMW memory accesses [read cycle followed by a write cycle] must NOT be
interleaved with memory accesses made by other CPUs • CPUs generally have special atomic instructions which indicate externally that an
atomic RMW memory access is being performed • if bus cycles are arbitrated on a cycle by cycle basis [i.e. NON atomic] then
a CPU could read a lock and find it free; on the next bus cycle another CPU could also read the lock and find it still free before the first CPU has been given a bus cycle to set the lock; this would result in the lock being allocated to both CPUs
• IA32/x64 CPUs asserts a /LOCK signal [external pin on chip] to inform bus arbiter that it is trying to perform an atomic RMW memory access
• bus arbiter must simply lock CPU onto bus while the /LOCK signal is asserted
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 10
IA32/x64 Atomic Instructions • XCHG [exchange] instruction generates an atomic read-modify-write memory access • use variant which exchanges [swaps] a register with a memory location
; ; testAndSet lock [NB: 0 = free, 1 = taken] ; wait mov eax, 1 ; eax = 1 xchg eax, lock ; exchange eax and lock in memory test eax, eax ; test eax [result of xchg] jne wait ; re-try if unsuccessful ret ; return
free mov lock, 0 ; clear lock ret
• XCHG asserts /LOCK when executed, hence atomic
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 11
IA32/x64 Atomic Instructions
• a selection of other IA32/x64 instructions can perform atomic RMW cycles if preceded
with a LOCK prefix instruction
• bts, btr, btc, xadd, cmpxchg, cmpxchg8b, inc, dec, not, neg, add, adc, sub, sbb, and, or,
& xor [only valid if instruction performs a read-modify-write access to memory]
• consider the useful exchange and add instruction xadd
lock ; lock prefix
xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp
• without a LOCK prefix, XADD is executed non atomically
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 12
Windows Example • can mix assembly language and C++, BUT…
• x64 VC++ compiler doesn't support an inline assembler, so for Win32/x64 portability
can use the intrinsics defined in intrin.h instead
LONG __cdecl InterlockedExchange( _Inout_ LONG volatile *Target, _In_ LONG Value );
• could be used as follows
volatile long lock = 0; // declare and initialise lock while (InterlockedExchange(a, 1)); // acquire lock
• NB: even though long and int are both 32 bit signed integers, types are NOT equivalent
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 13
Volatile • lock must be declared as volatile
• description of volatile from Visual Studio 2012 documentation
objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.
• to declare object pointed to by a pointer as volatile use: volatile int *p; // what p points to is volatile
• to declare the pointer itself volatile use:
int * volatile p; // contents of p is volatile
• both volatile int* volatile p; // p and what p points to are all volatile
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 14
Windows Example… • x64 Release code for a call to InterlockedExchange() obtained using Visual Studio
Debugger [VC++ compiler generates in line code rather than a function call]
000000013F3B1330 mov eax,1 000000013F3B1335 xchg eax,dword ptr [rsi+8] 000000013F3B1338 test eax,eax 000000013F3B133A jne worker+0D0h (013F3B1330h)
[rsi+8] contains address of lock
retry if unsuccessful
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 15
Serializing Instructions… • consider the following testAndSet code for obtaining a lock
CPU0 shared data CPU1
wait mov eax, 1
obtain lock
wait mov eax, 1
xchg eax, lock xchg eax, lock
test eax, eax test eax, eax
jne wait jne wait
<update shared data> update shared data <update shared data>
mov lock, 0 release lock mov lock, 0
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 16
Serializing Instructions… • need to consider memory read and write ordering if locks are to work correctly • CPU must NOT read ahead data in the shared data structure before it has obtained the
lock [otherwise the CPU with lock may not have finished updating the shared data structure and out of date will be read]
• CPU must not release the lock until ALL its writes to the shared data structure have
been completed [otherwise next lock holder could read out of date data] • LOCKED instructions [e.g. xchg, lock xadd] act implicitly as a memory barrier or fence • reads/writes cannot pass [be carried out ahead of] locked [serialising] instructions
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 17
Serializing Instructions… • CPUs often have explicit memory barrier or fence instructions to flush the write buffer
and to enforce ordering • IA32/x64 have the following fence instructions
SFENCE store fence flush all writes before executing instruction LFENCE load fence don't read ahead until instruction executed MFENCE memory fence flush all writes before executing instruction and…
don't read ahead until instruction executed
• see section 8.2 on Memory Ordering in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1
• and also Intel® 64 Architecture Memory Ordering White Paper
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 18
Serializing Instructions • from a hardware perspective
• CPU has an internal write
buffer which is used to buffer writes to the memory hierarchy [for speed]
• data in write buffer not visible externally until written to memory, in this case to the first level cache
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 19
Serializing Instructions… • why does the previous testAndSet code work on an IA32/x64 CPU?
1) writes are a made to memory in program order so that when the lock is cleared and visible [mov lock, 0] ALL previous writes to the shared data structure are also visible
2) lock obtained using a serialising instruction [xchg eax, lock] which prevents read ahead so that data in the shared data structure will not be read until the lock is obtained
3) executing serialising instructions reduces CPU performance as it prevents CPU from reading and writing ahead
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 20
Load Locked / Store Conditional Instructions • alternative approach for performing atomic RMW accesses to memory • executing a load locked [LL] followed by a store conditional [SC] instruction is used to
perform an atomic RMW access to memory
• first used by MIPS CPU [ll/sc]
• also used by Alpha [ldq_l/stq_c], IBM Power PC [lwarx/stwcx] and ARM [ldrex/strex] CPUs
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 21
Alpha LL/SC Implementation • each CPU has a lockFlag [LF] and a lockPhysicalAddressRegister [LPAR] used by the LL
and SC instructions • LDQ_L Ra, va ; load quadword locked
lockFlag = 1 lockPhysicalAddressRegister = physicalAddress(va) Ra = [va]
• STQ_C Ra, va ; conditionally store quadword
if (lockFlag == 1) ; check lock flag [va] = Ra ; conditional store if lockFlag is set Ra = lockFlag ; used to test if store occurred lockFlag = 0 ; clear lock flag
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 22
Alpha LL/SC Implementation… • where is the magic?
• if the per CPU lockFlag is still set when an associated STQ_C is executed, the store occurs
otherwise NO store takes place [conditional store] • what clears the lockFlag? if any CPU does a store [write] to the physical memory address contained in a
lockPhysicalAddressRegister, the associated CPU clears its lockFlag
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 23
Alpha LL/SC Hardware Perspective • consider CPU0 executing a LDQ_L ra, lock instruction [NB: lock is the virtual address of
the lock]
• state just after CPU0 has executed LDQ_L ra, lock
• NB: LF and LPAR values
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 24
Alpha LL/SC Hardware Perspective… • if any other CPU writes to the lock variable , CPU0's lockFlag will be cleared
• CPU2 writes to the lock variable resulting in CPU0's lockFlag being cleared
• when CPU0 executes the store conditional associated with the LL, it will not write to
memory
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 25
Using Alpha LL/SC to Perform an Atomic RMW • if the following sequence of instructions is successfully executed on a given CPU [BEQ
XXX doesn't branch back to XXX] …
XXX: LDQ_L ra, va ; read <modify> ; modify STQ_C rb, va ; conditional write BEQ XXX ; retry if unsuccessful
• it means that the CPU has performed an atomic RMW access to memory location a
• if the conditional store fails, must retry • LL/SC implementation means that a write to va is ONLY performed if there's a guarantee
of a atomic RMW access to va
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 26
Using Alpha LL/SC to Implement a TestAndSet Lock ACQUIRE LDQ_L R1, lock ; read lock BLBS R1, ACQUIRE ; retry if already set OR R1, #1, R2 ; r2 = 1 STQ_C R2, lock ; store conditional BEQ R2, ACQUIRE ; retry if unsuccessful MB ; memory barrier
<update shared data structure >
MB ; memory barrier STQ R31, lock ; clear lock [R31 always 0]
• BLBS: branch if register low bit set • MB: memory barrier which is equivalent to an IA32/x64 memory fence
see section on serialising instructions [slide 21]
• means that a write to a lock is ONLY performed when the lock is obtained resulting in much less bus traffic when lock contested
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 27
Cost of Sharing Data Between Threads • no problem if threads only make read accesses
• with the MESI cache coherency protocol, writes to shared data will cause copies in the
other caches to be invalidated
• program read and write accesses are typically 32 or 64 bits, while the size of a cache line is typically 64 bytes [with latest Intel CPUs]
• means that data in a cache line can be invalidated by a write to another part of the cache line
• known as false sharing [cache line shared rather than the data] • can be prevented by storing data in its own cache line
• sharing.cpp written to evaluate the cost of sharing
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 28
Cost of Sharing Data Between Threads…
• uses the following data structure
• each long variable stored in its own cache line
• code uses CPUID instruction to find CPU cache line size [see CPUID application Note]
• each thread repeatedly executes InterlockedExchangeAdd() to increment a thread specific or a shared variable for NSECONDS
• 0%, 25%, 50%, 75% and 100% sharing determined by how often the shared variable is incremented relative to the thread specific variable
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 29
Cost of Sharing Data Between Threads… • for 25% sharing, for example, each thread executes
InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(maxThread), 1); // shared NB: threads numbers from 0 .. maxThread-1
• use _aligned_malloc to allocate data on a cache line boundary volatile long *g; // NB: position of volatile g = (long*) _aligned_malloc((maxThread+1)*lineSz, lineSz); // shared global variable
• GINDX macro defined as follows #define GINDX(n) (g + n*lineSz/sizeof(long)) // index into g
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 30
Code to Create and Run N Threads for NSECONDS…
int _tmain(int argc, _TCHAR* argv[]) { … for (sharing = 0; sharing <= 100; sharing += 25) { // sharing range for (int nt = 1; nt <= 2*ncpus; nt *= 2) { // thread range tstart = clock(); for (int thread = 0; thread < nt; thread++) // create and start nt threads threadH[thread] = CreateThread(NULL, 0, worker, (LPVOID) thread, 0, NULL); WaitForMultipleObjects(nt, threadH, true, INFINITE); // wait for ALL threads to finish printResults(); // print results for (int thread = 0; thread < nt; thread++) // delete thread handles CloseHandle(threadH[thread]); } } … }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 31
Worker Function DWORD WINAPI worker(LPVOID thread) { long long ops = 0; // 64 bit local counter while (1) { for (int i = 0; i < NOPS / 4; i++) { // NOPS/4 since work comprises...
// do some work // 4 InterlockedExchange operations } ops += NOPS; // local to thread if (clock() - tstart > NSECONDS*CLOCKS_PER_SEC) // NSECONDS of work? break; } cnt[(int) thread] = ops; // remember in global cnt array return 0; }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 32
Cost of Sharing Data Between Threads… • NB: cache data retrieved from CPU using CPUID instruction
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 33
Cost of Sharing Data Between Threads…
1.00
1.74
2.88
3.41 3.40
0.98 0.87 0.92
1.05 1.12
1.00
0.29 0.28 0.25 0.25
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
1 2 4 8 16
Rela
tati
ve t
o s
ingle
thre
ad a
nd 0
% s
hari
ng
# threads
Cost of CPUs Sharing Write Data
0% sharing
25% sharing
50% sharing
75% sharing
100% sharing
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 34
Cost of Sharing Data Between Threads… • comments of graph…
• replacing InterlockedExchangeAdd(g, 1) with (*g)++
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 35
TestAndSet Lock • declaration of lock
volatile long lock = 0; // lock stored in shared memory • to acquire lock
while (InterlockedExchange(&lock, 1)); // wait for lock [0:free 1:taken]
• to release lock
lock = 0; // clear lock
• if an xchg instruction [InterlockedExchange] is used to obtain a lock, performance is poor when there's contention for the lock
• need to remember how the MESI cache coherency protocol operates if considering IA32/x64 CPUs
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 36
TestAndSet Lock… • ALL waiting CPUs repeatedly execute an xchg instruction trying to get hold of lock • the memory accesses made by the xchg instruction don't benefit from having a cache
since the shared cache lines are continually overwritten [even if the lock is a 1, it is overwritten with a 1] which invalidates the entries in the other caches which results in bus cycles for both the read and write parts of ALL xchg instructions [think MESI]
• ALL the xchg read and writes will be to memory
• a write update cache coherency protocol would allows the reads to be local cache reads [Firefly]
• the lock is overwritten even if there is NO chance of obtaining the lock
• why is there not an instructions which conditionally writes a 1 if the value read is 0
[e.g. conditional testAndSet] ?
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 37
TestAndTestAndSet Lock • designed to take advantage of underlying cache behaviour • to acquire lock [optimistic version]
while (InterlockedExchange(&lock, 1)) // try for lock while (lock == 1) // wait unit lock free _mm_pause(); // instrinsic see next slide
• to acquire lock [pessimistic version]
do { while (lock == 1) // wait unit lock free _mm_pause(); // intrinsic see next slide } while (InterlockedExchange(&lock, 1)); // try for lock
• optimistic version assumes lock is going to be free
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 38
IA32 TestAndTestAndSet Lock… • 7.11.2 PAUSE Instruction
The PAUSE instruction can improves the performance of processors supporting Hyper-Threading Technology when executing “spin-wait loops” and other routines where one thread is accessing a shared lock or semaphore in a tight polling loop. When executing a spin-wait loop, the processor can suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation and flushes the core processor’s pipeline. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-pipelines the spin-wait loop to prevent it from consuming execution resources excessively. (See Section 7.11.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about using the PAUSE instruction with IA-32 processors supporting Hyper-Threading Technology.)
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 39
TestAndTestAndSet Lock… • the advantage is that the test of the lock [lock == 1] is executed entirely within the
cache and the xchg instruction is only executed when the lock is known to be free and there is a chance of acquiring the lock
• the cached lock variable will be invalidated or updated when the lock is released and
only then is an attempt made to obtain the lock by executing a xchg instruction
• if the release of the lock invalidates the other shared caches lines then O(n2) [where n is number of CPUs waiting for lock] bus cycles will be generated? quote from the literature
• ALL n waiting CPUs continuously read the lock [from their own local cache]; these
cache lines will be invalidated when the lock is released; subsequent reads of the lock will appear on bus which will be serialised by a typical round-robin bus arbiter and each CPU, in turn, will see the lock free
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 40
TestAndTestAndSet Locks... • an individual CPU executes its xchg instruction but then sees the remaining CPUs
executing their xchg instructions which will invalidate its cache line so a bus cycle has to be performed to read the lock again i.e. O(n2)
• however won't the bus cycles for the xchg be such that all CPUs will execute them one
after another [thanks to the round robin arbiter] so that a CPU's cache line is effectively invalidated only once? i.e. O(n)
• if the release of the lock updates the other caches directly then the generated bus traffic will only be of O(n)
• either way there will be enough bus activity to interfere with the process in the critical
section as well as the other processes not involved with the lock
• if the lock is held for a long time the impact is unimportant, but for short critical sections the lock will be released before the last spurt of activity has subsided resulting in continued bus saturation
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 41
TestAndSet Lock with Exponential Back Off • don't continuously try to acquire lock, delay between attempts
to acquire lock: d = 1; // initialise back off delay while (InterlockedExchange(&lock, 1)) { // if unsuccessful… delay(d); // delay d time units d *= 2; // exponential back off }
• testAndTestAndSet lock NOT necessary when using a back off scheme • the longer the CPU has being waiting for the lock, the longer it will have to wait before
it attempts to acquire the lock again, possibility of starvation • supposed to work well in practice
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 42
Ticket Lock with Proportional Back Off class TicketLock { public: volatile long ticket; // initialise to 0 volatile long nowServing; // initialise to 0 };
inline void acquire(TicketLock *lock) // acquire lock { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); // get ticket [atomic] while (myTicket != lock->nowServing) // if not our turn… delay(myticket - lock->nowServing); // delay relative to… } // position in Q inline void release(TicketLock *lock) // release lock { lock->nowServing ++; // give lock to next CPU } // NB: not atomic
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 43
Ticket Lock with Proportional Back Off… • think of waiting in a Q in the Andrews St. tourist office, ISS computer help desk, A&E, … • deterministic • ONLY 1 atomic instruction executed per lock acquisition • FAIR, locks granted in order of request which eliminates the possibility of starvation • back off proportional to position in Q
• if time in critical section is constant , the delay can be calculated such that the
subsequent test of lock->nowServing will just succeed • still polls a common location [lock->nowServing] which will cause some bus traffic with
an invalidate protocol • delay not necessary with a write-update protocol [Firefly]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 44
MCS Lock [Mellor-Crummy and Scott] • lockless queue of waiting threads • each thread has its own QNode which is linked into a Q of QNodes waiting for lock • a global variable lock points to tail of Q • acquire lock by adding a thread’s QNode [qn] to tail of Q and waiting until
qn->waiting==0 • release lock by setting qn->next->waiting=0 [if qn not at the tail of Q]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 45
MCS Lock… • before looking at the code for the MCS lock need to discuss
the Compare and Swap (CAS) instruction
how to allocate objects on a cache line boundaries
thread local storage
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 46
Compare and Swap [CAS]
• pseudo C version of CAS
atomic long CAS(long *a, long e, long n) // memory address, expected value, new value { long r = *a; // read contents of memory address if (r == e) // compare with expected value and if equal… *a = n; // update memory with new value return r; // success if e returned }
• NB: returns expected value if exchange took place
• CAS can be mapped onto the IA32/x64 compare and exchange instruction
cmpxchg reg, mem // if (eax == mem) // ZF = 1, mem = reg // else // ZF = 0, eax = mem
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 47
Compare and Swap…
• make use of following instrinsic defined in intrin.h long InterlockedCompareExchange(long volatile *a, long n, long e); NB: different parameter order than previous/normal definition of CAS
• for convenience can always define #define CAS(a, e, n) InterlockedCompareExchange(a, n, e)
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 48
How to allocate objects aligned on a cache line • can allocate objects in their own cache line(s) to avoid false sharing • one straightforward approach is to use a template class to override new and delete
// // derive from ALIGNEDMA for aligned memory allocation // template <class T> class ALIGNEDMA { public: void* operator new(size_t); // override new void operator delete(void*); // override delete };
• C++ magic
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 49
How to allocate objects aligned to a cache line…
// // new // template <class T> void* ALIGNEDMA<T>::operator new(size_t sz) {
sz = (sz+lineSz-1)/lineSz*lineSz; // make sz a multiple of lineSz return _aligned_malloc(sz, lineSz); // allocate on a lineSz boundary } // // delete // template <class T> void ALIGNEDMA<T>::operator delete(void *p) { _aligned_free(p); // free object }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 50
MCS Lock… • derive QNode from ALIGNEDMA
• each QNode will be allocated its own cache line aligned on a cache line boundary
class QNode : public ALIGNEDMA<QNode> { public: volatile int waiting; volatile QNode *next; };
• when a new QNode is created… QNode *qn = new QNode();
• the ALIGNEDMA new function is called to allocated space for the QNode
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 51
Thread Local Storage [Tls]
• allocate next available Tls index DWORD tlsIndex = TlsAlloc(); // get a Tls index which all threads can use
• set value stored at tlsIndex
QNode *qn = new QNode(); // at start of worker function TlsSetValue(tlsIndex, qn);
• get value stored at tlsIndex
volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex);
• TlsGetValue used by acquire() and release() to get a pointer to thread’s local QNode
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 52
MCS Lock acquire inline void acquire(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); qn->next = NULL; volatile QNode *pred = (QNode*) InterlockedExchangePointer((PVOID*) lock, (PVOID) qn); if (pred == NULL) return; qn->waiting = 1; pred->next = qn; while (qn->waiting); }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 53
MCS Lock… inline void release(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); volatile QNode *succ; if (!(succ = qn->next)) { if (InterlockedCompareExchangePointer((PVOID*)lock, NULL, (PVOID) qn) == qn) return; do { succ = qn->next; } while(!succ); } succ->waiting = 0; }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 54
MCS Lock acquire
• pred = InterlockedExchange(lock, qn) performed atomically (1) • think about what happens if two or more threads try to acquire lock simultaneously • if pred is NULL [previous value of lock] then at head of Q so have lock otherwise… • set qn->waiting = 1 and… • link thread’s QNode to tail of existing Q by setting pred->next = qn (2) • wait until qn->waiting == 0
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 55
MCS Lock release
• if (qn->next != NULL) set qn->waiting = 0 which passes lock to next thread in Q
• if (qn->next == NULL) use InterLockedCompareExchangePointer(lock, NULL, qn) to atomically set lock = 0 if its lock == qn and return if successful [there are no more threads waiting for lock] otherwise…
• a call to acquire() by another thread must have added a QNode between qn and lock • follow qn->next until not NULL and assign to succ which then points to next QNode in Q • set succ->waiting = 0 to pass lock to next thread [no explicit removal of QNodes from Q]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 56
Testing Framework • create a framework to compare the performance of locked and lockless lists
• will use VC++
• source code on CS4021 web site [single source for Win32 and x64] • implement an ordered list with add(key) and remove(key) operations • create n threads which pseudo randomly add or remove items from a list • add and remove operations occur with equal probability • generate keys pseudo randomly in range 0 .. maxkey-1 • changing key range controls the length of list and also the amount of contention
between threads [less contention with longer lists]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 57
Testing Framework… • vary key range [16, 64, 256, …] and number of threads [1, 2, 4, 8, …]
• limit maximum number of threads to be twice the number of cores • run each test for NSECONDS [e.g. 10 seconds] and report results
test set up and configuration and date key range number of threads runtime operations per second performance relative to a single thread
• make sure tests are run with PC set to the high performance power plan
• results generated on a DELL M4600 precision laptop
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 58
C++ Node and List class definitions
• develop test framework for testing the performance of a list protected by different kinds of locks [CriticalSection, testAndSet, testAndTestAndSet,…]
class Node: public ALIGNEDMA<Node> { // derive from ALIGNEDMA public: int key; // key Node *next; // points to next node in list }; class List: public ALIGNEDMA<LIst> { // derive from ALIGNEDMA private: Node *head; // head of list DECLARE(); // macro to declare CriticalSection, testAndSet lock, … public: List(); // constructor ~List(); // destructor int add(int key); // return 1 if successful int remove(int key); // return 1 if successful };
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 59
Updating List
• List constructor List::List() { // constructor head = new Node(0, NULL); // sentinel INIT(); // macro to initialise lock }
• ONLY ONE thread can update list at a time, protect by acquiring lock
int List::add(int key) { ACQUIRE(); // macro to acquire CriticalSection, testAndSet lock, … … // update protected by lock RELEASE(); // macro to release CriticalSection, testAndSet lock, … }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 60
testAndSet Results
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 61
testAndSet Results…
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
1 2 4 8 16
MO
ps
per
second
# threads
list protected by a testAndSet lock
key=16
key=64
key=256
key=1024
key=4096
key=16384
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 62
Spread sheet Model of testAndSet Lock • work in progress, will put on CS4021 web site
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 2 4 8 16
Ops
rela
tive t
o s
ingle
thre
ad k
ey=16
# threads
key=16
key=64
key=256
key=1024
key=4096
key=16384
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 63
testAndSet Results… • do the results make sense? and why are they so poor?
one thread will be updating list while all others will be trying to obtain the lock each attempt to acquire the lock requires the execution of an xchg instruction
each exchange instruction not only reads memory but also writes a 1 to the lock [even
if it's already a 1] invalidating copies of the lock in other caches [MESI protocol] this greatly increases the bus traffic [reads and writing of the lock will be to/from
memory] which significantly reduces the speed of the thread that has the lock if thread pre-empted holding lock, it will obstruct other threads from making
progress [this effect is probably not too significant]
significantly reduced performance due to increased bus traffic from (1) continuously executing the xchg instruction and (2) sharing modified list nodes
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 64
Ticket Lock Results
NB: unusual
results when
threads = 16
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 65
Ticket Lock Results… • unusual results when threads = 16
• otherwise better performance than testAndSet lock
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 2 4 8 16
rela
tive t
o s
ingle
thre
ad m
axKey=16
# threads
ticket lock key=16
key=64
key=256
key=1024
key=4096
key=16384
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 66
Ticket Lock Results • unusual results when threads > NCPUS [threads = 16, NCPUS= 8]
• assume the following code for acquire
inline void acquire(TicketLock *lock) { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); while (myTicket != lock->nowServing) _mm_pause(); }
• why? what's happening?
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 67
Ticket Lock Results…
• idealised diagram of what is happening • to simplify diagram assume 4 cores and 8 threads
• threads run for an OS time quantum • need to wait for quantum to end before ticket 4, 8, … start to run • hence 4 tickets/updates per OS time quantum • what is the time OS quantum?
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 68
Ticket Lock Results… • exchangeTicketRate.cpp • simply count how many ticket lock acquire and release operations can be performed per
second
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31Millions
of
Tic
ket
Exchanges
# threads
Ticket Exchange Rate
Millions of ticket exchanges per second
NB: 8 cores
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 69
Ticket Lock Results… • graph in terms of how long it takes to exchange ticket lock between threads
• estimate of OS time quantum is 8 x 0.01ms = 0.1ms • seems too fast [need to find another way to get this value]
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132
ms
# threads
Time to exchange ticket between two threads
ticket exchanges per sec
NB: 8 cores
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 70
Lockless List Implementation • objective is to implement an ordered lists where op/s increases with number of threads
• need to consider calls to new and delete which are called inside add and remove
• new and delete need to be lockless otherwise they will become the bottleneck
• memory management is critical
• same argument for rand()
• quite a challenge ahead
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 71
Need to know meaning of the following terms deadlock
livelock
convoying
priority inversion
obstruction free
lock free
wait free
linearisation point
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 72
Lockless List Implementation using CAS • use CAS to add nodes 15 and 35 • search for insertion point and execute and CAS with correct parameters
CAS(&a->next, b, c) CAS(&d->next, e, f)
• disjoint-access parallelism
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 73
Lockless List Implementation using CAS
• if 2 threads try to add nodes at the same position
CAS(&a->next, b, c) // assume this CAS executes first and succeeds… CAS(&a->next, b, d) // consequently this CAS will fail
• first CAS executed succeeds, second fails as a->next != b • on failure need to RETRY operation • search AGAIN for insertion point and, if found, re-execute CAS [costly if list long]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 74
What can go wrong with an add? • imagine insertion point found, BUT before CAS(&a->next, b, c) is executed, thread is
suspended
• another thread then removes b from list and frees the memory block used by b [free(b)]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 75
What can go wrong with an add? • another thread adds another key [12] at same position in list using the SAME memory
block which just happens to be returned by the memory allocator [malloc()] • if suspended thread now resumes and executes its
CAS(&a->next, b, c) • the CAS will succeed as a->next still equals b, BUT node NOT inserted at correct position • known as the ABA problem
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 76
Using CAS to remove nodes • search for node and then execute CAS with correct parameters • consider 2 threads removing non-adjacent nodes [disjoint-access parallelism]
CAS(&a->next, b, c) // both will succeed CAS(&c->next, d, 0) // both will succeed
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 77
Using CAS to remove nodes • if two threads try to remove the same node
CAS(&a->next, b, c) CAS(&a->next, b, c)
• first CAS executed succeeds
• second CAS executed fails as a->next no longer equals b • retry on failure, which means searching AGAIN for node [NB. may now not be found!]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 78
What can go wrong with remove?
• imagine adding a node and removing a node concurrently
CAS(&a->next, b, c); // delete 20 CAS(&b->next, c, d); // insert 25
• NOT what was intended!
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 79
What can go wrong with remove? • consider deleting adjacent nodes
CAS(&a->next, b, c) // delete 20 CAS(&b->next, c, d) // delete 30
• AGAIN NOT what was intended
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 80
A Pragmatic Implementation of Non-Blocking Linked Lists Tim Harris [2001] • two step removal [consider remove(20)] • node atomically marked [logically deleted] before updating pointer using CAS
• marked node indicated by an odd address in next field [possible as nodes normally
aligned on 4 byte boundaries]
is_marked_reference(r) // returns 1 if marked get_marked_reference(r) // convert to marked reference get_unmarked_reference(r) // convert to unmarked reference
• tests, sets and clears LSB of address [which is stored in next field]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 81
A Pragmatic Implementation of Non-Blocking Linked Lists… • to atomically mark node [logically delete]
CAS(&b->next, c, get_marked_reference(c)); • then use CAS to update pointer
CAS(&a->next, b, c)
• can only update an unmarked pointer
• an intelligent find() removes ALL marked nodes to the immediate left of insertion point
or node to be deleted
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 82
A Pragmatic Implementation of Non-Blocking Linked Lists… • examine code taken directly from paper, but note that it…
"is intended merely as pseudo-code and does not reflect an optimised (or even necessarily) correct implementation"
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 83
class List<KeyType> {
Node<KeyType> *head;
Node<KeyType> *tail;
List() {
head = new Node<KeyType>();
tail = new Node<KeyType>();
head.next = tail;
}
}
List and Node Class Definitions
• how many mistakes can you spot in the code snippet above? will need to fix errors in order to get the code working
class Node<KeyType> {
KeyType key;
Node *next;
Node (KeyType key) {
this.key = key;
}
}
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 84
add [insert]
public boolean List::insert(KeyType key) { Node *new_node = new Node(key); Node *right_node, *left_node; do { right_node = search(key, &left_node); if ((right_node != tail) && (right_node.key == key)) // T1 return false;
new_node.next = right_node; if (CAS(&left_node.next, right_node, new_node)) // C2 return true; } while(true); // B3 }
allocate new node to insert
returns pointers to the unmarked nodes to left and right of insertion point
return false if key already in list
keep trying if successful
try to insert node by using CAS to update pointer
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 85
add [insert]…
insertion is reasonably straightforward, but makes use of an intelligent find function
• find should return adjacent left and right pointers
• CAS(&left.next, right, new) will only succeed if there are no nodes [marked or
unmarked] between left and right and if left is also unmarked
• of course, another thread could have inserted a node between left and right before the
CAS is executed
• a node cannot be inserted by linking to a marked [logically deleted] node thus avoiding one of the problems mentioned in the previous slides
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 86
remove [delete] public boolean List::delete (KeyType search_key) { Node *right_node, *right_node_next, *left_node; do { right_node = search(search_key, &left_node); if (right_node == tail || right_node.key != search_key) //T1 return false; right_node_next = right_node.next; if (!is_marked_reference(right_node_next)) if (CAS(&right_node.next, right_node_next, get_marked_reference(right_node_next))) break; } while (true); //B4 if (!CAS(&left_node.next, right_node, right_node_next)) //C4 right_node = search(right_node.key, &left_node); return true; }
returns pointer of unmarked node to delete [if not present then unmarked node with next higher key] and address of unmarked node to its left
try to mark unmarked node; once marked, node is logically deleted
return if key not in list
keep trying
if CAS fails, use search to remove marked nodes from list
try to remove node from list by using CAS to update pointer can remove a number of adjacent nodes
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 87
remove [delete]… • assume initial search has returned left and right and that the right node has been
marked [logically deleted]
• imagine that before before CAS(&left->next, right, right->next) is executed to remove
node from list, another thread inserts a node between left and right
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 88
remove [delete]… • CAS to remove node will fail
• since node is logically deleted there is no point in calling delete again…. • BUT calling search again will remove any marked node(s) immediately before key
• NOT calling search would simply mean that the marked node(s) would remain in the list
until another node is inserted after 20 [in this example state]
• how could the list get into the following state?
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 89
find [search] private Node *List::search (KeyType search_key, Node **left_node) { Node *left_node_next, *right_node; search_again: do { Node *t = head; Node *t_next = head.next; do { if (!is_marked_reference(t_next)) { *left_node = t; left_node_next = t_next; } t = get_unmarked_reference(t_next); if (t == tail) break; t_next = t.next; } while (is_marked_reference(t_next) || (t.key < search_key)); // T1 right_node = t; if (left_node_next == right_node) if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G1 else return right_node; // R1 if (CAS (&(left_node.next), left_node_next, right_node)) // C1 if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G2 else return right_node; // R2 } while (true); // B2 }
find left_node and right_node ignore any marked nodes
check left_node and right_node are adjacent
remove one or more marked nodes
optimisation
optimisation
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 90
find [search] • step1: iterates along list to find the first unmarked node >= key; this is the right node;
the left node refers to the previous unmarked node found
• step 2: if the left node is the immediate predecessor of the right node, the search returns [returns with no marked nodes between left and right]
• step 3: use CAS to remove marked node(s) between the left and right nodes; on failure the search is retried
• the optimisation checks if the right node has become marked [logically deleted] and performs the search again rather than returning and then failing in add or remove
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 91
A Pragmatic Implementation of Non-Blocking Linked Lists… • what is NOT said! • insert allocates a new node even if insertion fails • NO code for freeing or re-using nodes • nodes never become unmarked • avoids ABA problem by not re-using nodes which also… • avoids problem of threads traversing list using pointers to freed nodes • assumes nodes are garbage collected in a safe way [not an easy problem to solve] • ONLY a partial solution without memory management [perhaps the harder problem]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 92
Memory Management
• use garbage collection [Java, but not yet in C++] • reference counting • deferred freeing of nodes [see end of section 6 in Harris paper]
each node contains an additional link field so that it can be added to a per thread retireQ and reuseQ
each thread takes a copy of a global timer [e.g. clock()] before starting an add or remove operation and saves it in a global startOp array [each thread startOp stored in its own cache line for speed]
add and remove operations add any freed nodes to the retireQ and sets the key field to the startOp of the thread
add and remove operations, before they exit, can traverse the retireQ and transfer
nodes to a reuseQ if their startOp is less than the minimum startOp of any thread since no thread can still have a reference to the node
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 93
Memory Management
• add retired nodes to end of retireQ • the minimum thread startOp time is 129 • can transfer all nodes in retireQ with startOp < 129 to reuseQ [first three node] • allocate nodes from per thread reuseQ and only call new/malloc if empty
• why is a link field needed? why not use next?
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 94
Hazard Pointers
• Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged Michael (2004) IEEE Transactions on Parallel and Distributed Systems 15 (8): 491–504
• in terms of an ordered linked list, there are two active pointers as the list is traversed during a find operation [number will be different for other algorithms]
• these active pointers called hazard pointers [used to save cur and next p499 Fig 9]
• idea is not to reuse/delete/free nodes if they have hazard pointers pointing to them
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 95
Hazard Pointers… • maintain a global array of per thread hazard pointers [each thread saving its hazard
pointers in its own cache line for speed]
• use per a thread retireQ and reuseQ as per previous example
• retire node by adding to retireQ and when length >= 2*nthreads*HAZARDSPERTHREAD make a local copy of all hazard pointers in global array [allocate a local array] sort hazard pointers in local array [optional] for each node on retireQ, if node address doesn’t match any hazard pointer in local
array transfer to reuseQ • again need to allocate nodes from per thread reuseQ and only call new/malloc if empty
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 96
Some Results [NO memory management] • NB: number of nodes allocated [nmalloc]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 97
Some Results…
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
1 2 4 8 16
MO
ps
per
second
# threads
lockless list [no memory management]
key=16
key=64
key=256
key=1024
key=4096
key=16384
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 98
Some Results…
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
1 2 4 8 16
speed-u
p re
lati
ve t
o a
sin
gle
thre
ad
# threads
lockless list [no memory management]
key=16
key=64
key=256
key=1024
key=4096
key=16384
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 99
Transactional Memory • locks hard to manage effectively
pessimistic – inhibits parallelism priority inversion – lower priority thread pre-empted while holding a lock needed
by a higher priority thread convoying – thread holding lock is descheduled and other threads queue up
unable to progress deadlock – can be difficult to avoid in complex systems
• atomic primitives such as CAS operate on one word at a time resulting in complex
algorithms • MCAS [multiple compare and swap] some help • no hardware implementation • list of addresses, expected values and new values • can be implemented using CAS
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 100
Transaction
• a sequence of steps executed by a single thread • transactions must be serializable meaning that they appear to execute sequentially in a
one-at-a-time order • serializability is a kind of coarse-grained version of linearizability [atomic method calls
on a given object appear to take effect instantaneously] • correctly implemented transactions do not deadlock or livelock
• composing atomic method calls is straightforward from a programming perspective [the
implementation, however, is far from straightforward]
atomic { x = q0.remove(); // atomic remove q1.add(x); // atomic add }
• atomic removal from one list and addition to another
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 101
Transaction…
• transactions are executed speculatively; as a transaction executes it makes tentative changes to objects [memory locations]
• if it completes without encountering a conflict it then commits [the tentative changes become permanent] OR
• it aborts [the tentative changes are discarded] • the tentative changes are NOT visible to other transactions until the transaction
commits • each transaction maintains a read set and a write set
• each transactional load instruction adds the memory address to the read set and each
transactional store adds the memory address and value to the write set
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 102
Transaction…
• two transactions conflict if one writes to a location accessed [read or written] by another
• conflict detection can be eager or lazy • eager detection checks every read or write to see if there is a conflicting operation in
another transaction requires all read and write sets to be visible to other transactions
• lazy detection checks when a transaction commits
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 103
Conflict Detection Example • in both sequences, eager detection would
detect a conflict at Read X because the other transaction has already written to X
• (a)
lazy conflict detection would detect a conflict in T1 because T2 commits first implying that T1 should have used the result of the T2 Write X operation
• (b)
lazy conflict detection would allow both T1 and T2 to commit because T1 commits first and its Read X need not use the result of the T2 Write X
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 104
How to resolve conflicts? • consider eager detection
• when T1 performs Read X, a conflict is
detected with Write X in T2 and one transaction has to be aborted
• If T2 is aborted, T1 will conflict later with
T3 [Read Y and Write Y] and another transaction will have to be aborted
• if, however, the policy had decided to abort
T1, T2 and T3 could have finished with only one transaction aborted instead of two
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 105
Other issues • transactions can be nested
nested transactions are especially useful if a nested transaction can abort without
aborting its parent • zombies are transactions that are destined to abort, but are still running
• such transactions may have an inconsistent read set which could lead to erroneous
behaviour [e.g. infinite loop, index out of bounds]
• zombies can be avoided by validating the entire read set after each transactional load [expensive]
• validating explicitly checks for conflicts and will abort the transaction immediately
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 106
Other issues… • consider the code for adding to a bounded [fixed size] transactional Q – items stored
in a “circular” array of length items void add(int key) atomic { if (count == items.length) // Q full retry; items[tail] = x; // add item if (++tail == items.length) // if necessary… tail = 0; // wrap tail index ++count; // increase count } }
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 107
Other issues… • retry rolls back the enclosing transaction
• can be used to wait on multiple conditions
atomic { x = q0.remove(); } orElse { x = q1.remove(); } • if Q empty, retry called which is detected by orElse so q1.remove() tried instead
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 108
Hardware Transactional Memory
• Transactional Memory: Architectural Support of Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss Proceedings of the 20th Annual International Symposium on Computer Architecture 1993
• motivations
lock-free – operations on a data structure will not be prevented if one process/thread stalls mid execution
avoids common problems with mutual exclusion out performs best known locking techniques
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 109
Hardware Transactional Memory
• a transaction [as defined here] is a finite sequence of machine instructions,
executed by a single process/thread, satisfying the following properties:
serializability: transactions appear to execute serially, meaning that the steps of
one transaction never appear to be interleaved with the steps of another
committed transactions are never observed by different processes/threads to
execute in different orders
atomicity: each transaction makes a sequence of tentative changes to shared
memory
when a transaction completes it either commits making its changes visible to
other processes/threads or it aborts causing its changes to be discarded
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 110
Hardware Transactional Memory • implemented by modifying a multiprocessor cache coherency protocol [write-Once in
this example]
• tentative changes are made to a separate transaction cache
• ONLY when the transaction is committed do changes become visible atomically to other CPUs
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 111
Hardware Transaction Memory • basic idea is that any cache coherency protocol capable of detecting accessibility
conflicts can also detect transaction conflicts at no extra cost
• instructions added to CPU instruction set for handling transactions – would be automatically generated a compiler
• Load transactional [LT] reads value from a shared memory location into transaction cache [and CPU register]
• Load transactional exclusive [LTX] read a value of a shared memory location into
transaction cache and mark it as RESERVED [use LTX if location likely to be updated] • Store transactional [ST] tentatively writes a value to a copy of the data in the
transaction cache which does NOT become visible to other processors until the transaction successfully commits
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 112
Hardware Transactional Memory…
• commit [COMMIT] attempts to make a transaction’s tentative changes permanent and visible to other caches succeeds ONLY if no other transaction has written to any location in the
transaction's read or write set [and no other transaction has read any location in this transaction’s write set]
on failure all tentative changes to the write set are discarded returns success or failure
• Abort [ABORT] discards all updates to the write set • Validate [VALIDATE] tests the current transaction’s status
returns true if the transaction has not aborted [thus far] returns false if the current transaction has aborted, discards tentative updates
• CPU also keeps a TACTIVE flag indicating a transaction is in progress and a TSTATUS flag
indicating if the transaction is active or aborted; VALIDATE returns TSTATUS
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 113
Transaction Cache States
• transaction cache lines have a write-once state AND a transaction state
• a memory location cannot be in a CPU’s normal cache and transaction cache simultaneously [exclusive caches]
• transactional cache states
EMPTY contains no data [invalid] NORMAL contains committed data XCOMMIT [discard on commit] contains original value read from “memory” XABORT [discard on abort] holds the tentative writes made to cache line
during a transaction [always paired with a XCOMMIT cache line]
• if a transaction commits successfully, the XCOMMIT lines are set to EMPTY and the XABORT lines switch to NORMAL
• must occur atomically using appropriate hardware support so ALL changes become visible “instantaneously”
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 114
Transaction Cache States… • the snooping mechanism returns a BUSY status if CPUx tries to transactionally
read a memory location that is in another CPU’s transactional cache in the RESERVED or DIRTY state [because other CPU must have written to it]
• CPUx’s TSTATUS is set false [aborted] if it receives a BUSY status when attempting to execute a LT, LTX or ST
• if the transaction aborts, the XABORT lines set to EMPTY and the XCOMMIT
lines are set to NORMAL
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 115
Intended use of transactional instructions
1. use LT or LTX to read a set of locations
2. use VALIDATE to check that the read set is consistent
on failure goto 1
3. use ST to modify a set of locations
4. use COMMIT to make changes permanent
on failure goto 1
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 116
Consider the following transaction
atomic { a0 += 3; // add 3 a1 -=3; // subtract 3 }
• and compiler generated code sequence for transaction
tstart: ltx r1, a0 // know a0 will be modified ltx r2, a1 // know a1 will be modified add r1, 3, r1 // add 3 sub r2, 3, r2 // sub 3 st r1, a0 // tentative store st r2, a1 // tentative store commit // commit jeq tstart // retry on failure
• could add validate instructions to test for abort status earlier
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 117
Example Transaction
address and value transactional cache state
write-once state I, V, R and D
transaction state
• assume a0 = 0 and a1 = 3 initially
• consider transaction state when executed on a single CPU just before COMMIT
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 118
Example Transaction…
• memory locations a0 and a1 are read using LTX and so enter cache in the
Reserved | XCOMMIT state
• a copy of a0 and a1 also made in in Reserved | XABORT state
• memory locations are then written and the XABORT cache line is
changed to state Dirty | XABORT
• note that per CPU TACTIVE and TSTATUS flags indicate that a transaction is active and that its status is also active [rather than aborted]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 119
Example Transaction… • state after COMMIT executed • transaction successfully committed • transactional cache lines of type XCOMMIT
set to EMPTY • transactional cache lines of type XABORT
set to NORMAL • TACTIVE set to false ready for next
transaction • COMMIT operation updates the state of ALL
transaction cache lines atomically [needs appropriate hardware]
• other CPUs can now obtain updated contents of a0 and a1 from transactional cache
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 120
Example Transaction…
• how are conflicts detected between concurrent transactions?
• assume CPU 0 has executed
its LTX r1, a0 and CPU 1 has been granted access to the bus to execute its LTX r1, a0
• CPU 0 has loaded a0 into its transactional cache in the Reserved state [LT would have loaded it the Valid state]
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 121
Example Transaction…
• CPU 1 tries to read a0, but since CPU 0 has a copy in its transactional cache in the
RESERVED state, the transactional cache will detect the conflict and assert the BUSY
• when a BUSY response is received by CPU 1
it marks the transaction as being aborted by setting TACTIVE and TSTATUS to false
eager conflict detection
LTX will return arbitrary data
• when CPU1 [eventually] VALIDATEs or COMMITs transaction it will fail
sets all XABORT entries to EMPTY and sets all XCOMMIT entries to NORMAL
CS4021 LOCKLESS ALGORITHMS
© 2012 [email protected] 11-Dec-12
School of Computer Science and Statistics, Trinity College Dublin 122
Summary • you are now able to: