concurrent data structures in architectures with limited shared memory support
DESCRIPTION
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden. Concurrent Data Structures in Architectures with Limited Shared Memory Support. Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas. Concurrent Data Structures. - PowerPoint PPT PresentationTRANSCRIPT
Concurrent Data Structures in Architectures with
Limited Shared Memory Support
Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou
Philippas Tsigas
Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden
Yiannis Nikolakopoulos [email protected]
2
Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,
sharing a uniform address space (shared memory)
• Inter-process/thread communication and synchronization– Both a tool and a goal
Yiannis Nikolakopoulos [email protected]
3
Concurrent Data Structures:Implementations
• Coarse grained locking– Easy but slow...
• Fine grained locking– Fast/scalable but: error-prone, deadlocks
• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable
Yiannis Nikolakopoulos [email protected]
4
What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”
[Kumar et al 2011]– Shared address space
will not scale– Universal atomic primitives (CAS, LL/SC) harder to
implement• Shared memory message passing
Cache Cache
IA Core
Shared Local
Yiannis Nikolakopoulos [email protected]
5
• Networks on chip (NoC)• Short distance
between cores• Message passing
model support• Shared memory support
Can we have Data Structures:Fast
ScalableGood progress guarantees
Cache Cache
IA Core
Shared Local
• Eliminatedcache coherency
• Limited support for synchronization primitives
Yiannis Nikolakopoulos [email protected]
6
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
7
Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register
per core
Yiannis Nikolakopoulos [email protected]
8
SCC: Architecture Overview
Memory Controllers:to private & shared
main memory
Message Passing
Buffer (MPB) 16Kb
Yiannis Nikolakopoulos [email protected]
9
Programming Challenges in SCC• Message Passing but…– MPB small for
large data transfers– Data Replication is difficult
• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]
Yiannis Nikolakopoulos [email protected]
10
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
11
Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination
• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization
(MP-Queue, MP-Acks)
Yiannis Nikolakopoulos [email protected]
12
2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores
Yiannis Nikolakopoulos [email protected]
13
2-lock Queue:“Traditional” Enqueue Algorithm
• Acquire lock• Read & Update
Tail pointer (MPB)• Add data (SHM)• Release lock
Yiannis Nikolakopoulos [email protected]
14
2-lock Queue:Optimized Enqueue Algorithm
• Acquire lock• Read & Update
Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?
No Cache Coherency!
Yiannis Nikolakopoulos [email protected]
15
2-lock Queue:Dequeue Algorithm
• Acquire lock• Read & Update
Head pointer• Release lock• Check flag• Read node dataWhat about
progress?
Yiannis Nikolakopoulos [email protected]
16
2-lock Queue:Implementation
Head/TailPointers (MPB)
Data nodes
Locks?On which tile(s)?
Yiannis Nikolakopoulos [email protected]
17
Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who
keeps Head/Tail pointers• Enqueuers/Dequeuers request access through
dedicated slots in MPB• Successfully enqueued data are flagged with
dirty bit
Yiannis Nikolakopoulos [email protected]
18
MP-Queue
ENQ
TAIL
DEQ
HEAD
SPIN
What if this fails and is
never flagged?“Pairwise blocking”
only 1 dequeue blocks
ADDDATA
Yiannis Nikolakopoulos [email protected]
19
Adding Acknowledgements• No more flags!
Enqueue sends ACK when done• Server maintains in SHM a private queue of
pointers• On ACK:
Server adds data location to its private queue• On Dequeue:
Server returns only ACKed locations
Yiannis Nikolakopoulos [email protected]
20
MP-Acks
ENQ
TAIL
ACK
DEQ
HEAD
No blocking between
enqueues/dequeues
Yiannis Nikolakopoulos [email protected]
21
Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion
Yiannis Nikolakopoulos [email protected]
22
Evaluation
Benchmark:• Each core performs Enq/Deq at random• High/Low contention
• Perfomance? Scalability?• Is it the same for all cores?
23
• Throughput:Data structure operations completed per time unit.
[Cederman et al 2013]
Measures
Yiannis [email protected]
Operations by core i
Average operations per
core
Yiannis Nikolakopoulos [email protected]
28
Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance
• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability
Yiannis Nikolakopoulos [email protected]
31
Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs
Yiannis Nikolakopoulos [email protected]
32
Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]