parallel programming
TRANSCRIPT
Learning and Development
Presents
OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.
Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.
2
Parallel Programming
Sundararajan Subramanian
Aditi Technologies
Introduction to Parallel Computing
• The challenge
– Provide the abstractions , programming
paradigms, and algorithms needed to
effectively design, implement, and maintain
applications that exploit the parallelism
provided by the underlying hardware in order
to solve modern problems.
4
Single-core CPU chip
the single core
5
Multi-core architectures
Core 1 Core 2 Core 3 Core 4
Multi-core CPU chip
6
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)
core 1
core 2
core 3
core 4
7
The cores run in parallel
core 1
core 2
core 3
core 4
thread 1 thread 2 thread 3 thread 4
8
Within each core, threads are time-sliced
(just like on a uniprocessor)
core 1
core 2
core 3
core 4
several threads
several threads
several threads
several threads
9
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled rapid
increases in processor speeds over the
last 15 years
Instruction level parallelism
• For(int i-0;i<1000;i++)
{ a[0]++; a[0]++; }
• For(int i-0;i<1000;i++)
{ a[0]++; a[1]++; }
11
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
physics in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
12
A technique complementary to multi-core:
Simultaneous multithreading
• Problem addressed: The processor pipeline can get stalled:
– Waiting for the result of a long floating point (or integer) operation
– Waiting for data to arrive from memory
Other execution units wait unused BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
Source: Intel
13
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core
• Example: if one thread is waiting for a floating
point operation to complete, another thread can
use the integer units
14
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 1: floating point
Without SMT, only a single thread can
run at any given time
15
Without SMT, only a single thread can
run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 2: integer operation
16
SMT processor: both threads can run
concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 1: floating point Thread 2: integer operation
17
But: Can’t simultaneously use the
same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 1 Thread 2
This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE
18
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
simultaneous thread as a separate
“virtual processor”
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources
19
Multi-core:
threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 1 Thread 2
20
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 3 Thread 4
Multi-core:
threads can run on separate cores
21
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
22
SMT Dual-core: all four threads can run
concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB L2 C
ach
e an
d C
on
tro
l B
us
Thread 1 Thread 3 Thread 2 Thread 4
Designs with private L2 caches
memory
L2 cache
L1 cache L1 cache C O
R E
1
C O
R E
0
L2 cache
memory
L2 cache
L1 cache L1 cache C O
R E
1
C O
R E
0
L2 cache
Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches Example: Intel Itanium 2
25
Private vs shared caches?
• Advantages/disadvantages?
26
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or a
few) high-performance thread runs on the
system
Parallel Architectures
• Use multiple
– Datapaths
– Memory units
– Processing units
Parallel Architectures
• SIMD
– Single instruction stream, multiple data stream Processing
Unit
Control Unit
Interco
nn
ect
Processing Unit
Processing Unit
Processing Unit
Processing Unit
Parallel Architectures
• MIMD
– Multiple instruction stream, multiple data stream
Processing/Control Unit
Processing/Control Unit
Processing/Control Unit
Processing/Control Unit
Interco
nn
ect
Parallelism in Visual Studio 2010
Parallel Pattern Library
Resource Manager
Task Scheduler
Task Parallel Library
PLINQ
Managed Library Native Library Key:
Threads
Operating System
Concurrency Runtime
Programming Models
Agents Library
ThreadPool
Task Scheduler
Resource Manager
Data Stru
ctures
Dat
a St
ruct
ure
s
Integrated Tooling
Tools
Parallel Debugger
Toolwindows
Profiler Concurrency
Analysis
Programming Models
Concurrency Runtime
Multi threading Today
• Divide the total number of activites across n processors
• In case of 2 Procs, divide it by 2.
User Mode Scheduler
Program Thread
CLR Thread Pool
Global Queue
Worker Thread 1
Worker Thread p
…
User Mode Scheduler For Tasks
CLR Thread Pool: Work-Stealing
Worker Thread 1
Worker Thread p
…
Program Thread
Global Queue
Local Queue
Local Queue
…
Task 1 Task 2
Task 3 Task 5
Task 4
Task 6
DEMO
Task-based Programming
Summary ThreadPool
ThreadPool.QueueUserWorkItem(…);
System.Threading.Tasks
Task.Factory.StartNew(…);
Starting var p = new Task(() => { var t = new Task(…); });
Parent/Child
Task<int> f = new Task<int>(() => C()); … int result = f.Result;
Tasks with results Task t = … Task p = t.ContinueWith(…); t.Wait(2000); t.Cancel();
Continue/Wait/Cancel
Coordination Data Structures (1 of
3)
Concurrent Collections • BlockingCollection<T>
• ConcurrentBag<T>
• ConcurrentDictionary<TKey,TValue>
• ConcurrentLinkedList<T>
• ConcurrentQueue<T>
• ConcurrentStack<T>
• IProducerConsumerCollection<T>
• Partitioner, Partitioner<T>, OrderablePartitioner<T>
P P
P
C C
C
Block if full
Block if empty
Coordination Data Structures (2 of
3) Synchronization Primitives
• Barrier
• CountdownEvent
• ManualResetEventSlim
• SemaphoreSlim
• SpinLock
• SpinWait
Barrier postPhaseAction
Loo
p
CountdownEvent.
Coordination Data Structures (3 of
3)
Initialization Primitives
Thread Boundary
• Lazy<T>, LazyVariable<T>, LazyInitializer
• ThreadLocal<T>
Cancellation Primitives • CancellationToken
• CancellationTokenSource
• ICancelableOperation
Foo(…, CancellationToken ct)
Cancellation Token
Bar(…, CancellationToken ct)
ManualResetEventSlim.Wait( ct )
MyMethod( ) Cancellation Source