parallel programming

Learning and Development

Presents

OPEN TALK SERIES

A series of illuminating talks and interactions that open our minds to new ideas

and concepts; that makes us look for newer or better ways of doing what we did;

or point us to exciting things we have never done before. A range of topics on

Technology, Business, Fun and Life.

Be part of the learning experience at Aditi.

Join the talks. Its free. Free as in freedom at work, not free-beer.

Speak at these events. Or bring an expert/friend to talk.

Mail LEAD with topic and availability.

mailto:[email protected]?subject=I would like to speak at the Open Talk Series... my topic is...

2

Parallel Programming

Sundararajan Subramanian

Aditi Technologies

Introduction to Parallel Computing

• The challenge

– Provide the abstractions , programming

paradigms, and algorithms needed to

effectively design, implement, and maintain

applications that exploit the parallelism

provided by the underlying hardware in order

to solve modern problems.

4

Single-core CPU chip

the single core

5

Multi-core architectures

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip

6

Multi-core CPU chip

• The cores fit on a single processor socket

• Also called CMP (Chip Multi-Processor)

core 1

core 2

core 3

core 4

7

The cores run in parallel

core 1

core 2

core 3

core 4

thread 1 thread 2 thread 3 thread 4

8

Within each core, threads are time-sliced

(just like on a uniprocessor)

core 1

core 2

core 3

core 4

several threads

several threads

several threads

several threads

9

Instruction-level parallelism

• Parallelism at the machine-instruction level

• The processor can re-order, pipeline

instructions, split them into

microinstructions, do aggressive branch

prediction, etc.

• Instruction-level parallelism enabled rapid

increases in processor speeds over the

last 15 years

Instruction level parallelism

• For(int i-0;i<1000;i++)

{ a[0]++; a[0]++; }

• For(int i-0;i<1000;i++)

{ a[0]++; a[1]++; }

11

Thread-level parallelism (TLP)

• This is parallelism on a more coarser scale

• Server can serve each client in a separate

thread (Web server, database server)

• A computer game can do AI, graphics, and

physics in three separate threads

• Single-core superscalar processors cannot

fully exploit TLP

• Multi-core architectures are the next step in

processor evolution: explicitly exploiting TLP

12

A technique complementary to multi-core:

Simultaneous multithreading

• Problem addressed: The processor pipeline can get stalled:

– Waiting for the result of a long floating point (or integer) operation

– Waiting for data to arrive from memory

Other execution units wait unused BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

Source: Intel

13

Simultaneous multithreading (SMT)

• Permits multiple independent threads to execute

SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads”

on the same core

• Example: if one thread is waiting for a floating

point operation to complete, another thread can

use the integer units

14

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 1: floating point

Without SMT, only a single thread can

run at any given time

15

Without SMT, only a single thread can

run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 2: integer operation

16

SMT processor: both threads can run

concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 1: floating point Thread 2: integer operation

17

But: Can’t simultaneously use the

same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 1 Thread 2

This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE

18

SMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)

• OS and applications perceive each

simultaneous thread as a separate

“virtual processor”

• The chip has only a single copy

of each resource

• Compare to multi-core:

each core has its own copy of resources

19

Multi-core:

threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 1 Thread 2

20

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 3 Thread 4

Multi-core:

threads can run on separate cores

21

Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)

• The different combinations:

– Single-core, non-SMT: standard uniprocessor

– Single-core, with SMT

– Multi-core, non-SMT

– Multi-core, with SMT: our fish machines

• The number of SMT threads:

2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”

22

SMT Dual-core: all four threads can run

concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTB L2 C

ach

e an

d C

on

tro

l B

us

Thread 1 Thread 3 Thread 2 Thread 4

Designs with private L2 caches

memory

L2 cache

L1 cache L1 cache C O

R E

1

C O

R E

0

L2 cache

memory

L2 cache

L1 cache L1 cache C O

R E

1

C O

R E

0

L2 cache

Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D

L3 cache L3 cache

A design with L3 caches Example: Intel Itanium 2

25

Private vs shared caches?

• Advantages/disadvantages?

26

Private vs shared caches

• Advantages of private:

– They are closer to core, so faster access

– Reduces contention

• Advantages of shared:

– Threads on different cores can share the

same cache data

– More cache space available if a single (or a

few) high-performance thread runs on the

system

Parallel Architectures

• Use multiple

– Datapaths

– Memory units

– Processing units


• SIMD

– Single instruction stream, multiple data stream Processing

Unit

Control Unit

Interco

nn

ect

Processing Unit

Processing Unit

Processing Unit

Processing Unit


• MIMD

– Multiple instruction stream, multiple data stream

Processing/Control Unit




Interco

nn

ect

Parallelism in Visual Studio 2010

Parallel Pattern Library

Resource Manager

Task Scheduler

Task Parallel Library

PLINQ

Managed Library Native Library Key:

Threads

Operating System

Concurrency Runtime

Programming Models

Agents Library

ThreadPool

Task Scheduler

Resource Manager

Data Stru

ctures

Dat

a St

ruct

ure

s

Integrated Tooling

Tools

Parallel Debugger

Toolwindows

Profiler Concurrency

Analysis

Programming Models

Concurrency Runtime

Multi threading Today

• Divide the total number of activites across n processors

• In case of 2 Procs, divide it by 2.

User Mode Scheduler

Program Thread

CLR Thread Pool

Global Queue

Worker Thread 1

Worker Thread p

…

User Mode Scheduler For Tasks

CLR Thread Pool: Work-Stealing

Worker Thread 1

Worker Thread p

…

Program Thread

Global Queue

Local Queue

Local Queue

…

Task 1 Task 2

Task 3 Task 5

Task 4

Task 6

Task-based Programming

Summary ThreadPool

ThreadPool.QueueUserWorkItem(…);

System.Threading.Tasks

Task.Factory.StartNew(…);

Starting var p = new Task(() => { var t = new Task(…); });

Parent/Child

Task<int> f = new Task<int>(() => C()); … int result = f.Result;

Tasks with results Task t = … Task p = t.ContinueWith(…); t.Wait(2000); t.Cancel();

Continue/Wait/Cancel

Coordination Data Structures (1 of

3)

Concurrent Collections • BlockingCollection<T>

• ConcurrentBag<T>

• ConcurrentDictionary<TKey,TValue>

• ConcurrentLinkedList<T>

• ConcurrentQueue<T>

• ConcurrentStack<T>

• IProducerConsumerCollection<T>

• Partitioner, Partitioner<T>, OrderablePartitioner<T>

P P

P

C C

C

Block if full

Block if empty


3) Synchronization Primitives

• Barrier

• CountdownEvent

• ManualResetEventSlim

• SemaphoreSlim

• SpinLock

• SpinWait

Barrier postPhaseAction

Loo

p

CountdownEvent.


3)

Initialization Primitives

Thread Boundary

• Lazy<T>, LazyVariable<T>, LazyInitializer

• ThreadLocal<T>

Cancellation Primitives • CancellationToken

• CancellationTokenSource

• ICancelableOperation

Foo(…, CancellationToken ct)

Cancellation Token

Bar(…, CancellationToken ct)

ManualResetEventSlim.Wait( ct )

MyMethod( ) Cancellation Source

parallel programming

Technology

multicore architectures

multicore cpu chip5

machineinstruction level

coordination data structures

single processor socket

processor speeds

thread 4ccccoooorrrreeee12347

initialization primitives