cs.uef.fics.uef.fi/pages/sjuva/par12_handout1_3.pdf · parallel computing 25.10.2012 14:51 uef/cs...

50
Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 1 ( 289 ) 1 University of Eastern Finland Computer Science Parallel Computing 5 cr, 3621528 Fall 2012 Simo.Juvaste@uef.fi http://cs.uef.fi/pages/sjuva/parallel.html Sitting placements at the first lectures: 1) Sit within reach of someone (several) else. 2) The whole class must be connected. Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 2 ( 289 ) 2 Course contents (preliminary) Course contents (preliminary) Chapter 1: An Introduction to Parallel Computing (p. 3) • What?, Why?, How? Chapter 2: PRAM (p. 55) A simple model to parallelism Chapter 3: Parallel algorithms (in PRAM-notation) (p. 85) Basic algorithms, e.g., counting, prefix, sorting, etc. Chapter 4: Taking real world into account (p. 163) Network delay models, memory access models Chapter 5: Message passing programming (with MPI) (p. 224) Real parallel programming work. Chapter 6: Other stuff (p. 228) OpenMP, Fortran 90, HPF, functional, data flow. GPU programming, CUDA/OpenCL • Everyday (especially in few years) parallel (and concurrent) pro- gramming. Processes, IPC, shared memory, phtreads, Java threads. Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 3 ( 289 ) 3 Chapter 1 An Introduction to Parallel Computing What?, Why?, How? Some key concepts Pros, Cons Other similar terms Examples An animal experiment Design issues

Upload: others

Post on 29-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 1 (289)

1U

niversity of Eastern

Fin

land

Com

puter S

cience

Parallel C

omputing

5 cr, 3

621528

Fall 2

012

Sim

[email protected]

http://cs.uef.fi

/pages/sjuva/parallel.htm

l

Sittin

g placem

ents at the fi

rst lectures:

1) Sit w

ithinreach of som

eone (several) else.

2) The w

hole class must be

connected.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 2 (289)

2

Course contents (preliminary) Co

urse co

nten

ts (prelim

ina

ry)

•C

hapter 1:A

n Introduction to Parallel C

omputing

(p. 3)

•W

hat?,W

hy?,H

ow?

•C

hapter 2:P

RA

M(p. 55)

•A

simple

model to parallelism

•C

hapter 3:P

arallel algorithms (in P

RA

M-notation)

(p. 85)

•B

asicalgorithm

s, e.g., counting, prefix, sorting, etc.

•C

hapter 4:T

aking real world into account

(p. 163)

•N

etwork

delay models, m

emory access m

odels

•C

hapter 5:M

essage passing programm

ing (with M

PI)

(p. 224)

•R

ealparallel program

ming

work.

•C

hapter 6:O

ther stuff(p. 228)

•O

penMP, F

ortran 90, HP

F, functional, data flow

.

•G

PU

programm

ing, CU

DA

/OpenC

L

•E

veryday(especially

infew

years)parallel

(andconcurrent)

pro-gram

ming. P

rocesses, IPC

,shared mem

ory, phtreads, Java threads.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 3 (289)

3

Ch

ap

ter 1

An

Intro

du

ction

to P

ara

llel Com

pu

ting

What?, W

hy?, H

ow?

Som

e key concepts

Pros, C

ons

Other sim

ilar terms

Exam

ples

An anim

al experimen

t

Desig

n issues

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 4 (289)

4

What is Parallel Computing? Wh

at is

Pa

rallel C

om

pu

ting

?

⇒U

seseveralcom

putersto

solvea

singlecom

putationaltaskin

parallel!

•T

wo is better than one.

•O

ne thousand is better than tw

o…•

Think hum

an (manu

al) work.

⇒T

he single task has to bedivided in several parts.

•S

ome tasks are easy to divide, som

e are not.

⇒T

he cooperating computers have to be able to

comm

unicate.

•O

ne task, one solution.•

There are m

any ways to com

municate.

⇒T

he participating "computers" do not need to be com

plete!

•P

rocessor,mem

ory,comm

unication medium

(processing unit).

•M

onitors do not process.

•T

he whole parallel com

puter still needs to have some I/O

, etc.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 5 (289)

5

What is Parallel Computing?

Exam

ple 1

-1:

A hu

man

exam

ple:m

anual sorting of papers:

•Input: a bunch of A

4papers, each having a

name.

•Inputsize:10,100,1

000,or10000

papers(1

mm

,1cm

,10cm

,1m

).

•T

ask: sort the bunch (alphabetically).

One (quick) person alone: [1st exercise in D

ata Structures and A

lgorithms]

•10 papers: 30 s [3 s/paper]

•m

ethod insignificant

•100 papers: 8 m

in [5 s/paper]

•divide

in10

(5-27)substacks

accordingto

thefi

rstletter,

sortsub-

stacks, combine

•1000 papers: 2 h [7 s/paper]

•divide

in10

substacks

accordingto

thefi

rstletter,

applyrecursively

previous 100-sort.

•10000 papers: 25 h [9 s/paper]

•divide

in10

substacks

accordingto

thefi

rstletter,

applyrecursively

previous 1000-sort.

•Y

ou might w

ant some help...

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 6 (289)

6

What is Parallel Computing?

Para

llel man

ual p

ap

er sortin

g:

•10, 100, 1000, 10000 helpers!

•W

ork organization is more diffi

cult than in single person sort.

•E

xercise 1.

⇒T

he important question:

•W

ill10 helpers

speed up the work 10 tim

es?

•10 papers task:

no (one helper can help a little).•

10000 papers task:

yes (at least alm

ost 10 times).

•W

ill10

000 helpers speed up the work

10000 times?

•10 papers task:

no.•

10000 papers task:

no (but w

e can exploitm

ore than 10 helpers).•

100,000,000 papers task:yes (alm

ost)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 7 (289)

7

What is Parallel Computing? ⇒W

hat is theoptim

al number of helpers for each num

ber of papers?

•W

hat is the goal?W

hat means

op

tima

l?

•M

inimal

wall clock tim

e?•

Effi

ciency (minim

al person work hours, i.e., euros)?

•?

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 8 (289)

8

Little practice Little p

ractice

Ru

les

•P

hysical messages, w

riting on a piece of paper

•W

ritten message m

ay includeinstructions, addresses, data

•C

onnections toneighbours w

ithout standing up

•S

ending a m

essage (synchronous comm

unication):

•A

sk the neighbour to receive, wait until he/she is ready

•H

and out the message, say “here are you”

•R

eceiving a message:

•A

gree to receive•

Receive, say “thank you”

•Y

ou can see and comm

unicateo

nly

with your neighbours.

•L

ocal operations are unlimited

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 9 (289)

9

Little practice

Task

s

•M

ax,

count,search (single value, pattern),sum, sort, ...

Alg

orith

m?

•F

or above rules?

•F

ordifferent rules?

•W

ithout rules (but no magic)?

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 10 (289)

10

Little practice

Ph

ysica

l con

ditio

ns/restrictio

ns (i.e., ch

allen

ges):

•O

pen hall, no restrictions

•C

oordination:loudsp

eakers(for

leaders),person-to-person

comm

u-nication, guidance painted on fl

oor, rehearsal, etc.

•"C

luster" of two door-connected halls?

•S

itting here, no person movem

ent allowed.

•P

aper deliveryonly for neighbours vs anyone?.

•O

nlyone paper at tim

e vs. a bunch at a time

•H

ow to benefi

t use of blackboard or an electronic m

essage board?

•H

ow to benefi

t from sho

uting?

•W

ithout sight contact to neighbours.

•L

oad balancing (fast and slow

workers)

•F

ault tolerance (temporary, perm

anent)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 11 (289)

11

Why parallel computing is needed? Wh

y p

ara

llel com

pu

ting

is need

ed?

⇒W

hy computers are needed?

•B

ecausecom

puterscan

compute

(calculate)fa

stand

theycan

havehuge

mem

ory.

Why

i7 at 3.50 GH

z (20

03

slides: 3

GH

z)is not enough???

•C

omputing pow

er will ~

double every ~tw

o years. [“Moore”]

•Intel/A

MD

4/6/8-core processors at 2-4 GH

z arevery cheap

(from100e)!

•20 years ago governm

ents would have paid

millions for a 2012 P

C.

Wh

at else w

e need

?

•H

umans are greedy and im

patient...

•S

ome

tasksare

toodem

andingand

urgentto

becom

putedby

oneproc-

essor only.

•S

ome

tasksare

more

valuablethe

more

computing

power

we

canuse

onthem

.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 12 (289)

12

Why parallel computing is needed?

Wh

at is so

dem

an

din

g a

nd

urg

ent?

•W

ord processing?

•W

WW

-surfing?

•B

ank / stock exchange?

•eC

omm

erce?

•G

aming?

•R

eal world

simulation!

•M

atter consists ofvery tiny particles!

•E

very visible piece consists ofvery m

any particles.

•W

ecannot

simulate

every(sub)atom

icparticle

fora

large(visible)

object!

⇒B

ut:thesm

allerparticles

we

cansim

ulate,them

oreaccurate

simula-

tion we have!

•S

maller particles⇒

more particles⇒

more calculations to do!

⇒U

nbounded amount of calculations!

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 13 (289)

13

Why parallel computing is needed?

Wh

y w

e wan

t to sim

ula

te real w

orld

?

•"T

est" a piece of equipment w

ithout building it.

•P

rediction of natural phenom

ena.

•P

rediction of consequenses of changes.

•"S

ee" artificial things.

•O

ptimizing

structures or m

odels.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 14 (289)

14

Why parallel computing is needed?

Exam

ple:

wea

ther

foreca

sts

•H

istory data, constants measurem

ents.

•S

imulation of the future m

ovement of air particles.

•S

imulation

ofphysicalchanges

(temperature,pressure,hum

idity,veloc-ity, etc.) of air in the atm

osphere.

•H

ugeam

ountsof

molecules

move

andinteract

quicklyfor

severaldays.

•Incom

prehensible amou

nt of calculations.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 15 (289)

15

Why parallel computing is needed?

•R

esolution reduction:

•50×

50×1 km

(×5 m

in)block of air as 1 entity.

•P

enalty: accuracy and reliability are reduced.

•F

orecast asfar to future as possible.

•U

nfortunately:inaccuracies m

ultiply.

⇒M

orepow

erfulcom

puteror

more

time

yieldsim

mediately

more

accurate forecasts (and longer forecasts).

⇒(R

eliable) weather forecasts are

very valuable!

•In real forecasts, the m

odels exploit grid-wide differential equations

instead of local simulation...

Block size (km

),height 0.5 km

Gfl

op/s needed for"real" tim

esim

ulation (2 m

inute steps)5 days in 2 hours

Gfl

op/s needed

11 804 492

108 269 544

321 762

105 732

10241.7

103

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 16 (289)

16

Why parallel computing is needed?

•A

late forecast is worthless.

•F

innish Meteorological Institute: (about)

•7.5×

7.5×0.3 km

(×6 m

in) Canada .. U

ral, 3-10 days

•2.5×

2.5×0.? km

(×? m

in) Sw

eden .. Finland

•(44km

->7,5km

in 14 years)

•C

ray XT

5m, 656

× 6

-core Opteron, 35

TF

LO

PS

(theoretical)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 17 (289)

17

Why parallel computing is needed? ⇒C

onclusion

•W

e want as

powerful co

mputer as possible!

•W

e are willing to

pay for it.

⇒U

nfortunately

•N

o IA256 @

300 GH

z ever(?) (until 2030+?)

•E

ven if we pay

all the money in the w

orld.

Th

us

⇒W

e’ll use several processors to achieve more com

puting power.

•F

innish CS

C currently (louhi.csc.fi

): Cray X

T4/5

•2716

× (4-core 2.3 G

Hz O

pteron, 4-8GB

, 25G

B/s)

•T

heoretically 102.3 TF

lop/s, measured 76.5 T

Flop/s (L

inpack)

•http://w

ww

.csc.fi/english/research/C

omputing_services/com

puting

•see “C

urrent parallel computers (briefl

y)” p.37

•O

rdered: Cray C

ascade (10Me, 1

PF

LO

PS

?)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 18 (289)

18

Why parallel computing is needed?

Oth

er ap

plica

tion

s for p

rocessin

g p

ow

er (para

llelism)

•H

uge databases, urgent queries, data mining

•D

igital signal/image/vid

eo processing

•C

omplex user interfaces (virtual reality, gam

es)

•D

NA

modelling

•D

NA

matching

•M

olecular modelling

•E

nvironmental

modelling

(storms,pollution,earthquakes,sea

currents)

•A

stronomical m

odelling

•O

ptimization (aero/hyd

rodynamics, etc.)

•S

tructure strength calculations (car crash sim

ulations, etc.).

•C

ryptoanalysis,

•P

attern recognition, audio/image surveillance

•D

ata mining/indexing/classifi

cation,

•A

rtificial intelligence

•M

easurementdata

analysisand

modelling

(sensorvalues

tobig

picture)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 19 (289)

19

Some key concepts So

me k

ey co

ncep

ts

Exam

ple 1

-2:

Build

ing a sm

all house:

•O

ne skilled man can bu

ild a house in oneyear

•T

wo skilled m

en can do it inabout half a year

•12 m

en, onem

onth: requires very careful planning (at least)

•365 m

en,one day

: probably impossible

•1 m

illion men,

10 seconds: definitely im

possible

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 20 (289)

20

Some key concepts ⇒H

ow to coordinate the fast (1-5 day) parallel building of a house?

•S

killed w

orkers

•S

ynchronization of w

ork

•P

artlyindependent com

ponents (roof, walls, etc.)

•M

ore than one (levels of) leader(s)

•G

oodinstructions and

comm

unication

•D

etailed plan available to all (at least many) w

orkers

•P

roblem: single plan

will be crow

ded

•S

olution: local partial copies of the plan

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 21 (289)

21

Some key concepts ⇒L

essons learned:

•P

arallelization possibilities depends on the problem (ditch vs. w

ell)

•C

omm

unication and coordination are vital

•A

ccessto

aS

HA

RE

Dplan

with

localcopies

isa

fairlygood

comm

uni-cation m

ethod

⇒T

here is a limit on efficient num

ber of workers.

•K

ey concepts:

•speedup, extra w

ork,effi

ciency

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 22 (289)

22

Some key concepts

Exam

ple 1

-3:

Exam

ple: w

hich one to choose?

Th

ink

BIG

!

•G

reat Wall of C

hina (in a day?)

•5

mm

/ ~300

kg of wall for each C

hinese

•G

reat Pyram

id of Giza (in ???)

•~

60kg for each E

gyptian

Labour

Calendar

time

Speedup

Work

Labour

expensesE

ffi-

ciency

1 man

1 year1.00

1.00 my

48,000 e1.00

2 men

7 mo

nths1.71

1.17 my

56,000 e0.86

4 men

4.5 m

on

ths

2.671.50 m

y72,000 e

0.66

365

men

5 day

s73.00

5.00 my

240,000 e0.20

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 23 (289)

23

Some key concepts

Lim

its of p

ara

llelizatio

n

•C

an we

speed up a computation infi

nitely by adding m

ore and more

processors?

•N

otinfi

nitely,m

ostproblem

shave

alow

ertim

ebound

(usually(poly)logarithm

ic, with polynom

ial number of processors).

•In

practice, thelim

it is money.

•H

ard problems are huge (input size (N

) is large).

•H

uge problems have a lot of potential parallel parts.

•E

.g., a high-rise building vs. a single-family house.

•S

mall problem

s are fast enough with

one processor.

•In

theory,thelim

itis3-dim

ensionalspaceand

speedof

light(we

cannotreach exponential num

ber (as a function of tim

e) of processors)

(T(N

,P

) =).

ΩP

113 ------ε

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 24 (289)

24

Some key concepts

Sp

eedu

p(n

op

eutu

s),w

ork

(työ),

efficien

cy(teh

ok

ku

us,

hyö

tysuh

de)

•A

noptim

al sequential (uniprocessor) algorithm tim

e =T

s(N

).

•P

arallel algorithm w

ithP

processors, time =

Tp

(N,

P)

•S

peedup is defi

ned as ratioT

s/T

p

•S

peedupT

s/T

p =O

(P)

•I.e.,

superlinearspeedup

isnot

possible,as

itw

ouldim

plya

fastersequential algorithm

.

•W

ork (used resources) =

Tp ×

P.

•If

Tp ×

P =

O(T

s ), the algorithm is

work optim

al (työo

ptim

aa

linen).

•T

p ×P

=o

(Ts ) is im

possible!

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 25 (289)

25

Some key concepts

Am

da

hl’s

law

on

serial

fractio

ns

with

inp

ara

llelp

rog

ram

s

•If an algorithm

has an (inherently) serial part that will not be paral-

lelized, it will

limit w

hole parallelization.

•O

r, if we

do not bother to parallelize some diffi

cult part.

•W

hole algorithm (serial) tim

eT,sequential fraction

α (0..1).

T(N

,P

) =.

(1-1)

Speedup(P

)=w

henP

→∞

(1-2)

Effi

ciency(N,

P) =

(P→

∞)

(1-3)

αT

–(

) TP ---+

T

αT

–(

) TP ---+

------------------------------------1

α1

α–P

------------+

-----------------------1α ---

→=

T

T1

α–

() TP ---

+

----------------------------------------------

1P

α1

+-----------------

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 26 (289)

26

Some key concepts

Po

ssible

go

als fo

r speed

up

an

d/o

r efficien

cy

•A

sfast as possible.

•N

o matter how

many processors.

•F

orm

ostproblem

s,there

existsa

(poly)logarithmic-tim

e((log

n) k)

algorithm (very fast!).

•A

s goodeffi

ciency as possible.

•U

nfortunately, the sequential algorithm is alw

ays the most effi

cient.

⇒A

sfast

aspossible

while

maintaining

(asymptotically

full,or

given)efficiency.

•S

omething betw

een, or inreal life:

•In

agiven

time

byas

few(and

cheap)processors

(andother

resources) as possible.•

By

agiven

number

ofprocessors

(andother

resources)as

fastaspos-

sible.Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 27 (289)

27

Some key concepts

Bren

t’s theo

rem

•If our algorithm

works w

ithP

processor in time

T, we can execute it

with

P’

<P

processors in time

T× P

/P’ .

⇒W

ecan

always

designalgorithm

sfor

asm

anyprocessors

aspossible/

efficient. The algorithm

will w

ork nicely with few

er processors.

•E

ven if we w

on’t have thousands of processors, multithreaded proces-

sors work m

ore efficiently w

ith more threads.

⇒In

some

cases,thought,analgorithm

thatisdesigned

forfew

erproc-

essors may be m

ore efficient.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 28 (289)

28

Some key concepts

Wh

at is so

diffi

cult in

para

llel pro

gra

mm

ing?

•S

ometim

es evensequential program

ming is diffi

cult.

•In

parallelprogramm

ingw

ehave

tom

anageseveralprocessors,each

ofw

hich must w

ork correctly.

•T

he processors must

com

municate correctly.

•S

ome problem

s are easy to parallelize, some diffi

cult or inefficient.

⇒P

arallel programm

ing is difficult.

⇒W

eoften

needm

oreabstraction

levelsthan

insequential

program-

ming.

•C

oncentrate ondata and

operations on data.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 29 (289)

29

Some key concepts

Para

llelism is

natu

ral!

•In fact,sequential order is (som

etimes) artifi

cial.

•A

“typical” algorithm segm

ent:

for ea

chelem

in array

Ad

o1

elem←

elem×

22

•A

sequential programm

er implem

ents:

for (i =

1; i <=

A; i+

+)

1

A[i] =

A[i] * 2;

2

•W

hy to serialize anoriginally parallel (sim

ultaneous) operation?

•S

ometim

es serialization might be a source of errors.

•A

parallel version can be fl

exibly implem

ented with 1..N

processors.

•R

eal world is concurrent (and

very parallel) anyway.

•P

arallelism is (alm

ost) as old as Life.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 30 (289)

30

Some similar terms (that are sometimes mixed up) So

me sim

ilar term

s (tha

t are so

metim

esm

ixed

up

)

Distrib

uted

System

(haja

utettu

järjestelm

ä)

⇒A

distributedsystem

isa

collectionof

autonomous

computers

linkedby

acom

puternetw

orkthatappear

tothe

usersof

thesystem

asa

sin-gle com

puter.

•T

hem

achinesare

autonomous;this

means

theyare

computers

which,in

principle, could work independently;

•S

eparatecom

putersw

orkconcurrently,

without

globalclock,

andm

ayappear, fail and recover independently.

•T

heuser’s perception: the distributed system

is perceived as a singlesystem

solving a certain problem (even though, in reality, w

e haveseveral com

puters placed in different locations).

⇒E

achpart

ofthe

distributedsystem

may

bea

partof

(i.e.,participatein)

several distributed systems.

•N

ot part of this course.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 31 (289)

31

Some similar terms (that are sometimes mixed up)

Distrib

uted

com

pu

ting (h

aja

utettu

lask

enta

)

•T

erm often used w

hen severalcom

puters (often geographically distrib-uted) are used to com

pute a single computational problem

in parallel.

•M

essagepassing

programm

ing,tolerate

longand/or

unpredictabledelays, low

bandwidth

•E

.g., SE

TI@

home, d

istributed DN

A m

atching, etc.•

Boundary

between

paralleland

distributedcom

putingdepends

onthe speaker.

•S

ometim

es,"distributed

computing"

isused

on“distributed

sys-tem

s”.

•"G

rid computing".

•P

art of this course.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 32 (289)

32

Some similar terms (that are sometimes mixed up)

Co

ncu

rrent sy

stem (sa

ma

na

ika

inen

)

•T

hings occurringa

pp

aren

tly simultaneously.

•In

reality,only

one(process,

etc.)is

executingat

atim

e,and

theprocess is changed frequently enough.

•E

.g.,processesin

am

ultitaskingO

Sexecute

at~10

ms

time

slices.

•C

an also occur really simultaneously in m

ultiprocessors systems.

•C

oncurrency is defined w

ith respect to a slow observer (hum

an).

•O

rder of concurrent events isnondeterm

inistic.

•C

anbe

(usuallyis)

implem

entedusing

time-sharing

(sometim

esseveral

processors).

•T

asks are not necessarily (tightly) related.

•P

arallel and distributed systems are concurrent by nature.

•P

rocesses in different com

puters execute simultaneously

•T

hecom

munication

inasynchronous

distributedsystem

sis

concurrent.

•T

oachieve

most

flexibility

andperform

ance,theprocesses

(comput-

ers,softw

are)that

participatein

aD

Sare

usuallyconcurrent

(multi-

threaded).

•C

oncurrency theory (or practical handling) is not part of this course.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 33 (289)

33

Some similar terms (that are sometimes mixed up)

Mu

ltithrea

din

g (sä

ikeistys)

•T

he standard mechanism

to implem

enta concurrent process (one

process)

•A

s opposed to distinct processes, the threads of a single processshare

the same data.

•N

ot part of this course.

Mu

ltithrea

din

g a

ccord

ing to

pro

cessor m

an

ufa

ctures

•P

rocessorincludes

specialcircuitsto

executeseveralprocesses

simulta-

neously.

•D

epending on the implem

entation, the processes may execute at full

speed, or at slightly lower speed.

•B

enefit: m

ore efficient utilization of functional units.

•O

S (and processes) "see" several processors.

•E

.g., Intel HyperT

hreading(tm

), SU

N C

MT

.

•R

elates to this course.

•S

ee Processor m

ultithreading (p.44).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 34 (289)

34

Some similar terms (that are sometimes mixed up)

Distrib

uted

op

eratin

g sy

stem

•S

ingle system im

age (for user) for several computers.

•U

ser will not know

in which physical com

puter their processes run.

•A

utomatic

job/process distribution, balancing, m

igration.

•"G

rid computing"

•E

.g., Mosix

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 35 (289)

35

Some similar terms (that are sometimes mixed up)

Pa

rallel

com

pu

tatio

n/co

mp

uter

(rinn

ak

ka

islask

enta

,-tieto

ko

ne)

•U

se several processors/computers to

solve a single computation in

parallel

•T

he only goal is to make hard com

puting faster.

•U

p toP

times faster using

P processors.

•U

seful(only)

ifw

eare

ina

hurry(sim

ulation/forecast,real-tim

eapplications)

•A

parallelcomputer

oftenhas

dozens..thousands

ofsim

ilarprocessors

with a tight interconnection and often a (virtual) shared m

emory.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 36 (289)

36

Some similar terms (that are sometimes mixed up)

Pa

rallel,

distrib

uted

,a

nd

con

curren

tsy

stems

an

dp

ro-

gra

mm

ing

have a

lot in

com

mo

n.

•T

askdivision.

•Interprocess

comm

unication, dividing data.

•N

ondeterminism

.

•S

ynchronization challenges.

•D

eadlock possibility.

•L

oadbalancing.

•E

rror possibilities, fault-tolerance techniques.

⇒H

ardware, tools, and

goals differ.

•In this

course, we concentrate on parallelism

, but we’ll m

ight havesom

ething (threads, processes) on concurrency.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 37 (289)

37

Current parallel computers (briefly) Cu

rrent p

ara

llel com

pu

ters (briefl

y)

SM

P (S

ym

metric M

ultiP

rocesso

r)

•2-16 (-64) processors on the sam

em

emory bus (or sw

itch).

•S

everal banks of mem

ory.

•E

ach processor has its own

cache (to reduce bus traffic).

•N

ot very scalable approach (as bus, a bit m

ore with a sw

itch).

proc.

cache

pro

c.

cache

proc.

cache

proc.

cachem

emory

I/Om

emory

central system bus

Fig

ure 1

-1:

Bus-based S

MP

computer.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 38 (289)

38

Current parallel computers (briefly)

•E

.g.,cs: S

un M4000,

(2× 4-core S

PAR

C64 V

II 2.4G

Hz).

•In

largerunits

(P≥

8-16),processorsare usually

clustered.

•P

rocessors do not comm

unicatedirectly,m

emory is used for com

-m

unication.

•U

suallyused

toim

provethroughput

in a concurrent system, can be used

for parallel computation as w

ell.

proc.

cache

proc.

cache

mem

ory

I/O

mem

ory

mem

ory

mem

ory

Fig

ure

1-2

:C

rossbar-basedS

MP

computer.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 39 (289)

39

Current parallel computers (briefly)

Why parallel (once again)[G

ordon Moore, IS

SC

C2003, w

ww

.intel.com]

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 40 (289)

40

Current parallel computers (briefly)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 41 (289)

41

Current parallel computers (briefly)

Mu

lticore S

MP

, SM

T, C

MT

•A

sthe

siliconm

anufacturingprocess

improves,m

oreand

more

transis-tors can be fi

tted in a chip (mainfram

e/supercomputer: in a board).

•H

ow to use the

exponentially growing transistor count

efficently?

•1940’s to 70’s: m

ore and more bit-parallelism

and instructions.•

Eventually dim

inishing returns.

•(70’s), 80’s, 90’s:

deeper pipelining, wider superscalar.

•U

sefulnessof

deeperpipelines

andw

idersuperscalar

islim

itedby

code/compilers, eventually dim

inishing returns.•

Since late 80’s: m

ore and more

cache to balanceslow

mem

ory.

•D

ifferenceof

2MB

and4M

BL

2caches

issm

allin

speed,buthas

more transistors than an A

LU

, eventually diminishing returns.

•S

ince mid 2000’s:

mo

re cores.

•(A

nd more integration for cheap P

Cs)

•S

ame transistor count:

6000×

i386 and single

2-core Itanium 2!

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 42 (289)

42

Current parallel computers (briefly)

•M

ulticore SM

P is to have several C

PU

s within the single silicon chip

•E

ach CP

U has its ow

n AL

U(s), L

1 (& L

2) cache, usually also FP

U.

•C

PU

s share L3 (&

L2) cache, M

MU

, and external connections

•M

ulticore benefit

•P

times processing potential for approx. the sam

e price

•D

rawback

•M

emory

andI/O

bandwidth

donot

increaseaccordingly,

eventu-ally dim

inishing returns.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 43 (289)

43

Current parallel computers (briefly)

•S

unU

ltraSPA

RC

IVprocessor[w

ww

.sun.com]

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 44 (289)

44

Current parallel computers (briefly)

Processor

multithreading

•E

ach core executes several processes (threads).

•R

educesthe

impact

ofm

emory

latencyby

making

eachvirtual

proc-essors slow

er.

•S

un UltraS

PAR

C T

3•

16cores, 8

threads each→

OS

sees 128 threads ("processors")

•C

ray XM

T•

128 threads per processor.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 45 (289)

45

Current parallel computers (briefly) ⇒M

ulticore is mainstream

now (2006 slides: "soon").

•X

Box360

•C

PU

: Triple core P

owerP

C, tw

o threads each (total6 threads)

•G

PU

:48 A

LU

s

•P

laystation3

•8 V

LIW

processors (AP

U), each 4+

4 pipelines =256 pipelines.

•Intel

•S

ince 2003: Hyperthreading provides

2 virtual processors for OS

•8-core

i7/Xeon (m

ulti-chip)•

Dual core P

4 at 2005,quad core at 2007, 48? -core at 2010.

•A

MD

2*8-core Opteron,

dual core Athlon at 2005,quad

2007.

•S

UN

/OR

AC

LE

quad-core S

PAR

C61 V

II, 16-core T3

•S

UN

dual core UltraS

PAR

C IV

at 2004,8-core T

1 at 2006.

•IB

M 8-core P

OW

ER

7,dual core PP

C970 at 2004.

•N

vidia Kepler: 1536 cores, up to 96 threads/core, 500

e.

⇒N

owdays,w

ecan

assume

thatour

software

isrun

mostly

onparallel

machines!

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 46 (289)

46

Current parallel computers (briefly)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 47 (289)

47

Current parallel computers (briefly)

Vecto

r (sup

er)com

pu

ters

•C

lassical supercomputers since C

ray 1 at 1977.

•1-32 (m

ore clustered) extremely pow

erful processors.

•E

ach up to 100 GF

LO

PS

(2008).

•~

8M

UL

-an

d-A

DD

floating point operations / clock cycle / processor

•E

.g., dot product

•R

equiresseveral

long(1000

element)

arrays(vectors)

forpeak

per-form

ance.

•O

n each clock cycle, up to 16 words (64B

) from/to m

emory.

•A

verage PC

: 0.1 .. 1 B/cc

•N

o caches, but hardware prefetch (very deep pipeline) and

very wide

mem

ory channels (and SR

AM

mem

ory).

•C

ray, Hitachi, F

ujitsu,N

EC

.

•V

ery expensive, even per FL

OP

S.

•N

earlyextinct in original form

, current implem

entations approachM

PP

s, see below.

•N

EC

SX

-9: 100G

FL

OP

S/proc, 256

GB

/s mem

ory bandwidth/proc

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 48 (289)

48

Current parallel computers (briefly)

•http://w

ww

.nec.com/de/en/prod/servers/hpc/m

aterial/255_e_sx9.pdf

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 49 (289)

49

Current parallel computers (briefly)

MP

P (M

assiv

ely P

ara

llel Pro

cessing)

•T

ens..thousands of processors.

•E

achprocessing node is a 1-4 processor S

MP

and mem

ory.

•S

eparate I/O nodes.

•P

rocessingnodes

connectedby

aninterconnection

network,topologies

vary.

Fig

ure

1-3

:A

64-node3D

mesh,

a32-node

binaryhypercube,

andan

80-node butterfly (w

ith 16 input/output nodes).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 50 (289)

50

Current parallel computers (briefly)

•U

sually hardware supports

virtual shared mem

ory.

•S

cales enough (can be built to consum

e any budget).

•C

omm

unicationnetw

ork is expensive (up to half of the machine cost).

•S

pecialpurposem

achinescan

betailor-designed

tobalance

thecosts

ofsubsystem

s (processors, mem

ory, bandwidth, I/O

) with the given task.

•G

eneral purpose computers provide com

promises betw

een price andinterconnection and m

emory perform

ance.

•E

.g.,(ILL

IAC

IV),T

hinkingM

achinesC

M-1,-2,-5,C

rayT

3E,X

T4/5,

XE

6,Digital(H

P)

Alph

aserverS

C,IB

MeS

erver,IntelAS

CI

Red,S

GI,

etc.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 51 (289)

51

Current parallel computers (briefly)

NO

W (N

etwo

rk o

f Wo

rksta

tion

s)

⇒P

ersonal workstations are 99%

idle (nights, editor usage).

•F

ree cycles can be used by:nice compute

•"F

ree" (unused) computing pow

er:

•cs departm

ent: 400 PC

s× 3 G

FL

OP

S =

1.200T

FL

OP

S.

•U

EF

: 5000 PC

s× 3 G

FL

OP

S =

15T

FL

OP

S.

•F

inland: 1.5M

PC

s× 2G

FL

OP

S =

3P

FL

OP

S >

Blue G

ene.

•O

rdinary Unix (W

inNT

) workstations, T

CP

/IP connection.

•A

switch

...LA

N...W

AN

...Internet.

•S

ometim

es (nowadays) also a

dedicated cluster (ryväs).

•1(0)

Gb

Ethernet,Infi

niband,AT

M,F

C,or

Myrinet;no

displays,etc.•

Blade racks to save space, reduce loose w

ires.

⇒S

low(ish) com

munication restricts algorithm

choice.

⇒C

heapest FL

OP

S because of m

ass production!

•S

ee exercise 4-5.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 52 (289)

52

Current parallel computers (briefly)

Pa

rallel

arch

itectures

seemto

con

verg

eto

wa

rds

each

oth

er.

•In S

MP

-computers the buses are replaced by

clustered networks.

•V

ector supercomputers are im

plemented in

CM

OS

, usecaches and

DR

AM

,P

increases, nodes areclustered (m

emory perform

ancedegrades or no anym

ore shared mem

ory).

•V

ectortechniques

andvirtual

sharedm

emory

areused

inM

PP

comput-

ers.

•M

ultithreading and multicore are used in C

PU

s and GP

Us.

•W

orkstation (or server computing nodes) have parallel vector units.

•M

PP

computers are build from

comm

odity parts like NO

Ws.

•D

edicated “NO

Ws” are used for parallel com

putation.

•S

everal (even heterogenous) computers are connected for joint w

ork(grid com

puting).

•B

lade server racks look like a m

ainframe...

Cu

rrent to

p co

mp

uters: h

ttp://w

ww

.top

500.o

rg/

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 53 (289)

53

Current parallel computers (briefly)

IBM

Seq

uoia

- Blu

eGen

e/Q

•98,304 * 16core P

owerP

C

•16 P

FL

OP

S, 7900 kW

Tia

nh

e-1A

•http://pressroom

.nvidia.com/easyir/

customrel.do?easyirid=

A0D

622CE

9F579F

09&version=

live&prid=

678988&releasejsp=

releas

e_157

•7,168 N

VID

IA T

esla 2122 M2050 G

PU

s

•448 cores each

⇒3.2M

cores

•~

1 MF

LO

PS

/ core⇒

500 GF

LO

PS

/ GP

U

•B

ut only 3GB

mem

ory / GP

U•

~ 3.5 P

FL

OP

S theoretical, 2.5 P

FL

OP

S L

INPA

CK

•tens of threads / core =

tens of millions of threads!

•14,336 X

eon CP

Us.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 54 (289)

54

Current parallel computers (briefly)

Ad

ditio

nal b

on

us o

n p

ara

llel com

pu

ters

•A

s we can have unlim

ited performance via parallelization, w

e do notneed

thefastestprocessor.Instead,w

e’llselectthebestby

performance/

price. (ww

w.verkkokauppa.com

2010)

•N

ot quite as simple as G

FL

OP

S/e.

•W

eneed

more

thanprocessors

(motherboards,

network

cards,sw

itches).

•A

lgorithm m

ay be less efficient w

ith more processing nodes.

•S

ee exercises 4-5.

Intel Core 2 D

uo E7500 2×

2.9GH

z, 3MB

118.90 e

Intel Core 2 Q

uad Q8400 4×

2.66GH

z, 6MB

151.90 e

Intel Core 2 Q

uad Q9650 4×

3.0GH

z, 12MB

330.90 e

Intel i5-760 4×2.8G

Hz, 8M

B193.90 e

Intel i7-950 4×3.06G

Hz, 8M

B514.90 e

Intel i980X

EE

4×3.3G

Hz, 12M

B989.90 e

Intel Xeon X

7460 6×2.66G

Hz, 16M

B2578.90 e

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 55 (289)

55

Ch

ap

ter 2

PR

AM

A sim

ple

model of parallelism

PR

AM

program

ming

PR

AM

physical im

plem

entation possibilities

⇒P

RA

M is used to

avoid dirty details.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 56 (289)

56

PRAM shortly PR

AM

sho

rtly

How

PR

AM

was

born

?

⇒A

familiar com

puter abstraction (for programm

ers, etc.):

•R

AM

(Random

Access M

achine)

•A

processor•

Am

emory

•P

rocedural (or OO

) programm

ing, especiallyvariables.

•N

ot quite accurate anymore, but good enough.

Processor

. . .M

emory

Fig

ure 2

-1:

RA

M (V

on N

eumann).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 57 (289)

57

PRAM shortly

A n

atu

ral ex

tensio

n:

•P

RA

M (P

arallel Random

Access M

achine)

•F

ortune and Wyllie 1

978, many others

⇒Increase the num

ber ofprocessors.

•A

llprocessors

can equally access theshared m

emory.

⇒P

rogramm

ing like RA

M, exceptm

emory (variables) is shared.

•A

ll processors have to be programm

ed.

•M

emory access confl

icts have to be avoided.

Fig

ure 2

-2:

The structu

re of the PR

AM

model.

P1

P2

P3

P4

PP

. . .

. . .P

processors

Word-w

ise accessible shared mem

ory

Read/w

rite operations from/to shared m

emory

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 58 (289)

58

PRAM shortly

Wh

y P

RA

M is

good

:

•S

imple and

strong model.

•If a parallel algorithm

can be done, it can be done for PR

AM

.

•R

eminds real com

puters (like RA

M).

•F

lexible:T

ens of different variations.

•G

enerally used.

•M

ost parallel algorithms are designed for P

RA

M.

•E

xisting set of algorithms and other theory.

Wh

y P

RA

M is

bad

:

•P

-port shared mem

ory cannot be build (easily).

•R

eal world

delays are ignored.

•D

oes not account for building costs.

•D

oes not guide for savingresources.

Still•

A handy

too

l (abstraction) for research and teaching.

•A

lgorithms can be adapted for real com

puters.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 59 (289)

59

PRAM models PR

AM

mo

dels

Pro

cessors a

re pro

cessors, b

ran

d d

oes n

ot m

atter.

•If

needed,we

candefi

neeach

processor(processing

node)to

havelocal

mem

ory and I/O

.

•E

specially theprogram

can be stored aslocal copies, but as a plain

model, it does not m

atter.

•U

sually we assum

e thesam

e program but ow

n program counters at

every processor (MIM

D, m

ultiple instruction stream, m

ultiple data).

•S

IMD

(singleinstruction

stream)

isan

optionfor

cheaperim

plemen-

tation.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 60 (289)

60

PRAM models

Th

esh

ared

mem

ory

in P

RA

M is in

teresting.

•T

ooperate

efficiently,the

processorsneed

tobe

ableto

exploitmem

ory.

•U

p to aread/w

rite at every clock cycle by every processor.

•Is

itpossible/feasible

todefi

ne/implem

enta

mem

orythat

canhandle

Psim

ultaneous mem

ory accesses every clock cycle?

•It is easy to

define.

•It is attractive to

use.•

It might be possible to im

plement (w

ith some tricks).

•It is

not currently feasible to implem

ent, though.•

For

aw

hilew

eassum

ethat

itis

possible,and

we'll

exploitit

toachieve easiest possible parallelism

.

Processor - m

emory speed com

parison (Random

Access M

achine):

•8

bits/DR

AM

chip, 50ns

random access latency, 3

GH

z 64-bit proces-sor:

•3×

50×64

/8=

1200D

RA

Mchips/processor

forfull

randomaccess

of one word at every clock cycle!

•A

ctually, modern (S

D)R

AM

should not be considered as RA

M...

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 61 (289)

61

PRAM models

PR

AM

mem

ory

mod

el

•A

single mem

ory, indexed mem

ory locations (e.g., 1..m).

•m

usually "unlimited" (as in R

AM

).

•E

ach mem

ory reference (read/write) is done in unit tim

e (O(1), 1 cc).

•A

lso, all other machine instructions in 1 clock cycle.

⇒W

hataboutifsim

ultaneousm

emory

referenceshitthe

same

mem

orybank or even the sam

e mem

ory location?

•S

imultaneous:on

theex

actlysam

eclock

cycle,notim

esharingpossible

within a clock cycle. A

lso calledconcurrent.

Sam

e bank,different address:

•F

or the model, there is n

o such problem.

•F

or a real implem

entation, we need m

ore circuitry and/or tricks (seebelow

).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 62 (289)

62

PRAM models

Several sim

ultaneous mem

ory references to the

same m

emory

address:

•T

he references could possibly be

combined

.

•W

rite requests: something is w

ritten.

•R

ead requests: the result is copied to all accessing processors.

⇒In a

model, w

e just define what w

ill happen.

•S

everalsimultaneous

readsis

astrong

operation,butveryeasy

todefi

ne.

•S

imultaneous

read(s)an

da

write

canbe

defined

as,e.g.,everyw

riteto

occur before every read (two stages =

O(1)).

•S

everal simultaneous w

rites is much m

ore difficult to defi

ne.

•E

ach mem

ory location will alw

ays contain only one value.

⇒In P

RA

M m

odel, these are considered as model variations.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 63 (289)

63

PRAM models

PR

AM

varia

tion

s

•T

hem

emory

models

differon

restrictions/resultson

whatcan

happenat

single mem

ory location at a single clock cycle.

•If

therestrictions

areviolated,the

whole

machine

haltsim

mediately

(ina m

odel), or results are unknown (in real life).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 64 (289)

64

PRAM models

E/C

/O× R

/W

•E

RE

W (E

xclusive Read, E

xclusive Write)

•B

oth several simultaneous reads and w

rites are forbidden.

•C

RE

W (C

oncurrent Read, E

xclusive Write)

•S

everalprocessors

may

readsim

ultaneously,but

writing

isallow

edto one processor at a tim

e.

•C

RC

W (C

oncurrent Read, C

oncurrent Write)

•U

nlimited

number

ofreads

andw

ritesare

permitted

simultaneously.

•T

heresult

ofsim

ultaneousw

riteshas

tobe

solvedsom

ehow,

seebelow

.

•C

RO

W (C

oncurrent Read,

Ow

ner Write)

•E

achm

emory

locationis

owned

bya

processor,othersm

ayonly

readit.

•E

RC

W (E

xclusive Read, C

oncurrent Write)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 65 (289)

65

PRAM models

CW

varia

tion

exam

ples

•O

n concurrent access to a single mem

orylocation.

•In ascending (partial) order of strength.

•W

EA

K

•O

nly simultaneous w

riting ofzeroes is allow

ed.

•C

OM

MO

N

•O

nly simultaneous w

riting of thesam

e value is allowed.

•T

OL

ER

AN

T

•N

othing happens if several processors try to write sim

ultaneously.

•C

OL

LIS

ION

•A

specialcollisionsym

bolisw

rittenif

severalprocessorstry

tow

ritesim

ultaneously.

•C

OL

LIS

ION

+

•A

specialcollisionsym

bolisw

rittenif

severalprocessorstry

tow

ritesim

ultaneousdifferent values. (see C

OM

MO

N)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 66 (289)

66

PRAM models

•A

RB

ITR

AR

Y

•S

ome

(random)

valuesurvives

ifseveral

processorstry

tow

ritesim

ultaneously.

•P

RIO

RIT

Y

•P

rocessor with low

estP

ID w

ill success, others fail.

•S

TR

ON

G

•A

combination of the values is w

ritten,•

e.g., AD

D&

WR

ITE

, AN

D&

WR

ITE

, PR

EF

IX-S

CA

N

•D

ifferent variations have been suggested.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 67 (289)

67

PRAM models

Exam

ples o

n p

oten

cy d

ifferences:

•S

preading a w

ord to every processor (or toP

mem

ory locations).

•C

RE

W: every processor reads the sam

e mem

ory location:O

(1)

•E

RE

W:value

isdoub

led(as

ina

binarytree)

untilallprocessorshave

read it:O

(logP

)

•M

aximum

of an array.

•C

RE

W:

O(log

N)

•W

EA

K C

RC

W:

O(1

)

•S

orting

•E

RE

W:

O(log

N)

•S

TR

ON

G C

RC

W:

O(1)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 68 (289)

68

PRAM "programming" PR

AM

"p

rog

ram

min

g"

⇒A

s in sequential programm

ing, we'll use

several abstraction levels.

•D

escribe the algorithm in a

natural language and apicture.

•D

escribe the algorithm in an

algorithm notation.

•T

ransform the algorithm

toadapt w

ith real world

(machine and pro-

gramm

ing environment) restrictions.

•W

rite the algorithm in a

programm

ing language.

•C

ompile the program

intom

achine language.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 69 (289)

69

PRAM "programming"

(Data

)para

llel alg

orith

m n

ota

tion

⇒A

s sequential, with an additional statem

ent to express parallelism

for

i∈ 1..N

pard

o//or,e.g.,fo

rea

chelem

entin

Apard

o1

statem

ent;

// e.g., if A[i] =

0 then A[i] :=

...2

•sta

temen

tisexecuted

oncefor

eachvalue

ofi(1..N

)(as

ina

seqfor-do).

•A

llN

executions are donein parallel, if w

e have at leastN

processors.

•T

ime com

plexity:

•T

st +O

(1)if

we

haveenough

proc(T

st =tim

eof

asingle

statem

ent).

• T

st ×N

/P +O

(1) if we take

P into account.

•R

emem

ber Brent’s theorem

(p.27).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 70 (289)

70

PRAM "programming" ⇒D

ifferent parallel executions may not disturb each other.

for

i∈ 1..N

pard

o1

A[A

[i]] = A

[i];// result very unclear,

not a

llow

ed!

2

•If

we

needlocalvariables

(mem

ory),we

canuse

keywords

priv

ate

andsh

ared

to clarify the situation.

⇒C

reativefreedom

isallow

edin

algorithmnotation

aslong

asexact-

ness and comprehensibility is m

aintained.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 71 (289)

71

PRAM "programming"

pro

cedu

re Odd-even_m

ergesort (A : array[1..N

]);1

if Pro

cessors = 1

then

2

Sequential_m

ergesort(A

);3

else4

pa

r i = 1

to 2

do

5

Od

d-even_m

ergesort(i:th h

alf o

f A);

6

Odd

-even_

merg

e(ha

lves of A

);7

syn

chro

nize;

8

pro

cedu

re Odd

-even_merge (A

: array[1..N]);

9

if Processo

rs = 1

then

10

Sequential_m

erge(A

);11

else12

pa

r i = 0

to 1

do

13

Od

d-even_m

erge(ha

lves of o

dd

/even (2

n+

i) elemen

ts of A

);14

pa

r i = 2

to N

–1

by 2

do

15

pip

elined

_co

mp

are-exch

an

ge (A

[i], A[i+

1]);16

syn

chro

nize;

17

Alg

orith

m 2

-1:

Parallel odd-even m

ergesort, informal version.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 72 (289)

72

PRAM "programming"

Para

llel pro

gra

mm

ing la

ngu

ages

⇒V

ariety is huge, few established standards.

•W

e'll describe some real languages/standards later on.

⇒P

RA

Mprogram

ming

with

paper(or

with

aP

RA

Mem

ulator)can

bedone

aseasily

asm

ovingfrom

sequentialalgorithm

sto

sequentialprogram

s.

•L

ocal and shared variables.

•P

rocessor-ID (P

ID) to distinguish betw

een processors.

•S

ynchronization.

•I/O

is either forgot, or we'll use parallel I/O

.

•E

xample: (P

arallel Modula-2 for F

-PR

AM

)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 73 (289)

73

PRAM "programming"

procedure oemerge(sharedvar S

: array of word; S

tart, Length, S

tride : word);

1

vara, b

: word

;2

i, j, k, L

ength2 : register w

ord;3

begin

4

Len

gth

2 := L

ength / 2;

5

par i :=

0 to 1 do

6

oemerg

e(S, S

tart + i * S

tride, Length2, S

tride * 2);7

end;

8

par i :=

1 to L

ength2 - 1 do

9

j := i * 2

;10

a := S

[Start +

(j - 1) * Stride];

11

b := S

[Start +

j * Stride];

12

if a > b

then13

S[S

tart + (j - 1) * S

tride] := b;

14

S[S

tart + j * S

tride] := a;

15

end;

16

end;

17

synchronize;

18

end oem

erge;

19

Alg

orith

m 2

-2:

Odd-even m

erge in fpm.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 74 (289)

74

PRAM "programming"

PR

AM

mach

ine la

ngu

age

•A

sany

RA

Mm

achinelanguage,possibly

alsoL

OA

DP

ID,and

separateoperations to access local and shared m

emory.

•U

suallyone shared program

for every processor.

•T

hesam

eprogram

isloaded

toevery

processornode,processors

will

branch according to PID

.

•W

e can use assembler as an interm

ediate stage.

•E

.g., F-P

RA

M.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 75 (289)

75

PRAM "programming"

# macro assem

bler

else5:L

OA

D =

01

ST

OR

ET

MP

152

ST

OR

ET

MP

113

LO

AD

=1

4

ST

OR

ET

MP

105

LO

AD

PR

OS

6

SU

BT

MP

107

AD

DT

MP

118

SU

B=

19

JPO

Soverpar0

10

# macros opened

LO

AD

=0

1

ST

OR

E24

2

ST

OR

E20

3

LO

AD

=1

4

ST

OR

E19

5

LO

AD

96

SU

B19

7

AD

D20

8

SU

B=

19

JPO

S322

10

Fig

ure 2

-3:

(F)P

RA

M m

achine language

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 76 (289)

76

Implementing PRAM Imp

lemen

ting

PR

AM

⇒U

singshared

mem

ory(m

emory

referenceis

aread

orw

rite)in

oneclock cycle in im

possible.

•It has not succeeded even on uniprocessors since 1

MH

z times at 80's.

•T

oday, we could achieve 20

MH

z on DR

AM

, 300M

Hz on (nonem

bed-ded) S

RA

M.

•In addition to D

RA

M latency, physical distances or large com

putersm

ake access slow.

•In

0.3ns

(3G

Hz),

thelight

will

travel10

cmin

freespace,

electricity~

7cm

ina

coaxialcable,

evenless

oncircuit

board,only

fewcm

ona sem

iconductor.

⇒M

oreover,buildinga

P-port

mem

oryis

expensive/impossible

ifP

islarge.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 77 (289)

77

Implementing PRAM

Extra

cost fa

ctor fo

rP

ports isΩ

(P2) (a

s VL

SI a

rea).

•E

.g., let us considertechnology for 4G

bit (0.5G

B) m

emory chips.

•It w

ill yield 16M

bit (2M

B) m

emory w

ith 16 ports.•

Moreover,each

ofthe

16processors

willneed

24address

linesand

2data

lines,totalling

more

than416

pinsfor

the16

Mbit

(2M

B)

mem

ory chip.•

Packaging

costsfor

am

odest1

GB

mem

ory(64

MB

/pr)w

ouldbe

100000's e.

•A

t64 ports, a 1 M

bit (128

kB) chip w

ould be more com

plex (>1800

pins) than a Itanium2 Q

uad.

•64

GB

would take 0,5

M chips, 1000

m2, and cost >

109e.

•A

nd the access latency would still be long...

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 78 (289)

78

Implementing PRAM

PR

AM

can

be im

plem

ented

more ea

sily v

iasim

ula

ting th

esh

ared

mem

ory

by

distrib

uted

mem

ory.

⇒P

processors,M

mem

ory banks.

P0

P1

P2

P3

PP

–1

. . .P

processing nodes

Interco

nnectio

n n

etwork

Pro

cessor

Mem

ory

Netw

ork

interface

Fig

ure 2

-4:

Distributed M

emory M

odel.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 79 (289)

79

Implementing PRAM

•O

ften it is assumed that

M =

P, i.e., each processing node contains a

mem

ory module.

•G

ood:easier

construction,less

nodes,less

comm

unicationconnec-

tions.

•P

oor:more

traffic

ineach

node/connection,inreallife,m

emories

areslow

er than processors.

•F

orreasonable

performance,

M=

CP

,w

hereC

isthe

speeddiffer-

ence factor between processors and m

emory.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 80 (289)

80

Implementing PRAM

Overlo

ad

ing (ylik

uorm

itus)

⇒L

etus

assume

thata

mem

oryreference

from/to

a(virtual)

sharedm

emory takes

h clock cycles.

•T

he computer has

P physical processors.

•E

ach physical processor executes the tasks ofh

PR

AM

processors (hvirtual processors per a physical processor).

•T

heprocessor

executesonly

oneinstruction

atatim

efor

eachP

RA

Mprocessor it is respon

sible of.•

After each clock cycle it

changes to the next PR

AM

processor.

•A

fterim

plementing

allh

PR

AM

processors,itstarts

overby

execut-ing the next instructions for each P

RA

M processor.

⇒T

hem

emory

referencesm

adeby

PR

AM

processorshave

occurredin

hclock cycles.

•In algorithm

notation, see Algorithm

2-3:.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 81 (289)

81

Implementing PRAM

wh

ilenot a

ll pro

cessors h

alted

do

1

for

each

threa

di

do

2

PC

i := P

Ci +

1;

3

if op =

write

then

4

send

write-reference

5

else if op

= read

then

6

send

read-reference

7

else8

execute o

pera

tion

9

for

each

threa

dd

o10

if op =

readth

en11

recieve read-reference12

Alg

orith

m 2

-3:

PR

AM

simulation algorithm

.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 82 (289)

82

Implementing PRAM ⇒W

hat shall we

benefit?

•F

or each PR

AM

processor ("virtual processor") everything occurs inone clock cycle.

•T

heclock frequency of each P

RA

M processor is only

1/h

of the realprocessor.

•T

here areh×

P P

RA

M processors.

•P

rocessing power is (h×

P)×

( 1/h) =

P, i.e.,

the same as w

ith directP

processors.

⇒If

theprogram

canexploit

h×P

processors,it

will

executew

ork-optim

ally.

•h is called also parallel

slackness.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 83 (289)

83

Implementing PRAM

How

larg

eh

need

s to b

e?

•D

epends on the network and routing protocol.

•A

t leasttw

ice the diameter of the interconnection netw

ork.

•E

vena

bitmore

asthe

routing

algorithmneeds

slacknessto

handlecon-

gestions.

•E

.g., in a butterfly netw

ork:O

(logP

loglog

P).

•It has been done (S

aarbrücken SB

-PR

AM

, Tera M

TA

/ Cray X

MT

).

•S

ame technique is used in G

PU

units, e.g., Nvidia G

8x, etc.

•B

onus:no caches needed

.

Req

uirem

ents fo

r overlo

ad

ing

•M

ultithreading processor (switch after every clock cycle)

•Im

plementation sim

ilar to superpipelining (Forsell).

•H

ugem

emory bandw

idth.

•E

.g.,fullypopulated

gridshave

toonarrow

bisectionbandw

idth,seeF

igure 1-3 (p.49).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 84 (289)

84

Implementing PRAM

Lesso

n lea

rned

⇒A

parallelalgorithm

shouldbe

designedto

useas

many

processorsas

(efficiently) possible.

•P

RA

M is not com

pletely utopistic.

•E

speciallyif

we

uselocal

mem

oriesto

decreasethe

traffic

inthe

shared mem

ory.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 85 (289)

85

Ch

ap

ter 3

Para

llel alg

orith

ms

(in P

RA

M-n

ota

tion

)

Goals

Techniques

Som

e algorithm

s

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 86 (289)

86

Parallel algorithm design goals Pa

rallel a

lgo

rithm

desig

ng

oa

ls

Eith

er

•m

aximal

speedup (and parallelism

), or

•m

aximal speedup w

hile still maintaining

work-optim

ality.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 87 (289)

87

Parallel algorithm design goals

More fo

rmally, a

n a

lgorith

m cla

ssifica

tion

•A

ccording totim

e complexity

•N

C:

polylogarithmic

time

complexity,

polynomial

number

ofproc-

essors (Nick's class).

•P

:polynom

ialspeedup

•D

ifferent P than in sequential algs (solvable in polynom

ial time).

•note: N

C and P

are not disjoint

•A

ccording tow

ork optimality

•E

:effi

cient•

A:

polylogarithmic ineffi

ciency (almost effi

cient)•

S:

polynomial ineffi

ciency (semi effi

cient)

•C

ombining

thesew

e’llgetsixclasses

ofalgorithm

s,EN

C,A

NC

,SN

C,

EP, A

P, SP.

•E

NC

would be nice.

•E

P is usually good enough.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 88 (289)

88

Parallel algorithm design methods Pa

rallel a

lgo

rithm

desig

n m

etho

ds

⇒C

oncentrate to (operations for)data, not (operations by) processors!

Para

llelizing seq

uen

tial p

arts o

f an

existin

g seq

uen

tial

alg

orith

m.

⇒T

hisis

notarealdesign

method,butin

reallifethis

isw

hatwe'llface

(as ad hoc programm

ers have sequentialized parallel problems).

•S

uits well for linear algebra.

•A

nalysingfor-do loops (and other sequential sections).

•If the sequential parts are independent, w

e can parallelize them

•S

ometim

es,inner loops are parallel, som

etimes

outer loops.

•L

ooprearranging m

ay help.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 89 (289)

89

Parallel algorithm design methods

•E

.g., matrix m

ultiplication

C=

A⋅ B

,,

(3-1)

•E

asy sequential algorithm and an easy parallelization, .

•N

×N

matrix,

O(N

3)sequential

algorithm,

O(N

)parallel

algorithmw

ithO

(N2) processo

rs.•

PR

AM

variant? Exercise.

cij

aik

bk

k0

=

N1

∑ =

for i :=

1to N

do

//⇒p

ard

o1

for j :=

1to N

do

//⇒p

ard

o2

for k :=

1to N

do

3

C[i, j] :=

C[i, j] +

A[i, k] * B

[k, j];4

Alg

orith

m 3

-1:

Matrix m

ultiplication.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 90 (289)

90

Parallel algorithm design methods

•P

arallelizinginnerm

ostfor-loop

isnot

quiteas

straightforward

(unless we use S

TR

ON

G C

RC

W-m

odel).•

How

ever,the

innermost

product-sumcan

beevaluated

inO

(logN

)tim

eusing

O(N

)pro

cessorssee

Parallel

tournament

(turnaustekni-ikka) (p.95).•

Even

with

O(N

/logN

)processors,

seeB

locking(lohkom

inen)(p.

97).

•T

hus,the

whole

algorithmin

O(log

N)

time

with

O(N

3/logN

)proc-

essors (exercise).

•F

orreal

computers

andreal

inputsizes,it

isoften

enoughto

parallelizeonly

one of the nested loops.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 91 (289)

91

Parallel algorithm design methods

•In

algorithms

with

severalstages,we

shouldparallelize

all(demanding)

stages to achieve full efficiency (processor utilization).

for i :=

1to N

pard

o//

O(1)

1

for j :=

1to N

pard

o2

statem

ent1;

//O

(1)3

for i :=

1to N

do

//O

(N)

4

for j :=

1to N

pard

o5

statem

ent2;

//O

(1)6

Alg

orith

m3

-2:

An

unevenparallelization:

O(N

)tim

e,O

(N2)

proces-sors (but

O(N

) with

O(N

) processors).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 92 (289)

92

Parallel algorithm design methods

Div

ide-a

nd

-con

qu

er

⇒D

ivideinput

intw

oparts,

solvehalves

recursivelyin

parallel,com

-bine the results (in parallel).

•F

amiliar technique in sequential algorithm

s.

•P

arallel recursion isterm

inated w

hen either

•input is

trivial (as in sequential programm

ing), or•

thereis

only1

processorleft,

when

we

cansw

itchto

asequential

algorithm(see

Blocking

(lohkominen)

(p.97)

andA

lgorithm2-1

(p.71)).

•S

ubresults are combined

to larger subresults on returning from recur-

sion.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 93 (289)

93

Parallel algorithm design methods

•E

.g., mergesort

•S

equential algorithm:

Ts (N

) = 2*

Ts (N

/2) +O

(N) =

O(N

logN

)

•R

ecursivecalls

atlines

3and

4can

beexecuted

inparallel

(asthey

work on disjoint parts of the array).

•U

singsequentialm

erge,T

p (N)

=T

p (N/2)

+O

(N)

=O

(N),

O(N

)proc-

essors,O

(N2) w

ork, not good.

⇒A

lso combining of subresults m

ust be parallelized!

•C

ombining is often m

ore difficult than dividing.

•S

ometim

es combinin

g is trivial, though.

•E

.g.,insearch

algorithms

(onlydiscoverer

acts),especiallyusing

CR

CW

.

pro

cedu

rem

ergesort(var A

: array; first, last : index);1

if (last–fi

rst) > 0

then

2

mergesort(A

, first, (last+

first)/2);

3

mergesort(A

, (last+fi

rst)/2+1, last);

4

merge(A

, first, (last+

first)/2, (last+

first)/2+

1, last);5

Alg

orith

m 3

-3:

Mergesort.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 94 (289)

94

Parallel algorithm design methods

•In m

ergesort, combinin

g is the merging phase, w

hich is more diffi

cultto parallelize.

•If

we

couldm

ergein

O(1)

time

usingO

(P)

processors,the

sortingtim

ew

ouldbe

Tp (N

)=

Tp (N

/2)+

O(1)

=O

(logN

)tim

e,O

(N)

proc-essors,

O(N

logN

) work.

•U

nfortunatelym

ergingin

O(1)

time

isim

possible(using

realisticm

odels).•

O(1)

am

ortized

time is possible, but unfeasibly com

plex.

•M

ergingin

O(log

N)

orO

(loglog

N)

time

ism

ucheasier,butdoes

notoffer

work

optimality,

unlessw

euse

lessprocessors,

see“O

dd-evenm

erge” p.136.

•D

ivision can be made in

more than tw

o parts toreduce the num

ber ofstages.

•E

.g., division in parts, com

bining in unit time:

T(N

)=

T(

) +O

(1) =O

(loglog

N).

•O

bviously,combining

mightnotbe

aseasy

anym

ore,seeraw

power

and waterfall techniq

ues below.

NN

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 95 (289)

95

Parallel algorithm design methods

Para

llel tou

rnam

ent (tu

rnau

stekn

iikka

)

•A

lso calledb

ala

nced

tree.

•If divide-and-conquer is a

top

-do

wn

approach, we can also apply a

similar technique also

bo

ttom

-up.

•W

e'llskip

(recursive/parallel)dividing

inparts,instead

we'll

startfrom

ready“sequences” of length one elem

ent.

•C

ompare input elem

ents pairwise,w

inner continues to the next round.

•D

efinition

ofw

innerdepends

onapplication,e.g.,a

combination

canbe used.

•A

stage can be done inO

(1) time using

N/2 processors.

•S

ame

isrepeated

againand

againam

ongthe

winners

(N/4,N

/8,...pairs)until the ultim

ate winner is left.

•log

N stages, each

O(1) tim

e⇒

O(log

N) tim

e,O

(N) processors.

•A

s in divide-and-conquer, more than tw

o elements can be handled at

each stage, see below.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 96 (289)

96

Parallel algorithm design methods

Raw

pow

er (raaka vo

ima

)

•A

s fast as possible.

•"O

verkill".

•A

lmost: using as m

any processors as possible.

⇒W

e'll try to evaluate all possibilities at once.

•E

.g., we'll

compare all p

airs simultaneously.

•O

(N2) com

parisons inO

(1) time using

O(N

2) processors.•

N input elem

ents will transform

toN

2 subresults!

•C

ombining m

ay be hard to do fast, requires usually CR

CW

.

•G

oal isO

(1) or logarithm

ic time algorithm

.

•R

arely work-optim

al.

•Is often used as a fi

nal stage of an algorithm, see below

.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 97 (289)

97

Parallel algorithm design methods

Blo

ckin

g (lo

hkom

inen

)

•P

revious methods often result in unbalanced processor utilization,

which im

plies non-optimal w

ork.

•E

.g.,at

thebeginning

ofa

tournament,

N/2

processorsare

used,but

thenum

berof

activeprocessors

reduceson

everyround,lastcom

par-ison is m

ade by one processor only.

•W

e'llrestrict parallelism

appropriately to achieve

work-optim

ality.

•Idea:

•L

ess processors.•

More w

ork to do for each processor.•

Atthe

beginning,eachprocessor

(inparallel)

evaluatesits

own

blocksequentially.

•S

witch

tothe

fastparallel

algorithmonly

when

eachprocessor

hasa

single intermediate result.

•U

sually used with other techniques, e.g., divide-and-conquer.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 98 (289)

98

Parallel algorithm design methods

•E

.g., in a tournament of

O(N

) sequential work:

•A

ctual tournament stage w

ill takeO

(logP

) time.

•T

om

aintainw

ork-efficiency,w

ecan

useat

most

O(N

/logP

)proces-

sors(if

alsothe

blockpartcan

bedone

isO

(logP

)tim

e,lessif

ittakesm

ore time).

•W

e'll chooseP

=N

/logN

.

•E

achprocessor

will

havea

logN

-element

block,sequential

algo-rithm

is used,O

(logN

) time.

•R

emaining

N/log

Nelem

entsw

illbeprocessed

usingparalleltourna-

ment in

O(log

N) tim

e usingN

/logN

processors.

⇒W

hole algorithm in

O(log

N) tim

e with

N/log

N processors.

•If

thesequential

partw

ithblocks

ism

orethan

O(N

)tim

e,sm

allerblocks are enough.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 99 (289)

99

Parallel algorithm design methods

Waterfa

ll techn

iqu

e (vesipu

tou

stekn

iikka)

•A

lso calleda

ccelerated

casca

din

g.

•C

ombine the best parts of the previous m

ethods.

•S

witch

toa

fasteralgorithm

afterthe

sizeof

inputhasshrunk

enoughto

be executed faster using the givenP.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 100 (289)

100

Parallel algorithm design methods

Oth

er meth

od

s

•S

ome

basicalgorithm

s,e.g.,prefix

sums

(seep.119),binary

search,andtree/path com

paction, are useful asparts of larger algorithm

s. They

often help at the combining parts.

•R

andomization

(breaking patterns), useful for real-world E

RE

W-like

variant to avoid mem

ory congestion.

•P

arallel Monte C

arlo / genetic methods (all processors try (random

)solutions).

•S

ampling

.

•T

akea

(smallish,but

aslarge

aspossible

without

disturbingthe

effi-

ciency)sam

pleon

thew

holedata,

analyseit

usinga

fastalgorithm

(raw pow

er).•

Divide input according to the distribution of the sam

ple.

•Input w

ill hopefully be divided more evenly to processors.

•H

elps on real data with inconvenient patterns.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 101 (289)

101

Maximum finding Ma

xim

um

fin

din

g

⇒A

very simple problem

, examples on each technique.

•Input: a shared array

A[0..N

–1]

•O

utput:largest elem

ent or/and itsindex.

•S

equential algorithmO

(N).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 102 (289)

102

Maximum finding

Sta

nd

ard

tou

rnam

ent

⇒C

ompare elem

ents pairwise, w

inner continues to next iteration.

•A

fterlogN

iterations, only one element is left.

•Interm

ediate results have to be stored somew

here.

•F

or each comparison, w

e need two values w

hich are compared on

previous iteration by different processors.

•If w

e want to leave original array intact, w

e'll use an auxiliary array.

•H

ere we'll use the original for sim

plicity.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 103 (289)

103

Maximum finding

•W

inner placement can b

e done in many w

ays, see below.

•H

erew

e'llrestoreallw

innersto

thebeginning

partofthe

array.The

partreduces to half on every iteration.

•T

he most diffi

cult part is to make

indices match

on every iteration.

•Iterations have to be executed in

strict synchrony

•W

ecan

assume

thisin

PR

AM

algorithmnotation

(we

canm

entionit,

though).In

realm

achinesw

eneed

tohave

anexplicit

synchroniza-tion.

Fig

ure 3

-1:

Tournam

ent m

aximum

.

62

23

71

54

07

62

23

63

75

07

62

23

63

67

07

62

23

63

67

07

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 104 (289)

104

Maximum finding

•B

ysom

eclever

organization,the

synchronizationrequirem

entcan

be easied, even removed (w

ith auxiliary data structures).

•If/w

henthe

inputsize

Nis

notof

form2

k,we'll

haveto

refine

line4

to,e.g.,

A[j] :=

max(((j*2<

N) ? A

[j*2] : A[j]), (j*2+

1<N

? A[j*2+

1] : A[j])))

4

•T

ime:log

N(line

2)×O

(1)(lines

3-4)+

O(1)

(lines1

and5)

=O

(logN

).

•N

umber of processors:

N/2 =

O(N

).

•W

ork:O

(Nlog

N),

not work-optim

al (inefficient by factor

O(log

N))

•E

RE

W P

RA

M is suffi

cient.

fun

ction

tournament-m

ax(var A

: array[0..N–

1]);1

for i :=

log

N–

1to 0

do

2

for j :=

0to 2

i–1

pa

rdo

3

A[j] :=

max(A

[j*2], A[j*2+

1]);4

return

A[0

];5

Alg

orith

m 3

-4:

Maxim

um using standard tournam

ent.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 105 (289)

105

Maximum finding

•T

he same set of indices can be w

ritten in different ways:

•A

lso,youm

ayuse

anyindices,or

anew

arrayto

storethe

intermediate

results.

fun

ction

tournament-m

ax2(var A

: array[0..N–

1]);1

i = N

;2

wh

ile i > 0

do

3

i := i/2;

4

for j :=

0to i

pa

rdo

5

if j*2

< N

–1

then

6

A[j] :=

max(A

[j*2], A[j*2+

1]);7

elseif j*2

= N

–1

then

8

A[j] :=

A[j*2];

9

end

wh

ile;10

return

A[0];

11

Alg

orith

m 3

-5:

Tournam

ent-max, alternative im

plementation.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 106 (289)

106

Maximum finding

•E

.g., using doubling/halvingstride w

orks well.

•If counting tw

ice does not hurt, modulo helps on boundaries.

fun

ction

tournament-m

ax3(var A

: array[0..N–

1]);1

s := 1

; // stride

2

wh

ile s < n

do

3

for j :=

0to N

–s–

1b

y s*2p

ard

o4

A[i] :=

max(A

[i], A[i+

s]);5

s := s *

2;6

end

wh

ile;7

return

A[0];

8

Alg

orith

m3

-6:

Tournam

ent-max,

yetanother

alternativeim

plementa-

tion.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 107 (289)

107

Maximum finding

Fig

ure 3

-2:

Binary tree of A

lgorithm 3-6:

01

23

45

67

89

1011

1213

1415

1248

N–

1–

1

N–

2–

1

N–

4–

1

N–

8–

1

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 108 (289)

108

Maximum finding

A v

aria

tion

: maxim

um

for

every

pro

cessor

•O

ften,them

aximum

hasto

bespread

toallprocessors

(orindices

ofthe

array).

•T

his is useful especially on ER

EW

PR

AM

.•

We could m

ake the spreading by using another logN

“tree”.

•B

ut,onprevious

algorithm

,mostprocessors

areidle

mostof

time.T

heycan be exploited in “con

current spreading”.

•E

ach processor evaluates its “local” maxim

um tree.

•E

ven if all processors make useful w

ork during the whole execution,

this isnot w

ork-optimal.

Fig

ure

3-3

:A

n“arrays

oftrees”

ofdegree

2.D

ashedlines

representw

rap-around edges.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 109 (289)

109

Maximum finding

Div

ide-a

nd

-con

qu

er

•W

orks actually like tournament, slightly different notation.

•D

ivide recursively until input is trivial.

•O

n returning from recursion,

compare, and return the larger one.

•M

anaging array boundaries and synchrony is easier.

•P

arallelism representation possibly m

ore difficult / ineffi

cient.

•T

ime:

T(N

)=

T(N

/2) +O

(1) =O

(logN

),O

(N) proc,

O(N

logN

) work.

fun

ction

divide_conquer-max(v

ar A

: array[0..N–

1]; low, high : index);

1

if (low =

hig

h)th

en2

return

A[low

];3

else4

pa

rdo

5

x := divide_conquer-m

ax(A, low

, (high+

low)/2);

6

y := divide_conquer-m

ax(A, (high

+low

)/2+1, high);

7

return

max

(x, y

);8

Alg

orith

m3

-7:

Maxim

umfi

ndingusing

divide-and-conquer-tech-

nique.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 110 (289)

110

Maximum finding

Blo

ckin

g a

nd

tou

rnam

ent

•N

one of the previous algorithms is w

ork-optimal.

•W

ithout Concurrent W

rite, we cannot achieve

O(1) tim

e with

O(N

)processors, thus, w

e'll have toreduce the num

ber of processors forw

ork-optimality.

⇒W

e'll first useN

/logN

processors, goal forO

(logN

) time.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 111 (289)

111

Maximum finding

•Idea:

reduce the input toN

/logN

, after which w

e'll use tournament in

O(log

N) tim

e usingN

/logN

processors.

•E

ach processor finds fi

rst the maxim

um of its ow

n block of size logN

sequentially (but all processors in parallel).

•A

fterO

(logN

) time, w

e'll have an intermediate input of size

N/log

N.

•T

hen we’ll do tournam

ent for the smaller input.

•T

otal time O

(logN

),N

/logN

processors⇒O

(N) w

ork!

•E

RE

W is still enough.

fun

ction

blocking_tournament-m

ax(var A

: array[0..N–

1]);1

for i :=

0to

N/lo

gN

–1

pa

rdo

2

B[i] :=

A[i*log

N];

3

for j :=

1to lo

gN

–1

do

4

B[i] :=

max

(B[i], A

[i*logN

+j]);

5

return

tou

rnam

ent-m

ax(B[0..N

/logN

–1]);

6

Alg

orith

m 3

-8:

Blocking technique in m

aximum

finding.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 112 (289)

112

Maximum finding

Raw

pow

er (raaka vo

ima

)

•L

et us assume that

any element could be the m

aximum

.

•W

e’llprove other elem

ents not to be maxim

um, only m

aximum

is left.

•Initialize

anarray

of1's

ofsize

N(a

bitfor

everyelem

entof

theinput).

•C

ompare all pairs sim

ultaneously (about

N2/2 pairs).

•T

hesm

allerof

apair

cannotbe

them

aximum

,thusm

arkit

with

0to

the boolean array.

•D

raws

aredecided

accordingto

theindex

(below,

theone

with

smaller index w

ins).

•O

nly the maxim

um value retained the 1.

•A

ll stages inO

(1) time,

N2/2 processors,

O(N

2) work.

•C

oncurrent read is needed at line 4, concurrent write at lines 7 and 9.

•O

nlyzeros

arew

rittenconcurrently,thus

WE

AK

CR

CW

suffices.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 113 (289)

113

Maximum finding

fun

ction

raw-m

ax(var A

: array[0..N–

1]);1

for i :=

0to N

–1

pa

rdo

2

V[i] :=

1;3

for i :=

0to N

–1

pa

rdo

4

for j :=

i+1

to N–

1p

ard

o5

if A[i] <

A[j]

then

6

V[i] :=

0;7

else8

V[j] :=

0;9

for i :=

0to N

–1

pa

rdo

10

if V[i]≠

0th

en11

return

A[i];

12

Alg

orith

m 3

-9:

Maxim

um w

ith raw-pow

er.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 114 (289)

114

Maximum finding

Div

ide-a

nd

-con

qu

er & ra

w p

ow

er

•D

ivide-and-conquer can be used with division in m

ore than 2 parts.

•C

ombining fast enough is harder.

•U

sing raw-pow

er maxim

um A

lgorithm 3-9, w

e can combine (fi

ndm

aximum

of)M

results with

M2 processors in unit tim

e.

•If w

e haveN

processors, we can com

bine subresults by raw

-m

aximum

.

•D

ivide input in parts, solve them

recursively, find m

aximum

with

raw-m

ax.

N

N

fun

ction

root-max(v

ar A

: array[0..N–

1]; low, high : index);

1

if (low =

hig

h)th

en2

return

A[low

];3

else4

k :=

hig

h–

low+

1;5

for i :=

0to

–1

pa

rdo

6

B[i] :=

roo

t-max(A

, low +

i*, low

+ (i+

1)* – 1);

7

return

raw-m

ax(B

[0..–

1]);8

Alg

orith

m 3

-10

: -divide-and-conquer m

aximum

.

kk

kk

N

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 115 (289)

115

Maximum finding

•If

Nis

notof

form,w

ehave

torefi

nethe

algorithma

bit(exercise).

•T

ime

T(N

)=

T(

)+O

(1) =O

(loglog

N),

O(N

) processors,O

(Nlog

logN

) work. 2

2n

N

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 116 (289)

116

Maximum finding

Waterfa

ll = b

lock

ing

&d

ivid

e-an

d-co

nq

uer

&ra

w-p

ow

er

•R

educeN

elements to

N/log

logN

elements sequentially in log

logN

time using

N/log

logN

processors (blocking).

•S

olve the remaining

N/log

logN

elements w

ithN

/loglog

N processors

using Algorithm

3-10 (divide-and-conquer&raw

-power).

⇒A

work-optim

alO

(loglog

N) tim

e (weak) C

RC

W algorithm

.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 117 (289)

117

Maximum finding

Usin

g stro

nger C

RC

W m

od

els

•S

TR

ON

G C

W has a ready operation for m

aximum

.

•P

RIO

RIT

YC

Wcan

solvem

aximum

easilyin

O(1)

time

usingO

(N+

M)

processors:

fun

ction

crcw_priority_m

ax(sh

ared

var A

: array[0..N–

1]);1

sha

redva

r max

value, w

innerindex;2

for i :=

0 to N

–1

pa

rdo

3

counts[i] :=

–1;

4

for i :=

0 to N

–1p

ard

o5

counts[A

[i]] := i;

6

for i =

max_

valto 0

by

–1p

ard

o// process with largest

i will w

in7

if coun

ts[i] >=

0th

en8

maxvalu

e := i;

9

winnerind

ex := counts[i];

10

return

(max

value, winnerindex);

11

Alg

orith

m 3

-11

:U

sing PR

IOR

ITY

CR

CW

for maxim

um.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 118 (289)

118

Maximum finding

Oth

er simila

r pro

blem

s

•M

ost previous algorithms can be used (w

ith small changes) for m

anysim

ilar tasks.

•E

specially all problems w

here theresult is atom

ic, and combining is

easy.

•F

inding, selecting, counting, sum, and, or, etc.

•O

r, the algorithms can b

e used in opposite direction tospread data.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 119 (289)

119

Prefix sum (alkusumma) Prefi

x su

m (a

lku

sum

ma

)

•Input: array

A[0..N

–1] (or [1..N

]).

•R

esult: array

(A[0],

A[0]+

A[1],

...,, ...,

), or,(3-2)

(0,A

[0],A

[0]+A

[1],...

,)

("0-prefix sum

")(3-3)

Aj

[]

j0

= i

∑A

j[

]j

0=

N1

Aj

[]

j0

=

N2

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 120 (289)

120

Prefix sum (alkusumma)

•E

.g., (4 5 2 5 6)⇒ (4 9 11 16 22).

•E

.g., (1 0 1 1 0 0 1)⇒ (1 1 2 3 3 3 4).

•A

pplications: counting, array/list compression (rem

oving empty ele-

ments), load balancing, radix sort, graph algorithm

s, etc.

•A

lgorithm sim

ilar to maxim

um for all

•U

seblocking to m

ake it work optim

al (exercise).

•A

gain, synchrony is crucial; array boundaries are more diffi

cult ifN

isnot a pow

er of 2; use another array if original is needed.

pro

cedu

reprefix-sum

(var A

: array[0..N–

1]);1

for i :=

1to

logN

do

2

for j :=

2i–

1to

N–

1p

ard

o3

A[j] :=

A[j–

2i–

1] + A

[j];4

Alg

orith

m 3

-12

:B

asic parallel prefix sum

.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 121 (289)

121

Prefix sum (alkusumma)

Fig

ure 3

-4:

Data m

ovement in prefi

x sum.

62

23

71

54

07

84

510

86

94

07

1314

1316

1710

94

07

3024

2220

1710

94

07

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 122 (289)

122

Merging and sorting algorithms Merg

ing

an

d so

rting

alg

orith

ms

⇒P

arallelsorting

canbe

approachedin

severalw

ays(as

sequentialsorting).

•W

e'll present:

•R

aw pow

er

•M

ergesort(w

itha

coupleof

possibleapproaches

tom

ergingin

par-allel).

•S

ampling bucket sort.

•R

adix sort.

•L

ater, we’ll present som

e sorting algorithms suitable for m

essage-passing environm

ent.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 123 (289)

123

Merging and sorting algorithms

Para

llel "b

ub

bleso

rt" (o

dd

-even

tran

spositio

n)

•C

ompare-exchange odd and even pairs

N tim

es.

•N

/2 processors, 2N

=O

(N) tim

e,O

(N2) w

ork.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 124 (289)

124

Merging and sorting algorithms

Raw

pow

er sort (b

y ra

nk

ing)

⇒P

resents PR

AM

at its best and worst!

•E

xploitsS

TR

ON

G A

DD

CR

CW

.

Com

pu

te the

correct lo

catio

n o

f each

elemen

t at o

nce:

•C

ounthow

many sm

aller elements there are in the array.

•I.e., the

rank (ra

nkka

us,

sijoitu

s) of each element.

•R

anksare

evaluatedas

inraw

-max:com

pareallpairs,increase

therank

of the larger element by one (cf. zero the sm

aller inra

w-m

ax).

•S

everal increasings of the sam

e element at once (S

TR

ON

G A

DD

CR

CW

needed).

•A

fter ranking, we'll know

thenum

ber of smaller elem

ents for eachelem

ent, i.e., thelocatio

n of each elem

ent.

•D

raws have to be solved.

•O

(1) time,

O(N

2) processors,O

(N2) w

ork.

⇒R

anks can be counted also in different (more efficient) w

ays.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 125 (289)

125

Merging and sorting algorithms

Fig

ure 3

-5:

Direct sorting by ranking.

69

23

71

54

07

57

12

60

43

07

97

65

43

21

07

inp

ut

A

rankV

A[V

[i]] :=A

[i]

pro

cedu

reraw

-sort(var A

: array[0..N–

1]);1

for i :=

0to N

–1

pa

rdo

2

V[i] :=

0;3

for i :=

0to N

–1

pa

rdo

// rank4

for j :=

0to N

–1

pa

rdo

5

if A[i] <

A[j]

then

6

V[j] :=

V[j] +

1;// S

TR

ON

G A

DD

CR

CW

7

for i :=

0to N

–1

pa

rdo

// sort8

A[V

[i]] := A

[i];9

Alg

orith

m 3

-13

:S

orting by raw pow

er.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 126 (289)

126

Merging and sorting algorithms

Merg

esort (lo

mitu

slajittelu

)

•A

ctual sort is trivial, presented earlier.

•M

erging in parallel is interesting, we'll present a few

examples.

•M

ergingin

O(N

)tim

e(sequentially):

O(N

)tim

efull

sort(O

(N2)

work).

•M

erging inO

(logN

) time:

O(log

2N

) time sort.

•M

erging inO

(loglog

N) tim

e:O

(logN

loglog

N) tim

e sort.

•M

erging inO

(1) (amortized) tim

e:O

(logN

) time,

O(N

logN

) work.

pro

cedu

rem

ergesort(var A

: array; first, last : index);1

if (last–fi

rst) > 0

then

2

pa

rdo

3

merg

esort(A, fi

rst, (last+fi

rst)/2);4

merg

esort(A, (last+

first)/2+

1, last);5

merge(A

, first, (last+

first)/2, (last+

first)/2+

1, last);6

Alg

orith

m 3

-14

:M

ergesort.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 127 (289)

127

Merging and sorting algorithms

Merg

ing b

y ra

nk

ing

•W

e assume elem

ents to bedistinct (use index to resolve draw

s).

•L

et us define the

ran

k of an element

x in an arrayA

[0..N–1] as the

number of sm

aller elements in array

A.

⇒C

omputing

ofthe

rankis

much

easierif

Ais

inincreasing

order(sorted).

rank(x,A

) := m

axi

A[i]≤

x(3-4)

•U

sing one processor: usingbinary search in tim

eO

(logN

).

•W

ithP

processors, we can divide into

P+

1 parts (P division points)

instead of two.

•T

hus parallel “binary search” in time

Tp (N

,P

) = +

O(1) =

O(log

P+

1N

) =.

(3-5)

•O

ne processor finds correct interval, others follow

. exercise.

Ts

N

P1

+-------------

ON

logP

log------------

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 128 (289)

128

Merging and sorting algorithms

•U

singraw

power,w

ecan

find

onerank

inO

(1)tim

eusing

O(N

)proces-

sors.

•If needed, w

e can refine this w

ith one processor writing (instead of

return) and the rest of processors reading the result.

•C

RE

W suffi

ces.

•L

ater we’ll show

how to do this m

ore efficiently.

fun

ction

raw-rank(x : elem

ent;var A

: array[0..N–

1]);1

if x < A

[0]

then

2

return

0;3

else if x≥

A[N

–1

]th

en4

return

N;

5

else6

for i :=

0to N

–2

pa

rdo

7

if A[i]≤

xa

nd

x≤

A[i+

1]th

en8

return

i+1;

9

Alg

orith

m 3

-15

:R

ank in unit time by raw

power.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 129 (289)

129

Merging and sorting algorithms

Merg

ing w

ith ra

nk

ing

•Input: readily sorted arrays

A and

B (often halves of the sam

e array).

•R

ank of element

A[i] in array

A is

i.

•R

ank of element

A[i] in

arrayB

is rank(A[i],

B).

•R

ank of element

A[i] in the fi

nal array isi +

rank(A[i],

B).

•W

e can place every element to the fi

nal arrayindependently!

⇒F

orthe

whole

merge,w

e'llneed

therank

ofeach

element

ofA

inB

,and the

rank of each element of

B in

A.

•T

hiscan

beeasily

convertedto

restoreelem

entsback

toA

andB

and/orto m

erge halves of a single array.

fun

ction

rank-merge(A

, B : array[0..N

–1]) : array[0..N

*2–

1];1

for i :=

0to N

–1

pa

rdo

2

C[i +

rank(A

[i], B)] :=

A[i];

3

C[i +

rank(B

[i], A)] :=

B[i];

4

return

C;

5

Alg

orith

m 3

-16

:D

irect merge by rank.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 130 (289)

130

Merging and sorting algorithms

•W

e need CR

EW

PR

AM

sinceN

simultaneous ranking processes read

the same array (using b

inary search) in parallel (though only constantpenalty on E

RE

W).

•If parallelization and synchronization are m

ade carefully, the merging

can be donein place.

•B

utw

eneed

Nprocessors,

allof

which

useO

(1)helper

space,thus

it actually usesO

(N) extra space.

•L

ater,w

ithless

processors,w

eneed

anyway

O(N

)extra

spaceand

have to move elem

ents to/from

a helper array.

•Ω

(Nlog

N) w

ork.

•M

oreaccurate

analysisof

rank-merge-sort

with

P=

N,P

=N

2,arbitraryP

as an exercise.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 131 (289)

131

Merging and sorting algorithms

Faster m

ergin

g a

lgorith

ms

Merg

ing in

O(lo

gN

) time,

O(N

) work

•Input: arrays

A and

B (o

f lengthN

)

•C

hoose regularlyN

/logN

elements of

B.

•R

ankeach

ofthese

(with

sequentialbinary

search)in

A(elem

ent/processor, totalN

/logN

processors).

•N

ow w

e haveN

/logN

pairs of subsequences each ofw

hich can bem

erged sequentially.

a1 ...a

j1 andb

1 ...blog

n|ji =

rank(bi*

log

n ,A

)(3-6)

aj1

+1 ...a

j2 andb

logn+

1 ...b2*log

n

…ajlo

gn–

1+

1 ...an and

b(n–1)log

n+1 ...b

n

•F

rom the section boundaries, w

e know the location of the m

ergedsection in the new

array –m

erging tasks are independent.

•O

naverage, the lengths are

O(log

N), thus the w

hole algorithm in

O(log

N) tim

e.

AB

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 132 (289)

132

Merging and sorting algorithms

•U

nfortunately, thesequences of

A can be longer if data is uneven.

•E

ither:

•S

ymm

etric ranking & partitioning:

•C

hooseN

/logN

elements of both

A and

B.

•R

ankeach

ofthese

(with

binarysearch)

onthe

otherarray.

•N

oww

ehave

tom

erge2×

N/log

Npairs

ofsequences

oflength at m

ostlog

N.

•O

r:

•R

epartition the (few) too large sequences.

AB

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 133 (289)

133

Merging and sorting algorithms

Merg

ing in

O(lo

glo

gN

) time, O

(N) p

roc, O

(Nlo

glo

gN

) work

•E

xloits more effi

cient2

-step ranking A

lgorithm 3-17.

•T

ake regularly sam

ples of each arrayA

andB

.

•R

anksam

ples ofA

in samples of

B (not in w

hole B!).

• ranks on

elements w

ithN

proc inO

(1) time (raw

-rank).

•S

ame for sam

ples ofB

inA

(as in symm

etric ranking above).

•N

ow w

e have 2 subsequences, but

boundaries are still inaccurate(w

eonly

knowin

which

blockof

theother

arraythe

samples

belongto).

•R

ank each sample of

Ain the subsequence of

B it belongs to.

•2

ranks on elem

ents with

N procs in

O(1) tim

e (raw-rank).

•S

ame for sam

ples ofB

inA

.

•N

ow w

e have 2 subsequences w

ithaccurate boundaries in

O(1)

time.

•A

pply the algorithmrecursively to each of 2

subsequences (ofaverage length

/2) with

/2 processors for each subsequence.

•T

(N)

=T

()+

O(1) =

O(log

logN

).

N

NN

N

NN

N

NN

N

N

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 134 (289)

134

Merging and sorting algorithms

fun

ction

root-raw-rank(x : elem

ent;var A

: array[0..N–

1]) : index;1

if x < A

[0]

then

2

return

0;3

else if x≥

A[N

–1

]th

en4

return

N;

5

else6

for i :=

0to

–1

pa

rdo

7

B[i] :=

A[i*

]8

block = raw

-rank(x, B);

//O

(1) time w

ith proc

9

brank

=raw

-rank(x,A[block

*..(block+

1)*

]);//

O(1)

10

return

blo

ck*

+ brank;

11

Alg

orith

m 3

-17

:R

ank inO

(1) time w

ith processors.

(TO

DO

: check indeces).

NN

NN

NN

N

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 135 (289)

135

Merging and sorting algorithms

Merg

ing in

O(lo

glo

gN

) time, O

(N) w

ork

•W

ork and time optim

al merge!

•N

/loglog

N processors.

•P

artitionA

andB

toblocks of size log

logN

.

•R

ank theboundaries (N

/loglog

N)

in each other with previous algo-

rithm (O

(loglog

N) tim

e).

•R

ankeach

ofthe

sequentiallyboundaries

within

thecorresponding

sub-section

of lengthO

(loglog

N). (O

(loglog

logN

) time w

ith binarysearch).

•N

ow w

e have accurate boundaries (ranks) of 2×N

/loglog

N pairs of

sequences of length at most log

logN

.

•M

erge each pair of sequences independently

using a sequential algo-rithm

(O(log

logN

) time).

•Y

ields aO

(logN

loglog

N) tim

e,O

(Nlog

N) w

ork sorting algorithm.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 136 (289)

136

Merging and sorting algorithms

Od

d-ev

en m

erge

•B

atcher 68: odd-even merge and bitonic m

erge.

•Input array halves

A and

B.

•In

practice,halvesof

thesam

earray

arenam

edA

andB

foreasier

ref-erence.

•M

erge (recursively)odd

elements of

A and odd elem

ents ofB

; andm

erge (recursively)even elem

ents ofA

and even elements of

B.

•M

erging is done in place.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 137 (289)

137

Merging and sorting algorithms

•A

fter these merges,consecutive pairs m

ay be out of order, we'll check

order ofeach pair, sw

ap if needed.

•M

erge in time: T

(N) =

T(N

/2) +O

(1) =O

(logN

),O

(1) space.

Fig

ure 3

-6:

Odd-even m

erge [5].

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 138 (289)

138

Merging and sorting algorithms

Fig

ure 3

-7:

Recursion in odd-even m

erge [5].

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 139 (289)

139

Merging and sorting algorithms

pro

cedu

re Odd-even_m

erge (A : array[0..N

–1]);1

pa

rdo

2

Odd-even_m

erge(halves o

fodd elem

ents o

f A);

3

Odd-even_m

erge(halves o

feven

elemen

ts of A

);4

pa

r i = 1

to N–2

by

2d

o5

com

pa

re-excha

nge (A

[i], A[i+

1]);6

Alg

orith

m 3

-18

:P

arallel odd-even merge inform

ally.

pro

cedu

re oemerge(v

ar S

:arra

y; First, L

ength, Stride : index);

1

pa

r i := 0

to 1

do

2

oem

erge(S, F

irst + i * S

tride, Length/2, S

tride * 2);3

pa

r i := 1

to L

ength

/2–

1d

o4

j := i * 2

;// j :=

2 to Length

–2 by 2

5

ifS

[First +

(j–1) * S

tride] > S

[First +

j * Stride]

then

6

swap

(S[F

irst+(j–

1)*S

tride], S[F

irst+j*

Stride]);

7

Alg

orith

m 3

-19

:P

arallel in place odd-even merge procedure (F

PM

).

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 140 (289)

140

Merging and sorting algorithms

OE

M-so

rt perfo

rman

ce

•M

ergesort with odd-even m

erge exploits at most

N/2 processors,

executes inO

(log2N

) time, and thus uses

O(N

log2N

) work, w

hich isineffi

cient by factor ofO

(logN

).

⇒W

ecan

improve

theefficiency

byreducing

thenum

berof

processors.

•If there are less than

N/2 processors, w

e cansw

itch to sequential sort/m

erge as soon as we run out of processors.

•T

he recursive sort branches according toP

.

•A

lso,merging

canrun

outofprocessors,thus

alsothe

merge

willbranch

according toP

.

•T

ime com

plexity will be

T(N

,P

) =O

((N/P

)×(lo

g2P

+ log

N/P

)).(3-7)

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 141 (289)

141

Merging and sorting algorithms

•In theory, w

e cannot exploit very many processors effi

ciently.

•E

.g., to ensure50%

efficiency, w

e would have to settle for

(3-8)

•T

he same plotted:

•In practice, though, w

e can efficiently use slightly m

ore processors, asthe

slow recursion tails are rem

oved ifN

is clearly larger thanP

.

•M

easured performance on F

-PR

AM

:

P2

Nlog

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 142 (289)

142

Merging and sorting algorithms

Fig

ure

3-8

:M

aximum

efficiently

usefulP

asa

functionof

Nas

pre-dicted

byF

ormula

(3-8),odd-evenm

ergesort,logarithmic

x-axis.

1.04×10 6

1.67×10 7

2.68×10 8

4.29×10 9

6.87×10 10

510 15 20 25 3

035 40 45 5016

64

256

102

440961638465536

maximum efficient

inpu

t sizeN

number of processors

262144

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 143 (289)

143

Merging and sorting algorithms

Fig

ure

3-9

:S

peedupof

odd-evenm

ergesortasa

functionof

thenum

berof

processorsfor

differentinput

sizes.B

othscales

arelogarith-

mic.

1 2 4 8

16 32 64

128

256

512

1024

12

48

1632

64128

256512

10242048

4096

speedup

number of processors

N = 262144

N =

16384

N =

4096

N =

1024

N =

256

50 %

10 %

N = 65536

linear

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 144 (289)

144

Merging and sorting algorithms

Cole’s o

ptim

al p

ara

llel merg

esort (1

986)

•T

he first

almost practical tim

e and work-optim

alO

(logN

) sort.

•F

irstasym

ptoticallyoptim

alw

asA

jtai,K

omlós,

Szem

erédi(A

KS

)1983.

⇒In

fact,w

edo

notneed

aO

(1)tim

em

erging,a

merging

with

O(1)

amortized cost for each phase is sufficient.

•T

he merge operations in different stages of sort can be pipelined.

•W

e collect samples (border values, "cover") in different stages.

•W

e collect the ranks of the samples in halves of data.

•A

ccording to ranks of samples w

e can do the next stage faster.

•B

ecause of large constants, this Cole's sort is faster than odd-even

mergesort (or bitonic) only if

N>

1021 [6].

•S

ee, e.g., Jájá or Akl.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 145 (289)

145

Merging and sorting algorithms

Sam

plin

g p

ara

llel bu

cketso

rt (kau

kalo

lajittelu

)

•L

et us assume the

N>>

P.

•E

ach processorsam

ples its own part of the array.

•S

amples are sorted

in some fast (parallel) w

ay.

•A

ccording to the samples, processors decide

P–1 division points (val-

ues).

•E

achprocessor

partitionsits

partofinputto

otherprocessors

accordingto the division points.

•E

ach processorreceives one subsection of input from

all others.

•E

ach processor sorts its own section.

•In shared m

emory m

odel, we need som

e amount of additional space.

•In m

essage passing mod

el, we need all-to-all com

munication.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 146 (289)

146

Merging and sorting algorithms

Rad

ix so

rt in p

ara

llel (kan

talu

ku

lajittelu

)

⇒P

robablythe

fastestsequential

sortif

keysare

reasonablyshort

andinput is large.

•S

equentialtime

,where

mis

keysize

(inbits)

andr

isradix size (bits).

•S

orting instages:

•D

ividekey in parts.

•S

ortaccording

tothe

leastsignifi

cant part.•

Sort

accordingnext

leastsignifi

cant part.•

...

•S

ortaccording

tothe

most

significant part.

•S

ortshave

tobe

stable,i.e.,theorderofelem

entsw

iththe

same

subkey has the be sustained.

Omr ----

n2

r+

()

Fig

ure 3

-10

:S

orting in stages.

12

3 2

31

12

3 1

23

34

5 1

23

32

4 2

31

54

3 5

43

32

5 2

33

23

3 2

33

23

1 3

24

53

3 5

33

23

3 3

25

32

5 3

43

53

3 3

43

34

3 3

24

54

3 3

45

23

1 3

45

34

3 5

33

32

4 3

25

34

5 5

43

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 147 (289)

147

Merging and sorting algorithms

•A

s each subkey is short (reasonable amount of different possible

subkeys) we could use bucketsort.

•A

sw

ehave

alot

ofkeys

(alot

foreach

bucket),theuse

oflists

inbuck-

etsort gets slower, thus w

e'll use a slightly different method.

•F

irst count thenum

ber of each subkeys.

•C

ompute a

0-prefix sum

of the count array.

•P

refix sum

tells us into w

hich position each “bucket” will be stored.

•C

ontents of each “bucket” w

ill be stored in original order.

•R

adix sum location is increased after each assignm

ent.

•If/w

hen keys are not integers, we'll use the bit representation of keys.

rbits at a tim

e yields 2r buckets,

r is typically 12-20.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 148 (289)

148

Merging and sorting algorithms

Fig

ure 3

-11

:S

equential radix sort using a histogram.

12

3

34

5

54

3

23

0

53

3

32

5

34

3

23

1

32

4

01

23

45

11

04

12

Occurrences

01

23

45

01

22

67

0-prefi

x sum

23

1

12

3

54

3

23

0

53

3

34

3

32

4

34

5

32

5

013 245678

013 245678

T1:

T2:

R:

R:

1.2.

3.

Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 149 (289)

149

Merging and sorting algorithms

Para

llelizatio

n

•If

severalprocessorscountoccurrences

inparallel, the prefi

xsum

needs to becounted for everyP

×2

r buckets.

•R

esultlikein

Figure

3-12, butlinear

(sequential) scan istoo slow

.

Fig

ure

3-1

2:

Linear

scanfor

radixsort

[Culler&

al].Parallel Computing 25.10.2012 14:51 UEF/cs Simo Juvaste 150 (289)

150

Merging and sorting algorithms

Prefi

x in

three sta

ges

•P

refix sum

each row to

last column

(2r×

P/P

= 2

r time).

•B

roadcast all values of the last column to all processors (2

r (or skip inC

RE

W)).

•P

refix sum

the last colum

n.

•E

valuatefi

nal prefix sum

s by adding also the previous row sum

s. (2r)

•A

ssignment stage of the local input as in sequential version.

•P

rocesses can work independently.

⇒C

omparison of different sorts on C

M-5 [C

uller&al]: F

igure 3-13.