a coherence protocol for optimizing global shared data accesses

Post on 24-Feb-2016

46 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Coherence Protocol for Optimizing Global Shared Data Accesses. Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada Olivier Tardieu , IBM T. J. Watson, USA. Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming. - PowerPoint PPT Presentation

TRANSCRIPT

1

A Coherence Protocol for Optimizing Global Shared Data Accesses

Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada

Olivier Tardieu, IBM T. J. Watson, USA

2

Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming

3

ReadWrite

…Node 1 Node N

-1

-1

4

-1

-1 -1

-1

4

-1

-1

Node 1 Node 2

Node 1 Node 2

Node 3Node 4

MonteCarlo Estimation of PI 5-Point Stencil Operations

Turing Ring Simulation

4

Challenge: Minimize Communication Latency

-1

-1

4

-1

-1

Node 1 Node 2

Ghost Cell Pattern for Data Sharing

Data payloadMessage id

Data payloadAddress

NetworkInterface

HostCPU

Memory

Two-sided Message

One-sided Message

Remote Direct Memory Access (RDMA)

Communication Optimization Techniques

Communication Optimization Techniques

atomic at (p) sv();Transfer Referencing Task to SV Home

Communication Optimization Techniques

atomic at (p) sv();Transfer Referencing Task to SV Home

atomic at (p) async sv();Remote Task Creation for SV Access

…Node 1 Node N

Write-Once / Read-Mostly

Node 1 Node 1 Node 1 Node 1

Replication

…Node 1 Node N

Write-Once / Read-Mostly

Node 1 Node 1 Node 1 Node 1

Replication

…Node 1 Node N

Result Object

…Node 1 Node N

Collecting Sum Reducer

…Node 1 Node N

General Read-Write

SV state

SV state

SV state

SV state

SV state

SV state

SV state

SV state

11

Coordinate Multiple Protocols for Different Access Patterns

A static data management scheme may not yield performance improvements on varied

data access patterns

SV state

SV state

GR state

SV state

SV state

SV state

SV state

SV state

1. One-sided PUT/GET to SV home

2. Migrate Referencing Task to SV home 3. Directory-based Protocol

Composite Protocols

0 0 0 0 ⋅⋅⋅ 0 〈 Shared Variable (SV)〉↔

n bits: one per nodedirty bit

0 1 0 0 ⋅⋅⋅ 0 SV is only in its allocated node

1 0 0 1 ⋅⋅⋅ 0 Only one node can have a dirty SV

0 0 1 1 ⋅⋅⋅ 1 Multiple nodes may have clean SVs

Directory Entries

14

Node

SV state

SV state

15

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

16

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

17

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Home node for SV : the node where SV is allocatedRemote node for SV : a node whose memory does not store SV

18

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Example: Home node is j

Read/Write activity at node i

j

i

19

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node i

Request

20

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

21

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

0 0 1 0 ⋅⋅⋅ 00ij

22

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

Data Copy

0 0 1 0 ⋅⋅⋅ 00ij

23

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

0 0 1 1 ⋅⋅⋅ 00ij

24

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 1 0 1 0 ⋅⋅⋅ 00

ij

25

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state.

Write back1 0 1 0 ⋅⋅⋅ 00

ij

26

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 0 ⋅⋅⋅ 00

ij

SV Copy

27

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 1 ⋅⋅⋅ 00

ij

28

Performance Evaluation

29

Communication Patterns

Test CompositeProtocols

Data Structures / Granularities

What do We Want in Benchmarks?

30

Best Hand Coded Versions

Performance Comparison

• X10’s Shared Memory Protocol (X10-Mem)• Directory-based Protocol (GR-Mem)• Combination (X10-Mem/GR-Mem)

31

Code- and Data-Layout Restructurings

Patterns of Shared Variable Accesses

A Read-mostly: Replicate node-local copies --- reduce remote access

B Write-mostly: Intact: localize write access to the site of allocation

C Aggregate Data: Refactor into individual objects for element-wise access --- reduce false sharing

D Write-Following-Read from each place: Collecting Sum Reducer – reduce frequent remote writes

E Write-Once: Replicate node-local copies --- reduce remote access

32

Code Restructurings in Hand-coded Versions

Benchmarks Code RestructuringsA B C D E

FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔ ✔ ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG) ✔ ✔ ✔Delaunay Mesh Refinement (DMR) ✔ ✔ ✔

33

• CentOS Linux 6.0• 1 Node = 2 HyperTransport connected CPUs• QuadCore Opteron Processors

Heldar(Opetron)No. nodes Cores per node Memory per node

16 8 8 GB

Platform

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem/X10-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

…Node 1 Node N

Write-Once / Read-Mostly

Node 1Node 1Node 1 Node 1…

Replication

…Node 1 Node N

Result Object

…Node 1 Node N

Collecting Sum Reducer

GR state

GR state

GR state

GRstate

GR state

GR state

GRstate

GRstate

1. One-sided PUT/GET to GR home

2. Migrate Referencing Task to GR home

3. Directory-based Protocol

68

79

K-M

eans

128 Workers

Speedup

Benchmarks Code Restructurings

A B C D E

FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG)

Delaunay Mesh Refinement (DMR)

✔ ✔ ✔

Applicable to (A)PGAS LanguagesChapel, Fortress

41

Questions?

top related