a coherence protocol for optimizing global shared data accesses

41
A Coherence Protocol for Optimizing Global Shared Data Accesses Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada Olivier Tardieu, IBM T. J. Watson, USA 1

Upload: luke

Post on 24-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

A Coherence Protocol for Optimizing Global Shared Data Accesses. Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada Olivier Tardieu , IBM T. J. Watson, USA. Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A  Coherence Protocol for Optimizing Global Shared Data Accesses

1

A Coherence Protocol for Optimizing Global Shared Data Accesses

Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada

Olivier Tardieu, IBM T. J. Watson, USA

Page 2: A  Coherence Protocol for Optimizing Global Shared Data Accesses

2

Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming

Page 3: A  Coherence Protocol for Optimizing Global Shared Data Accesses

3

ReadWrite

…Node 1 Node N

-1

-1

4

-1

-1 -1

-1

4

-1

-1

Node 1 Node 2

Node 1 Node 2

Node 3Node 4

MonteCarlo Estimation of PI 5-Point Stencil Operations

Turing Ring Simulation

Page 4: A  Coherence Protocol for Optimizing Global Shared Data Accesses

4

Challenge: Minimize Communication Latency

Page 5: A  Coherence Protocol for Optimizing Global Shared Data Accesses

-1

-1

4

-1

-1

Node 1 Node 2

Ghost Cell Pattern for Data Sharing

Data payloadMessage id

Data payloadAddress

NetworkInterface

HostCPU

Memory

Two-sided Message

One-sided Message

Remote Direct Memory Access (RDMA)

Communication Optimization Techniques

Page 6: A  Coherence Protocol for Optimizing Global Shared Data Accesses

Communication Optimization Techniques

atomic at (p) sv();Transfer Referencing Task to SV Home

Page 7: A  Coherence Protocol for Optimizing Global Shared Data Accesses

Communication Optimization Techniques

atomic at (p) sv();Transfer Referencing Task to SV Home

atomic at (p) async sv();Remote Task Creation for SV Access

Page 8: A  Coherence Protocol for Optimizing Global Shared Data Accesses

…Node 1 Node N

Write-Once / Read-Mostly

Node 1 Node 1 Node 1 Node 1

Replication

Page 9: A  Coherence Protocol for Optimizing Global Shared Data Accesses

…Node 1 Node N

Write-Once / Read-Mostly

Node 1 Node 1 Node 1 Node 1

Replication

…Node 1 Node N

Result Object

…Node 1 Node N

Collecting Sum Reducer

Page 10: A  Coherence Protocol for Optimizing Global Shared Data Accesses

…Node 1 Node N

General Read-Write

SV state

SV state

SV state

SV state

SV state

SV state

SV state

SV state

Page 11: A  Coherence Protocol for Optimizing Global Shared Data Accesses

11

Coordinate Multiple Protocols for Different Access Patterns

A static data management scheme may not yield performance improvements on varied

data access patterns

Page 12: A  Coherence Protocol for Optimizing Global Shared Data Accesses

SV state

SV state

GR state

SV state

SV state

SV state

SV state

SV state

1. One-sided PUT/GET to SV home

2. Migrate Referencing Task to SV home 3. Directory-based Protocol

Composite Protocols

Page 13: A  Coherence Protocol for Optimizing Global Shared Data Accesses

0 0 0 0 ⋅⋅⋅ 0 〈 Shared Variable (SV)〉↔

n bits: one per nodedirty bit

0 1 0 0 ⋅⋅⋅ 0 SV is only in its allocated node

1 0 0 1 ⋅⋅⋅ 0 Only one node can have a dirty SV

0 0 1 1 ⋅⋅⋅ 1 Multiple nodes may have clean SVs

Directory Entries

Page 14: A  Coherence Protocol for Optimizing Global Shared Data Accesses

14

Node

SV state

SV state

Page 15: A  Coherence Protocol for Optimizing Global Shared Data Accesses

15

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Page 16: A  Coherence Protocol for Optimizing Global Shared Data Accesses

16

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Page 17: A  Coherence Protocol for Optimizing Global Shared Data Accesses

17

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Home node for SV : the node where SV is allocatedRemote node for SV : a node whose memory does not store SV

Page 18: A  Coherence Protocol for Optimizing Global Shared Data Accesses

18

Node

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node

SV state

SV state

Example: Home node is j

Read/Write activity at node i

j

i

Page 19: A  Coherence Protocol for Optimizing Global Shared Data Accesses

19

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node i

Request

Page 20: A  Coherence Protocol for Optimizing Global Shared Data Accesses

20

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

Page 21: A  Coherence Protocol for Optimizing Global Shared Data Accesses

21

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

0 0 1 0 ⋅⋅⋅ 00ij

Page 22: A  Coherence Protocol for Optimizing Global Shared Data Accesses

22

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

Data Copy

0 0 1 0 ⋅⋅⋅ 00ij

Page 23: A  Coherence Protocol for Optimizing Global Shared Data Accesses

23

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 1: SV is in node j in clean state.

0 0 1 1 ⋅⋅⋅ 00ij

Page 24: A  Coherence Protocol for Optimizing Global Shared Data Accesses

24

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 1 0 1 0 ⋅⋅⋅ 00

ij

Page 25: A  Coherence Protocol for Optimizing Global Shared Data Accesses

25

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state.

Write back1 0 1 0 ⋅⋅⋅ 00

ij

Page 26: A  Coherence Protocol for Optimizing Global Shared Data Accesses

26

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 0 ⋅⋅⋅ 00

ij

SV Copy

Page 27: A  Coherence Protocol for Optimizing Global Shared Data Accesses

27

Node j

SV state

SV state

Network

Node

SV state

SV state

Node

SV state

SV state

Node i

SV state

SV state

Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 1 ⋅⋅⋅ 00

ij

Page 28: A  Coherence Protocol for Optimizing Global Shared Data Accesses

28

Performance Evaluation

Page 29: A  Coherence Protocol for Optimizing Global Shared Data Accesses

29

Communication Patterns

Test CompositeProtocols

Data Structures / Granularities

What do We Want in Benchmarks?

Page 30: A  Coherence Protocol for Optimizing Global Shared Data Accesses

30

Best Hand Coded Versions

Performance Comparison

• X10’s Shared Memory Protocol (X10-Mem)• Directory-based Protocol (GR-Mem)• Combination (X10-Mem/GR-Mem)

Page 31: A  Coherence Protocol for Optimizing Global Shared Data Accesses

31

Code- and Data-Layout Restructurings

Patterns of Shared Variable Accesses

A Read-mostly: Replicate node-local copies --- reduce remote access

B Write-mostly: Intact: localize write access to the site of allocation

C Aggregate Data: Refactor into individual objects for element-wise access --- reduce false sharing

D Write-Following-Read from each place: Collecting Sum Reducer – reduce frequent remote writes

E Write-Once: Replicate node-local copies --- reduce remote access

Page 32: A  Coherence Protocol for Optimizing Global Shared Data Accesses

32

Code Restructurings in Hand-coded Versions

Benchmarks Code RestructuringsA B C D E

FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔ ✔ ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG) ✔ ✔ ✔Delaunay Mesh Refinement (DMR) ✔ ✔ ✔

Page 33: A  Coherence Protocol for Optimizing Global Shared Data Accesses

33

• CentOS Linux 6.0• 1 Node = 2 HyperTransport connected CPUs• QuadCore Opteron Processors

Heldar(Opetron)No. nodes Cores per node Memory per node

16 8 8 GB

Platform

Page 34: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 35: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 36: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 37: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 38: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 39: A  Coherence Protocol for Optimizing Global Shared Data Accesses

FSSi

mpl

eDist

K-M

eans

Mon

tePi

Dist

N-B

ody

Jaco

bi

RayT

race

r

UTS

Line

arRe

gres

sion

0

10

20

30

40

50

60

70

80

90

X10-Mem ManualGR-Mem/X10-Mem

Spee

dup

Ove

r Seq

uenti

al

Using 128 workers

DMG

DMR

Page 40: A  Coherence Protocol for Optimizing Global Shared Data Accesses

…Node 1 Node N

Write-Once / Read-Mostly

Node 1Node 1Node 1 Node 1…

Replication

…Node 1 Node N

Result Object

…Node 1 Node N

Collecting Sum Reducer

GR state

GR state

GR state

GRstate

GR state

GR state

GRstate

GRstate

1. One-sided PUT/GET to GR home

2. Migrate Referencing Task to GR home

3. Directory-based Protocol

68

79

K-M

eans

128 Workers

Speedup

Benchmarks Code Restructurings

A B C D E

FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG)

Delaunay Mesh Refinement (DMR)

✔ ✔ ✔

Applicable to (A)PGAS LanguagesChapel, Fortress

Page 41: A  Coherence Protocol for Optimizing Global Shared Data Accesses

41

Questions?