open mpi explorations in process affinity (eurompi'13 presentation)

42
Advancing Application Process Affinity Experimentation: Open MPI's LAMA-Based Affinity Interface Jeff Squyres September 18, 2013 Joshua Hursey

Upload: jeff-squyres

Post on 12-May-2015

1.562 views

Category:

Technology


3 download

DESCRIPTION

Presentation given at EuroMPI'13 by Jeff Squyres describing the flexible process affinity system in Open MPI 1.7.2 (and later).

TRANSCRIPT

Page 1: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Advancing Application Process Affinity Experimentation:Open MPI's LAMA-Based Affinity Interface

Jeff Squyres

September 18, 2013

Joshua Hursey

Page 2: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Locality Matters

• Multiple talks here at EuroMPI’13 about network locality

• Goals: Minimize data transfer distance Reduce network congestion and contention

• …this also matters inside the server, too!

Page 3: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Intel Xeon E5-2690 (“Sandy Bridge”)2 sockets, 8 cores, 64GB per socket

1GNICs

10GNICs

10GNICs

L1 and L2

Shared L3

Hyperthreading enabled

Page 4: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

The intent of this work is to provide a mechanism that allows users to explore the process-placement space

within the scope of their own applications.

A User’s Playground

Page 5: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

LAMA

• Locality-Aware Mapping Algorithm (LAMA) Supports a wide range of regular mapping

patterns.

• Adapts at runtime to available hardware Supports homogeneous and heterogeneous

systems.

• Extensible to any depth of server topology Naturally supports potentially deeper

topologies of future server architectures.

Page 6: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

LAMA Inspiration

• Drawn from much prior work

• Most notably, heavily inspired by BlueGene/P and /Q mapping systems LAMA’s mapping specification is similar

Page 7: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Launching MPI Applications

• Three steps in MPI process placement1. Mapping

2. Ordering

3. Binding

• Let's discuss how these work in Open MPI

Page 8: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

1. Mapping

• Create a layout of processes-to-resources

Server Server Server Server

Server Server Server Server

Server Server Server Server

Server Server Server Server

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI

Page 9: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Mapping

• MPI's runtime must create a map, pairing processes-to-processors (and memory).

• Basic technique: Gather hwloc topologies from allocated nodes. Mapping agent then makes a plan for which

resources are assigned to processes

Page 10: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Mapping Agent

• Act of planning mappings: Specify which process will be launched on

each server Identify if any hardware resource will be

oversubscribed

• Processes are mapped to the resolution of a single processing unit (PU) Smallest unit of allocation: hardware thread In HPC, usually the same as a processor core

Page 11: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Oversubscription

• Common / usual definition: When a single PU is assigned more than one

process

• Complicating the definition: Some application may need more than one

PU per process (multithreaded applications)

• How can the user express what their application means by “oversubscription”?

Page 12: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

2. Ordering: By “Slot”

Assigning MCW ranks to mapped processes

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16 17 18 19

20 21 22 23

24 25 26 27

28 29 30 31

32

36

40

44

48 49 50 51 64 65 66 67 80

Page 13: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

2. Ordering: By Node

Assigning MCW ranks to mapped processes

0 16 32 48

64 80 96 112

128 144 160 176

192 208 224 240

1 17 33 49

65 81 97 113

129 145 161 177

193 209 225 241

2

66

130 146

194 210

4 20 36 52 5 23 37 53 6

Page 14: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Ordering

• Each process must be assigned a unique rank in MPI_COMM_WORLD

• Two common types of ordering: natural

• The order in which processes are mapped determines their rank in MCW

sequential• The processes are sequentially numbered starting

at the first processing unit, and continuing until the last processing unit

Page 15: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

3. Binding

• Launch processes and enforce the layout

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

32 33 34

40 41 42

Page 16: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Binding

• Process-launching agent working with the OS to limit where each process can run:1. No restrictions

2. Limited set of restrictions

3. Specific resource restrictions

• “Binding width” The number of PUs to which a process is

bound

Page 17: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Command Line Interface (CLI)

• 4 levels of abstraction for the user Level 1: None Level 2: Simple, common patterns Level 3: LAMA process layout regular patterns Level 4: Irregular patterns

Page 18: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

CLI: Level 1 (none)

• No mapping or binding options specified May or may not specify the number of

processes to launch (-np) If not specified, default to the number of cores

available in the allocation One process is mapped to each core in the

system in a "by-core" style Processes are not bound

• …for backwards compatibility reasons

Page 19: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

CLI: Level 2 (common)

• Simple, common patterns for mapping and binding Specify mapping pattern with

• --map-by X (e.g., --map-by socket)

Specify binding option with:• --bind-to Y (e.g., --bind-to core)

All of these options are translated to Level 3 options for processing by LAMA

(full list of X / Y values shown later)

Page 20: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

CLI: Level 3 (regular patterns)

• LAMA process layout regular patterns Power users wanting something unique for

their application Four MCA run-time parameters

• rmaps_lama_map: Mapping process layout• rmaps_lama_bind: Binding width• rmaps_lama_order: Ordering of MCW ranks• rmaps_lama_mppr: Maximum allowable number of

processes per resource (oversubscription)

Page 21: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• Takes as an argument the "process layout" A series of nine tokens

• allowing 9! (362,880) mapping permutation options.

Preferred iteration order for LAMA• innermost iteration specified first• outermost iteration specified last

Page 22: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Example system

2 servers (nodes), 4 sockets, 2 cores, 2 PUs

Page 23: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• map=scbnh (a.k.a., by socket, then by core)

Page 24: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• map=scbnh (a.k.a., by socket, then by core)

Page 25: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• map=scbnh (a.k.a., by socket, then by core)

Page 26: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• map=scbnh (a.k.a., by socket, then by core)

Page 27: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_map (map)

• map=scbnh (a.k.a., by socket, then by core)

Page 28: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_bind (bind)

• “Binding width" and layer

• Example: bind=3c (3 cores)

bind = 3c

Page 29: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_bind (bind)

• “Binding width" and layer

• Example: bind=2s (2 sockets)

bind = 2s

bind = 2s

Page 30: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_bind (bind)

• “Binding width" and layer

• Example: bind=12 (all PUs in an L2)

bind = 12

Page 31: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_bind (bind)

• “Binding width" and layer

• Example: bind=1N (all PUs in NUMA locality)

bind = 1N

Page 32: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_order (order)

• Select which ranks are assigned to processes in MCW

• There are other possible orderings, but no one has asked for them yet…

Natural order formap-by-node (default)

Sequential order forany mapping

Page 33: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

rmaps_lama_mppr (mppr)

• mppr (mip-per) sets the Maximum number of allowable Processes Per Resource User-specified definition of oversubscription

• Comma-delimited list of <#:resource> 1:c At most one process per core 1:c,2:s At most one process per core,

and at most two processes per socket

Page 34: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

MPPR

1:c At most one process per core

Page 35: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

MPPR

1:c,2:s At most one process per core and two processes per socket

Page 36: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

CLI: Level 4 (rankfile)

• Complete specification of processor-to-resource mapping description Bypasses LAMA

• Not described in the paper

Page 37: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Level 2 to Level 3 Chart

Page 38: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Remember the prior example?

• -np 24 -mppr 2:c -map scbnh

Page 39: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Same example, different mapping

• -np 24 -mppr 2:c -map nbsch

Page 40: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

• Displays prettyprint representation of the binding actually used for each process. Visual feedback = quite helpful when exploring

mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report-bindings hello_world

MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]

Report Bindings

Page 41: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Future Work

• Available in Open MPI v1.7.2 (and later)

• Open questions to users: Are more flexible ordering options useful? What common mapping patterns are useful? What additional features would you like to

see?

Page 42: Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

Thank You