mpi sessions: a proposal to the mpi forum

64
How to make MPI Awesome: MPI Sessions Follow-on to Jeff’s crazy thoughts discussed in Bordeaux Random group of people who have been talking about this stuff: Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror, Martin Schulz, Anthony Skjellum, Jeff Squyres ^ more

Upload: jeff-squyres

Post on 06-Jan-2017

1.524 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: MPI Sessions: a proposal to the MPI Forum

How to make MPI Awesome:MPI Sessions

Follow-on to Jeff’s crazy thoughts discussed in Bordeaux

Random group of people who have been talking about this stuff:Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror,

Martin Schulz, Anthony Skjellum, Jeff Squyres^

more

Page 2: MPI Sessions: a proposal to the MPI Forum

What we want• Any thread (e.g., library) can use MPI any time it wants• But still be able to totally clean up MPI if/when desired• New parameters to initialize the MPI API

MPI Process// Library 1MPI_Init(…);

// Library 2MPI_Init(…);

// Library 3MPI_Init(…);

// Library 4MPI_Init(…);

// Library 5MPI_Init(…);

// Library 6MPI_Init(…);// Library 7

MPI_Init(…);

// Library 8MPI_Init(…);

// Library 9MPI_Init(…);

// Library 10MPI_Init(…);

// Library 11MPI_Init(…);

// Library 12MPI_Init(…);

Page 3: MPI Sessions: a proposal to the MPI Forum

Before MPI-3.1, this could be erroneous

int my_thread1_main(void *context) { MPI_Initialized(&flag); // …}

int my_thread2_main(void *context) { MPI_Initialized(&flag); // …}

int main(int argc, char **argv) { MPI_Init_thread(…, MPI_THREAD_FUNNELED, …); pthread_create(…, my_thread1_main, NULL); pthread_create(…, my_thread2_main, NULL); // …}

These mightrun at the same time (!)

Page 4: MPI Sessions: a proposal to the MPI Forum

The MPI-3.1 solution

• MPI_INITIALIZED (and friends) are allowed to be called at any time– …even by multiple threads– …regardless of MPI_THREAD_* level

• This is a simple, easy-to-explain solution– And probably what most applications do, anyway

• But many other paths were investigated

Page 5: MPI Sessions: a proposal to the MPI Forum

MPI-3.1 MPI_INIT / FINALIZE limitations

• Cannot init MPI from different entities within a process without a priori knowledge / coordination– I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem

MPI Process// Library 1 (thread)MPI_Initialized(&flag);if (!flag) MPI_Init(…);

// Library 2 (thread)MPI_Initialized(&flag);if (!flag) MPI_Init(…);

THIS IS INSUFFICIENT / POTENTIALLY ERRONEOUS

Page 6: MPI Sessions: a proposal to the MPI Forum

(More of) What we want

• Fix MPI-3.1 limitations:– Cannot init MPI from different entities within a

process without a priori knowledge / coordination– Cannot initialize MPI more than once– Cannot set error behavior of MPI initialization– Cannot re-initialize MPI after it has been finalized

Page 7: MPI Sessions: a proposal to the MPI Forum

All these things overlap

Still be able to finalize MPI

Any thread can use MPI any time

Re-initialize MPIAffect MPI

initialization error behavior

Page 8: MPI Sessions: a proposal to the MPI Forum

How do we get those things?

Page 9: MPI Sessions: a proposal to the MPI Forum

KEEPCALM

AND

LISTEN TOTHE ENTIREPROPOSAL

Page 10: MPI Sessions: a proposal to the MPI Forum

New concept: “session”

• A local handle to the MPI library– Implementation intent: lightweight / uses very few

resources– Can also cache some local state

• Can have multiple sessions in an MPI process– MPI_Session_init(…, &session);– MPI_Session_finalize(…, &session);

Page 11: MPI Sessions: a proposal to the MPI Forum

MPI Session

MPI Process

ocean library

MPI_SESSION_INIT(…)

atmosphere library

MPI_SESSION_INIT(…)

MPI library

Page 12: MPI Sessions: a proposal to the MPI Forum

MPI Session

MPI Process

ocean library atmosphere library

MPI library

ocean session

atmos-phere

session

Unique handles to the underlying MPI library

Page 13: MPI Sessions: a proposal to the MPI Forum

Initialize / finalize a session

• MPI_Session_init(– IN MPI_Info info,– IN MPI_Errhandler errhandler,– OUT MPI_Session *session)

• MPI_Session_finalize(– INOUT MPI_Session *session)

• Parameters described in next slides…

Page 14: MPI Sessions: a proposal to the MPI Forum

Session init params

• Info: For future expansion• Errhandler: to be invoked if

MPI_SESSION_INIT errors– Likely need a new type of errhandler• …or a generic errhandler• FT working is discussing exactly this topic

Page 15: MPI Sessions: a proposal to the MPI Forum

MPI Session

MPI Process

ocean library atmosphere library

MPI library

oceanErrors return

atmos-phereErrors abort

Unique errhandlers, info, local state, etc.

Page 16: MPI Sessions: a proposal to the MPI Forum

Great. I have a session.Now what?

Page 17: MPI Sessions: a proposal to the MPI Forum

Fair warning

• The MPI runtime has long-since been a bastard stepchild– Barely acknowledged in

the standard– Mainly in the form of

non-normative suggestions

• It’s time to change that

Page 18: MPI Sessions: a proposal to the MPI Forum

Overview

• General scheme:– Query the underlying

run-time system• Get a “set” of processes

– Determine the processes you want• Create an MPI_Group

– Create a communicator with just those processes• Create an MPI_Comm

Query runtimefor set of processes

MPI_Group

MPI_Comm

MPI_Session

Page 19: MPI Sessions: a proposal to the MPI Forum

Runtime concepts

• Expose 2 concepts to MPI from the runtime:1. Static sets of processes2. Each set caches (key,value) string tuples

These slides only discuss static sets(unchanged for the life of the process).

However, there are several useful scenarios that involve dynamic membership of sets over time. More

discussion needs to occur for these scenarios.

For the purposes of these slides,just consider static sets.

Page 20: MPI Sessions: a proposal to the MPI Forum

Static sets of processes

• Sets are identified by string name• Two sets are mandated– “mpi://WORLD”– “mpi://SELF”

• Other sets can be defined by the system:– “location://rack/19”– “network://leaf-switch/37”– “arch://x86_64”– “job://12942”– … etc.

• Processes can be in more than one set

These names are implementation-

dependent

Page 21: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

mpi://WORLD

Page 22: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

mpi://WORLD

arch://x86_64

Page 23: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

mpi://WORLD

job://12942

arch://x86_64

Page 24: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

mpi://SELF mpi://SELF mpi://SELF mpi://SELF

Page 25: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

location://rack/self location://rack/self

location://rack/17 location://rack/23

Page 26: MPI Sessions: a proposal to the MPI Forum

Examples of sets

MPI process 0 MPI process 1 MPI process 2 MPI process 3

user://ocean user://atmosphere

mpiexec \ --np 2 --set user://ocean ocean.exe : \ --np 2 --set user://atmosphere atmosphere.exe

Page 27: MPI Sessions: a proposal to the MPI Forum

Querying the run-time

• MPI_Session_get_names(– IN MPI_Session session,– OUT char **set_names)

• Returns argv-style list of \0-terminated names– Must be freed by caller

Example list of set names returnedmpi://WORLD

mpi://SELF

arch://x86_64

location://rack/17

job://12942

user://ocean

Page 28: MPI Sessions: a proposal to the MPI Forum

Values in sets

• Each set has an associated MPI_Info object• One mandated key in each info:– “size”: number of processes in this set

• Runtime may also provide other keys– Implementation-dependent

Page 29: MPI Sessions: a proposal to the MPI Forum

Querying the run-time

• MPI_Session_get_info(– IN MPI_Session session,– IN const char *set_name,– OUT MPI_Info *info)

• Use existing MPI_Info functions to retrieve (key,value) tuples

Page 30: MPI Sessions: a proposal to the MPI Forum

ExampleMPI_Info info;MPI_Session_get_info(session, “mpi://WORLD”, &info);

char *size_str[MPI_MAX_INFO_VAL]MPI_Info_get(info, “size”, …, size_str, …);int size = atoi(size_str);

Page 31: MPI Sessions: a proposal to the MPI Forum

Ummmm… great.What’s the point of that?

Page 32: MPI Sessions: a proposal to the MPI Forum

Make MPI_Groups!

• MPI_Group_create_from_session(– IN MPI_Session session,– IN const char *set_name,– OUT MPI_Group *group);

Advice to implementers:

This MPI_Group can still be a lightweight object (even if there are

a large number of processes in it)

Page 33: MPI Sessions: a proposal to the MPI Forum

Example// Make a group of procs from “location://rack/self”

MPI_Create_group_from_session_name(session, “location://rack/self”,

&group);

// Use just the even procsMPI_Group_size(group, &size);ranges[0][0] = 0;ranges[0][1] = size;ranges[0][2] = 2;MPI_Group_range_incl(group, 1, ranges,

&group_of_evens);

Page 34: MPI Sessions: a proposal to the MPI Forum

Make a communicator from that group

• MPI_Create_comm_from_group(– IN MPI_Group group,– IN const char *tag, // for matching (see next slide)– IN MPI_Info info,– IN MPI_Errhandler errhander,– OUT MPI_Comm *comm)

Note: this is different than the existing function

MPI_Comm_create_group(oldcomm, group, (int) tag,

&newcomm)

Might need a better name for this new function…?

Page 35: MPI Sessions: a proposal to the MPI Forum

String tag is used to match concurrent creations by different entities

MPI Process

ocean library atmosphere library

MPI Process

ocean library atmosphere library

MPI Process

ocean library atmosphere library

MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …)

MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)

Page 36: MPI Sessions: a proposal to the MPI Forum

Make any kind of communicator

• MPI_Create_cart_comm_from_group(– IN MPI_Group group,– IN const char *tag,– IN MPI_Info info,– IN MPI_Errhandler errhander,– IN int ndims,– IN const int dims[],– IN const int periods[],– IN int reorder,– OUT MPI_Comm *comm)

Page 37: MPI Sessions: a proposal to the MPI Forum

Make any kind of communicator

• MPI_Create_graph_comm_from_group(…)• MPI_Create_dist_graph_comm_from_group(…)• MPI_Create_dist_graph_adjacent_comm_from

_group(…)

Page 38: MPI Sessions: a proposal to the MPI Forum

Run-time static sets across different sessions in the same process

• Making communicators from the same static set will always result in the same local rank– Even if created from different sessions

See example in the next slide…

Page 39: MPI Sessions: a proposal to the MPI Forum

Run-time static sets across different sessions in the same process

// Session, group, and communicator 1MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1);MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);MPI_Comm_rank(comm1, &rank1);

// Session, group, and communicator 2MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2);MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2);MPI_Comm_rank(comm2, &rank2);

// Ranks are guaranteed to be the sameassert(rank1 == rank2);

Law of Least Astonishment

Page 40: MPI Sessions: a proposal to the MPI Forum

Mixing requests from different sessions: disallowed

// Session, group, and communicator 1MPI_Create_group_from_session_name(session_1,

“mpi://WORLD”, &group1);MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);MPI_Isend(…, &req[0]);

// Session, group, and communicator 2MPI_Create_group_from_session_name(session_2,

“mpi://WORLD”, &group2);MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2);MPI_Isend(…, &req[1]);

// Mixing requests from different// sessions is disallowedMPI_Waitall(2, req, …);

Rationale: this is difficult to optimize, particularly if a session

maps to hardware resources

Page 41: MPI Sessions: a proposal to the MPI Forum

MPI_Session_finalize

• Analogous to MPI_FINALIZE– Can block waiting for the destruction of the

objects derived from that session• Communicators, Windows, Files, … etc.

– Each session that is initialized must be finalized

Page 42: MPI Sessions: a proposal to the MPI Forum

Well, that all sounds great.

…but who calls MPI_INIT?

And what session does MPI_COMM_WORLD / MPI_COMM_SELF belong to?

Page 43: MPI Sessions: a proposal to the MPI Forum

New concept: no longer require MPI_INIT / MPI_FINALIZE

Page 44: MPI Sessions: a proposal to the MPI Forum

New concept: no longer require MPI_INIT / MPI_FINALIZE

• WHAT?!• When will MPI initialize itself?• How will MPI finalize itself?– It is still (very) desirable to allow MPI to clean itself

up so that MPI processes can be “valgrind clean” when they exit

Page 45: MPI Sessions: a proposal to the MPI Forum

Split MPI APIs into two setsPerformance doesn’t

matter (as much)

• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes

• MPI_T

Performanceabsolutely matters

• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer

Page 46: MPI Sessions: a proposal to the MPI Forum

Split MPI APIs into two setsPerformance doesn’t

matter (as much)

• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes

• MPI_T

Performanceabsolutely matters

• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer

Ensure that MPI is initialized (and/or finalized) by these

functions

These functions still can’t be used unless MPI is

initialized

Page 47: MPI Sessions: a proposal to the MPI Forum

Split MPI APIs into two setsPerformance doesn’t

matter (as much)

• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes

• MPI_T

Performance absolutely matters

• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer

These functions init / finalize MPI transparently

These functions can’t be called without a handle created from

the left-hand column

Page 48: MPI Sessions: a proposal to the MPI Forum

Split MPI APIs into two setsPerformance doesn’t

matter (as much)

• Functions that create / query / destroy:– MPI_Comm– MPI_File– MPI_Win– MPI_Info– MPI_Op– MPI_Errhandler– MPI_Datatype– MPI_Group– MPI_Session– Attributes– Processes

• MPI_T

Performance absolutely matters

• Point to point• Collectives• I/O• RMA• Test/Wait• Handle language xfer

MPI_COMM_WORLD and MPI_COMM_SELF are notable

exceptions.

…I’ll address this shortly.

Page 49: MPI Sessions: a proposal to the MPI Forum

Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype);

The creation of the first user-defined MPI object initializes MPI

Initialization can be a local action!

Page 50: MPI Sessions: a proposal to the MPI Forum

Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Valgrind clean return 0;}

The destruction of the last user-defined MPI object finalizes /

cleans up MPI. This is guaranteed.

There are some corner cases

described on the following slides.

Page 51: MPI Sessions: a proposal to the MPI Forum

Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype);

// Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype);

We can also re-initialize MPI!(it’s transparent to the user – so why not?)

Page 52: MPI Sessions: a proposal to the MPI Forum

Exampleint main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype);

// Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); return 0;}

(Sometimes) Not an error to exit the process with MPI still initialized

Page 53: MPI Sessions: a proposal to the MPI Forum

The overall theme

• Just use MPI functions whenever you want– MPI will initialize as it needs to– Initialization essentially becomes an

implementation detail• Finalization will occur whenever all user-

defined handles are destroyed

Page 54: MPI Sessions: a proposal to the MPI Forum

Wait a minute –What about MPI_COMM_WORLD?

int main() { // Can’t I do this? MPI_Send(…, MPI_COMM_WORLD);

This would be calling a “performance matters”

function before a “performance doesn’t

matter” function

I.e., MPI has not initialized yet

Page 55: MPI Sessions: a proposal to the MPI Forum

Wait a minute –What about MPI_COMM_WORLD?

int main() { // This is valid MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD);

Re-define MPI_INIT and MPI_FINALIZE:constructor and destructor for

MPI_COMM_WORLD and MPI_COMM_SELF

Page 56: MPI Sessions: a proposal to the MPI Forum

INIT and FINALIZEint main() { MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); MPI_Finalize();}

INIT and FINALIZE continue to exist for two reasons:1. Backwards compatibility2. Convenience

So let’s keep them as close to MPI-3.1 as possible:• If you call INIT, you have to call FINALIZE• You can only call INIT / FINALIZE once

• INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions)

If you want different behavior, use sessions

Page 57: MPI Sessions: a proposal to the MPI Forum

INIT and FINALIZE

• INIT/FINALIZE create an implicit session– You cannot extract an MPI_Session handle for the

implicit session created by MPI_INIT[_THREAD]• Yes, you can use INIT/FINALIZE in the same

MPI process as other sessions

Page 58: MPI Sessions: a proposal to the MPI Forum

Backwards compatibility:INITIALIZED and FINALIZED behavior

int main() { MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false);

MPI_Session_create(…, &session1); MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false);

MPI_Init(NULL, NULL); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);

MPI_Session_free(…, &session1); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);

MPI_Session_create(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false);

MPI_Finalize(); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true);

MPI_Session_free(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true);}

Short version:

INITIALIZED, FINALIZED,

IS_THREAD_MAIN all still refer to INIT / FINALIZE

Page 59: MPI Sessions: a proposal to the MPI Forum

FIN

(for the main part of the proposal)

Page 60: MPI Sessions: a proposal to the MPI Forum

Items that still need more discussion

Page 61: MPI Sessions: a proposal to the MPI Forum

Issues that still need more discussion

• Dynamic runtime sets– Temporal– Membership

• Covered in other proposals:– Thread concurrent vs. non-concurrent– Generic error handlers

Page 62: MPI Sessions: a proposal to the MPI Forum

Issues that still need more discussion

• If COMM_WORLD|SELF are not available by default:– Do we need new destruction hooks to replace SELF

attribute callbacks on FINALIZE?– What is the default error handler behavior for functions

without comm/file/win?• Do we need syntactic sugar to get a comm from

mpi://WORLD?• How do tools hook into MPI initialization and

finalization?

Page 63: MPI Sessions: a proposal to the MPI Forum

Session queries

• Query session handle equality– MPI_Session_query(handle1, handle1_type,

handle2, handle2_type, bool *are_they_equal)– Not 100% sure we need this…?

Page 64: MPI Sessions: a proposal to the MPI Forum

Session thread support

• Associate thread level support with sessions• Three options:

1. Similar to MPI-3.1: “first” initialization picks thread level

2. Let each session pick its own thread level (via info key in SESSION_CREATE)

3. Just make MPI always be THREAD_MULTIPLE