programming models and fastos bill gropp and rusty lusk

25
Programming Models and FastOS Bill Gropp and Rusty Lusk

Upload: geraldine-clarke

Post on 23-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming Models and FastOS Bill Gropp and Rusty Lusk

Programming Models and FastOS

Bill Gropp and Rusty Lusk

Page 2: Programming Models and FastOS Bill Gropp and Rusty Lusk

2Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Application View of the OS Application makes use of

the programming model (may be calls, may be compiler-generated).

Considers all calls the same, or at most, libc + programming model

Deciding what goes in the runtime and what goes in the OS is very important

But not to the application or (for the most part) the programming model

Just make it fast and correct

Application

Progming Model

Node Runtime

Operating System

Page 3: Programming Models and FastOS Bill Gropp and Rusty Lusk

3Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Page 4: Programming Models and FastOS Bill Gropp and Rusty Lusk

4Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Parallel Programming Models

Shared-nothing Typically communicating OS processes Need not be OS processes

Need the appearance of a separate address space (as visible to the programmer) and program counter

Need not have separate OS entries for each “application process”

Shared-all Typically either OS processes with shared address spaces

or one process + threads Need not involve OS in each thread of control

Processes and Processors are different An single “application process” could use many

processors (and what is a “processor”?)

Page 5: Programming Models and FastOS Bill Gropp and Rusty Lusk

5Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Some Needs of Programming Models

Job startup Acquire and setup resources

Job rundown Release resources, even on abnormal exit

Schedule As a job to match collective operations

Support communication Allocate and manage resources

Control Signals, interaction with other jobs, external

services

Page 6: Programming Models and FastOS Bill Gropp and Rusty Lusk

6Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Locality of Calls from the Application Viewpoint

Local Only Affects resources on the processing element

Collective All (or any subset) perform a coordinated

operation, such as file access or “symmetric malloc”

Independent Non-Local Uncoordinated access to an externally managed

resource, such as a file system or network Potential scalability hazard Two important subsets:

Cacheable Noncacheable

Page 7: Programming Models and FastOS Bill Gropp and Rusty Lusk

7Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Local Calls

App

NodeOS

Node Resources

Page 8: Programming Models and FastOS Bill Gropp and Rusty Lusk

8Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Independent Non-Local Calls

NodeApp

NodeApp

NodeApp

Remote ServiceArgh!!!

Page 9: Programming Models and FastOS Bill Gropp and Rusty Lusk

9Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Collective Calls

Note that collective calls can be implemented (but not efficiently) with the non-local independent

Metrics are needed to identify and measure scalability goals

NodeApp

NodeApp

NodeApp

Collective ManagementCollective Management

Remote Service

Page 10: Programming Models and FastOS Bill Gropp and Rusty Lusk

10Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Job Startup Independent read of the executable and

shared libraries Hack (useful but still a hack)

Capture the file accesses on startup and provide a scalable distribution of the needed data

Better solution Define operations as collective

“collective exec”

In between solution Define as non-local, independent operation on

cacheable (read-only) data

Page 11: Programming Models and FastOS Bill Gropp and Rusty Lusk

11Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Many Other Examples Collective scheduling, signaling (avoid

batch) Gettimeofday It is not practical to implement a special-

case solution for each system call Must identify a few strategies and apply them

Page 12: Programming Models and FastOS Bill Gropp and Rusty Lusk

12Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Implementing Independent Non-Local Calls

For each routine, implement special caching code Example: DNS cache

More interesting approach Exploit techniques used for coherent shared memory

caches Virtual “system pages” (a kind of distributed, coherent /proc) Read-only references can be cached Write references must invalidate

Special case – allow no consistency if desired (e.g., NFS semantics)

Syscalls could choose, but must be consistent by default. Correctness uber alles!

Page 13: Programming Models and FastOS Bill Gropp and Rusty Lusk

13Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Exploiting a Cached-Data View Shared OS space approach provides a common

way to support scalable implementation of independent non-local calls

Caching algorithms provide guidance for the implementation and the definition: routines should provide useful operations that can be implemented efficiently

For operations without a useful caching strategy, use the caching model to implement flow control

Provides a naturally-distributed approach Must be careful of faults!

Is this the answer? Research will tell us. It does point out one possible

strategy

Page 14: Programming Models and FastOS Bill Gropp and Rusty Lusk

14Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Case Study of Collective Operations

MPI I/O provides an example of the benefit of collective semantics MPI I/O is not POSIX; however, it is well

defined and provides precise semantics that match applications’ needs (unlike NFS)

Benefit is large (100x in some cases) More than just collective

Page 15: Programming Models and FastOS Bill Gropp and Rusty Lusk

15Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

MPI Code to Write a Distributed Mesh to a Single File

MPI Datatypes define memory layout and placement in file

Collective write provides scalable, correct output of data from multiple processes to a single file

MPI_File_Open( comm, … &file ); MPI_Type_create_subarray(..., &subarray, ...);

MPI_Type_commit(&subarray);

MPI_Type_vector( ..., &memtype );

MPI_Type_commit(&memtype);

MPI_File_set_view(fh, ..., subarray, ...);

MPI_File_write_all(fh, A, 1, memtype, ...);MPI_File_close( &file );

Page 16: Programming Models and FastOS Bill Gropp and Rusty Lusk

17Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

The Four Levels of AccessFi

le S

pace

Processes3210

Level 0

Level 3

Level 1

Level 2

Collectivealong oneaxis

Collective along both

Independent

Page 17: Programming Models and FastOS Bill Gropp and Rusty Lusk

19Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Distributed Array Access: Write Bandwidth

64 procs 64 procs 8 procs 32 procs256 procs

Array size: 512 x 512 x 512

Page 18: Programming Models and FastOS Bill Gropp and Rusty Lusk

20Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Unstructured Code:Read Bandwidth

64 procs 64 procs 8 procs 32 procs256 procs

Page 19: Programming Models and FastOS Bill Gropp and Rusty Lusk

22Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

MPI’s Collective I/O Calls Includes the usual

Open, Close, Seek, Get_position Includes collective versions

Read_all, Write_all Includes thread-safe versions

Read_at_all, Write_at_all Includes nonblocking versions

Read_all_begin/end, Write_all_begin/end, Read_at_all_begin/end, Write_at_all_begin/end

Includes general data patterns Application can make a single system call instead of many Only a four types cover very general patterns

Includes explicit coherency control MPI_File_sync, MPI_File_set_atomicity

Page 20: Programming Models and FastOS Bill Gropp and Rusty Lusk

23Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

MPI as Init MPI provides a rich set of collective operations

Includes collective process creation — MPI_Comm_Spawn Parallel I/O

Sample MPI Init process: While {

recv syscall switch syscall_id { case pexec: … use MPI_Comm_split to create a communicator this process creation … use MPI_File I/O to move executable to nodes MPI_Comm_spawn( … ) to create processes … remember new intercommunicator as handle for processes break; …

Page 21: Programming Models and FastOS Bill Gropp and Rusty Lusk

24Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

What’s Missing Process control (e.g., signals, ptrace)

Some of this was considered, see the MPI “Journal of Development”

Wait for any (no probe on any intercommunicator) (like wait or poll) A “system” like operation, not normally

appropriate for a well-designed MPI program

Precisely defined error actions (not inconsistent with MPI spec but because not defined, would need to be added)

Page 22: Programming Models and FastOS Bill Gropp and Rusty Lusk

25Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

What’s Not Missing Most of I/O

But directory operations are missing Process creation Fault tolerance

MPI spec is relatively friendly to fault tolerance, more so than current implementations

Scalability Most (all common) routines are scalable

Thread Safety E.g., MPI_File_read_at; no global state or (non-constant)

global objects Most communication Could implement much of an OS

Page 23: Programming Models and FastOS Bill Gropp and Rusty Lusk

26Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Linux System Calls_llseek_newselect_sysctlaccessacctadjtimexafs_syscallalarmbdflushbreakbrkchdirchmodchownchrootcloneclosecreatcreate_moduledelete_moduledupdup2execveexitfchdirfchmodfchown

fcntlfdatasyncflockforkfstatfstatfsfsyncftimeftruncateget_kernel_symsgetdentsgetegidgeteuidgetgidgetgroupsgetitimergetpgidgetpgrpgetpidgetppidgetprioritygetrlimitgetrusagegetsidgettimeofdaygetuidgttyidle

init_moduleioctliopermioplipckilllinklocklseeklstatmkdirmknodmlockmlockallmmapmodify_ldtmountmprotectmpxmremapmsyncmunlockmunlockallmunmapnanosleepniceoldfstat

oldlstatoldoldunameoldstatoldunameopenpausepersonalityphyspipeprofprofilptracequotactlreadreaddirreadlinkreadvrebootrenamermdirsched_get_priority_maxsched_get_priority_minsched_getparamsched_getschedulersched_rr_get_intervalsched_setparamsched_setschedulersched_yield

selectsetdomainnamesetfsgidsetfsuidsetgidsetgroupssethostnamesetitimersetpgidsetprioritysetregidsetreuidsetrlimitsetsidsettimeofdaysetuidsetupsgetmasksigactionsignalsigpendingsigprocmasksigreturnsigsuspendsocketcallssetmaskstat

statfsstimesttyswapoffswaponsymlinksyncsysfssysinfosyslogtimetimestruncateulimitumaskumountunameunlinkuselibustatutimevhangupvm86wait4waitpidwritewritev

Page 24: Programming Models and FastOS Bill Gropp and Rusty Lusk

27Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Linux System Calls_llseek_newselect_sysctlaccessacctadjtimexafs_syscallalarmbdflushbreakbrkchdirchmodchownchrootcloneclosecreatcreate_moduledelete_moduledupdup2execveexitfchdirfchmodfchown

fcntlfdatasyncflockforkfstatfstatfsfsyncftimeftruncateget_kernel_symsgetdentsgetegidgeteuidgetgidgetgroupsgetitimergetpgidgetpgrpgetpidgetppidgetprioritygetrlimitgetrusagegetsidgettimeofdaygetuidgttyidle

init_moduleioctliopermioplipckilllinklocklseeklstatmkdirmknodmlockmlockallmmapmodify_ldtmountmprotectmpxmremapmsyncmunlockmunlockallmunmapnanosleepniceoldfstat

oldlstatoldoldunameoldstatoldunameopenpausepersonalityphyspipeprofprofilptracequotactlreadreaddirreadlinkreadvrebootrenamermdirsched_get_priority_maxsched_get_priority_minsched_getparamsched_getschedulersched_rr_get_intervalsched_setparamsched_setschedulersched_yield

selectsetdomainnamesetfsgidsetfsuidsetgidsetgroupssethostnamesetitimersetpgidsetprioritysetregidsetreuidsetrlimitsetsidsettimeofdaysetuidsetupsgetmasksigactionsignalsigpendingsigprocmasksigreturnsigsuspendsocketcallssetmaskstat

statfsstimesttyswapoffswaponsymlinksyncsysfssysinfosyslogtimetimestruncateulimitumaskumountunameunlinkuselibustatutimevhangupvm86wait4waitpidwritewritev

Page 25: Programming Models and FastOS Bill Gropp and Rusty Lusk

28Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago

Should An OS Be Implemented with MPI?

Probably not (but it would be interesting to see how close you could come and what else you’d need)

But many of the concepts used in MPI are applicable Do not reinvent Exploit existing technologies Use an open process to ensure a solid design Encourage simultaneous experimentation and

development Build code that can be run Have a testbed to run it on