programming models and fastos bill gropp and rusty lusk
TRANSCRIPT
Programming Models and FastOS
Bill Gropp and Rusty Lusk
2Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Application View of the OS Application makes use of
the programming model (may be calls, may be compiler-generated).
Considers all calls the same, or at most, libc + programming model
Deciding what goes in the runtime and what goes in the OS is very important
But not to the application or (for the most part) the programming model
Just make it fast and correct
Application
Progming Model
Node Runtime
Operating System
3Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
4Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Parallel Programming Models
Shared-nothing Typically communicating OS processes Need not be OS processes
Need the appearance of a separate address space (as visible to the programmer) and program counter
Need not have separate OS entries for each “application process”
Shared-all Typically either OS processes with shared address spaces
or one process + threads Need not involve OS in each thread of control
Processes and Processors are different An single “application process” could use many
processors (and what is a “processor”?)
5Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Some Needs of Programming Models
Job startup Acquire and setup resources
Job rundown Release resources, even on abnormal exit
Schedule As a job to match collective operations
Support communication Allocate and manage resources
Control Signals, interaction with other jobs, external
services
6Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Locality of Calls from the Application Viewpoint
Local Only Affects resources on the processing element
Collective All (or any subset) perform a coordinated
operation, such as file access or “symmetric malloc”
Independent Non-Local Uncoordinated access to an externally managed
resource, such as a file system or network Potential scalability hazard Two important subsets:
Cacheable Noncacheable
7Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Local Calls
App
NodeOS
Node Resources
8Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Independent Non-Local Calls
NodeApp
NodeApp
NodeApp
Remote ServiceArgh!!!
9Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Collective Calls
Note that collective calls can be implemented (but not efficiently) with the non-local independent
Metrics are needed to identify and measure scalability goals
NodeApp
NodeApp
NodeApp
Collective ManagementCollective Management
Remote Service
10Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Job Startup Independent read of the executable and
shared libraries Hack (useful but still a hack)
Capture the file accesses on startup and provide a scalable distribution of the needed data
Better solution Define operations as collective
“collective exec”
In between solution Define as non-local, independent operation on
cacheable (read-only) data
11Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Many Other Examples Collective scheduling, signaling (avoid
batch) Gettimeofday It is not practical to implement a special-
case solution for each system call Must identify a few strategies and apply them
12Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Implementing Independent Non-Local Calls
For each routine, implement special caching code Example: DNS cache
More interesting approach Exploit techniques used for coherent shared memory
caches Virtual “system pages” (a kind of distributed, coherent /proc) Read-only references can be cached Write references must invalidate
Special case – allow no consistency if desired (e.g., NFS semantics)
Syscalls could choose, but must be consistent by default. Correctness uber alles!
13Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Exploiting a Cached-Data View Shared OS space approach provides a common
way to support scalable implementation of independent non-local calls
Caching algorithms provide guidance for the implementation and the definition: routines should provide useful operations that can be implemented efficiently
For operations without a useful caching strategy, use the caching model to implement flow control
Provides a naturally-distributed approach Must be careful of faults!
Is this the answer? Research will tell us. It does point out one possible
strategy
14Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Case Study of Collective Operations
MPI I/O provides an example of the benefit of collective semantics MPI I/O is not POSIX; however, it is well
defined and provides precise semantics that match applications’ needs (unlike NFS)
Benefit is large (100x in some cases) More than just collective
15Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
MPI Code to Write a Distributed Mesh to a Single File
MPI Datatypes define memory layout and placement in file
Collective write provides scalable, correct output of data from multiple processes to a single file
MPI_File_Open( comm, … &file ); MPI_Type_create_subarray(..., &subarray, ...);
MPI_Type_commit(&subarray);
MPI_Type_vector( ..., &memtype );
MPI_Type_commit(&memtype);
MPI_File_set_view(fh, ..., subarray, ...);
MPI_File_write_all(fh, A, 1, memtype, ...);MPI_File_close( &file );
17Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
The Four Levels of AccessFi
le S
pace
Processes3210
Level 0
Level 3
Level 1
Level 2
Collectivealong oneaxis
Collective along both
Independent
19Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Distributed Array Access: Write Bandwidth
64 procs 64 procs 8 procs 32 procs256 procs
Array size: 512 x 512 x 512
20Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Unstructured Code:Read Bandwidth
64 procs 64 procs 8 procs 32 procs256 procs
22Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
MPI’s Collective I/O Calls Includes the usual
Open, Close, Seek, Get_position Includes collective versions
Read_all, Write_all Includes thread-safe versions
Read_at_all, Write_at_all Includes nonblocking versions
Read_all_begin/end, Write_all_begin/end, Read_at_all_begin/end, Write_at_all_begin/end
Includes general data patterns Application can make a single system call instead of many Only a four types cover very general patterns
Includes explicit coherency control MPI_File_sync, MPI_File_set_atomicity
23Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
MPI as Init MPI provides a rich set of collective operations
Includes collective process creation — MPI_Comm_Spawn Parallel I/O
Sample MPI Init process: While {
recv syscall switch syscall_id { case pexec: … use MPI_Comm_split to create a communicator this process creation … use MPI_File I/O to move executable to nodes MPI_Comm_spawn( … ) to create processes … remember new intercommunicator as handle for processes break; …
24Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
What’s Missing Process control (e.g., signals, ptrace)
Some of this was considered, see the MPI “Journal of Development”
Wait for any (no probe on any intercommunicator) (like wait or poll) A “system” like operation, not normally
appropriate for a well-designed MPI program
Precisely defined error actions (not inconsistent with MPI spec but because not defined, would need to be added)
25Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
What’s Not Missing Most of I/O
But directory operations are missing Process creation Fault tolerance
MPI spec is relatively friendly to fault tolerance, more so than current implementations
Scalability Most (all common) routines are scalable
Thread Safety E.g., MPI_File_read_at; no global state or (non-constant)
global objects Most communication Could implement much of an OS
26Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Linux System Calls_llseek_newselect_sysctlaccessacctadjtimexafs_syscallalarmbdflushbreakbrkchdirchmodchownchrootcloneclosecreatcreate_moduledelete_moduledupdup2execveexitfchdirfchmodfchown
fcntlfdatasyncflockforkfstatfstatfsfsyncftimeftruncateget_kernel_symsgetdentsgetegidgeteuidgetgidgetgroupsgetitimergetpgidgetpgrpgetpidgetppidgetprioritygetrlimitgetrusagegetsidgettimeofdaygetuidgttyidle
init_moduleioctliopermioplipckilllinklocklseeklstatmkdirmknodmlockmlockallmmapmodify_ldtmountmprotectmpxmremapmsyncmunlockmunlockallmunmapnanosleepniceoldfstat
oldlstatoldoldunameoldstatoldunameopenpausepersonalityphyspipeprofprofilptracequotactlreadreaddirreadlinkreadvrebootrenamermdirsched_get_priority_maxsched_get_priority_minsched_getparamsched_getschedulersched_rr_get_intervalsched_setparamsched_setschedulersched_yield
selectsetdomainnamesetfsgidsetfsuidsetgidsetgroupssethostnamesetitimersetpgidsetprioritysetregidsetreuidsetrlimitsetsidsettimeofdaysetuidsetupsgetmasksigactionsignalsigpendingsigprocmasksigreturnsigsuspendsocketcallssetmaskstat
statfsstimesttyswapoffswaponsymlinksyncsysfssysinfosyslogtimetimestruncateulimitumaskumountunameunlinkuselibustatutimevhangupvm86wait4waitpidwritewritev
27Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Linux System Calls_llseek_newselect_sysctlaccessacctadjtimexafs_syscallalarmbdflushbreakbrkchdirchmodchownchrootcloneclosecreatcreate_moduledelete_moduledupdup2execveexitfchdirfchmodfchown
fcntlfdatasyncflockforkfstatfstatfsfsyncftimeftruncateget_kernel_symsgetdentsgetegidgeteuidgetgidgetgroupsgetitimergetpgidgetpgrpgetpidgetppidgetprioritygetrlimitgetrusagegetsidgettimeofdaygetuidgttyidle
init_moduleioctliopermioplipckilllinklocklseeklstatmkdirmknodmlockmlockallmmapmodify_ldtmountmprotectmpxmremapmsyncmunlockmunlockallmunmapnanosleepniceoldfstat
oldlstatoldoldunameoldstatoldunameopenpausepersonalityphyspipeprofprofilptracequotactlreadreaddirreadlinkreadvrebootrenamermdirsched_get_priority_maxsched_get_priority_minsched_getparamsched_getschedulersched_rr_get_intervalsched_setparamsched_setschedulersched_yield
selectsetdomainnamesetfsgidsetfsuidsetgidsetgroupssethostnamesetitimersetpgidsetprioritysetregidsetreuidsetrlimitsetsidsettimeofdaysetuidsetupsgetmasksigactionsignalsigpendingsigprocmasksigreturnsigsuspendsocketcallssetmaskstat
statfsstimesttyswapoffswaponsymlinksyncsysfssysinfosyslogtimetimestruncateulimitumaskumountunameunlinkuselibustatutimevhangupvm86wait4waitpidwritewritev
28Bill Gropp <www.mcs.anl.gov/~gropp>Argonne/MCS/U ChicagoArgonne/MCS/U Chicago
Should An OS Be Implemented with MPI?
Probably not (but it would be interesting to see how close you could come and what else you’d need)
But many of the concepts used in MPI are applicable Do not reinvent Exploit existing technologies Use an open process to ensure a solid design Encourage simultaneous experimentation and
development Build code that can be run Have a testbed to run it on