systems support for many task computing
TRANSCRIPT
FastOS Workshop: PROSE Presentation
Systems Support for
Many Task Computing
Holistic Aggregate Resource Environment
Eric Van Hensbergen (IBM) andRon Minnich (Sandia National Labs)
To replace the title / subtitle with your own:Click on the title block -> select all the text by pressing Ctrl+A -> press Delete key -> type your own text
Motivation
Overview of Approach
Targeting Blue Gene/P provide a complimentary runtime environment
Using Plan 9 Research Operating SystemRight Weight Kernel - balances simplicity and function
Built from the ground up as a distributed system
Leverage HPC interconnects for system services
Distribute system services among compute nodes
Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.
Related Work
Default Blue Gene runtimeLinux on I/O nodes + CNK on compute nodes
High Throughput Computing (HTC) Mode
Compute Node Linux
ZeptoOS
Kittyhawk
Foundation: Plan 9 Distributed System
Right Weight KernelGeneral purpose multi-thread, multi-user environment
Pleasantly portable
Relatively Lightweight (compared to Linux)
Core PrinciplesAll resources are synthetic file hierarchies
Local & remote resources accessed via simple API
Each thread can dynamically organize local and remote resources via dynamic private namespace
Everything Represented as File Systems
HardwareDevicesSystemServicesApplicationServicesDiskNetworkTCP/IP StackDNSGUI
/dev/hda1/dev/hda2
/dev/eth0
/net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status
/net/cs/dns
/win/clone/0/1 /ctl /data /refresh/2Console, Audio, Etc.
Wiki, Authentication, and Service Control
Process Control, Debug, Etc.
Plan 9 Networks
Internet
High Bandwidth (10 GB/s) NetworkLAN (1 GB/s) Network
Wifi/EdgeCable/DSL
ContentAddressableStorageFileServerCPUServersCPUServersPDASmartphoneTermTermTermTerm
Set Top Box
ScreenPhone
)))
An Issue of Scale
Node Card(4x4x2)32 compute0-2 IO cardsCompute Card2 chipsChipBG/p 4 way
Rack32 Node Cards
System72 Racks
Aggregation as a First Class Concept
Local Service
Aggregate Service
Remote Service
Proxy Service
Remote Service
Remote Service
Issues of Topology
File Cache Example
Proxy ServiceMonitors access to remote file server & local resources
Local cache mode
Collaborative cache mode
Designated cache server(s)
Integrate replication and redundancy
Explore write coherence via territories ala Envoy
Based on experiences with Xget deployment model
Leverage natural topology of machine where possible.
Monitoring Example
Distribute monitoring throughout the systemUse for system health monitoring and load balancing
Allow for application-specific monitoring agents
Distribute filtering & control agents at key points in topology
Allow for localized monitoring and control as well as high-level global reporting and control
Explore both push and pull methods of modeling
Based on experiences with supermon system.
Workload Management Example
Provide file system interface to job execution and scheduling.
Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.
Can allow for more organic growth of workloads as well as top-down and bottom-up models.
Can be extended to allow direct access from end-user workstations.
Based on experiences with Xcpu mechanism.
Status
Initial Port to BG/P 90% Complete
ApplicationsLinux emulation environment
CNK emulation environment
Native ports of applications
Also have a port of Inferno Virtual Machine to BG/PRuns on Kittyhawk as well as Native
Baseline boot & runtime infrastructure complete
HARE Team
David Eckhardt (Carnegie Mellon University)
Charles Forsyth (Vitanuova)
Jim McKie (Bell Labs)
Ron Minnich (Sandia National Labs)
Eric Van Hensbergen (IBM Research)
Thanks
FundingThis material is based upon work supported by the Department of Energy under Aware Number DE-FG02-08ER25851
ResourcesThis work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE)
InformationThe authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance.
Questions? Discussion?
Links
FastOS Web Sitehttp://www.cs.unm.edu/~fastos/
Phase II CFPhttp://www.sc.doe.gov/grants/FAPN07-23.html
BlueGenehttp://www.research.ibm.com/bluegene/
Plan 9http://plan9.bell-labs.com/plan9
LibraryOShttp://www.research.ibm.com/prose
Plan 9 Characteristics
Kernel Breakdown - Lines of CodeArchitecture Specific CodeBG/L: ~10,000 lines of code
Portable CodePort: ~25,000 lines of code
TCP/IP Stack: ~14,000 lines of code
Binary Sizes415k Text + 140k Data + 107k BSS
Runtime Memory Footprint~4 MB for compute node kernels could be smaller or larger depending on application specific tuning.
Why not Linux?
Not a distributed system
Core systems inflexibleVM based on x86 MMU
Networking tightly tied to sockets & TCP/IP w/long call-path
Typical installations extremely overweight and noisy
Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change
Community has become conservativeSupport for alternative interfaces waning
Support for large systems which hurts small systems not acceptable
Ultimately a customer constraintFastOS was developed to prevent OS monoculture in HPC
Few Linux projects were even invited to submit final proposals
FTQ on BG/L IO Node running Linux
FTQ on BG/L IO Node Running Plan 9
Right Weight Kernels Project (Phase I)
MotivationOS Effect on ApplicationsMetric is based on OS Interference on FWQ & FTQ benchmarks.
AIX/Linux has more capability than many apps need
LWK and CNK have less capability than apps want
ApproachCustomize the kernel to the application
Ongoing ChallengesNeed to balance capability with overhead
Why Blue Gene?
Readily available large-scale clusterMinimum allocation is 37 nodes
Easy to get 512 and 1024 node configurations
Up to 8192 nodes available upon request internally
FastOS will make 64k configuration available
DOE interest Blue Gene was a specified target
Variety of interconnects allows exploration of alternatives
Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware
Department of Energy FastOS CFP
aka: Operating and Runtime System for Extreme Scale Scientific
Computation (DE-PS02-07ER07-23)
Goal:
Stimulate R&D related to operating and runtime systems for petascale systems in the 2010 to 2015 time frame.Expected Output
Unified operating and runtime system that could fully support and exploit petascale and beyond systems.Near Term Hardware Targets: Blue Gene, Cray XD3, and HPCS Machines.
Blue Gene Interconnects
3 Dimensional TorusInterconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GB/s per node)
1 s latency between nearest neighbors, 5 s to the farthest
4 s latency for one hop with MPI, 10 s to the farthest
Communications backbone for computations
0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth
Global TreeOne-to-all broadcast functionality
Reduction operations functionality
2.8 Gb/s of bandwidth per link
Latency of one way tree traversal 2.5 s
~23TB/s total binary tree bandwidth (64k machine)
Interconnects all compute and I/O nodes (1024)
EthernetIncorporated into every node ASIC
Active in the I/O nodes (1:64)
All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier and InterruptLatency of round trip 1.3 s
Control Network
Click to edit the title text format
IBM Research, Sandia National Labs, Bell Labs, and CMU(c) 2008 IBM Corporation
Systems Support for Many Task Computing
11/17/2008
IBM Research, Sandia National Labs, Bell Labs, & CMU
Systems Support for Many Task Computing
11/17/2008
(c) 2008 IBM Corporation
IBM Research
FastOS Workshop
05/30/06
(c) 2006 IBM Corporation