clusters, fault tolerance, and other thoughts daniel s. katz jpl/caltech sos7 meeting 4 march 2003
TRANSCRIPT
Clusters, Fault Tolerance, and Other Thoughts
Daniel S. Katz
JPL/Caltech
SOS7 Meeting
4 March 2003
Cluster 2002 http://www.mcs.anl.gov/cluster2002/
• 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002
• Next 2 meetings are:• December 2003 in Hong Kong• September 2004 in San Diego
• Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings
• Motivation:• The series of conferences and their sponsor, the Task Force for Cluster
Computing (TFCC), were created to:• Bring the together the cluster community• Establish best practices• Provide educational material• Cross-fertilize ideas between industry and academia
Cluster 2002 Topics
• Running a cluster and making it usable• Software for management, including configuration• Middleware software
• Building a cluster• Software and hardware for networking• Choosing node hardware• Packaging hardware
• Making use of a cluster• New and innovative applications
Cluster 2002 Results and Conclusions
• Positives:• Software tools are getting better - management, configuration and
administration• Interesting and promising work ongoing in:
• Self-tuning software• Component redundancy• Applications
• Clusters are enabling platforms due to low entry cost
• Negatives:• Large (possibly heterogeneous) systems are not easy to build or maintain• Systems administration is normally underestimated and un(der)funded• Component failure in large systems can be a problem
• Other:• Clusters are good for work for which we know they are good
• Minimum cost clusters can handle some jobs well
• Should design and build cluster to suit application needs
FALSE 2002http://false2002.vanderbilt.edu/
• Workshop on Fault-Adaptive Large-Scale Real-Time Systems• Held at Vanderbilt, 14-15 Nov. 2002• Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems• Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common
to both meetings (Tony Skjellum and I)• Motivation:
• High Energy Physics community wants to build systems to monitor experiments• Others (DARPA, NASA) have an interest in similar systems• An occasion to share knowledge and plan future research
• Topics:• Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs)• Novel approaches to achieving fault tolerance at low cost (< 10% overhead)• How to make fault responses domain-specific (tools that enable the user to specify the
response to different failures, and to implement these responses throughout the system)
• Results/Consensus• No results from this initial meeting; just information sharing (w/ complete consensus)
General Thoughts
• Fault-Tolerance is becoming important to large-scale systems• Embedded and non-embedded systems• Real-time and non-real-time systems
• Is there a common solution (or partial solution) to this issue?
• “There is no software problem an additional layer of abstraction won’t solve”
Thanks
• Questions?