clusters, fault tolerance, and other thoughts daniel s. katz jpl/caltech sos7 meeting 4 march 2003

Clusters, Fault Tolerance, and Other Thoughts

Daniel S. Katz

JPL/Caltech

SOS7 Meeting

4 March 2003

Cluster 2002 http://www.mcs.anl.gov/cluster2002/

• 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002

• Next 2 meetings are:• December 2003 in Hong Kong• September 2004 in San Diego

• Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings

• Motivation:• The series of conferences and their sponsor, the Task Force for Cluster

Computing (TFCC), were created to:• Bring the together the cluster community• Establish best practices• Provide educational material• Cross-fertilize ideas between industry and academia

Cluster 2002 Topics

• Running a cluster and making it usable• Software for management, including configuration• Middleware software

• Building a cluster• Software and hardware for networking• Choosing node hardware• Packaging hardware

• Making use of a cluster• New and innovative applications

Cluster 2002 Results and Conclusions

• Positives:• Software tools are getting better - management, configuration and

administration• Interesting and promising work ongoing in:

• Self-tuning software• Component redundancy• Applications

• Clusters are enabling platforms due to low entry cost

• Negatives:• Large (possibly heterogeneous) systems are not easy to build or maintain• Systems administration is normally underestimated and un(der)funded• Component failure in large systems can be a problem

• Other:• Clusters are good for work for which we know they are good

• Minimum cost clusters can handle some jobs well

• Should design and build cluster to suit application needs

FALSE 2002http://false2002.vanderbilt.edu/

• Workshop on Fault-Adaptive Large-Scale Real-Time Systems• Held at Vanderbilt, 14-15 Nov. 2002• Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems• Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common

to both meetings (Tony Skjellum and I)• Motivation:

• High Energy Physics community wants to build systems to monitor experiments• Others (DARPA, NASA) have an interest in similar systems• An occasion to share knowledge and plan future research

• Topics:• Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs)• Novel approaches to achieving fault tolerance at low cost (< 10% overhead)• How to make fault responses domain-specific (tools that enable the user to specify the

response to different failures, and to implement these responses throughout the system)

• Results/Consensus• No results from this initial meeting; just information sharing (w/ complete consensus)

General Thoughts

• Fault-Tolerance is becoming important to large-scale systems• Embedded and non-embedded systems• Real-time and non-real-time systems

• Is there a common solution (or partial solution) to this issue?

• “There is no software problem an additional layer of abstraction won’t solve”

Thanks

• Questions?

clusters, fault tolerance, and other thoughts daniel s. katz jpl/caltech sos7 meeting 4 march 2003

Documents