1 establishing an inter-organisational ogsa grid: lessons learned wolfgang emmerich london software...
TRANSCRIPT
1
Establishing an inter-organisational OGSA Grid: Lessons Learned
Wolfgang EmmerichLondon Software Systems, Dept. of Computer Science
University College LondonGower St, London WC1E 6BT, U.Khttp://www.sse.ucl.ac.uk/UK-OGSA
2
An Experimental UK OGSA Testbed
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• Established 12/03-12/04• Four nodes:
– UCL (coordinator)– NeSC– NEReSC– LeSC
• Deployed Globus Toolkit 3.2 throughout onto
• Heterogeneous HW/OS– Linux – Solaris– Windows XP
3
Experience with GT3.2 Installation
• Different levels of experience within team• Heterogeneity
– HW (Intel/SPARC)
– Operating system (Windows/Solaris/Linux)
– Servlet container (Tomcat/GT3 container)
• Interaction with previous GT versions• Departure from web service standards prevented
standard tool use – JMeter
– Development environments (Eclipse)
– Exception management tools (Amberpoint)
• Interaction with system administration• Platform dependencies
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
4
Performance and Scalability
• Developed GTMark – Server-side load model: SciMark 2.0
(http://math.nist.gov/SciMark)– Client-side load model, configuration and
metrics collection based on J2EE benchmark StockOnline
• Configurable Benchmark– Static vs dynamic discovery of nodes– Loads for fixed period of time or until
steady state obtained– Constant or variation of concurrent
requests
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
5
Performance Results
ART (s)
0
50
100
150
200
0 10 20 30 40 50 60 70
Threads
Time (s)
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu Intel)
Edinburgh (4 hyperthread cpu Intel)
All
No of concurrent requests
6
Scalability ResultsThroughput (CPM)
0
10
20
30
40
50
60
70
80
0 20 40 60 80
Threads
CPM
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu intel)
Edinburgh (4 hyperthread cpu Intel)
All (12 cpus)
Theoretical Maximum
7
Performance Results
• Performance and scalability of GT3.2 with Tomcat/Axis surprisingly good
• Performance overhead of security is negligible
• Good scalability - reached 96% of theoretical maximum
• Tomcat performs better than GT3.2 container on slow machines
• Surprising results on raw CPU performance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
8
Reliability
• Tomcat more reliable than GT3.2 container.– Tomcat container sustained 100% reliability
under load– GT3.2 container failed once every 300
invocations (99.67% reliability)
• Denial of Service Attack possible by– Concurrently invoking operation on the same
service instance (they are not thread safe!)– Fully exhausting resources
• Problem of hosting more than one service in one container – Trade-off between reliability and reuse of
containers across multiple users/services.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
9
Security• Interesting effect of firewalls on testing and
debugging• Accountability and audit trails demand users
be given individual accounts on each node• Overhead of node and user certificates (they
always expire at the wrong time)• Current security model does not scale:
– Assuming cost of £18/Admin hour– 10 users per node (site)– It will cost approx. £300,000 to set up a 100
node grid with 1000 users– It will be prohibitively expensive to scale up to
1,000 nodes(with admin costs in excess of £6M)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
10
Deployment• How do admins get grid middleware
deployed systematically onto grid nodes?
• How can users get the services onto remote hosts?
• We tried out SmartFrog (http://www.smartfrog.org)
• Worked very well inside a node.• Impossible across organisations:
– SmartFrog daemon would need to execute actions with root privileges which some site admins just did not agree to
– Security paramount (SmartFrog would be the perfect virus distribution engine)
– SmartFrog’s security infastructure incompatible with GT 3.2 infrastructure
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
11
Looking Ahead• Installation efforts need to be reduced significantly
– Binary distributions– For a few selected HW/OS platforms
• Standards compliance– Track standards by all means– Otherwise no economies of scale
• Management console– Add / remove grid hosts– Need to be able to monitor status of grid resources– Across organisational boundaries
• More lightweight security model needed– Role-based Access Control– Trust-delegation
• Deployment is a first-class citizen– Avoid adding as an afterthought– Needs to be built into middleware stack
12
Conclusions
• Very interesting experience• Building a distributed system across
organisational boundaries is different from building a system over a LAN
• Insights that might prove useful for– OMII– Globus– ETF
• There is a lot more work to do before we realize the vision of the Grid!
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
13
Acknowledgements
• A large number of people have helped with this project, including– Dave Berry (NeSC)– Paul Brebner (UCL, now CSIRO)– Tom Jones (UCL, now Symantec)– Oliver Malham (NeSC)– David McBride (LeSC)– Savas Parastatidis (NEReSC)– Steven Newhouse (OMII)– Jake Wu (NEReSC)
• For further details (including IGR) check out http://sse.cs.ucl.ac.uk/UK-OGSA