1 establishing an inter-organisational ogsa grid: lessons learned wolfgang emmerich london software...

1

Establishing an inter-organisational OGSA Grid: Lessons Learned

Wolfgang EmmerichLondon Software Systems, Dept. of Computer Science

University College LondonGower St, London WC1E 6BT, U.Khttp://www.sse.ucl.ac.uk/UK-OGSA

2

An Experimental UK OGSA Testbed

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Established 12/03-12/04• Four nodes:

– UCL (coordinator)– NeSC– NEReSC– LeSC

• Deployed Globus Toolkit 3.2 throughout onto

• Heterogeneous HW/OS– Linux – Solaris– Windows XP

3

Experience with GT3.2 Installation

• Different levels of experience within team• Heterogeneity

– HW (Intel/SPARC)

– Operating system (Windows/Solaris/Linux)

– Servlet container (Tomcat/GT3 container)

• Interaction with previous GT versions• Departure from web service standards prevented

standard tool use – JMeter

– Development environments (Eclipse)

– Exception management tools (Amberpoint)

• Interaction with system administration• Platform dependencies



4

Performance and Scalability

• Developed GTMark – Server-side load model: SciMark 2.0

(http://math.nist.gov/SciMark)– Client-side load model, configuration and

metrics collection based on J2EE benchmark StockOnline

• Configurable Benchmark– Static vs dynamic discovery of nodes– Loads for fixed period of time or until

steady state obtained– Constant or variation of concurrent

requests



5

Performance Results

ART (s)

0

50

100

150

200

0 10 20 30 40 50 60 70

Threads

Time (s)

UCL (4 cpu Sun)

Newcastle (2 cpu Intel)

Imperial (2 cpu Intel)

Edinburgh (4 hyperthread cpu Intel)

All

No of concurrent requests

6

Scalability ResultsThroughput (CPM)

0

10

20

30

40

50

60

70

80

0 20 40 60 80

Threads

CPM

UCL (4 cpu Sun)

Newcastle (2 cpu Intel)

Imperial (2 cpu intel)

Edinburgh (4 hyperthread cpu Intel)

All (12 cpus)

Theoretical Maximum

7

Performance Results

• Performance and scalability of GT3.2 with Tomcat/Axis surprisingly good

• Performance overhead of security is negligible

• Good scalability - reached 96% of theoretical maximum

• Tomcat performs better than GT3.2 container on slow machines

• Surprising results on raw CPU performance



8

Reliability

• Tomcat more reliable than GT3.2 container.– Tomcat container sustained 100% reliability

under load– GT3.2 container failed once every 300

invocations (99.67% reliability)

• Denial of Service Attack possible by– Concurrently invoking operation on the same

service instance (they are not thread safe!)– Fully exhausting resources

• Problem of hosting more than one service in one container – Trade-off between reliability and reuse of

containers across multiple users/services.



9

Security• Interesting effect of firewalls on testing and

debugging• Accountability and audit trails demand users

be given individual accounts on each node• Overhead of node and user certificates (they

always expire at the wrong time)• Current security model does not scale:

– Assuming cost of £18/Admin hour– 10 users per node (site)– It will cost approx. £300,000 to set up a 100

node grid with 1000 users– It will be prohibitively expensive to scale up to

1,000 nodes(with admin costs in excess of £6M)



10

Deployment• How do admins get grid middleware

deployed systematically onto grid nodes?

• How can users get the services onto remote hosts?

• We tried out SmartFrog (http://www.smartfrog.org)

• Worked very well inside a node.• Impossible across organisations:

– SmartFrog daemon would need to execute actions with root privileges which some site admins just did not agree to

– Security paramount (SmartFrog would be the perfect virus distribution engine)

– SmartFrog’s security infastructure incompatible with GT 3.2 infrastructure



11

Looking Ahead• Installation efforts need to be reduced significantly

– Binary distributions– For a few selected HW/OS platforms

• Standards compliance– Track standards by all means– Otherwise no economies of scale

• Management console– Add / remove grid hosts– Need to be able to monitor status of grid resources– Across organisational boundaries

• More lightweight security model needed– Role-based Access Control– Trust-delegation

• Deployment is a first-class citizen– Avoid adding as an afterthought– Needs to be built into middleware stack

12

Conclusions

• Very interesting experience• Building a distributed system across

organisational boundaries is different from building a system over a LAN

• Insights that might prove useful for– OMII– Globus– ETF

• There is a lot more work to do before we realize the vision of the Grid!



13

Acknowledgements

• A large number of people have helped with this project, including– Dave Berry (NeSC)– Paul Brebner (UCL, now CSIRO)– Tom Jones (UCL, now Symantec)– Oliver Malham (NeSC)– David McBride (LeSC)– Savas Parastatidis (NEReSC)– Steven Newhouse (OMII)– Jake Wu (NEReSC)

• For further details (including IGR) check out http://sse.cs.ucl.ac.uk/UK-OGSA

1 establishing an inter-organisational ogsa grid: lessons learned wolfgang emmerich london software...

Documents