john kewley e-science centre cclrc daresbury laboratory 15 th march 2005 paradyn / condor week...

Post on 19-Jan-2018

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI What is a Compute Zoo?

TRANSCRIPT

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Caging the CCLRC Compute Zoo

(Activities at CCLRC)

John Kewleyj.kewley@dl.ac.uk

http://www.e-science.clrc.ac.uk/web/staff/john_kewley

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Outline

What is a Compute Zoo?

Caging Problems

A Trip to the Zoo

Uses for a Compute Zoo

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

What is a Compute Zoo?

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Farm

Homogenous: large numbers of (near) identical resources

Often co-located physically: a training room, lab workstations or a large cluster

Centrally managed, often by dedicated staff

Typical of many Condor Pools: excellent for High Throughput Computing

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Farm

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Zoo

Heterogeneous: resources are of many different operating systems and architectures

Located across a site

Individually, or variously managed

Of minimal use for HTC

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Zoo

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Caging Problems(Firewall Mirroring)

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewalls within a Condor Pool

Some resource owners have firewalls on their personal workstationsSince Condor needs each submit node to be able to talk to every potential execute node, this necessitates the opening of every firewall in the pool to every submit node when it is added.Between adding the new node and the firewalls being updated, the firewalled nodes will be unavailable for use.

Or are they? Maybe someone should tell Condor!

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Adding a new machine to the pool

If we add a new machine to the pool, the existing firewalls may not have anticipated this.The firewalls will likely block this new machineA Job may still match for the newly added machine to the firewalled resource.This job will not be able to runParts of the system can jam as a result.o condor_q on submitting nodeo Subsequent parts of the submit scripto (maybe also parts of the central node)

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Private networks

Similar "jams" occur if part of your pool (or flock of pools) is on a network that is unavailable to some of the other nodesHow can we permit jobs from submit nodes that can access the private network to run on these nodes whilst preventing Condor sending jobs from other submit nodes there?

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

How can we get round this?

1. Restrict the number of submit nodes

2. Automatically update the firewall files

3. Ensure everything is up-to-date

4. Permit pool to evolve whilst persuading Condor to “avoid” going to nodes where the job can’t run

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewall Mirroring (1)1. Each machine with a firewall declares the fact in

its ClassAds:HAS_FIREWALL = TRUE

2. Also, which machines and/or subnets it permits to access its Condor ports (mirroring FW table settings):

FW_ALLOWS_113 = TRUEFW_ALLOWS_rjavig6 = TRUE

3. Finally, it needs to export these settings:STARTD_EXPRS = HAS_FIREWALL, FW_ALLOWS_113, \

FW_ALLOWS_rjavig6

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewall Mirroring (2)To ensure that jobs can only go to resources they can

reach,

1. Ensure that submit machines declare their subnet and hostname:

MY_SUBNET = 113MY_HOST = condor

2. Use these value in the following expression which is added to all REQUIREMENTS for jobs from this machine:

APPEND_REQUIREMENTS = ( \ (HAS_FIREWALL =!= TRUE) || \ (FW_ALLOWS_$(MY_HOST) == TRUE) || \ (FW_ALLOWS_$(MY_SUBNET) == TRUE) )

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

And Private Networks?

Same solution can be used for private networks by pretending they have a firewall and declaring which other nodes have access to that network

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

A Trip to the Zoo(Viewing the Pool)

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

The CCLRC Compute Zoo2x Windows XP Professional2x Windows 2000 Professional1x Windows NT 4.0 Workstation7x SuSE Linux 9.02x SuSE Linux 8.01x SuSE Linux 9.15x White Box Enterprise Linux 3.01x Red Hat Enterprise Linux AS release 3.01x Red Hat Enterprise Linux WS release 3.03x Red Hat Linux 92x Red Hat Linux 8.02x Red Hat Linux 7.31x Mandrake Linux 10.11x Gentoo Linux 1.4

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Viewing the Pool

http://tardis.dl.ac.uk/Condor/cgi-bin/CondorStatus.cgi

http://tardis.dl.ac.uk/Condor/cgi-bin/WiscStatus.cgi

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Uses of a Zoo

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

“Build and Test”The CCLRC pool was part of the UK Grid Engineering Task Force “Build and Test” project.Software bundles were distributed to a variety of OS types around the flocked pool for building and testing.This type of (flocked) pool relies on heterogeneity and small numbers of each type are all that are required.

http://polaris.ecs.soton.ac.uk:65000/http://wiki.nesc.ac.uk/read/sfct?HomePage

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Other non-HTC Uses

I want to ensure my code compiles without warnings and/or runs its basic tests ono As many OSs as possibleo With as many different compilers as possible

I want to perform a release build of my product for platform X, but I only have accounts on A, B and C

I have several server-licensed products and many potential occasional users. How can this be made available to them more easily (within the bounds of the licence of course!)

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

What other uses are there for a Compute Zoo?

top related