condor tutorial ncsa alliance ‘98

47
Condor Tutorial NCSA Alliance ‘98 Presented by: The Condor Team University of Wisconsin-Madison Email: [email protected] URL: http://www.cs.wisc.edu/condor

Upload: guinevere-mckinney

Post on 02-Jan-2016

58 views

Category:

Documents


1 download

DESCRIPTION

Condor Tutorial NCSA Alliance ‘98. Presented by: The Condor Team University of Wisconsin-Madison Email: [email protected] URL: http://www.cs.wisc.edu/condor. Welcome to the Condor Tutorial!. Introductions What is Condor ? A system for High Throughput Computing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Condor Tutorial NCSA Alliance ‘98

Condor TutorialNCSA Alliance ‘98

Presented by:

The Condor TeamUniversity of Wisconsin-Madison

Email: [email protected]

URL: http://www.cs.wisc.edu/condor

Page 2: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

2

Welcome to the Condor Tutorial!

Introductions What is Condor ?

• A system for High Throughput Computing

Page 3: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

3

The “Religion” behind High Throughput Computing

Key Concepts:

• High Throughput Computing (HTC)

• Distributively owned resources

Page 4: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

4

Performance vs.Throughput

High Performance - Very large amounts of processing capacity over short time periods (FLOPS - Floating Point Operations Per Second)

High Throughput - Large amounts of processing capacity sustained over very long time periods (FLOPY - Floating Point Operations Per Year)

FLOPY 30758400*FLOPS

Page 5: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

5

Distributed Ownership Due to dramatic decrease in the cost-

performance ratio of hardware, powerful computing resources are owned today by individuals, groups, departments, …• Huge increase in the aggregate processing

capacity owned by the organization• Much smaller increase in the capacity

accessible by a single person

Page 6: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

6

The Challenge and Motivation behind Condor

Turn large collections of existing distributively owned (and perhaps non-dedicated) computing resources into effective High Throughput Computing Environments

Minimize Wait while Idle

Page 7: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

7

Road Block: Sociology

Make owners (& system administrators) happy.• Give owners full control on

– when and by whom private resources are used for HTC

– impact of HTC on private Quality of Service

– membership and information on HTC related activities

• No changes to existing software and make it easy – to install, configure, monitor, and maintain

Happy owners more resources higher throughput

Page 8: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

8

Road Block: Robustness

To be effective, a HTC environment must run as a 24-7-365 operation.• Customers count on it• Debugging and fault isolation may be a very

time consuming processes• In a large distributed system, everything that

might go wrong will go wrong.

Robust system less down time higher throughput

Page 9: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

9

Road Block: Portability

To be effective, the HTC software must run on and support the latest greatest hardware and software.• Owners select hardware and software according

to their needs and tradeoffs• Customers expect it to be there.• Application developer expect only few (if any)

changes to their applications.

Portability more platforms higher throughput

Page 10: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

10

Condor’s unique mechanisms for HTC

Matchmaking - enables requests for services and offers to provide services to find each other.

Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!).

Remote I/O - enables remote (from execution site) access to local (at submission site) data.

Page 11: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

11

Condor Viewpoints

Owner• Creates resource offers

User• Creates resource requests

Administrator• Drinks Coffee• Manages the pool-wide configuration• Could also be the Owner

Page 12: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

12

Condor Agents

Condor Resource Agent• condor_startd daemon• allows a machine to execute Condor jobs• enforces owner policy

Condor User Agent• condor_schedd daemon• allows a machine to submit jobs to a pool

Page 13: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

13

schedd

YourWorkstation

The Tutorial Installation

CentralCentralManagerManager

Alliance ‘98 PoolAlliance ‘98 Pool

startd

Page 14: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

14

The Tutorial Installation

CentralCentralManagerManager

CentralCentralManagerManager

Alliance ‘98 PoolAlliance ‘98 PoolUW-Madison PoolUW-Madison Pool

schedd schedd

YourWorkstation

startd

Page 15: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

15

Hands-on:Example #1

Joining the UW-Madison CS Condor Pool as a Submit-only

node

Page 16: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

16

Overview of Submitting a Job to Condor

Create a Submit-Description File Run condor_compile to relink your program

with the Condor Libraries, if Condor’s Checkpointing or Remote I/O support is desired

Run condor_submit• sends your request to the User Agent

(condor_schedd)

Page 17: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

17

Condor System Structure

Page 18: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

18

Hands-on:Example #2

Submit Jobs to Condor

Page 19: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

19

Condor Universes

A Universe specifies a Condor runtime environment:• STANDARD

– Supports CheckpointingSupports Checkpointing– Supports Remote System CallsSupports Remote System Calls– Has some limitations….Has some limitations….

• VANILLA– Any Unix executable (shell scripts, etc)Any Unix executable (shell scripts, etc)– No Condor Checkpointing or Remote I/ONo Condor Checkpointing or Remote I/O

Page 20: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

20

Hands-on:Example #3

Tour of User Tools/Commands

Page 21: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

21

User Priorities in Condor Each active user in the pool has a user

priority Viewed or changed with condor_userprio Like golf: the lower, the better A given user’s share of available machines

is inversely related to the ratio between user priorities.• Example: Fred’s priority is 10, Joe’s is 20. Fred will

be allocated twice as many machines as Joe.

Page 22: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

22

User Priorities in Condor, cont.

Condor continuously adjusts user priorities over time• machines allocated > priority, priority worsens• machines allocated < priority, priority improves

Priority Preemption• Higher priority users will grab machines away from

lower priority users (thanks to Checkpointing…)• Starvation is prevented• Priority “thrashing” is prevented

Page 23: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

23

Parallel Jobs in Condor

Condor can run parallel applications

( written to the popular PVM message passing library )

Page 24: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

24

Master-Worker Paradigm

Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.

Master• has a pool of work, sends pieces of work to the

workers, manages the work and the workers Worker

• gets a piece of work, does the computation, sends the result back

Page 25: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

25

What does Condor-PVM do?

Condor acts as the PVM resource manager. All pvm_addhost requests get re-mapped to

Condor. • Condor dynamically constructs PVM virtual

machines out of non-dedicated desktop machines. When a machine leaves the pool, the user gets

notified via the normal PVM notification mechanisms.

Page 26: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

26

How to compile and submit Condor-PVM jobs

Binary Compatible• Compile and link with PVM library just as normal

PVM applications. No need to link with Condor. Submit

In the submit file set:

universe = PVM

machine_count = <min>..<max>

Page 27: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

27

Classified Advertisements ClassAds

• Language for expressing attributes• Semantics for evaluating them

Intuitively, a ClassAd is a set of named expressions• Each named expression is an attribute

Expressions are similar to C …• Constants, attribute references, operators

Page 28: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

28

Classified Advertisements: Example

MyType = "Machine"

TargetType = "Job"

Name = "froth.cs.wisc.edu"

StartdIpAddr="<128.105.73.44:33846>"

Arch = "INTEL"

OpSys = "SOLARIS251"

VirtualMemory = 225312

Disk = 35957

KFlops = 21058

Mips = 103

LoadAvg = 0.011719

KeyboardIdle = 12

Cpus = 1

Memory = 128

Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60

Rank = 0

Page 29: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

29

Classified Advertisements: Matching

ClassAds are always considered in pairs

Does ClassAd A match ClassAd B (and vice versa)?

Page 30: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

30

Classified Advertisements: Examples

ClassAd AMyType = "Apartment"

TargetType = "ApartmentRenter"

SquareArea = 3500

RentOffer = 1000

HeatIncluded = False

OnBusLine = True

Rank = UnderGrad==False +

TARGET.RentOffer

Requirements = MY.RentOffer - TARGET.RentOffer < 150

ClassAd BMyType = "ApartmentRenter"

TargetType = "Apartment"

UnderGrad = False

RentOffer = 900

Rank = 1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded

Requirements = OnBusLine &&

SquareArea > 2700

Page 31: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

31

ClassAds in the Condor System

ClassAds allow Condor to be a general system• Constraints and ranks on matches

expressed by entities themselves• Only priority logic integrated into Manager

All principal entities in the Condor system are represented by ClassAds• Machines, Jobs, Submitters

Page 32: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

32

ClassAds in Condor: Requirements and Rank

(Example)

Friend = Owner == "tannenba" || Owner == "wright"

ResearchGroup = Owner == "jbasney" || Owner == "raman"

Trusted = Owner != "rival" && Owner != "riffraff"

Requirements = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 )

Rank = Friend + ResearchGroup*10

Page 33: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

33

Hands-on:Example #4

Submit Jobs with ClassAd Constraints

Page 34: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

34

Resource Owner’s Viewpoint In Condor, the owner of the

resource (machine owner) can dictate the terms and conditions under which that resource can be used

How? Configure the Resource Agent’s Policy (condor_startd configuration)

Page 35: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

35

Resource Agent ConfigurationExpressions

START expression• When TRUE, Condor can start a job

– True = Unclaimed StateTrue = Unclaimed State– False = Owner StateFalse = Owner State

SUSPEND expression• When TRUE, Condor suspends any job running on

this machine CONTINUE expression

• When TRUE, will continue a suspended job

Page 36: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

36

Resource Agent Configuration Expressions, cont.

VACATE expression• When TRUE, kick the job off of the machine

(via a Checkpoint if possible) KILL expression

• When TRUE, kill the job immediately– No CheckpointNo Checkpoint– On UNIX: a “kill -9”On UNIX: a “kill -9”

Page 37: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

37

Resource Agent Configuration Expressions,

Cont.

STARTSTARTSTARTSTART

WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND

SUSPENDSUSPENDSUSPENDSUSPEND

VACATEVACATEVACATEVACATE

WANT VACATEWANT VACATEWANT VACATEWANT VACATE

KILLKILLKILLKILL

True

True

True

True

True

False

False

Page 38: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

38

Resource Agent Configuration Expressions, cont.

Default SetupWANT_VACATE : True

WANT_SUSPEND : True

START : Keyboard_Idle && CPU_Idle

SUSPEND : Keyboard_Busy || CPU_Busy

CONTINUE : Keyboard and CPU idle again

VACATE : If Suspended > 10 minutes

KILL : If spent > 10 minutes in VACATE state

Page 39: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

39

Hands-on:Example #5

UW-Madison CS Pool Startd Policy

Page 40: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

40

Condor Administrator Features

The condor_master is the administrator’s best friend• Watches/restarts other daemons• Sends Email if notices suspicious problems• Runs condor_preen• Provides administrator remote control

Page 41: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

41

Condor Administrator Commands Administrator Commands

• condor_off [ hostname … ]– Down entire pool: Down entire pool: condor_off `cat machines-file`condor_off `cat machines-file`

• condor_on• condor_restart• condor_reconfig (“on-the-fly” reconfiguration)• condor_vacate

These commands could be used by the Owner as well, if desired

Page 42: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

42

Condor Host-based Access Control

HOST_ALLOW and HOST_DENY to grant machines (subnets, domains) different access levels:• READ access• WRITE access• ADMINISTRATOR access• OWNER access

Page 43: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

43

Example: Simple Host-based Access Control

HOSTDENY_READ = *.mil

HOSTALLOW_WRITE = *.ncsa.uiuc.edu

HOSTDENY_WRITE = ppp*.ncsa.uiuc.edu, 172.44.*

HOSTALLOW_ADMINISTRATOR = bigcheese.ncsa.uiuc.edu

HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)

Page 44: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

44

Configuration File Hierarchy

condor_config• Pool-wide default• Condor pool administrator’s requirements

condor_config.local• Overrides for a specific machine• Reflects Owner’s requirements

condor_config.root• System Administrator requirements

Page 45: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

45

Future Directions

Condor for Windows NT SMP support More parallel job support

• Checkpoint parallel jobs• MPI, MPI-2

Flocking

Page 46: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

46

Obtaining Condor Condor can be downloaded from the Condor web

site at:

http://www.cs.wisc.edu/condor Complete Users and Administrators manual

available

http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email :

[email protected]

Page 47: Condor Tutorial NCSA Alliance ‘98

Condor Tutorial, NCSA Alliance '98, April 27th 1998

47

Thank You!!Thank you for your interest!

The Condor Team:Miron Livny

Marvin Solomon

Todd Tannenbaum

Derek Wright

Bin Song

Rajesh Raman

Tom Stanis

Jim Basney

Adiel Yoaz