condor tutorial ncsa alliance ‘98
DESCRIPTION
Condor Tutorial NCSA Alliance ‘98. Presented by: The Condor Team University of Wisconsin-Madison Email: [email protected] URL: http://www.cs.wisc.edu/condor. Welcome to the Condor Tutorial!. Introductions What is Condor ? A system for High Throughput Computing. - PowerPoint PPT PresentationTRANSCRIPT
Condor TutorialNCSA Alliance ‘98
Presented by:
The Condor TeamUniversity of Wisconsin-Madison
Email: [email protected]
URL: http://www.cs.wisc.edu/condor
Condor Tutorial, NCSA Alliance '98, April 27th 1998
2
Welcome to the Condor Tutorial!
Introductions What is Condor ?
• A system for High Throughput Computing
Condor Tutorial, NCSA Alliance '98, April 27th 1998
3
The “Religion” behind High Throughput Computing
Key Concepts:
• High Throughput Computing (HTC)
• Distributively owned resources
Condor Tutorial, NCSA Alliance '98, April 27th 1998
4
Performance vs.Throughput
High Performance - Very large amounts of processing capacity over short time periods (FLOPS - Floating Point Operations Per Second)
High Throughput - Large amounts of processing capacity sustained over very long time periods (FLOPY - Floating Point Operations Per Year)
FLOPY 30758400*FLOPS
Condor Tutorial, NCSA Alliance '98, April 27th 1998
5
Distributed Ownership Due to dramatic decrease in the cost-
performance ratio of hardware, powerful computing resources are owned today by individuals, groups, departments, …• Huge increase in the aggregate processing
capacity owned by the organization• Much smaller increase in the capacity
accessible by a single person
Condor Tutorial, NCSA Alliance '98, April 27th 1998
6
The Challenge and Motivation behind Condor
Turn large collections of existing distributively owned (and perhaps non-dedicated) computing resources into effective High Throughput Computing Environments
Minimize Wait while Idle
Condor Tutorial, NCSA Alliance '98, April 27th 1998
7
Road Block: Sociology
Make owners (& system administrators) happy.• Give owners full control on
– when and by whom private resources are used for HTC
– impact of HTC on private Quality of Service
– membership and information on HTC related activities
• No changes to existing software and make it easy – to install, configure, monitor, and maintain
Happy owners more resources higher throughput
Condor Tutorial, NCSA Alliance '98, April 27th 1998
8
Road Block: Robustness
To be effective, a HTC environment must run as a 24-7-365 operation.• Customers count on it• Debugging and fault isolation may be a very
time consuming processes• In a large distributed system, everything that
might go wrong will go wrong.
Robust system less down time higher throughput
Condor Tutorial, NCSA Alliance '98, April 27th 1998
9
Road Block: Portability
To be effective, the HTC software must run on and support the latest greatest hardware and software.• Owners select hardware and software according
to their needs and tradeoffs• Customers expect it to be there.• Application developer expect only few (if any)
changes to their applications.
Portability more platforms higher throughput
Condor Tutorial, NCSA Alliance '98, April 27th 1998
10
Condor’s unique mechanisms for HTC
Matchmaking - enables requests for services and offers to provide services to find each other.
Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!).
Remote I/O - enables remote (from execution site) access to local (at submission site) data.
Condor Tutorial, NCSA Alliance '98, April 27th 1998
11
Condor Viewpoints
Owner• Creates resource offers
User• Creates resource requests
Administrator• Drinks Coffee• Manages the pool-wide configuration• Could also be the Owner
Condor Tutorial, NCSA Alliance '98, April 27th 1998
12
Condor Agents
Condor Resource Agent• condor_startd daemon• allows a machine to execute Condor jobs• enforces owner policy
Condor User Agent• condor_schedd daemon• allows a machine to submit jobs to a pool
Condor Tutorial, NCSA Alliance '98, April 27th 1998
13
schedd
YourWorkstation
The Tutorial Installation
CentralCentralManagerManager
Alliance ‘98 PoolAlliance ‘98 Pool
startd
Condor Tutorial, NCSA Alliance '98, April 27th 1998
14
The Tutorial Installation
CentralCentralManagerManager
CentralCentralManagerManager
Alliance ‘98 PoolAlliance ‘98 PoolUW-Madison PoolUW-Madison Pool
schedd schedd
YourWorkstation
startd
Condor Tutorial, NCSA Alliance '98, April 27th 1998
15
Hands-on:Example #1
Joining the UW-Madison CS Condor Pool as a Submit-only
node
Condor Tutorial, NCSA Alliance '98, April 27th 1998
16
Overview of Submitting a Job to Condor
Create a Submit-Description File Run condor_compile to relink your program
with the Condor Libraries, if Condor’s Checkpointing or Remote I/O support is desired
Run condor_submit• sends your request to the User Agent
(condor_schedd)
Condor Tutorial, NCSA Alliance '98, April 27th 1998
17
Condor System Structure
Condor Tutorial, NCSA Alliance '98, April 27th 1998
18
Hands-on:Example #2
Submit Jobs to Condor
Condor Tutorial, NCSA Alliance '98, April 27th 1998
19
Condor Universes
A Universe specifies a Condor runtime environment:• STANDARD
– Supports CheckpointingSupports Checkpointing– Supports Remote System CallsSupports Remote System Calls– Has some limitations….Has some limitations….
• VANILLA– Any Unix executable (shell scripts, etc)Any Unix executable (shell scripts, etc)– No Condor Checkpointing or Remote I/ONo Condor Checkpointing or Remote I/O
Condor Tutorial, NCSA Alliance '98, April 27th 1998
20
Hands-on:Example #3
Tour of User Tools/Commands
Condor Tutorial, NCSA Alliance '98, April 27th 1998
21
User Priorities in Condor Each active user in the pool has a user
priority Viewed or changed with condor_userprio Like golf: the lower, the better A given user’s share of available machines
is inversely related to the ratio between user priorities.• Example: Fred’s priority is 10, Joe’s is 20. Fred will
be allocated twice as many machines as Joe.
Condor Tutorial, NCSA Alliance '98, April 27th 1998
22
User Priorities in Condor, cont.
Condor continuously adjusts user priorities over time• machines allocated > priority, priority worsens• machines allocated < priority, priority improves
Priority Preemption• Higher priority users will grab machines away from
lower priority users (thanks to Checkpointing…)• Starvation is prevented• Priority “thrashing” is prevented
Condor Tutorial, NCSA Alliance '98, April 27th 1998
23
Parallel Jobs in Condor
Condor can run parallel applications
( written to the popular PVM message passing library )
Condor Tutorial, NCSA Alliance '98, April 27th 1998
24
Master-Worker Paradigm
Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.
Master• has a pool of work, sends pieces of work to the
workers, manages the work and the workers Worker
• gets a piece of work, does the computation, sends the result back
Condor Tutorial, NCSA Alliance '98, April 27th 1998
25
What does Condor-PVM do?
Condor acts as the PVM resource manager. All pvm_addhost requests get re-mapped to
Condor. • Condor dynamically constructs PVM virtual
machines out of non-dedicated desktop machines. When a machine leaves the pool, the user gets
notified via the normal PVM notification mechanisms.
Condor Tutorial, NCSA Alliance '98, April 27th 1998
26
How to compile and submit Condor-PVM jobs
Binary Compatible• Compile and link with PVM library just as normal
PVM applications. No need to link with Condor. Submit
In the submit file set:
universe = PVM
machine_count = <min>..<max>
Condor Tutorial, NCSA Alliance '98, April 27th 1998
27
Classified Advertisements ClassAds
• Language for expressing attributes• Semantics for evaluating them
Intuitively, a ClassAd is a set of named expressions• Each named expression is an attribute
Expressions are similar to C …• Constants, attribute references, operators
Condor Tutorial, NCSA Alliance '98, April 27th 1998
28
Classified Advertisements: Example
MyType = "Machine"
TargetType = "Job"
Name = "froth.cs.wisc.edu"
StartdIpAddr="<128.105.73.44:33846>"
Arch = "INTEL"
OpSys = "SOLARIS251"
VirtualMemory = 225312
Disk = 35957
KFlops = 21058
Mips = 103
LoadAvg = 0.011719
KeyboardIdle = 12
Cpus = 1
Memory = 128
Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60
Rank = 0
Condor Tutorial, NCSA Alliance '98, April 27th 1998
29
Classified Advertisements: Matching
ClassAds are always considered in pairs
Does ClassAd A match ClassAd B (and vice versa)?
Condor Tutorial, NCSA Alliance '98, April 27th 1998
30
Classified Advertisements: Examples
ClassAd AMyType = "Apartment"
TargetType = "ApartmentRenter"
SquareArea = 3500
RentOffer = 1000
HeatIncluded = False
OnBusLine = True
Rank = UnderGrad==False +
TARGET.RentOffer
Requirements = MY.RentOffer - TARGET.RentOffer < 150
ClassAd BMyType = "ApartmentRenter"
TargetType = "Apartment"
UnderGrad = False
RentOffer = 900
Rank = 1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded
Requirements = OnBusLine &&
SquareArea > 2700
Condor Tutorial, NCSA Alliance '98, April 27th 1998
31
ClassAds in the Condor System
ClassAds allow Condor to be a general system• Constraints and ranks on matches
expressed by entities themselves• Only priority logic integrated into Manager
All principal entities in the Condor system are represented by ClassAds• Machines, Jobs, Submitters
Condor Tutorial, NCSA Alliance '98, April 27th 1998
32
ClassAds in Condor: Requirements and Rank
(Example)
Friend = Owner == "tannenba" || Owner == "wright"
ResearchGroup = Owner == "jbasney" || Owner == "raman"
Trusted = Owner != "rival" && Owner != "riffraff"
Requirements = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 )
Rank = Friend + ResearchGroup*10
Condor Tutorial, NCSA Alliance '98, April 27th 1998
33
Hands-on:Example #4
Submit Jobs with ClassAd Constraints
Condor Tutorial, NCSA Alliance '98, April 27th 1998
34
Resource Owner’s Viewpoint In Condor, the owner of the
resource (machine owner) can dictate the terms and conditions under which that resource can be used
How? Configure the Resource Agent’s Policy (condor_startd configuration)
Condor Tutorial, NCSA Alliance '98, April 27th 1998
35
Resource Agent ConfigurationExpressions
START expression• When TRUE, Condor can start a job
– True = Unclaimed StateTrue = Unclaimed State– False = Owner StateFalse = Owner State
SUSPEND expression• When TRUE, Condor suspends any job running on
this machine CONTINUE expression
• When TRUE, will continue a suspended job
Condor Tutorial, NCSA Alliance '98, April 27th 1998
36
Resource Agent Configuration Expressions, cont.
VACATE expression• When TRUE, kick the job off of the machine
(via a Checkpoint if possible) KILL expression
• When TRUE, kill the job immediately– No CheckpointNo Checkpoint– On UNIX: a “kill -9”On UNIX: a “kill -9”
Condor Tutorial, NCSA Alliance '98, April 27th 1998
37
Resource Agent Configuration Expressions,
Cont.
STARTSTARTSTARTSTART
WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND
SUSPENDSUSPENDSUSPENDSUSPEND
VACATEVACATEVACATEVACATE
WANT VACATEWANT VACATEWANT VACATEWANT VACATE
KILLKILLKILLKILL
True
True
True
True
True
False
False
Condor Tutorial, NCSA Alliance '98, April 27th 1998
38
Resource Agent Configuration Expressions, cont.
Default SetupWANT_VACATE : True
WANT_SUSPEND : True
START : Keyboard_Idle && CPU_Idle
SUSPEND : Keyboard_Busy || CPU_Busy
CONTINUE : Keyboard and CPU idle again
VACATE : If Suspended > 10 minutes
KILL : If spent > 10 minutes in VACATE state
Condor Tutorial, NCSA Alliance '98, April 27th 1998
39
Hands-on:Example #5
UW-Madison CS Pool Startd Policy
Condor Tutorial, NCSA Alliance '98, April 27th 1998
40
Condor Administrator Features
The condor_master is the administrator’s best friend• Watches/restarts other daemons• Sends Email if notices suspicious problems• Runs condor_preen• Provides administrator remote control
Condor Tutorial, NCSA Alliance '98, April 27th 1998
41
Condor Administrator Commands Administrator Commands
• condor_off [ hostname … ]– Down entire pool: Down entire pool: condor_off `cat machines-file`condor_off `cat machines-file`
• condor_on• condor_restart• condor_reconfig (“on-the-fly” reconfiguration)• condor_vacate
These commands could be used by the Owner as well, if desired
Condor Tutorial, NCSA Alliance '98, April 27th 1998
42
Condor Host-based Access Control
HOST_ALLOW and HOST_DENY to grant machines (subnets, domains) different access levels:• READ access• WRITE access• ADMINISTRATOR access• OWNER access
Condor Tutorial, NCSA Alliance '98, April 27th 1998
43
Example: Simple Host-based Access Control
HOSTDENY_READ = *.mil
HOSTALLOW_WRITE = *.ncsa.uiuc.edu
HOSTDENY_WRITE = ppp*.ncsa.uiuc.edu, 172.44.*
HOSTALLOW_ADMINISTRATOR = bigcheese.ncsa.uiuc.edu
HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
Condor Tutorial, NCSA Alliance '98, April 27th 1998
44
Configuration File Hierarchy
condor_config• Pool-wide default• Condor pool administrator’s requirements
condor_config.local• Overrides for a specific machine• Reflects Owner’s requirements
condor_config.root• System Administrator requirements
Condor Tutorial, NCSA Alliance '98, April 27th 1998
45
Future Directions
Condor for Windows NT SMP support More parallel job support
• Checkpoint parallel jobs• MPI, MPI-2
Flocking
…
Condor Tutorial, NCSA Alliance '98, April 27th 1998
46
Obtaining Condor Condor can be downloaded from the Condor web
site at:
http://www.cs.wisc.edu/condor Complete Users and Administrators manual
available
http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email :
Condor Tutorial, NCSA Alliance '98, April 27th 1998
47
Thank You!!Thank you for your interest!
The Condor Team:Miron Livny
Marvin Solomon
Todd Tannenbaum
Derek Wright
Bin Song
Rajesh Raman
Tom Stanis
Jim Basney
Adiel Yoaz