Current methods for negotiating firewalls for
the Condor® system
Bruce Beckles(University of Cambridge Computing
Service)
Se-Chang Son (University of Wisconsin)John Kewley (CCLRC Daresbury Laboratory)
What is Condor?• A specialised, cross-platform,
distributed batch scheduling system• Often used for utilising idle CPU cycles
on workstations• Distributed systems architecture:
Different components run on different machines
Can provide greater resilience and improve performance…
…at the expense of simplicity (particularly simplicity of its use of the network)
Main Condor machine roles• Central Manager:
Monitors all Condor nodes and “matches” jobs to execute nodes
• Submit nodes: Submit jobs to pool
• Execute nodes: Execute jobs
• Checkpoint server (optional): Stores checkpoints of jobs (for supported job
types)Machines may have more than one role, and there
may be multiple machines with each of the above roles (except there can only ever be one active central manager)
Diagrammatic overviewCentral Manager
Execute Node
Submit Node
Condor daemons
(Normally listen on ports 9614 and 9618)
Condor daemons
Condor daemons Start job on Execute
Node Send results to Submit Node
User’s executable code
Condor libraries
User’s job
For some jobs, system calls performed as remote procedure calls back to Submit Node
Spawns job and signals it when to abort, suspend, or checkpoint.
Execute Node tells Central Manager about itself. Central Manager tells it when to accept a job from Submit Node.
Submit Node tells Central Manager about job. Central Manager tells it to which Execute Node it should send job.
Checkpoint Server
Condor daemons
(listen on ports 5651 – 5654)
For some jobs, write checkpoints to Checkpoint Server
Checkpoint server advertises itself to Central Manager
For some jobs, check status of Checkpoint Server
Who? How?
• Machine communication: Which machine talks to which Protocol(s) used:
(Does not include high availability daemon (Condor 6.7.6 and later))
Firewalls: Basic problems• Pattern of network communication:
“Many-to-many” Often bidirectional
• Port usage: Large range of dynamic ports Checkpoint server ports not
configurable
• Protocols used: TCP and UDP
Firewalls: Other problems• Administrative overhead:
Large pool may mean many exceptions
• Personal firewalls: Like having a different firewall for each
machine(!)
• Condor does not handle certain network connectivity failures gracefully
• Inadequate/inaccurate documentation• Bugs in Condor:
Didn’t always set SO_KEEPALIVE (now fixed) Machines “disappearing” from pool (although
machine still has network connectivity) Problems with Windows Firewall (now
resolved?)
Solutions: Identified requirements
• Respect the security boundary• Reduce administrative overhead• Minimal impact on firewall performance• NAT/firewall traversal• Allow incremental implementation• Scalability• Robustness (in the face of network problems)• Fail gracefully• Integrated into Condor’s security framework• Logging• Documentation
Types of solution
• Mitigation (“avoidance”): Mitigating the effects of firewalls
• Altering pattern of network communication: Reducing it from “many-to-many” to
“one-to-many”, “few-to-many”, etc.
• NAT/firewall traversal: Traversing the security boundary
Current solutions
• CCLRC’s “Firewall Mirroring” (FM)
• Using centralised submit nodes (CS)
• Remote job submission/Condor-C (C-C)
• Generic Connection Brokering (GCB)• Dynamic Port Forwarding (DPF)
“Firewall Mirroring”• Developed by John Kewley• Ensures that jobs are never given to
execute nodes that cannot run them because of network connectivity issues (e.g. personal firewalls)
• Achieved by duplicating firewall configuration in machine’s ClassAd and then modifying job requirements appropriately
• Works well with personal firewalls• Some administrative overhead
Centralised submit nodes
• Reduce pattern of network communication (“few-to-many” or better)
• Lowers administrative overhead• Can have minimal impact on
firewall performance• …but may impact performance of
the Condor pool• Ideal for centrally managed
“campus grid” scenarios
Remote job submission/Condor-C
• Remote job submission: Submit node submits job to a different submit node,
which then submits the job to the Condor pool Poorly documented Doesn’t scale well Security implications
• Condor-C: New feature as of Condor 6.7.3 Moves job submission queue of one submit node to
another submit node “scales gracefully when compared with Condor’s
flocking mechanism” Maintains only a single network connection between
the two submit nodes
• Can use to reduce pattern of network communication
Generic Connection Brokering
• NAT/firewall traversal technique• Developed by Se-Chang Son• Transparent to application• Can reverse direction of network connection…• …or relay network packets between two
machines that could not otherwise communicate
• Some scalability issues
• Not yet part of any official Condor release• Not yet integrated into Condor’s security
framework
Dynamic Port Forwarding• NAT/firewall traversal technique• Developed by Se-Chang Son• Add-on to firewall:
Currently only supports Linux netfilter-based firewalls
• Application asks DPF to open hole in firewall• DPF closes hole when connection finished• Highly scalable
• Not yet part of any official Condor release• Not yet integrated into Condor’s security
framework
Solutions v. Requirements• See paper for notes and explanations
Conclusion
• No “perfect” solution (meets all requirements)
• Careful design of Condor pool can help• Many solutions still experimental / not yet
generally available• Se-Chang working on further technical
solutions not discussed here• Some issues best addressed within Condor
(e.g. failing gracefully if loss of network connectivity)
• Further development of Condor required to properly address many of these issues