Download - Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of

Current methods for negotiating firewalls for

the Condor® system

Bruce Beckles(University of Cambridge Computing

Service)

Se-Chang Son (University of Wisconsin)John Kewley (CCLRC Daresbury Laboratory)

What is Condor?• A specialised, cross-platform,

distributed batch scheduling system• Often used for utilising idle CPU cycles

on workstations• Distributed systems architecture:

Different components run on different machines

Can provide greater resilience and improve performance…

…at the expense of simplicity (particularly simplicity of its use of the network)

Main Condor machine roles• Central Manager:

Monitors all Condor nodes and “matches” jobs to execute nodes

• Submit nodes: Submit jobs to pool

• Execute nodes: Execute jobs

• Checkpoint server (optional): Stores checkpoints of jobs (for supported job

types)Machines may have more than one role, and there

may be multiple machines with each of the above roles (except there can only ever be one active central manager)

Diagrammatic overviewCentral Manager

Execute Node

Submit Node

Condor daemons

(Normally listen on ports 9614 and 9618)

Condor daemons

Condor daemons Start job on Execute

Node Send results to Submit Node

User’s executable code

Condor libraries

User’s job

For some jobs, system calls performed as remote procedure calls back to Submit Node

Spawns job and signals it when to abort, suspend, or checkpoint.

Execute Node tells Central Manager about itself. Central Manager tells it when to accept a job from Submit Node.

Submit Node tells Central Manager about job. Central Manager tells it to which Execute Node it should send job.

Checkpoint Server

Condor daemons

(listen on ports 5651 – 5654)

For some jobs, write checkpoints to Checkpoint Server

Checkpoint server advertises itself to Central Manager

For some jobs, check status of Checkpoint Server

Who? How?

• Machine communication: Which machine talks to which Protocol(s) used:

(Does not include high availability daemon (Condor 6.7.6 and later))

Firewalls: Basic problems• Pattern of network communication:

“Many-to-many” Often bidirectional

• Port usage: Large range of dynamic ports Checkpoint server ports not

configurable

• Protocols used: TCP and UDP

Firewalls: Other problems• Administrative overhead:

Large pool may mean many exceptions

• Personal firewalls: Like having a different firewall for each

machine(!)

• Condor does not handle certain network connectivity failures gracefully

• Inadequate/inaccurate documentation• Bugs in Condor:

Didn’t always set SO_KEEPALIVE (now fixed) Machines “disappearing” from pool (although

machine still has network connectivity) Problems with Windows Firewall (now

resolved?)

Solutions: Identified requirements

• Respect the security boundary• Reduce administrative overhead• Minimal impact on firewall performance• NAT/firewall traversal• Allow incremental implementation• Scalability• Robustness (in the face of network problems)• Fail gracefully• Integrated into Condor’s security framework• Logging• Documentation

Types of solution

• Mitigation (“avoidance”): Mitigating the effects of firewalls

• Altering pattern of network communication: Reducing it from “many-to-many” to

“one-to-many”, “few-to-many”, etc.

• NAT/firewall traversal: Traversing the security boundary

Current solutions

• CCLRC’s “Firewall Mirroring” (FM)

• Using centralised submit nodes (CS)

• Remote job submission/Condor-C (C-C)

• Generic Connection Brokering (GCB)• Dynamic Port Forwarding (DPF)

“Firewall Mirroring”• Developed by John Kewley• Ensures that jobs are never given to

execute nodes that cannot run them because of network connectivity issues (e.g. personal firewalls)

• Achieved by duplicating firewall configuration in machine’s ClassAd and then modifying job requirements appropriately

• Works well with personal firewalls• Some administrative overhead

Centralised submit nodes

• Reduce pattern of network communication (“few-to-many” or better)

• Lowers administrative overhead• Can have minimal impact on

firewall performance• …but may impact performance of

the Condor pool• Ideal for centrally managed

“campus grid” scenarios

Remote job submission/Condor-C

• Remote job submission: Submit node submits job to a different submit node,

which then submits the job to the Condor pool Poorly documented Doesn’t scale well Security implications

• Condor-C: New feature as of Condor 6.7.3 Moves job submission queue of one submit node to

another submit node “scales gracefully when compared with Condor’s

flocking mechanism” Maintains only a single network connection between

the two submit nodes

• Can use to reduce pattern of network communication

Generic Connection Brokering

• NAT/firewall traversal technique• Developed by Se-Chang Son• Transparent to application• Can reverse direction of network connection…• …or relay network packets between two

machines that could not otherwise communicate

• Some scalability issues

• Not yet part of any official Condor release• Not yet integrated into Condor’s security

framework

Dynamic Port Forwarding• NAT/firewall traversal technique• Developed by Se-Chang Son• Add-on to firewall:

Currently only supports Linux netfilter-based firewalls

• Application asks DPF to open hole in firewall• DPF closes hole when connection finished• Highly scalable

• Not yet part of any official Condor release• Not yet integrated into Condor’s security

framework

Solutions v. Requirements• See paper for notes and explanations

Conclusion

• No “perfect” solution (meets all requirements)

• Careful design of Condor pool can help• Many solutions still experimental / not yet

generally available• Se-Chang working on further technical

solutions not discussed here• Some issues best addressed within Condor

(e.g. failing gracefully if loss of network connectivity)

• Further development of Condor required to properly address many of these issues

Download - Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of

Top Related