Download - Overview of Wisconsin Campus Grid
Dan Bradley 2
Technology
Dan Bradley 3
HTCondor
submit machine
Condor Pool
Condor Pool
firewall
flocking
Open one port and use shared_port on submit machine.
submit machine
If execute nodes are behind NAT but have outgoing net, use CCB.
• pools: 5• submit nodes: 50• user groups: 106• execute nodes: 1,600• cores: 10,000
executable = a.outRequestMemory = 1000
output = stdouterror = stderrqueue 1000
CCB
NAT
Dan Bradley 4
Accessing Files• No campus-wide shared FS• HTCondor file transfer for most cases:
– Send software + input files to job– Grind, grind, …– Send output files back to submit node
• Some other cases:– AFS: works on most of campus, but not across OSG– httpd + SQUID(s): when xfer from submit node doesn’t scale– CVMFS: read-only http FS (see talk tomorrow)– HDFS: big datasets on lots of disks– Xrootd: good for access from anywhere
• Used on top of HDFS and local FS
Dan Bradley 5
Managing Workflows
• A simple submit file works for many users– We provide an example job wrapper script to help download
and set up common software packages: MATLAB, python, R• DAGMan is used by many others– Common pattern:
• User drops files into a directory structure• Script generates DAG from that• Rinse, lather, repeat
• Some application portals are also used– e.g. NEOS Online Optimization Service
Dan Bradley 6
Overflowing to OSG
• glideinWMS– We run a glideinWMS “frontend”– Uses OSG glidein factories– Appears to users as just another pool to flock to• But jobs must opt-in: +WantGlidein = True
million hours used
• We customize glideins to make them look more like other nodes on campus:• publish OS version, glibc
version, CVMFS availability
Dan Bradley 7
A Clinical Health Application• Tyler Churchill: modeling cochlear
implants to improve signal processing.• Used OSG + campus resources to run
simulations that include important acoustic temporal fine structure, which is typically ignored due to difficulty.
“We can't do much about sound resolution given hardware limitations, but we can improve the integrated software. OSG and distributed high-throughput computing are helping us rapidly produce results that directly benefit CI wearers.”
Dan Bradley 8
Engaging Users
Dan Bradley 9
Engaging Users
• Meet with individuals (PI + techs)– Diagram workflow– How much input, output, memory, time?– Suitable for exporting to OSG?– Where will the output go?– What software is needed? Licenses?
• Tech support as needed• Periodic reviews
Dan Bradley 10
Training Users
• Workshops on campus– New users can learn about HTCondor, OSG, etc.– Existing groups can send new students– Show examples of what others have done
• Classes– Scripting for scientific users: python, perl,
submitting batch jobs, DAGMan
Dan Bradley 11
User Resources
• Many bring only their (big) brains– Use central or local department submit nodes– Use only modest scratch space
• Some have their own submit node– Can attach their own storage– Control user access– Install system software packages
Dan Bradley 12
Submitting Big
• Kick started work with big run in EC2, now continuing on campus.
• Building a database to quickly classify stem cells and identify important genes active in cell states useful for clinical applications.
Victor Ruotti, winner of Cycle Computing’s Big Science Challenge
Dan Bradley 13
Users with Clusters
• Three flavors:– condominium• User provides cash, we do the rest
– neighborhood association• User provides space, power, cooling, machines• Configuration is standardized
– sister cities• Independent pools that people want to share• e.g. student computer labs
Dan Bradley 14
Laboratory for Molecular and Computational Genomics
• Cluster integrated into campus grid• Combined resources can map data
representing the equivalent of one human genome in 90 minutes.
• Tackling challenging cases such as the important maize genome, which is difficult for traditional sequence assembly approaches.• Using whole genome single molecule optical mapping technique.
Dan Bradley 15
Reaching Further
Research Groups by Discipline