workflows: from development to production thursday morning, 10:00 am lauren michael research...
Post on 31-Dec-2015
212 views
Embed Size (px)
TRANSCRIPT
Workflows: from Development to ProductionThursday morning, 10:00 amLauren Michael Research Computing FacilitatorUniversity of Wisconsin - Madison
2012 OSG User School
Engineering a Good WorkflowDraw out the general workflowDefine details (test pieces with HTCondor)Define the computing workflowdivide or consolidate processesoff-load file transfers and consider file transfer timesidentify steps to be automated or checkedBuild it piece-by-piece; test and optimizeScale-up: data and computing resourcesWhat more can you automate or error check?(And document, document, document)
*
2012 OSG User School
process 99(filter output)From This . . .*(special transfer)file prep and split(POST-RETRY)process 0(filter output)combine, transform results
(POST-RETRY). . . 1 GB RAM 2 GB Disk 1.5 hours100 MB RAM500 MB Disk 40 min(each)300 MB RAM 1 GB Disk 15 min(PRE)(POST-RETRY)(POST-RETRY)
2012 OSG User School
End Up with This*DATANobel
2012 OSG User School
Scaling and OptimizationYour small DAG runs (once)! Now what?Need to make it run full-scaleNeed to make it run everywhereNeed to make it run every timeNeed to make it run unattendedNeed to make it run when someone else tries
*
2012 OSG User School
Scaling Up Things to Think AboutMore jobs:50-MB files may be fine for 10 or 100 jobs, but not for 1000 jobs. Why?More difficult to identify errorsLarger files:execute RAM and disk spacematching time vs. execute timedata transfer time vs. computation timeScale-up gradually*
2012 OSG User School
Data Scaling SolutionsFile manipulationscut into pieces, filtercompression/decompression
Listen to Dereks methods (Yesterday):SandboxCachingPrestagingStorage Element (SE) horsepower*
2012 OSG User School
Make It Run EverywhereWhat does an OSG machine have?Assume the worst: nothingBring as much as possible with you:Wont that slow me down?Bring:ExecutableEnvironmentParameters (using parameter files)Random numbers (generate before-submission)*
2012 OSG User School
Scaling Out to OSG:Rules of ThumbCPU (single-threaded)Run for 5 minutes and 2 hoursUpper limit somewhat softDiskKeep scratch working space < 20 GBIntermediate needs (/tmp?)Submit disk: think quota and I/O transfer Batch or Break Up JobsUse squid caching where appropriate*
2012 OSG User School
The expanding onionLaptop (1 machine)You control everything!Local cluster (1000 cores)You can ask an admin nicelyCampus (5000 cores)It better be important/generalizableOSG (50,000 cores)Good luck finding the pool admins*
2012 OSG User School
Bringing It With You: MATLABWhats the problem with MATLAB?license limitationsWhats the solution?compiling
Similar measures for other interpreter languages (R, Python, etc)
*
2012 OSG User School
How to bring MATLAB alongPurchase & install MATLAB compilerRun compiler as follows:$ mcc -m -R -singleCompThread -R -nodisplay -R -nojvm nocache foo.m
3) This creates run_foo.sh (et. al.)4) Create tarball of the runtime$ cd /usr/local/mathworks-R2009bSP1 ; cd ..$ tar cvzf ~/m.tgz mathworks-R2009bSP1
*
2012 OSG User School
More MATLAB5) Edit the run_foo.shtar xzf m.tgz mkdir cache chmod 0777 cache export MCR_CACHE_ROOT=`pwd`/cache
Make a submit file:universe = vanilla executable = run_foo.sh arguments = ./mathworks-R2009bSP1 should_transfer_files = yeswhen_to_transfer_output = on_exit transfer_input_files = m.tgz, fooqueue*
2012 OSG User School
Make It Work EverytimeWhat could possibly go wrong?EvictionFile corruptionPerformance surprisesNetworkPagingDiskMaybe even a bug in your code*
2012 OSG User School
Error Checks Are EssentialIf you dont check, it will happen
Check expected file existence (transfer or creation) with a loopbetter yet, check rough file sizeCheck expected file contentsspecific line text, # of columns, etc.Advanced: Error-check with wrapper-POST combo*
2012 OSG User School
What to do if a check failsUnderstand something about failure
Use DAG RETRY
Let rescue dag continueworkflow-specific*
2012 OSG User School
Performance SurprisesOne bad node can ruin your whole day
Black Hole machinesAvoidance tricks and their auto-clustering costsGLIDEIN whitelist/blacklist if a site is somehow bad. (But talk to the GOC first)REALLY slow machinesUse PERIODIC_HOLD / RELEASE
*
2012 OSG User School
Make It Work Unattended The ultimate goal!
Need to automate:Data collectionData cleansingSubmission (condor cron)Analysis and verificationLaTeX and paper submission *
2012 OSG User School
Make Science Work Unattended? *Well, maybe not, but a scientist can dream
2012 OSG User School
Make It Run(-able) for Someone ElseIf others cant reproduce your work, it isnt real science!Work hard to make this happen.Its their throughput, too.
Only ~10% of published cancer research is reproducible!
Another argument for automation*
2012 OSG User School
Documentation at Multiple LevelsIn job files: comment linessubmit files, wrapper scripts, executables
In README filesdescribe file purposesdefine overall workflow, justifications
In a Document draw the workflow!*
2012 OSG User School
Make It Run Faster? Maybe.Throughput, throughput, throughputResource reductions (match more slots!)Wall-time reductionsif significant per workflowWhy not per job?Think in orders of magnitude:Say you have 1000 hour-long jobs that are matched at a rate of 1 per minute
Waste the computers time, not yours.*
2012 OSG User School
Maybe Not Worth ItMaybe Not Worth It:Rewriting your code in a different languageTargeting faster machinesTargeting machines that will run longer jobsOthers?
Waste the computers time, not yours.
*
2012 OSG User School
Testing, Testing, 1-2-3
ALWAYS test a subset after making changesHow big of a change?Scale up gradually
Avoid making problems for others (and for yourself)
*
2012 OSG User School
If this were a test20 points for finishing at all10 points for the right answer1 point for every error check1 point per documentation line
Out of 100 points? 200 points?
*finishingrighterrorchecks&documen-tation
2012 OSG User School
Questions?Feel free to contact me:lmichael@wisc.eduNow: Break10:30-10:45amNext:10:45am-12:15pm: Exercises 6.2, 6.312:15pm: Lunch1:15-2:30pm: Principles of High-Throughput Computing
*