workflows: from development to production thursday morning, 10:00 am lauren michael research...

Download Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison

Post on 31-Dec-2015

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Workflows: from Development to ProductionThursday morning, 10:00 amLauren Michael Research Computing FacilitatorUniversity of Wisconsin - Madison

    2012 OSG User School

    Engineering a Good WorkflowDraw out the general workflowDefine details (test pieces with HTCondor)Define the computing workflowdivide or consolidate processesoff-load file transfers and consider file transfer timesidentify steps to be automated or checkedBuild it piece-by-piece; test and optimizeScale-up: data and computing resourcesWhat more can you automate or error check?(And document, document, document)

    *

    2012 OSG User School

    process 99(filter output)From This . . .*(special transfer)file prep and split(POST-RETRY)process 0(filter output)combine, transform results

    (POST-RETRY). . . 1 GB RAM 2 GB Disk 1.5 hours100 MB RAM500 MB Disk 40 min(each)300 MB RAM 1 GB Disk 15 min(PRE)(POST-RETRY)(POST-RETRY)

    2012 OSG User School

    End Up with This*DATANobel

    2012 OSG User School

    Scaling and OptimizationYour small DAG runs (once)! Now what?Need to make it run full-scaleNeed to make it run everywhereNeed to make it run every timeNeed to make it run unattendedNeed to make it run when someone else tries

    *

    2012 OSG User School

    Scaling Up Things to Think AboutMore jobs:50-MB files may be fine for 10 or 100 jobs, but not for 1000 jobs. Why?More difficult to identify errorsLarger files:execute RAM and disk spacematching time vs. execute timedata transfer time vs. computation timeScale-up gradually*

    2012 OSG User School

    Data Scaling SolutionsFile manipulationscut into pieces, filtercompression/decompression

    Listen to Dereks methods (Yesterday):SandboxCachingPrestagingStorage Element (SE) horsepower*

    2012 OSG User School

    Make It Run EverywhereWhat does an OSG machine have?Assume the worst: nothingBring as much as possible with you:Wont that slow me down?Bring:ExecutableEnvironmentParameters (using parameter files)Random numbers (generate before-submission)*

    2012 OSG User School

    Scaling Out to OSG:Rules of ThumbCPU (single-threaded)Run for 5 minutes and 2 hoursUpper limit somewhat softDiskKeep scratch working space < 20 GBIntermediate needs (/tmp?)Submit disk: think quota and I/O transfer Batch or Break Up JobsUse squid caching where appropriate*

    2012 OSG User School

    The expanding onionLaptop (1 machine)You control everything!Local cluster (1000 cores)You can ask an admin nicelyCampus (5000 cores)It better be important/generalizableOSG (50,000 cores)Good luck finding the pool admins*

    2012 OSG User School

    Bringing It With You: MATLABWhats the problem with MATLAB?license limitationsWhats the solution?compiling

    Similar measures for other interpreter languages (R, Python, etc)

    *

    2012 OSG User School

    How to bring MATLAB alongPurchase & install MATLAB compilerRun compiler as follows:$ mcc -m -R -singleCompThread -R -nodisplay -R -nojvm nocache foo.m

    3) This creates run_foo.sh (et. al.)4) Create tarball of the runtime$ cd /usr/local/mathworks-R2009bSP1 ; cd ..$ tar cvzf ~/m.tgz mathworks-R2009bSP1

    *

    2012 OSG User School

    More MATLAB5) Edit the run_foo.shtar xzf m.tgz mkdir cache chmod 0777 cache export MCR_CACHE_ROOT=`pwd`/cache

    Make a submit file:universe = vanilla executable = run_foo.sh arguments = ./mathworks-R2009bSP1 should_transfer_files = yeswhen_to_transfer_output = on_exit transfer_input_files = m.tgz, fooqueue*

    2012 OSG User School

    Make It Work EverytimeWhat could possibly go wrong?EvictionFile corruptionPerformance surprisesNetworkPagingDiskMaybe even a bug in your code*

    2012 OSG User School

    Error Checks Are EssentialIf you dont check, it will happen

    Check expected file existence (transfer or creation) with a loopbetter yet, check rough file sizeCheck expected file contentsspecific line text, # of columns, etc.Advanced: Error-check with wrapper-POST combo*

    2012 OSG User School

    What to do if a check failsUnderstand something about failure

    Use DAG RETRY

    Let rescue dag continueworkflow-specific*

    2012 OSG User School

    Performance SurprisesOne bad node can ruin your whole day

    Black Hole machinesAvoidance tricks and their auto-clustering costsGLIDEIN whitelist/blacklist if a site is somehow bad. (But talk to the GOC first)REALLY slow machinesUse PERIODIC_HOLD / RELEASE

    *

    2012 OSG User School

    Make It Work Unattended The ultimate goal!

    Need to automate:Data collectionData cleansingSubmission (condor cron)Analysis and verificationLaTeX and paper submission *

    2012 OSG User School

    Make Science Work Unattended? *Well, maybe not, but a scientist can dream

    2012 OSG User School

    Make It Run(-able) for Someone ElseIf others cant reproduce your work, it isnt real science!Work hard to make this happen.Its their throughput, too.

    Only ~10% of published cancer research is reproducible!

    Another argument for automation*

    2012 OSG User School

    Documentation at Multiple LevelsIn job files: comment linessubmit files, wrapper scripts, executables

    In README filesdescribe file purposesdefine overall workflow, justifications

    In a Document draw the workflow!*

    2012 OSG User School

    Make It Run Faster? Maybe.Throughput, throughput, throughputResource reductions (match more slots!)Wall-time reductionsif significant per workflowWhy not per job?Think in orders of magnitude:Say you have 1000 hour-long jobs that are matched at a rate of 1 per minute

    Waste the computers time, not yours.*

    2012 OSG User School

    Maybe Not Worth ItMaybe Not Worth It:Rewriting your code in a different languageTargeting faster machinesTargeting machines that will run longer jobsOthers?

    Waste the computers time, not yours.

    *

    2012 OSG User School

    Testing, Testing, 1-2-3

    ALWAYS test a subset after making changesHow big of a change?Scale up gradually

    Avoid making problems for others (and for yourself)

    *

    2012 OSG User School

    If this were a test20 points for finishing at all10 points for the right answer1 point for every error check1 point per documentation line

    Out of 100 points? 200 points?

    *finishingrighterrorchecks&documen-tation

    2012 OSG User School

    Questions?Feel free to contact me:lmichael@wisc.eduNow: Break10:30-10:45amNext:10:45am-12:15pm: Exercises 6.2, 6.312:15pm: Lunch1:15-2:30pm: Principles of High-Throughput Computing

    *