introducing gardner - center for research...

54
Introducing Gardner

Upload: lytuyen

Post on 30-Apr-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

IntroducingGardner

CenterforResearchInformatics

• Establishedin2011tosupportBSDresearch

• Mission:– ToprovideinformaticsresourcesandservicetotheBSD,toparticipateinclinicalandbiomedicalresearchofthehighestscientificmerit,andtosupportandpromoteresearchandeducationinthefieldofinformatics

ResourcesandServices• Clinicaldataforresearch• Bioinformaticsdataanalysis• Computinginfrastructure– Storage– HPC– VirtualServers

• Researchdatamanagementtools• Custom-builtapplications• Educationalopportunities

http://cri.uchicago.edu

CRIInfrastructureTeam• Director

– Thorbjorn Axelsson• HighPerformanceComputing

– MikeJarsulic– TonyAburaad

• VirtualServers– AndyBrook– Sneha Jha

• Storage– Olumide Kehinde

• UtilityInfielder– DanSullivan

AboutMe

• LivedinPittsburghforabout32years• AttendedtheUniversityofPittsburgh(atJohnstown)

• Bettis AtomicPowerLaboratory(2004-2012)– ScientificProgrammer(Thermal/HydraulicDesign)– Analyst- USSGeraldR.Ford– HighPerformanceComputing

• UniversityofChicago(2012– present)

USSGeraldR.Ford

AFewWeeksAgo!!!

“GeraldR.FordUSS,whataplace…itreallyfeelslikeaplace.”

AboutTony

• MastersstudentincomputerscienceatUChicago– Completingcourseworkinmachinelearning,distributedcomputing,andiOS

• SpentlastsummerattheComputationInstituteworkingonacachingtoolfortheOpenScienceGrid

• HasbeenhelpingwithGardnerattheCRIsinceNovember

• Dislikesmimes

CRIHPCClustersSeptember2012

• PrudentialDataCenter– BRDFCLUSTER– IBICLUSTER– IBIBMEM

• KenwoodDataCenter– BIOCluster

Tarbell

• PurchasedbytheCRIin2012bythepreviousstaff

• DellclusterutilizingAMDBulldozerprocessors• Infiniband QDR• 110TBScratchSpace• WhynamedTarbell?

WhowasHarlanTarbell?• BorninDelavan,IL• GrewupinGroveland,IL• Magician• DoctorofNaprapathy• Futurist

Themes:• Beginnermistakes• Predictingthefuture• Quackery

BeginnerMistakes

• Scratchspace– Setuppoorlywherethesystemwouldbecomeunstable

– Utilizedonly60TBofspaceinitially– HardwarehadlowRAM(24GBpernode)

• Loginnode– Onlyone(fixed)

PredictingtheFuture

• ComputeNodes– Onlyonetierofmemory(fixed)

• Infiniband– ExpectingQDRtostickaroundforever– Poorstrategyforfutureclusters

AMDBulldozer

• Didnotliveuptoexpectations• SharedFloatingPointUnit• Lawsuit

Quackery

TarbellMetrics

• SinceDecember2013– 234Users– TotalUserJobs:4.6Million– TotalCPUHours:18.29Million– AverageQueueHours:2.94Hours– AverageJobEfficiency:65%– AverageWallClockAccuracy:11%

WhowasMartinGardner?

• GraduateoftheUniversityofChicago

• YeomanontheUSSPopeduringWWII

• AmateurMagician• MathematicalGames• Skepticism• Literature• Art

MathematicalGames

• Flexagons• Polynominoes• GameofLife• Newcomb’sParadox• Mandelbrot'sFractals• PenroseTiling• PublicKeyCryptography• Bestbetforsimpletonsparadox

MathematicalGames

Skepticism

• OriginalfoundersofCSICOP

• Criticof:– Lysenkoism– Homeopathy– Chiropractic– Naturopathy– Orgone Chambers– Dianetics

LiteratureandArt

AlsoofInterest…

WhatisHPC?

NodeCountComparisonNodeType Tarbell Gardner

Standard ComputeNodes 34 88

Mid-Tier ComputeNodes 0 28

LargeMemory Nodes 2 4

GPU Nodes 0 5

XeonPhiNodes 0 1

InteractiveNodes 2 2(eventually4)

RemoteViz Nodes 0 Possibly 2

CoreCountComparisonNodeType Tarbell Gardner

Standard ComputeNodes 2176 2464

Mid-Tier ComputeNodes 0 784

LargeMemory Nodes 80 112

GPU Nodes 0 140

XeonPhiNodes 0 28

TOTAL 2256 3528

StandardNodeComparisonAttribute Tarbell Gardner

Processor AMDOpteron 6274 Intel HaswellE5-2683v3

Clock Speed 2.2GHz 2.0GHz

Processors perNode 4 2

Cores perProcessor 16 14

Instructions perCycle 8 (or4) 16

RAMperCore 4GB 4.5 GB

Mid-TierComputeNodesAttribute Gardner

Processor Intel HaswellE5-2683v3

Clock Speed 2.0GHz

Processors perNode 2

Cores perProcessor 14

Instructions perCycle 16

RAMperCore 16 GB

LargeMemoryNodeComparisonAttribute Tarbell Gardner

Processor Intel Westmere E7-4860 Intel HaswellE5-2683v3

Clock Speed 2.27GHz 2.0GHz

Processors perNode 4 2

Cores perProcessor 10 14

Instructions perCycle 8 16

RAMperCore 25.6GB 45.7 GB

GPGPUNodesCPUAttribute Gardner

Processor Intel HaswellE5-2683v3

Clock Speed 2.0GHz

Processors perNode 2

Cores perProcessor 14

Instructions perCycle 16

RAMperCore 8GB

Accelerator Nvidia TeslaK80

GPU TeslaGK210 (x2)

CoresperGPU 2496

RAM perAccelerator 24GB

XeonPhiNodesCPUAttribute Gardner

Processor Intel HaswellE5-2683v3

Clock Speed 2.0GHz

Processors perNode 2

Cores perProcessor 14

Instructions perCycle 16

RAMperCore 8 GB

Accelerator IntelXeonPhi5110P(x2)

CoresperAccelerator 60

RAM perAccelerator 8 GB

ScratchSpaceComparisonAttribute Tarbell Gardner

Processor Intel Westmere E5620 Intel HaswellE5-2623v3

Clock Speed 2.4GHz 3.0GHz

Processors perNode 2 2

Cores perProcessor 4 4

Instructions perCycle 8 16

RAMper Node 24 GB 64GB

CachePool N/A 200GB

UsableSpace 110TB 350TB

InterconnectBandwidth 40 Gb/s 56Gb/s

BenchmarkingAttribute Tarbell Gardner

Theoretical Performance 44.2TFLOPs 112.8 TFLOPs

Actual Performance 21.2TFLOPs 97TFLOPs

GPU TheoreticalPerformance N/A 14.5 TFLOPs

GPUActualPerformance N/A 11.4TFLOPs

XeonPhiTheoretical Performance N/A 2TFLOPs

XeonPhiActualPerformance N/A 1.7TFLOPs

FLOPs=Nodes*NumberofCores/Node*Frequency*OperationsperCycle

Software• Compilers

– Intel– PGI– GNU– Java7and8– DLang

• MPI– OpenMPI– MPICH– IntelMPI

• SoftwareEnvironment– Lmod

• Scheduler– Moab9.1

• ResourceManager– Torque6.1

WhatisGoingtoHappenTo?• Tarbell– Decommissioned:3/31/17

• LMEM-CRI– Decommissioned

• Stats– Repurposed– Xenabledloginnodesforthecluster– Commercialsoftware:SAS,Stata,MATLAB,etc.

• Galaxy– DecommissionedwithTarbell

ObtaininganAccount

• Prerequisites:BSDAccount• Signupforandaccount– http://cri.uchicago.edu– EarlyAccess• EmailAddressforJobOutput• EmergencyPhoneNumber• SoftwareRequests• LevelofExperience

– CollaboratorAccounts

BeingaGoodHPCCitizen

1. DonotrunanalysisontheLoginNodes!2. Citetheclusterandthesoftwareusedinyour

publications.3. Trytobeaccuratewithyourresource

requests.4. AllowtheCRItoinstallopensourcesoftware

foryou.5. Ifyouaregoingtorunananalysisthatis

muchlargerthannormal,letusknowinadvance.

BeingaGoodHPCCitizen

6. Providefeedback.7. CleanupyourScratchStorage.8. Ifusingascripttosubmit,sleepforafew

secondsinbetweeneachsubmission.9. Besuretoreleasememoryinyourscripts.10. Ifyouhaveaquestion,don’thesitatetoask

us.11. Ifyounoticeaproblem,reportit.

Citations

• ThecontinuedgroupandsupportoftheCRI’sHPCprogramisdependentondemonstrablevalue.

• Citingtheclusterallowsustojustifypurchasingfasterclusterswithmorecapacityinthefuture.

• SampleCitation:– ThisworkutilizedthecomputationalresourcesoftheCenterforResearchInformatics’GardnerHPCclusterattheUniversityofChicago(http://cri.uchicago.edu).

• Makesureyousitethesoftwareusedaswell!

SoftwareInstallation

• SoftwarerequestcanbesubmittedviatheResourceRequestformsathttp://cri.uchicago.edu

• AdvantagestoallowingtheCRItoinstallopensourcesoftware:– Otheruserscanutilizeit– Supportnightmare– Portability

• Disadvantages– Itmaytakeafewdays(letusknowthepriority)

HowtoGetSupport

• CalltheCRIHelpDesk– 773-834-8475

[email protected] tosubmitaticketorusetheRequestFormsontheCRIWebsite

• MeetwithMikeatourPeckOffice(N161)– TuesdayandThursdayAfternoons– Scheduleanappointment

• UserGroupMeetings– OnceamonthatPeck

Examples• Getanaccount– ResourceRequestForm

• Havesoftwareinstalled– ResourceRequestForm

• Jobextension– [email protected]– CC:Mike([email protected])

• Majorproblemonthecluster– CallHelpDesk– [email protected]

LoggingIn

• OnCampus– ssh togardner.cri.uchicago.edu

• OffCampus– VPN• CVPN(CNETAccountRequired)• BSDVPN

– ssh togardner.cri.uchicago.edu

Storage• HomeDirectories(/home/<userid>)– Permanent,Private,Quota’d,NotBackedUp– 1Gb/s

• LabShares(/group/<lab_name>)– Permanent,Shared,Quota’d,BackedUp– 1Gb/s

• ScratchSpace(/scratch/<userid>)– Purged,Private,NotQuota’d,NotBackedUp– 56Gb/s– Purgedevery6months(tostart)

SoftwareEnvironment

• Tarbell->EnvironmentModules– Flatmodulesystem–ModuleswritteninTCL– LastUpdate:December2012

• Gardner->Lmod– Hierarchicalmodulesystem–ModuleswritteninLua– LastUpdate:August2016

Lmod Basics

• Seewhichmodulesareavailabletobeloaded– module avail

• Loadpackages– module load <package1> <package2>

• Seewhichpackagesareloaded– module list

• Unloadapackage– module unload <package>

SchedulingJobs(Defaults)• MaximumAmountofWalltime– 14Days

• MaximumAmountofProcessors– 500concurrent

• Maximumamountofjobs– 500concurrent

• Maximumamountofmemory– 2TB

JobScheduling(Queues)

• Route– DefaultQueue(non-executable)

• Express– 1node;1proc;<=4GBRAM;<=6hours

• Standard–Multi-node;Multi-proc;<=8GBRAM

JobScheduling(Queues)

• Mid–Multi-node;Multi-proc;>8GBRAM;<=24GBRAM

• High–Multi-node;Multi-proc;>24GBRAM

TorqueClientCommands• Submitajob– qsub <scriptname>

• Deleteajob– qdel <jobid>

• Jobstatus– qstat

• ExtendedJobStatus– qstat –f

TorqueDirectives• SpecifyaJobName– #PBS -N <JobName>

• Specifynodesandcores– #PBS -l nodes=x:ppn=y

• Specifywallclocktimelimit– #PBS -l walltime=[dd:[hh:[mm:]]]ss

• Specifythememorylimit– #PBS -l mem=<x>gb

TorqueDirectives

• Specifytheshelltoexecutethescript– #PBS -S <path_to_shell>

• SpecifytheSTDOUTlocation– #PBS -o <path>

• SpecifytheSTDERRlocation– #PBS -e <path>

qsub Arguments

• Runandinteractivejob– qsub –I

• Submitajobandimmediatelyholdit– qsub -h <jobscript>

VolumeofaMolecule

OtherPossibleNewFeatures

• WebPortal(w/Templates)• RemoteVisualization• DataStaging• NUMAcontrolledjobs• Improvedcheckpointing