hpc for mechanical ansys

HighPerformanceComputingforMechanicalSimulationsusingANSYS

JeffBeisheimANSYS,Inc

HighPerformanceComputing(HPC)atANSYS:

Anongoingeffortdesignedtoremovecomputinglimitationsfromengineerswhousecomputeraidedengineeringinallphasesofdesign,analysis,andtesting.

Itisahardwareandsoftware initiative!

HPCDefined

NeedforSpeed

ImpactproductdesignEnablelargemodelsAllowparametricstudies

ModalNonlinearMultiphysicsDynamics

AssembliesCADtomeshCapturefidelity

AHistoryofHPCPerformance

1990 SharedMemoryMultiprocessing(SMP)available

1990 SharedMemoryMultiprocessing(SMP)available

1994IterativePCGSolverintroducedforlargeanalyses

1994IterativePCGSolverintroducedforlargeanalyses

1999 200064bitlargememoryaddressing

1999 200064bitlargememoryaddressing

20041st companytosolve100MstructuralDOF

20041st companytosolve100MstructuralDOF

2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

1980s VectorProcessingonMainframes

1980s VectorProcessingonMainframes

20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

1980

1990

2010

2000

2012

2010GPUacceleration(singleGPU;SMP)

2010GPUacceleration(singleGPU;SMP)

2012 GPUacceleration(multipleGPUs;DMP)

2012 GPUacceleration(multipleGPUs;DMP)

HPCRevolution

Recentadvancementshaverevolutionizedthecomputationalspeedavailableonthedesktop Multicoreprocessors Everycoreisreallyanindependentprocessor

LargeamountsofRAMandSSDs GPUs

ParallelProcessing Hardware

2Typesofmemorysystems Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP)multipleboxes,cluster

ClusterWorkstation

ParallelProcessing Software

2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1) Firstavailableinv4.3 Canonlybeusedonsinglemachine

Distributedmemoryparallel(dis np >1) Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster

GPUacceleration(acc) Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster

Distributed ANSYSDesignRequirements

Nolimitationinsimulationcapability Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease Reproducibleandconsistentresults Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS Supportallmajorplatforms Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest

interconnects

Distributed ANSYSDesign

Distributedsteps(dis np N) Atstartoffirstloadstep,decomposeFEAmodelintoNpieces(domains)

Eachdomaingoestoadifferentcoretobesolved

Solutionisnotindependent!! Lotsofcommunicationrequiredto

achievesolution Lotsofsynchronizationrequiredto

keepallprocessestogether Eachprocesswritesitsownsetsoffiles(file0*,file1*,file2*,,file[N1]*)

Resultsareautomaticallycombinedatendofsolution Facilitatespostprocessingin/POST1,

/POST26,orWorkBench

Distributed ANSYSCapabilities Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)

Widevarietyoffeatures&analysis

capabilitiesaresupported

DistributedANSYSEquationSolvers

Sparsedirectsolver(default) SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR

dampedeigensolvers PCGiterativesolver SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver JCG/ICCGiterativesolvers SupportsSMPonly

DistributedANSYSEigensolvers

BlockLanczoseigensolver(includingQRdamp) SupportsSMPandGPUacceleration PCGLanczoseigensolver SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(

DistributedANSYSBenefits

Betterarchitecture Morecomputationsperformedinparallel fastersolutiontime BetterspeedupsthanSMP Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores

Cantakeadvantageofresourcesonmultiplemachines Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!

DistributedANSYSPerformance

Needfastinterconnectstofeedfastprocessors Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound

+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHzTotal number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI

Core Machine Name Working Directory----------------------------------------------------

0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork

Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds

Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband


0

10

20

30

40

50

60

8cores 16cores 32cores 64cores 128cores

R

a

t

i

n

g

(

r

u

n

s

/

d

a

y

)

InterconnectPerformance

GigabitEthernetDDRInfiniband

Needfastinterconnectstofeedfastprocessors

Turbinemodel 2.1millionDOF SOLID187elements Nonlinearstaticanalysis Sparsesolver(DMP) Linuxcluster(8corespernode)


Needfastharddrivestofeedfastprocessors Checkthebandwidthspecs

ANSYSMechanicalcanbehighlyI/Obandwidthbound SparsesolverintheoutofcorememorymodedoeslotsofI/O

DistributedANSYScanbehighlyI/Olatencybound Seektimetoread/writeeachsetoffilescausesoverhead

ConsiderSSDs Highbandwidthandextremelylowseektimes

ConsiderRAIDconfigurationsRAID0 forspeedRAID1,5 forredundancyRAID10 forspeedandredundancy


Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.

Needfastharddrivestofeedfastprocessors

0

5

10

15

20

25

30

1core 2cores 4cores 8cores

R

a

t

i

n

g

(

r

u

n

s

/

d

a

y

)

HardDrivePerformance

HDDSSD

8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation

12IntelXeonx5675cores,48GBRAM,single7.2krpmHDD,singleSSD,Win7)

AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound CheckoutputfileforCPUandElapsedtimes WhenElapsedtime>>mainthreadCPUtime I/Obound

ConsideraddingmoreRAMorfasterharddriveconfiguration WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)


Total CPU time for main thread : 167.8 seconds. . . . . .Elapsed Time (sec) = 388.000 Date = 08/21/2012

AllrunswithSparsesolverHardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernodeHardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernodeANSYS12.0to14.0runswithDDRInfinibandinterconnectANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

DistributedANSYSPerformanceANSYS11.0 ANSYS12.0 ANSYS12.1 ANSYS13.0SP2 ANSYS 14.0

Thermal(fullmodel)3MDOF

Time 4 hours 4hours 4hours 4hours 1hour 0.8hour

Cores 8 8 8 8 8+1GPU 32

ThermomechanicalSimulation (fullmodel)7.8MDOF

Time ~5.5days 34.3hours 12.5hours 9.9 hours 7.5hours

Iterations 163 164 195 195 195

Cores 8 20 64 64 128

Interpolation ofBoundaryConditions

Time 37hours 37hours 37hours 0.2hour 0.2hour

LoadSteps 16 16 16 Improvedalgorithm 16

Submodel:CreepStrainAnalysis5.5MDOF

Time ~5.5days 38.5hours 8.5hours 6.1hours 5.9hours 4.2hours

Iterations 492 492 492 488 498 498

Cores 18 16 76 128 64+8GPU 256

TotalTime 2weeks 5days 2 days 1day 0.5day

ResultsCourtesyofMicroConsult Engineering,GmbH


0

5

10

15

20

25

0 8 16 24 32 40 48 56 64

S

p

e

e

d

u

p

SolutionScalability

Minimumtimetosolutionmoreimportantthanscaling

Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)


050001000015000200002500030000350004000045000

0 8 16 24 32 40 48 56 64

S

o

l

u

t

i

o

n

E

l

a

p

s

e

d

T

i

m

e

SolutionScalability

11hrs,48 mins

30mins

Minimumtimetosolutionmoreimportantthanscaling

1hr,20mins Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)

Graphicsprocessingunits(GPUs) Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs

SohowcanANSYSmakeuseofthisnewtechnologytoreducetheoveralltimetosolution??

GPUAcceleratorCapability

AccelerateSparsedirectsolver(SMP&DMP) GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartiallyaccelerated

AcceleratePCG/JCGiterativesolvers(SMP&DMP) GPUisonlyusedforsparsematrixvectormultiply(SpMV

kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated


Supportedhardware CurrentlysupportNVIDIATesla20series,Quadro6000,and

QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith

R14.5 InstallingaGPUrequiresthefollowing: Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot

Supportedplatforms WindowsandLinux64bitplatformsonly DoesnotincludeLinuxItanium(IA64)platform


NVIDIATeslaC2075

NVIDIATeslaM2090

NVIDIAQuadro6000

NVIDIAQuadroK5000

NVIDIATeslaK10

NVIDIATeslaK20

Power(W) 225 250 225 122 250 250

Memory 6GB 6GB 6GB 4GB 8GB 6to24GBMemoryBandwidth(GB/s)

144 177.4 144 173 320 288

PeakSpeedSP/DP(GFlops)

1030/515 1331/665 1030/515 2290/95 4577/190 5184/1728

Targetedhardware

TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect



2.6x

3.8x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

2cores 8cores 8cores

R

e

l

a

t

i

v

e

S

p

e

e

d

u

p

GPUPerformance

(noGPU) (noGPU)

6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7

GPUscanoffersignificantlyfastertime tosolution

(1GPU)


GPUscanoffersignificantlyfastertime tosolution

2.7x

5.2x

0.0

1.0

2.0

3.0

4.0

5.0

6.0

2cores 8cores 16cores

R

e

l

a

t

i

v

e

S

p

e

e

d

u

p

GPUPerformance

11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7

(noGPU) (1GPU) (4GPUs)

SupportsmajorityofANSYSusers CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations Easeofuse RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps Performance ~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!


Howwillyouuseallofthiscomputingpower?

DesignOptimizationStudies

DesignOptimization

Higherfidelity Fullassemblies Morenonlinear

ANSYSHPCPacksenablehighfidelityinsight Eachsimulationconsumesone

ormorepacks Parallelenabledincreases

quicklywithaddedpacks

Singlesolutionforallphysicsandanyleveloffidelity FlexibilityasyourHPCresourcesgrow ReallocatePacks,asresources

allow

2048

328

128

512

ParallelEnabled(Cores)

PacksperSimulation1 2 3 4 5

HPCLicensing

1GPU+

4 GPU+

16 GPU+

64 GPU+

256 GPU+

Scalable,likeANSYSHPCPacks Enhancesthecustomersabilitytoincludemany designpointsaspartofasingle study

Ensuresoundproductdecisionmaking

Amplifiescompleteworkflow Designpointscanincludeexecutionofmultipleproducts(pre,solve,HPC,post)

Packagedtoencourageadoptionofthepathtorobustdesign!

NumberofSimultaneousDesignPoints

Enabled

NumberofHPCParametricPackLicenses

1 2 3 4 5

64

84

16

32

HPCParametricPackLicensing

HPCRevolution

Therightcombinationofalgorithmsand hardware

leadstomaximumefficiency

SMPvs.DMP

HDDvs.SSDs

InterconnectsClusters

GPUs

HPCRevolution

Everycomputertodayisaparallelcomputer

EverysimulationinANSYScanbenefitfromparallelprocessing

hpc for mechanical ansys

Documents