hpc for mechanical ansys
TRANSCRIPT
-
HighPerformanceComputingforMechanicalSimulationsusingANSYS
JeffBeisheimANSYS,Inc
-
HighPerformanceComputing(HPC)atANSYS:
Anongoingeffortdesignedtoremovecomputinglimitationsfromengineerswhousecomputeraidedengineeringinallphasesofdesign,analysis,andtesting.
Itisahardwareandsoftware initiative!
HPCDefined
-
NeedforSpeed
ImpactproductdesignEnablelargemodelsAllowparametricstudies
ModalNonlinearMultiphysicsDynamics
AssembliesCADtomeshCapturefidelity
-
AHistoryofHPCPerformance
1990 SharedMemoryMultiprocessing(SMP)available
1990 SharedMemoryMultiprocessing(SMP)available
1994IterativePCGSolverintroducedforlargeanalyses
1994IterativePCGSolverintroducedforlargeanalyses
1999 200064bitlargememoryaddressing
1999 200064bitlargememoryaddressing
20041st companytosolve100MstructuralDOF
20041st companytosolve100MstructuralDOF
2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores
2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores
1980s VectorProcessingonMainframes
1980s VectorProcessingonMainframes
20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC
20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC
1980
1990
2010
2000
2012
2010GPUacceleration(singleGPU;SMP)
2010GPUacceleration(singleGPU;SMP)
2012 GPUacceleration(multipleGPUs;DMP)
2012 GPUacceleration(multipleGPUs;DMP)
-
HPCRevolution
Recentadvancementshaverevolutionizedthecomputationalspeedavailableonthedesktop Multicoreprocessors Everycoreisreallyanindependentprocessor
LargeamountsofRAMandSSDs GPUs
-
ParallelProcessing Hardware
2Typesofmemorysystems Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP)multipleboxes,cluster
ClusterWorkstation
-
ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1) Firstavailableinv4.3 Canonlybeusedonsinglemachine
Distributedmemoryparallel(dis np >1) Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster
GPUacceleration(acc) Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster
-
Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease Reproducibleandconsistentresults Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS Supportallmajorplatforms Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest
interconnects
-
Distributed ANSYSDesign
Distributedsteps(dis np N) Atstartoffirstloadstep,decomposeFEAmodelintoNpieces(domains)
Eachdomaingoestoadifferentcoretobesolved
Solutionisnotindependent!! Lotsofcommunicationrequiredto
achievesolution Lotsofsynchronizationrequiredto
keepallprocessestogether Eachprocesswritesitsownsetsoffiles(file0*,file1*,file2*,,file[N1]*)
Resultsareautomaticallycombinedatendofsolution Facilitatespostprocessingin/POST1,
/POST26,orWorkBench
-
Distributed ANSYSCapabilities Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)
Widevarietyoffeatures&analysis
capabilitiesaresupported
-
DistributedANSYSEquationSolvers
Sparsedirectsolver(default) SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR
dampedeigensolvers PCGiterativesolver SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver JCG/ICCGiterativesolvers SupportsSMPonly
-
DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp) SupportsSMPandGPUacceleration PCGLanczoseigensolver SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(
-
DistributedANSYSBenefits
Betterarchitecture Morecomputationsperformedinparallel fastersolutiontime BetterspeedupsthanSMP Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores
Cantakeadvantageofresourcesonmultiplemachines Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!
-
DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHzTotal number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI
Core Machine Name Working Directory----------------------------------------------------
0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork
Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds
Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband
-
DistributedANSYSPerformance
0
10
20
30
40
50
60
8cores 16cores 32cores 64cores 128cores
R
a
t
i
n
g
(
r
u
n
s
/
d
a
y
)
InterconnectPerformance
GigabitEthernetDDRInfiniband
Needfastinterconnectstofeedfastprocessors
Turbinemodel 2.1millionDOF SOLID187elements Nonlinearstaticanalysis Sparsesolver(DMP) Linuxcluster(8corespernode)
-
DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound SparsesolverintheoutofcorememorymodedoeslotsofI/O
DistributedANSYScanbehighlyI/Olatencybound Seektimetoread/writeeachsetoffilescausesoverhead
ConsiderSSDs Highbandwidthandextremelylowseektimes
ConsiderRAIDconfigurationsRAID0 forspeedRAID1,5 forredundancyRAID10 forspeedandredundancy
-
DistributedANSYSPerformance
Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.
Needfastharddrivestofeedfastprocessors
0
5
10
15
20
25
30
1core 2cores 4cores 8cores
R
a
t
i
n
g
(
r
u
n
s
/
d
a
y
)
HardDrivePerformance
HDDSSD
8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation
12IntelXeonx5675cores,48GBRAM,single7.2krpmHDD,singleSSD,Win7)
-
AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound CheckoutputfileforCPUandElapsedtimes WhenElapsedtime>>mainthreadCPUtime I/Obound
ConsideraddingmoreRAMorfasterharddriveconfiguration WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)
DistributedANSYSPerformance
Total CPU time for main thread : 167.8 seconds. . . . . .Elapsed Time (sec) = 388.000 Date = 08/21/2012
-
AllrunswithSparsesolverHardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernodeHardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernodeANSYS12.0to14.0runswithDDRInfinibandinterconnectANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS
DistributedANSYSPerformanceANSYS11.0 ANSYS12.0 ANSYS12.1 ANSYS13.0SP2 ANSYS 14.0
Thermal(fullmodel)3MDOF
Time 4 hours 4hours 4hours 4hours 1hour 0.8hour
Cores 8 8 8 8 8+1GPU 32
ThermomechanicalSimulation (fullmodel)7.8MDOF
Time ~5.5days 34.3hours 12.5hours 9.9 hours 7.5hours
Iterations 163 164 195 195 195
Cores 8 20 64 64 128
Interpolation ofBoundaryConditions
Time 37hours 37hours 37hours 0.2hour 0.2hour
LoadSteps 16 16 16 Improvedalgorithm 16
Submodel:CreepStrainAnalysis5.5MDOF
Time ~5.5days 38.5hours 8.5hours 6.1hours 5.9hours 4.2hours
Iterations 492 492 492 488 498 498
Cores 18 16 76 128 64+8GPU 256
TotalTime 2weeks 5days 2 days 1day 0.5day
ResultsCourtesyofMicroConsult Engineering,GmbH
-
DistributedANSYSPerformance
0
5
10
15
20
25
0 8 16 24 32 40 48 56 64
S
p
e
e
d
u
p
SolutionScalability
Minimumtimetosolutionmoreimportantthanscaling
Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)
-
DistributedANSYSPerformance
050001000015000200002500030000350004000045000
0 8 16 24 32 40 48 56 64
S
o
l
u
t
i
o
n
E
l
a
p
s
e
d
T
i
m
e
SolutionScalability
11hrs,48 mins
30mins
Minimumtimetosolutionmoreimportantthanscaling
1hr,20mins Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)
-
Graphicsprocessingunits(GPUs) Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs
SohowcanANSYSmakeuseofthisnewtechnologytoreducetheoveralltimetosolution??
GPUAcceleratorCapability
-
AccelerateSparsedirectsolver(SMP&DMP) GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartiallyaccelerated
AcceleratePCG/JCGiterativesolvers(SMP&DMP) GPUisonlyusedforsparsematrixvectormultiply(SpMV
kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated
GPUAcceleratorCapability
-
Supportedhardware CurrentlysupportNVIDIATesla20series,Quadro6000,and
QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith
R14.5 InstallingaGPUrequiresthefollowing: Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot
Supportedplatforms WindowsandLinux64bitplatformsonly DoesnotincludeLinuxItanium(IA64)platform
GPUAcceleratorCapability
-
NVIDIATeslaC2075
NVIDIATeslaM2090
NVIDIAQuadro6000
NVIDIAQuadroK5000
NVIDIATeslaK10
NVIDIATeslaK20
Power(W) 225 250 225 122 250 250
Memory 6GB 6GB 6GB 4GB 8GB 6to24GBMemoryBandwidth(GB/s)
144 177.4 144 173 320 288
PeakSpeedSP/DP(GFlops)
1030/515 1331/665 1030/515 2290/95 4577/190 5184/1728
Targetedhardware
TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect
GPUAcceleratorCapability
-
GPUAcceleratorCapability
2.6x
3.8x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
2cores 8cores 8cores
R
e
l
a
t
i
v
e
S
p
e
e
d
u
p
GPUPerformance
(noGPU) (noGPU)
6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7
GPUscanoffersignificantlyfastertime tosolution
(1GPU)
-
GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
2.7x
5.2x
0.0
1.0
2.0
3.0
4.0
5.0
6.0
2cores 8cores 16cores
R
e
l
a
t
i
v
e
S
p
e
e
d
u
p
GPUPerformance
11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7
(noGPU) (1GPU) (4GPUs)
-
SupportsmajorityofANSYSusers CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations Easeofuse RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps Performance ~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!
GPUAcceleratorCapability
-
Howwillyouuseallofthiscomputingpower?
DesignOptimizationStudies
DesignOptimization
Higherfidelity Fullassemblies Morenonlinear
-
ANSYSHPCPacksenablehighfidelityinsight Eachsimulationconsumesone
ormorepacks Parallelenabledincreases
quicklywithaddedpacks
Singlesolutionforallphysicsandanyleveloffidelity FlexibilityasyourHPCresourcesgrow ReallocatePacks,asresources
allow
2048
328
128
512
ParallelEnabled(Cores)
PacksperSimulation1 2 3 4 5
HPCLicensing
1GPU+
4 GPU+
16 GPU+
64 GPU+
256 GPU+
-
Scalable,likeANSYSHPCPacks Enhancesthecustomersabilitytoincludemany designpointsaspartofasingle study
Ensuresoundproductdecisionmaking
Amplifiescompleteworkflow Designpointscanincludeexecutionofmultipleproducts(pre,solve,HPC,post)
Packagedtoencourageadoptionofthepathtorobustdesign!
NumberofSimultaneousDesignPoints
Enabled
NumberofHPCParametricPackLicenses
1 2 3 4 5
64
84
16
32
HPCParametricPackLicensing
-
HPCRevolution
Therightcombinationofalgorithmsand hardware
leadstomaximumefficiency
SMPvs.DMP
HDDvs.SSDs
InterconnectsClusters
GPUs
-
HPCRevolution
Everycomputertodayisaparallelcomputer
EverysimulationinANSYScanbenefitfromparallelprocessing