Photos placed in horizontal position with even amount
of white spacebetween photos
and header
Photos placed in horizontal position
with even amount of white space
between photos and header
Sandia National Laboratories is a multi mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
RoshanQuadros,BrianCarnes,MadisonBrewer,andByronHanks
DOECentersofExcellencePerformancePortabilityMeetingAug22-24,2017Denver
ScalingPost-MeshingOperationsonNextGenerationPlatforms
SAND No: SAND2017-8825 PE
Parallel decomposition,refinement, smoothing, & projection
CUBIT
Percept
CUBITandPerceptcombinedworkflowforgeneratinglargemeshes
Initial Mesh Generation: advanced meshing
algorithms for Tri, Quad, Tet, and Hex mesh
generation
CUBITprovidesextensivecapabilitiesforpreparinggeometryandgeneratinganinitialmeshhttps://cubit.sandia.gov
Post-MeshingOperations:Refinement->Projection->Smoothing
Mesh refinement workflow:• Generate refined meshes in
memory from an existing mesh
• Project new boundary nodes onto geometry
• Smooth interior mesh nodes to improve mesh quality
Supported mesh types:• block-structured• unstructured• hybrid
Post-MeshingOperator#1:Projection
GeometryKernel:OpenNURBS
§ OpenSourcefromwww.rhino3d.com
§ Lightweight&easytoPort
§ Queryoperationsarethreadsafe
§ Supportsvariouscurves&surfacedefinitions
ParallelKernel:ProjectPointsonaNURBSSurface
ProgrammingModel:MPI+Kokkos (OpenMP)Threelevelsofparallelismisrequired:
1) DistributedmemoryparallelismviaMPI
2) SharedmemorythreadlevelparallelismontheMICdeviceusingKokkoswithOpenMP runtime
3) VectorizationforVectorProcessingUnit(VPU)
Hardware: Trinity testbed containing 72 core KNL
Image courtesy of http://www.hotchips.org
ProgrammingModel:MPI+Kokkos (OpenMP){ // MPI distributes data to n processes
ON_Surface *surface;ON_3dPoint *buff_p;MPI_Comm_size( MPI_COMM_WORLD, &numtasks);…if ( rank == 0 ){
for( int r=1; r < numtasks; r++){…ierr = MPI_Send ( p_start, num_pnts*3, MPI_DOUBLE, r, Tag, MPI_COMM_WORLD);
}} else{
ierr = MPI_Recv ( buff_p, num_pnts*3, MPI_DOUBLE, MPI_ANY_SOURCE, Tag, MPI_COMM_WORLD, &status );
}projection_method( surface, buff_p, num_pnts );
}void projection_method( const ON_Surface *surface, ON_3dPoint *buff_p, const int num_pnts ){
…// Kokkos handles thread level parallelismKokkos::parallel_for ( num_pnts, KOKKOS_LAMBDA(const int i ){
...// OpenNURBS API for projecting a point on a surface double u, v;surface->GetClosestPoint( buff_p[i], &u, &v );ON_3dPoint projected_pt = surface->PointAt( u, v );
});}
PointProjectionScalingResultsonKNL
933.071108.85
10
20
40
80
160
320
640
1 10 100 1000
Time(sec)
Threadsperrank
mpi_1 mpi_2 mpi_4 mpi_8 mpi_16 mpi_32 mpi_64
Speedup (MPI 32 ranks) = 28X Speedup (MPI + Kokkos) = 77X
ThreadAffinityonKNL
1
10
100
1000
1 10 100 1000Threads per KNL
Tim
e (s
ec)
Blue - ScatterRed - Compact
Post-meshingOperator#2:RefinementStructured grid refinement was a relatively simple process§ Removed pointers and references to main memory objects§ Replaced structured grid data structures with Kokkos views§ Algorithm :
§ Allocate new mesh (view)§ For each block (in serial)
§ For each element in old mesh (in parallel)§ Interpolate coordinates for new mesh§ Transfer existing node coordinates
ScalabilityofRefinement
Sequence of meshes• 0.5M, 4M, 33M elements• multiple blocks (12)
Compare• serial• GPU• threading (OpenMP)
Better scalability with increasing problem size
Blocks were refined sequentially
Post-MeshingOperator#3:SmoothingSmoothingstructuredgridswasmorecomplexprocess§ computeglobalqualitymetricandgradient§ nonlinearCGoptimizationwithlinesearch§ communicationbetweenstructuredblocks
(gradients)Example:smoothingalargecubewithinitialrandomlyperturbednodes
Untangling: invalid elements Smoothing: metric/gradientUntangling: metric/gradient
§ Abstractmeshinterfaceforbothstructuredandunstructured§ highcostofkernelcalls§ sub-optimalinterfacetostructuredgrid§ functionsnotsafeforthreadsorGPU
§ Suggestionforabstractinterfaces:§ designedwithsharedmemory(Kokkos)fromstart§ otherwise,optforspecializedinterfaces.
SmoothingPitfall:AbstractInterfaces
StructuredMesh
UnstructuredMesh
Smoother<MeshBase>• dependsonmany
virtualfunctionsinMeshType class
MeshBase
§ Totalmetric(longdouble)wassumofindividualmetrics(double)§ ProblematicforGPUbuildsasCUDAonlyusesupto64bits(double
precision)forfloatingpointrepresentation.§ Causesillegalmemoryaccess:
§ cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered
§ CertainSTLclassesandfunctionsareproblematiconGPUs§ array,unorderedmap,vector,set,…
§ array=>Kokkos::Array§ unorderedmap =>Kokkos::UnorderedMap
§ std::maxwasrewrittenlocally
SmoothingPitfall:HardwareDifferences
§ MemorylayoutinitiallycausedverypoorperformanceontheGPU§ Smoothingtestcase:perturbedcubefollowedbymeshsmoothing
SmoothingPitfall:MemoryLayout
FutureWork§ Projection:
• StudyhighwatermarkofmemoryusagefordifferentcombinationofMPIranksandThreadsperrank
§ UnstructuredRefinement:• Morecomplicatedalgorithmthanstructured• Example:determinenumberofnewnodes
• useKokkosmaptostoreneedednodes• valuesstoredforeverymeshedge/face• mapalsousedtointerpolatenewcoordinates
§ Smoothing:• Investigateothersmoothingalgorithms(ellipticsmoother)
ThankYou