system software for armv8-a with sve riken center for ... · system software for armv8-a with sve...
Post on 18-Mar-2020
12 Views
Preview:
TRANSCRIPT
System Software for Armv8-A with SVE
Yutaka Ishikawa, Leader of FLAGSHIP2020 ProjectRIKEN Center for Computational Science
9:00– 9:25 14th of January, 2019
Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China
Background: Flagship2020
20019/1/14
• Missions• Building the Japanese national flagship supercomputer, post
K, and• Developing wide range of HPC applications, running on post K,
in order to solve social and science issues in Japan
• Project organization• Post K Computer development
• RIKEN AICS is in charge of development• Fujitsu is vendor partner.• International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,
BSC, INRIA, RIKEN)
• Applications• The government selected
• 9 social & scientific priority issues• 4 exploratory issues
and their R&D organizations.
2
NOW
RIKEN Center for Computational Science
Background: Flagship2020
20019/1/14
• Missions• Building the Japanese national flagship supercomputer, post
K, and• Developing wide range of HPC applications, running on post K,
in order to solve social and science issues in Japan
• Project organization• Post K Computer development
• RIKEN AICS is in charge of development• Fujitsu is vendor partner.• International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,
BSC, INRIA, RIKEN)
• Applications• The government selected
• 9 social & scientific priority issues• 4 exploratory issues
and their R&D organizations.
3
NOW
Target Applications
Program Brief description
① GENESIS MD for proteins
② Genomon Genome processing (Genome alignment)
③ GAMERA Earthquake simulator (FEM in unstructured & structured grid)
④ NICAM+LETK Weather prediction system using Big data (structured grid stencil & ensemble Kalman filter)
⑤ NTChem molecular electronic (structure calculation)
⑥ FFB Large Eddy Simulation (unstructured grid)
⑦ RSDFT an ab-initio program (density functional theory)
⑧ Adventure Computational Mechanics System for Large Scale Analysis and Design (unstructured grid)
⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo)
RIKEN Center for Computational Science
Courtesy of FUJITSU LIMITED
Background: Post-K CPU A64FX
20019/1/14 4
Architecture Armv8.2-A SVE (512 bit SIMD)
Core48 cores for compute and 2/4 for OS activities
DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF
Cache
L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store)
L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store)
Memory HBM2 32 GiB, 1024 GB/s
Interconnect TofuD (28 Gbps x 2 lane x 10 port)
I/O PCIe Gen3 x 16 lane
Technology 7nm FinFET
PerformanceStream triad: 830+ GB/sDgemm: 2.5+ TF (90+% efficiency)ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.
CMG: CPU Memory GroupNOC: Network On Chip
RIKEN Center for Computational Science
Background: An Overview of Post-K Hardware
● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect
● 3-level hierarchical storage system
● 1st Layer
● Cache for global file system
● Temporary file systems
- Local file system for compute node
- Shared file system for a job
● 2nd Layer
● Lustre-based global file system
● 3rd Layer
● Storage for archive
520019/1/14 RIKEN Center for Computational Science
An Overview of System Software Stack
20019/1/14
Easy of use is one of our KPIs (Key Performance Indicators)
Providing wide range ofapplications/tools/libraries/compilers
Linux DistributionEco-System
Parallel Programming EnvironmentsXMP, FDPS, …
Armv8 + SVE
Multi-Kernel System: Linux and light-weight kernel (McKernel)
Batch Job System
Application-oriented
File I/O
Communication
MPI
Parallel File SystemTuning and Debugging Tools
Hierarchical File System
Low Level CommunicationFile I/O for
Hierarchical StorageLLIO
Fortran, C/C++, OpenMP, Java, …
Math libraries
Process/ThreadPIP
6RIKEN Center for Computational Science
● Programing Languages and Compilers provided by Fujitsu
● Fortran2008 & Fortran2018 subset
● C11 & GNU and Clang extensions
● C++14 & C++17 subset and GNU and Clang extensions
● OpenMP 4.5 & OpenMP 5.0 subset
● Java
GCC, LLVM, and Arm compiler will be also available
● Parallel Programming Language & Domain Specific Library provided by RIKEN
● XcalableMP
● FDPS (Framework for Developing Particle Simulator)
● Process/Thread Library provided by RIKEN
● PiP (Process in Process)
● Script Languages provided by Linux distributor
● E.g., Python+NumPy, SciPy
● Communication Libraries
● MPI 3.1 & MPI4.0 subset● Open MPI base (Fujitsu), MPICH (RIKEN)
● Low-level Communication Libraries● uTofu (Fujitsu), LLC(RIKEN)
● File I/O Libraries provided by RIKEN
● pnetCDF, DTF, FTAR
● Math Libraries
● BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu)
● EigenEXA, Batched BLAS (RIKEN)
● Programming Tools provided by Fujitsu
● Profiler, Debugger, GUI
Post-K Programming Environment
Scalableは筑波大・東大が運用するOakforest-PACS上でも稼働している。
20019/1/147
RIKEN Center for Computational Science
Open Source Management Tools
● EasyBuild● Used at CEA
● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild● CAFFE consists of several opensource packages:
- boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv
● Spack● Used at ECP project
● RIKEN is evaluating Spack also.
820019/1/14 RIKEN Center for Computational Science
● Partition resources (CPU cores, memory)
● Full Linux kernel on some cores
● System daemons and in-situ non HPC applications
● Device drivers
● Light-weight kernel(LWK), McKernel on other cores
● HPC applications
IHK/McKernel developed at RIKEN
● IHK: Linux kernel module
● Allows dynamically partitioning of node resources: CPU cores, physical memory, …
● Enables management of LWKs (assign resources, load, boot, destroy, etc..)
● Provides inter-kernel communication, messaging and notification
● McKernel: Light-weight kernel
● Is designed for HPC, noiseless, simple
● Implements only performance sensitive system calls, e.g., process and memory management, and the rest are offloaded to Linux
Very simplememory
management
Thin LWKProcess/Thread
managementGeneral
scheduler
Complex Mem. Mngt.
Linux
TCP stack
Dev. Drivers
VFS
File Sys Driers
Memory
… …Interrupt
Systemdaemons
?
HPC Applications
Partition
Partition
In-situ non HPC application
Linux API (glibc, /sys/, /proc/)
Core Core Core Core Core Core
20019/1/149
• IHK/McKernel runs on• Intel Xeon and Xeon phi• Fujitsu FX10 and FX100
(Experiments)
Interface for Heterogeneous Kernels
● Executes the same binary of Linux without any recompilation
RIKEN Center for Computational Science
How to deploy IHK/McKernel
• Linux Kernel with IHK kernel module is resident– daemons for job scheduler and etc. run on Linux
• McKernel is dynamically reloaded (rebooted) by IHK for each application
• No hardware reboot
Finish
App A, requiring LWK-without-scheduler, Is invoked
App B, requiring LWK-with-scheduler,
Is invoked
FinishA
pp C
, usi
ng fu
ll Li
nux
capa
bilit
y, Is
invo
ked
Finish
20019/1/14 10RIKEN Center for Computational Science
miniFE (CORAL benchmark suite)
11
● Conjugate gradient - strong scaling● Up to 3.5X improvement (Linux falls over.. )
3.5X
Oakforest-PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo
Results using the same binary
20019/1/14
Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017
RIKEN Center for Computational Science
Support of Software Development/Portingfor Post-K
20019/1/14 RIKEN Center for Computational Science 12
CY2017 CY2018 CY2019 CY2020 CY2021
Specification
OptimizationGuidebook
RIKEN Performance
EvaluationEnvironment
Early Access Program
Publishing Incrementally
Performance estimation tool using FX100
RIKEN Simulator
Installation, and TuningManufacturingDesign and Implementation Operation
Armv8-A + SVE Overview Detailed hardware info.
• CY2018. Q2, Optimization guidebook is incrementally published• CY2021. Q1/Q2, General operation starts
NOW
• CY2020. Q2, Early access program start
Contribution to Arm HPC (Armv8-A SVE) Ecosystem
Concluding Remarks
20019/1/14 RIKEN Center for Computational Science 13
https://postk-web.r-ccs.riken.jp/faq.html
MPI Communication implemented usingTofu2 and TofuD
● Tofu2 and TofuD offloading mechanism
● Posting send commands (PUT, GET, NOP) to a command queue, the Tofu network interface processes posted commands.
● Tofu2 has two packet processing modes: Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role.
● Scheduling Pointer: Commands enqueued in the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node
20019/1/14 15RIKEN Center for Computational Science
Evaluation: Latency
16
MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, MPI_DOUBLE, comm, &req[1]);
for (I = 0; …….) { /` Computation `/
MPI_Start(req);/* Computation */
MPI_Wait( req, stat);}
Tofu2 Offload
Direct Transfers between User Buffers
Completely Asynchronous Progression
Persistent pt2pt. (≒Non-blocking pt2pt.)
Late
ncy
[us]
Message Size [Bytes]
Late
ncy
[us]
• The offload version is faster.
• Unlike the point-to-point version, the offload version doe not need CPU cycle for communication progress. Thus computation and communication overlap is realized by the offload version.
20019/1/14
• Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Offloaded MPI persistent collectives using persistent generalized request interface,” Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI2017), ACM, 2017.
• Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Prototyping of Offloaded Persistent Broadcast on Tofu2 Interconnect,” SC17, 2017 (poster)
• Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Tagaki, Atsushi Hori, Yutaka Ishikawa, "Evaluation of Intra Node of Persistent Collective Communication using NIC Offloading," SWOPP'18, HPC165, 2018. (In Japanese)
RIKEN Center for Computational Science
17
● Application● MODYLAYS, USQCD, OpenFOAM
● Library● Numpy, Scipy, pysam, FFTW, LAPACK95, lapack, blas, Metis, ParMetis, HDF5,
NetCDF, NetCDF-fortran, PnetCDF, scalasca, SCOTCH, Zoltan, openmpi1.8, openmpi1.10, mpich2-1.4.1, boost, FFTE, PETSc/SLEPc Elemental, BWA, Star, Blat, TopHat, TopHat2, MapSplice2, MPDyn2, ELPA, Trillinos, Eigen3, mesa, MesaGLUT, libxml2, C-LIME, EigenExa
● Tool/Visuallization Tool● git, git-flow, gnuplot, Paraview, VisIT, ImageMagick, svn, Samtools, bedtools,
Biobambam, Picard, GMT, GrADS, HDF-EOS, wgrib, GRIB API, Climate data Operators
● Build tool● cmake, gnu Autotools, automake, autoconf, gcc, gfortran, C++, libtools
● Shell script / Programming language / Script language● python2, python3, perl5, R, Ruby2, zsh, ksh, NCADS Command Language
OSS Survey (9 priority issues developers)
20019/1/14 RIKEN Center for Computational Science
18
● Application● ABINIT-MP, AkaiKKR, bedtools, Biobambam, BWA, CUBE, ERmod, fdps, FFV-C,
FrontFlow/Red, FrontISTR, GAMES, GENESIS, gromacs, GROMACS, HIVE, LAMMPS, MapSplice2, MODYLAS, NEURON, octa, OpenFOAM, PBVR, Picard, PIMD, quantum ESPRESSO, rDock, Samtools, SCALE, Star, TopHat, TopHat 2, WHEEL, xTAPP,
● Library● FFTW, matplotlib(python), beautiful soup(python), metis, ParMETIS, NetCDF4, HDF5,
NuSDAS1.3, octa, fdps, Zoltan, cgns, Polylib, libsim● Visualization tool
● gnuplot, PBVR, VTK, OSMesa● Tool
● GNU utils, zlib, anaconda(python), itk, PAPI, PMlib, Szip, zip, TextParser, fpzip, ● Build tool
● make, autoconf, cmake● Shell script / Programming language / Script language
● bash, curl, python, ruby● ISV
● ABAQUS, Advance, AMBER, Ansys fluent, Gaussian, FLUENT, Scryu/Tetra, LS-DYNA, VPS solver ( PAM-CRASH ), Helyx, HEETAH, iconCFD, LaBS, JMAG, MIZUHO, NuFD, VASP, VSOP
OSS Survey (K computer users)
20019/1/14 RIKEN Center for Computational Science
top related