bdec2 poznan japanese hpc infrastructure update

20
BDEC2 Poznan Japanese HPC Infrastructure Update (on behalf of Satoshi Matsuoka, Director, Riken R-CCS) 1 Presenter: Masaaki Kondo, Riken R-CCS/Univ. of Tokyo

Upload: others

Post on 28-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

PowerPoint (on behalf of Satoshi Matsuoka, Director, Riken R-CCS)
1
2
1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming
2. New Technology Innovations of Post-K • High Performance, esp. via high memory BW
Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration
• Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs
• Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by Fujitsu
• High Perf. on Society5.0 apps incl. AI Architectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc.
Post-K: The Game Changer (2020)
ARM: Massive ecosystem from embedded to HPC
Global leadership not just in the machine & apps, but as
cutting edge IT
Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds
Arm64fx & Post-K (to be renamed)
3
Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s),
on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.)
Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds Extremely power efficient – > 10x power/perf efficiency for CFD
benchmark over current mainstream x86 CPU Largest and fastest supercomputer to be ever built circa 2020 > 150,000 nodes, superseding LLNL Sequoia > 150 PetaByte/s memory BW Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) 25~30PB NVMe L1 storage ~10,000 endpoint 100Gbps I/O network into Lustre The first ‘exascale’ machine (not exa64bitflops but in apps perf.)
Post K A64fx Processor
4
an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection
…but also an accelerated GPU-like processor SVE 512 bit vector extensions (ARM & Fujitsu)
Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + scratchpad-like local memory (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4)
Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features
GPU-like High performance in HPC, AI/Big Data, Auto Driving…
A64FX: Spec Summary
DP performance 2.7+ TFLOPS, >90%@DGEMM
Memory BW 1024 GB/s, >80%@STREAM Triad
12x compute cores 1x assistant core
A64FX ISA (Base, extension) Armv8.2-A, SVE Process technology 7 nm Peak DP performance > 2.7+ TFLOPS
SIMD width 512-bit # of cores 48 + 4
Memory capacity 32 GiB (HBM2 x4) Memory peak bandwidth 1024 GB/s
PCIe Gen3 16 lanes High speed interconnect TofuD integrated
PCle Controller
Tofu Interface
Preliminary performance evaluation results
Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Preliminary performance evaluation results
Overview of Post-K System & Storage
8
3-level hierarchical storage 1st Layer: GFS Cache + Temp FS (25~30 PB NVMe) 2nd Layer: Lustre-based GFS (a few hundred PB HDD)
3rd Layer: Off-site Cloud Storage
Full Machine Spec >150,000 nodes
~8 million High Perf. Arm v8.2 Cores > 150PB/s memory BW Tofu-D 10x Global IDC traffic @ 60Pbps ~10,000 I/O fabric endpoints > 400 racks ~40 MegaWatts Machine+IDC
PUE ~ 1.1 High Pressure DLC NRE pays off: ~= 15~30 million
state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems)
Preparing the 40+MW Facility (actual photo)
9
10
World’s top class computing resources open to the world-wide HPC communities
The operation of K computer will stop in August 2019.
HPCI Computational Resources (sites / machines)
11
HPCI Tier 2 Systems Roadmap (As of Nov. 2018
12
Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027
Hokkaido
Tohoku
Tsukuba
Tokyo
20-40+ PF (FAC/TPF + UCC) 1.5 MW
Cray:XE6 + GB8K XC30
1.33 MW Cray XC30 (584TF)
TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW TSUBAME 2.6 5 PF
HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) 100+ PF
(FAC/TPF + UCC/TPF) 3MW
2.0MW
Fujitsu PRIMERGY CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW
FX10 (90.8TFLOPS)
100~200 PF, 100-200 PB/s
(CFL-D/CFL-D) ~4 MW
80-150+ PF (FAC/TPF + UCC) 2 MW
UV2000 (98TF, 128TiB) 0.3MW 2PF, 0.3MW
TSUBAME 3.0 (12.15 PF, 1.66 PB/s) 1.4MW (total) TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW)
HITACHI SR16000/M1172TF, 22TB
Cloud System BS2000 44TF, 14TB Data Science Cloud / Storage HA8000 / WOS7000
10TF, 1.96PB
3.96 PF UCC + CFL/M 0.9 MW 0.16 PF Cloud 0.1MW
HITACHI SR16000/M1172TF, 22TB Cloud System BS2000 44TF, 14TB 35 PF UCC +
CFL-M 2MW
BDEC 60+ PF (FAC) 3.5-4.5MWReedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-U20206) Reedbush-L1.4 PF (FAC) 0.2 MW
Oakforest-PACS (OFP) 25 PF (UCC + TPF) 3.2 MW
Oakbridge-II 4+ PF 1.0MW 200+ PF
(FAC) 6.5-
8.0MW
HA- PACS(1166TF) Cygnus 3+PF (TPF) 0.4MW PACS-XI 100PF (TPF) PPX2 (62TF)PPX1
(62TF)
Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s)
25.6 PB/s, 50-100Pflop/s (TPF) 1.5-2.0MW
3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M)NEC SX-ACE NEC Express5800 (423TF) (22.4TF) OCTPUS 1.463PF (UCC)
20+ PF (FAC/UCC + CFL-M) Fujitsu FX100 (2.9PF, 81 TiB) Fujitsu CX400 (774TF, 71TiB)
2MW in total up to 3MW 100+ PF (FAC/UCC+CFL-M)
up to 3MW (542TF, 71TiB)
Supercomputer roadmap in ITC/U.Tokyo
13
FY 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2 big systems, 6 yr. cycle
ITC/U.Tokyo now operating 2 (or 4) Systems !! 2,000+ users (1,000 from outside of U.Tokyo)
14
Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal)) Integrated Supercomputer System for Data Analyses & Scientific Simulations
Jul. 2016-Jun.2020 (RB-U), -Mar. 2021 (RB-H/L) Reedbush-U: CPU only, 420 nodes, 508 TF (Jul. 2016) Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar. 2017) Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct. 2017) 502.2 kW (w/o cooling, aircooled)
Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL)) JCAHPC: Joint Center for Advanced HPC (U.Tsukuba & U.Tokyo) 8,208 nodes, 25 PF, 13.55 PF (HPL)
TOP 500 #14 (#2 in Japan), HPCG #9 (#3) (Nov. 2018) Omni-Path Architecture, Full Bi-section Fat-tree DDN IME (Burst Buffer), Lustre 26 PB
IO 500 #1 (June 2018), #4 (Nov. 2018)
4.24 MW (incl. Cooling) 3.44 MW (w/o Cooling)
Oakbridge-CX (OBCX): Fujitsu awarded
15
Intel Xeon Scalable Processors (CascadeLake SP) Platinum 8280 (28 Core, 2.7 GHz) x 2 socket
Overview 1,368 nodes, 6.61 PF peak Aggregated Memory Bandwidth: 385.1 TB/sec Total HPL Performance: 4.2+ PF Fast Cache: SSD’s for 128 nodes: Intel SSD
1.6 TB/node, 3.20/1.32 GB/s/node for R/W Staging, Check-Pointing, Data Intensive Application 16 of these nodes can directly access external resources
(server, storage, sensor network etc.) Network: Intel Omni-Path, 100 Gbps, Full Bi-Section Storage: DDN EXAScaler (Lustre) 12.4 PB, 193.9 GB/sec Power Consumption: 950.5 kVA
Operation Starts: July 1st, 2019
TSUBAME3.0
No.1 Asia, No.7 World 10,000 cores
2010 TSUBAME2.0 2.4 Petarlops No1 World No.1 Production Green ACM Gordon Bell Prize
2013 TSUBAME2.5 418 GPUs Upgraded
5.7 Petaflps, No.2 Japan AI Flops 17.1 Petaflops
2013 TSUBAME-KFC TSUBAME3 Prototype Oil Immersive Cooling Green World No.1
2015 AI Prototype Upgrade (KFC/DL)
2017 TSUBAME3.0, > 10 million cores 12.1 Petaflops (AI Flops 47.2 Petaflops)
Green World No1 HPC and Big Data / AI Convergence
Tokyo Tech. TSUBAME Supercomputing History World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use
of massively parallel, many-core technology
2002 “TSUBAME0” 1.3 TeraFlops
800 cores
General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling
ABCI: AI Supercomputer to Serve Academic & Industry AI research in Japan, hosted by AIRC-AIST (2018)
• 4332 Volta GPUs + 2166 Skylake CPUs • 0.55 AI-Exaflops, 37 DFP Petaflops • ~2 Petabytes NVMe, 20 Petabytes HDD • DNN training optimized Infiniband EDR • New IDC single floor, inexpensive & fast build, hard
concrete floor with 2 tons/m2 weight tolerance • Racks
• 144 racks max. • ABCI 43 racks for compute & storage
• Power capacity • 3.25 MW max. • ABCI 2.3MW max.
• Water-Air Hybrid Wam Water “Free” Cooling • 70KW/rack. PUE < 1.1 • Total: 3.2MW min. (summer) • 32 Celsius Free cooling even at ~39 Celsius external
temperature with high humidity • ABCI IDC built in 7 months, operation Aug. 2018
Commoditizing TSUBAME3 supercomputer technologies to the Cloud
(17 AI-Petaflops, 70kW/rack, free cooling)
Training ImageNet in Minutes
Rio Yokota, Kazuki Osawa,Yohei Tsuji,Yuichiro Ueno, Hiroki Naganuma, Shun Iwase, Kaku Linsho, Satoshi Matsuoka Tokyo Institute of Technology/Riken+ Akira Naruse
(NVIDIA)
#GPU time Facebook 512 30 min Preferred Networks 1024 15 min UC Berkeley 2048 14 min Tencent 2048 6.6 min Sony (ABCI) ~3000 3.7 min Google (TPU/GCC) 1024 2.2 min Fujitsu Lab+ (ABCI) 2048 75 sec TokyoTech/NVIDIA/Riken (ABCI) 4096 ??
19
Post-K Processor High perf FP16&Int8 High mem BW for convolution Built-in scalable Tofu network
Unprecedened DL scalability
High Performance DNN Convolution
Low Precision ALU + High Memory Bandw idth + Advanced Combining of Convolutio n Algorithms (FFT+Winograd+GEMM)
High Performance and Ultra-Scalable Network for massive scaling model & data parallelism
Unprecedented Scalability of Data/
C P U For the
Post-K supercomputer
Post-K supercomputer
Post-K supercomputer
Post-K supercomputer
Large AI Infrastructures in Japan
20
Training Peak Perf.
NVIDIA P100 x 2160
NVIDIA P100 x 512
NVIDIA P100 x 400
~210 PF (FP16)
???? ????
Slide Number 1
Post K A64fx Processor
Preparing the 40+MW Facility (actual photo)
What is HPCI ?
HPCI Tier 2 Systems Roadmap (As of Nov. 2018
Supercomputer roadmap in ITC/U.Tokyo
ITC/U.Tokyo now operating 2 (or 4) Systems !!2,000+ users (1,000 from outside of U.Tokyo)
Oakbridge-CX (OBCX): Fujitsu awarded
Tokyo Tech. TSUBAME Supercomputing HistoryWorld’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use of massively parallel, many-core technology
ABCI: AI Supercomputer to Serve Academic & Industry AI research in Japan, hosted by AIRC-AIST (2018)
Training ImageNet in Minutes
Large AI Infrastructures in Japan