yutaka ishikawa (u. of tokyo), yuji saeki (hitachi ... · yutaka ishikawa (u. of tokyo), yuji saeki...
TRANSCRIPT
Feasibility Study on Advanced and Efficient Latency Core-based Architecture for Future HPCI R&D
Yutaka Ishikawa (U. of Tokyo), Yuji Saeki (Hitachi), Masamichi Takagi (NEC), Hisanobu Tomari (U. of Tokyo)
• Takahiro Katagiri, Kengo Nakajima, Satoshi Ohsima, Hideyuki Jitsumoto, Shinji Todo, Junichi Iwata, Kazuyuki Uchida, Hiroyasu Hasumi (University of Tyokyo)
• Kei Hiraki, Hiroshi Nakamura, Reiji Suda (University of Tokyo • Mutsumi Aoyagi, Koji Inoue, Yuichi Inadomi (Kyushu University) • Naoki Shinjo, Toshiyuki Shimizu, Akira Asato, Shinji Sumimoto (Fujitsu) • Tsuneo Iida, Masaaki Shimizu, Takashi Yonemura (Hitachi) • Yuichi Nakamura (NEC)
2013/3/19 1
Many many More to come
Target System, Approach, and Organization Target Application and Current Evaluation Results System Software Stack ◦ Overview ◦ Basic Evaluation of McKernel by Yuji Saeki (Hitachi) ◦ MPI and Low Level Communication by Masamichi Takagi (NEC)
Long-range Study on Future Architectures by HisanobuTomari (U. of Tokyo) Concluding Remarks
2013/3/19 2
Many many More to come
3
Feature of Target System: Deployment around 2018 Power consumption up to 30MW 2000 m2 space
Approach: Material and Climate Sciences are the first target
applications Approach from evolution of the K architecture System Software Stack is designed for both the
proposed and commodity-based machines
2013/3/19
Post-petascale Machine
Many many More to come
4
PI: Yutaka Ishikawa, U. of Tokyo Organization System Software Stack Performance Prediction and Tuning
Approach: Material and Climate Sciences are the first target applications Approach from evolution of the K architecture System Software Stack is designed for both the proposed and
commodity-based machines
Co-PI: Kei Hiraki, U. of Tokyo Architecture Evaluation, Compiler,
and Low power technologies
Co-PI: Mutsumi Aoyagi, Kyushu U. Network Evaluation Environment
Co-PI: Naoki Shinjo, Fujitsu Processor, Node,
Interconnect Architecture and System Software Stack
Co-PI: Tsuneo Iida, Hitachi Storage Architecture and
System Software Stack
Co-PI: Yuichi Nakamura, NEC System Software Stack
Next-Gen General Purpose Supercomputer
Commodity-based
Supercomputer
System Software Stack (MPI, parallel file I/O, PGAS,
Batch Job Scheduler, Debugging and Tuning Tools)
Applications
2013/3/19
Many many More to come
2013/3/19 5
Tightly coupled design of architecture by architects, software developers, and application developers.
Application Code with performance counter instrumentation
Performance Prediction
Performance Parameters
Prediction Tools
Architecture Design: Processor, Node, Network, and Storage
Evaluation and Tuning of Apps.
Fujitsu FX10
1 Cycle / 2 months
Many many More to come
ALPS(Algorithms and Libraries for Physics Simulations) ◦ Providing high-end simulation codes for strongly
correlated quantum mechanical systems ◦ Requirements: Total Memory: 10〜100PB, low latency and
high radix network RSDFT (Real-Space Density Functional Theory) ◦ A DFT(Density Functional Theory) code with real space
discretized wave functions and densities for molecular dynamics simulations using the Car-Parrinello type approach
◦ Requirements: Total Memory: 1PB, 1EFLOPS (B/F = 0.1+) NICAM (Nonhydrostatic ICosahedral Atmospheric Model) ◦ A Global Cloud Resolving Model (GCRM) ◦ Requirements: Total Memory:1 PB, Memory Bandwidth: 300 PB/sec, 100 PFLOPS (B/F = 3)
COCO (CCSR Ocean Component Model) ◦ Ocean general circulation model developed at Center for
Climate System Research (CCSR), the University of Tokyo ◦ Requirements: Total Memory: 320 TB, Memory Bandwidth:150 PB/sec. 50 PFLOPS (B/F=3)
6 2013/3/19 In FY2013, Mini-apps, developed by FS application team, are applied
Many many More to come
Programming Languages/Models
2013/3/19 7
Commodity Machine, Proposed Machine, …
Infiniband Tofu RoCE BGQ Fabric With
manycore
Hetero Operating System: Linux and light-weight micro kernel (McKernel)
Batch Job System
Parallel File I/O Communication MPI, ….
Parallel File System
Math. Libraries
Tuning and Debugging Tools
Hierarchical File System
Process/Thread
Parallel Process Spawn
Low Level Communication
Power Management
Real-Time/Big-data Visualization
Hierarchical Memory
Management
Energy Consumption Model Energy Control Model
Yellow: Mainly Concern
Many many More to come
Programming Languages/Models
2013/3/19 8
Commodity Machine, Proposed Machine, …
Infiniband Tofu RoCE BGQ Fabric With
manycore
Hetero Operating System: Linux and light-weight micro kernel (McKernel)
Batch Job System
Parallel File I/O Communication MPI, ….
Parallel File System
Math. Libraries
Tuning Tools
Hierarchical File System
Process/Thread
Parallel Process Spawn
Low Level Communication
Power Management
Real-Time/Big-data Visualization
Hierarchical Memory
Management
Energy Consumption Model Energy Control Model
Yellow: Mainly Concern
International Collaboration
Many many More to come
Linux only ◦ single Linux kernel on all cores ◦ Multiple Linux kernels on compute
nodes Lightweight micro kernel(LMK)
+ Linux ◦ Single LMK on compute cores +
Linux on OS cores ◦ Multiple LMKs on compute nodes +
Linux on OS cores Single LMKs on each compute node
Compute Node Kernel Only with Linux Server ◦ Single LMK on all cores ◦ Multiple LMKs on all cores
9
Linux Kernel
LMK Linux Kernel
Linux Kernel Linux
Kernel Linux Kernel Linux
Kernel
LMK LMK LMK Linux Kernel
LMK LMK LMK LMK LMK
2013/3/19
Many many More to come
◦ Based on the reference architectures, the following possible configurations are considered and evaluated using a KNC cluster
2013/3/19 10
PCI-Express Host Infiniband
Network Card
KNC
Linux Kernel mc knl
mc knl
mc knl
PCI-Express Host
Infiniband Network Card
KNC
Linux Kernel Linux Kernel mc
knl mc knl
PCI-Express Host
Infiniband Network Card
KNC
Linux Kernel Linux Kernel mckernel
PCI-Express Host Infiniband
Network Card
KNC
Linux Kernel Linux Kernel
PCI-Express Host Infiniband
Network Card
KNC
Linux Kernel Linux Kerne
l
Linux Kerne
l
Linux Kerne
l
PCI-Express Host Infiniband
Network Card
KNC
Linux Kernel mckernel
Many many More to come
IHK (Interface for Heterogeneous Kernel) ◦ Provides interface between Linux kernel and micro kernels ◦ Provides generic-purpose communication and data transfer
mechanisms McKernel ◦ Micro lightweight kernel
In case of Bootable Many Core In case of Non-Bootable Many Core
2013/3/17-updated 11
Host Infiniband
Network Card
Many Core
mckernel Linux Kernel
Helper Threads
IHK
PCI-Express device driver cokernel
IKC IKC
Executer
Executer Linux API (glibc, pthread)
MPI PGAS OpenMP
Many Core
Infiniband Network Card
mckernel
IHK
cokernel IKC
mckernel
IHK
cokernel IKC
mckernel Linux Kernel
Helper Threads
IHK
device driver cokernel IKC IKC
Executer Executer
MPI PGAS OpenMP
Linux API (glibc, pthread)
US-JP Collaboration on System Software
Many many More to come
Linux Kernel+Loadable LWK ◦ LWK is dynamically reloaded for each application
E.g. LWK-A for application A is loaded during the A’s execution LWK-B for application B is loaded during the B’s execution
◦ Linux API is provided in LWK
Linux Kernel
LMK-B LMK-B LMK-B Linux Kernel
Linux kernel is resident
LWK-A Linux Kernel
App A, requiring LWK-A, Is invoked
Finish
LMK-C LMK-C LMK-C Linux Kernel
App C, requiring LWK-C, Is invoked
Finish
App B, requiring LWK-C, Is invoked
Finish
12 2013/3/19
Many many More to come
Features implemented and being tested ◦ glibc and pthread Thread and memory management File I/O, delegated to Linux in host Memory map and dynamic link library
◦ Process launcher in host ◦ Direct Communication with Infiniband ◦ MPI library (not fully) running on Xeon
Phi ◦ OpenMP environment with Intel
compiler Features being developed and
planned ◦ Hierarchical Memory Management ◦ PVAS, supporting the PGAS model ◦ Direct SSD ◦ Single OS kernel image for partitioned
multiple light-weight kernels
13 2013/3/19
Many many More to come
Yuji Saeki (Hitachi)
2013/3/19 14
Many many More to come
System calls on Heterogeneous OS
Attached Built-in
■ Delegating system calls on mckernel (lightweight kernel) to Linux kernel to minimize cache pollution. ・In case of delegating a system call with an arguments pointing data buffers, the data is transferred from mckernel to Linux.
ssize_t write(int fd , const void *buf , size_t count );
ssize_t read(int fd, void *buf, size_t count);
int open(const char *pathname, int flags, mode_t mode); ・・・ more than 300 Linux primitives
write
PCI-Express Host Xeon Phi™
McKernel Linux Kernel
args, data
Host PCI-Express Xeon Phi™
McKernel Linux Kernel
Linux Kernel
write
Delegation Mechanism on McKernel
count buf
ret buf
pathname ~256
2013/3/19 15
Many many More to come
Data transfer between Linux and McKernel ■ memory copy Data structure depends on API specification of system calls.
→ Individual implementation of system call delegation on both sides of Linux and McKernel is required. ■ memory map to the same virtual address on Linux and McKernel
→ System call with the same values of arguments can be invoked on the Linux side without individual implementation.
write(fd, bbb, cnt) write(fd, bbb, cnt) ※ Overlap in virtual address on Linux and McKernel must be avoided.
cnt cnt
Linux mckernel
memory map
2013/3/19 16
Many many More to come
■ Mapping pages used by a process on mckernel to the same position in virtual memory space on Linux
① 6 arguments of system call are forwarded from McKernel to Linux. ② mcexec (process on Linux launching a.out) invokes syscall(args) ③ Access to virtual address used on McKernel → Page fault on Linux → Handler gets Page Table Entries from McKernel → Memory map
⇒ It is not necessary to implement each of system calls.
※ mcexec is prevented from using the same pages of a.out → Position Independent Executable is located to the bottom of
virtual space.
syscall(num, v, &p)
a.out mcexec
syscall(num, v, &p) args page table
Linux McKernel
memory map Page fault
①
②
③
Semi-automatic Delegation Mechanism
2013/3/19 17
Many many More to come
attached-mic writeシステムコール性能 2013-Feb-14 kncc16
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
perf
orm
ance
(byt
e/se
c)
Xeon local
Phi memcpy
Phi mmap
・Delegation to host Linux
・Delegated write is much slower than local write on host Linux … not using DMA engine on Xeon Phi™
PCI-Express Host Xeon Phi™
Linux Kernel write McKernel
Delegation throughput using mmap and memcpy shows the same tendency. mmap: 23.5 MB/sec memcpy: 24.8 MB/sec
Write throughput (Attached composition)
2013/3/19 18
Host Linux local 4434MB/s
Experimental Machine: Intel Xeon E5-2670 Intel preproduction Xeon Phi
Many many More to come
・Delegation using memcpy is 2.5 times faster than mmap (due to page fault).
・Delegated read throughput is better than write, using PCIe write-combining.
PCI-Express Host Xeon Phi™
Linux Kernel read McKernel
Read throughput (Attached composition)
attached-mic readシステムコール性能 2013-Feb-15 kncc16
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
perf
orm
ance
(byt
e/se
c)
Xeon local
Phi memcpy
Phi mmap
2013/3/19 19
Many many More to come
Read throughput reusing PTE (Attached composition) ・ In case host Linux can reuse PTE (executing the test program repeatedly), throughput of mmap comes close to
memcpy. attached-mic readシステムコール性能 2013-Feb-15 kncc16
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+11
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
perf
orm
ance
(byt
e/se
c)
Xeon local
Phi memcpy
Phi mmap
2013/3/19 20
PCI-Express Host Xeon Phi™
Linux Kernel read McKernel
Many many More to come
builtin-mic writeシステムコール性能 2013-Feb-20 kncc14
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
thro
ught
put (
byte
/sec
)
miclinux local
mckernel memcpy
mckernel mmap
・Delegation to Linux kenel on Xeon Phi™
・Delegated write is about 2 times slower than that of Linux on Xeon Phi™
Delegation throughput using mmap is about 90% of memcpy due to page fault
Linux on Xeon Phi™
PCI-Express Host Xeon Phi™
McKernel Linux kernel
Linux kernel
Write throughput (Built-in composition)
2013/3/19 21
Many many More to come
・ In case Linux on Xeon Phi™ can reuse PTE (executing the test program repeatedly),
throughput of mmap comes close to the case of Linux on Xeon Phi™
Write throughput reusing PTE (Built-in composition)
builtin-mic writeシステムコール性能 2013-Feb-20 kncc14
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
thro
ught
put (
byte
/sec
)
miclinux local
mckernel memcpy
mckernel mmap
2013/3/19 22
PCI-Express Host Xeon Phi™
McKernel Linux kernel
Linux kernel
Many many More to come
builtin-mic readシステムコール性能 2013-Feb-20 kncc14
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)
thro
ughp
ut (b
yte/
sec)
miclinux local
mckernel memcpy
mckernel mmap
Delegation using mmap is about 90% of memcpy
・ It is similar to the case of write().
Read throughput (Built-in composition)
2013/3/19 23
PCI-Express Host Xeon Phi™
McKernel Linux kernel
Linux kernel
Many many More to come
・Semi-automatic delegation mechanism minimizes cache pollution of system call without implementing each of system calls. … Mapping pages used by a process on mckernel to the same position in virtual memory space on Linux ・Throughput of read/write system call delegated to Linux using mmap is influenced from overhead of retrieving and inserting Page Table Entries of mckernel. → to be improved by retrieving and inserting a set of PTEs at once
Summary: basic evaluation of mckernel
2013/3/19 24
Many many More to come
Masamichi Takagi (NEC)
2013/3/19 25
Many many More to come
Proc
ess
mgm
t.
Infini-Band Tofu DRAM
IB Verbs
Low level communication
library
• API for building block • API for new protocol
MPICH
• Optimized design for NW • New protocol
Communication building blocks for next-generation supercomputers
• Enabling optimized implementation for new NW HW e.g. One can implement boot-strap information exchange e.g. One can implement two-sided communication using RDMA building block
• Providing API for new protocol e.g. Low latency communication by scheduling multiple communications
2013/3/19 26
Prototype design of LLC is needed for its co-design
Many many More to come
Steps 1. Design and implement new network-module using Xeon with IB 2. Port process management functionality on Phi 3. Port network-module on Phi with IB
CH3
Phi with IB Tofu
Nemesis
MPI IF
ADI IF Pamid
Channel2 IF Sock
SCIF TCP
Proxy Executable
UI Executable
Other Proxy Executables
PMI IF Netmod
Prototyping by porting MPICH to Phi with IB
2013/3/19 27
Many many More to come
Machine ◦ Intel Xeon E5-2670, 2.601GHz ◦ Mellanox ConnectX-3, 56Gb/s
MPI libraries compared ◦ MVAPICH 1.9b ◦ MPICH 3.0.1 with our network module
Program ◦ MPI_Isend and MPI_Irecv ping-pong
2013/3/19 28
Many many More to come Performance of prototype is comparable to state-of-the-art
Round trip latency
(usec)
0
1
2
3
4
5
6
7
8
9
4 512
MVAPICH
MPICH
(usec)
0
50
100
150
200
250
300
350
16384 262144
MVAPICHMPICH
(bytes) Message size
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2097152
MVAPICHMPICH
(usec)
2013/3/19 29
Comparable to MVAPICH
Many many More to come
Metrics • Operations per second
Methodology Machines ◦ Intel Xeon E5-2670, 2.601GHz, 8-
core, 2-socket, 4-node ◦ Mellanox ConnectX-3, 56Gb/s
MPI libraries compared ◦ MVAPICH 1.9b ◦ MPICH 3.0.1 with our netmod
Program ◦ NAS Parallel Benchmark 3.3.1 MZ ◦ bt-mz, lu-mz, sp-mz ◦ OpenMP (8-thread) and
MPI (4-process) hybrid parallelization
(Mops/sec) Comparable to MVAPICH
Operations/sec
Benchmark program
0
10000
20000
30000
40000
50000
60000
70000
bt-mz lu-mz sp-mz
MPICHMVAPICH
2013/3/19 30 Performance of prototype is comparable to state-of-the-art
Many many More to come
Hisanobu Tomari (U. of Tokyo)
2013/3/19 31
Many many More to come
Performance of integer instructions − Power consumption associated with performance
On-chip Network − Latency between hardware threads
2013/3/19 32
Many many More to come
Integer Instructions are: − ALU (add, sub, ...) - used for address calc. − Branch − load/store
Observed trend using Dhrystone − Dhrystone results: close to CINT2006 − Runs on wider range of hardware configurations
2013/3/19 33
Many many More to come
System power consumption − CPU, memory − PSU loss, fan
Measured between AC outlet and computer
2013/3/19 34
Many many More to come
2013/3/19 35
Many many More to come
2013/3/19 36
Many many More to come
Performance ◦ Integer applications in modern HPC is not as high ◦ Latest ARM processor would outperform them
Improvements mainly driven by performance increase
2013/3/19 37
Many many More to come
extrapolate data from computers after 1995 ◦ 2018: ~3000 VAX MIPS/Watt ◦ Translates to 0.33 nJ/instruction
If you say 1EFlop/s at 10MW in 2018, ◦ Translates to 0.01 nJ/Flop
How we are going to fill this gap? ◦ Must eliminate Load/Store and address calc. from kernel
loop - is that even possible?
2013/3/19 38
Many many More to come
OMP bt.A OMP cg.A OMP ep.A OMP ft.A OMP is.A OMP lu.A OMP mg.A OMP sp.A OMP ua.A0
0.5
1
1.5
2
2.5
3
3.5
SPARC64 IXfx/Fujitsu Xeon E5-2687W/Intel BGQ
Rel
. Per
f to
Fujit
su Per-socket NPB performance, OpenMP, compiler
optimization only
2013/3/19 39
Many many More to come
All three systems have similar theoretical FP performance
Due to integer performance, FP application performance suffers
Life is too short for optimizing every application for every supercomputer architecture
Advanced dynamic instruction scheduling beneficial for HPC
2013/3/19 40
Many many More to come
Measure ping-pong latency between threads Repeat Compare-and-Swap Bind thread to same/different physical core or
socket The benchmark is on ◦ https://github.com/tomari/shmlatency
2013/3/19 41
Many many More to come
same-core same-die inter-socket0
50
100
150
200
250
300
350
400
450
500
Opteron 6282SESPARC T4Xeon X5570POWER7
Late
ncy
(ns)
2013/3/19 42
Many many More to come
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
10
20
30
40
50
60
70
80
POWER7SandyBridge-E
Late
ncy
(ns)
2013/3/19 43
Many many More to come
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Enterprise 3000Origin 2000
Late
ncy
(ns)
N
N R
N
N R Origin 2000 topology
2013/3/19 44
Many many More to come
Large variations in on-chip latency Latency critical in synchronization ◦ More apparent when thread count is higher
Inter-chip latency has been on the same level for 15+ years
Can we decrease latency?
2013/3/19 45
Many many More to come
We need more integer performance ◦ thereby increasing the Performance/Watt ◦ Key to the `Easy-to-Use' supercomputer
Latency is currently studied ◦ Can we decrease latency for synchronizing larger number of
threads? ◦ What impact will that have on the exascale?
2013/3/19 46
Many many More to come
2013/3/19 47
2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 CY
Basic Component
Advanced Component Tuning
PostT2K
PostT2K • U. of Tokyo and U. of Tsukuba will install so-called PostT2K machine, whose boundary conditions are up to 3 MW electricity and up to 1000 m2 space.
• A part of the system software stack shown in this presentation will be deployed. The machine is used for production run. This software stack will be extended towards a post peta-scale machine.
Activity in U. of Tokyo and U. of Tsukuba (Beside this Feasibility Study)