yutaka ishikawa (u. of tokyo), yuji saeki (hitachi ... · yutaka ishikawa (u. of tokyo), yuji saeki...

Feasibility Study on Advanced and Efficient Latency Core-based Architecture for Future HPCI R&D

Yutaka Ishikawa (U. of Tokyo), Yuji Saeki (Hitachi), Masamichi Takagi (NEC), Hisanobu Tomari (U. of Tokyo)

• Takahiro Katagiri, Kengo Nakajima, Satoshi Ohsima, Hideyuki Jitsumoto, Shinji Todo, Junichi Iwata, Kazuyuki Uchida, Hiroyasu Hasumi (University of Tyokyo)

• Kei Hiraki, Hiroshi Nakamura, Reiji Suda (University of Tokyo • Mutsumi Aoyagi, Koji Inoue, Yuichi Inadomi (Kyushu University) • Naoki Shinjo, Toshiyuki Shimizu, Akira Asato, Shinji Sumimoto (Fujitsu) • Tsuneo Iida, Masaaki Shimizu, Takashi Yonemura (Hitachi) • Yuichi Nakamura (NEC)

2013/3/19 1

Many many More to come

Target System, Approach, and Organization Target Application and Current Evaluation Results System Software Stack ◦ Overview ◦ Basic Evaluation of McKernel by Yuji Saeki (Hitachi) ◦ MPI and Low Level Communication by Masamichi Takagi (NEC)

Long-range Study on Future Architectures by HisanobuTomari (U. of Tokyo) Concluding Remarks

2013/3/19 2


3

Feature of Target System: Deployment around 2018 Power consumption up to 30MW 2000 m2 space

Approach: Material and Climate Sciences are the first target

applications Approach from evolution of the K architecture System Software Stack is designed for both the

proposed and commodity-based machines

2013/3/19

Post-petascale Machine

http://www.google.co.jp/url?sa=i&rct=j&q=%E7%90%86%E7%A0%94%E3%80%80%E8%A8%88%E7%AE%97%E6%A9%9F%E5%AE%A4&source=images&cd=&cad=rja&docid=bxFFI7gwbA7_DM&tbnid=IsBMCJ0WJ1XViM:&ved=0CAUQjRw&url=http://www.nsc.riken.jp/site/naibushashin.html&ei=w_5DUaaKJYewkAX8x4CIBw&bvm=bv.43828540,d.dGI&psig=AFQjCNF8xUqWbhOjYfS0CmbvC9HPqSOO2w&ust=1363496968727063

http://www.google.co.jp/url?sa=i&rct=j&q=%E7%90%86%E7%A0%94%E3%80%80%E8%A8%88%E7%AE%97%E7%A7%91%E5%AD%A6%E7%A0%94%E7%A9%B6%E6%A9%9F%E6%A7%8B&source=images&cd=&cad=rja&docid=udeFcqQEnfwEXM&tbnid=eyBduLRAZmn1sM:&ved=0CAUQjRw&url=http://www.townart.co.jp/project/others/aics.html&ei=LP9DUafzBMiAkgWnkICABw&bvm=bv.43828540,d.dGI&psig=AFQjCNEu8WESrPIH8cAZeASSYDzrKwW1RA&ust=1363497123715819


4

PI: Yutaka Ishikawa, U. of Tokyo Organization System Software Stack Performance Prediction and Tuning

Approach: Material and Climate Sciences are the first target applications Approach from evolution of the K architecture System Software Stack is designed for both the proposed and

commodity-based machines

Co-PI: Kei Hiraki, U. of Tokyo Architecture Evaluation, Compiler,

and Low power technologies

Co-PI: Mutsumi Aoyagi, Kyushu U. Network Evaluation Environment

Co-PI: Naoki Shinjo, Fujitsu Processor, Node,

Interconnect Architecture and System Software Stack

Co-PI: Tsuneo Iida, Hitachi Storage Architecture and

System Software Stack

Co-PI: Yuichi Nakamura, NEC System Software Stack

Next-Gen General Purpose Supercomputer

Commodity-based

Supercomputer

System Software Stack (MPI, parallel file I/O, PGAS,

Batch Job Scheduler, Debugging and Tuning Tools)

Applications

2013/3/19


2013/3/19 5

Tightly coupled design of architecture by architects, software developers, and application developers.

Application Code with performance counter instrumentation

Performance Prediction

Performance Parameters

Prediction Tools

Architecture Design: Processor, Node, Network, and Storage

Evaluation and Tuning of Apps.

Fujitsu FX10

1 Cycle / 2 months


ALPS(Algorithms and Libraries for Physics Simulations) ◦ Providing high-end simulation codes for strongly

correlated quantum mechanical systems ◦ Requirements: Total Memory: 10〜100PB, low latency and

high radix network RSDFT (Real-Space Density Functional Theory) ◦ A DFT(Density Functional Theory) code with real space

discretized wave functions and densities for molecular dynamics simulations using the Car-Parrinello type approach

◦ Requirements: Total Memory: 1PB, 1EFLOPS (B/F = 0.1+) NICAM (Nonhydrostatic ICosahedral Atmospheric Model) ◦ A Global Cloud Resolving Model (GCRM) ◦ Requirements: Total Memory:1 PB, Memory Bandwidth: 300 PB/sec, 100 PFLOPS (B/F = 3)

COCO (CCSR Ocean Component Model) ◦ Ocean general circulation model developed at Center for

Climate System Research (CCSR), the University of Tokyo ◦ Requirements: Total Memory: 320 TB, Memory Bandwidth:150 PB/sec. 50 PFLOPS (B/F=3)

6 2013/3/19 In FY2013, Mini-apps, developed by FS application team, are applied


Programming Languages/Models

2013/3/19 7

Commodity Machine, Proposed Machine, …

Infiniband Tofu RoCE BGQ Fabric With

manycore

Hetero Operating System: Linux and light-weight micro kernel (McKernel)

Batch Job System

Parallel File I/O Communication MPI, ….

Parallel File System

Math. Libraries

Tuning and Debugging Tools

Hierarchical File System

Process/Thread

Parallel Process Spawn

Low Level Communication

Power Management

Real-Time/Big-data Visualization

Hierarchical Memory

Management

Energy Consumption Model Energy Control Model

Yellow: Mainly Concern


Programming Languages/Models

2013/3/19 8

Commodity Machine, Proposed Machine, …

Infiniband Tofu RoCE BGQ Fabric With

manycore

Hetero Operating System: Linux and light-weight micro kernel (McKernel)

Batch Job System

Parallel File I/O Communication MPI, ….

Parallel File System

Math. Libraries

Tuning Tools

Hierarchical File System

Process/Thread

Parallel Process Spawn

Low Level Communication

Power Management

Real-Time/Big-data Visualization

Hierarchical Memory

Management

Energy Consumption Model Energy Control Model

Yellow: Mainly Concern

International Collaboration


Linux only ◦ single Linux kernel on all cores ◦ Multiple Linux kernels on compute

nodes Lightweight micro kernel(LMK)

+ Linux ◦ Single LMK on compute cores +

Linux on OS cores ◦ Multiple LMKs on compute nodes +

Linux on OS cores Single LMKs on each compute node

Compute Node Kernel Only with Linux Server ◦ Single LMK on all cores ◦ Multiple LMKs on all cores

9

Linux Kernel

LMK Linux Kernel

Linux Kernel Linux

Kernel Linux Kernel Linux

Kernel

LMK LMK LMK Linux Kernel

LMK LMK LMK LMK LMK

2013/3/19


◦ Based on the reference architectures, the following possible configurations are considered and evaluated using a KNC cluster

2013/3/19 10

PCI-Express Host Infiniband

Network Card

KNC

Linux Kernel mc knl

mc knl

mc knl

PCI-Express Host

Infiniband Network Card

KNC

Linux Kernel Linux Kernel mc

knl mc knl

PCI-Express Host


KNC

Linux Kernel Linux Kernel mckernel


Network Card

KNC

Linux Kernel Linux Kernel


Network Card

KNC

Linux Kernel Linux Kerne

l

Linux Kerne

l

Linux Kerne

l


Network Card

KNC

Linux Kernel mckernel


IHK (Interface for Heterogeneous Kernel) ◦ Provides interface between Linux kernel and micro kernels ◦ Provides generic-purpose communication and data transfer

mechanisms McKernel ◦ Micro lightweight kernel

In case of Bootable Many Core In case of Non-Bootable Many Core

2013/3/17-updated 11

Host Infiniband

Network Card

Many Core

mckernel Linux Kernel

Helper Threads

IHK

PCI-Express device driver cokernel

IKC IKC

Executer

Executer Linux API (glibc, pthread)

MPI PGAS OpenMP

Many Core


mckernel

IHK

cokernel IKC

mckernel

IHK

cokernel IKC

mckernel Linux Kernel

Helper Threads

IHK

device driver cokernel IKC IKC

Executer Executer

MPI PGAS OpenMP

Linux API (glibc, pthread)

US-JP Collaboration on System Software


Linux Kernel＋Loadable LWK ◦ LWK is dynamically reloaded for each application

E.g. LWK-A for application A is loaded during the A’s execution LWK-B for application B is loaded during the B’s execution

◦ Linux API is provided in LWK

Linux Kernel

LMK-B LMK-B LMK-B Linux Kernel

Linux kernel is resident

LWK-A Linux Kernel

App A, requiring LWK-A, Is invoked

Finish

LMK-C LMK-C LMK-C Linux Kernel

App C, requiring LWK-C, Is invoked

Finish

App B, requiring LWK-C, Is invoked

Finish

12 2013/3/19


Features implemented and being tested ◦ glibc and pthread Thread and memory management File I/O, delegated to Linux in host Memory map and dynamic link library

◦ Process launcher in host ◦ Direct Communication with Infiniband ◦ MPI library (not fully) running on Xeon

Phi ◦ OpenMP environment with Intel

compiler Features being developed and

planned ◦ Hierarchical Memory Management ◦ PVAS, supporting the PGAS model ◦ Direct SSD ◦ Single OS kernel image for partitioned

multiple light-weight kernels

13 2013/3/19


Yuji Saeki (Hitachi)

2013/3/19 14


System calls on Heterogeneous OS

Attached Built-in

■ Delegating system calls on mckernel (lightweight kernel) to Linux kernel to minimize cache pollution. ・In case of delegating a system call with an arguments pointing data buffers, the data is transferred from mckernel to Linux.

ssize_t write(int fd , const void *buf , size_t count );

ssize_t read(int fd, void *buf, size_t count);

int open(const char *pathname, int flags, mode_t mode); ・・・ more than 300 Linux primitives

write

PCI-Express Host Xeon Phi™

McKernel Linux Kernel

args, data

Host PCI-Express Xeon Phi™

McKernel Linux Kernel

Linux Kernel

write

Delegation Mechanism on McKernel

count buf

ret buf

pathname ～256

2013/3/19 15


Data transfer between Linux and McKernel ■ memory copy Data structure depends on API specification of system calls.

→ Individual implementation of system call delegation on both sides of Linux and McKernel is required. ■ memory map to the same virtual address on Linux and McKernel

→ System call with the same values of arguments can be invoked on the Linux side without individual implementation.

write(fd, bbb, cnt) write(fd, bbb, cnt) ※ Overlap in virtual address on Linux and McKernel must be avoided.

cnt cnt

Linux mckernel

memory map

2013/3/19 16


■ Mapping pages used by a process on mckernel to the same position in virtual memory space on Linux

① 6 arguments of system call are forwarded from McKernel to Linux. ② mcexec (process on Linux launching a.out) invokes syscall(args) ③ Access to virtual address used on McKernel → Page fault on Linux → Handler gets Page Table Entries from McKernel → Memory map

⇒ It is not necessary to implement each of system calls.

※ mcexec is prevented from using the same pages of a.out → Position Independent Executable is located to the bottom of

virtual space.

syscall(num, v, &p)

a.out mcexec

syscall(num, v, &p) args page table

Linux McKernel

memory map Page fault

①

②

③

Semi-automatic Delegation Mechanism

2013/3/19 17


attached-mic writeシステムコール性能 2013-Feb-14 kncc16

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07size (byte)

perf

orm

ance

(byt

e/se

c)

Xeon local

Phi memcpy

Phi mmap

・Delegation to host Linux

・Delegated write is much slower than local write on host Linux … not using DMA engine on Xeon Phi™


Linux Kernel write McKernel

Delegation throughput using mmap and memcpy shows the same tendency. mmap: 23.5 MB/sec memcpy: 24.8 MB/sec

Write throughput (Attached composition)

2013/3/19 18

Host Linux local 4434MB/s

Experimental Machine: Intel Xeon E5-2670 Intel preproduction Xeon Phi


・Delegation using memcpy is 2.5 times faster than mmap (due to page fault).

・Delegated read throughput is better than write, using PCIe write-combining.


Linux Kernel read McKernel

Read throughput (Attached composition)

attached-mic readシステムコール性能 2013-Feb-15 kncc16

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10


perf

orm

ance

(byt

e/se

c)

Xeon local

Phi memcpy

Phi mmap

2013/3/19 19


Read throughput reusing PTE (Attached composition) ・ In case host Linux can reuse PTE (executing the test program repeatedly), throughput of mmap comes close to

memcpy. attached-mic readシステムコール性能 2013-Feb-15 kncc16

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

1.0E+11


perf

orm

ance

(byt

e/se

c)

Xeon local

Phi memcpy

Phi mmap

2013/3/19 20


Linux Kernel read McKernel


builtin-mic writeシステムコール性能 2013-Feb-20 kncc14

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09


thro

ught

put (

byte

/sec

)

miclinux local

mckernel memcpy

mckernel mmap

・Delegation to Linux kenel on Xeon Phi™

・Delegated write is about 2 times slower than that of Linux on Xeon Phi™

Delegation throughput using mmap is about 90% of memcpy due to page fault

Linux on Xeon Phi™


McKernel Linux kernel

Linux kernel

Write throughput (Built-in composition)

2013/3/19 21


・ In case Linux on Xeon Phi™ can reuse PTE (executing the test program repeatedly),

throughput of mmap comes close to the case of Linux on Xeon Phi™

Write throughput reusing PTE (Built-in composition)

builtin-mic writeシステムコール性能 2013-Feb-20 kncc14

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10


thro

ught

put (

byte

/sec

)

miclinux local

mckernel memcpy

mckernel mmap

2013/3/19 22



Linux kernel


builtin-mic readシステムコール性能 2013-Feb-20 kncc14

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09


thro

ughp

ut (b

yte/

sec)

miclinux local

mckernel memcpy

mckernel mmap

Delegation using mmap is about 90% of memcpy

・ It is similar to the case of write().

Read throughput (Built-in composition)

2013/3/19 23



Linux kernel


・Semi-automatic delegation mechanism minimizes cache pollution of system call without implementing each of system calls. … Mapping pages used by a process on mckernel to the same position in virtual memory space on Linux ・Throughput of read/write system call delegated to Linux using mmap is influenced from overhead of retrieving and inserting Page Table Entries of mckernel. → to be improved by retrieving and inserting a set of PTEs at once

Summary: basic evaluation of mckernel

2013/3/19 24


Masamichi Takagi (NEC)

2013/3/19 25


Proc

ess

mgm

t.

Infini-Band Tofu DRAM

IB Verbs

Low level communication

library

• API for building block • API for new protocol

MPICH

• Optimized design for NW • New protocol

Communication building blocks for next-generation supercomputers

• Enabling optimized implementation for new NW HW e.g. One can implement boot-strap information exchange e.g. One can implement two-sided communication using RDMA building block

• Providing API for new protocol e.g. Low latency communication by scheduling multiple communications

2013/3/19 26

Prototype design of LLC is needed for its co-design


Steps 1. Design and implement new network-module using Xeon with IB 2. Port process management functionality on Phi 3. Port network-module on Phi with IB

CH3

Phi with IB Tofu

Nemesis

MPI IF

ADI IF Pamid

Channel2 IF Sock

SCIF TCP

Proxy Executable

UI Executable

Other Proxy Executables

PMI IF Netmod

Prototyping by porting MPICH to Phi with IB

2013/3/19 27


Machine ◦ Intel Xeon E5-2670, 2.601GHz ◦ Mellanox ConnectX-3, 56Gb/s

MPI libraries compared ◦ MVAPICH 1.9b ◦ MPICH 3.0.1 with our network module

Program ◦ MPI_Isend and MPI_Irecv ping-pong

2013/3/19 28

Many many More to come Performance of prototype is comparable to state-of-the-art

Round trip latency

(usec)

0

1

2

3

4

5

6

7

8

9

4 512

MVAPICH

MPICH

(usec)

0

50

100

150

200

250

300

350

16384 262144

MVAPICHMPICH

(bytes) Message size

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2097152

MVAPICHMPICH

(usec)

2013/3/19 29

Comparable to MVAPICH


Metrics • Operations per second

Methodology Machines ◦ Intel Xeon E5-2670, 2.601GHz, 8-

core, 2-socket, 4-node ◦ Mellanox ConnectX-3, 56Gb/s

MPI libraries compared ◦ MVAPICH 1.9b ◦ MPICH 3.0.1 with our netmod

Program ◦ NAS Parallel Benchmark 3.3.1 MZ ◦ bt-mz, lu-mz, sp-mz ◦ OpenMP (8-thread) and

MPI (4-process) hybrid parallelization

(Mops/sec) Comparable to MVAPICH

Operations/sec

Benchmark program

0

10000

20000

30000

40000

50000

60000

70000

bt-mz lu-mz sp-mz

MPICHMVAPICH

2013/3/19 30 Performance of prototype is comparable to state-of-the-art


Hisanobu Tomari (U. of Tokyo)

2013/3/19 31


Performance of integer instructions − Power consumption associated with performance

On-chip Network − Latency between hardware threads

2013/3/19 32


Integer Instructions are: − ALU (add, sub, ...) - used for address calc. − Branch − load/store

Observed trend using Dhrystone − Dhrystone results: close to CINT2006 − Runs on wider range of hardware configurations

2013/3/19 33


System power consumption − CPU, memory − PSU loss, fan

Measured between AC outlet and computer

2013/3/19 34


2013/3/19 35


2013/3/19 36


Performance ◦ Integer applications in modern HPC is not as high ◦ Latest ARM processor would outperform them

Improvements mainly driven by performance increase

2013/3/19 37


extrapolate data from computers after 1995 ◦ 2018: ~3000 VAX MIPS/Watt ◦ Translates to 0.33 nJ/instruction

If you say 1EFlop/s at 10MW in 2018, ◦ Translates to 0.01 nJ/Flop

How we are going to fill this gap? ◦ Must eliminate Load/Store and address calc. from kernel

loop - is that even possible?

2013/3/19 38


OMP bt.A OMP cg.A OMP ep.A OMP ft.A OMP is.A OMP lu.A OMP mg.A OMP sp.A OMP ua.A0

0.5

1

1.5

2

2.5

3

3.5

SPARC64 IXfx/Fujitsu Xeon E5-2687W/Intel BGQ

Rel

. Per

f to

Fujit

su Per-socket NPB performance, OpenMP, compiler

optimization only

2013/3/19 39


All three systems have similar theoretical FP performance

Due to integer performance, FP application performance suffers

Life is too short for optimizing every application for every supercomputer architecture

Advanced dynamic instruction scheduling beneficial for HPC

2013/3/19 40


Measure ping-pong latency between threads Repeat Compare-and-Swap Bind thread to same/different physical core or

socket The benchmark is on ◦ https://github.com/tomari/shmlatency

2013/3/19 41


same-core same-die inter-socket0

50

100

150

200

250

300

350

400

450

500

Opteron 6282SESPARC T4Xeon X5570POWER7

Late

ncy

(ns)

2013/3/19 42


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

POWER7SandyBridge-E

Late

ncy

(ns)

2013/3/19 43


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Enterprise 3000Origin 2000

Late

ncy

(ns)

N

N R

N

N R Origin 2000 topology

2013/3/19 44


Large variations in on-chip latency Latency critical in synchronization ◦ More apparent when thread count is higher

Inter-chip latency has been on the same level for 15+ years

Can we decrease latency?

2013/3/19 45


We need more integer performance ◦ thereby increasing the Performance/Watt ◦ Key to the `Easy-to-Use' supercomputer

Latency is currently studied ◦ Can we decrease latency for synchronizing larger number of

threads? ◦ What impact will that have on the exascale?

2013/3/19 46


2013/3/19 47

2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 CY

Basic Component

Advanced Component Tuning

PostT2K

PostT2K • U. of Tokyo and U. of Tsukuba will install so-called PostT2K machine, whose boundary conditions are up to 3 MW electricity and up to 1000 m2 space.

• A part of the system software stack shown in this presentation will be deployed. The machine is used for production run. This software stack will be extended towards a post peta-scale machine.

Activity in U. of Tokyo and U. of Tsukuba (Beside this Feasibility Study)

yutaka ishikawa (u. of tokyo), yuji saeki (hitachi ... · yutaka ishikawa (u. of tokyo), yuji saeki...

Documents