update on hpc efforts in nudt - exascale– high performance scalable interconnection – balance...

15
National University of Defense Technology Tianhe 天河 Update on HPC Efforts in NUDT Yutong Lu School of computer science National University of Defense Technology

Upload: others

Post on 17-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Update on HPC Efforts in NUDT

Yutong  Lu  

School  of  computer  science  

 National  University  of  Defense  Technology  

Page 2: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Outline

l Roadmap

l Hardware

l Software

l Application

Page 3: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Overview of TH Roadmap

10P 100P 1000P 1P

2014

2015

2020

year

Performance

2009

2010 NUDT Changsha, NSCC-CS NSCC-TJ Tianjin

Page 4: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Roadmap l  12-FiveY project ( -2015)

–  System l  100P l  funding

–  MOST (863) –  Local government

–  Software(year 2011) l  NSFC 37 million for basic algorithm and computable model l  MOST(863) 80 million for Domain applications

l  13-FiveY project ( -2020) –  ~ 1 E –  Funding ?

Page 5: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Highlights of future TH systems

l Heterogeneous parallel architecture

l Multiple-dimension interconnection network

l Hierarchy I/O storage system

l Autonomic fault tolerant management

l Domain specific programming framework

l Adaptive power aware computing

Page 6: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

l Multiple-dimension interconnection network –  Support more than 100,000 nodes –  Bandwidth will be improved a lot –  Enhance collective communication

System Level

Rack Level

Blade Level

Hardware

Page 7: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

l Inter-chip optical connection –  Optical interface between

processors

–  Optical switch between CPUs is under research

Power

High-speed interconne

ction chip

High-speed interconne

ction chip

Parallel optical coupled transfer

High-speed electric interface

Mem

ory

Mem

ory

cpu7 cpu8

cpu1

cpu2

cpu3 cpu4

cpu5

cpu6

Optical switch array

Detector

Modulator

Laser

Hardware

Page 8: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Large-scale Hybrid Tiered File System Architecture

l  Scalability to achieve >1TB/s I/O bandwidth by leveraging spatial locality l  Usability by federating multi-level storage into unified name space l  Flexibility by key components re-configuration for application optimization l  Applicability for supercomputers and clusters with hybrid infrastructure

Software

Page 9: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Resilience computing Framework

l  Capability to support post-petaflops and Exascale computing

l  Collaboration with whole system software stack

l  Coherent fault detection l  Coordinate fault tolerant decision l  Cooperation of multiple fault

recovery mechanics l  Combination of proactive and

reactive strategies l  Customizable fault detection,

prediction and recovery approaches l  Support various parallel models

Software

Page 10: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Data Dependency

extract

Data Structures

Promote

Communications

Load Balancing support

Fault tolerance

Parallel Computing

Models form

separate Models Stencils

Algorithms

Special Library

Models Stencils

Algorithms

Common

Computers

Parallel  programming  Framework  

Software

Application code

Page 11: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Data Dependency

extract

Data Structures

Promote

Communications

Load Balancing support

Fault tolerance

Parallel Computing

Models form

separate Models Stencils

Algorithms

Special Library

Models Stencils

Algorithms

Common

Computers

Parallel  programming  Framework  

Software

Application code

Hide the complexity of programming(mil cores/hybrid) Integrate the efficient parallel fast numerical algorithms Provide efficient data structures and solver libraries Support software engineering for code reusing USERS

Page 12: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Data Dependency

extract

Data Structures

Promote

Communications

Load Balancing support

Fault tolerance

Parallel Computing

Models form

separate Models Stencils

Algorithms

Special Library

Models Stencils

Algorithms

Common

Computers

Parallel  programming  Framework  

Software

Application code

Optimization interface Heterogeneous issues Communication organization Load balance Fault tolerance System software developers

Page 13: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Applications

l Five priority areas

–  Climate & Environment

–  Bio-medicine

–  New energy

–  Equipment engineering

–  Animation

Page 14: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河 Summary

l Towards Next generation of TH system –  Heterogeneous architecture

–  New enabling technology

–  High performance scalable interconnection

–  Balance the computing and data accessing

–  Feasible fault tolerant mechanics

–  Usable domain-specific programming framework

–  Selected priority application areas

Page 15: Update on HPC Efforts in NUDT - Exascale– High performance scalable interconnection – Balance the computing and data accessing – Feasible fault tolerant mechanics – Usable

National University of Defense Technology

Tianhe

天河

Thank you