european extreme data and computing initiative - …...production/analysis applications big data and...
TRANSCRIPT
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Centres of Excellence view on:
“What are the important research goals for the next 5 years”
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
EoCoE - Energy oriented Centre of Excellence for computer
BioExcel - Centre of Excellence for Biomolecular Research
NoMaD - The Novel Materials Discovery Laboratory
MaX - Materials design at the eXascale
ESiWACE - Excellence in SImulation of Weather and Climate in Europe
E-CAM - An e-infrastructure for software, training and consultancy in simulation and modelling
POP - Performance Optimisation and Productivity
COEGSS - Center of Excellence for Global Systems Science
CompBioMed - A Centre of Excellence in Computational Biomedicine
1st Generation of CoE
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
• Collaborate with PRACE on: best practices software environment organizational requirements for science
domains training
• PRACE HLST• Use PRACE resources for benchmarking etc.• Use center competence for co-design and
exascaling (all tiers) • Help communities accessing PRACE
resources
• Collaborate with FET research projects to: harden and integrate results to
production quality provide real-world test cases
• Provide for ESDs: provide requirements optimize software for targeted
ESDs – will also be important for smaller scales
co-design
Ecosystem: cPPP
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Goals of CoE
Ensure European competitiveness in the application of HPC to address scientific, industrial or societal challenges:
• Excellence in research, development and service;• User-driven, with the application users and owners playing a decisive role in governance; • Integrated: encompassing not only HPC software but also relevant aspects of hardware, algorithms, data, workflows, connectivity,
security, etc., covering the full scientific/industrial workflow and addressing full-scale, widely used applications with proven impact,
not just application kernels or mini-apps;• Multidisciplinary: with teams that combine application domain and user expertise together with HPC system, software and
algorithm expertise; • Distributed with a possible central hub, federating capabilities around Europe, exploiting available competences, and ensuring
synergies with national/local programmes; • Showing a clear path towards Exascale covering both computing and extreme data• Committed to a culture of collaboration and sharing of best practice among CoEs on cross-cutting topics, such as co-design and
Exascale technologies, software sustainability, and potential business models; • Committed to collaborate with the upcoming Extreme Scale Demonstrators, providing them with a high performance and
scalable application base; • Actively contributing to economic and societal benefit, by facilitating high-impact research and collaborating with industry
and SMEs.
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Present CoE
Vertical (domain) CoEs, which provide services to well defined communities:• Software development, support & consultancy, training• Frontier developments towards exascale requirements in Material Sciences, Weather&Climate or Bio may be very different
Horizontal (transversal) CoEs, which cover topics of importance to several vertical CoEs:• Software development, support & consultancy, training• Performance analysis potentially complex interaction matrix
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Future CoE
Vertical (domain) CoEs, which provide services to well defined communities:• Current themes+ Engineering?+ Fundamental physics?+ Environmental sciences (adaptation to environmental change)?Cover entire value chain (from science to industry)Compute and data (from exaflop to exabyte)
Horizontal (transversal) CoEs, which cover topics of importance to several vertical CoEs:• Current themes+ High-performance data analytics?+ Software sustainability?+ Mathematics and algorithms? Enhance culture of collaboration (CoE matrix)
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Future CoE: Considerations
• Potentially merging of existing CoE (cf. Flagships):• smaller number of domains that have shown a clear path to exascale/co-design or larger variety?• may provide more efficient x-benefit for several communities towards exascale but effectiveness
may suffer (weak links affect wider area)• providing support for merged communities will be challenging
• Importance of Exascale:• addressing key science challenges is most important ( Exaflop)• technological developments (e.g. co-design EsD top 3 in 2022) run away from applications• only few CoE have really big data tasks (e.g. ESiWACE)
• Stimulation of SME:• smaller dedicated efforts may be more effective (e.g. SHAPE, Fortissimo)• unified software development vs specialized SME niches
• Sustainability:• assimilating cutting-edge FET developments• business model needs simple/realistic approach (and public funding for some time to come)
Manage expectations of CoE what will be able to achieve & sustain – given the available funding
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE in SRA-2
• Section 8 EsD introduction, page 67: “These EsDs should provide platforms deployed by HPC centres and used by CoEs for their production of new and relevant applications.”
• Section 8.2, Proposal of ETP4HPC for EsD Calls: “This and other hardware characteristics (energy efficiency, I/O bandwidth, resiliency, etc.) will be detailed in the 2017 release of the SRA, also taking into account results from the FETHPC projects and requirements from the CoEs.”
• Section 9.3, Centres of Excellence for Computing Applications: “The Centres of Excellence in Computing Applications (CoEs) form one of the three pillars of the European HPC Ecosystem and represent the European Application expertise […]. ETP4HPC will be working to include the CoEs in the processes of the contractual Public-Private Partnership for HPC and synchronise their efforts with those of the other two pillars of the European HPC Ecosystem (i.e. ETP4HPC and the FETHPC projects, PRACE).
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Target for addressing key science challenges in weather & climate prediction: Global 1-km Earth system simulations @ ~1 year / day rate
Example: ESiWACE Science Challenge
10 km
1 km
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Example: NOMAD Science and Data Handling Challenges
Discovering interpretable patterns and correlations in this data will
• create knowledge• advance materials science, • identify new scientific phenomena, and• support industrial applications.
The NOMAD Archive10.000
100
1
# G
eo
met
ries
per
Co
mp
osi
tio
n
1 Mio
1.000
1
# C
om
po
siti
on
s
Transparent metals
Photovoltaics
Superconductors
Thermal-barrier materials
Descriptor A
Des
crip
tor
B
The NOMAD challenge: Build a mapand fill the existing white spots
NOMAD supports all important codes in computational materials science. The code-independent Archive contains data from manymillion calculations (billions of CPU hours).
Data is the raw materials of the 21st century
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Agent Based Modelling for Synthetic Information System Simulation
• Higher Scalability Extreme communication overheads• Big Data Outputs Need for enhanced HPDA capabilities
Example: CoeGSS Science Challenge
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
E183
E226
D259
E236
R0
R1
R2
R3
K4
R5
VSD
A
B
Molecular Dynamics on the exascale
• Understanding proteins and drugs
• A 1 μs simulation: 10 exaflop
• Many structural transition: many simulations needed
• Study effect of several bound drugs
• Study effect of mutations
• All this multiplies to >> zettaflop
• Question: how far can we parallelize?
Example: ion channel in a nerve cell.Opens and closes during signalling.Affected by e.g. alcohol and drugs.200 000 atoms
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE and SRA-3Domain CoE: What are the key scientific challenges and what does it take to achieve them?
BioExcel CompBioMed MaX ESiWACE E-CAM
HPC System Architecture and Components
Large width vector units, low-latency networks, high-bandwidth and large memory; fast CPU<->accelerator transfer rates, Heterogeneous acceleration, floating-point
Hybrid systems, GPUs, high-bandwidth memory
Scientific challanges: accurate computations requiring more powerful HPC systems,heterogeneous systems designed to optimize end-to-end workflows
High-bandwidthmemory, networks, NVRAM
High memory bandwidth, large RAM
System Software and Management
Dynamic (task) scheduling, Support for workflows
Dynamic scheduling, urgent computing
Virtual Machine model supported Dynamic scheduling, compilers
Cross compiling, archiving tools
Programming Environment
Standardization, portability, task parallelism, fast code driven by Python interfaces
Portability, ease of scale out, MPI extensions
support new MPI and OpenMPstandards + new development tools like python
Fast standardization, DSL
Interactive testing, OpenCL support, sustainable support of standards
Energy and Resiliency
Distributed computing techniques to handle resiliency/fault tolerance
Reduced cost, improved fault tolerance
Energy optimized workflows: HPC systems including energy monitoring and profiling.
Fault handling, less precision
Energy aware algorithms
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE and SRA-3
Domain CoE: What are the key scientific challenges and what does it take to achieve them?
POP CoEGSS NOMAD EoCoE
HPC System Architecture and Components
Throughput oriented devices (vectors), memory architectures and how to use them, architectural support for runtimes, mechanisms to monitor progress and notify runtimes in cases of resource preemptions
Convergence between HPC & HPDA, NVRAM, Fast networks
For extreme scale simulations: powerful compute nodes, high bandwidth memory, low-latency networks.For data analytics: convergence between HPC & HPDA
accelerator with shared memoryFPGAProcessor in memoryNVRAM, large memory
System Software and Management
Dynamic, interactive use of available resources, tight and bidirectional communication/cooperation between job schedulers and runtimes
Dynamically scaling jobs, Integration of HPC & HPDA, Visualization (in situ), Data analytics (in situ) Programming Environment
Containers, scalable data bases, interactive access to subsets of resources, remote visualization, synchronization of big data sets between different computer centres in different countries and continents
complex, light and flexible workflow (notebooks based on python may be a solution comparing to other heavy solutions)Allowing an easy access to large HPC ressourcesRuntimesHeterogenous job handling for coupled codes/libraries
Programming Environment
Programming model and runtime support for malleability, asynchrony/out of order task execution, hide heterogeneity and tolerate latency and variability, powerful performance analytics in tools, tools for task dependencies and memory access patterns, programmers mindset from bottom-up latency dominated tothroughput oriented mentality
Well-defined standards /and tools that implement the standards) for agent-based modeling, Compilers, Debuggers
MPI, OpenMP, GPU programming, Python, tools for profiling and optimization of scalable data bases
Task programmingautomatic optimization (as BOAST)domain specific language to test new algorithms more easilyPerformances analysis toolsAutomated, continual performance monitoring of production
Energy and Resiliency
Better integration between algorithmic based fault detection techniques and mechanism in the infrastructure from detected errors
Not important for CoeGSS Energy aware system operation, algorithms and workflows
Fault toleranceAnalysis in situ
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE and SRA-3 (cont’d)Domain CoE: What are the key scientific challenges and what does it take to achieve them?
BioExcel CompBioMed MaX ESiWACE E-CAM
Balance Compute, I/O and Storage Performance
Post-processing on the fly, Data-focused workflows, handling lots of small files in bioinformatics
Post-processing on the fly, easy transfer between storage tiers
HT material science workload becomes quickly memory and I/O bound: systems with high IOPS and post posix data objects are required.
Post-processing on the fly, multi-tier software
Fast access of archive data, multi-platform workflows, multi-threaded applications for hybrid production/analysis applications
Big Data and HPC usage Models
Proximity of data generation and analysis/visualization resources, workflows, machine learning for analyzing simulation data, high-throughput sampling
Analytics of simulation outputs, visualisation
Workflows, intelligent data analytics
Recomputing, dataanalytics
Fast access to data bases, data mining
Mathematics and algorithms for extreme scale HPC systems
Multi-scale algorithms, task-parallel algorithms, Electrostatics solvers, ensemble sampling & clustering theory, ensemble simulations
Novel time stepping algorithms, automated implementation of multiscale computing patterns
New algorithms avoiding synchronous (unnecessary) data dependency and exploiting unreducible data dependency tree (nesting), to improve concurrency and locality
Disruptive numerical methods (discretization), data placement
Memory/cache aware algorithms, asynchronous algorithms, efficient handling of long range/collective correlations
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE and SRA-3 (cont’d)Domain CoE: What are the key scientific challenges and what does it take to achieve them?
POP CoEGSS Nomad EoCoE
Balance Compute, I/O and Storage Performance
Integration of asynchronous I/O interface in programming model, better integration of programming models/languages and persistent storage
Converged systems, Live data analytics, Strong data movement capabilities
Data-centric workflows, high I/O bandwidth, handling huge number of files
In situ data analysis coupled to dedicated I/O systemHigh-band large capacity storage
Big Data and HPC usage Models
Better integration between programming model and storage interface, more dynamic, interactive supercomputing practices, make users/programmers aware of cost/benefit of each individual data and computation for better resource/storage scheduling
HPDA Platform support, Algorithms / Models for efficient HPDA
High performance data analytics of simulation outputsand derived data, faster analysis of complex and Big Data structures (map reduce, faceted search), fast and efficient data mining. Significant advancements in hard and software for handling Big Data more efficiently
Techniques of data analysis coming from Big Data (robust solution with fault tolerance)Reproducibility and ease of calculations based on notebooksHigh-throughput computing
Mathematics and algorithms for extreme scale HPC systems
Algorithm complexity (computation/communication). asynchrony and variability tolerance, algorithm based fault detection
Algorithms for efficient data analytics
Novel scalable algorithms for extreme scale simulations and for extreme scale data analytics
parallel dynamics (molecular or for climate)multiscale modelling so better coupling between codes as QM/MMlinear algebraParallel in time algorithms
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
Top 3 Challenges – VERY EARLY DRAFT
• HPC System Architecture and Components– Efficient use of memory and I/O hierarchies - links to Balance
Compute, I/O and Storage Performance
– Efficient interaction between “fat” and “thin” (GPU) cores
• System Software and Management – Software standards (C++17 and Fortran 2015 in particular, but also
OpenMP 4.5, MPI 3.1, OpenCL 2.2,...)
• Programming Environment– (Dynamic) environments for task parallelism.
18
ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE
CoE and SRA-3/EsD
Assumption: EsD will be precursors of European exascale HPC facilities hosting novel technologies and software stack
EsD are FLOP focused, CoE application areas may not be!Extreme scale data analytics not covered yet!
If the key role of CoE is to provide wider (domain) user community with expertise, software etc.:1. How is transition from ‘novel’ to ‘applicable’ be managed? What is the role of FET
projects in this context? (there is also no certainty for sustainability of FET projects per domain)
2. How are key applications, to be run on EsD in operational mode, adapted to novel architecture, stack, programming models etc.? By whom? With what funding?
3. If future CoE become wider (less domain specific):a. how can the above be realized?b. will the horizontal CoE do the job?