Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10
Copyright 2011 FUJITSU LIMITED
System Configuration of PRIMEHPC FX10
File
management
nodes
Job
management
nodes
Control
nodes
System
integration
node
Login
nodes
User Management nodes
Administrator
Global file system (Data storage area)
Local file system (Temporary area occupied by jobs)
Compute nodes
6D mesh/torus Interconnect
IO network (IB), management network (GbE)
• Login
• Compilation
• Job submission
• Data transfer to/from global
file system
• Data communication for
system job operations
management
• System operations management
• Job operations management
1
Copyright 2011 FUJITSU LIMITED
System Software Stack
User/ISV Applications
High-performance file system
Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
PRIMEHPC FX10
File system, operations management Application development environment
System operations management
System configuration management System control System monitoring System installation & operation
Job operations management
Job manager Job scheduler Resource management Parallel execution environment
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Compilers
Support Tools
MPI Library Scalability of High-Func. Barrier Comm.
IDE Profiler & Tuning tools Interactive debugger
Hybrid parallel programming Sector cache support SIMD / Register file extensions
Linux-based enhanced Operating System Enhanced hardware support System noise reduction Error detection / Low power
2
Copyright 2011 FUJITSU LIMITED
OS (Linux-based enhanced Operating System)
Easy existing application porting
POSIX API: Linux kernel 2.6.x, glibc 2.x
High performance / High scalability
Enhanced hardware support
CPU registers, Large memory page, High speed interconnect
Reduce system noise in highly parallel program
Inter-node OS scheduling
High availability / low power consumption
Hardware error detection / isolation
memory patrol, io driver enhance.
CPU suspend during system idle state.
Node 1
Node 2
Node 3
Node 4
Synchronous daemon scheduling
Application running
idle wait Daemon services = System noise
Node 1
Node 2
Node 3
Node 4
Application running
Idle → CPU suspend Job running
3
Copyright 2011 FUJITSU LIMITED
System Software Stack
User/ISV Applications
High-performance file system
Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
PRIMEHPC FX10
Linux-based enhanced Operating System
File system, operations management Application development environment
System operations management
System configuration management System control System monitoring System installation & operation
Job operations management
Job manager Job scheduler Resource management Parallel execution environment
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Compilers
Support Tools
MPI Library Scalability of High-Func. Barrier Comm.
IDE Profiler & Tuning tools Interactive debugger
Hybrid parallel programming Sector cache support SIMD / Register file extensions
Enhanced hardware support System noise reduction Error detection / Low power
4
System Operations Management
Copyright 2011 FUJITSU LIMITED
Hierarchical structure for
efficient system operation and
adaptability to large-scale
systems
The load is distributed by using
the job management sub node.
Easy to operate with a single
system image
The system is efficiently
operated by dividing a logical
resource partition named
“Resource Unit”.
Job management node Job operations
management
Easy to operate
with single
system image
System administrator
Compute nodes
Compute nodes
IO node IO node
・・・
・・・
Compute nodes
Compute nodes
IO node IO node
・・・
・・・
Tofu interconnect
- Power control - Hardware monitoring - Software service monitoring
による効率的な運用
Cluster Control node
Job manage sub node Job manage sub node
Nodegroup#1 Nodegroup#2
Hierarchical
structure for
efficient
operation
Resource Unit #1 Resource Unit#2 階層化による効率的な
Logical Resource Partition 5
Copyright 2011 FUJITSU LIMITED
High Availability System
The important nodes have redundancy
Control node
Job management node
Job management sub node
File servers
For example : right figure
Continuing job execution even if the job
management node is in failed status
The job data always synchronizes between
active node and stand-by node.
Alternatively to stand-by node if active node is
down.
Job management nodes
JOBs
active stand-by data
sync user
failure
active
6
Copyright 2011 FUJITSU LIMITED
Job Operations Environment
Efficient resource usage
Flexible job scheduling based on prioritized resource assignment
Interconnect topology-aware resource assignment
Backfill scheduling for keeping the resources busy
Asynchronous file staging
High availability
Avoids assigning faulty resources to jobs
disconnects faulty nodes from job operations
Management nodes with failover support
Time
Job A Job B
Job C Running job
Job C
T0
T1
Job A Job B
Job C
Running job Job C
Now t1 t2 t3
Backfilling
disabled
Backfilling
enabled
7
Copyright 2011 FUJITSU LIMITED
Resource Assignment
Interconnect topology-aware resource assignment
Treats 12 compute nodes as one interconnect unit
Assigns cubic-shaped interconnect unit(s) to a job
Using adjacent interconnect unit(s) is suitable for contiguous communication,
and also avoids interfering with other jobs.
Optimizes the alignment of resources
Rotating the cubic-shaped interconnect units This improves total system
utilization by rotating the cubic shaped interconnect units.
In-use unoccupied
x
z y 8 6
6 6 8
8 8 4
4 6 4 4
8
Copyright 2011 FUJITSU LIMITED
System Software Stack
User/ISV Applications
High-performance file system
Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
PRIMEHPC FX10
Linux-based enhanced Operating System
File system, operations management Application development environment
System operations management
System configuration management System control System monitoring System installation & operation
Job operations management
Job manager Job scheduler Resource management Parallel execution environment
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Compilers
Support Tools
MPI Library Scalability of High-Func. Barrier Comm.
IDE Profiler & Tuning tools Interactive debugger
Hybrid parallel programming Sector cache support SIMD / Register file extensions
Enhanced hardware support System noise reduction Error detection / Low power
9 9
Copyright 2011 FUJITSU LIMITED
High Scalability
Achieved high-scalable IO performance with multiple OSSes.
Parallel IO (MPI-IO)
Shared File
Compute
Node
Compute
Node
Compute
Node
Compute
Node
OSS OSS OSS OSS
File
Compute
Node
Compute
Node
Compute
Node
Compute
Node
Single Stream IO
File File File
OSS OSS OSS OSS
Adapted various IO model
Add Server&Storage
Number of servers
Thro
ughput
OSS: Object storage server
File
Compute
Node
Compute
Node
Compute
Node
Compute
Node
Master IO
File File File
OSS OSS OSS OSS
10
Copyright 2011 FUJITSU LIMITED
Fair Share QoS: Sharing IO bandwidth with all users.
Login Node File Servers
User A
User B
IO Bandwidth
IO Bandwidth Guarantee
Without Fair Share QoS With Fair Share QoS
Not Fair
User A
User B
Fair
Best Effort QoS: Utilize all IO bandwidth exhaustively. Occupied by one client Shared by all clients
Client(s) A
Client(s) B
Client(s)
File Servers
11 11
Copyright 2011 FUJITSU LIMITED
High Reliability and High Availability
Avoiding single point of failure by redundant hardware and failover
mechanism.
OSS (Active)
OSS (Active)
RAID RAID
MDS OSS
RAID
IB SW IB SW Network path
Disk path
Dual Server
RAID
Failover
MDS (Active)
MDS (Standby)
Failover
Compute Node
File Management
Server
Monitoring
& Managing
Software
12
System Software Stack
User/ISV Applications
High-performance file system
Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
PRIMEHPC FX10
Linux-based enhanced Operating System
File system, operations management Application development environment
System operations management
System configuration management System control System monitoring System installation & operation
Job operations management
Job manager Job scheduler Resource management Parallel execution environment
Compilers
MPI Library Scalability of High-Func. Barrier Comm.
Hybrid parallel programming Sector cache support SIMD / Register file extensions
Enhanced hardware support System noise reduction Error detection / Low power
Support Tools IDE Profiler & Tuning tools Interactive debugger
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Copyright 2011 FUJITSU LIMITED 13
VISIMPACTTM
(Virtual Single Processor by Integrated Multi-core Parallel Architecture)
Mechanism that treats multiple cores as one high-speed CPU Easy and efficient execution of inter-
core thread parallel processing with a
multi-core CPU
Supports the realization of a highly-
efficient Hybrid model (Automatic
parallelization + MPI)
Large-capacity shared L2 cache
memory decrease in the influence of
false sharing
Inter-core hardware barrier facilities 6-10
times faster than conventional software
barrier
CPU technologies
Memory
CPU
Core
L2$
Process
Core
L2$
Process
Memory
CPU
Core
Core
L2$
Process Inter-core
thread parallel
processing
Core • • •
Barrier synchronization
Thread 1
Hardware barrier synchronization: 10
times faster than conventional system
Tim
e
• • •
• • •
Core
Thread 2
Core
Thread N
Copyright 2011 FUJITSU LIMITED 14
System Software Stack
User/ISV Applications
High-performance file system
Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability
HPC Portal / System Management Portal
PRIMEHPC FX10
Linux-based enhanced Operating System
File system, operations management Application development environment
System operations management
System configuration management System control System monitoring System installation & operation
Job operations management
Job manager Job scheduler Resource management Parallel execution environment
VISIMPACTTM
Shared L2 cache on a chip Hardware intra-processor
synchronization
Enhanced hardware support System noise reduction Error detection / Low power
Support Tools
Compilers
MPI Library Scalability of High-Func. Barrier Comm.
Hybrid parallel programming Sector cache support SIMD / Register file extensions
IDE Profiler & Tuning tools Interactive debugger
Copyright 2011 FUJITSU LIMITED 15
Programming Model for High Scalability
Hybrid parallelism by VISIMPACT
and MPI library
VISIMPACT
•Automated multi-thread parallelization
•High performance thread barrier used
Inter-core hardware barrier facility
MPI library
•High performance collective
communications used Tofu barrier
facility
Scalability of Himeno benchmark(XL size)
0
2
4
6
8
10
12
14
16
18
20
1,000 10,000 100,000Number of cores
Performance
ratio
Hybrid + Tofu barrier
Hybrid
Flat MPI
Quotation from K computer performance data
Himeno benchmark detail (65536 Core)
0
0.2
0.4
0.6
0.8
1
1.2
Hybrid + Tofu
barrier
Hybrid Flat MPI
Time Ratio
Collective communication
Neighbor Communication and Calculation
Quotation from K computer performance data
Copyright 2011 FUJITSU LIMITED 16
Customized MPI Library for High Scalability
Point-to-Point communication
•Use a special type of low-latency path
that bypasses the software layer
•The transfer method optimization
according to the data length, process
location and number of hops
Collective communication
•High performance Barrier, Allreduce,
Bcast and Reduce used Tofu barrier
facility
•Scalable Bcast, Allgather, Allgatherv,
Allreduce and Alltoall algorithm
optimized for Tofu network Copyright 2011 FUJITSU LIMITED
Barrier / Allreduce Performance
0
10
20
30
40
50
60
70
80
90
100
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
Number of Nodes (48x6xZ)E
lap
se
d T
ime
(m
icro
se
co
nd
s)
Barrier (Tofu Barrier) Allreduce (Tofu Barrier)
Barrier (software) Allreduce (software)
Quotation from K computer performance data
17
Compiler Optimization for High Performance
Instruction-level parallelism with SIMD instructions
Improvement of computing efficiency used Expanded registers
Improvement of cache efficiency used Sector cache
0
0.2
0.4
0.6
0.8
1
1.2
Memory wait Cache missesOperation wait Instructions committed
NPB3.3 LU Execution time comparison (relative values)
FX1 PRIMEHPC FX10
Faster
Efficient use of
Expanded registers
reduces Operation wait
0
0.2
0.4
0.6
0.8
1
1.2
Memory wait Cache missesOperation wait Instructions committed
NPB3.3 MG Execution time comparison (relative values)
FX1 PRIMEHPC FX10
Faster
SIMD implementation
reduces Instructions
committed
Copyright 2011 FUJITSU LIMITED 18
Application Tuning Cycle and Tools
Execution
Job
Information
Open Source
Tools
Overall
Tuning
RMATT
Tofu-PA
MPI Tuning
CPU Tuning
PAPI
Vampir-trace
Profiler
Profiler
Vampir-trace
FX10 Specific
Tools
Profiler snapshot
Copyright 2011 FUJITSU LIMITED 19
Copyright 2011 FUJITSU LIMITED 20