prace spring school 2012-05-16 · prace spring school 2012-05-16 stefan andersson [email protected] ....

Best practices – Hermit PRACE Spring School 2012-05-16

Stefan Andersson

[email protected]

Agenda

• HLRS

• The hardware parts of the Cray XE6

Node (procesor, interconnect)

Packaging

• XE6 Software

CLE (Cray Linux Environment) ESM, CCM

CCE (Cray Compiler Environment)

Other Compiler Enviroments (shorter)

• How to submit a job

Slide 2

Leading Edge HPC infrastructure in Germany

HLRS is one of the three national

supercomputing centers in Germany

responsible for engineering and

industrial simulation

The national supercomputing

centers are working together in the

Gauss Centre for Supercomputing

GCS

GCS is the means to contribute to

the Partnership for Advanced

Computing in Europe (PRACE)

The BMBF project petaGCS is the

source for 50% of funding for

investment and operation

The remaining 50% are provided by

the Ministry of Science, Research

and the Arts Baden-Württemberg

3

Hermit Phase1 Step1 (2011)

System Design Overview

4

External Login Server

Fast Local Storage

Remote Visualization

Server Pre- & Postprocessing

Server

Parallel Filesystem

(Lustre) HLRS wide shared NAS Home Space

Hermit Phase1 Step 1b

GPGPU Add-on

Hermit Phase1 Step 2 (2013)

Other HLRS Server

Storage and Archive

Phase 1 Step 1 Overview

5

Configuration:

Peak Performance ~ 1PF

38 racks with 96 nodes each

96 service nodes and 3552 compute nodes

Each compute node will have 2 sockets

AMD Interlagos @ 2.3GHz 16 Cores each

leading to 113.664 cores

Nodes with 32GB and 64GB memory reflecting different user needs

2.7PB storage capacity @ ~ 150GB/s IO bandwidth

External Access Nodes, Pre- & Postprocessing Nodes, Remote

Visualization Nodes

~2MW maximal power consumption

Support for ISV Codes depending on the application under ESM

(Extreme Scalability Mode, „native“) or CCM (Cluster Compatibility

Mode)

CRAY <-> HLRS Collaboration

CRAY and HLRS have set up a

Cray development center in

Stuttgart with on site staff for

production & joint research

Work closely with the users to

port and optimize for the big

installation step1 in Q3/2011

Joined development and definition of

the details of Phase 1 Step 2 based

on

Results of joint optimization and

scaling efforts on Phase 1Step1

Experiences with accelerators on

Step1b

Target is tailored Step2 system for

HLRS’ industrial and academic users

6

HLRS-CRAY Collaboration

<2011

Cray XE6 “Hermit1”

Phase1 Step1 ~1PF Peak

Q3/2011

Cray Cascade “Hermit2”

Phase1 Step2 ~4-5PF Peak

2013

Cray XE6 Phase1 Step0

Testsystem 2010

Update of “Hermit1”

with 32 Nodes CRAY XK6

2012

A glimpse on Phase 1 Step 2 – Hermit 2

7

Step 2 will run in parallel to Step 1 realizing an integrated system

Goal is to maintain similar software stack for both installation steps

Expected architectural changes

Next generation interconnect

Newest generation of CPUs

Partially relying on accelerators

Updated storage infrastructure

Additional external servers

Significantly increased sustained application performance

Scheduled for Autumn 2013

Overall peak performance of complete Phase 1 will be >5PF

Specification is driven by agreed sustained application performance and

not peak performance tough

Differentiation Strategy

Unique resources in terms of

size and architecture (tier-0

systems)

High level of expertise in

emerging technologies /

Technology Watch (e.g. WP9

prototypes)

Consultancy in

Porting/Optimisation

Layer integration

Unification of access

models, application

procedures across the tier-x

Co-Development

Collaboration approaches

aiming to shorten time to

market in Software,

Hardware,

Solutions/Architecture

Training and Consulting

Lower the barrier to exploit

available resources

Support selection of most

appropriate system for the

problem size (tier-x)

Training of trainers

8

Three year EU-funded

collaborative project, 13

partners, €12 million costs, €8.5

million funding

Collaborative Research into

Exascale Systemware, Tools

and Applications

Project coordinator: EPCC at

The University of Edinburgh

CRESTA has a very strong

focus on exascale software

challenges

Uses a co-design model of

applications with exascale

potential interacting with

systemware and tools activities

The hardware partner is Cray

Applications represent broad

spectrum from science and

engineering (OpenFOAM!)

CRESTA will compare and

contrast incremental and

disruptive solutions to Exascale

challenges

9

epcc|crestaVisual Identity Designs

CREST

CREST

CREST

CREST

• Towards EXascale ApplicatTions (TEXT)

• EU-funded CP & CSA in FP7-Infrastructures-2010-2

• 4 HPC Centers, 4 Universities, 1 Industrial)

• Centered around StarSS programming model from BSC: #pragma css task input(v1, v2, len) output(v3)

void vadd (float *v1, float *v2, float *v3, int len)

{ ... }

• Project Goal: Apply StarSS to a set of applications • Hybrid Parallelization using StarSS and MPI for:

• BEST / LBC Lattice Boltzmann codes (Fotran) • LS1-Mardyn MD code (C++)

10

The Cray XE6 node We are concentrating on the HLRS XE6 installed at HLRS, which is called

‘hermit1’

There are other XE6 models using different processors and different

interconnects topology which we don’t cover in this presentation

We start by introducing the node parts (processors used, interconnect, …)

and shows how they are packaged

Processor

The new Opteron 6200 Series (Interlagos)

Hermit uses the Model 6276 (2.3 GHz)

● Interlagos is composed of a number of Bulldozer core “modules”

● A core module has shared and dedicated components

● There are two independent integer cores and a shared, 256-bit FP resource

● A single Integer Core can make use of the entire FP resource with 256-bit AVX instructions

● This architecture is very flexible, and can be applied effectively to a variety of workloads and problems

● DL1 is 16 KB, L2 is 2 MB and L3 is 8MB

Interlagos Processor Architecture

Shared L2 Cache

Fetch

Decode

Shared L3 Cache and NB

FP

Scheduler

128-b

it F

MA

C

L1 DCache L1 DCache

128-b

it F

MA

C

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Int

Scheduler

Int

Scheduler

Int Core 0 Int Core 1

Dedicated Components

Shared at the module level

Shared at the chip level

May 16, 2012 Slide 13 Cray Proprietary 13

Why share components

Multi Core Multi Threading Hyper Thread

Hardware Overhead 2x ~ 1.2x < 1.05x

Performance gain Max 2x Max 1.8 Max 1.25

Performance gain vs. Hardware overhead

1 1.5 1.2

In this example we are going from 1 to 2 ‘instances’ (1 to 2 cores, 1 to 2 integer cores, …

By letting the cores share parts, the performance of processors can be

increased and still keeping the complexity of the processers (#transistors) down

compared by simply adding more cores.

The following table shows this gain by comparing different strategies. The

numbers quoted where found in a ht4u article :

http://ht4u.net/reviews/2011/amd_bulldozer_fx_prozessoren/index8.php

14



● An Orochi die consists of 4 Bulldozer modules

● An 8MB Level 3 Cache and memory controller is shared among the 4 modules

● Two Orochi die make up a single Interlagos processor

● The HLRS machine runs at 2.3 GHz

● Cores can run at faster clock speeds depending on the workload running on the part

Orochi Die

Orochi Die

Shared Level 3 Cache

Integrated Memory Controller

Integrated Northbridge Controller

May 16, 2012 Cray Proprietary 15

8 M

B L

3 C

ach

e

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

DDR3 Channel

ToGemini

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

8M

B L

3 C

ach

e

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

HT

3

8 M

B L

3 C

ach

e

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

8M

B L

3 C

ach

e

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

Bulldozer

Bulldozer

L1

L1L2

HT

3

HT3

HT3

HT3

XE6 Node Details – 32-core Interlagos

● 2 Multi-Chip Modules, 4 Opteron Dies: ~300 Gflops

● 8 Channels of DDR3 Bandwidth to 8 DIMMs: ~105 GB/s

● 32 Computational Cores, 32 MB of L3 cache

● Dies are fully connected with HT3

May 16, 2012 Cray Proprietary 16

Gemini

The Cray interconnect

Cray Gemini ASIC (application-specific integrated circuit)

Supports 2 Nodes per ASIC

3D Torus network

XT5/XT6 systems field upgradable

Scales to over 100,000 network endpoints

Link Level Reliability and Adaptive Routing

Advanced Resiliency Features

Advanced features

MPI – millions of messages / second

One-sided MPI

UPC, Coarray FORTRAN, Shmem, Global Arrays

Atomic memory operations

Gemini

Hyper

Transport 3

NIC 1

Netlink

Block

48-Port

YARC Router

Hyper

Transport 3

NIC 0

18

Gemini NIC Design

● Fast memory access (FMA) ● Mechanism for most MPI transfers,

involves processor ● Supports tens of millions of MPI

requests per second

● Block transfer engine (BTE) ● Supports asynchronous block

transfers between local and remote memory, in either direction

● For large MPI transfers that happen in the background

● Hardware pipeline maximizes issue rate

● HyperTransport 3 host interface

● Hardware translation of user ranks and addresses

● AMO cache ● Network bandwidth

dynamically shared between NICs

May 16, 2012 Slide 19 Cray Proprietary

HT

3 C

av

e

vc0

vc1

vc1

vc0

LB Ring

LB

LM

NL

FMA

CQ

NPT

RMTnet req

H

A

R

B

net

rsp

ht p

ireq

ht treq p

ht irsp

ht np

ireq

ht np req

ht np reqnet req

ht p req O

R

B

RAT

NAT

BTE

net

req

net

rsp

ht treq np

ht trsp net

req

net

req

net

req

net

req

net

reqnet req

ht p req

ht p req

ht p req net rsp

CLM

AMOnet rsp headers

T

A

R

B

net req

net rsp

S

S

I

D

Ro

ute

r T

ile

s

19

• Cray MPI uses MPICH2 distribution from Argonne

CH3 device Nemesis: multi-method device with a highly optimized shared memory sub-method

• MPI device for Gemini based on

User level Gemini Network Interface (uGNI)

Distributed Memory Applications (DMAPP) library

• FMA (Fast Memory Access)

In general used for small transfers

FMA transfers are lower latency

• BTE (Block Transfer Engine)

BTE transfers take longer to start but can transfer large amount of data without CPU involvement

• AMOs provide a fast synchronization method for collectives

AMO=Atomic Memory Operations

Gemini Software

20

● PGAS= Partitioned Global Address Space

● Globally addressable memory provides efficient support for ● UPC, Co-array FORTRAN, SHMEM

● Pipelined global loads and stores ● Allows for fast execution of irregular communication patterns

● Atomic memory operations ● Provides fast synchronization method for one-sided communication

● Cray DMAPP application interface ● Cray Programming Environment targets this directly

● Available for 3rd party tools (check docs.cray.com for the API)

Gemini PGAS Features


● MPI latency of 1.4 msec ● 3X improvement on Seastar

● MPI message rate of 9M/sec ● 20X improvement on Seastar

● Injection bandwidths in excess of 6 GB/sec ● 3X improvement on Seastar

● Cray SHMEM put rate of 25M/sec

● Scattered/indexed put rates of 60-90M/sec

Gemini Performance Highlights


● Low latencies are maintained across the whole system with cores sending non-local messages (HPCC natural+random ring)

MPI Latency at Scale

0

1

2

3

4

5

6

7

8

9

10

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

La

ten

cy (m

icro

se

co

nd

s)

Number of processes

Nehalem + IB natural ring Nehalem + IB random ring

Westmere + IB natural ring Westmere + IB random ring

Small Cray XE6 natural ring Small Cray XE6 random ring

Large Cray XE6 natural ring Large Cray XE6 random ring


● Gemini MPI bandwidth exceeds 5 GB/sec

MPI Bandwidth

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

MP

I ban

dw

idth

(MB

yte

s/se

c)

Message size (bytes)

Single message

Multiple messages


Each Gemini supports 2 XE6 Compute Nodes

• Built around the Gemini Interconnect

• Each Gemini ASIC provides 2 NICs enabling it to connect 2 dual-socket nodes

Y

X

Z

25

The Cray XE6 packaging

Compute blade on the XE6

• Configuration

4 compute nodes per compute blade

Each compute node has 2 Opteron sockets

Each socket hosts a Magny-Cours or Interlago MCM for a total of up to 128 compute cores per blade

32 DDR3 Memory DIMMS + 32 DDR3 Memory channels

2 Gemini ASICs

L0 Blade management processor

• Runs Cray Linux Environment (CLE)

Linux-based operating system

designed to run large, complex applications and scale efficiently to hundreds of thousands of processor cores

27

Cray XE6 Compute Blade (4 nodes)


Node 0 Node 1

Node 2 Node 3

Ge

min

i Me

zzanin

e Card

AMD Opteron with heatsink

memory DIMMs

voltage regulators

into backplane

Message flow from

Node 0 to Node 2

Message flow from

Node 1 to Node 0

28

Service nodes on the XE

• Overview

• Run full Linux (SuSe SLES 11)

• 4 nodes per service blade

• Boot node

• first XE6 node to be booted: boots all other components

• IO nodes

• Run Lustre processes (OST, MDT)

• SDB node

• hosts MySQL database

• processors, allocation, accounting, PBS information

• Login nodes

• User login and code preparation activities: compile, launch

• Partition allocation: ALPS (Application Level Placement Scheduler)

29

XK6 Nodes

Crays GPU nodes.

HLRS has a small system with 16 XK6 nodes (Tesla) now, the current

plan is to grow it to 32 nodes (Kepler) by end of the year

Currently the nodes and the software is being tested in a TDS. Expect the

nodes to join hermit in June/July timeframe

Cray XK6 Node

Y

X

Z

High Radix

YARC Router

with adaptive

Routing

168 GB/sec

capacity

10 12X Gemini

Channels

(Each Gemini

acts like two

nodes on the 3D

Torus)

XE6 Node Characteristics

Number of Cores 32 (Interlago)

Peak Performance 2 x IL-16 (2.3)

295 Gflops/sec

Memory Size 32 GB per node 64 GB per node

Memory Bandwidth (Peak)

104.5 GB/sec


XK6 Compute Node Characteristics

Host Processor AMD Series 6100

(Interlagos)

Host Processor 147 Gflops

Tesla X2090 Cores 448

Tesla X2090 Perf. 600+ Gflops

Host Memory 16 or 32GB

1633 MHz DDR3

Tesla X090 Memory 6GB GDDR5 capacity

170 GB/sec

Gemini High Speed Interconnect

Upgradeable to KEPLER many-core processor

31

XK6 Compute Blade

Cray XK6 Compute Blade

+

NVIDIA Tesla X2090

+

Cray Gemini Interconnect

32

Cray XK6 supercomputer HPCwire readers: “Top 5 New Products or Technologies to Watch”

• Nvidia Fermi 2090 GPU 20% better performance than 2070

compute: 448512 cores; 1.151.30 GHz clock

memory: 6GB; 150178GB/s bandwidth

Upgradable to Kepler in 2012

• AMD Series 6200 Interlagos CPU (16 cores)

• Cray Gemini interconnect high bandwidth/low latency scalability

HPCwire editors: “Best HPC Interconnect Product or Technology”

• Fully integrated/optimised/supported Hardware and full software stack stack (including libraries)

Also supports Cray Cluster Compatibility Mode for ISV applications

• Fully blendable with Cray XE6 product line HPCwire readers: “Best HPC Server Product or Technology”

• Fully upgradeable from Cray XT/XE systems

"Accelerating the Way to Better Science"

33

Cray hybrids in future Top500

ORNL Titan: 200 cabinets of Cray XK6

NCSA Blue Waters: 235 cabinets of Cray XE6 + 30 cabinets of Cray XK6

34

• Most important hurdle for widespread adoption of accelerated computing is programming difficulty Need a single programming model that is portable across machine types,

and also forward scalable in time Portable expression of heterogeneity and multi-level parallelism Programming model and optimization should not be significantly difference for “accelerated”

nodes and multi-core x86 processors Allow users to maintain a single code base

• Cray’s Approach to Ease of Use Accelerator Programming is to provide a tightly coupled high level programming environment with compilers, libraries, and tools that will hide the complexity of the system Focus on integration and differentiation Target ease of use with extended functionality and increased automation

• Ease of use is possible with

Compiler making it feasible for users to write applications in Fortran, C, C++ (OpenACC)

Tools to help users port and optimize for accelerators Auto-tuned scientific libraries

Cray Vision for Accelerated Computing

35

XE6 Cabinets

• A XE6 cabinet contains 3 cages (aka chassis)

• A cage contains 8 blades

• A compute blade contains

8 sockets

2 Gemini interconnects

Memory

L0 controller

VRMs

No moving parts

• One blower at the bottom

XE6 configuration details

37

Cray XE6 Packaging

Slide 38

XIO blades

cables to GigE backbone

IB or FC disk

PDUs

3D torus

interconnect

from backplane

blower blower

enclosure

3 chassis

each with 8 blades

(compute or XIO)

Compute blade XIO blade

3D torus

interconnect

over head to

next row

38

Topology: 16 Cabinets, 8 x 6 x 16 Each chassis has a 1x2x8 topology; HLRS has a 19x6x16 topology

X

Y

Z

39

External Services

• esFS Provides globally shared data between multiple systems

Cray XE systems and others

Provides access to other file systems

DVS is used to project Panasas or StorNext to the compute nodes

• esLogin Increased availability of data and system services to users

An enhanced user environment

larger memory, swap space, and more horsepower

Dell 905, 4 socket, quad-core and 128 GB of memory

• esDM More options for data management and data protection

40

Why External Services for Cray Systems

To address customer requirements: More flexible user access

More options for data management, data protection

Leverage commodity components in customer-specific implementations

Provide faster access to new devices and technologies

Repeatable solutions that remain open to custom configuration

Enable each solution to be used, scaled, and configured independently

esFS esLogin esDM

41

Scalable Software Architecture

Microkernel on Compute nodes, full featured Linux on Service nodes.

Service PEs specialize by function

Software Architecture eliminates OS “Jitter”

Software Architecture enables reproducible

run times Service Partition

Compute Partition

Specialized Linux nodes

Scalable Software Architecture: CLE

43

Trimming OS – Standard Linux Server

Linux Kernel

Portmap

sshd

slpd

nscd

resmgrd

powersaved

cupsd

kdm

cron mingetty(s)

qmgr master

pickup

ndbd

…

init

klogd

44

Linux on a Diet – CNL

Linux Kernel

ALPS client

syslogd

Lustre Client init

klogd

45

FTQ Plot of Stock SuSE (most daemons removed)

27550

27750

27950

28150

28350

0 1 2 3

Time - Seconds

Co

un

t

46

FTQ plot of CNL

27550

27750

27950

28150

28350

0 1 2 3

Time - Seconds

Co

un

t

47

Cray Software Ecosystem

CrayPAT Cray Apprentice

Cray Scientific Libraries

DVS


CLE4, An Adaptive Linux OS designed specifically for HPC

• No compromise scalability

• Low-Noise Kernel for scalability

• Native Comm. & Optimized MPI

• Application-specific performance tuning and scaling

ESM – Extreme Scalability Mode

• No compromise compatibility

• Fully standard x86/Linux

• Standardized Communication Layer

• Out-of-the-box ISV Installation

• ISV applications simply install and run

CCM –Cluster Compatibility Mode

49

Cluster Compatibility Mode: Overview

• Provides the runtime environment on compute nodes expected by ISV applications

• Associated with specific batch queues

• Dynamically allocates and configures compute nodes at job start

Nodes are not permanently dedicated to CCM

Any compute node can be used

Allocated like any other batch job (on demand)

• MPI and Third party MPI runs over InfiniBand or TCP/IP over HSN

• Supports standard services: ssh, rsh, nscd, ldap

• Complete root file system on the compute nodes

built on top of the Dynamic Shared Libraries (DSL) environment

Under CCM, everything the application can “see” is identical to a standard Linux

cluster: Linux OS, x86 processor, and MPI

50

CCM IAA

Look just like InfiniBand to third-party MPIs.

Emulate IB characteristics not in the spec.

Be invisible to the user.

Be invisible in performance profiles.

51

Cray XE I/O architecture

• All I/O is offloaded to service nodes

• Lustre

High performance parallel I/O file system

Direct data transfer between compute nodes and files

• DVS

Virtualization service

Allows compute nodes to access NFS mounted on service node

Applications must execute on file systems mounted on compute nodes

• No local disks

• /tmp is a MEMORY file system, on each login node

52

Scaling Shared Libraries with DVS

Diskless Compute Node 0

/dvs


/dvs

Diskless Compute Node N

/dvs


/dvs


/dvs

DVS Server Node 0

Requests for shared libraries (.so files) are routed through DVS Servers

Provides similar functionality as NFS, but scales to 1000s of compute nodes

Central point of administration for shared libraries

DVS Servers can be “re-purposed” compute nodes

Cray

Interconnect

NFS Shared

Libraries


DSL : Dynamic shared libraries

54

• Benefit: root file system environment available to applications

• Shared root from SIO nodes will be available on compute nodes

• Standard libraries / tools will be in the standard places

• Able to deliver customer-provided root file system to compute nodes

• Programming environment supports static and dynamic linking

• Performance impact negligible, due to scalable implementation

The Cray Programming Environment Overview

The Cray Programming Environment Vision

• It is the role of the Programming Environment to close the gap between observed performance and peak performance

Help users achieve highest possible performance from the hardware

• The Cray Programming Environment is addressing the issues of scale and complexity of high end HPC systems with:

Increased automation

Ease of use

Hiding the system complexity

Extended functionality

Focus on scalability

Improved Reliability

Strong academic collaborations

Close interaction with users

For feedback targeting functionality enhancements

56

Cray Programming Environment Distribution

Focus on Differentiation and Productivity

Programming Languages

Fortran

C

C++

Chapel

Python

I/O Libraries

NetCDF

HDF5

Optimized Scientific

Libraries

LAPACK

ScaLAPCK

BLAS (libgoto)

Iterative Refinement

Toolkit

Cray Adaptive FFTs (CRAFFT)

FFTW

Cray PETSc (with CASK)

Cray Trilinos (with CASK)

Cray developed

#: Under development

Licensed ISV SW

3rd party packaging

Cray added value to 3rd party

PGI

GNU

Compilers

Cray Compiling Environment

(CCE)

•CrayPat

• Cray Apprentice2

Tools

Environment setup

Debuggers

Modules

DDT

gdb

Modules

Debugging Support

Tools

• Fast Track Debugger (CCE w/ DDT)

• Abnormal Termination Processing

DDT

Performance Analysis

STAT

Cray Comparative Debugger#

Programming

models

Distributed Memory (Cray MPT)

• MPI

• SHMEM

PGAS & Global View

• UPC (CCE)

• CAF (CCE)

• Chapel

Shared Memory

• OpenMP 3.0

PGI CCE

57

• Cray technology focused on scientific applications Takes advantage of automatic vectorization Takes advantage of automatic shared memory parallelization

• Standard conforming languages and programming models

Fortran 2003 standard compliant with F2008 features already available C++98/2003 compliant OpenMP 3.0 compliant, working on OpenMP 3.1 and OpenMP 4.0

• OpenMP and automatic multithreading fully integrated

Share the same runtime and resource pool Aggressive loop restructuring and scalar optimization done in the presence of

OpenMP Consistent interface for managing OpenMP and automatic multithreading

• PGAS languages (UPC & Fortran Coarrays) fully optimized and integrated into the

compiler UPC 1.2 and Fortran 2008 coarray support No preprocessor involved Target the network appropriately

CCE : The Cray Compilation Environment

58

• MPI

Implementation based on MPICH2 from ANL

Optimized Remote Memory Access (one-sided) fully supported including passive RMA

Full MPI-2 support with the exception of Dynamic process management (MPI_Comm_spawn)

MPI3 Forum active participant

• Cray SHMEM

Fully optimized Cray SHMEM library supported Cray XE implementation close to the T3E model

Cray MPI & Cray SHMEM

59

• From performance measurement to performance analysis

• Assist the user with application performance analysis and optimization

Help user identify important and meaningful information from potentially massive data sets

Help user identify problem areas instead of just reporting data

Bring optimization knowledge to a wider set of users

• Focus on ease of use and intuitive user interfaces

Automatic program instrumentation

Automatic analysis

• Target scalability issues in all areas of tool development

Cray Performance Analysis Tools

60

• Systems with hundreds of thousands of threads of execution need a new debugging paradigm

Innovative techniques for productivity and scalability Scalable Solutions based on MRNet from University of Wisconsin

STAT - Stack Trace Analysis Tool

Scalable generation of a single, merged, stack backtrace tree

• running at 216K back-end processes

ATP - Abnormal Termination Processing

Scalable analysis of a sick application, delivering a STAT tree and a minimal, comprehensive, core file set.

Fast Track Debugging

Debugging optimized applications

Added to Allinea's DDT 2.6 (June 2010)

Comparative debugging

A data-centric paradigm instead of the traditional control-centric paradigm

Collaboration with Monash University and University of Wisconsin for scalability

Support for traditional debugging mechanism TotalView, DDT, and gdb (TotalView is not on Hermit)

Debuggers on Cray Systems

61

FFT

FFTW

CRAFFT

Sparse

Trilinos

PETSc

CASK

Dense

BLAS

LAPACK

ScaLAPACK

IRT

Scientific libraries – functional view

FFTW

fftw-2.1.5

fftw

PETSc

petsc-

Petsc-complex

CASK (petsc)

Trilinos

Trilinos 10.8.3.0

CASK (trilinos)

LibSci

BLAS

LAPACK

ScaLAPACK

IRT

CRAFFT

Scientific libraries – package view

How to access and use the software

• The Cray XE system uses modules in the user environment to support multiple software versions and to create integrated software packages

As new versions of the supported software and associated man pages become available, they are added automatically to the Programming Environment, while earlier versions are retained to support legacy applications

You can use the default version of an application, or you can choose another version by using Modules system commands

Environment Setup

65

• How can we get appropriate Compiler, Tools, and Libraries? The modules tool is used to handle different versions of

packages e.g.: module load compiler_v1 e.g.: module swap compiler_v1 compiler_v2 e.g.: module load perftools

• Taking care of changing of PATH, MANPATH, LM_LICENSE_FILE,....

environment Modules also provide a simple mechanism for updating certain

environment variables, such as PATH, MANPATH, and LD_LIBRARY_PATH

In general, you should make use of the modules system rather than embedding specific directory paths into your startup files, makefiles, and scripts.

• It is also easy to setup your own modules for your own software

The module tool on the Cray XE

66

The PrgEnv-X module

• The PrgEnv-X is the ‚basic‘ module for all XE6 users

X=cray, pgi, gnu, intel [pathscale]

• With PrgEnv you decide which compiler you want to use and all needed modules (math libs, mpi, …) are loaded automatically

• Modules not loaded by default can be loaded any time e.g. perftools for performance analysis

Slide

67

module list

eslogin002:~> module list

Currently Loaded Modulefiles:

1) modules/3.2.6.6 13) xe-sysroot/4.0.36

2) xtpe-network-gemini 14) rca/1.0.0-2.0400.30002.5.75.gem

3) xtpe-interlagos 15) xt-asyncpe/5.07

4) cce/8.0.2 16) atp/1.4.2

5) acml/4.4.0 17) PrgEnv-cray/4.0.36

6) xt-libsci/11.0.05 18) xt-mpich2/5.4.3

7) udreg/2.3.1-1.0400.3911.5.13.gem 19) eswrap/1.0.9

8) ugni/2.3-1.0400.4127.5.20.gem 20) torque/2.5.9

9) pmi/3.0.0-1.0000.8661.28.2807.gem 21) moab/6.1.5.s1992

10) dmapp/3.2.1-1.0400.3965.10.63.gem 22) system/ws_tools

11) gni-headers/2.1-1.0400.4156.6.1.gem 23) system/hlrs-defaults

12) xpmem/0.1-2.0400.30792.5.6.gem

PrgEnv-cray is the default on Hermit

68

@eslogin002:~> module show xtpe-interlagos

-------------------------------------------------------------------

/opt/cray/xt-asyncpe/default/modulefiles/xtpe-interlagos:

conflict xtpe-barcelona

conflict xtpe-quadcore

conflict xtpe-shanghai

conflict xtpe-istanbul

conflict xtpe-interlagos-cu

conflict xtpe-mc8

conflict xtpe-mc12

conflict xtpe-xeon

prepend-path PE_PRODUCT_LIST XTPE_INTERLAGOS

setenv XTPE_INTERLAGOS_ENABLED ON

setenv CRAY_CPU_TARGET interlagos

setenv INTEL_PRE_COMPILE_OPTS -msse3

setenv PATHSCALE_PRE_COMPILE_OPTS -march=barcelona

-------------------------------------------------------------------

What is xtpe-interlagos?

I should build for the right compute-node

architecture.

It’d probably be a really bad idea to load two architectures at once.

Oh yeah, let’s link in the tuned math libraries for this architecture too.

69

Useful module commands

• Which modules are loaded?

module list

• Load software

module load perftools

• Change programming environment

module swap PrgEnv-cray PrgEnv-gnu

• Change software version

module swap cce cce/7.4.4

• Check which version are available

module avail cce

70

Which Software Versions Are Available?

hpcnicho@eslogin002:~> module avail perftools

--------------------------- /opt/cray/modulefiles --------------------------

perftools/5.2.0 perftools/5.2.3 perftools/5.3.0(default)

hpcnicho@eslogin002:~> module avail cce

---------------------------- /opt/modulefiles --------------------------------

cce/7.3.3 cce/7.4.2 cce/8.0.0 cce/8.0.0.137

cce/8.0.2(default) cce/7.3.4 cce/7.4.4 cce/8.0.0.129

cce/8.0.1

71

What Happens When I Load a Module?

hpcnicho@eslogin002:~> module show perftool

-------------------------------------------------------------------

/opt/cray/modulefiles/perftools/5.3.0:

setenv PERFTOOLS_VERSION 5.3.0

conflict x2-craypat

conflict craypat

conflict xt-craypat

conflict apprentice2

module load rca

setenv CHPL_CG_CPP_LINES 1

setenv PDGCS_LLVM_DISABLE_FP_ELIM 1

setenv PAT_REPORT_PRUNE_NAME

_cray$mt_start_,__cray_hwpc_,f_cray_hwpc_,cstart,__pat_,pat_region_,PAT_,OMP.slave_loop,slave_entry,_new_slave

_entry,__libc_start_main,_start,__start,start_thread,__wrap_,UPC_ADIO_,_upc_,upc_,__caf_,__pgas_

module-whatis Perftools - the Performance Tools module sets up environments for CrayPat, Apprentice2 and

PAPI

prepend-path PATH /opt/cray/perftools/5.3.0/bin

prepend-path MANPATH /opt/cray/perftools/5.3.0/man

setenv CRAYPAT_LICENSE_FILE /opt/cray/perftools/craypat.lic

prepend-path CRAYLMD_LICENSE_FILE /opt/cray/perftools/craypat.lic

setenv CRAYPAT_ROOT /opt/cray/perftools/5.3.0

setenv CRAYPAT_INCLUDE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts INCLUDE)

setenv CRAYPAT_PRE_LINK_OPTS $($CRAYPAT_ROOT/sbin/pat-opts PRE_LINK)

setenv CRAYPAT_POST_LINK_OPTS $($CRAYPAT_ROOT/sbin/pat-opts POST_LINK)

setenv CRAYPAT_PRE_COMPILE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts PRE_COMPILE)

setenv CRAYPAT_POST_COMPILE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts POST_COMPILE)

setenv CRAYPAT_ROOT_FOR_EVAL /opt/cray/perftools/$PERFTOOLS_VERSION

module load papi/4.2.0

setenv APP2_STATE 5.3.0

setenv JH_HELPSET /opt/cray/perftools/5.3.0/help/app2help.jar

setenv JH_VIEWER /opt/cray/perftools/5.3.0/help/jh2_0_05/demos/bin/hsviewer.jar

prepend-path CRAY_LD_LIBRARY_PATH /opt/cray/perftools/5.3.0/lib

append-path CLASSPATH /opt/cray/perftools/5.3.0/help/jh2_0_05/javahelp

append-path PE_PRODUCT_LIST PERFTOOLS

append-path PE_PRODUCT_LIST CRAYPAT

------------------------------------------------------------------- 72

Release Notes

hpcnicho@eslogin002:~> module help cce/8.0.2

----------- Module Specific Help for 'cce/8.0.2' ------------------

The modulefile, cce, defines the system paths and environment

variables needed to run the Cray Compile Environment.

Type "module avail cce" to see if other versions of this product

are available on this system. Use "module switch" to change versions.

Cray Compiling Environment 8.0.2 (CCE 8.0.2)

============================================

Purpose:

--------

The CCE 8.0.2 update provides bugfixes to the CCE 8.0.1 release for Cray XE

systems.

Bugs fixed in 8.0.2 are:

779483 Runtime error with Cray Fortran compiler cce/7.4.4

780053 Illegal folding of optional argument test into a merge

780346 Internal compiler error with crayftn when enabling full debugging

779573 Fortran function pointer issue

Note:

-----

Support for CCE on Cray XT systems will continue to be provided with

updates to the CCE 7.4 release. The CCE 8.0 release branch is

supported on the Cray XE and XK systems only.

Dependencies:

-------------

The CCE 8.0.2 release is supported on Cray XE systems that run on the Cray

Linux Environment (CLE) operating system, version 3.1 and later and on the

Cray XK systems that run the Cray Linux Environment 4.0 UP01 and later.

73

Release Notes (cont.)

CCE 8.0.2 requires that gcc/4.4.4 be installed. GCC 4.4.4 does not need to

be a default GNU environment.

Cray Performance Measurement and Analysis Tools dependency:

- cce/8.0.0 or later compiles using -h profile_generate require

perftools/5.3.0 in order to provide loop work estimates.

- perftools/5.3.0 is required to support the PGAS (UPC, CAF) runtime

library changes made in CCE 8.0

The Cray Compiling Environment 8.0.2 update requires the following supporting

asynchronous software products:

Cray Compiler Drivers (xt-asyncpe) 5.04 or later

GNU GCC 4.4.4 must be installed, but is not required to be the default GCC

PMI 2.1.4 or later

Cray Scientific Libraries (LibSci) 11.0.00 or later

The Cray Compiling Environment 8.0.2 update requires the following minimum

version if these products are used:

PETSc 3.1.05 or later

hdf5-netcdf 1.8 (HDF5 1.85 and netcdf 4.1.1)

MPT 5.2.3 or later

acml 4.4.0 or later. To use acml 5.0, gcc 4.6.1 must be installed.

Cray Performance Measurement and Analysis Tools 5.3.0

74

• You use compiler driver commands to launch all Cray XE compilers (ft, cc, and CC)

• The syntax for the compiler driver is:

cc | CC | ftn [Cray_options | PGI_options | GNU_options] files [-lhugetlbfs]

• For example, to use any Fortran compiler (CCE, PGI, GNU) to compile prog1.f90

Use this command: % ftn prog1.f90

• The compiler drivers are checking the the PrgEnv-X Module

Using the Compiler Driver Commands

75

The Cray Compilation Environment

This is the default on hermit

CCE Technology Sources

X86 Code

Generator

Cray XK Code

Generator

Fortran Front End

Interprocedural Analysis

Optimization and

Parallelization

C and C++ Source

Object File

Co

mp

iler

C & C++ Front End

Fortran Source

C and C++ Front End supplied by Edison

Design Group, with Cray-developed code

for extensions and interface support

X86 Code Generation from LLVM, with

additional Cray-developed optimizations

and interface support

Cray Inc. Compiler

Technology

PTX Code Generation derived from the

Cray X2 code generator

Fortran 2003 plus portions of 2008 (CAF),

OpenMP, and Cray-specific programming

support

Aggressive inlining and interprocedural

optimization, including cross-file

Automatic vectorization and SMP;

automatic restructuring for memory

usage; OpenMP, UPC and CAF

expansion and optimization;

heterogeneous target data transfer,

parallelization, and optimization; scalar

and vector optimization

77

• Compliance with ANSI/ISO FORTRAN 2003 Fortran 2008 (full compliance targeted for 2012)

Fortran 2008 coarrays Submodules Block construct Contiguous Attribute ALLOCATE enhancements (MOLD =, shape from SOURCE/MOLD) intrinsic assignment for polymorphic variables Most of the new intrinsic functions ISO_Fortran_Env module enhancements

• Compliance with ANSI/ISO C99 and ANSI/ISO C++ 2003

(except the export keyword for templates) Support for Kernighan & Ritchie C C/C++ enhancements/changes

updated to GCC version 4.4.4 compatibility C++ supports the ISO 1998 Standard Template Library (STL) headers Upgraded the C and C++ front end to EDG Version 4.1

With this update CCE can better handle modern C++ applications Periodic synchronization with the latest sources and bug fixes Better support for non-standard GNU language extensions The new EDG C and C++ front end more strictly enforces the standards

UPC 1.2 support

78

CCE Main Features

• AMD Interlagos support, including AVX, FMA, and XOP instructions

• X86/NVIDIA compiler and library development (ongoing “beta” release)

• Support for MPI 2.2

• Full OpenMP 3.0 support

Automatic multithreading integrated with OpenMP Atomic construct extensions

taskyield construct

firstprivate clause accepts intend(in) and constant objects

• Support for hybrid programming using MPI across node and OpenMP within the node

• Support for IEEE floating-point arithmetic and IEEE file formats

• Cray performance tools and debugger support

• Program Library

• CCE 8.0 was released on December, 2011

The full release overview can be found at: http://docs.cray.com/books/S-5212-74/

79

CCE Main Features (cont.)

http://docs.cray.com/books/S-5212-74/





• C-based UPC and Fortran Coarray are PGAS language extensions, not stand-alone languages

• A subset of Fortran coarray collectives were added for CCE

Although they are not yet part of the official language – they are too useful to be delayed

• Significant improvements were made to the automatic use of blocked network transfers, including:

Automatic conversion of multiple single-word accesses into blocked accesses

Improved capabilities for pattern matching to hand-optimized library routines, including messages stating what might be inhibiting the conversion

• UPC and Fortran coarrays support up to 2,147,483,647 threads within a single application

We actually did hit the previous limit of 65,535!

UPC and Fortran Coarray Features

80

• The Program Library (PL) feature allows the user to specify a repository of compiler information for an application build

This repository provides the framework for future productivity features such as Whole program static error detection

Incremental recompilation

Provide support for the future Cray interactive whole program performance analysis and tuning assistant Reveal

• Two command line options control the Program Library functionality

-h pl = <PL_path> specifies the repository ftn –hpl=./PL.1 tells the compiler to either update the Program Library “./PL.1”

if it exists, or create it if it does not exist.

<PL_path> should specify a single location to be used for entire application build. If a makefile changes directories during a build, an absolute path might be necessary.

-h wp enables whole-program mode

Whole-Program Compilation

81

• Whole-program mode (-hwp) requires a program library (-hpl =) and both options must be specified on all compilation command lines as well as on the link line. The compiler frontend is invoked for the compilation (-c) command lines The compiler backend (inliner, optimizer, code generator) is invoked for all source

files when the link line is specified. While –hwp might have a negative affect on overall compile time due to increased

inlining, it is most usually a compile time shift, where –c compilations become quite fast and the time spent on the link step increases.

Setting the environment variable “NPROC” to a number greater than 1 instructs the compiler to invoke NPROC backend processes concurrently. The backend invocations are independent of each other and setting NPROC to a level that is appropriate for the host build machine can improve compile time.

• Whole-program mode (-hwp) allows the inliner to see all inline candidates in the

application. This option makes cross file inlining automatic

Removes the need for –h ipafrom = Inlining heuristics are still controlled by –h/-O ipan

Whole-Program Compilation (cont)

82

• Use default optimization levels It’s the equivalent of most other compilers –O3 or –fast It is also our most thoroughly tested configuration

• Use –O3,fp3 (or –O3 –hfp3, or some variation)

-O3 only gives you slightly more than –O2 We also test this thoroughly -hfp3 gives you a lot more floating point optimization, esp. 32-bit

• If an application is intolerant of floating point reassociation, try a lower –hfp

number – try –hfp1 first, only –hfp0 if absolutely necessary Might be needed for tests that require strict IEEE conformance Or applications that have ‘validated’ results from a different compiler Interlagos FMA usage is aggressive at –hfp2 and –hfp3; limited at –hfp1,

and disabled at –hfp0

• Do not use –Oipa5, -Oaggress, and so on – higher numbers are not always correlated with better performance

Recommended CCE Compilation Options

83

• We recommend using –O3 –hfp3 if the application runs cleanly with these options

• -hfp3 primarily improves 32-bit floating point performance on the X86

• A partial list of what happens at –hfp3 is: Use of fast 32-bit inline division, reciprocal, square root, and reciprocal

square root algorithms (with some loss of precision)

Use of a fast 32-bit inline complex absolute value algorithm

Starting with CCE 8.0, more aggressive reassociation (pre-8.0 –hfp2 behavior)

Various assumptions about floating point trap safety

Somewhat more aggressive about NaN assumptions

Assumes standard-compliant Fortran exponentiation (x**y)

What Exactly Does –hfp3 Do?

84

• Overall Options

-ra creates a listing file with optimization info

-rm produces a source listing with loopmark information

• Preprocessor Options

-eZ runs the preprocessor on Fortran files

-F enables macro expansion throughout the source file

• Optimisation Options

-O2 optimal flags [ enabled by default ]

-O3 aggressive optimization

-O ipa<n> inlining, n=0-5

Cray compiler flags

85

• Language Options

-f free process Fortran source using freeform

-s real64 treat REAL variables as 64-bit

-s integer64 treat INTEGER variables as 64-bit

• Parallelization Options

-O omp Recognize OpenMP directives [default ]

-O thread<n> n=0-3, aggressive parallelization, default n=2

Cray compiler flags

=> man crayftn http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-3901-71;idx=books_search;this_sort=;q=3901;type=books;title=Cray%20Fortran%20Reference%20Manual

86

http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-3901-71;idx=books_search;this_sort=;q=3901;type=books;title=Cray Fortran Reference Manual








• OpenMP is ON by default

Optimizations controlled by –Othread#

To shut off use –Othread0 or –xomp or –hnoomp

• Autothreading is NOT on by default;

-hautothread to turn on

Modernized version of Cray X1 streaming capability

Interacts with OpenMP directives

• If you do not want to use OpenMP and have OMP directives in the code, make sure to shut off OpenMP at compile time

OpenMP

87

• Cray compiler supports a full and growing set of directives and pragmas

!dir$ concurrent

!dir$ ivdep

!dir$ interchange

!dir$ unroll

!dir$ loop_info [max_trips] [cache_na] ... Many more

!dir$ blockable

man directives

man loop_info

CCE Directives

88

Loopmark/Compiler Feedback

• ftn –rm … or cc –hlist=m …

• Compiler will generate an ‘.lst’with annotated listing of your source code with letter indicating important optimizations

89

• Compiler can generate an filename.lst file.

Contains annotated listing of your source code with letter indicating important optimizations

90

Loopmark: Compiler Feedback

%%% L o o p m a r k L e g e n d %%%

Primary Loop Type Modifiers

------- ---- ---- ---------

a - vector atomic memory operation

A - Pattern matched b – blocked

C - Collapsed f – fused

D - Deleted i – interchanged

E - Cloned m - streamed but not partitioned

I - Inlined p - conditional, partial and/or computed

M - Multithreaded r – unrolled

P - Parallel/Tasked s – shortloop

V - Vectorized t - array syntax temp used

W - Unwound w - unwound

• ftn –rm … or cc –hlist=m …

91

Example: Cray loopmark messages for Resid

29. b-------< do i3=2,n3-1

30. b b-----< do i2=2,n2-1

31. b b Vr--< do i1=1,n1

32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)

33. b b Vr * + u(i1,i2,i3-1) + u(i1,i2,i3+1)

34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)

35. b b Vr * + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)

36. b b Vr--> enddo

37. b b Vr--< do i1=2,n1-1

38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)

39. b b Vr * - a(0) * u(i1,i2,i3)

40. b b Vr * - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )

41. b b Vr * - a(3) * ( u2(i1-1) + u2(i1+1) )

42. b b Vr--> enddo

43. b b-----> enddo

44. b-------> enddo

Example: Cray loopmark messages for Resid (cont) ftn-6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.

92

• The cc(1), CC(1), and ftn(1) man pages contain information about the compiler driver commands

• The pgcc(1), pgCC(1), and pgf95(1) man pages contain descriptions of the PGI compiler command options

• The craycc(1), crayCC(1), and crayftn(1) man pages contain descriptions of the Cray compiler command options

• The gcc(1), g++(1), and gfortran(1) man pages contain descriptions of the GNU compiler command options

• To verify that you are using the correct version of a compiler, use: -V option on a cc, CC, or ftn command with PGI and CCE --version option on a cc, CC, or ftn command with GNU

Compiler man Pages

93

• One rounding for the FMA as a whole, rather than two (one for multiply and one for addition)

• That sounds like a minor difference, but these differences can accumulate

• For our internal testing, most of the differences we manually approved by examining them and deciding the FMA-based results were within an acceptable range

• Actual applications – at least some of them – appear to be less forgiving

• There is no hardware way to obtain the exact same result between FMAs and individual multiplications and additions

… but the performance difference means we really do need to use them

• Some level of FMA control is provided by CCE –hfp options

-hfp0: No FMA generation (but also disables a lot of other stuff)

-hfp1: Generate FMAs, but not across user parenthesis

-hfp2,3: Aggressive FMA generation

94

Impact of Fused Multiply-Add (FMA) on Application Results

Feature PGI Cray

Listing -Mlist -ra

Diagnostic -Minfo -Mneginfo (produced by -ra)

Free format -Mfree -f free

Preprocessing -Mpreprocess -eZ -F

Suggested Optimization -fast (default)

Aggressive Optimization -Mipa=fast,inline -O3, fp3

Variables size -r8 –i8 -s real64 –s integer64

Byte swap -byteswapio -h byteswapio

OpenMP recognition -mp=nonuma (default)

Automatic parallelization -Mconcur -h autothread

Cray and PGI compiler flags

95

• GNU (PrgEnv-gnu) Suggested options: -O3 –ffast-math –funroll-loops

Compiler feedback: -ftree-vectorize -verbose=2

OpenMP: -fopenmp

Man pages: gcc, gfortran, g++

• Intel (PrgEnv-intel) Suggested options: -O3

Aggressive options: -ffast-math -funroll-loops -msse3 -ftree-vectorize

OpenMP: -openmp=on Careful : An extra control thread is spawn: issues when pinning threads to cores. Try aprun –cc [none|numa_node] instead of –cc cpu

Man pages: ifort, icc

Other programming environments

96

• Compiling on a Linux service node

• Generating an executable for a CLE compute node

• Do not use pgf90, pgcc, gcc, g++, ..., unless you want a Linux executable for the service node

Use ftn, cc, or CC instead

Cross Compiling Environment

97

Running an application on the Cray XE6

• ALPS : Application Level Placement Scheduler

• aprun is the ALPS application launcher

It must be used to run application on the XE compute nodes

If aprun is not used, the application is launched on the Mom node (and will most likely fail)

aprun man page contains several useful examples

at least 3 important parameters to control: The total number of PEs : -n

The number of PEs per node: -N

The number of OpenMP threads: -d More precise : The ‘stride’ between 2 PEs in a node

Running an application on the Cray XE ALPS + aprun

99

Some Definitions

• ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general ‘definitions’ :

PE : Processing Elements Basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, …

Numa_node The cores and memory on a node with ‘flat’ memory access, basically one of the 4 Dies on the Opteron and the direct attach memory.

Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.

100

• Assuming a XE6 IL16 system (32 cores per node)

• Pure MPI application, using all the available cores in a node

$ aprun –n 32 ./a.out

• Pure MPI application, using only 1 core per node

32 MPI tasks, 32 nodes with 32*32 core allocated

Can be done to increase the available memory for the MPI tasks

$ aprun –N 1 –n 32 –d 32./a.out (we’ll talk about the need for the –d32 later)

• Hybrid MPI/OpenMP application, 4 MPI ranks per node

32 MPI tasks, 8 OpenMP threads each

need to set OMP_NUM_THREADS $ export OMP_NUM_THREADS=8

$ aprun –n 32 –N 4 –d $OMP_NUM_THREADS

Running an application on the Cray XE6 some basic examples

101

• CNL can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node

• In some cases, moving PEs or threads from CPU to CPU increases cache and translation lookaside buffer (TLB) misses and therefore reduces performance

• CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node

• aprun CPU affinity option (see man aprun)

Default settings : -cc cpu PEs are bound a to specific core, depended on the –d setting

Binding PEs to a specific numa node : -cc numa_node PEs are not bound to a specific core but cannot ‘leave’ their numa_node

No binding : -cc none

Own binding : -cc 0,4,3,2,1,16,18,31,9,…

aprun CPU Affinity control

102

• Cray XE6 systems use dual-socket compute nodes with 4 dies

Each die (8 cores) is considered a NUMA-node

• Remote-NUMA-node memory references, can adversely affect performance. Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’

• aprun memory affinity options (see man aprun)

Suggested setting is –ss a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.

Memory affinity control

103

Running an application on the Cray XT - MPMD

• aprun supports MPMD – Multiple Program Multiple Data

• Launching several executables on the same MPI_COMM_WORLD $ aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3

• Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node. Example : The following commands needs 3 nodes $ aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3

• Use a script to start several serial jobs on a node : $ aprun –a xt –n 1 –cc none –d32 script.sh >cat script.sh

./exe1&

./exe2&

./exe3&

wait

>

104

● In this mode, an MPI task is pinned to each integer core

● Implications

● Each core has exclusive access to an integer scheduler, integer pipelines and L1 Dcache

● The 256-bit FP unit and the L2 Cache is shared between the two cores

● 256-bit AVX instructions are dynamically executed as two 128-bit instructions if the 2nd FP unit is busy

● When to use

● Code is highly scalable to a large number of MPI ranks

● Code can run with 1 GB per core memory footprint (or 2 GB on 64 GB node)

● Code is not well vectorized

How to use the interlago 1/3 1 MPI Rank on Each Integer Core Mode

Shared L2 Cache

Fetch

Decode

FP

Scheduler

128-b

it F

MA

C

L1 DCache L1 DCache

128-b

it F

MA

C

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Int

Scheduler

Int

Scheduler


MPI Task 0 Shared Components

MPI Task 1


● In this mode, only one integer core is used per core pair

● Implications

● This core has exclusive access to the 256-bit FP unit and is capable of 8 FP results per clock cycle

● The core has twice the memory capacity and memory bandwidth in this mode

● The L2 cache is effectively twice as large

● The peak of the chip is not reduced

● When to use

● Code is highly vectorized and makes use of AVX instructions

● Code needs more memory per MPI rank

How to use the interlago 2/3 Wide AVX mode

Shared L2 Cache

Fetch

Decode

FP

Scheduler

128-b

it F

MA

C

L1 DCache L1 DCache

128-b

it F

MA

C

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Int

Scheduler

Int

Scheduler


Idle Components

Active Components


● In this mode, an MPI task is pinned to a core pair

● OpenMP is used to run a thread on each integer core

● Implications

● Each OpenMP thread has exclusive access to an integer scheduler, integer pipelines and L1 Dcache

● The 256-bit FP unit and the L2 Cache is shared between the two threads

● 256-bit AVX instructions are dynamically executed as two 128-bit instructions if the 2nd FP unit is busy

● When to use

● Code needs a large amount of memory per MPI rank

● Code has OpenMP parallelism exposed in each MPI rank

How to use the interlago 3/3 2-way OpenMP Mode

Shared L2 Cache

Fetch

Decode

FP

Scheduler

128-b

it F

MA

C

L1 DCache L1 DCache

128-b

it F

MA

C

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Pip

elin

e

Int

Scheduler

Int

Scheduler


OpenMP Thread 0

Shared Components

OpenMP Thread 1


Aprun: cpu_lists for each PE

• CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads. Essentially, this provides the user more granularity to specify cpu_lists for each processing element. Here an example with 3 threads : aprun -n 4 -N 4 -cc 1,3,5:7,9,11:13,15,17:19,21,23

• Note: This feature will be modified in CLE 4.0.UP03, however this option will still be valid.

108

Running a batch application with Torque

• The number of required nodes and cores is determined by the parameters specified in the job header #PBS -l mppwidth=256

#PBS -l mppnppn=4

This example uses 256/4=64 nodes

• The job is submitted by the qsub command

• At the end of the execution output and error files are returned to submission directory

• PBS environment variable: $PBS_O_WORKDIR Set to the directory from which the job has been submitted Default is $HOME

• man qsub for env. variables

109

Other Torque options

• #PBS -N job_name

the job name is used to determine the name of job output and error files

• #PBS -l walltime=hh:mm:ss

Maximum job elapsed time

should be indicated whenever possible: this allows Torque to determine best scheduling startegy

• #PBS -j oe

job error and output files are merged in a single file

• #PBS -q queue

request execution on a specific queue

110

Torque and aprun

Torque aprun

-lmppwidth=$PE -n $PE Number of PE to start

-lmppdepth=$threads -d $threads #threads/PE

-lmppnppn=$N -N $N #(PEs per node)

<none> -S $S #(PEs per numa_node)

-lmem=$size -m $size[h|hs] per-PE required memory

111

• -B will provide aprun with the Torque settings for –n,-N,-d and –m

aprun –B ./a.out

• Using –S can produce problems if you are not asking for a full node.

If possible, ALPS will only give you access to a parts of a node if the Torque

settings allows this. The following will fail :

• PBS -lmppwidth=4 ! Not asking for a full node

• aprun –n4 –S1 … ! Trying to run on every die

• Solution is to ask for a full node, even if aprun doesn‘t use it

Core specialization

• System ‘noise’ on compute nodes may significantly degrade scalability for some applications

• Core Specialization can mitigate this problem

1 core per node will be dedicated for system work (service core)

As many system interrupts as possible will be forced to execute on the service core

The application will not run on the service core

• Use aprun -r to get core specialization

$ aprun –r –n 100 a.out

• apcount provided to compute total number of cores required

$ qsub -l mppwidth=$(apcount -r 1 1024 16)job

aprun -n 1024 -r 1 a.out

112

Core Specialization and MPI progress

Typical HPC application threads tend to run hot, i.e. they don’t typically make calls that result in yielding of the core on which they are scheduled

Because of this, MPI progress threads need to have at least one core of a compute unit available per node for efficient handling of interrupts received from Gemini

Core Specialization provides a convenient way to partition cores on a node between hot application threads, and cool system service daemon threads as well as MPI progress threads.

MPI Asynchronous Progress – enabling

export MPICH_NEMESIS_ASYNC_PROGRESS=1

export MPICH_MAX_THREAD_SAFETY=multiple

aprun –r 1 …

113

Running a batch application with Torque

• The number of required nodes can be specified in the job header

• The job is submitted by the qsub command

• At the end of the exection output and error files are returned to submission directory

• Environment variables are inherited by #PBS -V

• The job starts in the home directory. $PBS_O_WORKDIR contains the directory from which the job has been submitted

Hybrid MPI + OpenMP

#!/bin/bash

#PBS –N hybrid

#PBS –lwalltime=00:10:00

#PBS –lmppwidth=128

#PBS –lmppnppn=8

#PBS –lmppdepth=4

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=4

aprun –n128 –d4 –N8 a.out

114

Starting an interactive session with Torque

• An interactive job can be started by the –I argument

That is <capital-i>

• Example: allocate 64 cores and export the environment variables to the job (-V)

$ qsub –I –V –lmppwith=64 –lmppnppn=32

• This will give you a new prompt in your shell from which you can use aprun directly. Note that you are running on a MOM node (shared resource) if not using aprun

115

Watching a launched job on the Cray XE

• xtnodestat

Shows XE nodes allocation and aprun processes

Both interactive and batch jobs

• apstat

Shows aprun processes status

apstat overview

apstat –a[ apid ] info about all the applications or a specific one

apstat –n info about the status of the nodes

• Batch qstat command

shows batch jobs

116

Accounting at HLRS

• Accounting is done by examining the Torque log files and is based on the unix group id a user belongs to

Normally the user don‘t have to do anything

• If a user is involved in several projects, he has to select the correct one by setting the group id in the batch script :

#PBS -W group_list=<group name>

117

Lustre filesystem at HLRS

• In order to use lustre at HLRS, you have to create a „workspace“

• HLRS provides a tool suite to create and manage the workspace

• To allocate a workspace : ws_allocate <name> <duration>

• To list your workspaces : ws_list

• After <duration>, the workspace is deleted. You can extend the <duration> 3 times.

• https://wickie.hlrs.de/platforms/index.php/Workspace_mechanism

Slide

118

https://wickie.hlrs.de/platforms/index.php/Workspace_mechanism

Starting 512 MPI tasks (PEs) #PBS -N MPIjob

#PBS -l mppwidth=512

#PBS -l mppnppn=32

#PBS -l walltime=01:00:00

#PBS -j oe

cd $PBS_O_WORKDIR

export MPICH_ENV_DISPLAY=1

export MALLOC_MMAP_MAX_=0

export MALLOC_TRIM_THRESHOLD_=-1

aprun -n 512 –cc cpu –ss ./a.out

119

Starting a hybrid job single node, 8 MPI tasks, each with 4 threads

#PBS -N hybrid

#PBS -l mppwidth=8

#PBS -l mppnppn=8

#PBS -l mppdepth=4


#PBS -j oe

cd $PBS_O_WORKDIR





aprun –n8 –N8 –d $OMP_NUM_THREADS –cc cpu –ss ./a.out

120

Starting a MPMD job on a non-default projectid using 1 master, 16 workers, each with 8 threads #PBS -N hybrid

#PBS -l mppwidth=160 ! Note : 5 nodes * 32 cores = 160 cores

#PBS -l mppnppn=32


#PBS -j oe

#PBS -W group_list=My_Project

cd $PBS_O_WORKDIR





id # Unix command ‚id‘, to check group id

aprun –n1 –d32 –N1 ./master.exe :

-n 16 –N4 –d $OMP_NUM_THREADS –cc cpu –ss ./worker.exe

121

Starting an MPI job on two nodes using only every second integer core

#PBS -N hybrid

#PBS -l mppwidth=32

#PBS -l mppnppn=16

#PBS -l mppdepth=2


#PBS -j oe

cd $PBS_O_WORKDIR


aprun –n32 –N16 –d 2 –cc cpu –ss ./a.out

122

Starting a hybrid job on two nodes using only every second integer core

#PBS -N hybrid

#PBS -l mppwidth=32

#PBS -l mppnppn=16

#PBS -l mppdepth=2


#PBS -j oe

cd $PBS_O_WORKDIR



aprun –n32 –N16 –d $OMP_NUM_THREADS

–cc 0,2:4,6:8,10:12,14:16,18:20,22:24,26:28,30 –ss ./a.out

123

• HLRS wiki

https://wickie.hlrs.de/platforms/index.php/Cray_XE6

• Cray docs site

http://docs.cray.com

• Starting point for Cray XE info

http://docs.cray.com/cgi-bin/craydoc.cgi?mode=SiteMap;f=xe_sitemap

• Twitter ?!?

http://twitter.com/craydocs

Documentation

124

http://docs.cray.com/




http://twitter.com/craydocs

End

125

prace spring school 2012-05-16 · prace spring school 2012-05-16 stefan andersson [email protected] ....

Documents