Download - HPC Technologies for Weather and Climate - ECMWF · HPC Technologies for Weather and Climate Simulations ... New SSE4.2 Instructions Deeper Buffers Faster Virtualization Simultaneous

11

David Barkai, Ph.D. HPC Computational ArchitectHigh Performance Computing

Digital Enterprise Group Intel Corporation

HPC Technologies for

Weather and Climate

Simulations

High Performance Computing

13th Workshop on the Use of High Performance Computing in Meteorology

Shinfield Park, Reading, UKNov. 3-7, 2008

http://www.ecmwf.int/

22

What we’ll talk about

•The Big Picture

•Nehalem is coming..

•NWS on Clusters

33

The Big Picture..

44

Intel HPC Vision

A world in which Intel based supercomputers enable the major

breakthroughs in science, medicine & engineering…

From exploration to production

Mission:

• Create a maintain a technology leadership position for Intel at the highest end

of computing – drive the path to TeraScale processors & ExaScale systems

• Drive a valuable technology pipeline for Intel’s volume business

• Grow the use of HPC across all segments from the office to the datacenter

Intel Confidential

55

Inertia For The Insatiable Demand For Performance

Moore’s Law

Moore’s Law:

• Transistor Density doubles every two years

• More than 40 years old

• 37 years of “free” performance gains

• Qualitative change a few years ago: Multicore

• High performance requires multithreading

• Intel Xeon 7400 series just announced with 6 cores

Large scale deployments consistently outpacing Moore’s Law

66

In Search Of (Even) More Performance

Fixed Function

PartiallyProgrammable

FullyProgrammable

Multi-threading Multi-core Many Core

Throughput Performance

Programmability

CPU

GPU

CPU• Evolving toward throughput computing• Motivated by energy-efficient performance

GPU• Evolving toward general-purpose computing• Motivated by higher quality graphics and

GP-GPU usages

Architecture Evolution: A Collision Course?

77

Intel’s Terascale* Research Program

High BandwidthI/O & Communications

Stacked, Shared Memory

Scalable Multi-coreArchitectures

Parallel ProgrammingTools & Techniques

VirtualEnvironments

FinancialModeling

Media Search& Manipulation

Web MiningBots’

EducationalSimulation

Model-Based Applications

Thread-AwareExecution Environment

* TeraFLOPProcessors

88

2004 2006 2008 2010 2012 2014 2016 2018 2020

1 GF

10 GF

100 GF

1 TF

10 TF

100 TF

The Path To ExaScale Systems

~ 2TF

~10TF

~50TF

…and it’s not just about CPU performance…

Conceptual roadmap – not for planning purposes

99

What’s coming for Servers by Intel

1111

Nehalem Core: Recap

ExecutionUnits

Out-of-OrderScheduling &Retirement

L2 Cache& InterruptServicing

Instruction Fetch& L1 Cache

Branch PredictionInstructionDecode &Microcode

Paging

L1 Data Cache

Memory Ordering& Execution

Additional CachingHierarchy

New SSE4.2 Instructions

Deeper Buffers

FasterVirtualization

SimultaneousMulti-Threading

Better BranchPrediction

Improved Lock Support

ImprovedLoop

Streaming

1212

Nehalem Based System Architecture

Benefits

• More application performance

• Improved energy efficiency

• Improved Virtualization Technology

PCI Express*Gen 1, 2

I/OHub

ICH

DMI

Nehalem Nehalem

Functional

system

demonstrated

Sept 2007 IDF

Extending Today’s Leadership

Production Q4’08

Volume ramp Q1’09

Key TechnologiesNew 45nm Intel® Microarchitecture

New Intel® QuickPath interconnect

Integrated Memory Controller

Next Generation Memory (DDR3)

PCI Express Gen 2All future products, dates, and figures are preliminary and are subject to change without any notice.

QPI

Up to 25.6 Gb/sec bandwidth per link

1313

QuickPath Interconnect

• Nehalem introduces new QuickPath Interconnect (QPI)

• High bandwidth, low latencypoint to point interconnect

• Up to 6.4 GT/sec initially

– 6.4 GT/sec -> 12.8 GB/sec

– Fully duplex -> 25.6 GB/sec per link

– Future implementations at even higher speeds

• Highly scalable for systems with varying # of sockets

NehalemEP

NehalemEP

IOH

memoryCPU CPU

CPU CPU

IOH

memory

memory

memory

1414

Core/Uncore Modularity

Optimal price / performance / energy efficiencyfor server, desktop and mobile products

#QPILinks

# memchannels

Size ofcache# cores

PowerManage-

ment

Type ofMemory

Integratedgraphics

Differentiation in the “Uncore”:

2008 – 2009 Servers & Desktops

DRAM

QPI

Core

Uncore

CORE

CORE

CORE

IMC QPIPower

&Clock

…

QPI…

…

…

L3 Cache

QPI: Intel® QuickPath Interconnect

Nehalem Core

Common from Mobile to Server

Nehalem Uncore

Differentiates the product segments

1515

By Intel’s NWS application engineers team:

Roman Dubtsov

Alexander Semenov

Shkurko Dmitry

Alex Kosenkov

and Mike Greenfield

Characterizing NWS on Clusters

• WRF

• POP

• CAM

• HOMME

1616

Characterization methodology overview

• The characterization should allow answering the questions:– How scalable is the application?

– Is the application bandwidth limited?

– How strong is MPI and IO impact?

– How it will work on other platforms?

• Tools used for characterization:– VTune is used for FSB Utilization measurement

– IOP* (AFT based tool) for IO impact evaluation

– IPM (http://ipm-hpc.sf.net) for MPI related measurements

• Methodology:– Profiles collected for different process pinning configurations that

incrementally increase resources available to benchmark.

– On each transition performance improvements are noted and profiles

compared.

http://ipm-hpc.sf.net/



1717

Components for Pinning Setup

L2$

L2$

L2$

L2$

Ch

ipse

t

Memory

Fabrics (IB HCA)

FSB

FSB

PCI-X

1818

Process Pinning Setup

4 active cores,

“2xinterconnect” mode

L2$

Co

re

L2$L2$ L2$

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

FSB FSB

Interconnect

L2 sensitivityassessment

MPI sensitivityassessment

FSB sensitivity

assessment

Endeavor compute node:

2 sockets with quad-core

CPUs; each CPU has 2

core pairs sharing L2$

8 active cores,

“normal” mode

L2$

Co

re

L2$L2$ L2$

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

FSB FSB

Interconnect

4 active cores,

“2xFSB” mode

L2$

Co

re

L2$L2$ L2$

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

FSB FSB

Interconnect

4 active cores,

“2xL2” mode

L2$

Co

re

L2$L2$ L2$

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

Co

re

FSB FSB

Interconnect

Changes in performance

wrt transition from one pinning

configuration to another indicate

sensitivity to respective resource

1919

WRF3.0/CONUS12: Performance

0

0.5

1

1.5

2

2.5

3

16 32 64 128 16 32 64 128

baseline optimized

# of cores

me

an

ste

p, s

ec

on

ds

normal run: Walltime 2x interconnect BW: Walltime

2x FSB BW: Walltime 2x L2 space: Walltime

0%

5%

10%

15%

20%

25%

30%

35%

40%

16 32 64 128 16 32 64 128

baseline optimized

# of cores

se

ns

itiv

ity

(s

pe

ed

up

fro

m t

ran

sit

ion

be

twe

en

pin

nin

g m

od

es

)

normal run -> 2x interconnect 2x interconnect -> 2x FSB BW

2x FSB BW -> 2x L2 space

Clear dependency on FSB bandwidth. Optimized configurations show

somewhat lower sensitivity to FSB and L2$. Not sensitive to interconnect.

2020

WRF3.0/CONUS12: Memory

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

16 32 64 128 16 32 64 128

baseline optimized

# of cores

L2

mis

se

s / in

str

uc

tio

n

normal run: L2 2x interconnect BW: L2

2x FSB BW: L2 2x L2 space: L2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

16 32 64 128 16 32 64 128

baseline optimized

# of cores

BW

uti

liza

tio

n

Triad normal run: FSB

2x interconnect BW: FSB 2x FSB BW: FSB

2x L2 space: FSB

Optimized configuration shows better FSB and L2$ utilization.

2121

WRF2.2/IVAN: Hybrid helps..

• Configuration for N cores is

either N MPI processes or

N/2 MPI processes 2

OpenMP threads each

• This breakdown was

computed using Intel® Trace

Collector, Intel® Thread

Checker and profiling

OpenMP library from Intel®

Fortran/C Compilers

• “OpenMP time” is time spent

in OpenMP regions

regardless of number of

threads.

• There are some

improvements in serial parts

WRF2.2/IVAN (6h simulation), MPI+OpenMP, computations

breakdown

533.44467.23

227.92 232.86

111.22 91.56

159.86

151.74

119.75 115.40

111.9290.16

71.74

91.18

40.02 44.39

28.98

29.18

765.04

710.14

387.69 392.65

252.12210.90

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

1 thread 2 threads 1 thread 2 threads 1 thread 2 threads

64 cores 128 cores 256 cores

configuration

se

co

nd

s

OpenMP time MPI time serial non-MPI time

2222

Even better for WRF CONUS 2.5kmWRFV3/CONUS2.5km

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

256 456 656 856 1056 1256 1456 1656 1856

# of cores

sim

ula

tio

n s

pe

ed

HTN - baseline

HTN - optimized

2323

POP/x1 characterization: HarpertownAssessment summary

Baseline configuration is expressively FSB limited on lower core counts as there is significant speedup from 2x-interconnect to 2x-FSB runs. On higher core counts additional interconnect BW and L2 start to give benefit. Speedups from 2x-interconnet BW are low.

POP 2.0.1, x1 on HTN: speed up from pinning mode

-10.00%

-5.00%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

8 16 32 64 128 8 16 32 64 128

Optimized Baseline

# of MPI processes

Wallti

me, seco

nd

s

normal run -> 2x interconnect 2x interconnect -> 2x FSB

2x FSB -> 2x L2

POP 2.0.1, x1 on HTN: Walltime, MPI and I/O

0

50

100

150

200

250

300

350

8 16 32 64 128 8 16 32 64 128

Optimized Baseline

# of MPI processes

Wallti

me, seco

nd

s

normal run 2x interconnect 2x FSB

2x L2 normal run:IO normal run:MPI

2424

POP/x1 characterization: HarpertownMemory subsystem impact assessment

Workload is sensitive to the cache size (working set is comparable to cache size). On average FSB is not saturated completely and “2x FSB“ configuration consumes 0.5BW of Triad. Optimized version shows consistent behavior.

POP 2.0.1, x1 on HTN: L2 miss rates

0

1

2

3

4

5

6

7

8 16 32 64 128 8 16 32 64 128

Optimized Baseline

# of MPI processes

L2 m

isses p

er

1000 in

str

ucti

on

s

normal run 2x interconnect 2x FSB 2x L2

POP 2.0.1, x1 on HTN: BW utilization

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

8 16 32 64 128 8 16 32 64 128

Optimized Baseline

# of MPI processes

Fra

cti

on

of

theo

reti

cal m

axim

um

fo

r sin

gle

pro

cess/t

hre

ad

normal run 2x interconnect 2x FSB

2x L2 Triad (8 thread)

2525

CAM performance

•Drops in the curve

are explained by

suboptimal

decomposition in

certain points

•The same

decompositions and

tuning options were

used for both

configurations

•Based on the 3- or 6-day

forecasts

•Speed is calculated from

integration time (“stepon”

timer)

Simulation speed,

full load balancing

0.30

1.36

4.65

10.4912.4210.36

12.38

7.58

7.00

7.12

4.10

3.85

1.88

19.1 19.018.0

17.1

0.1

1.0

10.0

100.0

10 100 1000 10000

core count

Sim

ula

ted

ye

ars

pe

r d

ay

Cluster based on dual core

Intel® Xeon® processors

5100 series

Cluster based on quad core

Intel® Xeon® processors

5400 series

`

CAM, D-Grid workload, scales to 2,000 cores on Infiniband cluster

2626

HOMME 30km characterization: Scalability assessment

Ideal scalability on small core counts. And still good at larger core counts.RDSSM MPI Device improves scalability and walltime up to 9% comparing to RDMA.

Dependence on interconnect, bandwidth, and MPI are moderate

HOMME 30 km performance on HTN

0

500

1000

1500

2000

2500

3000

3500

4000

4500

128 256 512 768 1024 1536

# cores

sec

RDSSM M PI Device

RDM A M PI Device

HOMME 30 km scalability on HTN

0.00

0.20

0.40

0.60

0.80

1.00

1.20

256 512 768 1024 1536

# cores

scala

bilit

y

RDSSM M PI Device

RDM A M PI Device

2727

In Closing..

It’s an exciting time to be in HPC

The HPC demand is high and growing faster than the general server market

We begin to gain quantified understanding about the opportunities of deploying large clusters for NWS

2929

Backup

Download - HPC Technologies for Weather and Climate - ECMWF · HPC Technologies for Weather and Climate Simulations ... New SSE4.2 Instructions Deeper Buffers Faster Virtualization Simultaneous

Top Related