11
David Barkai, Ph.D. HPC Computational ArchitectHigh Performance Computing
Digital Enterprise Group Intel Corporation
HPC Technologies for
Weather and Climate
Simulations
High Performance Computing
13th Workshop on the Use of High Performance Computing in Meteorology
Shinfield Park, Reading, UKNov. 3-7, 2008
22
What we’ll talk about
•The Big Picture
•Nehalem is coming..
•NWS on Clusters
33
The Big Picture..
44
Intel HPC Vision
A world in which Intel based supercomputers enable the major
breakthroughs in science, medicine & engineering…
From exploration to production
Mission:
• Create a maintain a technology leadership position for Intel at the highest end
of computing – drive the path to TeraScale processors & ExaScale systems
• Drive a valuable technology pipeline for Intel’s volume business
• Grow the use of HPC across all segments from the office to the datacenter
Intel Confidential
55
Inertia For The Insatiable Demand For Performance
Moore’s Law
Moore’s Law:
• Transistor Density doubles every two years
• More than 40 years old
• 37 years of “free” performance gains
• Qualitative change a few years ago: Multicore
• High performance requires multithreading
• Intel Xeon 7400 series just announced with 6 cores
Large scale deployments consistently outpacing Moore’s Law
66
In Search Of (Even) More Performance
Fixed Function
PartiallyProgrammable
FullyProgrammable
Multi-threading Multi-core Many Core
Throughput Performance
Programmability
CPU
GPU
CPU• Evolving toward throughput computing• Motivated by energy-efficient performance
GPU• Evolving toward general-purpose computing• Motivated by higher quality graphics and
GP-GPU usages
Architecture Evolution: A Collision Course?
77
Intel’s Terascale* Research Program
High BandwidthI/O & Communications
Stacked, Shared Memory
Scalable Multi-coreArchitectures
Parallel ProgrammingTools & Techniques
VirtualEnvironments
FinancialModeling
Media Search& Manipulation
Web MiningBots’
EducationalSimulation
Model-Based Applications
Thread-AwareExecution Environment
* TeraFLOPProcessors
88
2004 2006 2008 2010 2012 2014 2016 2018 2020
1 GF
10 GF
100 GF
1 TF
10 TF
100 TF
The Path To ExaScale Systems
~ 2TF
~10TF
~50TF
…and it’s not just about CPU performance…
Conceptual roadmap – not for planning purposes
99
What’s coming for Servers by Intel
1010
1111
Nehalem Core: Recap
ExecutionUnits
Out-of-OrderScheduling &Retirement
L2 Cache& InterruptServicing
Instruction Fetch& L1 Cache
Branch PredictionInstructionDecode &Microcode
Paging
L1 Data Cache
Memory Ordering& Execution
Additional CachingHierarchy
New SSE4.2 Instructions
Deeper Buffers
FasterVirtualization
SimultaneousMulti-Threading
Better BranchPrediction
Improved Lock Support
ImprovedLoop
Streaming
1212
Nehalem Based System Architecture
Benefits
• More application performance
• Improved energy efficiency
• Improved Virtualization Technology
PCI Express*Gen 1, 2
I/OHub
ICH
DMI
Nehalem Nehalem
Functional
system
demonstrated
Sept 2007 IDF
Extending Today’s Leadership
Production Q4’08
Volume ramp Q1’09
Key TechnologiesNew 45nm Intel® Microarchitecture
New Intel® QuickPath interconnect
Integrated Memory Controller
Next Generation Memory (DDR3)
PCI Express Gen 2All future products, dates, and figures are preliminary and are subject to change without any notice.
QPI
Up to 25.6 Gb/sec bandwidth per link
1313
QuickPath Interconnect
• Nehalem introduces new QuickPath Interconnect (QPI)
• High bandwidth, low latencypoint to point interconnect
• Up to 6.4 GT/sec initially
– 6.4 GT/sec -> 12.8 GB/sec
– Fully duplex -> 25.6 GB/sec per link
– Future implementations at even higher speeds
• Highly scalable for systems with varying # of sockets
NehalemEP
NehalemEP
IOH
memoryCPU CPU
CPU CPU
IOH
memory
memory
memory
1414
Core/Uncore Modularity
Optimal price / performance / energy efficiencyfor server, desktop and mobile products
#QPILinks
# memchannels
Size ofcache# cores
PowerManage-
ment
Type ofMemory
Integratedgraphics
Differentiation in the “Uncore”:
2008 – 2009 Servers & Desktops
DRAM
QPI
Core
Uncore
CORE
CORE
CORE
IMC QPIPower
&Clock
…
QPI…
…
…
L3 Cache
QPI: Intel® QuickPath Interconnect
Nehalem Core
Common from Mobile to Server
Nehalem Uncore
Differentiates the product segments
1515
By Intel’s NWS application engineers team:
Roman Dubtsov
Alexander Semenov
Shkurko Dmitry
Alex Kosenkov
and Mike Greenfield
Characterizing NWS on Clusters
• WRF
• POP
• CAM
• HOMME
1616
Characterization methodology overview
• The characterization should allow answering the questions:– How scalable is the application?
– Is the application bandwidth limited?
– How strong is MPI and IO impact?
– How it will work on other platforms?
• Tools used for characterization:– VTune is used for FSB Utilization measurement
– IOP* (AFT based tool) for IO impact evaluation
– IPM (http://ipm-hpc.sf.net) for MPI related measurements
• Methodology:– Profiles collected for different process pinning configurations that
incrementally increase resources available to benchmark.
– On each transition performance improvements are noted and profiles
compared.
1717
Components for Pinning Setup
L2$
L2$
L2$
L2$
Ch
ipse
t
Memory
Fabrics (IB HCA)
FSB
FSB
PCI-X
1818
Process Pinning Setup
4 active cores,
“2xinterconnect” mode
L2$
Co
re
L2$L2$ L2$
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
FSB FSB
Interconnect
L2 sensitivityassessment
MPI sensitivityassessment
FSB sensitivity
assessment
Endeavor compute node:
2 sockets with quad-core
CPUs; each CPU has 2
core pairs sharing L2$
8 active cores,
“normal” mode
L2$
Co
re
L2$L2$ L2$
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
FSB FSB
Interconnect
4 active cores,
“2xFSB” mode
L2$
Co
re
L2$L2$ L2$
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
FSB FSB
Interconnect
4 active cores,
“2xL2” mode
L2$
Co
re
L2$L2$ L2$
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
Co
re
FSB FSB
Interconnect
Changes in performance
wrt transition from one pinning
configuration to another indicate
sensitivity to respective resource
1919
WRF3.0/CONUS12: Performance
0
0.5
1
1.5
2
2.5
3
16 32 64 128 16 32 64 128
baseline optimized
# of cores
me
an
ste
p, s
ec
on
ds
normal run: Walltime 2x interconnect BW: Walltime
2x FSB BW: Walltime 2x L2 space: Walltime
0%
5%
10%
15%
20%
25%
30%
35%
40%
16 32 64 128 16 32 64 128
baseline optimized
# of cores
se
ns
itiv
ity
(s
pe
ed
up
fro
m t
ran
sit
ion
be
twe
en
pin
nin
g m
od
es
)
normal run -> 2x interconnect 2x interconnect -> 2x FSB BW
2x FSB BW -> 2x L2 space
Clear dependency on FSB bandwidth. Optimized configurations show
somewhat lower sensitivity to FSB and L2$. Not sensitive to interconnect.
2020
WRF3.0/CONUS12: Memory
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
16 32 64 128 16 32 64 128
baseline optimized
# of cores
L2
mis
se
s / in
str
uc
tio
n
normal run: L2 2x interconnect BW: L2
2x FSB BW: L2 2x L2 space: L2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
16 32 64 128 16 32 64 128
baseline optimized
# of cores
BW
uti
liza
tio
n
Triad normal run: FSB
2x interconnect BW: FSB 2x FSB BW: FSB
2x L2 space: FSB
Optimized configuration shows better FSB and L2$ utilization.
2121
WRF2.2/IVAN: Hybrid helps..
• Configuration for N cores is
either N MPI processes or
N/2 MPI processes 2
OpenMP threads each
• This breakdown was
computed using Intel® Trace
Collector, Intel® Thread
Checker and profiling
OpenMP library from Intel®
Fortran/C Compilers
• “OpenMP time” is time spent
in OpenMP regions
regardless of number of
threads.
• There are some
improvements in serial parts
WRF2.2/IVAN (6h simulation), MPI+OpenMP, computations
breakdown
533.44467.23
227.92 232.86
111.22 91.56
159.86
151.74
119.75 115.40
111.9290.16
71.74
91.18
40.02 44.39
28.98
29.18
765.04
710.14
387.69 392.65
252.12210.90
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
1 thread 2 threads 1 thread 2 threads 1 thread 2 threads
64 cores 128 cores 256 cores
configuration
se
co
nd
s
OpenMP time MPI time serial non-MPI time
2222
Even better for WRF CONUS 2.5kmWRFV3/CONUS2.5km
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
256 456 656 856 1056 1256 1456 1656 1856
# of cores
sim
ula
tio
n s
pe
ed
HTN - baseline
HTN - optimized
2323
POP/x1 characterization: HarpertownAssessment summary
Baseline configuration is expressively FSB limited on lower core counts as there is significant speedup from 2x-interconnect to 2x-FSB runs. On higher core counts additional interconnect BW and L2 start to give benefit. Speedups from 2x-interconnet BW are low.
POP 2.0.1, x1 on HTN: speed up from pinning mode
-10.00%
-5.00%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
8 16 32 64 128 8 16 32 64 128
Optimized Baseline
# of MPI processes
Wallti
me, seco
nd
s
normal run -> 2x interconnect 2x interconnect -> 2x FSB
2x FSB -> 2x L2
POP 2.0.1, x1 on HTN: Walltime, MPI and I/O
0
50
100
150
200
250
300
350
8 16 32 64 128 8 16 32 64 128
Optimized Baseline
# of MPI processes
Wallti
me, seco
nd
s
normal run 2x interconnect 2x FSB
2x L2 normal run:IO normal run:MPI
2424
POP/x1 characterization: HarpertownMemory subsystem impact assessment
Workload is sensitive to the cache size (working set is comparable to cache size). On average FSB is not saturated completely and “2x FSB“ configuration consumes 0.5BW of Triad. Optimized version shows consistent behavior.
POP 2.0.1, x1 on HTN: L2 miss rates
0
1
2
3
4
5
6
7
8 16 32 64 128 8 16 32 64 128
Optimized Baseline
# of MPI processes
L2 m
isses p
er
1000 in
str
ucti
on
s
normal run 2x interconnect 2x FSB 2x L2
POP 2.0.1, x1 on HTN: BW utilization
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
8 16 32 64 128 8 16 32 64 128
Optimized Baseline
# of MPI processes
Fra
cti
on
of
theo
reti
cal m
axim
um
fo
r sin
gle
pro
cess/t
hre
ad
normal run 2x interconnect 2x FSB
2x L2 Triad (8 thread)
2525
CAM performance
•Drops in the curve
are explained by
suboptimal
decomposition in
certain points
•The same
decompositions and
tuning options were
used for both
configurations
•Based on the 3- or 6-day
forecasts
•Speed is calculated from
integration time (“stepon”
timer)
Simulation speed,
full load balancing
0.30
1.36
4.65
10.4912.4210.36
12.38
7.58
7.00
7.12
4.10
3.85
1.88
19.1 19.018.0
17.1
0.1
1.0
10.0
100.0
10 100 1000 10000
core count
Sim
ula
ted
ye
ars
pe
r d
ay
Cluster based on dual core
Intel® Xeon® processors
5100 series
Cluster based on quad core
Intel® Xeon® processors
5400 series
`
CAM, D-Grid workload, scales to 2,000 cores on Infiniband cluster
2626
HOMME 30km characterization: Scalability assessment
Ideal scalability on small core counts. And still good at larger core counts.RDSSM MPI Device improves scalability and walltime up to 9% comparing to RDMA.
Dependence on interconnect, bandwidth, and MPI are moderate
HOMME 30 km performance on HTN
0
500
1000
1500
2000
2500
3000
3500
4000
4500
128 256 512 768 1024 1536
# cores
sec
RDSSM M PI Device
RDM A M PI Device
HOMME 30 km scalability on HTN
0.00
0.20
0.40
0.60
0.80
1.00
1.20
256 512 768 1024 1536
# cores
scala
bilit
y
RDSSM M PI Device
RDM A M PI Device
2727
In Closing..
It’s an exciting time to be in HPC
The HPC demand is high and growing faster than the general server market
We begin to gain quantified understanding about the opportunities of deploying large clusters for NWS
2828
2929
Backup