weather research and forecasting (wrf) performance ...weather research and forecasting (wrf) • the...

Weather Research and Forecasting (WRF)

Performance Benchmark and Profiling

June 2015

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: Intel, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The following was done to provide best practices

– WRF performance overview

– Understanding WRF communication patterns

– Ways to increase WRF productivity

– MPI libraries comparisons

• For more info please refer to

– http://www.dell.com

– http://www.intel.com

– http://www.mellanox.com

– http://wrf-model.org

http://www.dell.com/

http://www.hp.com/go/hpc

http://www.mellanox.com/

http://wrf-model.org/

3

Weather Research and Forecasting (WRF)

• The Weather Research and Forecasting (WRF) Model

– Numerical weather prediction system

– Designed for operational forecasting and atmospheric research

• WRF developed by

– National Center for Atmospheric Research (NCAR)

– The National Centers for Environmental Prediction (NCEP)

– Forecast Systems Laboratory (FSL)

– Air Force Weather Agency (AFWA)

– Naval Research Laboratory

– Oklahoma University

– Federal Aviation Administration (FAA)

4

WRF Usage

• The WRF model includes

– Real-data and idealized simulations

– Various lateral boundary condition options

– Full physics options

– Non-hydrostatic and hydrostatic

– One-way, two-way nesting and moving nest

– Applications ranging from meters to thousands of kilometers

5

Objectives

• The presented research was done to provide best practices

– WRF performance benchmarking

• CPU performance comparison

• MPI library performance comparison

• Interconnect performance comparison

• System generations comparison

• The presented results will demonstrate

– The scalability of the compute environment/application

– Considerations for higher productivity and efficiency

6

Test Cluster Configuration

• Dell PowerEdge R730 32-node (896-core) “Thor” cluster

– Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance)

– Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop, Turbo Enabled

– OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack

– Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” on RAID 1

• Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters

• Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch

• Mellanox Connect-IB FDR InfiniBand Adapter, Mellanox ConnectX-3 40GbE Ethernet VPI Adapters

• Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch

• MPI: Mellanox HPC-X v1.2.0-326, Intel MPI 5.0.3

• Compilers: Intel Composer XE 2015.3.187 (compiled using the WRF option (SNB with AVX mods) ifort compiler with icc)

• Application: WRF 3.6.1, Libraries: NetCDF 4.1.3

• Benchmarks: CONUS-12km - 48-hour, 12km resolution case over the Continental US from October 24, 2001

7

PowerEdge R730Massive flexibility for data intensive operations

• Performance and efficiency

– Intelligent hardware-driven systems management

with extensive power management features

– Innovative tools including automation for

parts replacement and lifecycle manageability

– Broad choice of networking technologies from GigE to IB

– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)

– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

8

WRF Performance – Network Interconnects

• InfiniBand is the only interconnect that delivers superior scalability performance

– EDR IB provides higher performance than 1GbE, 10GbE or 40GbE

– Ethernet stops scaling beyond 2 nodes

– InfiniBand demonstrates continuous performance gain at scale

28 MPI Processes / Node Higher is better

28x

28x

51x

9

WRF Performance – EDR vs FDR InfiniBand

• EDR InfiniBand delivers superior scalability in application performance

– As the number of nodes scales, performance gap of EDR IB becomes widen

• Performance advantage of EDR InfiniBand increases for larger core counts

– EDR InfiniBand provides 28% versus FDR InfiniBand at 32 nodes (896 cores)

Higher is better

15%

28 MPI Processes / Node

28%

10

WRF Performance – System Generations

• Thor cluster (based on Intel E5-2697v3 - Haswell) outperforms prior generations

– Up to 34-40% higher performance than the Jupiter cluster (based on Intel Ivy Bridge CPUs)

• System components used:

– Jupiter: 2-socket 10-core Intel [email protected], 1600MHz DIMMs, Connect-IB FDR IB

– Thor: 2-socket 14-core Intel [email protected], 2133MHz DIMMs, ConnectX-4 EDR IB

34%

40%

Higher is better

11

WRF Performance – Cores Per Node

• Running more CPU cores provides higher performance

– ~39% higher productivity with 28PPN compared to 20PPN

– ~13% higher productivity with 28PPN compared to 24PPN

– Higher demand on memory bandwidth and network might limit performance as more cores are used

CPU @ 2.6GHzHigher is better

13%

39%

12

8%

WRF Performance – MPI Libraries

• HPC-X delivers slightly higher scalability performance than Intel MPI at 32 nodes (896 cores)

– HPC-X delivers 8% higher productivity than Intel MPI at 32 nodes (896 cores); gap increases as node size increases

– Flags for HPC-X: -mca btl_sm_use_knem 1 -x MXM_SHM_KCOPY_MODE=knem -bind-to core -map-by node

– Flags for Intel MPI: -genv I_MPI_PIN on -genv I_MPI_PIN_PROCESSOR_LIST all -genv I_MPI_FABRICS

shm:dapl -genv I_MPI_DAPL_SCALABLE_PROGRESS 1

Higher is better 28 MPI Processes / Node

13

WRF Profiling – Time Spent by MPI Calls

• Majority of the MPI time is spent on MPI_Bcast

– MPI_Bcast: 62% of runtime at 32 nodes (896 cores)

– MPI_Scatter: 8%, MPI_Wait: 2%

– For waiting for pending non-blocking sends and receives to complete

32 Nodes

14

WRF Profiling – MPI Message Sizes

• Majority of data transfer messages are medium sizes, except for:

– MPI_Bcast has a large concentration (60% wall) in small sizes (e.g. 4 byte size)

– MPI_Scatter shows some concentration (~6% wall time) at 16KB buffer

– MPI_Wait: Large concentration of 1 to less than 256 bytes

32 Nodes

15

WRF Profiling – MPI Data Transfer

• As the cluster grows, less data is transferred between MPI processes

– Decrease from 523MB max (8 nodes) at to 263MB max per rank (16 nodes)

– Majority of communications are between neighboring ranks

– Nonblocking (point to point) data transfers are shown in the graph

– Collective data communications are small compared to non-blocking communications2 Nodes32 Nodes

16

WRF Summary

• Using the best system components can greatly improve WRF scalability performance

– Compute: Intel Haswell cluster outperforms system architecture of previous generations

• Haswell cluster outperforms Ivy Bridge cluster by 34% at 32 nodes (896 CPU cores)

– Compute: Running more CPU cores provides higher performance

• ~39% higher productivity with 28PPN compared to 20PPN, and ~13% compared to 24PPN

– Network: EDR InfiniBand and HPC-X MPI library deliver superior scalability in application performance

• EDR IB delivers 28 times higher performance than 10GbE/40GbE, 41 times than 1GbE at 32 nodes (896 CPU cores)

• EDR InfiniBand delivers 28% of higher performance compared to FDR InfiniBand at 32 nodes (896 cores)

• Mellanox HPC-X provides 8% of performance benefit at scale

– WRF has high dependency on both network latency and throughput

• EDR InfiniBand delivers higher network throughput, which allows it to outperforms FDR InfiniBand at high node count

• MPI profile shows large concentration of medium messages which is benefit by EDR InfiniBand

– MPI Profile shows majority of data transfer messages are medium sizes, except for:

• MPI_Bcast has a large concentration in small sizes (e.g. 4 byte size)

• MPI_Wait: Large concentration of 1 to less than 256 bytes

1717

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

weather research and forecasting (wrf) performance ...weather research and forecasting (wrf) • the...

Documents