abaqus/explicit io profiling - hpc advisory council io performance abaqus offers a suite

Download Abaqus/Explicit IO Profiling - HPC Advisory Council IO Performance   ABAQUS offers a suite

Post on 08-Apr-2020

2 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Abaqus/Explicit IO Profiling

    March 2010

  • 2

    Note

    • The following research was performed under the HPC Advisory Council activities

    – Participating vendors: AMD, Dell, SIMULIA, Mellanox

    – Compute resource - HPC Advisory Council Cluster Center

    • The participating members would like to thank SIMULIA for their support and guidelines

    • For more info please refer to

    – www.mellanox.com, www.dell.com/hpc, www.amd.com

    – http://www.simulia.com

    http://www.mellanox.com/� http://www.dell.com/hpc� http://www.amd.com/� http://www.simulia.com/�

  • 3

    SIMULIA Abaqus

    • ABAQUS offers a suite of engineering design analysis

    software products, including tools for:

    – Nonlinear finite element analysis (FEA)

    – Advanced linear and dynamics application problems

    • ABAQUS/Standard provides general-purpose FEA that includes a broad range of analysis capabilities

    • ABAQUS/Explicit provides nonlinear, transient, dynamic

    analysis of solids and structures using explicit time

    integration

  • 4

    Objectives

    • The presented research was done to provide best practices and IO

    profiling information for Abaqus/Explicit

    – Determination of application IO requirements

    – Testing of application on NFS IO subsystem

    • Provide recommendations on Storage systems for Abaqus/Explicit

  • 5

    Test Cluster Configuration

    • Dell™ PowerEdge™ SC 1435 24-node cluster

    • Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

    • Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

    • Mellanox® InfiniBand DDR Switch

    • Memory: 16GB memory, DDR2 800MHz per node

    • OS: RHEL5U3, OFED 1.4.1 InfiniBand SW stack

    • MPI: HP-MPI 2.3

    • Application: Abaqus 6.9 EF1

    • Single SCSI hard drive in master node using NFS over GigE connection

    • Benchmark Workload

    – Abaqus/Explicit Server Benchmarks: E5 benchmark

  • 6

    Mellanox InfiniBand Solutions

    • Industry Standard – Hardware, software, cabling, management

    – Design for clustering and storage interconnect

    • Performance – 40Gb/s node-to-node

    – 120Gb/s switch-to-switch

    – 1us application latency

    – Most aggressive roadmap in the industry

    • Reliable with congestion management • Efficient

    – RDMA and Transport Offload

    – Kernel bypass

    – CPU focuses on application processing

    • Scalable for Petascale computing & beyond • End-to-end quality of service • Virtualization acceleration • I/O consolidation Including storage

    InfiniBand Delivers the Lowest Latency

    The InfiniBand Performance Gap is Increasing

    Fibre Channel

    Ethernet

    60Gb/s

    20Gb/s

    120Gb/s

    40Gb/s

    240Gb/s (12X)

    80Gb/s (4X)

  • 7

    • Performance – Quad-Core

    • Enhanced CPU IPC • 4x 512K L2 cache • 6MB L3 Cache

    – Direct Connect Architecture • HyperTransport™ Technology • Up to 24 GB/s peak per processor

    – Floating Point • 128-bit FPU per core • 4 FLOPS/clk peak per core

    – Integrated Memory Controller • Up to 12.8 GB/s • DDR2-800 MHz or DDR2-667 MHz

    • Scalability – 48-bit Physical Addressing

    • Compatibility – Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

    7 November5, 2007

    PCI-E® Bridge

    I/O Hub

    USB

    PCI

    PCI-E® Bridge

    8 GB/S

    8 GB/S

    Dual Channel Reg DDR2

    8 GB/S

    8 GB/S

    8 GB/S

    Quad-Core AMD Opteron™ Processor

  • 8

    Dell PowerEdge Servers helping Simplify IT

    • System Structure and Sizing Guidelines – 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

    – Servers optimized for High Performance Computing environments

    – Building Block Foundations for best price/performance and performance/watt

    • Dell HPC Solutions – Scalable Architectures for High Performance and Productivity

    – Dell's comprehensive HPC services help manage the lifecycle requirements.

    – Integrated, Tested and Validated Architectures

    • Workload Modeling – Optimized System Size, Configuration and Workloads

    – Test-bed Benchmarks

    – ISV Applications Characterization

    – Best Practices & Usage Analysis

  • 9

    Dell PowerEdge™ Server Advantage

    • Dell™ PowerEdge™ servers incorporate AMD Opteron™ and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability

    • Building Block Foundations for best price/performance and performance/watt

    • Investment protection and energy efficient • Longer term server investment value • Faster DDR2-800 memory • Enhanced AMD PowerNow! • Independent Dynamic Core Technology • AMD CoolCore™ and Smart Fetch Technology • Mellanox InfiniBand end-to-end for highest networking performance

  • 10

    Introduction to Profiling

  • 11

    Abaqus/Explicit Benchmark Results

    • Input Dataset: E5 – Blast loaded plate

    • Master node has a single hard drive – SAS drive

    • Exported using NFS over GigE – Full bi-sectional bandwidth

    • No special NFS options used on server or clients – Default options used

    – For example, on server: /application *(rw,sync,no_root_squash)

    • Profile was done on 16 cores – Each node has 8 cores – Two nodes total connected via InfiniBand

    • Analysis is done using strace_analyzer (clusterbuffer.wetpaint.com) – GPL application

  • 12

    Abaqus/Explicit IO Profiling

    • The goal of IO profiling is to examine: – How the application performs IO

    • How many process do IO? • How much writing? How much reading? • Sizes of syscalls? • Number of lseek? (head thrashing)

    – How the profile results can translate into IO requirements (i.e. design)

    – For applications with source the IO profiling can be used for changing the application for better performance

  • 13

    Executive Summary

  • 14

    Abaqus/Explicit - Summary

    • This particular case of Abaqus/Explicit does little IO compared to the total IO – 0.5% of the time is spent doing IO when tested with NFS/GigE

    • Only one process (rank-0 process) does all of the IO for the application – Very suitable for NFS

    • Most of the IO is write (130MB) – Very small writes (1.8KB per syscall)

    • IOPS can be fairly important – Partly because of large number of lseek() operations

    • Recommendations: • NFS is likely to be a good option even for larger problem sizes • A single hard drive provided plenty of performance for the test case

    – For larger test cases, more drives may be needed for better IOPS performance

  • 15

    Details

  • 16

    Abaqus/Explicit – Run Times

    Process ID Total Run Time (secs)

    IO Time (secs)

    % of Time for IO

    12419 424.719 0.7889 0.185%

    12420 425.292 0.0991 0.023%

    12421 425.373 0.1222 0.028%

    12422 425.433 0.1155 0.027%

    12423 424.517 0.1023 0.024%

    12424 425.291 0.1250 0.029%

    12425 425.331 0.1293 0.030%

    12427 425.291 0.1355 0.532%

    14297 418.912 2.2275 0.532%

    14298 418.827 2.3856 0.570%

    14299 425.785 2.6418 0.620%

    14300 425.112 2.3226 0.546%

    14301 418.706 2.1369 0.510%

    14302 424.769 2.1317 0.502%

    14303 419.200 2.0271 0.486%

    14304 418.868 1.8862 0.450%

  • 17

    Abaqus/Explicit – Command Count

    Process ID access lseek fcntl stat unlink open close fstat read mkdir getdents write 12419 14 19,689 31 136 1 287 297 148 6,349 69 8 74,328

    12420 5 3,150 5 78 0 241 243 103 1,584 0 8 1,870

    12421 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882

    12422 5 3,155 5 78 0 241 243 104 1,579 0 8 1,880

    12423 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882

    12424 5 3,149 5 78 0 241 243 104 1,577 0 8 1,876

    12425 5 3,147 5 78 0 241 243 104 1,581 0 8 1,870

    12427 5 3,149 5 78 0 241 243 104 1,583 0 8 1,870

    14297 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870

    14298 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870

    14299 5 3,157 5 78 0 246 248 104 1,586 0 8 1,880

    14300 5 3,160 5 78 0 246 248 104 1,587 0 8 1,882

    14301 5 3,156 5 78 0 246 248 104 1,585 0 8 1,880

    14302 5 3,163 5 78 0 246 248 104 1,588 0 8 1,884

    14303 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870

    14304 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870

    • Process 12419 is the rank-0 process – access, fcntl, stat, unlink, open, close, fstat, read, write counts are

    much larger than other processes

    • Number of times an IO system function is called:

  • 18

    Abaqus/Explicit – Command Count

    • Open/close don’t match because of sockets – Looks like an open() function – Sockets open() but don’t close()

    • Open() also works for .so libraries – .so libraries are opened and read – so they look like IO

    • Fairly easy to identify Rank-0 process • Rank-0: (Process 12419)

    – 100 times more write() than other pr

Recommended

View more >