abaqus/explicit io profiling - hpc advisory council io performance abaqus offers a suite
Post on 08-Apr-2020
2 views
Embed Size (px)
TRANSCRIPT
Abaqus/Explicit IO Profiling
March 2010
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: AMD, Dell, SIMULIA, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The participating members would like to thank SIMULIA for their support and guidelines
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com
– http://www.simulia.com
http://www.mellanox.com/� http://www.dell.com/hpc� http://www.amd.com/� http://www.simulia.com/�
3
SIMULIA Abaqus
• ABAQUS offers a suite of engineering design analysis
software products, including tools for:
– Nonlinear finite element analysis (FEA)
– Advanced linear and dynamics application problems
• ABAQUS/Standard provides general-purpose FEA that includes a broad range of analysis capabilities
• ABAQUS/Explicit provides nonlinear, transient, dynamic
analysis of solids and structures using explicit time
integration
4
Objectives
• The presented research was done to provide best practices and IO
profiling information for Abaqus/Explicit
– Determination of application IO requirements
– Testing of application on NFS IO subsystem
• Provide recommendations on Storage systems for Abaqus/Explicit
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U3, OFED 1.4.1 InfiniBand SW stack
• MPI: HP-MPI 2.3
• Application: Abaqus 6.9 EF1
• Single SCSI hard drive in master node using NFS over GigE connection
• Benchmark Workload
– Abaqus/Explicit Server Benchmarks: E5 benchmark
6
Mellanox InfiniBand Solutions
• Industry Standard – Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance – 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management • Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond • End-to-end quality of service • Virtualization acceleration • I/O consolidation Including storage
InfiniBand Delivers the Lowest Latency
The InfiniBand Performance Gap is Increasing
Fibre Channel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s (12X)
80Gb/s (4X)
7
• Performance – Quad-Core
• Enhanced CPU IPC • 4x 512K L2 cache • 6MB L3 Cache
– Direct Connect Architecture • HyperTransport™ Technology • Up to 24 GB/s peak per processor
– Floating Point • 128-bit FPU per core • 4 FLOPS/clk peak per core
– Integrated Memory Controller • Up to 12.8 GB/s • DDR2-800 MHz or DDR2-667 MHz
• Scalability – 48-bit Physical Addressing
• Compatibility – Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E® Bridge
I/O Hub
USB
PCI
PCI-E® Bridge
8 GB/S
8 GB/S
Dual Channel Reg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
8
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines – 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions – Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling – Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
9
Dell PowerEdge™ Server Advantage
• Dell™ PowerEdge™ servers incorporate AMD Opteron™ and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability
• Building Block Foundations for best price/performance and performance/watt
• Investment protection and energy efficient • Longer term server investment value • Faster DDR2-800 memory • Enhanced AMD PowerNow! • Independent Dynamic Core Technology • AMD CoolCore™ and Smart Fetch Technology • Mellanox InfiniBand end-to-end for highest networking performance
10
Introduction to Profiling
11
Abaqus/Explicit Benchmark Results
• Input Dataset: E5 – Blast loaded plate
• Master node has a single hard drive – SAS drive
• Exported using NFS over GigE – Full bi-sectional bandwidth
• No special NFS options used on server or clients – Default options used
– For example, on server: /application *(rw,sync,no_root_squash)
• Profile was done on 16 cores – Each node has 8 cores – Two nodes total connected via InfiniBand
• Analysis is done using strace_analyzer (clusterbuffer.wetpaint.com) – GPL application
12
Abaqus/Explicit IO Profiling
• The goal of IO profiling is to examine: – How the application performs IO
• How many process do IO? • How much writing? How much reading? • Sizes of syscalls? • Number of lseek? (head thrashing)
– How the profile results can translate into IO requirements (i.e. design)
– For applications with source the IO profiling can be used for changing the application for better performance
13
Executive Summary
14
Abaqus/Explicit - Summary
• This particular case of Abaqus/Explicit does little IO compared to the total IO – 0.5% of the time is spent doing IO when tested with NFS/GigE
• Only one process (rank-0 process) does all of the IO for the application – Very suitable for NFS
• Most of the IO is write (130MB) – Very small writes (1.8KB per syscall)
• IOPS can be fairly important – Partly because of large number of lseek() operations
• Recommendations: • NFS is likely to be a good option even for larger problem sizes • A single hard drive provided plenty of performance for the test case
– For larger test cases, more drives may be needed for better IOPS performance
15
Details
16
Abaqus/Explicit – Run Times
Process ID Total Run Time (secs)
IO Time (secs)
% of Time for IO
12419 424.719 0.7889 0.185%
12420 425.292 0.0991 0.023%
12421 425.373 0.1222 0.028%
12422 425.433 0.1155 0.027%
12423 424.517 0.1023 0.024%
12424 425.291 0.1250 0.029%
12425 425.331 0.1293 0.030%
12427 425.291 0.1355 0.532%
14297 418.912 2.2275 0.532%
14298 418.827 2.3856 0.570%
14299 425.785 2.6418 0.620%
14300 425.112 2.3226 0.546%
14301 418.706 2.1369 0.510%
14302 424.769 2.1317 0.502%
14303 419.200 2.0271 0.486%
14304 418.868 1.8862 0.450%
17
Abaqus/Explicit – Command Count
Process ID access lseek fcntl stat unlink open close fstat read mkdir getdents write 12419 14 19,689 31 136 1 287 297 148 6,349 69 8 74,328
12420 5 3,150 5 78 0 241 243 103 1,584 0 8 1,870
12421 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882
12422 5 3,155 5 78 0 241 243 104 1,579 0 8 1,880
12423 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882
12424 5 3,149 5 78 0 241 243 104 1,577 0 8 1,876
12425 5 3,147 5 78 0 241 243 104 1,581 0 8 1,870
12427 5 3,149 5 78 0 241 243 104 1,583 0 8 1,870
14297 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870
14298 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870
14299 5 3,157 5 78 0 246 248 104 1,586 0 8 1,880
14300 5 3,160 5 78 0 246 248 104 1,587 0 8 1,882
14301 5 3,156 5 78 0 246 248 104 1,585 0 8 1,880
14302 5 3,163 5 78 0 246 248 104 1,588 0 8 1,884
14303 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870
14304 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870
• Process 12419 is the rank-0 process – access, fcntl, stat, unlink, open, close, fstat, read, write counts are
much larger than other processes
• Number of times an IO system function is called:
18
Abaqus/Explicit – Command Count
• Open/close don’t match because of sockets – Looks like an open() function – Sockets open() but don’t close()
• Open() also works for .so libraries – .so libraries are opened and read – so they look like IO
• Fairly easy to identify Rank-0 process • Rank-0: (Process 12419)
– 100 times more write() than other pr