abaqus/explicit io profiling - hpc advisory council io performance analysis.pdfabaqus offers a suite...

Abaqus/Explicit IO Profiling

March 2010

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: AMD, Dell, SIMULIA, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank SIMULIA for their support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com

– http://www.simulia.com

http://www.mellanox.com/�

http://www.dell.com/hpc�

http://www.amd.com/�

http://www.simulia.com/�

3

SIMULIA Abaqus

• ABAQUS offers a suite of engineering design analysis

software products, including tools for:

– Nonlinear finite element analysis (FEA)

– Advanced linear and dynamics application problems

• ABAQUS/Standard provides general-purpose FEA that includes a broad range of analysis capabilities

• ABAQUS/Explicit provides nonlinear, transient, dynamic

analysis of solids and structures using explicit time

integration

4

Objectives

• The presented research was done to provide best practices and IO

profiling information for Abaqus/Explicit

– Determination of application IO requirements

– Testing of application on NFS IO subsystem

• Provide recommendations on Storage systems for Abaqus/Explicit

5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U3, OFED 1.4.1 InfiniBand SW stack

• MPI: HP-MPI 2.3

• Application: Abaqus 6.9 EF1

• Single SCSI hard drive in master node using NFS over GigE connection

• Benchmark Workload

– Abaqus/Explicit Server Benchmarks: E5 benchmark

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

I/O Hub

USB

PCI

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

9

Dell PowerEdge™ Server Advantage

• Dell™ PowerEdge™ servers incorporate AMD Opteron™ and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability

• Building Block Foundations for best price/performance and performance/watt

• Investment protection and energy efficient• Longer term server investment value• Faster DDR2-800 memory• Enhanced AMD PowerNow!• Independent Dynamic Core Technology• AMD CoolCore™ and Smart Fetch Technology• Mellanox InfiniBand end-to-end for highest networking performance

10

Introduction to Profiling

11

Abaqus/Explicit Benchmark Results

• Input Dataset: E5– Blast loaded plate

• Master node has a single hard drive– SAS drive

• Exported using NFS over GigE– Full bi-sectional bandwidth

• No special NFS options used on server or clients– Default options used

– For example, on server: /application *(rw,sync,no_root_squash)

• Profile was done on 16 cores– Each node has 8 cores– Two nodes total connected via InfiniBand

• Analysis is done using strace_analyzer (clusterbuffer.wetpaint.com)– GPL application

12

Abaqus/Explicit IO Profiling

• The goal of IO profiling is to examine:– How the application performs IO

• How many process do IO?• How much writing? How much reading?• Sizes of syscalls?• Number of lseek? (head thrashing)

– How the profile results can translate into IO requirements (i.e. design)

– For applications with source the IO profiling can be used for changing the application for better performance

13

Executive Summary

14

Abaqus/Explicit - Summary

• This particular case of Abaqus/Explicit does little IO compared to the total IO– 0.5% of the time is spent doing IO when tested with NFS/GigE

• Only one process (rank-0 process) does all of the IO for the application– Very suitable for NFS

• Most of the IO is write (130MB)– Very small writes (1.8KB per syscall)

• IOPS can be fairly important– Partly because of large number of lseek() operations

• Recommendations:• NFS is likely to be a good option even for larger problem sizes• A single hard drive provided plenty of performance for the test case

– For larger test cases, more drives may be needed for better IOPS performance

15

Details

16

Abaqus/Explicit – Run Times

Process ID Total Run Time (secs)

IO Time (secs)

% of Time for IO

12419 424.719 0.7889 0.185%

12420 425.292 0.0991 0.023%

12421 425.373 0.1222 0.028%

12422 425.433 0.1155 0.027%

12423 424.517 0.1023 0.024%

12424 425.291 0.1250 0.029%

12425 425.331 0.1293 0.030%

12427 425.291 0.1355 0.532%

14297 418.912 2.2275 0.532%

14298 418.827 2.3856 0.570%

14299 425.785 2.6418 0.620%

14300 425.112 2.3226 0.546%

14301 418.706 2.1369 0.510%

14302 424.769 2.1317 0.502%

14303 419.200 2.0271 0.486%

14304 418.868 1.8862 0.450%

17

Abaqus/Explicit – Command Count

Process ID access lseek fcntl stat unlink open close fstat read mkdir getdents write12419 14 19,689 31 136 1 287 297 148 6,349 69 8 74,328

12420 5 3,150 5 78 0 241 243 103 1,584 0 8 1,870

12421 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882

12422 5 3,155 5 78 0 241 243 104 1,579 0 8 1,880

12423 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882

12424 5 3,149 5 78 0 241 243 104 1,577 0 8 1,876

12425 5 3,147 5 78 0 241 243 104 1,581 0 8 1,870

12427 5 3,149 5 78 0 241 243 104 1,583 0 8 1,870

14297 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870

14298 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870

14299 5 3,157 5 78 0 246 248 104 1,586 0 8 1,880

14300 5 3,160 5 78 0 246 248 104 1,587 0 8 1,882

14301 5 3,156 5 78 0 246 248 104 1,585 0 8 1,880

14302 5 3,163 5 78 0 246 248 104 1,588 0 8 1,884

14303 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870

14304 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870

• Process 12419 is the rank-0 process– access, fcntl, stat, unlink, open, close, fstat, read, write counts are

much larger than other processes

• Number of times an IO system function is called:

18

Abaqus/Explicit – Command Count

• Open/close don’t match because of sockets– Looks like an open() function– Sockets open() but don’t close()

• Open() also works for .so libraries– .so libraries are opened and read – so they look like IO

• Fairly easy to identify Rank-0 process• Rank-0: (Process 12419)

– 100 times more write() than other processes– Does all the mkdir() calls– Does 50 times more lseeks() than other processes– One lseek for every 4 writes (a fair number of lseeks)

19

Abaqus/Explicit – Write Statistics

Write syscall sizes

Process ID 0 -1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB12419 12,999 61,049 268 0 012420 72 1,796 0 0 012421 72 1,808 0 0 012422 72 1,806 0 0 012423 72 1,808 0 0 012424 72 1,802 0 0 012425 72 1,796 0 0 012427 72 1,796 0 0 014297 72 1,796 0 0 014298 72 1,796 0 0 014299 72 1,806 0 0 014300 72 1,808 0 0 014301 72 1,806 0 0 014302 72 1,810 0 0 014303 72 1,796 0 0 014304 72 1,796 0 0 0

• There are no larger write syscall sizes

20

Abaqus/Explicit – Write Statistics

Process ID Total BytesMean Bytes

per callStandard Dev

(Bytes)

Mean Absolute Dev

(Bytes)Median Bytes

per call

Median Absolute Dev

(Bytes)Slowest write

time (secs)12419 134,122,801 1,804.76 1,137.07 1,016.65 2048 473.42 0.050212420 7,362,795 3,941.54 771.62 3,222.57 4096 154.46 0.001212421 7,411,947 3,942.53 769.25 3,225.41 4096 153.48 0.002412422 7,403,755 3,942.36 769.64 3,224.94 4096 153.54 0.002012423 7,411,947 3,942.53 769.25 3,225.41 4096 153.48 0.001312424 7,387,371 3,942.03 770.43 3,223.99 4096 153.97 0.001812425 7,362,795 3,941.54 771.62 3,222.57 4096 154.46 0.001212427 7,362,795 3,941.54 771.62 3,222.57 4096 154.46 0.002014297 7,362,790 3,941.54 771.63 3,222.56 4096 154.46 0.001614298 7,362,789 2,941.54 771.64 3,222.46 4096 154.46 0.050314299 7,403,757 3,942.36 769.64 3,224.94 4096 153.64 0.001414300 7,411,947 3,942.53 769.25 3,225.41 4096 153.48 0.001714301 7,403,748 3,942.36 769.66 3,224.92 4096 153.64 0.001714302 7,420,139 3,942.69 768.86 3,225.88 4096 153.31 0.001214303 7,362,792 3,941.54 771.63 3,222.57 4096 154.46 0.001614304 7,362,788 3,941.54 771.64 3,222.56 4096 154.46 0.0016

• Average bytes per syscall for Rank-0 is small (1.8KB)– Standard deviation is 1.13KB

• Small syscalls can impact performance– Impact of cache

21

Abaqus/Explicit – Read Statistics

Read syscall sizesProcess ID 0 - 1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB

12419 3,819 2,523 7 0 012420 227 1,357 0 0 012421 227 1,354 0 0 012422 227 1,352 0 0 012423 227 1,354 0 0 012424 227 1,350 0 0 012425 227 1,354 0 0 012427 227 1,356 0 0 014297 232 1,356 0 0 014298 232 1,354 0 0 014299 232 1,354 0 0 014300 232 1,355 0 0 014301 232 1,353 0 0 014302 232 1,356 0 0 014303 232 1,356 0 0 014304 232 1,354 0 0 0

• There are no larger read syscall sizes

22

Abaqus/Explicit – Read Statistics

Process ID Total BytesMean Bytes

per callStandard

Dev (Bytes)

Mean Absolute Dev

(Bytes)

Median Bytes per

call

Median Absolute Dev

(Bytes)

Slowest write time

(secs)12419 10,240,234 1,612.89 1,842.22 1,761.42 4096 2,506.54 0.000212420 5,611,044 3,542.33 1,349.55 2,500.97 4096 553.67 0.000212421 5,598,755 3,541.27 1,350.61 2,499.75 4096 554.73 0.013412422 5,590,563 3,540.57 1,351.33 2,498.94 4096 555.43 0.013812423 5,598,756 3,541.28 1,350.61 2,499.75 4096 554.72 0.000112424 5,582,372 3,539.87 1,352.04 2,498.12 4096 556.13 0.013512425 5,598,756 3,541.28 1,350.61 2,499.75 4096 554.72 0.013512427 5,606,948 3,541.98 1,349.90 2,500.57 4096 554.02 0.013514297 5,607,058 3,530.89 1,362.13 2,488.24 4096 565.11 0.013514298 5,598,865 3,530.18 1,362.84 2,487.43 4096 565.82 0.144114299 5,598,866 3,530.18 1,362.84 2,487.43 4096 565.82 0.144114300 5,602,962 3,530.54 1,362.48 2,487.83 4096 565.46 0.066714301 5,594,769 3,529.82 1,363.19 2,487.02 4096 566.18 0.044614302 5,607,058 3,530.89 1,362.13 2,488.24 4096 565.11 0.055414303 5,607,058 3,530.89 1,362.13 2,488.24 4096 565.11 0.061814304 5,598,866 3,530.18 1,362.84 2,487.43 4096 565.82 0.0197

• Average bytes per syscall for Rank-0 is small (1.6KB)– Standard deviation is 1.8KB

• Other processes are loading .so libraries (hence the read)

23

Abaqus/Explicit – IOPS Statistics

Process IDAverage

Write IOPSAverage

Read IOPSAverage

Total IOPSMax Write

IOPS

Max Write IOPS Time

(secs)Max Read

IOPS

Max Read IOPS Time

(secs)Max Total

IOPS

Max Total IOPS Time

(secs)12419 2,752 244 3,753 4,339 423 2,652 8 10,042 912420 311 396 1,213 898 10 1,335 9 2,676 912421 313 395 1,216 904 10 1,332 9 2,670 912422 313 394 1,215 903 10 1,330 9 2,666 912423 313 395 1,216 904 9 1,355 8 2,742 812424 312 394 1,213 901 10 1,328 9 2,662 912425 311 395 1,212 898 10 1,332 9 2,670 912427 311 395 1,213 898 10 1,334 9 2,674 914297 373 529 1,216 898 417 1,334 2 3,660 214298 467 528 1,458 899 2 1,332 2 1,458 214299 313 317 914 903 424 1,332 9 3,972 914300 313 396 1,219 904 9 1,333 8 2,672 814301 469 528 1,462 904 2 1,331 2 4,474 214302 376 397 1,220 906 8 1,334 8 4,484 814303 373 529 1,459 898 3 1,352 2 2,724 214304 467 528 1,458 899 2 1,332 2 4,466 2

• Run time is about 425 secs– Some IOPS peaks occur at beginning and some near the end

• For Rank-0:– Peak Write IOPS is near end, peak Read IOPS is near beginning

24

Abaqus/Explicit – File Details

Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec==========================================================================================================================================/application/Simulia/benchmark/e5_dellamd_8core.prt 0 134,350,403.33 3,079,008 384,279,539.18/dev/infiniband/uverbs0 0 71,436,679.29 5,260 1,304,000.23/sys/class/infiniband_verbs/uverbs0/abi_version 2 816,798.94 0 0/proc/cpuinfo 6,248 1,239,084.77 0 0/etc/protocols 4,096 143,908,378.00 0 0/application/Simulia/benchmark/e5_dellamd_8core.use.1 0 292,571,428.57 4,060 11,938,059.90/application/Simulia/benchmark/e5_dellamd_8core.pac.1 1,880,064 478,023,617.34 0 0/application/Simulia/benchmark/e5_dellamd_8core.sel.1 12,288 379,259,259.26 8,192 463,238,095.24/application/Simulia/benchmark/e5_dellamd_8core.abq.1 3,653,632 763,882,173.43 7,430,144 398,649,798.81/application/Simulia/benchmark/e5_dellamd_8core.sta 0 0 4,440 11,999,119.00/application/Simulia/benchmark/e5_dellamd_8core.msg.1 0 0 177 6,321,428.57/application/Simulia/benchmark/e5_dellamd_8core.odb 1,473,576 82,406,285.85 123,591,248 253,690,560.35/proc/12441/status 765 725,553,148.58 0 0

• Process 12419

• /application directory is NFS shared• You see other directories because the analyzer can’t

distinguish what is a file for the application and what isn’t

25

Abaqus/Explicit – File Details

Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec========================================================================================================================================/proc/14317/status 766 138,766,889.62 0 0/dev/infiniband/uverbs0 0 65,677,860.45 5,260 1,332,205.04/sys/class/infiniband_verbs/uverbs0/abi_version 2 896,428.57 0 0/proc/cpuinfo 6,248 1,268,732.18 0 0/application/Simulia/benchmark/e5_dellamd_8core.use.9 0 64,269,784.08 1,007 47,999,342.28/application/Simulia/benchmark/e5_dellamd_8core.pac.9 1,884,160 495,357,157.13 0 0/application/Simulia/benchmark/e5_dellamd_8core.sel.9 12,288 395,114,845.94 8,192 497,371,428.57/application/Simulia/benchmark/e5_dellamd_8core.abq.9 3,612,672 687,962,921.56 7,348,224 537,920,980.74/application/Simulia/benchmark/e5_dellamd_8core.msg.9 0 0 107 11,888,888.89

• Process 14297

• /application directory is NFS shared• You see other directories because the analyzer can’t

distinguish what is a file for the application and what isn’t

26

Abaqus/Explicit – Time Histories

• The next slides present time histories of:– Write, Read syscall size

• Amount of data in each system call– Write IOPS and Read IOPS– Write throughput, Read throughput

• MB/s for each syscall– Read, Write cumulative syscall size (MB)– File offsets (bytes)

• Charts can present useful visual information about the behavior over time of the IO

• Two MPI processes (ranks) are shown:– 12419: Rank-0– 14297: First process on second node

• An example of non-Rank-0 IO

27

Write() syscall size distribution for Rank-0

Most common write() syscall size.

Majority of write()syscall sizes arevery small.

28

Read() syscall size distribution for Rank-0

Most common read()syscall size.

Majority of read()syscall sizes arevery small.

29

Write syscall and Write IOPS History for Rank-0

Many verysmall writes(Avg = 1.8KB)

Very few“larger” writes

Write IOPSPeak is 4,339

30

Write Throughput and Write IOPS History for Rank-0

Peak throughputis 1,800 MB/s

31

Cumulative Write syscall History for Rank-0

Small amounts ofdata are written atspecific intervals

32

Read syscall and Read IOPS History for Rank-0

Very few“larger” reads.Only at thebeginning.

Reads are verysmall

Read IOPSpeak is 2,652

33

Read Throughput History and Read IOPS for Rank-0

Read throughputis nominallyvery small

34

Cumulative read syscall history for Rank-0

Virtually all readingis done at thebeginning of the run

10 times more iswritten as is read

35

File offset histogram

During writerank-0 lseeksto beginning of fileand reads forward toa spot and then writes

Lots of lseeks!

36

Write() syscall size distribution for non-rank-0

Most commonwrite() syscall size.

37

Read() syscall size distribution for non-rank-0

Most commonread() syscall size.

38

Write syscall and Write IOPS History for non-Rank-0

Write syscall isvery small

Peak Write IOPSIs 898

All writing is atbeginning and endof run

39

Write throughput and Write IOPS History fornon-Rank-0

Max writeThroughput isa bit over 1,000 MB/s

40

Cumulative Write syscall History for non-Rank-0

Very little writingIs done. Just over7MB compared to 130 MB for therank-0 process

Writing is only doneat the beginning andend of the run

41

Read syscall and Read IOPS History for non-Rank-0

Very small reads,mostly at beginningof run (loading of.so libraries)

Read IOPSPeak happensat beginning ofrun. Value is1,334

42

Read Throughput History and Read IOPS fornon-Rank-0

Max readThroughput isa bit over 800 MB/s

43

Cumulative read syscall history for non-Rank-0

Virtually all readinghappens at beginningof run.

Very little readingdone for non-Rank-0processes. Approximately5.6 MB versus 130 MB written by rank-0 process

Virtually all reads fornon-Rank-0 process arefor loading .so libraries

44

Abaqus/Explicit – IO Profile Summary

• Little time is spent doing IO compared to total run time– 0.5% of the time involves IO (2 secs. out of 425 secs.)– This is over NFS (GigE) with single drive

• There seems to one process doing all the IO for the job– Rank-0 process

• 100 times more write() than non-Rank-0 processes• 50 times more lseek than non-Rank-0 processes• 4 times more reads than non-Rank-0 processes

– A total of about 134 MB is written by the Rank-0 process– A total of about 10 MB is read by the Rank-0 process

• Non-Rank-0 processes do little IO– 7MB is written– ~5.7 MB is read (mostly to load .so libraries)

45

Abaqus/Explicit – IO Profile Summary cont’d

• Rank-0 Process:– The write() calls are all less than 32KB

• Average is 1.8KB +/- 1.4KB– The read() calls are all less than 32KB

• Average is 1.6KB +/- 1.8KB• Non-Rank-0 processes:

– Very little write() or read()– Most of the IO is around loading .so libraries

46


• Peak Write IOPS occurs near end of run– Peak Write IOPS is 4,339 for Rank-0 process– Peak Write IOPS is ~900 for non-Rank-0 processes

• Peak Read IOPS occurs near beginning of run– Peak Read IOPS is 2,652 for rank-0 process– Peak Read IOPS is ~1,330 for non-Rank-0 processes

• Average Write IOPS for Rank-0 is 2,752– Average Write IOPS for non-Rank-0 processes is ~350

• Average Read IOPS for Rank-0 is 244– Average Read IOPS for non-Rank-0 processes is ~400

47


• Rank-0 Observations on throughput:– Peak Write IOPS corresponds to peak write throughput

• Peak write throughput is a bit less than 1.8 GB/s• Dependent upon hardware and software configuration because of

caching effects– Peak Read IOPS corresponds to peak read throughput

• Peak read throughput is approximately 1.4 GB/s• Dependent upon hardware and software configuration because of

caching effects• Application does a great deal of lseek() when writing

– File offset chart illustrates file pointer movement– About 1 lseek() for every 4 write() statements on rank-0

• Too many lseek() syscalls can reduce performance

48

Abaqus/Explicit - Summary

• This particular case of Abaqus/Explicit does little IO compared to the total IO– NFS may be a good option even to larger problem sizes

• The rank-0 process does all of the IO for the application• Most of the IO is write

– Very small writes (1.8KB per syscall)• Application does a fair amount of lseek()

– This can reduce performance• IOPS can be fairly important

– Partly because of large number of lseek()

4949

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

abaqus/explicit io profiling - hpc advisory council io performance analysis.pdfabaqus offers a suite...

Documents