true scale ddr best in class performance

5
WHITE PAPER QLogic TrueScale DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox ® QLogic’s Message Rate 340% Better and Scalable Latency Up to 33% Superior Executive Summary Solving today’s most challenging computational problems requires more powerful, cost-effective, and power efficient systems. As clusters and the number of processors per cluster grow to address problems of increasing complexity, the communication needs of the applications also increase. Consequently, interconnect performance is crucial for application scaling. Satisfying the high performance requirements of Inter-Processor Communications (IPC) requires a interconnect that: Efficiently processes a variety of messages patterns Leverages the benefits of multi-core processors Scales with the size of the fabric Minimizes power requirements QLogic Host Channel Adapters (HCAs) have been architected with these design goals in mind to provide significantly better scaling performance than any other InfiniBand™ (IB) architecture. As a result, a measurable and sustainable difference in application performance can be realized when deploying the TrueScale IB architecture. QLogic has performed a series of head-to-head performance benchmarks showing the I/O performance and scalability advantages of their 7200 Series of Dual Data Rate (DDR) IB adapters over Mellanox ConnectX™ adapters. The findings in this paper demonstrate that QLogic TrueScale adapters are the best choice for High Performance Computing (HPC) applications. Key Findings The QLogic 7200 Series DDR InfiniBand adapters offer better message and scalable latency than Mellanox’s ConnectX adapters. The test results described in this paper suggest that: Message rate performance is over 340-percent better than ConnectX Scalable latency is up to 33 percent superior to ConnectX TrueScale bandwidth performance is anywhere from 120 to 70 percent better at 128- and 1024-byte message sizes, respectively HPC customers can reap the benefits of TrueScale adapters, which significantly outperform Mellanox DDR adapters as the size of the cluster increases

Upload: seiland

Post on 25-Dec-2014

436 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: True Scale Ddr   Best In Class Performance

W H I T E P a P E r

QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance

QLogic’s DDR Adapters Outperform Mellanox®

QLogic’s Message Rate 340% Better and Scalable Latency Up to 33% Superior

Executive Summary

Solving today’s most challenging computational problems requires more powerful, cost-effective, and power efficient systems. as clusters and the number of processors per cluster grow to address problems of increasing complexity, the communication needs of the applications also increase. Consequently, interconnect performance is crucial for application scaling. Satisfying the high performance requirements of Inter-Processor Communications (IPC) requires a interconnect that:

Efficiently processes a variety of messages patterns •Leverages the benefits of multi-core processors •Scales with the size of the fabric •Minimizes power requirements •

QLogic Host Channel adapters (HCas) have been architected with these design goals in mind to provide significantly better scaling performance than any other InfiniBand™ (IB) architecture. as a result, a measurable and sustainable difference in application performance can be realized when deploying the TrueScale IB architecture.

QLogic has performed a series of head-to-head performance benchmarks showing the I/O performance and scalability advantages of their 7200 Series of Dual Data rate (DDr) IB

adapters over Mellanox ConnectX™ adapters. The findings in this paper demonstrate that QLogic TrueScale adapters are the best choice for High Performance Computing (HPC) applications.

Key Findings

The QLogic 7200 Series DDr InfiniBand adapters offer better message and scalable latency than Mellanox’s ConnectX adapters. The test results described in this paper suggest that:

Message rate performance is over 340-percent better •than ConnectX

Scalable latency is up to 33 percent superior to ConnectX •

TrueScale bandwidth performance is anywhere from 120 •to 70 percent better at 128- and 1024-byte message sizes, respectively

HPC customers can reap the benefits of TrueScale •adapters, which significantly outperform Mellanox DDR adapters as the size of the cluster increases

Page 2: True Scale Ddr   Best In Class Performance

QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®

HSG-WP08014 IB0030901-00 a 2

W H I T E P a P E r

results

The most accurate way to establish the best interconnect option for a given application is to install and run the application on a variety of fabrics to determine the best performing option. However, given the costs associated with this approach, the use of industry standard benchmarks is a more pragmatic means of evaluating an interconnect.

For applications with heavy messaging requirements, message rate performance is a good indicator of how well an interconnect will be able to support the needs of an application. another factor to consider is how well the interconnect maintains its performance as the system is scaled. The High Performance Computing Challenge (HPCC) scalable latency and scalable message rate benchmarks are strong indicators of how well the interconnect will support an application at scale.

architecturally, ConnectX is designed to offload more of the burden of communication processing from the CPU to the adapter. This design can provide benefits in CPU utilization, especially when using single- or dual-core compute nodes. However, given the availability of multiple cores in today’s compute nodes, this approach is no longer optimal. as more cores are added to a node, the communications burden on a single adapter increases significantly. This results in an increased dependency on the adapter’s capabilities for scalable “system” performance. Consequently, scalability anomalies can begin to appear when the number of cores in a compute node increases to four or five.

Primarily due to the offload capability of ConnectX, Mellanox’s adapters require significantly more power to operate — as much as 50 percent compared to TrueScale adapters. The additional wattage required to power the compute nodes is also reflected in the associated higher cooling costs to bring down the ambient temperature in the data center.

TrueScale architecture is designed to support highly-scaled applications with high message rate and ultra-low scalable latency performance. In both “scale-up” (multi-core environments) and

“scale-out” (large node count) clusters, the efficient message processing capabilities of the adapter enable more effective use of the available compute resources, resulting in application performance benefits as the number of cores per node and the number of nodes in a cluster increase.

Microbenchmarks

Table 1 summarizes QLogic’s findings in scalable benchmark performance between ConnectX and TrueScale IB adapters.

Message rateAs seen in Table 1, at eight processes per node (ppn), TrueScale message rate performance is over three times that of ConnectX.

OSU’s Multiple Bandwidth/Message rate benchmark (osu_mbw_mr) was run on two servers connected by 1m cable (no switch), each server with 2x 3.0 GHz Intel® Harpertown E5472, quad-core CPUs, 16GB raM, rHEL 5. ConnectX runs used OFED 1.3, MVaPICH-1.0.0 (default options). TrueScale runs used InfiniPath® 2.2/OFED 1.3 and QLogic MPI (default options).

as multi-core systems become increasingly more prevalent, the cluster interconnect must be able to accommodate more processes per compute node. The TrueScale architecture was designed with this trend in mind, enabling users to take maximum advantage of all the cores in their compute nodes. This is accomplished through high message rate and superior inter- and intra-node communication capabilities.

Table 1. Summary of QLogic’s Message Rate and Scalable Latency Advantage Over Mellanox

Comparison Benchmark Mellanox® MHGH28 | MHGH29

QLogic QLE7240 | QLE7280

QLogic advantage

Message rage (non-coalesced)

OSU Message rate @ 8 ppn 4.5 | 5.5Million messages/s

19 | 26Million message/s

Over 340%

Scalable Latency HPCC random ring Latency @ 128 cores

4.4 | 8.9 µs

1.3 | 1.1 µs

Up to 33%

Page 3: True Scale Ddr   Best In Class Performance

QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®

HSG-WP08014 IB0030901-00 a 3

W H I T E P a P E r

Figure 1 illustrates the ability of TrueScale to make effective use of multi-core nodes1. Note that ConnectX does not scale as the processes per node increase. With TrueScale, more application work is accomplished as the node size increases.

Figure 1. TrueScale Multi-core Advantage in Message Rate Performance

Scalable LatencyIn terms of scalable latency performance, at 128 cores, QLogic’s MPI latency ranges from 13 percent to 33 percent of Mellanox’s ConnectX.

all scalable latency results are from the HPC Challenge web site (http://icl.cs.utk.edu/hpcc/hpcc_results_all.cgi) and use the random ring Latency benchmark. ConnectX Gen1 results are from the 2008-05-15 submission by Intel using 128 cores of the Intel Endeavour cluster with Xeon® E5462 CPUs (2.8 GHz); ConnectX Gen2 results are from the 2008-05-09 submission by TU Dresden using 128 cores of the SGI® altix® ICE 8200EX cluster with Xeon X5472 CPUs (3.0 GHz). QLogic QLE7240 results are from their 2008-08-05 submission using 128 cores of the Darwin Cluster with Xeon 5160 CPUs (3.0 GHz); QLogic QLE7280 results are from their 2008-08-01 submission using 128 cores of the QLogic Benchmark Cluster with Xeon E5472 CPUs (3.0 GHz).

Figure 2 shows that TrueScale adapters maintain consistent latency performance as more cores are added to a node.2 Consequently, more of the compute power can be used for application workload rather than waiting for the adapter to process messages.

1 These are the results of the OSU multiple bandwidth message rate (osu_mbw_mr) test. The test used a 1-byte message size when run on two nodes, each with 2x 3.0 GHz Intel Xeon E5472 quad-core CPUs. The test used QLogic MPI 2.2 for QLE7280 adapters and MVaPICH-1.0.0 and OFED 1.3 on Gen2 ConnectX DDr adapters.2 These are the results of the OSU Multiple-latency (osu_multi_lat) test of QLE7240 and Gen1 ConnectX HCas at 128 bytes message size when run on two nodes, each with 2x 2.33 GHz Intel Xeon E5410 quad-core CPUs.

Figure 2. TrueScale Multi-core Advantage in Latency Performance

When measuring latency with a realistic 128-byte message size, the latency performance of ConnectX drops off at about four to five cores per node. Under the same conditions, TrueScale provides consistent and predictable levels of performance.

application Performance

SPEC MPI2007There are more sophisticated benchmarks, such as SPEC MPI2007, which measure performance at a system level over a variety of different applications. This benchmark suite includes 13 different codes and emphasizes areas of performance that are most relevant to MPI applications running on large scale systems. The quantity and performance of the microprocessors, memory architecture, interconnect, compiler, and shared file system are all evaluated.

In august 2008, QLogic ran the SPECmpiM_base2007 benchmark on a TrueScale enabled cluster that yielded the best overall performance at 96 and 128 cores3. This result represents third-party validation of the scalable performance capabilities of the architecture over a variety of application types. This result compared favorably not only to other commodity x86-based compute clusters, but also against platforms from large system vendors.

Halo TestThe halo test from argonne National Laboratory’s mpptest benchmark suite simulates communications patterns in layered ocean models.

3 Details of the submission and results can be found at: http://www.spec.org/mpi2007/results/res2008q3/

Page 4: True Scale Ddr   Best In Class Performance

QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®

HSG-WP08014 IB0030901-00 a 4

W H I T E P a P E r

Unlike many of the point-to-point microbenchmarks that measure peak bandwidth, this benchmark measures throughput performance over a variety of message sizes. as seen in Figure 3, TrueScale out-performs Mellanox across the entire range of message sizes.1

Figure 3. TrueScale Bandwidth Performance on Halo Benchmark

application requirements vary in terms of message sizes and patterns, so performance over a variety of message sizes is a better predictor of performance than peak measurements. At four processes per node, TrueScale bandwidth performance is anywhere from 120 to 70 percent better at 128 and 1024 byte message sizes, respectively.

1 The benchmark is the Halo test from argonne National Laboratory’s mpptest. In particular, the 2D halo psendrecv test at 4 processes per node on 8 nodes of 2 x 2.6 GHz aMD® Opteron™ 2218 CPUs, 8 GB DDr2-667 memory; NVIDIa® MCP55 PCIe chipset, for a total of 32 MPI ranks. QLogic MPI 2.2 used for TrueScale adapters and MVaPICH 0.9.9 for ConnectX.

Summary and Conclusion

TrueScale is architecturally designed to take advantage of two significant trends in high performance computing clusters: the prevalence of multi-core processors in compute nodes and the need to deploy increasingly larger clusters to tackle more complex computational problems.

The benefits of the TrueScale architecture can be demonstrated in a variety of industry standard benchmarks that measure the scalable performance characteristics of the interconnect. More importantly, the advantages can be realized through improved application performance and a reduced time-to-solution at about half the power of ConnectX.

Page 5: True Scale Ddr   Best In Class Performance

QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®

W H I T E P a P E r

Corporate Headquarters QLogic Corporation 26650 aliso Viejo Parkway aliso Viejo, Ca 92656 949.389.6000 www.qlogic.com

Europe Headquarters QLogic (UK) LTD. Quatro House Lyon Way, Frimley Camberley Surrey, GU16 7Er UK +44 (0) 1276 804 670

© 2008 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, InfiniPath, and TrueScale are trademarks or registered trademarks of QLogic Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc. InfiniBand is a trademark and service mark of the InfiniBand Trade association. Intel and Xeon are registered trademarks of Intel Corporation. SGI and altix are registered trademarks of Silicon Graphics, Inc., in the United States and/or other countries worldwide. aMD and Opteron are trademarks or registered trademarks of advanced Micro Devices, Inc. NVIDIa is a registered trademarks of NVIDIa Corporation.in the United States and other countries. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications.

HSG-WP08014 IB0030901-00 a 5

Disclaimerreasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment. QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or quality.