tr-2015-01 - university of wisconsin–madisonsbel.wisc.edu/documents/tr-2015-01.pdf5 chrono spmv...

15
TR-2015-01 Comparison of OpenCL performance on different platforms using VexCL and Blaze Hammad Mazhar Dan Negrut January 28, 2015

Upload: nguyennga

Post on 18-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

TR-2015-01

Comparison of OpenCL performance on different platforms using

VexCL and Blaze

Hammad Mazhar

Dan Negrut

January 28, 2015

Abstract

This technical report provides performance numbers for several benchmark prob-lems running on several different hardware platforms. The goal of this report is twofold.First, it helps us better understand how the performance of OpenCL changes on dif-ferent platforms. Second, it provides a OpenCL-OpenMP comparison for a sparsematrix-vector multiplication operation. The VexCL library will be used for the OpenCLportion of this comparison and the Blaze C++ library will be used for the OpenMPportion.

1

Contents

1 Blaze 3

2 VexCL 3

3 Hardware Platforms 33.1 AMD Kaveri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Intel Haswell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Intel Haswell-E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 Intel Xeon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 AMD Opteron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Benchmark Results 6

5 Chrono SpMV Benchmark 7

2

1 Blaze

Blaze [1] is an open source headers-only library for for performing linear algebra operationsusing dense and sparse data structures. Blaze was designed so that mathematical expres-sions can be written intuitively with the library transparently handling type conversion andoptimization. By default, Blaze uses OpenMP for parallelism but it can be configures to useC++11 threads, Boost threads and also execute serialy. Additionally bindings for genericBLAS libraries, such as ATLAS, which will be transparently used for certain linear algebraoperations such as Matrix-Matrix multiplication.

In this context Blaze was used to compare the performance of OpenCL to OpenMP onplatforms that supported it

2 VexCL

VexCL [2] is an open source headers-only expression template library for both OpenCL andCUDA. Similar to Thrust [3], its purpose is to reduce boilerplate code required to developapplications for GPUs and other accelerators. VexCL provides many different functions thatdeal with reduction, linear algebra, sorting etc. In terms of this technical report the syntheticbenchmark results are provided by the VexCL benchmark example [4]. For the real worldexamples VexCL’s SpMV function is used.

For all tests the latest version of VexCL was used from the GitHub repository [5]

3 Hardware Platforms

Thirteen different hardware setups were used for benchmarking, this section will describeeach setup

3.1 AMD Kaveri

This CPU is based on AMDs new Kaveri architecture, the die features 4 x86-64 cores basedon AMDs Steamroller architecture and an 8 core Radeon R7 class GPU. Specifically, theperformance in the 7850K matches that of the Radeon HD 7750. The GPU does not havededicated memory and instead relies on the system memory. The upside of this is that thetotal memory available to the GPU is equal to the system ram, the downside is that generallysystem memory is much slower than ordinary on-board GPU memory.

Model AMD A10-7850KArchitecture SteamrollerClock (Turbo) 3.7 GHz (4.0 GHz)Cores 4Threads 4

L1 Cache2x96 KB 3-way Instruction4x16 KB 4-way Data

L2 Cache 2x2 MB 16-wayL3 Cache None

3

Memory 16GBMemory Interface Dual Channel DDR3OS Arch Linux

Compiler GCC 4.9.2Compiler Flags -O3OpenCL AMD APP

Accelerator

Type AMD Radeon R7 seriesCompute Units 8

Cores 512Clock 720 MHz

Reference: [6, 7]

3.2 Intel Haswell

Model i7-4770KArchitecture HaswellClock (Turbo) 3.5 GHz (3.9 GHz)Cores 4Threads 8L1 Cache

4x32 KB 8-way Instruction4x32 KB 8-way Data

L2 Cache 4x256 KB 8-wayL3 Cache 8 MB 16-wayMemory 32GBMemory Interface Dual Channel DDR3OS Arch LinuxCompiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL

Accelerator 1

Model NVIDIA GTX 680Architecture GK104 - KeplerCompute Units 8

Cores 1536Clock (Boost) 1006 MHz (1058 MHz)Memory 2GB 256-bit GDDR5

Accelerator 2

Model NVIDIA K20cArchitecture GK110 - KeplerCompute Units 13

Cores 2496Clock 706 MHzMemory 5GB 320-bit GDDR5

Reference: [8–11]

3.3 Intel Haswell-E

Model i7-5960XArchitecture Haswell-ECPU Clock (Turbo) 3.0 GHz (3.5 GHz)CPU Cores 8CPU Threads 16

L1 Cache8x32 KB Instruction8x32 KB Data

L2 Cache 8x256 KBL3 Cache 20 MB

4

Memory 32GBMemory Interface Quad Channel DDR4OS Arch Linux

Compiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL

Accelerator

Model NVIDIA GTX 770Architecture GK104 - KeplerCompute Units 8

Cores 1536Clock (Boost) 1046 MHz (1085 MHz)Memory 2GB 256-bit GDDR5

Reference: [12,13]

3.4 Intel Xeon

Model E5-2690 V2Architecture Ivy Bridge-EPSockets 2CPU Clock (Turbo) 3.0 GHz (3.6 GHz)CPU Cores 10CPU Threads 20L1 Cache

10x32 KB 8-way Instruction10x32 KB 8-way Data

L2 Cache 10x256 KB 8-wayL3 Cache 25 MB 20-wayMemory 64GBMemory Interface Quad Channel DDR3OS Arch LinuxCompiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL

Accelerator 1,2,3

Model NVIDIA K20xArchitecture GK110 - KeplerCompute Units 15

Cores 2688Clock 732 MHzMemory 6GB 384-bit GDDR5

Accelerator 4

Model Intel Xeon Phi 5110PArchitecture Knights CornerCores 60Threads 240L1 Cache

60x32 KB 8-way Instruction60x32 KB 8-way Data

L2 Cache 60x512 KB 8-wayClock 1053 MHzMemory 8GB GDDR5

Reference: [14–16]

3.5 AMD Opteron

5

Model 6274Architecture BulldozerSockets 4Clock (Turbo) 2.2 GHz (3.1 GHz)Cores 16Threads 16L1 Cache

8x64 KB 2-way Instruction16x16KB 4-way Data

L2 Cache 8x2 MB 16-wayL3 Cache 2x8 MB up to 64-wayMemory 128GBMemory Interface Quad Channel DDR3OS Centos 6Compiler GCC 4.9.2Compiler Flags -O3OpenCL AMD APP

Reference: [17]

4 Benchmark Results

Using the VexCL library a benchmark was performed on each platorm. Several differenttests were used to gauge performance including sort, reduce, and scan operations along withvector-vector operations such as add and matrix-vector operations such as SPMV.

106 107 108 109

A10-7850Ki7-4770K

Radeon R7E5-2690 V2

Opteron 6274MIC

i7-5960X3xK20X2xK20XGTX680GTX770

K20CK20X

Keys/sec

Sort

Scan

6

0 5 10 15 20 25 30 35 40

A10-7850K

i7-4770K

Radeon R7

E5-2690 V2

Opteron 6274

MIC

i7-5960X

3xK20X

2xK20X

GTX680

GTX770

K20C

K20X

GFLOPs

Reduce

SAXPY

SpMV

5 Chrono SpMV Benchmark

Along with the synthetic benchmarks performed using the VexCL library, actual matriciesfrom a simualtion were used to gauge real world performance. The simualtion setup consistedof a kinematically driven vehicle that fords a river comprised of one million rigid, frictionlessspheres.

Specifically 8 different sets of matricies will be compared on each platform. The problembeing solved is DTM−1Dx which is split into two matrix vector multiplications, first temp =M−1Dx and then Result = DT temp. Note that M is a diagonal matrix and x is a vector.The figures below show the simulation output, jacobian matrix D and the results for FLOPrate for the computation.

A10-7850K i7-4770K Radeon R7 E5-2690 V2 Opteron 6274 MIC i7-5960X 3xK20X

2xK20X GTX680 GTX770 K20C K20X

Fig. 8, and Fig. 9 provide the same data as above in a different format, data is grouped bydevice name with each bar representing one of the 7 tests that were performed. The results

7

0 2 4 6 8 10 12

GFLOPs

T = 3.0 s

0 2 4 6 8 10 12

GFLOPs

T = 4.0 s

0 2 4 6 8 10 12

GFLOPs

T = 5.0 s

demonstrate how different sparisity patterns affect the speed that the SpMV operation isperformed at. Fig. 10, and Fig. 11 show the results using the Blaze C++ library along withthe speedup of VexCL vs Blaze. In most cases Blaze was slightly faster than VexCL

8

0 5 10 15

GFLOPs

T = 6.0 s

0 2 4 6 8 10 12 14

GFLOPs

T = 7.0 s

0 2 4 6 8 10 12 14

GFLOPs

T = 8.0 s

9

0 2 4 6 8 10

GFLOPs

T = 9.0 s

0 0.5 1 1.5 2 2.5 3 3.5 4

A10-7850K

i7-4770K

E5-2690 V2

Opteron 6274

i7-5960X

GFLOPs

SpMV CPU

T=3.0s

T=4.0s

T=5.0s

T=6.0s

T=7.0s

T=8.0s

T=9.0s

Figure 8: Combined plots for the CPUs using VexCL

10

0 2 4 6 8 10 12 14 16 18

Radeon R7

MIC

3xK20X

2xK20X

GTX680

GTX770

K20C

K20X

GFLOPs

SpMV Accelerators

T=3.0s

T=4.0s

T=5.0s

T=6.0s

T=7.0s

T=8.0s

T=9.0s

Figure 9: Combined plots for the accelerators using VexCL

11

0 0.5 1 1.5 2 2.5 3 3.5 4

A10-7850K

i7-4770K

E5-2690 V2

Opteron 6274

i7-5960X

GFLOPs

SpMV Blaze

T=3.0s

T=4.0s

T=5.0s

T=6.0s

T=7.0s

T=8.0s

T=9.0s

Figure 10: Combined plots for the CPUs using Blaze

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

A10-7850K

i7-4770K

E5-2690 V2

Opteron 6274

i7-5960X

GFLOPs

SpMV Speedup VexCL vs Blaze

T=3.0s

T=4.0s

T=5.0s

T=6.0s

T=7.0s

T=8.0s

T=9.0s

Figure 11: Speedup for VexCL compared to Blaze for different matrices. A speedup of lessthan one means that VexCL was slower than Blaze.

12

References

[1] K. Iglberger, G. Hager, J. Treibig, and U. Rude. High performance smart expressiontemplate math libraries. In High Performance Computing and Simulation (HPCS),2012 International Conference on, pages 367–373, July 2012.

[2] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. ProgrammingCUDA and opencl: A case study using modern C++ libraries. CoRR, abs/1212.6326,2012.

[3] J. Hoberock and N. Bell. Thrust: C++ template library for CUDA. http://thrust.

github.com/.

[4] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. benchmark.cpp.https://github.com/ddemidov/vexcl/blob/master/examples/benchmark.cpp.

[5] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Vexcl. https:

//github.com/ddemidov/vexcl.

[6] AMD. Amd a-series apu processors. http://www.amd.com/en-us/products/

processors/desktop/a-series-apu.

[7] CPU-World. Amd a10-series a10-7850k. http://www.cpu-world.com/CPUs/

Bulldozer/AMD-A10-Series%20A10-7850K.html.

[8] CPU-World. Intel core i7-4770k. http://www.cpu-world.com/CPUs/Core_i7/

Intel-Core%20i7-4770K.html.

[9] techpowerup. Nvidia geforce gtx 680. http://www.techpowerup.com/gpudb/342/

geforce-gtx-680.html.

[10] Nvidia. Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/

Tesla-K20-Passive-BD-06455-001-v07.pdf.

[11] techpowerup. Nvidia tesla k20c. http://www.techpowerup.com/gpudb/564/

tesla-k20c.html.

[12] CPU-World. Intel core i7-5960x. http://www.cpu-world.com/CPUs/Core_i7/

Intel-Core%20i7-5960X%20Extreme%20Edition.html.

[13] Nvidia. Geforce gtx 700 series. http://www.nvidia.com/gtx-700-graphics-cards/

gtx-770/.

[14] CPU-World. Intel xeon e5-2690 v2. http://www.cpu-world.com/CPUs/Xeon/

Intel-Xeon%20E5-2690%20v2.html.

13

[15] Nvidia. Tesla k20x gpu accelerator. http://www.nvidia.com/content/PDF/kepler/

Tesla-K20X-BD-06397-001-v07.pdf.

[16] CPU-World. Intel xeon phi 5110p. http://www.cpu-world.com/CPUs/Xeon_Phi/

Intel-Xeon%20Phi%205110P.html.

[17] CPU-World. Amd opteron 6274. http://www.cpu-world.com/CPUs/Bulldozer/

AMD-Opteron%206274%20OS6274WKTGGGU.html.

14