tr-2015-01 - university of wisconsin–madisonsbel.wisc.edu/documents/tr-2015-01.pdf5 chrono spmv...
TRANSCRIPT
TR-2015-01
Comparison of OpenCL performance on different platforms using
VexCL and Blaze
Hammad Mazhar
Dan Negrut
January 28, 2015
Abstract
This technical report provides performance numbers for several benchmark prob-lems running on several different hardware platforms. The goal of this report is twofold.First, it helps us better understand how the performance of OpenCL changes on dif-ferent platforms. Second, it provides a OpenCL-OpenMP comparison for a sparsematrix-vector multiplication operation. The VexCL library will be used for the OpenCLportion of this comparison and the Blaze C++ library will be used for the OpenMPportion.
1
Contents
1 Blaze 3
2 VexCL 3
3 Hardware Platforms 33.1 AMD Kaveri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Intel Haswell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Intel Haswell-E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 Intel Xeon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 AMD Opteron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Benchmark Results 6
5 Chrono SpMV Benchmark 7
2
1 Blaze
Blaze [1] is an open source headers-only library for for performing linear algebra operationsusing dense and sparse data structures. Blaze was designed so that mathematical expres-sions can be written intuitively with the library transparently handling type conversion andoptimization. By default, Blaze uses OpenMP for parallelism but it can be configures to useC++11 threads, Boost threads and also execute serialy. Additionally bindings for genericBLAS libraries, such as ATLAS, which will be transparently used for certain linear algebraoperations such as Matrix-Matrix multiplication.
In this context Blaze was used to compare the performance of OpenCL to OpenMP onplatforms that supported it
2 VexCL
VexCL [2] is an open source headers-only expression template library for both OpenCL andCUDA. Similar to Thrust [3], its purpose is to reduce boilerplate code required to developapplications for GPUs and other accelerators. VexCL provides many different functions thatdeal with reduction, linear algebra, sorting etc. In terms of this technical report the syntheticbenchmark results are provided by the VexCL benchmark example [4]. For the real worldexamples VexCL’s SpMV function is used.
For all tests the latest version of VexCL was used from the GitHub repository [5]
3 Hardware Platforms
Thirteen different hardware setups were used for benchmarking, this section will describeeach setup
3.1 AMD Kaveri
This CPU is based on AMDs new Kaveri architecture, the die features 4 x86-64 cores basedon AMDs Steamroller architecture and an 8 core Radeon R7 class GPU. Specifically, theperformance in the 7850K matches that of the Radeon HD 7750. The GPU does not havededicated memory and instead relies on the system memory. The upside of this is that thetotal memory available to the GPU is equal to the system ram, the downside is that generallysystem memory is much slower than ordinary on-board GPU memory.
Model AMD A10-7850KArchitecture SteamrollerClock (Turbo) 3.7 GHz (4.0 GHz)Cores 4Threads 4
L1 Cache2x96 KB 3-way Instruction4x16 KB 4-way Data
L2 Cache 2x2 MB 16-wayL3 Cache None
3
Memory 16GBMemory Interface Dual Channel DDR3OS Arch Linux
Compiler GCC 4.9.2Compiler Flags -O3OpenCL AMD APP
Accelerator
Type AMD Radeon R7 seriesCompute Units 8
Cores 512Clock 720 MHz
Reference: [6, 7]
3.2 Intel Haswell
Model i7-4770KArchitecture HaswellClock (Turbo) 3.5 GHz (3.9 GHz)Cores 4Threads 8L1 Cache
4x32 KB 8-way Instruction4x32 KB 8-way Data
L2 Cache 4x256 KB 8-wayL3 Cache 8 MB 16-wayMemory 32GBMemory Interface Dual Channel DDR3OS Arch LinuxCompiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL
Accelerator 1
Model NVIDIA GTX 680Architecture GK104 - KeplerCompute Units 8
Cores 1536Clock (Boost) 1006 MHz (1058 MHz)Memory 2GB 256-bit GDDR5
Accelerator 2
Model NVIDIA K20cArchitecture GK110 - KeplerCompute Units 13
Cores 2496Clock 706 MHzMemory 5GB 320-bit GDDR5
Reference: [8–11]
3.3 Intel Haswell-E
Model i7-5960XArchitecture Haswell-ECPU Clock (Turbo) 3.0 GHz (3.5 GHz)CPU Cores 8CPU Threads 16
L1 Cache8x32 KB Instruction8x32 KB Data
L2 Cache 8x256 KBL3 Cache 20 MB
4
Memory 32GBMemory Interface Quad Channel DDR4OS Arch Linux
Compiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL
Accelerator
Model NVIDIA GTX 770Architecture GK104 - KeplerCompute Units 8
Cores 1536Clock (Boost) 1046 MHz (1085 MHz)Memory 2GB 256-bit GDDR5
Reference: [12,13]
3.4 Intel Xeon
Model E5-2690 V2Architecture Ivy Bridge-EPSockets 2CPU Clock (Turbo) 3.0 GHz (3.6 GHz)CPU Cores 10CPU Threads 20L1 Cache
10x32 KB 8-way Instruction10x32 KB 8-way Data
L2 Cache 10x256 KB 8-wayL3 Cache 25 MB 20-wayMemory 64GBMemory Interface Quad Channel DDR3OS Arch LinuxCompiler GCC 4.9.2Compiler Flags -O3OpenCL Intel(R) OpenCL
Accelerator 1,2,3
Model NVIDIA K20xArchitecture GK110 - KeplerCompute Units 15
Cores 2688Clock 732 MHzMemory 6GB 384-bit GDDR5
Accelerator 4
Model Intel Xeon Phi 5110PArchitecture Knights CornerCores 60Threads 240L1 Cache
60x32 KB 8-way Instruction60x32 KB 8-way Data
L2 Cache 60x512 KB 8-wayClock 1053 MHzMemory 8GB GDDR5
Reference: [14–16]
3.5 AMD Opteron
5
Model 6274Architecture BulldozerSockets 4Clock (Turbo) 2.2 GHz (3.1 GHz)Cores 16Threads 16L1 Cache
8x64 KB 2-way Instruction16x16KB 4-way Data
L2 Cache 8x2 MB 16-wayL3 Cache 2x8 MB up to 64-wayMemory 128GBMemory Interface Quad Channel DDR3OS Centos 6Compiler GCC 4.9.2Compiler Flags -O3OpenCL AMD APP
Reference: [17]
4 Benchmark Results
Using the VexCL library a benchmark was performed on each platorm. Several differenttests were used to gauge performance including sort, reduce, and scan operations along withvector-vector operations such as add and matrix-vector operations such as SPMV.
106 107 108 109
A10-7850Ki7-4770K
Radeon R7E5-2690 V2
Opteron 6274MIC
i7-5960X3xK20X2xK20XGTX680GTX770
K20CK20X
Keys/sec
Sort
Scan
6
0 5 10 15 20 25 30 35 40
A10-7850K
i7-4770K
Radeon R7
E5-2690 V2
Opteron 6274
MIC
i7-5960X
3xK20X
2xK20X
GTX680
GTX770
K20C
K20X
GFLOPs
Reduce
SAXPY
SpMV
5 Chrono SpMV Benchmark
Along with the synthetic benchmarks performed using the VexCL library, actual matriciesfrom a simualtion were used to gauge real world performance. The simualtion setup consistedof a kinematically driven vehicle that fords a river comprised of one million rigid, frictionlessspheres.
Specifically 8 different sets of matricies will be compared on each platform. The problembeing solved is DTM−1Dx which is split into two matrix vector multiplications, first temp =M−1Dx and then Result = DT temp. Note that M is a diagonal matrix and x is a vector.The figures below show the simulation output, jacobian matrix D and the results for FLOPrate for the computation.
A10-7850K i7-4770K Radeon R7 E5-2690 V2 Opteron 6274 MIC i7-5960X 3xK20X
2xK20X GTX680 GTX770 K20C K20X
Fig. 8, and Fig. 9 provide the same data as above in a different format, data is grouped bydevice name with each bar representing one of the 7 tests that were performed. The results
7
0 2 4 6 8 10 12
GFLOPs
T = 3.0 s
0 2 4 6 8 10 12
GFLOPs
T = 4.0 s
0 2 4 6 8 10 12
GFLOPs
T = 5.0 s
demonstrate how different sparisity patterns affect the speed that the SpMV operation isperformed at. Fig. 10, and Fig. 11 show the results using the Blaze C++ library along withthe speedup of VexCL vs Blaze. In most cases Blaze was slightly faster than VexCL
8
0 5 10 15
GFLOPs
T = 6.0 s
0 2 4 6 8 10 12 14
GFLOPs
T = 7.0 s
0 2 4 6 8 10 12 14
GFLOPs
T = 8.0 s
9
0 2 4 6 8 10
GFLOPs
T = 9.0 s
0 0.5 1 1.5 2 2.5 3 3.5 4
A10-7850K
i7-4770K
E5-2690 V2
Opteron 6274
i7-5960X
GFLOPs
SpMV CPU
T=3.0s
T=4.0s
T=5.0s
T=6.0s
T=7.0s
T=8.0s
T=9.0s
Figure 8: Combined plots for the CPUs using VexCL
10
0 2 4 6 8 10 12 14 16 18
Radeon R7
MIC
3xK20X
2xK20X
GTX680
GTX770
K20C
K20X
GFLOPs
SpMV Accelerators
T=3.0s
T=4.0s
T=5.0s
T=6.0s
T=7.0s
T=8.0s
T=9.0s
Figure 9: Combined plots for the accelerators using VexCL
11
0 0.5 1 1.5 2 2.5 3 3.5 4
A10-7850K
i7-4770K
E5-2690 V2
Opteron 6274
i7-5960X
GFLOPs
SpMV Blaze
T=3.0s
T=4.0s
T=5.0s
T=6.0s
T=7.0s
T=8.0s
T=9.0s
Figure 10: Combined plots for the CPUs using Blaze
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
A10-7850K
i7-4770K
E5-2690 V2
Opteron 6274
i7-5960X
GFLOPs
SpMV Speedup VexCL vs Blaze
T=3.0s
T=4.0s
T=5.0s
T=6.0s
T=7.0s
T=8.0s
T=9.0s
Figure 11: Speedup for VexCL compared to Blaze for different matrices. A speedup of lessthan one means that VexCL was slower than Blaze.
12
References
[1] K. Iglberger, G. Hager, J. Treibig, and U. Rude. High performance smart expressiontemplate math libraries. In High Performance Computing and Simulation (HPCS),2012 International Conference on, pages 367–373, July 2012.
[2] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. ProgrammingCUDA and opencl: A case study using modern C++ libraries. CoRR, abs/1212.6326,2012.
[3] J. Hoberock and N. Bell. Thrust: C++ template library for CUDA. http://thrust.
github.com/.
[4] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. benchmark.cpp.https://github.com/ddemidov/vexcl/blob/master/examples/benchmark.cpp.
[5] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Vexcl. https:
//github.com/ddemidov/vexcl.
[6] AMD. Amd a-series apu processors. http://www.amd.com/en-us/products/
processors/desktop/a-series-apu.
[7] CPU-World. Amd a10-series a10-7850k. http://www.cpu-world.com/CPUs/
Bulldozer/AMD-A10-Series%20A10-7850K.html.
[8] CPU-World. Intel core i7-4770k. http://www.cpu-world.com/CPUs/Core_i7/
Intel-Core%20i7-4770K.html.
[9] techpowerup. Nvidia geforce gtx 680. http://www.techpowerup.com/gpudb/342/
geforce-gtx-680.html.
[10] Nvidia. Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/
Tesla-K20-Passive-BD-06455-001-v07.pdf.
[11] techpowerup. Nvidia tesla k20c. http://www.techpowerup.com/gpudb/564/
tesla-k20c.html.
[12] CPU-World. Intel core i7-5960x. http://www.cpu-world.com/CPUs/Core_i7/
Intel-Core%20i7-5960X%20Extreme%20Edition.html.
[13] Nvidia. Geforce gtx 700 series. http://www.nvidia.com/gtx-700-graphics-cards/
gtx-770/.
[14] CPU-World. Intel xeon e5-2690 v2. http://www.cpu-world.com/CPUs/Xeon/
Intel-Xeon%20E5-2690%20v2.html.
13
[15] Nvidia. Tesla k20x gpu accelerator. http://www.nvidia.com/content/PDF/kepler/
Tesla-K20X-BD-06397-001-v07.pdf.
[16] CPU-World. Intel xeon phi 5110p. http://www.cpu-world.com/CPUs/Xeon_Phi/
Intel-Xeon%20Phi%205110P.html.
[17] CPU-World. Amd opteron 6274. http://www.cpu-world.com/CPUs/Bulldozer/
AMD-Opteron%206274%20OS6274WKTGGGU.html.
14