cuda fortran for scientists and engineers || tesla specifications
TRANSCRIPT
AAPPENDIX
Tesla Specifications
Floating-point performance
Tesla Products C870 C1060 C2050 C2070 M2090 K10 K20 K20X
Compute capability 1.0 1.3 2.0 3.0 3.5
Number of multiprocessors 16 30 14 14 16 2 × 8 13 14
Core clock (GHz) 1.35 1.296 1.15 1.15 1.3 0.745 0.706 0.732
Single-precision cores per8 8 32 192 192
multiprocessor
Total single-precision cores 128 240 448 448 512 2 × 1536 2496 2688
Single-precision GFlops346 622 1030 1030 1331 2 × 2289 3524 3935
(Multiply + Add)
Double-precision cores– 1 16* 8 64
per multiprocessor
Total double-precision cores – 30 224* 224* 256* 2 × 64 832 896
Double-precision GFlops– 78 515* 515* 665* 2 × 95 1175 1312
(Multiply + Add)
*GeForce GPUs have fewer double-precision units.
CUDA Fortran for Scientists and Engineers. http://dx.doi.org/10.1016/B978-0-12-416970-8.00016-X© 2014 Elsevier Inc. All rights reserved.
237
238APPEN
DIXA
TeslaS
pecifications
Memory
Tesla Products C870 C1060 C2050 C2070 M2090 K10 K20 K20X
Compute capability 1.0 1.3 2.0 3.0 3.5
Device Memory (DRAM)
Total global memory (GB) 1.5 4 3* 6* 6* 2 × 4* 5* 6*
Constant memory (KB) 64
Memory clock (MHz) 800 800 1,500 1,566 1,848 2,500 2,600 2,600
Bus width (bits) 384 512 384 384 384 2 × 256 320 384
Theoretical peak bandwidth (GB/s) 76.8 102.4 144* 150.3* 177.4* 2 × 160* 208* 249.6*
On-Chip Memory
32-bit registers per multiprocessor 8 K 16 K 32 K 64 K 64 K
Maximum registers per thread 127 127 63 63 255
Shared memory per multiprocessor 16 K 16 K 48 K/16 K 48 K/32 K/16 K 48 K/32 K/16 K
L1 cache per multiprocessor – – 16 K/48 K 16 K/32 K/48 K** 16 K/32 K/48 K**
Constant memory cache per multiprocessor (KB) 8
*With ECC enabled the available global memory and peak bandwidth will be less than the numbers listed.**For the K10, K20, and K20X GPUs, the L1 cache is used for local memory only.
APPENDIX
ATesla
Specifications
239
Execution configuration limits
Compute capability 1.0 1.3 2.0 3.0 3.5
C2050 C2070Tesla products C870 C1060 M2090 K10 K20 K20X
M2050 M2070
Maximum thread8 8 8 16 16
blocks per multiprocessor
Maximum threads per512 512 1024 1024 1024
thread block
Maximum threads (warps)768 (24) 1024 (32) 1536 (48) 2048 (64) 2048 (64)
per multiprocessor
Maximum grid 65536 × 65536 × 65536 × 2147483647 × 2147483647 ×dimensions 65536 × 1 65536 × 1 65536 × 65536 65536 × 65536 65536 × 65536
Maximum block512 × 512 × 64 512 × 512 × 64 1024 × 1024 × 64 1024 × 1024 × 64 1024 × 1024 × 64
dimensions