cuda fortran for scientists and engineers || tesla specifications

3
A APPENDIX Tesla Specifications Floating-point performance Tesla Products C870 C1060 C2050 C2070 M2090 K10 K20 K20X Compute capability 1.0 1.3 2.0 3.0 3.5 Number of multiprocessors 16 30 14 14 16 2 × 8 13 14 Core clock (GHz) 1.35 1.296 1.15 1.15 1.3 0.745 0.706 0.732 Single-precision cores per 8 8 32 192 192 multiprocessor Total single-precision cores 128 240 448 448 512 2 × 1536 2496 2688 Single-precision GFlops 346 622 1030 1030 1331 2 × 2289 3524 3935 (Multiply + Add) Double-precision cores 1 16* 8 64 per multiprocessor Total double-precision cores 30 224* 224* 256* 2 × 64 832 896 Double-precision GFlops 78 515* 515* 665* 2 × 95 1175 1312 (Multiply + Add) *GeForce GPUs have fewer double-precision units. CUDA Fortran for Scientists and Engineers. http://dx.doi.org/10.1016/B978-0-12-416970-8.00016-X © 2014 Elsevier Inc. All rights reserved. 237

Upload: massimiliano

Post on 17-Dec-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

AAPPENDIX

Tesla Specifications

Floating-point performance

Tesla Products C870 C1060 C2050 C2070 M2090 K10 K20 K20X

Compute capability 1.0 1.3 2.0 3.0 3.5

Number of multiprocessors 16 30 14 14 16 2 × 8 13 14

Core clock (GHz) 1.35 1.296 1.15 1.15 1.3 0.745 0.706 0.732

Single-precision cores per8 8 32 192 192

multiprocessor

Total single-precision cores 128 240 448 448 512 2 × 1536 2496 2688

Single-precision GFlops346 622 1030 1030 1331 2 × 2289 3524 3935

(Multiply + Add)

Double-precision cores– 1 16* 8 64

per multiprocessor

Total double-precision cores – 30 224* 224* 256* 2 × 64 832 896

Double-precision GFlops– 78 515* 515* 665* 2 × 95 1175 1312

(Multiply + Add)

*GeForce GPUs have fewer double-precision units.

CUDA Fortran for Scientists and Engineers. http://dx.doi.org/10.1016/B978-0-12-416970-8.00016-X© 2014 Elsevier Inc. All rights reserved.

237

238APPEN

DIXA

TeslaS

pecifications

Memory

Tesla Products C870 C1060 C2050 C2070 M2090 K10 K20 K20X

Compute capability 1.0 1.3 2.0 3.0 3.5

Device Memory (DRAM)

Total global memory (GB) 1.5 4 3* 6* 6* 2 × 4* 5* 6*

Constant memory (KB) 64

Memory clock (MHz) 800 800 1,500 1,566 1,848 2,500 2,600 2,600

Bus width (bits) 384 512 384 384 384 2 × 256 320 384

Theoretical peak bandwidth (GB/s) 76.8 102.4 144* 150.3* 177.4* 2 × 160* 208* 249.6*

On-Chip Memory

32-bit registers per multiprocessor 8 K 16 K 32 K 64 K 64 K

Maximum registers per thread 127 127 63 63 255

Shared memory per multiprocessor 16 K 16 K 48 K/16 K 48 K/32 K/16 K 48 K/32 K/16 K

L1 cache per multiprocessor – – 16 K/48 K 16 K/32 K/48 K** 16 K/32 K/48 K**

Constant memory cache per multiprocessor (KB) 8

*With ECC enabled the available global memory and peak bandwidth will be less than the numbers listed.**For the K10, K20, and K20X GPUs, the L1 cache is used for local memory only.

APPENDIX

ATesla

Specifications

239

Execution configuration limits

Compute capability 1.0 1.3 2.0 3.0 3.5

C2050 C2070Tesla products C870 C1060 M2090 K10 K20 K20X

M2050 M2070

Maximum thread8 8 8 16 16

blocks per multiprocessor

Maximum threads per512 512 1024 1024 1024

thread block

Maximum threads (warps)768 (24) 1024 (32) 1536 (48) 2048 (64) 2048 (64)

per multiprocessor

Maximum grid 65536 × 65536 × 65536 × 2147483647 × 2147483647 ×dimensions 65536 × 1 65536 × 1 65536 × 65536 65536 × 65536 65536 × 65536

Maximum block512 × 512 × 64 512 × 512 × 64 1024 × 1024 × 64 1024 × 1024 × 64 1024 × 1024 × 64

dimensions