general purpose processing using embedded gpus: a study of … · 2016. 6. 1. · zürcher...
TRANSCRIPT
Zürcher Fachhochschule 1
GeneralpurposeprocessingusingembeddedGPUs:Astudyof
latencyanditsvaria:on
Ma#hiasRosenthalandAminMazloumianMay,2016
Zürcher Fachhochschule 2
Agenda
• GeneralPurposeGPUCompuAng• EmbeddedCPU/GPUversusCPU/FPGA• CPU–GPUDataTransfer
– UnifiedVirtualAddressing(DMA)– Memorymapped(ZeroCopy)
• LatencyResults• Kernel-LoopSoluAonavoidingGPUKernellaunch
Zürcher Fachhochschule 3
GPUCompuAng
Originallyused3DgamerenderingGPUsareheavilyusedin
HighPerformanceCompuAng Financialmodeling RoboAcs GasandoilexploraAon CuYng-edgescienAficresearch
àWhataboutembeddedsystems??
Zürcher Fachhochschule 4
CPUvs.GPU
[h#p://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html]
SPSinglePrecisionDPDoublePrecision
Zürcher Fachhochschule 5
CPUvs.GPU
• CPUs:Hugecache,opAmizedforseveralthreads:Sequen:alinstruc:ons
• GPUs:100+simplecoresforhugeparallelizaAon:Intensiveparalleliza:on
Zürcher Fachhochschule 6
DiscretevsIntegratedGPU
DiscreteGPU IntegratedGPU
Cache Cache
CPU GPU CPU GPU
Zürcher Fachhochschule 7
CPU/GPUCompuAngvs.CPU/FPGA
Flexibility&MaintenancePowerConsumpAonDevelopmentCostLatencyLatencyvariaAon
High
HighNanosecondsMicroseconds
High LowMid
Low
? NovariaAon
CPU/GPU CPU/FPGA
(CPU/GPU/DSP/FPGA)
Zürcher Fachhochschule 8
Example:NvidiaTK1
- GPU:192Cudacore
- CPU:ARMA-15Quad-core- Videodecode:Full-HD60Hz
- Videoencode:Full-HD30Hz
- Networking:1GBEthernet
Zürcher Fachhochschule 9
GPUProgramming:CUDA
[https://code.msdn.microsoft.com/vstudio/NVIDIA-GPU-Architecture-45c11e6d]
LinuxcompilaAonmodel
AddiAonalLibraries
StandardCudaProgramm
Zürcher Fachhochschule 10
NvidiaTK1
[GPUperformanceAnalysis,Nvidia(2012)]
64KByteConfigurable
L1/SMEM/RO
128KByteL2
192Cores 192Cores 192Cores TK1
Zürcher Fachhochschule 11
DataTransferonTK1
InputVideo/Audio/
Data
TK1
CPU GPU
CPUCache GPUCache
OutputVideo/Audio/
DataInput
DRAM
Output
2OpAonsforDataTransfertoGPUinCuda:• UnifiedVirtualAddressing(GPUDMATransfer)• Memorymapped(ZeroCopy)
?
Zürcher Fachhochschule 12
CudaDataTransfer
Method1:UnifiedVirtualAddressing(withCPU-GPUDMA)
• AllocaAoninGPUmemory• LocalaccessforfirstGPU• NodirectCPUaccess• DMATransferCPU<->GPU
cudaMemcpy
CPU GPU GPU
Zürcher Fachhochschule 13
CudaDataTransfer
GPUprocessingUnifiedVirtualAddressing(DMA): Step1:CopydatatoGPUmemory
Step2:ProcessdatainGPUusing1000softhreads
Step3:Copyresultsbacktohostmemory
Zürcher Fachhochschule 14
CudaDataTransfer
// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMalloc( &dev_b, size ); cudaMalloc( &dev_c, size ); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); // GPU-DMA cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // GPU-DMA // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )
Zürcher Fachhochschule 15
CudaDataTransfer
Method2:Memorymapped(ZeroCopy)
• AllocaAoninCPUmemory• LocalaccessforCPU• MemorymappedforGPUs CPU GPU GPU
Zürcher Fachhochschule 16
CudaDataTransfer
GPUprocessingMemoryMapped(ZeroCopy): Step1:CopydatatoGPUmemory
Step2:ProcessdatainGPUusing1000softhreads
Step3:Copyresultsbacktohostmemory
Zürcher Fachhochschule 17
// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMallocHost(&dev_a,size); cudaMalloc( &dev_b, size ); cudaMallocHost(&dev_b,size); cudaMalloc( &dev_c, size ); cudaMallocHost(&dev_c,size); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )
TypicalGPUworkflow:Memory-mapped
Zürcher Fachhochschule 18
DMAvs.Memory-mapped
DMA(cudaMemcpy)
Memory-mapped(ZeroCopy)
Factor2
Zürcher Fachhochschule 19
GPULatencyVariaAon:output=input
__device__voididenAty(float*input,float*output,intnumElem):
for(intindex=0;index<numElem;index++){
output[index]=input[index]
}
Inputsize=25
(90%)
(0.01%)
TestedonLinux-KernelwithPREEMPT_RT/FullPreempt
Zürcher Fachhochschule 20
-ThereisahugevariaAoninprocessingAme.-For100bytesdata(25floatvalues)perthread:
-90%ofthelaunchestakelessthan40microsec. -0.01%ofthelaunchestakearound500microsec.
-Slowlaunchesdropupdateratefrom25KHzto2KHz.
GPULatencyVariaAon
Zürcher Fachhochschule 21
GPULatencyVariaAon
Inputsize25 250 2500 25000
Jetson TK1
RT Kernel
identity<<<1,1>>>
(90%)
Zürcher Fachhochschule 22
OurSoluAonforLatencyVariaAon
Kernel Loop: while (true) { poll_CPU_flag(); output_data = fct(input_data); }
GPU
... wait_for_input_in_DRAM(); flag_to_GPU(); ...
TK1
CPU GPU
CPUCache GPUCache
Input
DRAM
Output
• Implementkernel-loopsinGPUcores• Memorymapped(zerocopy)dataaccess• EachGPUkernel-loopproducesoutputfromitsinputdata(memory-mapped)
• ThenumberofGPUcoreslimitthenumberofkernelloops
CPU
Zürcher Fachhochschule
SoCs with GPU as Industrial Modules
23
NvidiaTK1Module Snapdragon820Module AllwinnerA80Module
Sources: Nvidia, Avionic Design, Toradex, Intrinisic, Theobroma Systems
NvidiaTX1ModuleNvidiaTK1Module
Zürcher Fachhochschule
SoCs with GPU as Industrial Modules
24
Mobile Processor
Android TV Video Conferencing
Lecture recording streaming Medical Imaging
Driving Assistance Source: Google / PMK
Zürcher Fachhochschule 25
- Ourresultsconfirmthatforsmalldatachunksmemorymappedtransfers
ismoreefficient
- WeobserveahugebutrarevariaAoninGPUprocessingAme
- ThevariaAondramaAcallyreducesupdateratebyanorderofmagnitude
- OursoluAonistoimplementGPUkernel-loopsandmemory-mappedtransfer
Conclusion