general purpose processing using embedded gpus: a study of … · 2016. 6. 1. · zürcher...

Zürcher Fachhochschule 1

GeneralpurposeprocessingusingembeddedGPUs:Astudyof

latencyanditsvaria:on

Ma#hiasRosenthalandAminMazloumianMay,2016


Agenda

•  GeneralPurposeGPUCompuAng•  EmbeddedCPU/GPUversusCPU/FPGA•  CPU–GPUDataTransfer

–  UnifiedVirtualAddressing(DMA)–  Memorymapped(ZeroCopy)

•  LatencyResults•  Kernel-LoopSoluAonavoidingGPUKernellaunch


GPUCompuAng

Originallyused3DgamerenderingGPUsareheavilyusedin

HighPerformanceCompuAng Financialmodeling RoboAcs GasandoilexploraAon CuYng-edgescienAficresearch

àWhataboutembeddedsystems??


CPUvs.GPU

[h#p://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html]

SPSinglePrecisionDPDoublePrecision


CPUvs.GPU

•  CPUs:Hugecache,opAmizedforseveralthreads:Sequen:alinstruc:ons

•  GPUs:100+simplecoresforhugeparallelizaAon:Intensiveparalleliza:on


DiscretevsIntegratedGPU

DiscreteGPU IntegratedGPU

Cache Cache

CPU GPU CPU GPU


CPU/GPUCompuAngvs.CPU/FPGA

Flexibility&MaintenancePowerConsumpAonDevelopmentCostLatencyLatencyvariaAon

High

HighNanosecondsMicroseconds

High LowMid

Low

? NovariaAon

CPU/GPU CPU/FPGA

(CPU/GPU/DSP/FPGA)


Example:NvidiaTK1

-  GPU:192Cudacore

-  CPU:ARMA-15Quad-core-  Videodecode:Full-HD60Hz

-  Videoencode:Full-HD30Hz

-  Networking:1GBEthernet


GPUProgramming:CUDA

[https://code.msdn.microsoft.com/vstudio/NVIDIA-GPU-Architecture-45c11e6d]

LinuxcompilaAonmodel

AddiAonalLibraries

StandardCudaProgramm


NvidiaTK1

[GPUperformanceAnalysis,Nvidia(2012)]

64KByteConfigurable

L1/SMEM/RO

128KByteL2

192Cores 192Cores 192Cores TK1


DataTransferonTK1

InputVideo/Audio/

Data

TK1

CPU GPU

CPUCache GPUCache

OutputVideo/Audio/

DataInput

DRAM

Output

2OpAonsforDataTransfertoGPUinCuda:•  UnifiedVirtualAddressing(GPUDMATransfer)•  Memorymapped(ZeroCopy)

?


CudaDataTransfer

Method1:UnifiedVirtualAddressing(withCPU-GPUDMA)

•  AllocaAoninGPUmemory•  LocalaccessforfirstGPU•  NodirectCPUaccess•  DMATransferCPU<->GPU

cudaMemcpy

CPU GPU GPU


CudaDataTransfer

GPUprocessingUnifiedVirtualAddressing(DMA): Step1:CopydatatoGPUmemory

Step2:ProcessdatainGPUusing1000softhreads

Step3:Copyresultsbacktohostmemory


CudaDataTransfer

// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMalloc( &dev_b, size ); cudaMalloc( &dev_c, size ); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); // GPU-DMA cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // GPU-DMA // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )


CudaDataTransfer

Method2:Memorymapped(ZeroCopy)

•  AllocaAoninCPUmemory•  LocalaccessforCPU•  MemorymappedforGPUs CPU GPU GPU


CudaDataTransfer

GPUprocessingMemoryMapped(ZeroCopy): Step1:CopydatatoGPUmemory

Step2:ProcessdatainGPUusing1000softhreads

Step3:Copyresultsbacktohostmemory


// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMallocHost(&dev_a,size); cudaMalloc( &dev_b, size ); cudaMallocHost(&dev_b,size); cudaMalloc( &dev_c, size ); cudaMallocHost(&dev_c,size); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )

TypicalGPUworkflow:Memory-mapped


DMAvs.Memory-mapped

DMA(cudaMemcpy)

Memory-mapped(ZeroCopy)

Factor2


GPULatencyVariaAon:output=input

__device__voididenAty(float*input,float*output,intnumElem):

for(intindex=0;index<numElem;index++){

output[index]=input[index]

}

Inputsize=25

(90%)

(0.01%)

TestedonLinux-KernelwithPREEMPT_RT/FullPreempt


-ThereisahugevariaAoninprocessingAme.-For100bytesdata(25floatvalues)perthread:

-90%ofthelaunchestakelessthan40microsec. -0.01%ofthelaunchestakearound500microsec.

-Slowlaunchesdropupdateratefrom25KHzto2KHz.

GPULatencyVariaAon


GPULatencyVariaAon

Inputsize25 250 2500 25000

Jetson TK1

RT Kernel

identity<<<1,1>>>

(90%)


OurSoluAonforLatencyVariaAon

Kernel Loop: while (true) { poll_CPU_flag(); output_data = fct(input_data); }

GPU

... wait_for_input_in_DRAM(); flag_to_GPU(); ...

TK1

CPU GPU

CPUCache GPUCache

Input

DRAM

Output

•  Implementkernel-loopsinGPUcores•  Memorymapped(zerocopy)dataaccess•  EachGPUkernel-loopproducesoutputfromitsinputdata(memory-mapped)

•  ThenumberofGPUcoreslimitthenumberofkernelloops

CPU

Zürcher Fachhochschule

SoCs with GPU as Industrial Modules

23

NvidiaTK1Module Snapdragon820Module AllwinnerA80Module

Sources: Nvidia, Avionic Design, Toradex, Intrinisic, Theobroma Systems

NvidiaTX1ModuleNvidiaTK1Module

Zürcher Fachhochschule

SoCs with GPU as Industrial Modules

24

Mobile Processor

Android TV Video Conferencing

Lecture recording streaming Medical Imaging

Driving Assistance Source: Google / PMK


-  Ourresultsconfirmthatforsmalldatachunksmemorymappedtransfers

ismoreefficient

-  WeobserveahugebutrarevariaAoninGPUprocessingAme

-  ThevariaAondramaAcallyreducesupdateratebyanorderofmagnitude

-  OursoluAonistoimplementGPUkernel-loopsandmemory-mappedtransfer

Conclusion

general purpose processing using embedded gpus: a study of … · 2016. 6. 1. · zürcher...

Documents