hyperspectral video using cudaon-demand.gputechconf.com/gtc/2013/presentations/s3030...title...
TRANSCRIPT
Hyperspectral video using CUDA Robert Dunn, PhD Candidate
Outline Hyperspectral Imaging
Our Research
Problems
Results
Future Work
Imaging Spectroscopy
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Linear Mixing Model
wawsaxM
i
iij
S1
Endmembers
Abundances
Pixel Spectrum
Dimension Reduction Principal Components
Analysis
1. Covariance Matrix
2. Eigendecomposition
3. Transformation
-3 -2 -1 0 1 2 3 -3
-2
-1
0
1
2
3
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1 Intensity
2 I
nte
nsity
N-FINDR
Data
Simplex 1
Simplex 2
Simplex 3
Endmember Determination
Determinant
LU Decomposition
Semi-Parallelisable Search
Abundance Mapping
2minarg xaa
a
S
Current Research Real-time terrestrial HSI
◦ Forensics
◦ Manufacturing
◦ Military
CUDA Simulations
◦ Visualisation Problems
◦ Algorithm Optimisation
Problem - syrk
2.50E-05
2.70E-05
2.90E-05
3.10E-05
3.30E-05
3.50E-05
3.70E-05
2 4 8 16 32 64 128 256
cublasSsyrk Error
0.00%
0.01%
0.10%
1.00%
10.00%
100.00%
2 4 8 16 32 64 128 256
cublasSsyrk Utilisation
p
n
njinij xxy1
Syrk performance with ‘fat matrix’
p=200,000
TXXY
Problem - syrk
0.00E+00
1.00E-08
2.00E-08
3.00E-08
4.00E-08
5.00E-08
6.00E-08
7.00E-08
8.00E-08
9.00E-08
1.00E-07
2 4 8 16 32 64 128 256
Sub-Block Algorithm Error
0.1
1
10
100
2 4 8 16 32 64 128 256
Sp
eed
Im
pro
vem
en
t
Sub-Block vs cublasSsyrk
k
kn
njin
k
n
njinij xxxxy2
11
p=200,000
GPU inefficient for small matrices.
Host-Device transfers inefficient.
◦ Synchronisation
Batch processing not suitable.
Problem – Small matrix
Problem – MAGMA/CULA Advanced Linear Algebra ◦ EVD, LU, QR etc
Hybrid CPU/GPU ◦ Designed for large matrix
Streaming and Synchronisation
‘Assembling Blocks’ vs ‘Complete Kernel’
Problem - Streaming
CPU
GPU
CPU
GPU
Problem - Overheads CPU
Single Frame
Problem – Testing CUDA + C MATLAB
Fast Performance Poor Performance
Unproven Code Proven Code
Developer Intensive Simple Development
CUDA HSI
MEX
MATLAB
Monitoring
Thread
Named
Pipe
Experimental Setup Nvidia Tesla C2070 ‘fermi’
Intel Xeon W3520 @ 2.67Ghz 12GB System RAM
Simulated Data (Moffet Field)
512x512x190 Frame
Results
Future CUDA Optimisations Kepler Architecture (GK110) ◦ Dynamic Parallelism
◦ Hyper-Q
“Complete Kernels”
Portable Platforms / Tegra
Streaming Variants (EVD, QR, LS, etc)
Conclusion Working ‘real-time’ HSI video chain
Further optimisation required
Visualisation challenges
GK110 will help