new a gpu-accelerated 3d kinematic modeling platform for … · 2015. 3. 26. · • as a...
TRANSCRIPT
-
A GPU-Accelerated 3D Kinematic
Modeling Platform for Behavioral
Neuroscience
John Long, PhD
Buzsáki Laboratory
Neuroscience Institute
New York University Langone Medical Center
03.17.2015
-
A little about me…
György Buzsáki
Jose Carmena
-
…and my previous work.
Venkatraman et al. 2009
Long and Carmena 2013 Long and Carmena 2011
Koralek et al. 2012
-
• As a neuroscientist, I find small form factor, massively parallel computing machines intriguing.
• For ease of interface and visualization, I often program in Matlab or Python, and I suffer agonizing computational bottlenecks, which has led me to GPUs.
• I’ve had a fair amount of success applying GPU computing to my scientific work.
Why am I at a GPU conference?
-
• An introduction to the work I do in behavioral
neuroscience that led me to GPU computing.
• A detailed description of one of the CUDA
programs I have implemented in the context of
my research.
• Throughout, I’ll mention a workflow I’ve
found useful for porting CUDA code into
Matlab and Python.
What I have in store for you…
-
Who reads the maps in the brain?
Lurilli et al. 2012 Geisler, Sirota, Zugaro, Robbe, Buzsaki, PNAS 2007
Hippocampal “place” cells (O’Keefe and Nadel, 1978; O’Keefe and Recce, 1993)
Sensory receptive fields (Hubel and Wiesel 1959)
-
The State of the Art in Behavioral Neuroscience
More and more neural data!
-
The behaving rat…
The State of the Art in Behavioral Neuroscience
-
Advances in Motion Capture
Corazza et al. 2006
-
Environment Construction
4
2
1
6
3
5
-
Lines to cameras
Line to Amplipex system
Multiple Camera Synchronization
-
Image Segmentation
-
Svoboda et al. 2005
3D to 2D perspective transformation
Camera Calibration
-
Visual Hull Construction
Visual Hull Algorithm modified from Forbes et al. 2006
-
Kinematic Model Design
-
Kinematic Model Manipulation
Murray et al. 1994
-
Generate Candidate Poses
Score Each Pose
ni
nj nj
dij Mi
Dj
dij = ||Mi – Dj||2
αij = dot(ni,nj)
Compute Cost Components
Update Posterior Estimate
Kinematic Model Fitting
-
Generate Candidate Poses
Score Each Pose
ni
nj nj
dij Mi
Dj
dij = ||Mi – Dj||2
αij = dot(ni,nj)
Compute Cost Components
Update Posterior Estimate
Kinematic Model Fitting
-
Open Chain Kinematics
P1 = t1a * t1b * t1c * t1d * t1e * P1;
N1 = t1a * t1b * t1c * t1d * t1e * N1;
N1 = N1-P1;
P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4;
N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4;
N4 = N4-P4;
P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5;
N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5;
N5 = N5-P5;
P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7;
N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7;
N7 = N7-P7;
“A mathematical introduction to robotic manipulation” by Murray, Li, and Sastry 1994
-
Open Chain Kinematics: On the GPU
P1 = t1a * t1b * t1c * t1d * t1e * P1;
N1 = t1a * t1b * t1c * t1d * t1e * N1;
N1 = N1-P1;
P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4;
N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4;
N4 = N4-P4;
P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5;
N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5;
N5 = N5-P5;
P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7;
N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7;
N7 = N7-P7;
Exposing Parallelism
//MATRIX REDUCTION: across temporary variables over twists float sum[2]; //1st reduction from 16, 4x4 matrices to 8, 4x4 matrices if(hWID < 8) { sum[0] = 0.0f; #pragma unroll for(int k = 0; k < 4; k++) { sum[0] += Stwists[4*(2*hWID) + y][k]* Stwists[4*(2*hWID+1) + k][x]; } Transtmp0[4*hWID + y][x] = sum[0]; }; __syncthreads();
//Thread parameters unsigned int hWID = threadIdx.x / halfWarpSz; unsigned int hWoff = threadIdx.x % halfWarpSz; unsigned int x = hWoff % DimXY; unsigned int y = hWoff / DimXY;
• All 4x4 transformation matrices ti can be
computed in parallel.
• There are many shared computational
blocks.
-
Open Chain Kinematics: On the GPU: Results
• x22.5 speedup relative to single Matlab
process
• x14.6 speedup relative to parallel Matlab
process (6 CPUs)
• Qualitative speedup allowed for parameter
tuning resulting in an average 50% reduction
in per frame model fit error i.e. better model
fits!
• Promising approach to open chain kinematic
CUDA ported into Matlab via Mex
Per
fra
me
com
pute
tim
e (s
econds)
Frame number
Per
fra
me
model
fit
err
or
(a.u
.)
Compute Time Comparison
Model Fit Comparison
single Matlab: mean = 12.6 sec
parfor Matlab: mean = 8.2 sec
CUDA in Matlab: mean = 0.55 sec
CUDA where you need it errors prior to tuning
errors after tuning
• Qualitative speedups mean more efficient
science.
• Work where you need to and let user
friendly languages like Matlab and Python do
the rest.
-
Putting it all together
-
Promising Directions
Berman et al. 2014
-
Wavelet Analysis
1st principal component
2nd principal component
3rd principal component
Time (seconds)
Kinematic Modeling
Behavioral Classification
Parameterize
Dynamics
Cluster Embedded
Dynamics
T-SNE
Map Dynamics
Label Clusters
rearing
forward gaze
tight scan
-
Conclusion
• Fitting kinematic models to 3D visual hull data is greatly
accelerated by GPUs.
• The framework I’ve presented can be applied to open chain
kinematics models in general.
• In science, Big Data too often means sitting around waiting to
find out you need to run your analysis again. GPUs are a game
changer.
• Interfaces like mex (Matlab) and ctypes (Python) allow you to
tackle the hard parts and be lazy about the easy parts.
– You can incrementally deal with bottlenecks of decreasing priority.
-
Acknowledgements
György Buzsáki
Antal Berenyi Andres Grosmark
The entire Buzsáki lab
Thank you!
-
Kinematic Model Design