new a gpu-accelerated 3d kinematic modeling platform for … · 2015. 3. 26. · • as a...

A GPU-Accelerated 3D Kinematic

Modeling Platform for Behavioral

Neuroscience

John Long, PhD

Buzsáki Laboratory

Neuroscience Institute

New York University Langone Medical Center

03.17.2015

A little about me…

György Buzsáki

Jose Carmena

…and my previous work.

Venkatraman et al. 2009

Long and Carmena 2013 Long and Carmena 2011

Koralek et al. 2012

• As a neuroscientist, I find small form factor, massively parallel computing machines intriguing.

• For ease of interface and visualization, I often program in Matlab or Python, and I suffer agonizing computational bottlenecks, which has led me to GPUs.

• I’ve had a fair amount of success applying GPU computing to my scientific work.

Why am I at a GPU conference?

• An introduction to the work I do in behavioral

neuroscience that led me to GPU computing.

• A detailed description of one of the CUDA

programs I have implemented in the context of

my research.

• Throughout, I’ll mention a workflow I’ve

found useful for porting CUDA code into

Matlab and Python.

What I have in store for you…

Who reads the maps in the brain?

Lurilli et al. 2012 Geisler, Sirota, Zugaro, Robbe, Buzsaki, PNAS 2007

Hippocampal “place” cells (O’Keefe and Nadel, 1978; O’Keefe and Recce, 1993)

Sensory receptive fields (Hubel and Wiesel 1959)

The State of the Art in Behavioral Neuroscience

More and more neural data!

The behaving rat…

The State of the Art in Behavioral Neuroscience

Advances in Motion Capture

Corazza et al. 2006

Environment Construction

4

2

1

6

3

5

Lines to cameras

Line to Amplipex system

Multiple Camera Synchronization

Image Segmentation

Svoboda et al. 2005

3D to 2D perspective transformation

Camera Calibration

Visual Hull Construction

Visual Hull Algorithm modified from Forbes et al. 2006

Kinematic Model Design

Kinematic Model Manipulation

Murray et al. 1994

Generate Candidate Poses

Score Each Pose

ni

nj nj

dij Mi

Dj

dij = ||Mi – Dj||2

αij = dot(ni,nj)

Compute Cost Components

Update Posterior Estimate

Kinematic Model Fitting

Open Chain Kinematics

P1 = t1a * t1b * t1c * t1d * t1e * P1;

N1 = t1a * t1b * t1c * t1d * t1e * N1;

N1 = N1-P1;

P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4;

N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4;

N4 = N4-P4;

P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5;

N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5;

N5 = N5-P5;

P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7;

N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7;

N7 = N7-P7;

“A mathematical introduction to robotic manipulation” by Murray, Li, and Sastry 1994

Open Chain Kinematics: On the GPU

P1 = t1a * t1b * t1c * t1d * t1e * P1;

N1 = t1a * t1b * t1c * t1d * t1e * N1;

N1 = N1-P1;

P4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * P4;

N4 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * N4;

N4 = N4-P4;

P5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * P5;

N5 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * N5;

N5 = N5-P5;

P7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * P7;

N7 = t1a * t1b * t1c * t1d * t1e * t4a * t4b * t4c * t5a * t5b * t5c * t6a * t6b * t6c * t7a * t7b * t7c * N7;

N7 = N7-P7;

Exposing Parallelism

//MATRIX REDUCTION: across temporary variables over twists float sum[2]; //1st reduction from 16, 4x4 matrices to 8, 4x4 matrices if(hWID < 8) { sum[0] = 0.0f; #pragma unroll for(int k = 0; k < 4; k++) { sum[0] += Stwists[4*(2*hWID) + y][k]* Stwists[4*(2*hWID+1) + k][x]; } Transtmp0[4*hWID + y][x] = sum[0]; }; __syncthreads();

//Thread parameters unsigned int hWID = threadIdx.x / halfWarpSz; unsigned int hWoff = threadIdx.x % halfWarpSz; unsigned int x = hWoff % DimXY; unsigned int y = hWoff / DimXY;

• All 4x4 transformation matrices ti can be

computed in parallel.

• There are many shared computational

blocks.

Open Chain Kinematics: On the GPU: Results

• x22.5 speedup relative to single Matlab

process

• x14.6 speedup relative to parallel Matlab

process (6 CPUs)

• Qualitative speedup allowed for parameter

tuning resulting in an average 50% reduction

in per frame model fit error i.e. better model

fits!

• Promising approach to open chain kinematic

CUDA ported into Matlab via Mex

Per

fra

me

com

pute

tim

e (s

econds)

Frame number

Per

fra

me

model

fit

err

or

(a.u

.)

Compute Time Comparison

Model Fit Comparison

single Matlab: mean = 12.6 sec

parfor Matlab: mean = 8.2 sec

CUDA in Matlab: mean = 0.55 sec

CUDA where you need it errors prior to tuning

errors after tuning

• Qualitative speedups mean more efficient

science.

• Work where you need to and let user

friendly languages like Matlab and Python do

the rest.

Putting it all together

Promising Directions

Berman et al. 2014

Wavelet Analysis

1st principal component

2nd principal component

3rd principal component

Time (seconds)

Kinematic Modeling

Behavioral Classification

Parameterize

Dynamics

Cluster Embedded

Dynamics

T-SNE

Map Dynamics

Label Clusters

rearing

forward gaze

tight scan

Conclusion

• Fitting kinematic models to 3D visual hull data is greatly

accelerated by GPUs.

• The framework I’ve presented can be applied to open chain

kinematics models in general.

• In science, Big Data too often means sitting around waiting to

find out you need to run your analysis again. GPUs are a game

changer.

• Interfaces like mex (Matlab) and ctypes (Python) allow you to

tackle the hard parts and be lazy about the easy parts.

– You can incrementally deal with bottlenecks of decreasing priority.

Acknowledgements

György Buzsáki

Antal Berenyi Andres Grosmark

The entire Buzsáki lab

Thank you!

Kinematic Model Design

new a gpu-accelerated 3d kinematic modeling platform for … · 2015. 3. 26. · • as a...

Documents