(life(and medical(biology data accelerator (lambda)

Institute of Computing Technology, Chinese Academy of Sciences

1

Guangming Tan

: Life and Medical Biology Data Accelerator

(Lambda)

Ins>tute of Compu>ng Technology, Chinese Academy of Sciences


2

Biological Imaging Data Challenge

GAP: O(years) !

Moritz Helmstaedter, Cellular resolu>on connectomics: challenges of dense neural circuit reconstruc>on, Nature Method, 10(6), 2013

High-Throughput Image Data Analysis is Required!

Higher Resolu>on


3

High Spa>otemporal Resolu>on Two-‐Photon Microscope Imaging System

•  In vivo •  High Dimension

Peking University


4

Event Detec>on at Cellular Level

Cheng, H Science 1993 Calcium Spark Sparks and Transients

Cheng, H Cell 2008

Superoxide Flash

5 mm

Superoxide Flash

Elementary Events of Calcium Signals

Visualiza>on of Reac>ve oxygen species (ROS)

5µm

(Zhuang Zhou, Xiaowei Can)

Dendrite Calcium Imaging

Animal’s dynamic neural signals

2µm


5

Life and Medical Biology Data Accelerator (Lambda, λ)

•  PostgreSQL

•  Bio-Format Data

•  Domain-Specific Accelerator

•  Auto-tuning library Engine

•  Built-in modules

•  Customizable framework Pipeline

lambda


6

λ-‐Image SoYware/Hardware Stack

¢  High-‐dimension & mul>-‐mode biological image data system ¢  Data analysis pipeline for massive biological image ¢  Accelera>ng data-‐intensive algorithms for biological image analysis

machine learning stencil denoising

Biological Data Analysis Pipeline (cell event detection, segmentation)

Biological Data Analysis Algorithm Toolkit

deconvolution

Database Accelerator

MPI Spark CUDA OpenCL

RDMA

Mouse embryo heart image cell lineage

Mice brain cell Ca2+ spark detec>on

Islet forming in pancrea>c and imaging in vivo

Cardiovasology

Brain

Endocrinology


7

High-‐throughput Image Processing Algorithm

fMRI ssTEM sBEM LSFM

O(N*P3)

Unbiased Analysis of Events

Current Compu>ng Systems（SoYware/Hardware）：O(Years）

Interac>ve

High Performance Compu>ng Pla_orm: O(Minutes)

High accuracy

Machine Learning


8

Paralleliza>on with in-‐memory Compu>ng Model

Raw Data

Raw Data

3D Deconvolution

Intensity Normalization

Subtract Background

Preprocessed Data

Left Side

Match

Powell

Mutual Information

Left Side

Right Side

Right Side

Left Side

Wavelet Decomposition

Activity Measure

Fusion Decomposition

Fused Data

Right Side

Fused Data

Planarity Enhancement

Tensor Voting

3D Watershed

Labelmap Image Data

Image R1

Preprocess

Registration

Fusion

Image L1

Preprocess

PreprocessedImage L1

Registration

Registered Image L1

Image L2

Preprocess

Registration

Image R2

Preprocess

Registration

Fused Image1

Merge Image Stack

... ...

Fusion

Fused Image2 ...

Fused Image Stack

Segmetation

Final Result For Visualization Process

Map

Reduce

PreprocessedImage L2

PreprocessedImage R1

PreprocessedImage R2

Registered Image L2

Registered Image R1

Registered Image R2

Spark


9

GPU Accelera>on of Algorithm Modules

0

5

10

15

20

25

30

35

40

DeconvoluJon Median Filter Objectness Filter IteraJve Closing

CPU GPU


10

Image Processing & Analysis Pipeline

RL/Sparse

Mutual Information

Machine Learning (Event detection or Pattern Analysis)

Watershed segmentation Labelmap selection Particle analysis

Mutual Information is derived from Information Theory and its application to image registration has been proposed in different forms

2D and 3D iterative deconvolution.

Machine Learning

Mice Brain Cell Ca2+ Spike Detection Analysis

Segm

enta>o

n Re

gistra>o

n De

convolu>

on

Fusion

GPU

✔ ✔

✔

✔ ✔

Spark

✔

✔

✔

Fusion

Wavelet based ✔

use global five-level wavelet decomposition


11

Deconvolu>on of Pancreas Islet Images

2 DAYS

4 GPUs (K20)

Terabyte EM Images

Preprocessing

e:image p:psf for N iterations /*apple imaging model to estimate*/ E=gpu_fft(e, batch) B=gpu_multiply(E, PSF, batch) b=gpu_ifft(B, batch) /*captured image divided by blurred estimate*/ r=gpu_divide(o, b, batch) /*calculate correction vector*/ R=gpu_fft(r, batch) C=gpu_multiply(R, PSF, batch) c=gpu_ifft(C, batch) /*apply correction vector*/ e=gpu_multiply(e,c, batch) end

deconvolution

GPU

batch

name pixel(XYZ) #iters

size Fiji (JAVA)

GPU speedup

beta.tif 1024x2048x51 100 408MB 60m 22s 163

glucose_sequential2.tif

512x512x400 50 200MB 30m 10s 180

4.7YEARS


12

Extrac>ng Cells from Mouse Embryos Images

light-‐sheet microscopes images

1.5 DAYS

culture Fast, two-‐side, 3D, duel-‐color imaging

excitaJon

detecJon

detecJon ReconstrucJon

Time1 Time2 Time3

………

200 Time points 2x500 images 2048x2048 pixels 4GB*2*200 = 1.6TB


13

Blitz：High Performance Machine Learning Toolkit

Algorithm Interface

Layer Opera>on

Virtual Backend

NVIDIA DIGITS (Customized)

Sugon Xmachine

Pipelining Parallelism

Model Parallelism

Data Parallelism

RDMA

Classifica>on Clustering SVM K-‐means

Accelerator: Programming Hardware

Distributed Parallelism

Operator Language (Linear Algebra / Tensor Primi>ves)

Automa>c Performance Tuning

Performance Interface

Communica>on Avoiding

KNN

vectoriza>on

mul>thread

DNN CNN

Dimensionality PCA


14

Convolu>onal Nets 2012 (AlexNet)

Hardware Environment CPU: Dual Intel(R) Xeon(R) CPU E5-‐2680 v3, 28 CPU-‐Memory: 128GB GPU: Tesla K20 GPU-‐Memory:6GB

blitz 1310s

caffe 1960s

blitz 125ms caffe 196ms

13-‐layer architecture Layer Type Maps and neurons Kernel size

0 Input 1 map of 224*224 neurons

1 ConvoluJon 64 maps of 55*55 neurons 11*11

2 Pooling 64 maps of 27*27 neurons 3*3







9 Fully-‐connected 4096 neurons 1*1

10 Dropout 4096 neurons 1*1


12 Dropout 4096 neurons 1*1


Input 3@224*224

Conv1 64@55*55

Convolu>ons Pooling

Pool1 64@27*27

Convolu>ons

Conv2 192@27*27 Pool2 192@13*13

Pooling

1 batch size running >me

batch size=128 1 epoch running >me


15

Flash Detec>on

A B C

E.Coli, Jme series, 512X512X(100 frames).

Intensity increases rapidly Intensity declines obviously averaged intensi>es change con>nuously

A nonstandard flash is not found by either expert or threshold-‐based method


16

Automated Flash Detec>on based on Blitz

Stack of fluorescence

images

Image registration

Cell segmentation

Intensity average

Data over-fitting check

If model accuracy is close to 100%

Feature addition

Data skew elimination

Test set upper bound estimation

Data ratio adjustmentCross validation

If model accuracy is close to upper bound

Model generation

Event detection Events output

Input Pre-processing

Feature selection

Model train

Model prediction

Resultcollection

Skewelimination

F value: F=2×precision×recall/(precision+recall), where precision =(no. returned flashes)/(no. returned peaks) recall =(no. returned flashes)/(no. all the flashes).

¢  Use cross valida>on to find parameters to train a model which can get beper accuracy and F value.

¢  Use MPI + CUDA paralleliza>on to reduce training >me

•  9 features: 6 local, 3 global •  (amplitude, width, slope)*(left, right) •  Average intensity of trace, distance to

the (last, next) peak


17

Membrane Segmenta>on based on Blitz

Group Rand error [·∙10 −3 ]

Warping error [·∙10 −6]

Pixel error [·∙10 −3 ] Training Time

Simple Thresholding 445 15522 222

NIPS 2012 48 434 60 7 Days (Four GPUs) Our approach 116 2865 95 2 Days (One GPUs)

Deep Neural Network

Brain Heart


18

Conclusion

¢  Develop a Spark-‐based paralleliza>on framework for high throughput image analysis pipelines

¢  Op>miza>ons on GPU § Core algorithms in image processing (3x-‐10x) §  SGEMM in deep learning ( 30%)

¢  Achieve significant speedups for image processing §  Years à days


19

Thanks!

(life(and medical(biology data accelerator (lambda)

Documents