(life(and medical(biology data accelerator (lambda)
TRANSCRIPT
Institute of Computing Technology, Chinese Academy of Sciences
1
Guangming Tan
: Life and Medical Biology Data Accelerator
(Lambda)
Ins>tute of Compu>ng Technology, Chinese Academy of Sciences
Institute of Computing Technology, Chinese Academy of Sciences
2
Biological Imaging Data Challenge
GAP: O(years) !
Moritz Helmstaedter, Cellular resolu>on connectomics: challenges of dense neural circuit reconstruc>on, Nature Method, 10(6), 2013
High-Throughput Image Data Analysis is Required!
Higher Resolu>on
Institute of Computing Technology, Chinese Academy of Sciences
3
High Spa>otemporal Resolu>on Two-‐Photon Microscope Imaging System
• In vivo • High Dimension
Peking University
Institute of Computing Technology, Chinese Academy of Sciences
4
Event Detec>on at Cellular Level
Cheng, H Science 1993 Calcium Spark Sparks and Transients
Cheng, H Cell 2008
Superoxide Flash
5 mm
Superoxide Flash
Elementary Events of Calcium Signals
Visualiza>on of Reac>ve oxygen species (ROS)
5µm
(Zhuang Zhou, Xiaowei Can)
Dendrite Calcium Imaging
Animal’s dynamic neural signals
2µm
Institute of Computing Technology, Chinese Academy of Sciences
5
Life and Medical Biology Data Accelerator (Lambda, λ)
• PostgreSQL
• Bio-Format Data
• Domain-Specific Accelerator
• Auto-tuning library Engine
• Built-in modules
• Customizable framework Pipeline
lambda
Institute of Computing Technology, Chinese Academy of Sciences
6
λ-‐Image SoYware/Hardware Stack
¢ High-‐dimension & mul>-‐mode biological image data system ¢ Data analysis pipeline for massive biological image ¢ Accelera>ng data-‐intensive algorithms for biological image analysis
machine learning stencil denoising
Biological Data Analysis Pipeline (cell event detection, segmentation)
Biological Data Analysis Algorithm Toolkit
deconvolution
Database Accelerator
MPI Spark CUDA OpenCL
RDMA
Mouse embryo heart image cell lineage
Mice brain cell Ca2+ spark detec>on
Islet forming in pancrea>c and imaging in vivo
Cardiovasology
Brain
Endocrinology
Institute of Computing Technology, Chinese Academy of Sciences
7
High-‐throughput Image Processing Algorithm
fMRI ssTEM sBEM LSFM
O(N*P3)
Unbiased Analysis of Events
Current Compu>ng Systems(SoYware/Hardware):O(Years)
Interac>ve
High Performance Compu>ng Pla_orm: O(Minutes)
High accuracy
Machine Learning
Institute of Computing Technology, Chinese Academy of Sciences
8
Paralleliza>on with in-‐memory Compu>ng Model
Raw Data
Raw Data
3D Deconvolution
Intensity Normalization
Subtract Background
Preprocessed Data
Left Side
Match
Powell
Mutual Information
Left Side
Right Side
Right Side
Left Side
Wavelet Decomposition
Activity Measure
Fusion Decomposition
Fused Data
Right Side
Fused Data
Planarity Enhancement
Tensor Voting
3D Watershed
Labelmap Image Data
Image R1
Preprocess
Registration
Fusion
Image L1
Preprocess
PreprocessedImage L1
Registration
Registered Image L1
Image L2
Preprocess
Registration
Image R2
Preprocess
Registration
Fused Image1
Merge Image Stack
... ...
Fusion
Fused Image2 ...
Fused Image Stack
Segmetation
Final Result For Visualization Process
Map
Reduce
PreprocessedImage L2
PreprocessedImage R1
PreprocessedImage R2
Registered Image L2
Registered Image R1
Registered Image R2
Spark
Institute of Computing Technology, Chinese Academy of Sciences
9
GPU Accelera>on of Algorithm Modules
0
5
10
15
20
25
30
35
40
DeconvoluJon Median Filter Objectness Filter IteraJve Closing
CPU GPU
Institute of Computing Technology, Chinese Academy of Sciences
10
Image Processing & Analysis Pipeline
RL/Sparse
Mutual Information
Machine Learning (Event detection or Pattern Analysis)
Watershed segmentation Labelmap selection Particle analysis
Mutual Information is derived from Information Theory and its application to image registration has been proposed in different forms
2D and 3D iterative deconvolution.
Machine Learning
Mice Brain Cell Ca2+ Spike Detection Analysis
Segm
enta>o
n Re
gistra>o
n De
convolu>
on
Fusion
GPU
✔ ✔
✔
✔ ✔
Spark
✔
✔
✔
Fusion
Wavelet based ✔
use global five-level wavelet decomposition
Institute of Computing Technology, Chinese Academy of Sciences
11
Deconvolu>on of Pancreas Islet Images
2 DAYS
4 GPUs (K20)
Terabyte EM Images
Preprocessing
e:image p:psf for N iterations /*apple imaging model to estimate*/ E=gpu_fft(e, batch) B=gpu_multiply(E, PSF, batch) b=gpu_ifft(B, batch) /*captured image divided by blurred estimate*/ r=gpu_divide(o, b, batch) /*calculate correction vector*/ R=gpu_fft(r, batch) C=gpu_multiply(R, PSF, batch) c=gpu_ifft(C, batch) /*apply correction vector*/ e=gpu_multiply(e,c, batch) end
deconvolution
GPU
batch
name pixel(XYZ) #iters
size Fiji (JAVA)
GPU speedup
beta.tif 1024x2048x51 100 408MB 60m 22s 163
glucose_sequential2.tif
512x512x400 50 200MB 30m 10s 180
4.7YEARS
Institute of Computing Technology, Chinese Academy of Sciences
12
Extrac>ng Cells from Mouse Embryos Images
light-‐sheet microscopes images
1.5 DAYS
culture Fast, two-‐side, 3D, duel-‐color imaging
excitaJon
detecJon
detecJon ReconstrucJon
Time1 Time2 Time3
………
200 Time points 2x500 images 2048x2048 pixels 4GB*2*200 = 1.6TB
Institute of Computing Technology, Chinese Academy of Sciences
13
Blitz:High Performance Machine Learning Toolkit
Algorithm Interface
Layer Opera>on
Virtual Backend
NVIDIA DIGITS (Customized)
Sugon Xmachine
Pipelining Parallelism
Model Parallelism
Data Parallelism
RDMA
Classifica>on Clustering SVM K-‐means
Accelerator: Programming Hardware
Distributed Parallelism
Operator Language (Linear Algebra / Tensor Primi>ves)
Automa>c Performance Tuning
Performance Interface
Communica>on Avoiding
KNN
vectoriza>on
mul>thread
DNN CNN
Dimensionality PCA
Institute of Computing Technology, Chinese Academy of Sciences
14
Convolu>onal Nets 2012 (AlexNet)
Hardware Environment CPU: Dual Intel(R) Xeon(R) CPU E5-‐2680 v3, 28 CPU-‐Memory: 128GB GPU: Tesla K20 GPU-‐Memory:6GB
blitz 1310s
caffe 1960s
blitz 125ms caffe 196ms
13-‐layer architecture Layer Type Maps and neurons Kernel size
0 Input 1 map of 224*224 neurons
1 ConvoluJon 64 maps of 55*55 neurons 11*11
2 Pooling 64 maps of 27*27 neurons 3*3
3 ConvoluJon 192 maps of 27*27 neurons 5*5
4 Pooling 192 maps of 13*13 neurons 3*3
5 ConvoluJon 384 maps of 13*13 neurons 3*3
6 ConvoluJon 256 maps of 13*13 neurons 3*3
7 ConvoluJon 256 maps of 13*13 neurons 3*3
8 Pooling 256 maps of 6*6 neurons 3*3
9 Fully-‐connected 4096 neurons 1*1
10 Dropout 4096 neurons 1*1
11 Fully-‐connected 4096 neurons 1*1
12 Dropout 4096 neurons 1*1
13 Fully-‐connected 1000 neurons 1*1
Input 3@224*224
Conv1 64@55*55
Convolu>ons Pooling
Pool1 64@27*27
Convolu>ons
Conv2 192@27*27 Pool2 192@13*13
Pooling
1 batch size running >me
batch size=128 1 epoch running >me
Institute of Computing Technology, Chinese Academy of Sciences
15
Flash Detec>on
A B C
E.Coli, Jme series, 512X512X(100 frames).
Intensity increases rapidly Intensity declines obviously averaged intensi>es change con>nuously
A nonstandard flash is not found by either expert or threshold-‐based method
Institute of Computing Technology, Chinese Academy of Sciences
16
Automated Flash Detec>on based on Blitz
Stack of fluorescence
images
Image registration
Cell segmentation
Intensity average
Data over-fitting check
If model accuracy is close to 100%
Feature addition
Data skew elimination
Test set upper bound estimation
Data ratio adjustmentCross validation
If model accuracy is close to upper bound
Model generation
Event detection Events output
Input Pre-processing
Feature selection
Model train
Model prediction
Resultcollection
Skewelimination
F value: F=2×precision×recall/(precision+recall), where precision =(no. returned flashes)/(no. returned peaks) recall =(no. returned flashes)/(no. all the flashes).
¢ Use cross valida>on to find parameters to train a model which can get beper accuracy and F value.
¢ Use MPI + CUDA paralleliza>on to reduce training >me
• 9 features: 6 local, 3 global • (amplitude, width, slope)*(left, right) • Average intensity of trace, distance to
the (last, next) peak
Institute of Computing Technology, Chinese Academy of Sciences
17
Membrane Segmenta>on based on Blitz
Group Rand error [·∙10 −3 ]
Warping error [·∙10 −6]
Pixel error [·∙10 −3 ] Training Time
Simple Thresholding 445 15522 222
NIPS 2012 48 434 60 7 Days (Four GPUs) Our approach 116 2865 95 2 Days (One GPUs)
Deep Neural Network
Brain Heart
Institute of Computing Technology, Chinese Academy of Sciences
18
Conclusion
¢ Develop a Spark-‐based paralleliza>on framework for high throughput image analysis pipelines
¢ Op>miza>ons on GPU § Core algorithms in image processing (3x-‐10x) § SGEMM in deep learning ( 30%)
¢ Achieve significant speedups for image processing § Years à days
Institute of Computing Technology, Chinese Academy of Sciences
19
Thanks!