it summit 150604 cb_wcl_ld_kmh_v6_to_publish

25
Parallelization of Large-Scale Image Processing Workflows to Unravel Neuronal Network Research Computing Kristina Holton Lingsheng Dong M.D. M.S. Research Computing Harvard Medical School

Upload: kevindonovan

Post on 03-Aug-2015

63 views

Category:

Education


1 download

TRANSCRIPT

Parallelization of Large-Scale Image Processing Workflows to Unravel Neuronal Network

Research Computing

Kristina HoltonLingsheng Dong M.D. M.S.Research ComputingHarvard Medical School

Research Computing

Orchestra HPC

• Wiki page: https://wiki.med.harvard.edu/Orchestra

• Tech specs:– 476 compute nodes– Total 5128 cores– 10GigE interconnection– 37.7TB RAM

• Debian Linux• LSF scheduler• Total 20PB storage

Research Computing

Research Computing Consultants

• Meet with HMS user community to discuss projects and computational needs

• Consult on analyses from statistics and algorithms to implementation

• Develop scripts and pipelines to address users’ needs

• Provide outreach and education

Research Computing

User Case

• Investigator: Wei-Chung Allen Lee, from R. Clay Reed’s lab in HMS NeuroBiology

• Drosophila: map olfactory neurons• Custom serial EM (Electron Microscopy)

images: process/reconstruct/trace 3D image• 154 TB raw data stitched to 54 TB• Develop highly-efficient, parallelized pipeline

with error checking and logging

Research Computing

Key Question in Neuroscience

• How is information processed in neural circuits?

• A neuron’s function is fundamentally dependent

on how it is connected within its network

• Reconstruction of neural networks enable analysis

of network connectivity

Previous Work: Mouse Visual Cortex

• Mouse shown a bar in differing orientations• Visual cortex neurons preferentially stimulated

• Map stimulated neurons to their neural network: structure and function (inhibitory vs excitatory)

• First use in vivo two-photon calcium imaging to identify neurons preference; EM to trace

b Schematic representation of diverse input to inhibitory interneurons. colored white. c, In vivo two-photon fluorescence image of the 3-D anatomical volume (red: blood vessels or astrocytes, green: OGB somata or YFP apical dendrites) separated to expose the functionally imaged plane. Scale bar, 100 μm.

Functional characterization of neurons: before

Nature. 2011 Mar 10; 471(7337): 177–182.

Anatomical 3D Reconstruction

Nature. 2011 Mar 10; 471(7337): 177–182.

•Excitatory•Inhibitory

Research Computing

Original Workflow

• Highly serial: steps carried out in sequential order

• Manual: Each step has to be submitted separately by hand (time consuming)

• Prone to systemic failure (downstream jobs fail)

• Debug is very difficult for failed steps• Record keeping confusing (Google doc)

Original Image Processing Workflow

linksections

montage

genmasks

selectframes

Given a list with hundreds of sections, do:

Create links for the raw data

* Connect neighbor sections together

There are 20+ additional steps.

* Generate mask file for each sections

Find the edges of each section

Original Image Processing Workflow

Improved workflow

• Highly parallel for most steps: steps carried out in parallel order

• Automatic: all the steps are queued together by single command

• Fault tolerance: if any step fails, the script will automatically kill the downstream, instead of give errors

• Check points: can resume workflow from any step• Email notification makes trouble-shooting more friendly• MySQL database: makes record keeping and check

points easy

linksections montage selectframes

Given a list with hundreds of sections, do:

montage selectframes

… …

Parallelize as much as we can

• Linksections step only creates links for raw data files, which is very fast to run

• montage step can be parallelized to run multiple sections independently, so that we can submit one job for each section

• Selectframes step can be parallelized to run multiple sections independently, so that we can submit one job for each section

• And so on.

Automation?

• A list of steps in Google doc• Every step is performed by hand• Requires manual labor to perform

and QC • Manually re-run

• Automatic workflow

We need a tool?

For each section in a section list, do:

Do step 1 as a job and log into database Do step2 as a job and log into database, if step1 successfully done

Do step3 as a job and log into database, if step2 successfully done Do step4 as a job and log into database, if step3 successfully done

General idea

For each section in a section list, do:

#@1,0, linksections #Step1, doesn’t depend on anything, stepNamesubmit job for step1 to computer cluster

#@2,1,montage #Step2, depends on step1, stepName submit job for step2 to computer cluster

#@3,2, selectframes #Step3, depends on step2, stepName submit job for step3 to computer cluster

….

#@10,6.7, patch_cmaps #Step10, depends step6 and step7, stepNamesubmit job for step3 to computer cluster

…..

Automation!

Research Computing

Trouble-shoot super easy

linksections montage selectframes

Given a list with hundreds of sections, do:

montage selectframes

… …

_____________________________

linksections montage selectframes

Given a list with hundreds of sections, do:

montage selectframes

… …

Log everything in MySQL

linksections montage

genmasksselectframes

Given a list with hundreds of sections, do:

montage selectframes

… …

Easy to re-run

____________________________________________________________________________________________________________

#!/bin/sh

for i in `ls -d folder*`; do cd $i for j in `ls -d sample*`; do cd $j for l in `ls -f *.txt`; do #step1 cp $l $l.copy1 done #step2 cat *.copy1 > $j.copy1.copy2 cd .. done #step3 cat */*copy2 > $i.copy3 cd ..done

folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt

folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt

folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt

folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt

Input files

Generalized Applications

#!/bin/sh#loop,ifor i in `ls –d folder*`; do cd $i #loop,j for j in `ls -d sample*`; do cd $j #loop,l for l in `ls -f *.txt`; do #@1,0,copy1 cp $l $l.copy1 done #@2,1,copy2 cat *.copy1 > $j.copy1.copy2 cd .. done #@3,2,copy3 cat */*copy2> $i.copy3 cd ..done

Generalized ApplicationsInput files

folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt

folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt

folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt

folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt

#!/bin/shfor i in `ls –d folder*`; do cd $i for j in `ls -d sample*`; do cd $j #loop,l for l in `ls -f *.txt`; do #@1,0,copy1 bsub –q mini cp $l $l.copy1 done #@2,1,copy2 bsub –q mini-w ‘done(“copy1”)’ cat *.copy1 > $j.copy1.copy2 cd .. done #@3,2,copy3 bsub –q mini -w ‘done(“copy2”)’ cat */*copy2>$i.copy3 cd ..done

Input files

Generalized Applications

folder1/sample1:c1.s1.f1.txtc1.s1.f2.txt

folder1/sample2:c1.s2.f1.txtc1.s2.f2.txt

folder2/sample1:c2.s1.f1.txtc2.s1.f2.txt

folder2/sample2:c2.s2.f1.txtc2.s2.f2.txt

Movie Time

1) Rotating rendering of reconstructed cortical neurons postsynaptic partnersmovie1

3) Manual reconstruction demonstration (generating (2) above)Movie2

Research Computing

https://rc.hms.harvard.edu

Research Computing