2014 nyu-bio-talk
DESCRIPTION
Talk on sequence analysis at NYU CGSBTRANSCRIPT
![Page 1: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/1.jpg)
SCALABLE APPROACHES
TO EXPLORING
MICROBIAL DIVERSITY
C. Titus Brown
Asst Professor, MMG / CSE; Michigan State University
1/15: Population Health & Reproduction, VetMed, UC Davis
Talk slides on slideshare.net/c.titus.brown
![Page 2: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/2.jpg)
Funding and motivation:
![Page 3: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/3.jpg)
The central question of my lab --
How can we most effectively use computation to extract
information from large sequence data sets, for the purpose
of better understanding non- and semi-model organisms?
Focus on environmental microbes, marine animals,
& agricultural and veterinary animals.
![Page 4: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/4.jpg)
Biology is becoming data rich – and a
rising tide lifts all boats!
http://susieinfrance.blogspot.com/2010/06/rising-tide-lifts-all-boats.html
![Page 5: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/5.jpg)
…but sometimes the tide comes in a bit
fast.
![Page 6: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/6.jpg)
Our foil for today:
Investigating soil microbial communities
Life on earth depends on soil microbes, but:
• 95% or more of soil microbes cannot be cultured in lab.
• Very little transport in soil and sediment =>
slow mixing rates.
• Estimates of immense diversity:
• Billions of microbial cells per gram of soil.
• Million+ microbial species per gram of soil (Gans et al, 2005)
• One observed lower bound for genomic sequence complexity =>
26 Gbp (Amazon Rain Forest Microbial Observatory)
![Page 7: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/7.jpg)
N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml
“By 'soil' we understand (Vil'yams, 1931) a loose surface
layer of earth capable of yielding plant crops. In the physical
sense the soil represents a complex disperse system
consisting of three phases: solid, liquid, and gaseous.”
Microbes live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;
![Page 8: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/8.jpg)
Specific questions to address:
• Role of soil microbes in nutrient cycling?
• How does agricultural soil differ from native soil?
• How do soil microbial communities respond to climate
perturbation?
• Genome-level questions:
• What kind of strain-level heterogeneity is present in the population?
• What are the phage and viral populations & dynamics thereof?
• What species are where, and how much is shared between
different geographical locations?
![Page 9: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/9.jpg)
Must use culture independent and
metagenomic approaches• Many reasons why you can’t or don’t want to culture:
Cross-feeding, niche specificity, dormancy, etc.
• If you want to get at underlying function, 16s analysis
alone is not sufficient.
Single-cell sequencing & shotgun metagenomics are two
common ways to investigate complex microbial communities.
![Page 10: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/10.jpg)
Shotgun metagenomics
• Collect samples;
• Extract DNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
“Sequence it all and let the
bioinformaticians sort it
out”
![Page 11: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/11.jpg)
Computational reconstruction of
(meta)genomic content.
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
![Page 12: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/12.jpg)
Points:
• Lots of fragments needed! (Deep sampling.)
• Having read and understood some books will help quite a bit
(Reference genomes.)
• Rare books will be harder to reconstruct than common books.
• Errors in OCR process matter quite a bit. (Sequencing error)
• The more, different specialized libraries you sample, the more
likely you are to discover valid correlations between topics and
books. (We don’t understand most microbial function.)
• A categorization system would be an invaluable but not
infallible guide to book topics. (Phylogeny can guide
interpretation.)
• Understanding the language would help you validate &
understand the books.
![Page 13: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/13.jpg)
Great Prairie Grand Challenge --SAMPLING LOCATIONS
2008
![Page 14: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/14.jpg)
A “Grand Challenge” dataset (DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
Ba
sep
air
s of
Seq
uen
cin
g (
Gb
p)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
![Page 15: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/15.jpg)
A “Grand Challenge” dataset (DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
Ba
sep
air
s of
Seq
uen
cin
g (
Gb
p)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
![Page 16: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/16.jpg)
My algorithm research: 3 methods.
1. Adaptation of a suite of probabilistic data structures for
representing set membership and counting (Bloom filters
and CountMin Sketch). (Zhang et al., PLoS One, 2014.)
2. An online streaming approach to lossy compression of
sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.)
3. Compressible de Bruijn graph representation for
assembly. (Pell et al., PNAS, 2012.)
![Page 17: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/17.jpg)
Method #2 - Digital normalization(a computational version of library normalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
We can discard it for
you…
![Page 18: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/18.jpg)
Digital normalization
![Page 19: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/19.jpg)
Digital normalization
![Page 20: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/20.jpg)
Digital normalization
![Page 21: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/21.jpg)
Digital normalization
![Page 22: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/22.jpg)
Digital normalization
![Page 23: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/23.jpg)
Digital normalization
![Page 24: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/24.jpg)
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembling Iowa prairie and Iowa corn:
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
![Page 25: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/25.jpg)
Resulting contigs are all low coverage.
Figure11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil metagenomes.
20
Howe et al., 2014
![Page 26: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/26.jpg)
Corn Prairie
Iowa prairie & corn DNA abundances are
very even.
Howe et al., 2014
![Page 27: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/27.jpg)
Assembly is a good idea:
Howe et al., 2014
![Page 28: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/28.jpg)
Howe et al., 2014
Analyses of
metabolic potential
begin to illuminate
differences.
![Page 29: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/29.jpg)
We see little strain variation in sample.Top tw
o a
llele
fre
quencie
s
Position within contig
Of 5000 most
abundant
contigs, only 1
has a
polymorphism
rate > 5%
Can measure
by read
mapping.
![Page 30: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/30.jpg)
Biogeography: Iowa sample overlap?
Corn and prairie content graphs have 51% nucleotide
overlap.
Corn Prairie
Suggests that at greater depth, samples may have similar
genomic content.
![Page 31: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/31.jpg)
Biogeography of genomic DNA in soil
How much genomic richness is shared
between different sites?
Qingpeng Zhang
![Page 32: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/32.jpg)
So, for soil:
• We really do need more data;
• But at least now we can assemble what we already have.
• Estimate required sequencing depth at 50 Tbp;
• Now also have 2-8 Tbp from Amazon Rain Forest
Microbial Observatory.
• …still not saturated coverage, but getting closer.
Iowa soil work has been published:
Howe et al., 2014, PNAS.
![Page 33: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/33.jpg)
So, for soil:
Note! There are now much faster assembly approaches…!
See: Megahit, http://arxiv.org/abs/1409.7208
(Technology marches on!)
![Page 34: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/34.jpg)
So, for soil:
• We really do need more data;
• But at least now we can assemble what we already have.
• Estimate required sequencing depth at 50 Tbp;
• Now also have 2-8 Tbp from Amazon Rain Forest
Microbial Observatory.
• …still not saturated coverage, but getting closer.
But, diginorm approach turns out to also be widely
useful.
![Page 35: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/35.jpg)
Digital normalization is popular…
Estimated ~1000 users of our software.
Diginorm algorithm now included in Trinity
software from Broad Institute (~10,000 users)
Illumina TruSeq long-read technology now
incorporates our approach (~100,000 users)
![Page 36: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/36.jpg)
The data problem: Looking forward 5
years…
Navin et al., 2011
![Page 37: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/37.jpg)
Some basic math:
• 1000 single cells from a tumor…
• …sequenced to 40x haploid coverage with Illumina…
• …yields 120 Gbp each cell…
• …or 120 Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling will require 2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in one
month.
![Page 38: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/38.jpg)
Similar math applies:
• Pathogen detection in blood;
• Environmental sequencing;
• Sequencing rare DNA from circulating blood.
• Two issues:
• Volume of data & compute
infrastructure;
• Latency for clinical applications.
![Page 39: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/39.jpg)
We face an infinite data problem.
• For all intents and purposes
• For example, Illumina estimates that 228,000 human
genomes will be resequenced this year, primarily by
researchers; this is only going to grow.
• Similar stories across all of biology (although #s lower :)
![Page 40: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/40.jpg)
Current analysis approaches are multipass,
e.g. variant calling:
Mapping
Data
Sorting
Calling Answer
On infinite data, you really only want to look at the data once…
![Page 41: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/41.jpg)
Streaming algorithms can be very efficient
1-pass
Data
Answer
See also eXpress, Roberts et al., 2013.
![Page 42: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/42.jpg)
Some key points --
• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower
memory than other approaches; parallelizable/multicore;
single-pass)
• Currently, primarily used for prefiltering for assembly, but
relies on underlying abstraction (De Bruijn graph) that is
also used in variant calling.
![Page 43: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/43.jpg)
Digital normalization
![Page 44: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/44.jpg)
Digital normalization
![Page 45: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/45.jpg)
Digital normalization
![Page 46: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/46.jpg)
Digital normalization
![Page 47: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/47.jpg)
Digital normalization
![Page 48: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/48.jpg)
Some key points --
• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower
memory than other approaches; parallelizable/multicore;
single-pass)
• Currently, primarily used for prefiltering for assembly, but
relies on underlying abstraction (De Bruijn graph) that is
also used in variant calling.
![Page 49: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/49.jpg)
Error correction as the solution for our ills
Current work: error correction (??)
Errors in sequencing data are at the root of many
problems:
• Assembly is 100x lower memory in the absence of errors.
• Mapping is computationally trivial when there are no
errors.
• Variant calling and genotyping become simple, as does
species detection.
![Page 50: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/50.jpg)
We can error correct high-coverage shotgun data
with k-mer spectra:
Chaisson et al., 2009
Erroneous k-mers
True k-mers
![Page 51: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/51.jpg)
Streaming error correction on E. coli data
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell
TP FP TN FN
Error correction 3,494,631 3,865 460,601,171 5,533
(corrected) (mistakes) (OK) (missed)
(Early days…)
![Page 52: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/52.jpg)
![Page 53: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/53.jpg)
![Page 54: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/54.jpg)
Single pass, reference free, tunable, streaming
online variant calling.
Error correction variant calling
![Page 55: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/55.jpg)
Streaming with reads…
Sequence...
Graph
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
Sequence...
....
Variants
![Page 56: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/56.jpg)
Analysis is done after sequencing.
Sequencing Analysis
![Page 57: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/57.jpg)
Streaming with bases
k bases...
Graph
k+1
k bases... k+1
k bases... k+1
k bases... k+1
k bases... k+1
k bases... k+1
k+2
...
Variants
![Page 58: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/58.jpg)
Integrate sequencing and analysis
Sequencing
Analysis
Are we done yet?
![Page 59: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/59.jpg)
What does the future hold?
• More emphasis on training and infrastructure.
• Data integration!
• Identifying the function of unknown genes…
![Page 60: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/60.jpg)
Summer NGS workshop (2010-2017)
![Page 61: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/61.jpg)
The infrastructure challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
We currently have no good way of querying,
exploring, investigating, or mining these data sets,
especially across multiple locations..
![Page 62: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/62.jpg)
Distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
![Page 63: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/63.jpg)
Data integration?
Once you have all the data, what do you do?
"Business as usual simply cannot work."
Looking at millions to billions of genomes.
(David Haussler, 2014)
![Page 64: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/64.jpg)
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
My charge: We don’t know what most genes do.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
![Page 65: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/65.jpg)
Data Intensive Biology
Opportunities & challenges; how can we best support the
biology?
"I have traveled the length and breadth of this
country and talked with the best people, and I can
assure you that data processing is a fad that won't
last out the year." --The editor in charge of business
books for Prentice Hall, 1957
![Page 66: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/66.jpg)
Thanks!
Key points:
• Facing nigh-infinite data situation;
• The first stages of sequence analysis, assembly and variant
calling, are computationally intensive (but we’re hoping to fix
that);
• Training in data intensive biology is critical to the future of
biology.
• Data sharing and data integration infrastructure is also critical.
![Page 67: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/67.jpg)
Graph alignment can detect read saturation
![Page 68: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/68.jpg)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
![Page 69: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/69.jpg)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
![Page 70: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/70.jpg)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
![Page 71: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/71.jpg)
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
![Page 72: 2014 nyu-bio-talk](https://reader034.vdocuments.mx/reader034/viewer/2022052622/55944cf81a28ab526f8b466a/html5/thumbnails/72.jpg)
Graph queries
assembled
sequence
nitrite
reductaseppaZ
SIMILARITY TO ALSO CONTAINS
raw
sequence
across public & walled-garden data sets:
See Lee,
Alekseyenko, Brown,
paper in SciPy 2009:
the “pygr” project.