isb
DESCRIPTION
ISB. Ravi Pandya | Bill Bolosky Microsoft June 28 2012. Genomics project. Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/1.jpg)
ISBRavi Pandya | Bill BoloskyMicrosoft June 28 2012
![Page 2: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/2.jpg)
Genomics project
Collaboration with UC Berkeley AMP LabDave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, …
Long term: Cancer genomicsDavid Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA)500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center
Near term: Genome sequencing pipelineMotivated by Archon Genomics X-Prize (September 2013)100 samples of DNA from centenarians (>105 years old)Sequence with best coverage, accuracy, and cost in 1 monthGoal: 98% coverage, 99.9999% accuracy, $1000/genomeCurrent tools (GATK, CLC) are not sufficient to meet the goal
![Page 3: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/3.jpg)
Genomics pipeline
Fast, accurate, scalableApply state-of-the art computer science to sequencing problem
Machine learning, distributed systems, high-performance computing
Open source for Windows+Linux | Windows Azure cloud service
SNAP (available now)Fast aligner using hash-based index of entire genome
10-40x faster than BWA
FLASH (in progress)Comprehensive probabilistic model
Reference-based alignment + targeted de novo contig assembly + scaffold assembly
![Page 4: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/4.jpg)
Genomics pipeline
Aligned reads Unaligned reads
Hash clustering
De novo assemblyOptimization
Scaffold assembly
Call SNPs, indels, SVs
SN
AP
FLASH
![Page 5: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/5.jpg)
SNAP
CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG...
Reference genome
AGCTCAAA GAAAGAA
CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAGRead sequence
Hash index of seed {locations}
1. Lookup seeds2. Map locations3. Score matches~15 core-hours for 30x coverage
![Page 6: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/6.jpg)
CandidateAssembly
FLASH
Candidateassembly
CoveragePair distance
Alignment
Depth
Separation
Overlap
Likelihood
Optimize
SNAP alignerGenomic prior knowledgeMachine learning models
Sparse MatricesSNAP
![Page 7: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/7.jpg)
Read alignment
CandidateAssemblyCandidate
Assembly1
1 1
1
3B bp Genome
1B
Rea
ds
Stra
nds
0.9 0.6 0.7
0.2 0.8
3B bp Genome
1B
Rea
ds
RGS = Read-Genome-Strandcandidate assembly
LRG = Likelihood ofRead-Genome alignment
SNAP alignmentSequencing errorMutation frequencyVariant databases
![Page 8: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/8.jpg)
Coverage distribution
AssemblyAssembly22 24 29 35 34
3B bp Genome
Stra
nds 0.1 0.12 0.14 0.12 0.1
Coverage
GSC = Genome-Strand Coverage
LC = Likelihood of Coverage
AssemblyAssemblyRGS
Sequencer characteristicsAlignment data
![Page 9: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/9.jpg)
Hash clustering
Cluster unaligned reads with overlapping basesStarting point for assembling contigs
CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC
GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA
CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA
1 1 2 3 1. Count seeds2. Bucket reads by seed3. Connect overlapping reads4. Cluster connected components
![Page 10: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/10.jpg)
Targeted de novo assembly
Alignment
Depth
Separation
Overlap
Likelihood
Optimize
Genomic prior knowledgeMachine learning models
Contig “genome”
calc
hash
clus
ters
CandidateAssembly
Candidateassembly
CoveragePair distance
infe
r
Update
![Page 11: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/11.jpg)
Scaffold assembly
Maximum likelihood modelOptimized reference contigs + de novo unaligned contigs
Explore space of possible arrangements into a sample genome
Optimize P(observed reads | candidate genome)
= sequencing error + coverage depth + pair distance
Incremental calculation using sparse matrix model
![Page 12: ISB](https://reader036.vdocuments.mx/reader036/viewer/2022082817/56812b3d550346895d8f516c/html5/thumbnails/12.jpg)
Next steps? …
SNAPApply to more datasets / platforms / organisms
Validate accuracy / coverage
FLASHUse Kaviar for population priors
Different approaches to assembly / structural variation
BiologyWhat interesting research could this enable – scale, speed, accuracy, analysis?