isb

ISBRavi Pandya | Bill BoloskyMicrosoft June 28 2012

Genomics project

Collaboration with UC Berkeley AMP LabDave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, …

Long term: Cancer genomicsDavid Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA)500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center

Near term: Genome sequencing pipelineMotivated by Archon Genomics X-Prize (September 2013)100 samples of DNA from centenarians (>105 years old)Sequence with best coverage, accuracy, and cost in 1 monthGoal: 98% coverage, 99.9999% accuracy, $1000/genomeCurrent tools (GATK, CLC) are not sufficient to meet the goal

Genomics pipeline

Fast, accurate, scalableApply state-of-the art computer science to sequencing problem

Machine learning, distributed systems, high-performance computing

Open source for Windows+Linux | Windows Azure cloud service

SNAP (available now)Fast aligner using hash-based index of entire genome

10-40x faster than BWA

FLASH (in progress)Comprehensive probabilistic model

Reference-based alignment + targeted de novo contig assembly + scaffold assembly

Genomics pipeline

Aligned reads Unaligned reads

Hash clustering

De novo assemblyOptimization

Scaffold assembly

Call SNPs, indels, SVs

SN

AP

FLASH

SNAP

CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG...

Reference genome

AGCTCAAA GAAAGAA

CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAGRead sequence

Hash index of seed {locations}

1. Lookup seeds2. Map locations3. Score matches~15 core-hours for 30x coverage

CandidateAssembly

FLASH

Candidateassembly

CoveragePair distance

Alignment

Depth

Separation

Overlap

Likelihood

Optimize

SNAP alignerGenomic prior knowledgeMachine learning models

Sparse MatricesSNAP

Read alignment

CandidateAssemblyCandidate

Assembly1

1 1

1

3B bp Genome

1B

Rea

ds

Stra

nds

0.9 0.6 0.7

0.2 0.8

3B bp Genome

1B

Rea

ds

RGS = Read-Genome-Strandcandidate assembly

LRG = Likelihood ofRead-Genome alignment

SNAP alignmentSequencing errorMutation frequencyVariant databases

Coverage distribution

AssemblyAssembly22 24 29 35 34

3B bp Genome

Stra

nds 0.1 0.12 0.14 0.12 0.1

Coverage

GSC = Genome-Strand Coverage

LC = Likelihood of Coverage

AssemblyAssemblyRGS

Sequencer characteristicsAlignment data

Hash clustering

Cluster unaligned reads with overlapping basesStarting point for assembling contigs

CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC

GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA

CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA

1 1 2 3 1. Count seeds2. Bucket reads by seed3. Connect overlapping reads4. Cluster connected components

Targeted de novo assembly

Alignment

Depth

Separation

Overlap

Likelihood

Optimize

Genomic prior knowledgeMachine learning models

Contig “genome”

calc

hash

clus

ters

CandidateAssembly

Candidateassembly

CoveragePair distance

infe

r

Update

Scaffold assembly

Maximum likelihood modelOptimized reference contigs + de novo unaligned contigs

Explore space of possible arrangements into a sample genome

Optimize P(observed reads | candidate genome)

= sequencing error + coverage depth + pair distance

Incremental calculation using sparse matrix model

Next steps? …

SNAPApply to more datasets / platforms / organisms

Validate accuracy / coverage

FLASHUse Kaviar for population priors

Different approaches to assembly / structural variation

BiologyWhat interesting research could this enable – scale, speed, accuracy, analysis?

isb

Documents