statistical methods for next generation sequencingkhansen/lecintro1.pdf · source: metzker ml....

48
Statistical Methods for Next Generation Sequencing http://www.biostat.jhsph.edu/~khansen/enar2012.html Zhijin Wu Brown University Kasper Hansen, Rafael A Irizarry Johns Hopkins University

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Statistical Methods for Next Generation

Sequencinghttp://www.biostat.jhsph.edu/~khansen/enar2012.html

Zhijin WuBrown University

Kasper Hansen, Rafael A IrizarryJohns Hopkins University

Page 2: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

\

Outline

• Introduction to NGS

• SNP calling and genotyping

• RNA-sequencing

• Hands-on exercise

2

Page 3: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Introduction to Next Generation

SequencingRafael A. Irizarry

http://rafalab.org

Many slides courtesy of:Héctor Corrada Bravo and Ben Langmead

Page 4: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

D. melanogaster, Science, 2000 H. sapiens, Nature, 2000 M. musculus, Nature, 2002and Science, 2000

Back then: millions of clones (thousand bps) in 9 months for billions of dollars

Today: billion of short reads (35-100 bps) in a week for thousands of dollars

Remember this?

4

Claim: Assemble a genome in weeks for less than $100,000

Page 5: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Start with DNA (millions of copies)

5

Page 6: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Break it

6

Page 7: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Put in sequencer

7

Page 8: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Sequence first 35-400 bps: call them “reads”

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGTGTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGGCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTTCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCGTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTGCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

8

Page 9: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Platforms

Page 10: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Illumina/Solexa

• Eight lanes

• ~160M short reads (~50-70 bp) per lane

TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

INTRODUCTION

CLUSTER GENERATION

SEQUENCING-BY-SYNTHESIS

ANALYSIS PIPELINE

DATA COLLECTION, PROCESSING, AND ANALYSIS

Illumina Sequencing TechnologyThe Genome Analyzer generates several billion bases of high-quality sequence per run at less than 1% of the cost of capillary-based methods. An expansive scale of research unimaginable with other technology platforms is now possible.

FIGURE 1: ILLUMINA GENOME ANALYZER FLOW CELL

Several samples can be loaded onto the eight-lane flow cell for simul-taneous analysis on the Illumina Genome Analyzer.

10

Page 11: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

Page 12: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Not just Assembly

• Resequencing

• SNP discovery and genotyping

• Variant discovery and quantification

• TF binding sites: ChIP-Seq

• Gene expression: RNA-Seq

• Measuring methylation

Page 13: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

\

Not just Assembly

13

Page 14: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

\

1000 Genomes Project

Genotyping

14

Page 15: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

\

Human Epigenome Project

15

Methylation

Page 16: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

What to do with all these sequences?

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGTGTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGGCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTTCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCGTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTGCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC

16

Page 17: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Most apps: Start by matching to reference

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

17

Page 18: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Variant detection

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Reference

Call: HET A, Gp-value: 0.0023

GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATATGCCGGA-CACCCTATG

Statistics

“Coverage”

“Pileup” or “Coverage plot”

“Depth of coverage” = 14

Page 19: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

RNA-seq differential expression

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Statistics

Gene 1differentially expressed?: YES

p-value: 0.0012

TGTCGCAGTATCTGTC AGCACCCTATGTCGCAGCCGGAGCACCCTATGGTCGCAGTANCTGTCT

||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align Aggregate

Gene 1

Sample A

Sample B

Page 20: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

GATTCCTGCCTCATCC

ChIP-seq

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align

Reference

Binding occurs herep-value: 0.0023

Aggregate

Statistics

TATGCACGCGATAGCAGATAGCATTGCGAGAC

Page 21: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Matching Revisted

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

21

Page 22: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Matching 10,000,000 32 bps reads

• BLAST takes more than 6 months

• BLAT takes 2 months

• MAQ takes 1 day and half

• Bowtie takes 17 minutes

22

Page 23: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Matching

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

23

Page 24: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Mapping

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC

Take a read:

And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAACTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATACCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAGAAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGTCGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAGAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAATTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCTACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAGTTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTCCAAAGAGGAACAGCTCTTTGGACACTAGGAAAAAACCTTGTAGAGAGAGTAAAAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACTCCTCACACCCAATTGGACCAATCTATCACCCTATAGAAGAACTAATGTTAGTATAAGTAACATGAAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTAAAACACTGAACTGACAATTAACAGCCCAATATCTACAATCAACCAACAAGTCATTATTACCCTCACTGTCAACCCAACACAGGCATGCTCATAAGGAAAGGTTAAAAAAAGTAAAAGGAACTCGGCAAATCTTACCCCGCCTGTTTACCAAAAACATCACCTCTAGCATCACCAGTATTAGAGGCACCGCCTGCCCAGTGACACATGTTTAACGGCCGCGGTACCCTAACCGTGCAAAGGTAGCATAATCACTTGTTCCTTAAATAGGGACCTGTATGAATGGCTCCACGAGGGTTCAGCTGTCTCTTACTTTTAACCAGTGAAATTGACCTGCCCGTGAAGAGGCGGGCATAACACAGCAAGACGAGAAGACCCTATGGAGCTTTAATTTATTAATGCAAACAGTACCTAACAAACCCACAGGTCCTAAACTACCAAACCTGCATTAAAAATTTCGGTTGGGGCGACCTCGGAGCAGAACCCAACCTCCGAGCAGTACATGCTAAGACTTCACCAGTCAAAGCGAACTACTATACTCAATTGATCCAATAACTTGACCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATATCAACAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCGATGGTGCAGCCGCTATTAAAGGTTCGTTTGTTCAACGATTAAAGTCCTACGTGATCTGAGTTCAGACCGGAGTAATCCAGGTCGGTTTCTATCTACNTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTACTTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTATTATACCCACACCCACCCAAGAACAGGGTTTGTTAAGATGGC

How do we determine the read’s point of origin with respect to the reference?

CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC|||||| |||| |||| ||||||||| |||| |||||CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA

Hypothesis 1:

Hypothesis 2:

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC|||||||||||| ||||||||||||||||||||| ||||| |CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC

Answer: sequence similarity

Read

Reference

Read

Reference

Say hypothesis 2 is correct. Why are there still mismatches and gaps?

Which hypothesis is better?

Page 25: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

More on variants and base-calling

Page 26: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

SNPs

GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT TTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

Page 27: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

SNPs

TCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTA TCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTTTG GTACTCGTCGCTGCGTTGAGGCTTGCGTTTTTGGT TGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTA GCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTAC CGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCT GCGTTGAGGCTTGCGTTTATGGTACGCTGGATTTT GTTGAGGCTTGCGTTTTTGGTACGCTGGACTTTGT GTTGAGGCTTGCGTTTATGGTACGCTGGGCTTTTT GAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGG CTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC TTGCGTTTATGGTACGCTGGACTTTGTAGGATACC GCGTTTATGGTACGCTGGACTTTGTAGGATACCCT CGTTTATGGTACGCTGGACTTTGTAGGATACCCTC GTTTATGGTACGCTGGACTTTGTAGGATACCCTCG TATGGTACGCTGGACTTTGTAGGATACCCTCGCTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT ATGGTACGCTGGACTTTGTAGGATACCCTCGCTTT CTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTAGGATACCCTCGCTTTC

Page 28: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

All Reads

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Page 29: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

1000 Genomes Data

050

0010

000

1500

0 Sample NA19238, UCSC Loci, n=42691

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

010

0015

0020

00

Sample NA19238, Non−UCSC Loci, n=4986

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Hapmap Loci, n=19067

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000 Sample NA19238, Non−Hapmap Loci, n=28610

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=36333

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

00

Sample NA19238, Non−1kgenomes Loci, n=11344

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, UCSC Loci, n=48864

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Non−UCSC Loci, n=15005

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

00

Sample NA19238, Hapmap Loci, n=20069

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, Non−Hapmap Loci, n=43800

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=38306

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

Sample NA19238, Non−1kgenomes Loci, n=25563

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

All data

Filtered: snpq>=20,

nreads<=360

SNPs in dbSNP Novel SNPs

Cycle

Here we aggregate reads and

record cycle at which variant appears

Page 30: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

1000 Genomes Data

050

0010

000

1500

0 Sample NA19238, UCSC Loci, n=42691

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

010

0015

0020

00

Sample NA19238, Non−UCSC Loci, n=4986

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Hapmap Loci, n=19067

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000 Sample NA19238, Non−Hapmap Loci, n=28610

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=36333

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

00

Sample NA19238, Non−1kgenomes Loci, n=11344

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, UCSC Loci, n=48864

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

010

0020

0030

0040

0050

0060

0070

00

Sample NA19238, Non−UCSC Loci, n=15005

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

00

Sample NA19238, Hapmap Loci, n=20069

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

050

0010

000

1500

0

Sample NA19238, Non−Hapmap Loci, n=43800

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

1200

0

Sample NA19238, 1kgenomes Loci, n=38306

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

020

0040

0060

0080

0010

000

Sample NA19238, Non−1kgenomes Loci, n=25563

Sequencing Cycle

Num

ber o

f Cal

ls

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

All data

Filtered: snpq>=20,

nreads<=360

SNPs in dbSNP Novel SNPs

Cycle

Page 31: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

What is causing this?

Page 32: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009

Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010

namesequencequality scores

x 100s of millions

(slide courtesy of Ben Langmead)

Page 33: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Before Reads There were Intensities

Page 34: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

We Want to See This

Color coded by call made: A, C, G, T

Page 35: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

But See This

Color coded by call made: A, C, G, T

Four channel fluorescence intensity, cycle 1

A

C

G

T

Page 36: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Gets Worse for higher cycles

Color coded by call made: A, C, G, T

Page 37: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Error Rate and Reported Quality

0 20 40 60

0.0

0.1

0.2

0.3

0.4

Sequencing Cycle

Estim

ated

Erro

r Pro

babi

litie

sA>CT>CA>GT>AC>TC>AG>AT>GG>CG>TA>TC>G

0 20 40 60

0.00

00.

005

0.01

00.

015

Sequencing Cycle

Mis

mat

ch P

ropo

rtion

s

A>CT>CA>GT>AC>TC>AG>AT>GG>CG>TA>TC>G

Page 38: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Remember This?

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Page 39: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Bias Explained

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2000 4000 6000 8000

05000

10000

0.5(A+T)

A−T

cycle << 20cycle ≥≥ 20

Page 40: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Base Calling

1) Rougemont et al. Probabilistic base calling of Solexa sequencingdata. BMC Bioinformatics (2008)

2) Erlich et al. Alta-Cyclic: a self-optimizing base caller fornext-generation sequencing. Nat Methods (2008) 3) Kao et al. BayesCall: A model-based base-calling algorithm forhigh-throughput short-read sequencing. Genome Res (2009)

4) Corrada Bravo and Irizarry. Model-Based Quality Assessment and Base-Callingfor Second-Generation Sequencing Data. Biometrics (2009)

5) Cokus et al. Shotgun bisulphite sequencing of the Arabidopsisgenome reveals DNA methylation patterning. Nature (2009)

Page 41: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Intensity Model

cycle

12.5

13.0

13.5

14.0

Page 42: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Intensity Model

log intensity read i, cycle j, channel c

indicators of nucleotide identity, read i, pos. j

∆ijc =

�1 if c is the nucleotide in read i position j

0 otherwise

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

Page 43: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Intensity Model

log intensity read i, cycle j, channel c

read-specific linear models

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

Page 44: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Intensity Model

log intensity read i, cycle j, channel c

measurement error

uijc = ∆ijc(µcjα + xTj αi + �α

ijc) +

(1−∆ijc)(µcjβ + xTj βi + �β

ijc)

�αijc ∼ N(0, σ2

αi) �βijc ∼ N(0, σ2

βi)

Page 45: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Read & Cycle Effects

cycle

h(intensity)

11121314

0 10 20 30 0 10 20 30 0 10 20 30

11121314

11121314

0 10 20 30 0 10 20 30

11121314

Page 46: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Base IdentityProbability Profiles

1 3 5 7 9 11 14 17 20 23 26 29 32 35

Position

00.20.40.60.81

Probability

Page 47: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

Before And After

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Sequencing cycle

Nuc

leot

ide

com

posi

tion

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30

A

T

Solexa (Default) Srfim (Statistical Approach)

Page 48: Statistical Methods for Next Generation Sequencingkhansen/LecIntro1.pdf · Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Source: Whiteford

The End