new genome assembly with next generation sequencers · 2011. 5. 6. · next generation sequencing...

53
Genome Assembly With Next Generation Sequencers 3 May, 2011 Jongsun Park Personal Genomics Institute

Upload: others

Post on 13-Oct-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assembly With Next Generation Sequencers

3 May, 2011Jongsun Park

Personal Genomics Institute

Page 2: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Table of Contents

1 Central Dogma and –Omics Studies

2 History of Sequencing Technologies

3 Genome Assembly Processes With NGS Sequences

4 Current Status of Plant Genomes

Page 3: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Central Dogma and -Omics Studies

Page 4: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Central Dogma in Molecular Biology and Bioinformatics

DNA RNA Proteins

Genomics Transcriptomics ProteomicsAd

van

ces

of

Bio

tech

no

logi

es

Bioinformatics

Page 5: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Further –Omics Studies With Bioinformatics

Genomics Population Genomics

Phylogenomics

Transcriptomics

Gene Expression

Regulatory Network

Non-coding RNAs

Epigenomics

Proteomics Metabolomics

Pathway Analysis

3D structure

Page 6: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Position of Sequences in Central Dogma!

AGCUACGUGAGAGACGUACUGUAC…

AGCTACGTGAGAGACGTACUGTAC…

MASTWTSWAMTCCAAMST…

Target for “Sequencing”

Predicted from sequences

Page 7: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

History of Sequencing Technologies

Page 8: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Classical Method: Sanger Sequencing

- Using dideoxynucleotides, Dna synthesis can be stopped randomly with four

different reaction tubes.

DNA Template

DNA Template

DNA Template

DNA Template

Primer

Primer

Primer

Primer

New DNA strain A

New DNA strain T

New DNA strain G

New DNA strain C

+ddATP

+ddTTP

+ddGTP

+ddCTP

http://en.wikipedia.org/wiki/File:Sequencing.jpg

Page 9: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Automated Method: Sanger Sequencing

- Using four fluorescent dyes, scanner can read sequences directly.

- With 96 capillaries, machine can read 96 or 384 different samples at one time.

DNA Template

DNA Template

DNA Template

DNA Template

Primer

Primer

Primer

Primer

New DNA strain A

New DNA strain T

New DNA strain G

New DNA strain C

+ddATPwith dye

+ddTTPwith dye

+ddGTPwith dye

+ddCTPwith dye

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

Page 10: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Capacity of Automated Sanger Sequencing

- Per one capillary, 700 bases (Quality Value is

higher than 20) can be read in one hour.

- Per one machine, ABI3730 with 384

capillaries kit, 700 * 384 * 24 = 6,451,200 bp

can be obtained without considering sample

preparation process.

https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600533&tab=Overview

- If you want to get 1x human genome sequences using one this machine,

it will take 465 days (30,000,000,000 bp / 6,451,200 bp = 465.029..).

- For de novo assembly, usually 6x to 10x coverage are required: it will take

4,650 days for obtaining enough sequences for genome project.

Page 11: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Next Generation Sequencing (NGS) Technologies

- Next (or Current) generation sequencing technologies have accelerated the speed

of genome sequencing projects and have broaden application range of genome

sequences.

Solexa; Illumina

SOLiD; ABI

GS-Titanium; Roche 454

SMRT; Pacific Bioscience Helicos; Helicos Bioscience

Page 12: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

NGS: 454 Technology

Page 13: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

NGS: Solexa Technology

Page 14: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

NGS: SOLiD Technology (1)

Page 15: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

NGS: SOLiD Technology (2)

Page 16: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

NGS: SOLiD Technology (3)

Page 17: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Capacities of Next Generation Sequencers

Solexa GA2; Illumina

SOLiD 4; ABI

GS-Titanium; Roche 454

ABI 3730; ABI

384 x 700 bp = 268,800 bp = 269Kb (per one reaction / 1 hr)

950,000 x 450 bp = 405,000,000 bp = 405Mb (per one reaction / 2-3 days)

35,000,000 x 7 x (151 x 2) bp = 73,990,000,000 bp = 74.0 Gb (per one reaction / 12 days)

1,400,000,000 x 75 bp (50+25) = 105,500,000,000 bp = 105.5Gb (per one reaction / 11 days)

HiSeq2000; Illumina

70,000,000 x 14 x (101 x 2) bp = 197,960,000,000 bp = 198.0 Gb (per one reaction / 8 days)

Page 18: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Single Read and Mate-Pair (Pair-end) Sequences

- Single read is the most simple method: just read one

direction for each sample (cluster or bead).

- Mate-pair (or pair-end) method can generate both

side of short-read sequences of each read, which is

similar to BAC-end, Fosmid-end, Cosmid-end

sequences.

- Mate-pair (pair-end) methods are useful for

generating larger sequences with gaps (We called it

scaffolding.)

Page 19: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Summary of Costs For NGS Technologies

Sequence explosion, Nature, 46(1), 670-671

Page 20: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Pros and Cons of NGS Technologies

- Large number of reads per one run

- Extremely low costs for generating

huge amount of sequences

- Diverse applications not only for

genomics but also for transcriptomics,

small RNAs, epigenomics, and etc.

with small modification of protocols.

Pros Cons

- Short read length per each reads

(36 bp to 151 bp or 450bp)

- Different types of sequencing

qualities (GS, Solexa, and SOLiD)

- Difficult to deal with large size of

sequences with normal programs

including de novo assembly

- Impossible to get small amount of

sequences with low costs

Page 21: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assembly ProcessesWith NGS Sequences

Page 22: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Whole Shotgun Sequence Strategy

- Assembly process is essential for genome project because read length of each

sequence is less than 1 kb.

- Assembly process was conducted by several popular programs, such as phrap

and PCAP3, for Sanger sequences.

Genome AssemblyScaffolding

Assembled genome

Page 23: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assembly Process

- We can perform genome assembly manually!

23 sequences should be compared with each other!

23C2 = 23*22 / 2 = 253 comparison!

Page 24: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

How to Find Overlapped Sequences?

- Using dynamic algorithm, we can make the program for finding similar sequences.

- Complexity of this algorithm is O(n3).

http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

Page 25: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Classification of Alignments (1)

Considering whole sequences for

alignment?

Global Alignment

Yes

No

Local alignment

Page 26: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Classification of Alignments (2)

How many sequences are considered for

alignment?

Pair-wise alignment

2 sequences!

More than 2 sequences!

Multiple alignment

Page 27: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Famous Bioinformatics Tools for Alignments

- Global alignment : ClustalW, T-coffee, and MUSCLE

- Local alignment : FASTA, and BLAST (Basic Local Alignment Searching Tools)

provided by NCBI.

- Pair-wise alignment : BLAST and FASTA

- Multiple sequence alignment : ClustalW, T-coffee, MUSCLE, and etc.

Page 28: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Example of Genome Assembly: Vitis Vinifera

Pair-wise comparison of 6,200,000 reads

6,200,000C2 = 19,219,996,900,000 comparisons

Page 29: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assemblers

http://www.phrap.org/

Page 30: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Short-read Sequence Assembly (1)

- Short-read sequences generated by NGS machines cause several problems

of already well-established genome assemblers.

- Too many reads require near to infinite computational power.

- Too short reads cannot find reliable overlaps to make long contigs.

- To reduce computational power, new algorithm was required.

- Short reads require another strategy to make reliable contig sequences.

- Dealing a lot of sequences also caused several technical problems, such as

physical memory problems, harddisk space, and computing power.

Page 31: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Short-read Sequence Assembly (2)

563,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!

563,466,202

Page 32: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

De brujin Graph: Alternative Method For Alignment

Page 33: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

De brujin Graph Algorithm For Alignment (1)

- This algorithm has been utilized for finding overlapped short-read sequences

quickly.

- This algorithm consists of three parts:

i) Generating k-mer sequences

ii) Constructing de brujin graph

iii) Resolving the graph with generating sequences

Page 34: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

K = 3GCAAAACACTTA…

GCA

CAA

AAA

AAA

AAC

De brujin Graph Algorithm For Alignment (2)

ACA

CAC

1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

ACT

CTT

TTA

1-10

Page 35: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

ACA

CACACT

1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

1-10

GCAAAACACTTA…

De brujin Graph Algorithm For Alignment (3)

CTT

TTA

ACACTTATTCGT

TAT

ATT

TTC

TCG

CGT

TAT

ATT

TTC

TCGCGTK = 3

2-1

2-22-3

2-4

2-5

2-6

2-7

2-8

2-92-10

Page 36: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

De brujin Graph Algorithm For Alignment (4)

1-3

1-4

1-1 1-2

1-5

1-6

1-71-8

1-9

1-10

TAT

ATT

TTC

TCGCGT

2-1

2-22-3

2-4

2-5

2-6

2-7

2-8

2-92-10

GCAAAACACTTATTCGT

Page 37: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Genome Assemblers For Short Read Sequences

Page 38: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Examples of Plant Genome de novo Assembly

5,937,915,739 bp

# of contigs 950 ea

Total length 976,089 bp

Maximum length 12,606 bp

Average length 1,027.46 bp

N50 length 3,061 bp

Lithocarpushancei

Ficusaltissima

Ficusaltissima

Ficusaltissima

Ficustinctoria

Ficustinctoria

# of contigs 462,868 355,052 132,590 247,376 337,777 476,937

Total length 112,614,098 87,502,701 33,293,636 61,369,608 87,427,716 116,554,688

Maximumlength

1,748 1,090 1,688 1,334 1,274 1,578

Average length

243.30 246.45 251.10 248.08 258.83 244.38

N50 length 237 239 245 241 248 3061

Page 39: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Giant Panda Genome Project

Page 40: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Sordaria macrospora Genome Project

Page 41: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Large Scale Genome Projects: 1000 Human Genomes

http://www.1000genomes.org/page.phphttp://en.wikipedia.org/wiki/File:Genetic_Variation.jpg

Page 42: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

http://www.ldl.genomics.cn/page/index.jsp

Large Scale Genome Projects: BGI 1000 Genomes

Page 43: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

http://genome10k.org/

Large Scale Genome Projects: Genome 10K Project

Page 44: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Current Status ofPlant Genome Projects

Page 45: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Species name Method Size (Mb) # of contigs # of transcripts

Arabidopsis lyrata WGS 206.67 695 32,670

Medicago truncatula BAC, WGS 278.69 9 38,334

Selaginella moellendorffii WGS 212.76 768 22,285

Lycopersicon esculentum WGS, BAC 794.60 7,409 49,389

Solanum phureja WGS 702.58 57,681 110,512

Ricinus communis WGS 362.47 28,518 38,613

Mimulus guttatus WGS 416.66 11,243 47,442

Manihot esculenta WGS 321.73 2,216 27,501

Phoenix dactylifera WGS 284.68 234,704 -

Prunus persica WGS 227.25 202 28,689

Oryza glaberrima WGS 316.08 5,309 -

Unpublished 11 Higher Plant Genomes

All pictures are from wikipedia.

Page 46: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Species name Journal Method Size (Mb) # of contigs # of transcripts

Arabidopsis thaliana Nature, 2000 BAC +α 119.19 5 32,615

Oryza sativa japonicaScience, 2002Nature, 2005

BAC +α 372.08 12 66,710

Oryza sativa indicaScience, 2002PLoS Biology, 2005

WGS 426.32 10,267 49,710

Oryza sativa japonica (syngenta)

PLoS Biology, 2005 WGS 391.14 7,777 45,824

Populus trichocarpa Science, 2006 WGS 485.51 22,012 45,555

Vitis vinifera Nature, 2007WGS, Complete

497.51 35 30,434

Carica papaya Nature, 2008 WGS 369.69 17,677 28,589

Lotus japonicus DNA Research, 2008 WGS 323.24 110,945 26,700

Sorghum bicolor Nature, 2009 WGS 738.54 3,304 36,338

Zea mays Science, 2009 BAC, WGS 2,061.02 11 53,764

Cucumis sativusNature genetics, 2009

WGS 243.57 47,488 26,682

Glycine max Nature, 2010 WGS 996.90 4,262 62,199

Brachypodiumdistachyon

Nature, 2010 WGS 273.27 197 32,255

Malus x domesticaNature Genetics,2010

WGS 871.56 144,621 57,386*

14 Published Higher Plant Genomes from 11 Species

All pictures are from wikipedia.

Page 47: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Species name Journal MethodSize (Mb)

# of contigs

# of transcripts

Chlamydomonas reinhardtii Science, 2007 WGS 112.31 88 16,709

Micromonas pusilla CCMP1545 Science, 2009 WGS 22.04 27 10,547

Micromonas sp. RCC299 Science, 2009 WGS 20.99 17 10,108

Ostreococcus lucimarinus CCE9901 PNAS, 2007 WGS 13.2 21 7,488

Ostreococcus sp. RCC809 Not published yet WGS 13.41 22 7,492

Ostreococcus tauri PNAS, 2007 WGS 12.58 118 7,725

Coccomyxa sp. C169 Not published yet WGS 48.95 45 9,629

7 Unicellular Plants Genomes

All pictures are from wikipedia.

Page 48: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Distribution of 32 Plant Genome Size

Mb

0.0

500.0

1000.0

1500.0

2000.0

2500.0

12.6 13.2 13.4 21.0 22.0 49.0 112.3 119.7

206.7 212.8 225.9 227.3 273.3 284.7 307.5 316.1 321.7 323.2 362.5 369.7 372.3 391.1 416.7 426.3

485.5 497.5

702.6 738.5 781.8

871.6

996.9

2061.0

Unicellular Plants

Page 49: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Distribution of Number of Transcripts of 29 Plants

# of transcripts

0

20,000

40,000

60,000

80,000

100,000

120,000

7,4887,4927,7259,62910,10810,547

16,709

22,2852668226,70027,50128,58928,68930,43432,25532,67033,410

36,33838,613

45,55545,82447,4424948349,71053,42353,764

62,19967,393

110,512

Unicellular Plants

Page 50: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Relationship between Genome Size and Transcripts# of transcript/Genome Size (Mb)

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

26.1 49.2

61.2 62.4 63.3 77.3 82.6 85.5 93.8

104.7 106.5 113.9 116.6 117.2 118.0 118.1 126.2 148.8 157.3 158.1

173.7 181.0 196.7

279.2

478.5 481.6

558.7 567.3

614.1 Unicellular Plants

Page 51: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Comparisons with Genomes in Other Kingdoms

Mb

5.0 102.9

23.8 24.1

1587.1

392.2

-500.0

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

Bacteria Oomycota Eukaryota Fungi Metazoa Viridaeplanta

5 species 6 species 23 species 256 species 90 species 32 species

Page 52: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Cucumber Genome Sequences

Page 53: New Genome Assembly With Next Generation Sequencers · 2011. 5. 6. · Next Generation Sequencing (NGS) Technologies - Next (or Current) generation sequencing technologies have accelerated

Thank you for your attention!