6. ulrike schoeck- gatc biotech

24
GATC Biotech confidential VII © 2007-2011 Eagle Genomics Symposium "Provisioning bioinformatics - are we prepared?" Ulrike Schoeck GATC Biotech April, 5 th 2011

Upload: eagle-genomics-ltd

Post on 11-May-2015

786 views

Category:

Documents


2 download

DESCRIPTION

Provisioning Bioinformatics are we prepared.

TRANSCRIPT

Page 1: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Eagle Genomics Symposium

"Provisioning bioinformatics - are we prepared?"

Ulrike Schoeck

GATC Biotech

April, 5th 2011

Page 2: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

I. Introduction to GATC Biotech, providing sequencing service

II. Presentation of in-house sequencing technologies

III. Bioinformatics - definition and history

IV. Evolution of sequencing

V. Sequence analysis - what do we have to face everyday?

VI. Sequencing applications - what is possible?

VII. Conclusions - are we prepared?

Agenda

Page 3: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

GATC Biotech - where we are

Page 4: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

GATC Biotech

• leading european commercial sequencing service provider

• over 20 years of experience and know how

• ISO-certified since 1997

• 100% privately owned, self-financed & independent

• more than 125 employees in 5 subsidaries, 22 sales offices

• 3-shift sequencing labs in Germany (Konstanz & Duesseldorf) and UK

• over 10,000 customers all over the world (industry & academia)

• Illumina Certified Service Provider

Complete and integrated sequencing & bioinformatic solutions:

from single sample to ultra high throughput

Page 5: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Sequencing technologies in house

Applied Biosystems

ABI 3730xl

Roche / 454

Genome Sequencer FLX

since 1996 since 2006

since 2006

Illumina / Solexa

HiSeq 2000

Pacific Bosciences

PacBio RS

May 2011

Page 6: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Sequencing capacity

GS FLX

yearly s

equencin

g c

apacity in T

b

GA

HiSeq

PacBio RS

GS FLX

GA GA

0

10

20

30

40

50

60

70

till 2006 July 07 Jan 08 July 08 Jan 09 July 09 Jan 10 July 10 Jan 11 July 11

Applied Biosystems ABI 3730xl

Page 7: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

System comparison

system GS FLX HiSeq

2000

PacBio

RS

available

since

2005 (GS 20 by

454 Life Science)

2006 (Genetic

Analyzer by Solexa) 2010

device PicoTiterPlate

w/ wells

flowcells

w/ channels

SMRT cells w /zero-

mode waveguides

library DNA fragmentation, adapter ligation

amplification emulsion PCR bridging PCR none

sequencing

sequencing by

synthesis

pyrosequencing

sequencing by

synthesis

cyclic reversible

termination

sequencing by

synthesis

single molecule,

real-time

Page 8: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Comparison

GS FLX HiSeq 2000 PacBio RS

Read length Ø 400 bases 50 bases

100 bases > 1,000 bases

Mate pairs /

paired end

averaging 140-

200+ bases

insert sizes

~ 3 kb & higher

2 x 50 or 2 x 100 bases

insert sizes 300 b,

~ 3 kb

strobe reads

# of reads /

run > 1 mio

> 800,000,000

(single reads)

> 1,600,000,000

(paired end)

75,000 ZMVs /

SMRT cell

base

integration

same bases in

one cycle

(homopolymers)

base after base,

cycle per cycle

base after base,

continuously

Page 9: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Definition:

• Science explaining biology by using information

technologies (computational biology)

• Providing algorithms, databases, user interfaces and

statistical applications for specifying potential scientific

significance

Bioinformatics

Page 10: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Main object:

• Presentation of macromolecules as linear chains of

defined components or as sequences of symbols

• Main application in bioinformatics: comparison of

sequences for detecting homology (function, structure)

GCGTCCTCGGGCTTGGCGA

ACTGGGCGGCGGCGGTGGC

GGGCAGCAGCATGGGGGCG

GCA...

Bioinformatics

Page 11: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

GCGTCCTCGGGCTTGGCGA

ACTGGGCGGCGGCGGTGGC

GGGCAGCAGCATGGGGGCG

GCA...

Main object:

• Presentation of macromolecules as linear chains of

defined components or as sequences of symbols

• Main application in bioinformatics: comparison of

sequences for detecting homology (function, structure)

Bioinformatics

Page 12: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Manual analyses of sequential homologies using standard

word processing programmes

• Sustainable change in molecular biology by introducing

efficient computer algorithms

• Sequence alignment

• Phylogenetics

• Pattern matching

• Web-based database searches

• …

History

Page 13: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Sanger sequencing (1 read per sequencing run)

• Roche (1,000,000 reads per sequencing run)

• Illumina (1,600,000,000 reads per sequencing run)

Evolution of Sequencing

Page 14: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

0

10

20

30

40

50

60

70

till 2006 July 07 Jan 08 July 08 Jan 09 July 09 Jan 10 July 10 Jan 11 July 11

yearly s

equencin

g c

apacity in T

b

Evolution of Sequencing

Page 15: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation

sequencing technologies

• Advantages

• Applications, applications, applications…

• Runtime

• Costs

• Challenges

• Data analysis and interpretation

• Hardware infrastructure

• Data storage

• Software development

• Error rates

Sequence analysis - today

Page 16: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation

sequencing technologies

• Advantages

• Applications, applications, applications…

• Runtime

• Costs

Example: de novo sequencing

@HWI-ST143_0345:7:1:1200:2150#CGATGT/1

TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG

+

HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF

@HWI-ST143_0345:7:1:1310:2072#CGATGT/1

CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA

+

GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:

...

Sequence analysis - today

Page 17: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation

sequencing technologies

• Advantages

• Applications, applications, applications…

• Runtime

• Costs

@HWI-ST143_0345:7:1:1200:2150#CGATGT/1

TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG

+

HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF

@HWI-ST143_0345:7:1:1310:2072#CGATGT/1

CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA

+

GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:

...

Bioinformatics

Assembly

Scaffolding

Annotation

Finishing

Sequence analysis - today

Page 18: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation sequencing

technologies

• Advantages

• Applications, applications, applications…

• Runtime

• Costs

Example: Quantitative transcriptomics

@HWI-ST143_0345:7:1:1200:2150#CGATGT/1

TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG

+

HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF

@HWI-ST143_0345:7:1:1310:2072#CGATGT/1

CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA

+

GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:

...

Sequence analysis - today

Page 19: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation

sequencing technologies

• Advantages

• Applications, applications, applications…

• Runtime

• Costs

@HWI-ST143_0345:7:1:1200:2150#CGATGT/1

TTCTTCTGATGCCGGCATCCCTGCTTGCAGGTGTGAAG

+

HHHHHHHHHHHHHHHHFHHHHHHHHHHHHHF=FBF:CF

@HWI-ST143_0345:7:1:1310:2072#CGATGT/1

CGTTTCTAAAGCACCCACTATGGATGNNCAGCAGGACA

+

GFFGEFEFFFGDGGCEBEE=EEEEE9##55++;(1@A:

...

Bioinformatics

Alignment

Quantification

Comparison

data

cleaned

sequences

Preanalysis:

short quality reads

low complexity regions

cDNA adapters

sequencing primers

de novo contigs cluster representatives

contig hits

option 2:

assembly

option 1:

clustering

BLAST

analysis

cluster hits

Assembly

Assembly validation BLAST

analysis

Clustering

Sequence analysis - today

Page 20: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

data

cleaned

sequences

Preanalysis:

short quality reads

low complexity regions

cDNA adapters

sequencing primers

de novo contigs cluster representatives

contig hits

option 2:

assembly

option 1:

clustering

BLAST

analysis

cluster hits

Assembly

Assembly validation BLAST

analysis

Query QLength %HitLength HitLength %Identity e-value GeneID GeneLength

GD3X8YD02G3UTK 473 105.07 497 88 1.00E-138 59783566 778

GD3X8YD02HAS2D 504 103.57 522 92 0 112201467 867

GD3X8YD01EQVR9 438 103.42 453 89 1.00E-129 82985781 904

GD3X8YD02F9LP1 372 103.23 384 92 1.00E-140 194673237 7376

GD3X8YD01BBAL3 435 103.22 449 91 1.00E-165 112362035 3276

GD3X8YD02IRTW2 413 103.15 426 87 1.00E-104 56145323 770

GD3X8YD01C9534 418 103.11 431 93 1.00E-167 157279321 3170

GD3X8YD01EJJ53 461 103.04 475 89 1.00E-137 73976208 2891

Cluster ID Length(bp) %HitLength e-value UniGene ID Gene Length Contig ID Length (bp) Hit Start Hit End

GD3X8YD02G3UTK 473 105.07 1.00E-138 59783566 778 contig20575 2718 300 770

GD3X8YD02HAS2D 504 103.57 0 112201467 867 contig01324 2816 1230 1804

GD3X8YD01EQVR9 438 103.42 1.00E-129 82985781 904 contig01325 2825 67 513

GD3X8YD02F9LP1 372 103.23 1.00E-140 194673237 7376 contig01323 2903 2005 2375

GD3X8YD01BBAL3 435 103.22 1.00E-165 112362035 3276 contig01321 2980 400 830

GD3X8YD02IRTW2 413 103.15 1.00E-104 56145323 770 contig01320 2977 2300 2710

GD3X8YD01C9534 418 103.11 1.00E-167 157279321 3170 contig01318 2894 34 460

GD3X8YD01EJJ53 461 103.04 1.00E-137 73976208 2891 contig01315 2971 56 510

GD3X8YD02F06H3 427 103.04 0 194673243 2026 contig01314 2968 124 552

GD3X8YD02JQKXX 463 103.02 1.00E-180 146186547 4283 contig01322 2808 2287 2750

GD3X8YD01C2RA0 438 102.97 0 167693932 567 contig01319 2886 1456 1890

GD3X8YD01BMILD 439 102.96 1.00E-130 112362035 3276 contig01317 2960 1098 1530

GD3X8YD02IA76G 478 102.93 1.00E-155 160333384 4068 contig01316 2963 1004 1484

GD3X8YD02IJUER 443 102.93 0 114451341 864 contig05489 2657 95 1438

GD3X8YD01AUA6W 484 102.89 1.00E-119 59782054 869 contig22645 3358 3009 3486

GD3X8YD01CKSI4 451 102.88 1.00E-140 74268339 2212 contig20113 2734 1765 2109

GD3X8YD02JPD23 492 102.85 1.00E-167 151556820 3665 contig11912 2558 53 547

GD3X8YD02GG9F3 390 102.82 1.00E-105 219283151 3782 contig01371 2299 1453 1843

Sequence analysis - today

Page 21: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

DNA

• single reads in tubes & plates (PCR, plasmids)

• whole (meta)genome de novo sequencing

• whole genome re-sequencing

• targeted re-sequencing (enrichment, amplicons, exons)

• methylome / epigenome studies

• ChIP-Seq

RNA

• eukaryotic / prokaryotic cDNA de novo sequencing

• eukaryotic / prokaryotic cDNA re-sequencing (3’ UTR / 5’ UTR)

• smallRNA / microRNA

Sequencing applications

Page 22: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

• Massively produced sequence data using next generation

sequencing technologies

• Advantages

• Applications, applications, applications…

• Turnover

• Costs

• Challenges

• Data analysis and interpretation

• Hardware infrastructure

• Data storage

• Software development

• Error rates

Sequence analysis - today

Page 23: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Conclusion

Provisioning bioinformatics: Are we prepared?

Advancements in sequencing technologies

(data quantity and application complexity)

and

Advancements in information technologies

(hardware and software)

Cloud computing,

GPU usage, software developement,

parallelization...

GAP

SOLUTION

Page 24: 6. Ulrike Schoeck- GATC Biotech

GATC Biotech confidential VII © 2007-2011

Thanks for your kind attention.

Open questions?

www.gatc-biotech.com