mining hidden information from your 454 data using modular and database oriented methods

Mining hidden information from your 454 data using modular and database

oriented methods

Joachim De Schrijver

Short introduction on 454 sequencing Variant Identification pipeline Possibilities of a DB oriented pipeline Examples

◦ Coverage◦ Improving PCR◦ Fast Q assessment◦ Homopolymers

Overview

Roche/454 GS-FLX sequencing:◦ Pyrosequencing◦ ± 400,000 reads/run◦ Average length: 200-250bp

Applications:◦ Resequencing: Variant identification◦ De novo (genome) sequencing: Assembly of new

regions, plasmids or entire genomes Standard Software:

◦ Variants: Amplicon Variant Analyzer (AVA)◦ Assembly: Standard 454 assembler

Introduction (i)

Standard software◦ + Easy to use◦ + reproducible results on similar datasets◦ + GUI (graphical user interface)◦ - No answer for ‘non-standard’ questions

Methylation experiments Different types of experiments grouped together …

◦ - What about ‘hidden’ information? Homopolymer error rates Quality score ~ length of sequenced read ‘Multirun’ information …

Introduction (ii)

Modular and database oriented pipeline

Modular:◦ Efficient planning◦ Scalable

Database (DB):◦ No loss of data◦ Grouping several

runs together

Variant Identification Pipeline (i)

Basic idea: Data is processed and stored in DB. Results (reports) are calculated ‘on the fly’ using the DB data.◦ Fast & efficient◦ Calculations only happen once◦ Everybody can access the database without risk of

data modification◦ Reporting is independent from the dataprocessing

Paper: De Schrijver et al. 2009. Analysing 454 sequences with a modular and database oriented Variant Identification Pipeline

Variant Identification pipeline (ii)

VIP originally developed for variant identification

Now being used in:◦ Amplicon resequencing◦ De novo shotgun◦ Methylation ◦ ~ solexa experiments

‘Hidden’ data can be extracted using intelligent querying strategies

Results per lane/Multiplex MID/run…

Possibilities of a DB oriented pipeline

Coverage can be calculated per◦ Lane◦ MID◦ Amplicon◦ Base position

Assessment of errors (PCR dropouts vs. human errors)

Example: Detailed coverage

1 2 3 4 5 6 7 8 9 10 11 120.00%2.00%4.00%6.00%8.00%

10.00%12.00%14.00%

MID frequency (unmapped)

Amplicon Resequencing experiment

Goal: Variant identification Length distributions

◦ Mapped◦ Unmapped◦ ‘Short’ mapped

Additional length separation + Improved PCR

Result: Improved efficiency

Example: Improving PCR

Can the length of a homopolymer be assessed using the Q score?

Yes, when homopolymer length < 6bp

Example: Homopolymers

Fast assessment of the quality of a run

Example: Q assessment

1 27 53 79 10513115718320923526128731333936505

1015202530354045

Q value ~ position

Q v

alue

0 50 100 150 200 250 30005

101520253035404550

Q value ~ position

Lab work OK Errors in lab work

Biobix – UgentWim Van CriekingeTim De MeyerGeert TrooskensTom VandekerkhoveLeander Van NesteGerben Mensschaert

CMG – UZ GentJo VandesompeleJan HellemansFilip PattynSteve LefeverKim DeleeneerJean-Pierre Renard

Acknowledgements NXT-GNT

Paul CouckeSofie BekaertFilip Van NieuwerburghDieter DeforceWim Van CriekingeJo Vandesompele

Questions ?

[email protected]

mining hidden information from your 454 data using modular and database oriented methods

Documents

db data

variant identificationnow

homopolymer length

db oriented pipeline7coverage

readsrunaverage length

solexa experimentshidden

q score

schrijver1short introduction