hard assembly jan pačes institute of molecular genetics as cr

24
hard assembly Jan Pačes Institute of Molecular Genetics AS CR

Upload: eileen-blair

Post on 04-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

hard assembly

Jan Pačes

Institute of Molecular Genetics AS CR

Page 2: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

problemsgenomes high GC content repetitions (short - low informational content,

long) polymorphic "unreadable" sequences, "weird" structures

technologies nonrandom libraries wrong sizes erroneous or chimeric reads

Page 3: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

sequencing technologies ABI (sanger)

454 (pyrosequencing)

solexa (reversible terminator)

SOLiD (2base ligation)

PacBio (SMRT)

Page 4: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

example of errors in one technology

http://chevreux.org/mira_ex_454sanger.html

Page 5: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

Aird et al. Genome Biology 2011

high GC regions are underrepresented

Page 6: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

Aird et al. Genome Biology 2011

protocol optimization for high GC content

Page 7: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

repetitions

scaffold

repetition

Page 8: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

repetitions

Page 9: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

repetitions recognition

MIRA http://sourceforge.net/projects/mira-assembler/

MaSuRCAhttp://www.genome.umd.edu/masurca.html

SPAdeshttp://bioinf.spbau.ru/spades

Repeatmaskerhttp://www.repeatmasker.org/

RepeatModeller (RECON and RepeatScout)http://www.repeatmasker.org/RepeatModeler.html

position aware assemblers

Page 10: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

k-mer distribution

Page 11: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

k-mer analysis

JELLYFISH - Fast, Parallel k-mer Counting for DNAhttp://www.cbcb.umd.edu/software/jellyfish/

Quake is a package to correct substitution sequencing errors in experiments with deep coveragehttp://www.cbcb.umd.edu/software/quake/

KHMER Trim off likely erroneous k-mershttps://khmer-protocols.readthedocs.org/en/v0.8.2/

Page 12: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

repetitions

scaffold

repetition

Page 13: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

filling gaps

GapCloser (part of SOAPdenovo)http://soap.genomics.org.cn/soapdenovo.html

GapFiller (part of SSPACE)http://www.baseclear.com/lab-products/bioinformatics-tools/gapfiller/

GapFillerhttp://sourceforge.net/projects/gapfiller/

Page 14: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

454 multiplicates

Page 15: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

contig coverage by large libraries

Page 16: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

illumina pe and mate-pairs libraries

Page 17: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

highly polymorphic genomes

scaffold

two copies of polymorphic contigs

Page 18: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

polymorphic assembly workflow

normal assembly

condensing alternative contigs

mapping to identify SNPs

"repair" reads

second "polymorpic" assembly

http://www.fishbrowser.org/software/L_RNA_scaffolder

Page 19: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR
Page 20: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

G-quadruplex

Page 21: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

AGCGACCCCCCCCCACCACCGCCACCACCACCTCTGCCATTGGCCGCCGCCGCCCCCCCCCCATTAAACCCCCCCACCCCCCCCCGCGCTGCCCCCTCCCCGGTGG

Chicken p53 – coverage from RNAseq data

Coverage > 13,000X

Page 22: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

CCCGCCCACCCCCACCCCCACCCGCACCCCCCACTCTCCCACCCCCACCCCCTTTTCTCCCACCCCCTCTTCTCCCACCCCCTTTTCCCCCCCTTCCTCCCCCCACTCCGCCCCCCCCCCGCCCCCTCCCCCCCCCCAGGTGAGGACCCT

Chicken erythropoietin (EPO)– coverage from RNAseq data

Coverage > 500X from RNAseq

(*EPO locus not completed even from 1000X coverage genomic Illumina data!)

Page 23: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

chicken missing genes

Page 24: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR

that’s it, thank you

many thanks also to:

Daniel EllederTomáš HronMichal KolářHynek Strnad