hard assembly jan pačes institute of molecular genetics as cr
TRANSCRIPT
hard assembly
Jan Pačes
Institute of Molecular Genetics AS CR
problemsgenomes high GC content repetitions (short - low informational content,
long) polymorphic "unreadable" sequences, "weird" structures
technologies nonrandom libraries wrong sizes erroneous or chimeric reads
sequencing technologies ABI (sanger)
454 (pyrosequencing)
solexa (reversible terminator)
SOLiD (2base ligation)
PacBio (SMRT)
example of errors in one technology
http://chevreux.org/mira_ex_454sanger.html
Aird et al. Genome Biology 2011
high GC regions are underrepresented
Aird et al. Genome Biology 2011
protocol optimization for high GC content
repetitions
scaffold
repetition
repetitions
repetitions recognition
MIRA http://sourceforge.net/projects/mira-assembler/
MaSuRCAhttp://www.genome.umd.edu/masurca.html
SPAdeshttp://bioinf.spbau.ru/spades
Repeatmaskerhttp://www.repeatmasker.org/
RepeatModeller (RECON and RepeatScout)http://www.repeatmasker.org/RepeatModeler.html
position aware assemblers
k-mer distribution
k-mer analysis
JELLYFISH - Fast, Parallel k-mer Counting for DNAhttp://www.cbcb.umd.edu/software/jellyfish/
Quake is a package to correct substitution sequencing errors in experiments with deep coveragehttp://www.cbcb.umd.edu/software/quake/
KHMER Trim off likely erroneous k-mershttps://khmer-protocols.readthedocs.org/en/v0.8.2/
repetitions
scaffold
repetition
filling gaps
GapCloser (part of SOAPdenovo)http://soap.genomics.org.cn/soapdenovo.html
GapFiller (part of SSPACE)http://www.baseclear.com/lab-products/bioinformatics-tools/gapfiller/
GapFillerhttp://sourceforge.net/projects/gapfiller/
454 multiplicates
contig coverage by large libraries
illumina pe and mate-pairs libraries
highly polymorphic genomes
scaffold
two copies of polymorphic contigs
polymorphic assembly workflow
normal assembly
condensing alternative contigs
mapping to identify SNPs
"repair" reads
second "polymorpic" assembly
http://www.fishbrowser.org/software/L_RNA_scaffolder
G-quadruplex
AGCGACCCCCCCCCACCACCGCCACCACCACCTCTGCCATTGGCCGCCGCCGCCCCCCCCCCATTAAACCCCCCCACCCCCCCCCGCGCTGCCCCCTCCCCGGTGG
Chicken p53 – coverage from RNAseq data
Coverage > 13,000X
CCCGCCCACCCCCACCCCCACCCGCACCCCCCACTCTCCCACCCCCACCCCCTTTTCTCCCACCCCCTCTTCTCCCACCCCCTTTTCCCCCCCTTCCTCCCCCCACTCCGCCCCCCCCCCGCCCCCTCCCCCCCCCCAGGTGAGGACCCT
Chicken erythropoietin (EPO)– coverage from RNAseq data
Coverage > 500X from RNAseq
(*EPO locus not completed even from 1000X coverage genomic Illumina data!)
chicken missing genes
that’s it, thank you
many thanks also to:
Daniel EllederTomáš HronMichal KolářHynek Strnad