what should bioinformatics do for evodevo?
DESCRIPTION
Presented at Euro Evo Devo 2014 in ViennaTRANSCRIPT
Insights into the evolution and development of planarian regeneration from the genome of the flatworm Girardia tigrina
SUJAI KUMAR
2014-07-24 VIENNA EURO EVODEVO
WHAT SHOULD BIOINFORMATICS DO FOR EVODEVO?
EVODEVO
SUJAI KUMAR
SUJAI KUMAR
"Winkel triple projection SW" by Strebe - Own workLicensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Winkel_triple_projection_SW.jpg
Cartoonist and mathematics teacher inNew Delhi
SUJAI KUMAR
Finding patterns in sequences:TIMSS 1999 video study
MS in Educational Psychology at the University of Illinois
SUJAI KUMAR
Self-organising systems research in New Delhi
SUJAI KUMAR
Sequenced four nematode genomes for PhD in Blaxter Lab, Edinburgh
SUJAI KUMAR
Planarian regeneration genomics in Aboobaker Lab, Oxford
Outline of this talk
1. Regeneration, planarian flatworms, and Girardia tigrina
2. Creating G tigrina genomic resources
3. Using these resources to understand regeneration
4. What should bioinformatics do for EvoDevo
1. Regeneration,planarian flatworms,and Girardia tigrina
Bely and Nyberg, 2010 DOI:10.1016/j.tree.2009.08.005
1. Regeneration,planarian flatworms,and Girardia tigrina
Kao, 2014. PhD Thesis “Transcriptome assembly and analysisof the freshwater planarian Schmidtea mediterranea”
Platyhelminthes
Cestoda
Monogenea
Trematoda
Rhabditophora
Turbellaria
Tricladida
Macrostomorpha
Lecithoepitheliata
RhabdocoelaTT
T
TT
T
Girardia tigrinaaboobakerlab.com/genomes
G
Schmidtea mediterraneasmedgd.neuro.utah.edu
G
Polycladida
1. Regeneration,planarian flatworms,and Girardia tigrina
• What we know already
• Some genes and pathways that are essential for WBR• Some transcription expression profiles• No transgenics in any planarian
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
Illumina HiSeq: WorkhorseShort paired reads~$£€ 1,000 / 100 MegaBaseMate pairs essential
PacBio: expensiveHigh quality fly genome~$£€ 10,000 / 100 MegaBase
Nanopore – not a game changer just yet
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Quality Control
• Raw data QC fastqc
• Preliminary assembly Blobology
• Separate components contaminants/ endosymbionts/ mitochondrial
• Assess insert sizes Bad mate pair libraries confound scaffolding
Each point is a contigfrom a preliminaryassembly
(Caenorhabditis Sp. 5)
Taxon-annotatedGC-Coverage(TAGC)Plots
a.k.a“Blobology”
GC Content
Rea
d co
vera
ge
Girardia tigrina
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Quality Control
• Raw data QC fastqc
• Preliminary assembly Blobology
• Separate components contaminants/ endosymbionts/ mitochondrial
• Assess insert sizes Bad mate pair libraries confound scaffolding
• Generate many assemblies
• ABySS, CLC, MaSurCA, SGA, Spades, ALLPATHS-LG• Evaluate assemblies
• FRCbam, REAPR, CGAL
• CEGMA, alignments to known sequences• Freeze and release
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• NOT a great assembly• But it was GoodEnough™ • Next version with long-insert mate pairs• Diploid, but high heterozygosity
Assembly version nGt.0.3 nGt.0.5
Raw read data ~500M short read pairs160 GBases
Consolidating near identical contigs
Total Span Gbases 1.898 1.500
Num Contigs 581,558 422,617
Span Contigs >10kb 541,653,308 536,575,093
Num Contigs >10kb 29,050 27,495
N50 5,751 6,827
CEGMA 45% 56%
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Gene prediction
• RNA-seq• Predictors Augustus, SNAP, GeneMark
• Consolidators MAKER, EVM, ENSEMBL genebuild
• Evaluate use Annotation Edit Distance (AED) as a metric
• Functional annotation
• InterProScan, Trinotate, Blast2GO
• Community annotation
• WebApollo, Community Annotation Portal
Annotation Version
Num of Genes
Num of Genes with AED>0.5
Mean aa length
Num of Genes with InterPro annotations
nGt.0.5.1 39,119 35,061 268 22,747
2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Genome Browser
• Blast server
• Bulk data downloads
• Interface
• Badger, Tripal, InterMine, Ensembl
3. Using these resources to understand regeneration
• Individual genes and pathways
• Transgenics
• Protein ortholog analysis
• 4 triclads, 1 other platyhelminth, 2 ecdysozoa, 4 deuterostomes• 14k out of 40k G tigrina proteins in strict ortholog clusters• ~8000 triclad-specific clusters• ~800 triclad-specific clusters with all 4 species represented
• Cis-regulatory analysis
• Neoblast specific regulatory regions
4. What should bioinformatics do for EvoDevo
• What should I do for an experimental EvoDevo lab
• Visual > Text• View additional information in place• Plot everything vs everything• Create gene models visually• Routine analyses should not require bioinformatician• Clear explanations of how a resource was created• Not too many versions• Minimum standards
4. What should bioinformatics do for EvoDevo
• What should the bioinformatics community do for me as an EvoDevo bioinformatician
• Best practice documentation for analyses• Easy to install tools• Minimum standards for assembly, metadata, annotation, and delivery• Grants for coordination, tools, resources
Summary
• Please use the resources at aboobakerlab.com/genomes
• Tell us what other resources you’d like to see as standard
• Fund technology development and training
Acknowledgements
• AboobakerLab.com
• Aziz Aboobaker• Natalia Pouchkina-Stantcheva• Damian Kao• Yuliana Mihaylova• Aphrodite Zhao
• Blaxter Lab (nematodes.org)
• Ben Elsworth (Badger)
• Sequencing
• Edinburgh Genomics
• Funding
• BBSRC• BSDB / Company of Biologists travel grant