maker: an easy to use genome annotation...
TRANSCRIPT
MAKER: An easy to usegenome annotation pipeline
Carson HoltYandell Lab
Department of Human GeneticsUniversity of Utah
Introduction to GenomeAnnotation
• What annotations are• Importance of genome annotations• Effect of next generation sequencing
technologies on the annotation process
What Are Annotations?• Annotations are descriptions of features of the
genome– Structural: exons, introns, UTRs, splice forms etc.– Functional: metabolism, hydrolase, expressed in the
mitochondria, etc.
• Annotations should include evidence trail– Assists in quality control of genome annotations
• Examples of evidence supporting a structuralannotation:– Ab initio gene predictions– ESTs– Protein homology
Why should I care aboutgenome annotations?
Background
SUCCESS>Smg5MEVTFSSGGSSNASSECAIDGGTNRCRGLEPNNGTCILSQEVKDLYRSLYTASKQLDDAKRNVQSVGQLFQHEIEEKRSLLVQLCKQIIFKDYQSVGKKVREVMWRRGYYEFIAFV
Why should I care aboutgenome annotations?
Background
SUCCESS>Smg5MEVTFSSGGSSNASSECAIDGGTNRCRGLEPNNGTCILSQEVKDLYRSLYTASKQLDDAKRNVQSVGQLFQHEIEEKRSLLVQLCKQIIFKDYQSVGKKVREVMWRRGYYEFIAFV
Incorrect annotations poison everyexperiment that uses them!!
Advances in Technology Promise to Make WholeGenome Sequencing “Routine” for Even Small Labs
Advances in annotation technology have not keptpace with genome sequencing, and annotation is
rapidly becoming a major bottleneck affecting moderngenomics research.
• As of October 2009, 222 eukaryotic genomes were fully sequencedyet unpublished.
• Currently there are over ~900 eukaryotic genome projects underway,assuming 10,000 genes per genome, that’s 9,000,000 newannotations.
• There is a limit to how much data can be managed, maintained, andupdated by a single organization.
• Small research groups affected disproportionately by difficultiesrelated to genome annotation.
• .GOLD: Genomes OnLine Database. 2009.
• MAKER is an easy-to-use annotationpipeline designed to help smallerresearch groups convert the mountainof genomic data provided by nextgeneration sequencing technologiesinto a usable resource.
MAKER Overview
• What does MAKER do?• What sets MAKER apart from other
tools (ab initio gene predictors, etc.)?
MAKER
Lewis, S.E. et al. Apollo: a sequence annotation editor. Genome Biology 3, research0082.1 - 0082.14 (2002). Stein, L.D. et al. The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Res. 12, 1599-1610 (2002).
User Requirements: Can be run by a single individual with little bioinformatics experience
System Requirements: Can run on laptop or desktop computers running Linux or Mac OS X
Program Output: Output is compatible with popular GMOD annotation tools like Apollo and GBrowse
Availability: Free open source application (for academic use)
• The easy-to-use annotation pipeline.
MAKER identifies repeats, aligns ESTs and proteins to a genome, producesab-initio gene predictions, automatically synthesizes these data into geneannotations, and produces evidence-based quality values for downstream annotationmanagement
Other Features
MPI Support
• Message Passing Interface(MPI) is a communicationprotocol for computerclusters which essentiallyallows multiple computers toact like a single powerfulmachine.
MPI Maker
Gene-predictionsComputational evidence
Gene annotationgene prediction ≠ gene annotation
What sets MAKER apart from othertool (i.e. ab initio gene predictors)?
Model versus Emerging genomes
Model genomes:
• Classic experimental systems
• Much prior knowledge about genome
• Large community
• Big $
Examples: D. melanogaster, C. elegans, human, etc
Model versus Emerging genomes
Emerging genomes:• New experimental systems
– Genome will be the central resource for work in these systems
• Little prior knowledge about genome– Usually no genetics
• Small communities
• Less $
Examples: flatworms, oomycetes, the cone snail, etc.
MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. (2008)Cantarel B L, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M Genome Res 18(1) 188-196
Comparison of gene models produced by state-of-the artalgorithms against a REFERENCE genome
*nGASP - the nematode genome annotation assessment project Avril Coghlan , Tristan J Fiedler , Sheldon J McKay ,Paul Flicek , Todd W Harris , Darin Blasiar , The nGASP Consortium and Lincoln D SteinBMC Bioinformatics 2008, 9:549doi:10.1186/1471-2105-9-549
With enough training data, ab-initio gene predictorscan match or even out-perform annotation pipelines*
Ab initio gene predictors don’t do nearlyso well on emerging genomes*
7% contain a domain35% contain a domain
Average of sevenREFERENCE proteomes
S. mediterranea SNAPab-initio gene predictions
29% contain a domain
MAKER S. mediterraneaannotations
*MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. (2008)Cantarel B L, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M Genome Res 18(1) 188-196
Benefits of MAKER
• Provides gene models as well as an evidence trailcorrelations for quality control and manual curation
• Provides a mechanism to train and retrain ab initiogene predictors for even better performance.
• Output can be loaded into a GMOD compatibledatabase for annotation distribution
• Annotations can be automatically updated by newevidence by simply passing existing annotationsets back into the pipeline
What is Happening InsideMAKER
• RepeatMasking• Ab Initio Gene Prediction• EST and Protein Evidence Alignment• Polishing Evidence Alignments• Integrating Evidence to Synthesize
Final Annotations
Current evidence
Current Assembly
Annotating the Genome – Apollo View
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
• RepeatMasker– RepBase– Species specific library
• RepeatRunner– MAKER internal protein library
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
• MAKER currently supports:– SNAP– Augustus– GeneMark– FGENESH
• Remember to supply HMM’s for each
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
• Identify regions being activelytranscribed (i.e. EST data)
• Identify region with homology to aknown protein
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished proteinPolished EST
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished proteinPolished EST
• All base pairs must aligns in order.
• No HSP overlap is permitted
• Aligns HSPs correctly with respectto splice sites.
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished proteinPolished EST
Current evidence
Current Assembly
Ab initio PredictionsHint-based SNAP Hint-based FgenesH
Pass Gene Finders Evidence-based ‘hints’
Current evidence
Current Assembly
Ab initio PredictionsHint-based SNAP Hint-based FgenesH
*
*Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 200910:67doi:10.1186/1471-2105-10-67
Identify Gene Model Most Consistent with Evidence*
Current evidence
Current Assembly
Ab initio Predictions*
Revise it further if necessary; Create New Annotation
Compute Support for Each Portion of Gene Model
Using MAKER
MAKER Web Annotation Service
MAKER Web Annotation Service
http://www.yandell-lab.org
De novo Annotation of aNewly Sequenced Genome
• You are involved in a genome projectfor an emerging model organism.
• You have no pre-existing gene models.• What you do have:
– ESTs– Proteins from other species available from
public databases
Go to Web
GFF3 pass-through: How touse external evidence
• You have an existing annotation set.• You want to update the evidence and
allow the annotation to change to reflectthe new evidence.
What if I have mRNA-seq data?
RNA-seq is fundamentally changing the field of genome annotation
for both model and emerging genomes
RNA-seq may soon make gene prediction(mostly) a thing of the past
• Still need to de-convolute reads & evidence (for now)• Still need to archive and distribute annotations• Still need to manage genome and its annotations
How to use RNA-seq data inMAKER
• Use BowTie and TopHat to produce,aligns reads into expression “islands”and “junctions”
• Pass data through as EST evidence viaGFF3 pass-through.
Go to Web
Another issue: legacy annotations
• Many are no longer maintained by original creators
• In some cases more than one group has annotated the same genome,using very different procedures, even different assemblies
• The communities associated with those genomes are going to wantmRNA-seq data
• Many investigators have their own genome-scale data and would like aprivate set of annotations that reflect these data
• There will be a need to revise, merge, evaluate, and verify legacyannotation sets in light of RNA-seq and other data
Legacy Annotation Set 1 Legacy Annotation Set 2 Legacy Annotation Set n
new data
• Identify legacy annotation most consistent with new data• Automatically revise it in light of new data• If no existing annotation, create new one
current assembly
Merging and Revising Legacy Annotation Sets
Current evidence
Current Assembly
Legacy Annotations
Align Evidence and Legacy Annotations to Current Assembly
Current evidence
Current Assembly
Legacy AnnotationsHint-based SNAP Hint-based FgenesH
Pass Gene Finders Evidence-based ‘hints’
Current evidence
Current Assembly
Legacy AnnotationsHint-based SNAP Hint-based FgenesH
*
Identify Gene Model Most Consistent with Evidence*
*Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 200910:67doi:10.1186/1471-2105-10-67
Go to Web
Working with Chado• maker2chado [OPTION] <database_name> <gff3file1> <gff3file2> ...• maker2chado [OPTION] -d <datastore_index> <database_name>
This script takes MAKER produced GFF3 files and dumpsthem into a CHADO database. You must set the databaseup first according to CHADO installation instructions.CHADO provides its own methods for loading GFF3, but thisscript makes it easier for MAKER specific data. You caneither provide the datastore index file produced by MAKERto the script or add the GFF3 files as command linearguments.
Working with JBrowse
• maker2jbrowse [OPTION] <gff3file1> <gff3file2> ...• maker2jbrowse [OPTION] -d <datastore_index>
This script takes MAKER produced GFF3 files anddumps them into JBrowse for you using pre-configured JSON tracks.