sample gep annotation report · examination of this region in the gep ucsc genome browser shows...

8
Last Update: 12/28/2019 1 GEP Annotation Report Student name: Wilson Leung Student email: [email protected] Faculty advisor: Sarah C.R. Elgin College/university: Washington University in St. Louis Project details Project name: contig10 Project species: D. biarmipes Date of submission: 12/28/2019 Size of project in base pairs: 43,013 Number of genes in project: 3 Does this report cover all of the genes or is it a partial report? Partial report If this is a partial report, please indicate the region of the project covered by this report: From base 25,000 to base 28,000 Instructions for project with no genes If you believe that the project does not contain any genes, please provide the following evidence to support your conclusion: 1. Perform a NCBI BLASTX search of the entire contig sequence against the “non- redundant protein sequences (nr)” database. Provide an explanation for any significant (E-value < 1e-5) hits to known genes in the nr database as to why they do not correspond to real genes in the project. 2. For each Genscan prediction, perform a NCBI BLASTP search of the predicted amino acid sequence against the nr protein database using the strategy described above. 3. Examine the gene expression tracks (e.g., RNA-Seq) for evidence of transcribed regions that do not correspond to alignments to known D. melanogaster proteins. Perform a NCBI BLASTX search against the nr protein database using these genomic regions to determine if they show sequence similarity to known or predicted proteins in the nr database. Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission.

Upload: others

Post on 14-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

1

GEP Annotation Report

Studentname:WilsonLeungStudentemail:[email protected]:SarahC.R.ElginCollege/university:WashingtonUniversityinSt.Louis

Project details Projectname:contig10Projectspecies:D.biarmipesDateofsubmission:12/28/2019Sizeofprojectinbasepairs:43,013Numberofgenesinproject:3Doesthisreportcoverallofthegenesorisitapartialreport?PartialreportIfthisisapartialreport,pleaseindicatetheregionoftheprojectcoveredbythisreport:

Frombase25,000tobase28,000

Instructions for project with no genes Ifyoubelievethattheprojectdoesnotcontainanygenes,pleaseprovidethefollowingevidencetosupportyourconclusion:

1. PerformaNCBIBLASTXsearchoftheentirecontigsequenceagainstthe“non-redundantproteinsequences(nr)”database.Provideanexplanationforanysignificant(E-value<1e-5)hitstoknowngenesinthenrdatabaseastowhytheydonotcorrespondtorealgenesintheproject.

2. ForeachGenscanprediction,performaNCBIBLASTPsearchofthepredictedamino

acidsequenceagainstthenrproteindatabaseusingthestrategydescribedabove.

3. Examinethegeneexpressiontracks(e.g.,RNA-Seq)forevidenceoftranscribedregionsthatdonotcorrespondtoalignmentstoknownD.melanogasterproteins.PerformaNCBIBLASTXsearchagainstthenrproteindatabaseusingthesegenomicregionstodetermineiftheyshowsequencesimilaritytoknownorpredictedproteinsinthenrdatabase.

Note:Foreachgenedescribedinthisannotationreport,youshouldalsopreparethecorrespondingGFF,transcriptandpeptidesequencefilesaspartofyoursubmission.

Page 2: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

2

CompletethefollowingGeneReportFormforeachgeneinyourproject.Copyandpastethesectionsbelowtocreateasmanycopiesasneededwithinthisreport.BesuretocreateenoughIsoformReportFormswithinyourGeneReportFormforallisoforms.

Gene report form Genename(e.g.,D.biarmipeseyeless):D.biarmipesCG31997Genesymbol(e.g.,dbia_ey):dbia_CG31997Approximatelocationinproject(from5’endto3’end):25673-27471NumberofisoformsinD.melanogaster:2Numberofisoformsinthisproject:2Completethefollowingtableforalltheisoformsinthisproject:Name(s)ofuniqueisoform(s)basedoncodingsequence

Listofisoformswithidenticalcodingsequences

CG31997-PB CG31997-PANamesoftheisoformswithuniquecodingsequencesinD.melanogasterthatareabsentinthisspecies:NA

Consensus sequence errors report form Completethissectionifyouhaveidentifiederrorsintheprojectconsensussequencethataffecttheannotationofthegenedescribedabove.Allofthecoordinatesreportedinthissectionshouldberelativetothecoordinatesoftheoriginalprojectsequence.Location(s)intheprojectsequencewithconsensuserrors:NA

Note:Forisoformswithidenticalcodingsequence,youonlyneedtocompletetheIsoformReportFormforoneoftheseisoforms(i.e.usingthenameoftheisoformlistedintheleftcolumnofthetableabove).However,youshouldgenerateGFF,transcript,andpeptidesequencefilesforALLisoforms,irrespectiveofwhethertheyhaveidenticalcodingsequencesasotherisoforms.

Page 3: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

3

Isoform report form Completethisreportformforeachuniqueisoformlistedinthetableabove.CopyandpastethisformtocreateasmanycopiesofthisIsoformReportFormasneeded.Gene-isoformname(e.g.,dbia_ey-PA):dbia_CG31997-PBNamesoftheisoformswithidenticalcodingsequencesasthisisoform:dbia_CG31997-PAIsthe5’endofthisisoformmissingfromtheendoftheproject?No

Ifso,howmanyexonsaremissingfromthe5’end: Isthe3’endofthisisoformmissingfromtheendoftheproject?No

Ifso,howmanyexonsaremissingfromthe3’end:

1. Gene Model Checker checklist EnterthecoordinatesofyourfinalgenemodelforthisisoformintotheGeneModelCheckerandpasteascreenshotofthechecklistresultsintotheboxbelow:

Note:Forprojectswithconsensussequenceerrors,reporttheexoncoordinatesrelativetotheoriginalprojectsequence.IncludetheVCFfileyouhavegeneratedabovewhenyousubmitthegenemodeltotheGeneModelChecker.TheGeneModelCheckerwillusethisVCFfiletoautomaticallyrevisethesubmittedexoncoordinates.

Page 4: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

4

2. View the gene model on the Genome Browser UsethecustomtrackfeaturefromtheGeneModelCheckertocaptureascreenshotofyourgenemodelshownontheGenomeBrowserforyourproject.Zoominsothatonlythisisoformisinthescreenshot.(Seepage12oftheGeneModelCheckeruserguideonhowtodothis;youcanfindtheguideunder“Help”è“Documentations”è“WebFramework”ontheGEPwebsiteathttp://gep.wustl.edu.)Includethefollowingevidencetracksinthescreenshotiftheyareavailable:

1. Asequencealignmenttrack(D.melProteinsorOtherRefSeq)2. Atleastonegenepredictiontrack(e.g.,Genscan)3. AtleastoneRNA-Seqtrack(e.g.,RNA-SeqAlignmentSummary)4. Acomparativegenomicstrack(e.g.,Conservation,D.mel.NetAlignment)

PasteascreenshotofyourgenemodelasshownontheGEPUCSCGenomeBrowserintotheboxbelow:

Low-frequencyRNA-Seqexonjunctionsnotannotated:TheevidencefromtheRNA-SeqTopHattracksandMultizalignmentssuggestthattheremightbeadditionalisoformsbecauseofalternativesplicingatthe5'endofthisgene(redarrowsinthescreenshotabove).However,becausemostoftheTopHatjunctionsaresupportedbylessthan10reads,thereisinsufficientevidencetopostulatethepresenceofmultiplenovelisoformsinD.biarmipescomparedtoD.melanogaster.

Page 5: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

5

ExtraCDSpredictedbytheSNAPgenepredictor:SNAPpredictedaCDSat26,502-26,584(bluearrowinthescreenshotabove)betweenthefirstandsecondCDS'sofCG31997.TheRNA-SeqAlignmentSummarytrackshowsthattheregionsurroundingthisregionhaslow(<20reads)RNA-SeqreadcoverageandtheregionisadjacenttoahATDNAtransposonfragment(seescreenshotbelow).

NCBIBLASTXsearchofthegenomicregionsurroundingtheSNAPCDSprediction(contig10:26400-26700)againstthenrdatabasedidnotdetectanysignificant(E-value<1e-5)sequencesimilaritytoknownproteinsinthenrdatabase(seescreenshotbelow).

Page 6: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

6

ANCBIBLASTNsearchofthisregionagainstthentdatabasedetectedfivesignificantmatchestopredictedmRNAsinDrosophilasuzukii(seescreenshotbelow).

TheE-valuesfortheseD.suzukiimatchesrangefrom4e-10to3e-06,andtheycorrespondtothreedifferentpredictedgenes(LOC108013970,LOC108011950,andLOC108014610).AllofthesematchesareRefSeqpredictionsthathavenotbeenconfirmedexperimentally.TherearenosignificantmatchestoRefSeqrecordsthataresupportedbyexperimentalevidenceandnosignificantmatchestomRNAsinotherspeciesbesidesD.suzukii.Collectively,whilewecouldnotrejectthepossibilitythatthisregionofcontig10containsanuntranslatedregionofanearbygene,thereisinsufficientevidencetopostulateanovelisoformofCG31997inD.biarmipescomparedtoD.melanogaster.GiventheproximityofthisfeaturetothehATDNAtransposonandthemultiplematchestopredictedtranscriptsinD.suzukii,analternativeexplanationisthatthefeatureispartofatransposonthatisfoundinbothD.biarmipesandD.suzukii.HencewehaveomittedthispredictedCDSinourannotationoftheCG31997orthologinD.biarmipes.

Page 7: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

7

3. Alignment between the submitted model and the D. melanogaster ortholog ShowanalignmentbetweentheproteinsequenceforyourgenemodelandtheproteinsequencefromtheputativeD.melanogasterortholog.YoucaneitherusetheproteinalignmentgeneratedbytheGeneModelChecker(availablethroughthe“Viewproteinalignment”linkunderthe“DotPlot”tab)oryoucangenerateanewalignmentusingthe“Aligntwoormoresequences”feature(bl2seq)attheNCBIBLASTwebsite.Pasteascreenshotoftheproteinalignmentintotheboxbelow:

4. Dot plot between the submitted model and the D. melanogaster ortholog PasteascreenshotofthedotplotofyoursubmittedmodelagainsttheputativeD.melanogasterortholog(generatedbytheGeneModelChecker)intotheboxbelow.Provideanexplanationforanyanomaliesonthedotplot(e.g.,largegaps,regionswithnosequencesimilarity).

Page 8: Sample GEP Annotation Report · Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for

LastUpdate:12/28/2019

8

ThedotplotshowsthatthelasttwoCDS'sofCG31997-PBarehighlyconservedbetweentheproposedD.biarmipesgenemodelandtheD.melanogasterortholog.ExaminationoftheproteinalignmentattheendofthesecondandthirdCDS'sindicatethattheaminoacidshavesimilarchemicalpropertieseventhoughtheyarenotidentical.Inaddition,thelengthsofthesetwoCDS'sarethesamebetweenD.biarmipesandD.melanogaster.ThedotplotshowsthatthebeginningofthefirstCDSofCG31997-PBisonlyweaklyconservedbetweenD.biarmipesandD.melanogaster.Inaddition,thedotplotshowsthatthefirstCDSoftheD.biarmipesgenemodelislongerthantheorthologousCDSinD.melanogaster.Theproteinalignmentshowsthatthereare8additionalaminoacidswithinthefirstCDSintheproposedD.biarmipesgenemodelcomparedtoD.melanogaster.ExaminationofthisregionintheGEPUCSCGenomeBrowsershowsthatthereisonlyonemethionineinframe+2thatcouldserveasthestartcodonforCG31997-PB(seescreenshotbelow).TheexpansionofthisCDSisconsistentwiththeBLASTXalignment,theN-SCANgeneprediction,andtheavailableRNA-Seqdata.Consequently,ourannotationhasexpandedthesizeofthisCDS(1_10755_0)inordertoretainthisisoforminD.biarmipes.

Note:Largeverticalandhorizontalgapsnearexonboundariesinthedotplotoftenindicatethatanincorrectsplicesitemighthavebeenpicked.Pleasere-examinetheseregionsandprovideajustificationastowhyyouhaveselectedthisparticularsetofdonorandacceptorsites.