bchm 6280 tutorial: gene specific information using...

5
BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC Genome browser: http://genome.ucsc.edu/ Exercise 1 homepage: http://biochem.slu.edu/bchm628/exercise1.html Goals: Learn how to efficiently navigate the NCBI, EBI-Ensembl, and UCSC Genome browsers to find information on specific genes. NOTE: Refseq refers to records that have been reviewed by the NCBI curation staff. The Refseq database is a precursor to the Gene database and is available as a Limits option in the protein and nucleotide databases. Curated Refseq records have the nomenclature: NM_#### for mRNA and NP_#### for protein records. Other designations are described in the PDF file RefseqNomenclature.pdf available from the Exercise 1 homepage. Conduct text based searches of NCBI and Ensembl a) Search the NCBI Gene database using the query term: “p53 AND human”. The AND tells it to search for both p53 and human in every field. b) Change the search query to: “p53 AND human[Organism]” or use the Advance option to create the same query. This tells the search algorithm that you are searching specifically for species human in the Organism field of the database. c) Search the Ensembl database for the human gene encoding p53. Change the dropdown menu to human, type “p53” in the search box and click GO. The first thing you should note is that there are many matches to the query “p53.” There are several reasons for this: 1. You are searching every field and not just the gene name 2. You are not using the official HGNC (Human Genome Nomenclature Committee) gene name and there are several different aliases for this gene. 3. The p53 protein interacts with >100 other proteins so there is a lot of literature that mention this protein and thus the name will appear in the records of many other genes. So how do you get around this? You can try searching for different aliases. You can look through the first few records and see if you can determine what the official gene symbol is. You can search the literature for other aliases. In this case, from your search of NCBI/Gene database in either a) or b), the top hit is the gene with the symbol TP53, which is the correct symbol. Read through the summary and you’ll note that the official gene name is Tumor Protein p53 and that it is involved in numerous cellular processes involved in gene regulation. You should also note that p53 is one of the listed aliases.

Upload: hoangmien

Post on 13-Mar-2018

230 views

Category:

Documents


5 download

TRANSCRIPT

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5

BCHM6280Tutorial:GenespecificinformationusingNCBI,EnsemblandgenomeviewersWebresources:NCBIdatabase:http://www.ncbi.nlm.nih.gov/Ensembldatabase:http://useast.ensembl.org/index.htmlUCSCGenomebrowser:http://genome.ucsc.edu/Exercise1homepage:http://biochem.slu.edu/bchm628/exercise1.htmlGoals:LearnhowtoefficientlynavigatetheNCBI,EBI-Ensembl,andUCSCGenomebrowserstofindinformationonspecificgenes.NOTE:RefseqreferstorecordsthathavebeenreviewedbytheNCBIcurationstaff.TheRefseqdatabaseisaprecursortotheGenedatabaseandisavailableasaLimitsoptionintheproteinandnucleotidedatabases.CuratedRefseqrecordshavethenomenclature:NM_####formRNAandNP_####forproteinrecords.OtherdesignationsaredescribedinthePDFfileRefseqNomenclature.pdfavailablefromtheExercise1homepage.ConducttextbasedsearchesofNCBIandEnsembla)SearchtheNCBIGenedatabaseusingthequeryterm:“p53ANDhuman”.

TheANDtellsittosearchforbothp53andhumanineveryfield.b)Changethesearchqueryto:“p53ANDhuman[Organism]”orusetheAdvanceoptiontocreatethesamequery.

ThistellsthesearchalgorithmthatyouaresearchingspecificallyforspecieshumanintheOrganismfieldofthedatabase.

c)SearchtheEnsembldatabaseforthehumangeneencodingp53.Changethedropdownmenutohuman,type“p53”inthesearchboxandclickGO.Thefirstthingyoushouldnoteisthattherearemanymatchestothequery“p53.”Thereareseveralreasonsforthis:1.Youaresearchingeveryfieldandnotjustthegenename2.YouarenotusingtheofficialHGNC(HumanGenomeNomenclatureCommittee)genenameandthereareseveraldifferentaliasesforthisgene.

3.Thep53proteininteractswith>100otherproteinssothereisalotofliteraturethatmentionthisproteinandthusthenamewillappearintherecordsofmanyothergenes.

Sohowdoyougetaroundthis?Youcantrysearchingfordifferentaliases.Youcanlookthroughthefirstfewrecordsandseeifyoucandeterminewhattheofficialgenesymbolis.Youcansearchtheliteratureforotheraliases.Inthiscase,fromyoursearchofNCBI/Genedatabaseineithera)orb),thetophitisthegenewiththesymbolTP53,whichisthecorrectsymbol.Readthroughthesummaryandyou’llnotethattheofficialgenenameisTumorProteinp53andthatitisinvolvedinnumerouscellularprocessesinvolvedingeneregulation.Youshouldalsonotethatp53isoneofthelistedaliases.

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 2 of 5

SearchtheEnsemblhumangenomewiththequery“p53”.Howmanyresults?Now,restricttheresultstoGenesandthisshouldreducethelistto~443records.However,Ididnotfinditwithinthefirstfewpages.Changethesearchto“TP53”restrictedtohumanandGenesanditshouldcomeupasthetoprecord.Centraltothiscourseisdealingwithlistsofgenes.Forthisreason,wewillusetheofficialgenesymbolsandspecificdatabaseIDs.Ifyouhadtofindtheofficialgenesymbolformorethanabout10genesyouwillquicklyseethevalueofusinggeneidentifiersthatareuniversallyrecognized.Youwillalsolearntovalueliteraturethatreferencesgenesbytheirofficialsymbols.Unfortunately,thisisnotauniversalpractice.FindingtranscriptinformationaboutaspecificgeneusingNCBI&EnsemblHumangenesarecomplexandoftenhaveseveraltranscriptisoforms.Thecurationofgenemodelstoidentifyallpossibleandexpressedtranscriptsusesseveralexperimentaltechniques,includingtissue-specificRNAseq,whichprovidesdirectsupportforexpressionofexons.

ThecurationofgenesatNCBIusesasinglepipelineandcollectsthecuratedgenomic,transcriptandproteinsequencesintotheRefSeqdatabase.TheynomenclatureidentifiesthosesequencesthatareconsideredReference(NG_(genomic)NM_(mRNA)andNP_(protein).ThereisaPDFontheexercise1homepagethatdescribesalloftheRefseqnomenclature.NotethatsomeoflistedasXMorXP,whichindicatespredictedtranscriptsorproteinswithlessornoexperimentalevidenceforthem.

Ensemblhastwogenecurationpipelines(VEGA&HAVANNA),andwhenthetwopipelinesarecombined,theannotationisknownasGENCODE.OntheGenespecificpages,thetranscriptsareidentifiedbywhethertheyareproteincodingornot.Thereisalsoavisualforsplicevariantsthatmatchestheknowndomainsinthegenewiththedifferenttranscripts.EnsemblalsomakesiteasytoexportanExcel-compatibletranscripttableandusuallyidentifieswhichofitstranscriptshaveacorrespondingRefseqtranscriptmatch.

a)WithintheNCBIgenerecordfortheTP53genethereare2sectionsthatprovidetranscript/proteininformation:Genomicregions,transcriptsandproductsandNCBIReferenceSet.

ExportaPDFfromtheGenomicregionssection.Here,genesarecolorcoded(greenforproteincoding,bluefornon-coding).Italsolistsgenemodels(XRorXM).Refseqtranscripts/proteinsstartingwithXrepresentcomputationalmodelswithoutexperimentalverification.AnexampleisprovidedontheExercise1homepage.

b)WithintheEnsemblgenerecordforTP53,findthetranscripttable.HereyoucanexporttheentiretableinCSVformatandthenimportintoExcel.AnexampleisprovidedontheExercise1homepage.

NOTE:TheEnsemblsitegenerallymakesiteasiertodealwithlistsofgenes(bothimportingandexporting).TheNCBIsitehasbettercross-databasefunctionalityandisbetterintegratedwiththeliterature.

Youshouldnoteseveralthingsaboutthesetranscriptsearches:

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 3 of 5

1.TP53hasalargenumberoftranscriptisoforms.Notallhumangeneshavethismany,butifyouwanttoconductawholegenomeexpressionexperiment,oneconsiderationisconsiderwhethertoanalyzethedataonagene(~25,000)ortranscript(~160,000)level.

2.ThetranscriptvariantsdifferbetweenEnsemblandNCBI.ThoughEnsemblkindlyliststhosethatareincommonbetweenthetwosites.

3.Ensemblmakesiteasytodistinguishbetweentranscriptsthatareproteincodingornotandalsobetweentranscriptswithgoodexperimentalevidenceversuscomputationallypredictedtranscripts.

ExploringthegenomiccontextofgenesusingEnsemblandUCSCGenomebrowser.Thegenomiccontextmeanswhereonthegenomethegeneislocated.Thatis:

• Whichchromosome• Whereonthatchromosome• Whatstrand• Whatgenesareupstream/downstream

Genomebrowsersofferawaytovisualizedatathatcanbeplacedonachromosome.Thesedataareincludedasadditionaltracksofinformation(fromafewtohundredsdependingonthegenome)andincludesuchdataas:

• Locationofrepetitivesequences• Levelofhomologytoothergenomes• SNPorvariantswithinthegenomeofinterest• TFbindingsites

Thedatabehindagenomebrowserisenormousandcanbequitecomplextosortthrough.Thisamountofdatacanalsobeslowtoload.Spendsometimeturningtracksonandoffandfollowinglinksorpop-upsthatexplainthedifferentdatasources.WewilluseboththeUCSCandEnsemblgenomebrowsersforthisexercise.Bothallowyoutoexportimagesofthebrowserwindowandofferlinkstodownloadsequencedata.EnsemblgenomebrowserToaccesstheEnsemblgenomebrowser,clickontheLocationtab(whichshouldhaveatitle:Location:17:7,661,779-7,687,550.ThisindicatesthatthisgeneislocatedonChromosome17betweenthecoordinates7,661,779-7,687,550.Thefirstsectionshowsaschematicofthechromosomewitharedboxaroundthecoordinatesofthegene(Fig.1).IfyouclickontheAssemblyExceptionslink,youcanturnoffthattrackandareleftwithjusttheboxhighlighting

thegene.

Figure1:Chromosomeideogramofchr17withtheregionforTP53shownasaredbox

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 4 of 5

Scrolldowntothenextsectionandyou’llseethechromosomeregioninmoredetail,withtheTP53geneinthemiddle.Thisgivesyouanideaofthegenomiccontextofthegeneofinterest.Scrolldowntothenextsectionandthiswilldisplaythe25Kbregionthatencompassesthelargesttranscriptisoformofthegene.Youcanseeallthedifferentsplicevariants.Theyarecolorcodedbyexperimentalsupportandwhethertheyareproteincodingornot.Clickononeofthetranscriptsanditwillopenapop-upwindowwithadditionaldetailsaboutthattranscript.Youcanright-clickonthelinkswithinthepop-upwindowtoopenupthelinkinanewtaborwindow.ClickontheXtoclosethewindow.Scrolldownfurtherandyouwillseeadditionaltracksofinformation,suchasSNPlocations,associatedphenotypesand%GC.Thesetrackscanbeexpandedandturnedonandoff.Itcantakeawhileforthechangestobeimplementeddependingonhowlongofachromosomalregionyouareworkingwithandhowmuchdataisinthetrack.Ifyouscrollbacktothetopofthissection,youcanzoominorout.Sometimestrackswon’texpandbecauseyouareviewingalargeenoughsectionthattherewillbetoomuchinformationtodisplay.Ifyoutriedexpandingatrackandnothinghappened,tryzoominginsuchthatyouaredisplaying<10Kbofsequence.Thatwillusuallyallowanytracktobeexpanded.Figure2showsaportionoftheTDP53transcriptwithexpandedtrackofSNPs.

Figure2:PartoftheTP53transcriptvariantswithexpandedSNPsbelow.

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 5 of 5

UsingtheUCSCGenomebrowserBelowtheheadersisadarkbluebarwiththelinkGenomes.MouseoveritandselecthumangenomeGRCh38/hg38.OrclickthelinkanditwillopenasearchwindowforthelatestHumanassemblyasadefaultoption.TypeinTP53intothesearchtextboxanditwilllistmanypossiblematches.Selectthesecondonewhichcorrespondstotumorproteinp53(fromHGNCTP53).ThisshouldopenawindowthatlookssomethinglikeFig.3.

ThegenesizeandcoordinatesofwherethisgenefallsonChr17shouldbeverysimilarifnotidenticaltothecoordinateslistedfortheEnsemblbrowser.Scrolldownthroughthegraphics.Clickonthegraphicorclickingonthenameofthetrackwillpopopenawindowwithinformationaboutthetrack.Clickonanysingletranscripttoseedetailsaboutthetranscript.AFEWofthequestionsyoucanaskwithagenomebrowserinclude(dependingonthegenomeandavailabletrackinformation):

1) Whatgenesarelocatednearitormaysharepromoters?2) WhatSNPsarefoundinmygeneandaretheylocatedinintrons,promotersorexons?3) Whatstrandismygeneencodedon?4) Whatregulatorelementsarelocatedwithinornearmygene?5) Whatclinicalvariantsareassociatedwithmygene?

Spendsometimeexploringthetracksandlookingupwhattheyrepresentandhowthedataispresented.Youmayfindsomeoftheinformationpertinenttoyourresearchproject.

Figure3:UCSCviewofTp53