tools in bioinformatics genome browsers. retrieving genomic information previous lesson(s):...

Click here to load reader

Post on 18-Jan-2016

220 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Tools in BioinformaticsGenome Browsers

    Copyright OpenHelix. No use or reproduction without express written consent

  • Retrieving genomic informationPrevious lesson(s): annotation-based perspective of search/dataToday: genomic-based perspective: look at all the data from the prism of a specific chromosome locationNext: sequence-based searches

  • Genome browsersNCBI Map Viewerhttp://www.ncbi.nih.gov/mapview

    Ensemblhttp://www.ensembl.org/

    UCSC Genome Browserhttp://genome.ucsc.edu/

  • Copyright OpenHelix. No use or reproduction without express written consent*

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Important note to slide users:To maintain the color schemes/cues and the animations, if you import these slides into other slide sets please click the checkbox in the PowerPoint Insert window that maintains slide format. Otherwise important information may be lost.Mac usersPC users

    Copyright OpenHelix. No use or reproduction without express written consent

  • Version16a_0209Copyright OpenHelix. No use or reproduction without express written consent*The UCSC Genome BrowserIntroduction

    Materials prepared byMary Mangan, Ph.D.www.openhelix.com

    Updated: Q1 2009

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser Agenda UCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercisesIntroduction and Credits

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Organization of Genomic DataGenome backbone: base position numbersequenceLinks out to more data

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*A Sample of the UCSC Genome Browserofficialsequence

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser CreditsLed by David Haussler and Jim KentDozens of staff and students bring you this software and dataDevelopment team: http://genome.ucsc.edu/staff.html

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser Agenda UCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*The UCSC Homepage: http://genome.ucsc.eduGeneral informationSpecific informationnew features, current status, etc.

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Genome Browser Gateway: start page, basic searchUse this Gateway to search by:Gene names, symbols, IDsChromosome number: chr7, or region: chr11:1038475-1075482Keywords: kinase, receptorSee lower part of page for help with format

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*The Genome Browser GatewayMake your Gateway choices:Select CladeSelect genome = species: search 1 species at a timeAssembly: the official backbone DNA sequencePosition: location in the genome to examineImage width: how many pixels in display window; 5000 maxConfigure: make fonts bigger + other choices45assembly

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*The Genome Browser Gatewaysample search for Human TP53Sample search: human, March 2006 assembly, tp53Select from results listID search may go right to a viewer page, if unique

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser Agenda UCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Overview of the WholeGenome Browser Page(mature release)Groups of data (Tracks)

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Different Species, Different Tracks, Same SoftwareSpecies may have different data tracksLayout, software, functions the same

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Sample Genome Viewer Image, TP53 Regionbase position

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Visual Cues on the Genome Browser

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Options for Changing Images: Upper SectionChange your view or location with controls at the topUse base to get right down to the nucleotidesConfigure: to change font, window size, moreNext item, next exon navigation assistance can be turned onSpecifyapositionFonts,window,next item,moreWalkleft orrightZoominZoomoutClick tozoom 3xand re-center

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Annotation Track Display OptionsSome data is ON or OFF by defaultMenu links to info about the tracks: content, methodsYou change the view with pulldown menusAfter making changes, REFRESH to enforce the changeLinks to infoand/or filters

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Annotation Track Options DefinedHide: removes a track from view

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Mid-page Options to Change SettingsYou control the viewsUse pulldown menusConfigure options page

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Cookies and SessionsYour browser remembers where you were (cookies)

    To clear your cart or parameters, click default tracks or reset Save your setup as sessions and store/share them

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser Agenda UCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Click Any Viewer Object for DetailsExample: click your mouse anywhere on the TP53 line

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Click Annotation Track Item for Details Pages

    Not all genes have this much detail.

    Different annotation tracks carry different data.

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Get DNA, with Extended Case/Color OptionsUse the DNA link at the topPlain or Extended optionsChange colors, fonts, etc.

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Get Sequence from Details PagesClick a track, go to Sequence section of details page

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser AgendaUCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Accessing the BLAT ToolRapid searches by INDEXING the entire genomeWorks best with high similarity matchesSee documentation and publication for detailsKent, WJ. Genome Res. 2002. 12:656BLAT = BLAST-like Alignment Tool

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*BLAT Tool Overview: www.openhelix.com/sampleseqs.htmlsubmit

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*BLAT Results with HyperlinksResults with demo sequences, settings default; sort = Query, ScoreScore is a count of matcheshigher number, better matchClick browser to go to Genome Browser image location (next slide)Click details to see the alignment to genomic sequence (2nd slide)sortinggo to browser/viewergo to alignment detail

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*BLAT Results: BrowserFrom browser click in BLAT resultsA new line with Your Sequence from BLAT Search appears!

    Base position = full menu and zoomed in enough to seeamino acids in 3 frame translation

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*BLAT Results,Alignment Details

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser AgendaUCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Introduction SummaryUCSC Genome BrowserVisual cues and genomic contextMany ways to alter your viewsAccess to deeper dataAccess and use sequence data

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*UCSC Genome Browser AgendaUCSC Genome Browser: http://genome.ucsc.eduIntroduction and CreditsBasic Searches Understanding DisplaysGet Details or SequencesSequence Searches (BLAT)SummaryExercises

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Hands-on Session for IntroductionExercises provided as documentWe will walk through them together2 styles: questions only, and step-by-stepWhen we are finished the formal exercises, we can help you to investigate issues that you want to understand for your research

    Copyright OpenHelix. No use or reproduction without express written consent

  • Copyright OpenHelix. No use or reproduction without express written consent*Notice:The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes is strictly prohibited.

    Copyright 2009, OpenHelix, LLC

    http://www.openhelix.com/ucsc

    Copyright OpenHelix. No use or reproduction without express written consent

    *Welcome to an OpenHelix tutorial.*To maintain the color schemes/cues and the animations, if you import these slides into other slide sets please click the checkbox in the PowerPoint Insert window that maintains slide format. Otherwise important information may be lost.*Welcome to the introductory tutorial on the UCSC Genome Browser.

    The University of California at Santa Cruz Genome Browser resource contains the reference (or official) public DNA sequences and working draft assemblies for human and a large collection of other genomes. There are a number of tools within this site that will provide access to the sequences themselves, and many other useful genome features to add context to the genomic information. Researchers can use this site to find genes and gene predictions, expression information, SNPs and variations, cross-species comparative data, and more.

    Our goal in this tutorial is to help you search, retrieve, and display the data that you want, which is relevant to your research. In this introduction we will provide an overview of the organization, graphical cues, and basic features of the searches and displays. We will explore the range of tools available for several types of searches. The materials in this presentation were prepared by Dr. Mary Mangan of OpenHelix, with guidance from the UCSC Genome Bioinformatics team.

    Separate tutorials about the UCSC resources address the more advanced topics of The Table Browser and Custom Tracks, the Gene Sorter and a variety of other tools associated with the browser. After you have mastered this introductory material, be sure to check those out. *The agenda for this tutorial is shown here.

    We will begin with an introduction and credits, and will move on to explore the UCSC Genome Browser resource with some basic searches; we will perform a basic text search, and examine the results and displays in some detail.Then we will use sequence data to perform a search, which uses the BLAT tool.I will summarize this material.

    Finally, you will have the opportunity to explore sample exercises on the UCSC Genome Browser site, to reinforce the concepts developed in this tutorial.

    Lets begin with the introduction and credits.*A great deal of information has come to us from the official Human Genome Project, and the official projects from many other species as well. But other data has come from individual laboratories doing traditional benchwork; some has come from the literature; and some of the data has come from new large-scale technologies that have arisen in the last few years, such as microarray data for gene expression detection.

    Sothere are tremendous volumes of data available; and many places to try to find it. The UCSC Genome Browser is a great resource because it organizes this material in one place. It uses the backbone of the genomethe official backbone sequence of the Human Genome Project, sometimes nicknamed the Golden Pathand combines this data with all kinds of other useful and important biological information, such as chromosome banding patterns, known genes, gene predictions, expression data, comparative genomics and evolutionary conservation, SNPs and other variations, and so on.

    As I illustrate in this diagram, the data is organized along the official genomic sequence backbone. The other data types are referred to as Annotation Tracks and are aligned on the genomic backbone framework. These tracks provide additional information about any given genomic region of interest.

    All of this data is aligned in one place so you can quickly find new information, and context, about regions important for your work. In addition, all the data links out to other databases, web sites, and literature so you can go as deep as you want into any specific topic in which you may be interested.

    *On the previous slide I diagrammed the UCSC Genome Browser representation of the genome and the annotation databriefly I wanted to show you a sample of the kind of data we will examine as it actually looks in the Genome Viewer. Here you see a portion of the genome viewer, with the base positionsthe official genome sequence--the top, and the many layers of dataannotation tracks--organized in that region. From any of this data if you click the features, you will be presented with even more detail about the items you see. The detail pages themselves link out to more resources, too. Shown here are some examples of Gene Details, cross-species alignment data, and SNPs.

    So much data, so well organized, is right at your fingertips now, thanks to the UCSC Genome Bioinformatics Group team.*We would like to start out by giving full credit to the developers of the tools right away. The leaders of the project are David Haussler and Jim Kent, but of course there is no way to build a tool like this by yourself. There is a terrific team involved in developing and maintaining this data and softwareand they are continually adding new and useful features, species, and performing quality control checks. If you are interested in what kind of jobs are associated with projects like the UCSC Genome Browser, take a look at the list of folks involved here on the credits page.

    Of course, the UCSC folks themselves credit many others for their help with the creation of these toolsfrom the funding sources who pay the bills, to the many other data providers out there. The software is developed and the data is maintained by the UCSC Genome browser team. The data comes from a wide range of sourcesfrom researchers all over the world, who have contributed to the Human Genome Project and many other species data projects. Many other data types have been added by groups beyond the official sequencing projects as well. The UCSC Group is also the official Data Coordination Center repository for the data coming from the ENCODE (Encyclopedia of DNA Elements) project.

    OpenHelix is a separate company that provides training on bioinformatics resources. UCSC sponsors us to provide the freely available training materials offered here.

    One final note: this software is free to use for academic non-profit, and personal use situations, and if you are in a company setting, you can use the website freely. However, if you want to have a copy of the software and database installed behind a company firewall you will need to obtain a license. Please contact the UCSC folks if you have any questions about licensing the software. There is a link to license information from the Genome Browser homepage.

    *[end of Introduction and Credits] That completes our introduction and credits.[beginning of Basic Searches] In this section we explore the basic text searches. *Shown here is the homepage for the UCSC Genome Bioinformatics site. When you first arrive, you will see a page that looks like this. At the top there is a section that contains general information about the site. Next, there is a specific section for News--new species, new features, software or data changes, the current state of the data that is available. This information is worth a quick check when you visit the site, in case there have been changes since the last time you visited.

    But the real substance of the sitethe data and toolsare accessible in a couple of ways from this page. There are navigation bars at the top and left side which will permit you to access all of the available features. You will begin your experience at the UCSC Genome Browser by navigating from these blue areas.

    To actually get in and start performing basic searches in the database, there are several optionsyou can search by textgene name, gene symbol, keywords, ID, and so on. To do this we will use the Genomes or the Genome Browser link. You may notice that many of the features are accessible from both navigation bars. I have circled the Genome Browser link, which you would click to access the search page, which is known as the Gateway.*Shown here is a portion of the Genome Browser Gateway page. By default the search is set to Human when you first arrive, but we will see that you can change the species later.

    We will begin to talk about searching using the text search feature from this Genome Browser Gateway page.

    You can do a text search for information such as gene names, chromosome number, chromosome region, your favorite gene or marker identification number (ID), GenBank submitter name, and more. You can use a keyword to find records. Examples of the kinds of searches you could do are shown on the lower part of this pagesee the request items, and the expected responses from the genome browser. Remember that you can just check out this section for helpful reminders of the correct query format when doing your own searches later on.

    We are going to go a little deeper into your search options from this gatewaywell take each option and explore what you can expect from a given search.

    *Here we are going to focus on the options that you have to search a genome using the Gateway page. This screen shot isolates that part of the page for us so we can focus on the specific options that are available to you.

    The first option is clade, and then the second is the genome, or species, choice. At one time all of the species were in a single list, but there are so many species now that they have been re-organized into these 2 menus. You will search one species at a time in the Genome Browser. Use the pulldown menus to select and highlight the species name that you want to use in your search.

    Next, you have to choose an assembly. Assembly refers to the official backbone genomic sequencesometimes referred to as the reference sequence or reference genome--that is used to create the framework on which to hang all the other data. The reference sequence comes from the official groups who release genome sequence data. UCSC obtains the official assembly, and then generates the annotation tracks for that genome. The source of that assembly and any version number from the sequencing group is indicated on that species gateway information section. The date UCSC obtains that sequence is what we see in the Assembly menu. Usually you will want the most current assembly, but sometimes you may want to look back at older data and you can see that is still available for a while. Even older data is still available in the UCSC archives if you need it. Archives are accessible from a link on the homepage left navigation menu.

    Position or search term is the 4th option. This is where you put the symbol, keyword, or ID information about where you want to examine in the genome.

    Image width is the number of pixels used to make the genome viewer window you will see. You can set this from around 300 up to 5000 pixels if your monitor is that big. But it will require you to scroll around some.

    The last thing Ill point out here is the button for configuring tracks and displays. You can make changes here to the displaysuch as the font sizes and feature appearance, but later Ill show you a couple of other places you can access this as well. If you are finding that the text on the viewer is too small, or the arrowhead features are difficult to view, configure their size and alter other aspects of the viewer here.*Now that we have examined the search options, lets perform a sample search of this database.

    The search that Ill be demonstrating uses the HUMAN genome, the March 2006 assembly. If you are seeing these slides at a time when there is a later assembly, things might look slightly different.

    For this example, Im going to use the human TP53 genethis is an important and medically relevant gene that has been implicated in some cancers. It is a well characterized gene for our example.

    Once you have made the appropriate selections among the options, added your position or search text, you would click the submit button and wait for your results..which we see below.

    Here I show a part of the results page for the text search for TP53. That text appears in a number of different records, so you have to select the one you want from this results page. Sometimes you can go directly to the browserif you use a specific accession number that might happen. However, with text searches often you will have to select from the records. Usually I choose a record that appears to be the correct gene symbol or name. And if there appear to be multiple entries that are likely to be splice variants, I may select the longest of them (as indicated by the nucleotide range at the end of the link). We simply have to choose one to move to that genomic regionas you will see, the other versions of that gene will be visible on the viewer when we get there.

    For my example here, I will choose the link that says uc002gij.2, tumor protein p53, near the middle of the TP53s. Click that link to go to the TP53 position in the genomewe will go to chromosome 17 in that nucleotide range that is shown.

    *[end of Basic Searches] That completes our introduction to the basic text search.

    [beginning of Understanding Displays] In this section we explore the results of searches as they appear in the Genome Viewer, to understand the layouts, displays, and controls for the viewer. *Shown here is an overview of the page that results from clicking the link in our results list. I use this slide to illustrate the major organizational concepts of the Genome Browser.

    At the top of the page you will see the Genome Viewer section. Here you will see the diagrammatic representation of the genome and annotation track features in this region. Soon we will examine this data and the visual cues in more detail.

    At the bottom of the page you will see the controls that you can use to turn the data in the viewer on or off. The data is organized into GROUPS for quickly finding data of interest. These are groups of similar data, such as Mapping and Sequencing Tracks, Genes and Gene Prediction tracks, and so on. Each GROUP contains the individual TRACKS, or the rows of annotation. A group at the bottom of the page corresponds to a section in the viewer. Here I illustrate that the data from the Mapping and Sequencing tracks group is displayed in the uppermost part of the viewer. Next, the Genes and Gene Prediction tracks are located in the next section down in the viewer. In the viewer the separate GROUPs are indicated by the color change along the left side of the image area, from gray to blue. Understanding this Group and Track organization will also help you to understand the Table Browser functions well discuss later.

    Other data types in human include Phenotype and Disease tracks, mRNA and EST data, Expression data (such as microarray data sets), Regulation (including data such as Transcription Factor Binding sites), Comparative genomics data with many species comparisons and individual species comparisons, Variation and Repeats with SNPs and copy number variation and more. At the bottom a special group of tracks of ENCODE or Encyclopedia of DNA Elements project data is provided. This special project data will be described in another tutorial section.

    This is a Genome Browser page at a mature stage of this assembly. You can see that there are many track and image controls seen down at the bottom of the page. At the very beginning of a releasethere is only a core set of tracks at first, not all of the tracks are available. Over time these will be added to the browserso the actual track options you see will accumulate over time. Tracks take time to createwithin UCSC, and from other contributors all over the world. So, the first day of a new release the SNPs may not be there. However, they will appear over time.

    A key point to make here: the official reference sequence that forms the framework for this assembly will remain frozen over the course of time. However, the data in the annotation tracks may change. It may be updated periodicallyfor example, new data for ESTs and mRNAs is downloaded from GenBank every week. New data types may be added, or tracks may be updated, at any time. So although the official sequence remains the same, the annotation tracks data may change.*Another point to make at this time is that the UCSC Genome Browser has dozens of different species genome browsers. Here are a few of the images of these different species. As you can see from a quick look, for each species the interface and display is very similar, and the way the software works will be similar as well.

    Although we are focusing on the human genome in our slides todayyou should know that all these species share the software functionality that we will be talking about.

    However, different species will have different annotation tracks. Just because you see a certain track in the human browser, it does not mean that the same track will be available in Fugu, for example. Similarly, there may be data in yeast that will not be available in the human genome browser. *At this time, lets focus on the viewer section of the Genome Browser. This is the default view, after our search for TP53. I want to quickly orient you to the things that you are seeing when you look at the default setup of the genome viewer. One of the first things to notice is that we can see that we are in the position of the genome that we expected by looking at the label on the side of the UCSC gene track, which indicates the TP53 gene locationwhich I have highlighted in RED. Notice that one of the TP53 symbols is highlighted black: that is the specific one that we clicked from our results list to arrive here, and the one that supplied the coordinates for our current view.

    At the very top of the image there is a track called Base Position, which I have been calling the genome backbone. This is the actual base of every single nucleotide of the reference sequence. As you can see, we are on chromosome 17 around base number 7 million something-something-something.The viewer displays numbers unless you are zoomed all the way in to base, and then you would see the individual nucleotide letters A, T, G and C themselves.

    As you look down the viewer, you will see many different data types are representedUCSC genes, Mammalian Gene Collection clones, mRNAs, ESTs, evolutionary relationships compared across many species or as individual species, SNPs and repeats. This is just the default view, thoughother data types are available for you to display.

    Immediately from the viewer, you can see that you have a lot of information and context about the TP53 region. Lets talk a little bit more about the display of the features in the viewer.*Various data objects will be represented differently in the Genome Browser. For some objects, there are just single locations, or very short stretches of sequence. For example, STS sequence tagged sites, or SNPs, simple nucleotide polymorphisms, are indicated by vertical tick marks. Sometimes if there are several close together they may look like a broader barbut essentially these are indicating a single small location.

    For the UCSC Genes track, there are several cues provided. Coding region exons are the tallest boxes. Half-size boxes indicate exons that comprise the 5 and 3 Untranslated Regions, or UTRs. Further, you can tell the direction of the transcription of this coding unit if you look at the little arrowheads which point to the left or to the right on the intron section. In the example diagram I have here, the arrowheads point to the left, indicating that this gene is transcribed from the 5 UTR on the right side to the 3 UTR on the left.

    For some tracks, colors have important meaning. For example, in the UCSC Genes track, the color BLACK indicates that there is a PDB or Protein Data Bank structure entry for this transcript. Shades of blue indicate its statuswhich may be reviewed, or provisional, for example. You should check the documentation for the specific color codes for different tracks. Another track that has specific important color codes is the SNP, where the SNPs can be colored to represent different characteristics of the SNP.

    Some data types are represented by a histogramfor example some of the Comparative Genomics data in the track called Conservation displays a bar of a certain height; tall bars indicate the increased likelihood of an evolutionary relationship in that region. This kind of track is sometimes called the wiggle track.

    Another visual indication of the sequence relationships can be seen in the single species comparisons. Boxes indicate aligning regions, and lines indicate gaps. Single lines are simple gaps that represent likely insertions or deletions. Double lines represent more complex situations that could be a range of issues. More details on the possibilities can be found in the description of the display conventions in the browser documentation. Zooming and clicking on the display will bring you more information about the specific sequences involved.

    The different tracks will have different colors, shapes, etc. If you have a question about a specific representation you should check the documentation for an explanation of the significance. Understanding these representations will help you to quickly grasp many of the features in any genomic region. *In addition to the view of the genome that you see when you first arrive, you have the option to make lots of changes to the area of your view. Here I show the upper section of the Genome Viewer page, with several controls for adjusting your view of the genome.

    You can use the move buttons with the arrowhead indicators to walk left or right along the chromosome in this area. You can take big steps (with the triple arrowhead), medium, or little steps along with the single arrowheads. These can be very handy if you are interested in whats going on near your search region.

    You can magnify the image area using the zoom in buttonsand as you can see you can zoom in a little bit, or up to 10-fold! Oryou can choose base to zoom all the way down to the nucleotide level right away. Similarly, you can zoom out with a different set of buttons.

    Alternatively, you can indicate a specific genome coordinate position in the POSITION box. For example, if we wanted to see more of the possible promoter or downstream regions, we could subtract 1000 from the 5 side, and add 1000 to the 3 side, and get all of that extra sequence in our view. In addition, you can use this box just like the search box on the gateway pageyou can use it to search for text items if you enter text and click jump.

    Another handy feature is the automatic zoom and re-center action. If you click your mouse on the base position track at the very top, the browser will automatically re-center the image where you clicked, and zoom in 3 fold.

    Finally, you could change the way your viewer looks with the configure button. From this button, you will access a page that gives you some choices about how this page should look, including changing the font and graphics sizes. This is the same configure option we saw from the Gateway search page. One thing I didnt mention before is that you can also activate the next item or next exon navigation option from the configure page. This will offer arrows that help you to jump to the next gene or other next thing depending on your track of interest.

    Those are the controls at the upper part of the pagemostly they move you along the genome horizontally or to change the nucleotide position, affecting the entire viewing area. In the next few slides well talk about controlling the individual annotation tracks down below on the Genome Viewer page with the track controls, which alters the types of data displayed in your viewer.

    *At the bottom of all the Genome Viewer pages are the controls for the data, the annotation tracks. This slide shows just a part of that section.

    In this slide I have focused on just one category area: Mapping and Sequencing Tracks. However, the pulldown menu definitions are the same for all of the annotation tracks.

    The first important point is this: when you arrive at a fresh Genome Browser, some tracks are ON by default, and others are HIDDEN by default. For example, note that the display menu option for Base Position says dense. And see also the display menu option for Chromosome Band says HIDE and is grey in color. Sowhen you first arrive at the genome browser you are being shown only the default set of items which are already turned on.

    Some of the annotation track names are pretty clear: UCSC Genes, or Human ESTs for example. Other names may seem a little bit less apparent. If you arent sure what type of data the track contains, all you need to do is click the hyperlink above the menu. Those links will present a page of information about the data in that track: the description of the data, the source of the data, any filters that might be available for that data, and possibly publications about the data if they are available. There are so many data types, and new ones are being added all the time. Yet it is easy to learn about the details of these annotation tracks from these links.

    Once you find the data types you want to see or hide, you can use the pulldown menus here to turn any individual annotation track ON or OFF. There are several options for data visibility here, and Ill define those in the next slide.

    Right in the center of the page there are some handy buttons that I will describe in more detail later, but you need to know that you have to hit a refresh button if you make any changes to the menus; you need to click refresh to enforce those and actually see them in the viewer.

    *Here I will illustrate the different appearances of the menu selections, using the Human ESTs (expressed sequence tags) track as an example. I show the same region of human chromosome 17 as our TP53 gene, in the Human ESTs section of the viewer, using the different menu options:Hide: completely removes the data from your image.Dense: all items become collapsed into a single lineit fuses all the rows of data into one line. In this case it means that you can see where there is EST coverage, but you dont know anything about individual ESTs in this view.Squish: each item is on a line, but the graphics are only 50% of their regular height. Here you can see more information about individual ESTs.Pack: each item is separate, but efficiently stacked like sardines. However, they are full height diagramswhich makes it different from squish. Here you can see the GenBank accession numbers for the ESTs, which may be useful.Full: each item is on its own separate line, all the way down the browser viewerup to a certain number of rows. If you have more than a couple of hundred items here the browser can become overloaded, and it will automatically revert to the more efficient Pack view.

    To choose any of these options, just highlight it in the pulldown menu. To make the changes appear, you must click the refresh button that appears in several locations on the genome browser page.

    Lets return to a few of the other page button options now.*The final features I wanted to mention about controlling the Genome Viewer image are illustrated in this slide. This is a screen shot of the area around the middle of the Genome Browser page.

    First, let me draw your attention to the control buttons. The default tracks button will get you back to the default settingsit is like an escape hatch if you made a lot of changes on the image and want to start over. The hide all button is nice if you wanted to set up a specific display with only those annotation tracks that you wantit will let you start to build a nice customized view for yourself with only those things you care about.

    We will talk about the custom tracks button in another tutorial.

    Configure is a button we have seen before, up above on this page, and also on the Gateway page: this button gives you access to a big web page that will let you make all sorts of changes to the viewer. You will be able to change the font and graphic size here; you can also change the window width (in pixels again) from this page. You can make broad changes to all the track menus, which are all together and grouped on this page for quick access to entire sections.

    There may be times that you would rather see your region of interest in the reverse orientation. Clicking the reverse button will quickly accomplish that.

    We have also seen the refresh button before. You have to click this button to enforce any of the changes you made to those pulldown menus in the annotation tracks. The changes in the pulldown menus are NOT made automatically, you have to click this button.

    I hope this provides some guidance on the many ways that you can control the Genome Browser viewer to visualize the data that is important for your research. *One thing that is important to know about changes you have made to the viewer: the browser remembers your changes, until you clear them. A cookie is stored on your browser that remembers where you were looking in the genome, and if you made changes to those menus.

    As we have discussed, there are a number of changes you can maketo the position, the track displays, and even the filter options (which we really didnt cover here, but are covered in the exercises). These parameters are all saved on the computer you are using. This may be greatyou may always want to look at the data the same way. Oras you move from one tool at this site to another, you carry your position with you. Butthat may not be greatif you have forgotten that you filtered out something, or turned off a track. And if you use a shared computer in the lab or a libraryyou dont know if someone made some changes since you used the browser last.

    The UCSC team refers to these settings as being stored in your cart. There are a couple of ways to clear out your cart: you can choose the default tracks button from the Viewer controls to reset the viewer to default settings. Or the link that says: Click here to reset on the Gateway starting page wipes out any cart choices.

    If you ever find that your genome browser isnt behaving quite like you expect, try to clear your cart and start again.

    Another handy feature is the Session option. If you have a configuration that you want to store and return to examine later, or if there is some region you want to point people to specificallyyou can save your view as a session. At the top of a viewer page there is a link for Session where you can accomplish this. You will need a login for the Genome Browser Wiki system, but once you have that you will see how easy it is to save views, segments, track configurations, and so onyou can save multiple sessions and they can be uniquely named. You will get a URL that you can share if you like, or a session can be private. *[end of Understanding Displays] That completes our examination of the Genome Viewer display features.[beginning of Get Details or Sequences] In this section we go deeper than the display to find details about the items we see, and to obtain the actual sequences. *We have spent a great deal of time on the Genome Viewer image, which offers a great deal of visual information about the genome data context and annotation tracks. But there is much more data available to you still.

    Here Ive just shown the small area of the annotation track image that has been our focus, the upper section in the TP53 region, with our TP53 likely splice variants. You will remember that the one in the black highlight around the gene symbol is the one we selected in our original search. And the black color of that line indicates that this entry corresponds to an entry in the PDB, or Protein DataBank.

    We want to know more information about that item specifically. To learn more, all you need to do is put your mouse on that line and click that item.

    When you do so, a new web page will open. Here I show just the upper section of the TP53 gene description page for this item. You will find many important details about the object that you clicked just one page down from the viewer.

    The point is that one click awayon any item in the Genome Viewer--there is a LOT of more information available to you. Lets look at an entire sample page.*As I showed on the previous slide, one level down there are description or information pages that contain a great deal of additional information about that gene (or predicted gene, or SNP, or other item) in the viewer. Im going to just show one sample here of the detailed information on the human TP53 UCSC Gene page. But the other types of data also have lots of additional information one layer down as well.

    This page is actually quite huge, and I know that you wont be able to see all the details right now. But later you should go and see for yourself. There is extensive information about this gene, and links to many other resources as well. Practically one-stop shopping for known genes!

    One thing to know: not all genes will have this level of detail, and not every species will have all this information. I have specifically chosen a well-known gene for our example. Some genes wont have protein structures, some wont have pathway information, some wont have microarray data. But if the data is available, it will be available to you on these detail pages.

    Other pages will carry different types of data, of course. I attached here a small part of a SNP detail pageposition, sequence, validation status, function.and so on. Different data types will have different details pages. You only have to click on any item in the viewer to get to these details pages.*So far, we have seen visual cues, and lots of text-based data. But one Frequently Asked Question that people have at this point is where is the sequence data? I want to spend a couple of slides on that topic so that you will know that you can get to the sequence level data. From the viewer, there are two handy ways to get the sequence information.

    First, from your TP53 viewer section, you could simply click the DNA link in the blue navigation bar at the top of the page. The link will bring you to a new GET DNA in Window web page, shown in the center. As you can see, the position you were looking at in the viewer is carried here, and is specified in the position box. This takes whatever you were examining in your viewer window. On this page you have several options to format the sequence:You can tweak the output by adding some bases upstream or downstream.You can get the sequence in upper or lower case.You can mask repeated, low complexity regions.Or you can get the reverse strand.

    You could just click the get DNA button to get the sequence in a new web page, the output will be in FASTA format.

    The second button option offers even more ways to customize the output DNA sequence.

    If you click the extended case/color options button, youll get a new page that lets you change the case of individual items, change their colors, underline specific features, and so on. The choices that you will see in the list are based on the tracks actively shown in the Genome Viewer window you were looking at. If thats too much, go back and turn off some tracks to make it easier to view.

    This is a really unique way to look at your sequence of interest, and can be copied to text documents for later review. As you can see in a sample output, different features look different by color, case, or underlines.

    These two options that I just describe deal with getting the whole region of DNA from your viewer. But you have another optionyou can get just the sequence you want from an annotation track item; thats what well look at in the next slide.

    *In this second example of how to get sequence data, Im showing a screen shot of the TP53 annotation track in the UCSC Genes section. As before, we would click on the item to get to the TP53 details page. From the details pages you can get the specific sequence for that item.

    Here Im showing a part of that details pagethere is a box for the sequence section. You can scroll down the details page to find the sequence section. Here you will find links to the Genomic, mRNA sequence, and the protein sequence. You can use these links to get this specific sequence, plus additional options if you choose the genomic sequencewhich is great for promoter studies, intron studies, and so on.

    Sothe sequence of the items in your viewer is just a couple of clicks away, using either the DNA link at the top to get the whole window, or the links from the information pages to obtain sequence for specific items. You can also download lists of sequences or more complex queries from the Table Browser, but that is beyond the scope of our introduction here. Please see the advanced tutorial for more details on that topic.*[end of Get Details or Sequences] That completes our examination of the access to details and sequence information from the Genome Viewer.

    [beginning of Sequence Searching] In this section, we will examine the way to search the UCSC Genome Browser starting with sequence data. *In the UCSC Genome Browser, the tool you will use for sequence searching is called BLAT. Many of you will be familiar with the alignment tool called BLAST or BLAST2, which stands for Basic Local Alignment Search Tool. If you have used the NCBI databases, and searched for similar sequences, you have probably used BLAST.

    But BLAT is differentit is the Blast-like alignment tool. It searches the database slightly differently than BLAST. BLAT requires an index of the sequences in the databasesomething like the index in the back of a biochemistry textbook. The BLAT index consists of all the possible unique 11-oligomer sequences in the genome (or 4-mers for protein sequences). Just as you can quickly scan a book index to find the correct word, BLAT scans the index for matching 11-mers, and then builds the rest of the match out from there. It is a very fast way to search the sequences. BLAST does it the other wayit indexes your query and then runs your smaller index over everythingthats the essential difference in the algorithm.

    But the outcome will still be a pair of sequences that are lined up with each other so you can compare the matches.

    BLAT works best with sequences with high identity, and greater than 21 bases longbut dont let that scare you, you can find more distant matches as well. Directly from the UCSC documentation:On DNA queries, BLAT is designed to quickly find sequences with 95% or greater similarity of length 40 bases or more.On protein queries, BLAT rapidly locates genomic sequences with 80% or greater similarity of length 20 amino acids or more. In general, gene family members that arose within the last 350 million years can generally be detected.

    For many people it will be enough to know that there is a means of searching for your region of interest in the database by starting with a sequence! For the more casual BLAT user, check out the Help and Frequently Asked Questions documentation at the UCSC web site for a little more detail about the way BLAT works, without tremendous amounts of mathematical equations. For the more mathematically inclined folks, you can see the publication by Jim Kent that describes BLAT in more detail.

    So now we know a little bit about the BLAT tool. How do we get to it? Lets start at the UCSC Genome Browser homepage. As for most UCSC tools, you can use the Navigation bars at the top or at the side of the UCSC home page. Select a link called BLAT to get started.

    [not read in recording]BLAST (original paper): Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=2231712&dopt=Abstract

    BLAT (original paper):W. James Kent (2002) BLAT - The BLAST-Like Alignment Tool, Genome Res 12:4 656-664. *Shown here is the interface for BLAT. We will work our way down this page.

    As you can see along the top there are a few parameters you can changesome choices you have to make. First, you must choose one species to search. You search one species at a time with this tool. Then you choose an assemblywhich we have seen before in the basic search section.

    Next, you may let the BLAT tool guess whether you have entered nucleotides or amino acids, or you can tell it which one you are using. Default is BLATs guess which has always determined the sequence composition correctly in my experience.

    Sort outputon default settings herewill list the best scoring matches first. Output type specifies whether you want the output to be in the browser form, or in files you can use later. Hyperlink is the default which displays in the browser, and thats what Ill be using for this example. The other type, PSL or pat space output styles, are useful for people who want a differently structured, text-based output that can be used for a variety of purposes. For my example, I will use the default hyperlink choice.

    There is a large text box where you can paste your sequence or sequences. You can paste one or more sequences, but there are limits to how much BLAT you can do, as it is a large burden on the servers. You can submit up to 25,000 bases or 10,000 amino acids, up to a total of 25 sequences. If you need to do more BLAT, UCSC asks that you download it and run a local copy. Instructions for this can be found in the documentation.

    I have displayed two partial mRNA sequences here that I will use in my example. These are in the common FASTA format, which you have to use if you are going to use multiple sequences.

    There is also an option to upload your sequence (or sequences), if you keep a file of them.

    Finally, you click submit to send your query to the database.

    There is a special buttonthe Im Feeling Lucky button. If you click thatjust like in Googleyou will be taken to the position of your best match right away, in the Genome Viewer. But Ill be demonstrating the plain old submit button right now.*Here we see the results of a BLAT search against the human genome, using the sample human mRNA sequences I showed.

    As you can see, we have sorted the list by the query and score. You can see we have a really high scoring match up at the top. After that they appear to be less good matchespretty small regions, probably.

    Now, youll remember that we asked for hyperlinked results in our setup. You can see that there are two columns of links for us. One says browser, one says details.

    The first thing that I will do is demonstrate a click of the browser link for the matches. This will link me to the position of this match in the Genome Viewer. I will show a sample of that on the next slide.

    Later we will click on the details link for the best match. That will give us a new page with sequence information, as youll see a couple of slides from now.*When you link from the BLAT results to the BROWSERyou get a special track appearing in the Viewer!

    Just down from the top there is a new line on the browserit says Your Sequence from Blat Search. And the name of my query sequence is listed over on the left.

    If you look at the UCSC genes, or RefSeq genes, you can see that we have matched the CXCL5 gene, which is what I would have expected from the BLAT query. On the known genes, because we are zoomed in to a small region, you can see the methionines indicated in green. Also, note the direction of this geneit is on the negative strand, therefore runs from right to left in this case. But bewarethe 3 frame translation at the top is running the other way. If you want to compare the methionines or other amino acids, you have to flip the frame translation. To flip the sequence and see the opposite strand, you must click the tiny arrow on the upper left of the viewer, or use the reverse button in the middle of the page.

    Sowe have used a sequence as a starting point to search the genome. We get to see the location of our match directly on the Genome Browser by clicking the Browser link from our BLAT results.

    So BLAT is another good place to start searching for your genes of interest in the UCSC Genome Browser tool.

    One special tip here: when you are zoomed in enough on any genomic sequence, you can see the amino acids in the display if you have turned on the Base Position menu to full. Zoom in more to see the amino acid single-letter codes right on the sequence.

    *Here I show the outcome if you clicked the details link from the BLAT results page. I know its impossible to see the whole alignment page clearlyeven with my short query sequence this is a large web page.

    You can see the page is divided into several parts. The top part shows the query sequence you put in (in this case our human CXCL5 mRNA sequence).

    The middle part of the page shows the match of your sequence (in blue) capital letters, to the genomic sequence. This gives you a quick look at the possible exon/intron structure if you have used an mRNA sequence as I have. Its a nice way to see which parts are the likely exons in an mRNA, and the introns in black text.

    The bottom part shows you the actual nucleotide-for-nucleotide matchesthis may be more like the BLAST results you are used to seeing. I magnified the top of the side-by-side alignment so you can see where my query sequence on the top (starts with number 001), lines up with the genomic sequence. You can judge the quality of the match yourself in this section.

    Although I have shown nucleotide sequence in the example, you can BLAT with a protein sequence and see where the protein sequence matches in the genomic framework as well. If you had started with a protein sequence, your amino acid sequence would be displayed with the corresponding genomic nucleotide sequence.

    So you can start to search the UCSC Genome Browser data with a sequence, and view the results in either the Genome Viewer or at the level of alignment detail shown here. *[end of Sequence Searches] That completes our look at sequence searching in the UCSC Genome Browser.

    [beginning of Summary] In this section we will summarize this tutorial.*The UCSC Genome Browser is dynamic and effective tool for accessing and understanding genomic regions of interest for many species genomes. Many data types useful to biomedical researchers are available for searching and displaying in this resource. The views can be adjusted and controlled by the user to show the data in the most helpful way.

    Pages with detailed information are available for any type of features you see in the browser.

    You can access sequence information in a variety of ways from the browser interface, and use sequence data to align genomic sequence and locate regions in the viewer.

    There are other ways to access the information in the UCSC Genome Browser. Additional tutorials on other ways to access and use the UCSC Genome Browser tools in advanced ways, and with other data types, are also available. Once you have mastered the introductory materials, be sure to check those out as well.*[end of Summary] That completes our summary.

    [beginning of Exercises] In this section we will explore exercises that reinforce concepts developed in this tutorial.*Hands-on session for basic text and sequence searches.

    [Not included in audio for movie]

    The exercises that match this presentation can be found on the UCSC Genome Browser OpenHelix tutorial homepageor in your folders if you are at a live OpenHelix training.

    We walk through them together in live OpenHelix workshop sessions.

    *The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes is strictly prohibited.

    Copyright 2009, OpenHelix, LLC.