bioinformatics lecture 4 bch 550 arjumand warsy. retrieving dna sequences

Click here to load reader

Post on 16-Dec-2015




0 download

Embed Size (px)


  • Slide 1
  • Bioinformatics Lecture 4 BCH 550 Arjumand Warsy
  • Slide 2
  • Retrieving DNA Sequences
  • Slide 3
  • Introduction Protein sequences are simple with a narrow range of sizes (300 a.a long, plus or minus 200, except for a few giant ones), clearly defined boundaries, and specific functional attributes. Furthermore, proteins of microbes or higher eukaryotes (animal and plants) have roughly the same properties. The corresponding gene (DNA) sequences get more varied and complex in higher animals. Gene sizes in humans may vary from a few thousand bp to several hundred thousand bp. Not all DNA is coding for protein. Various types of DNA sequences are involved in defining a gene: Regulatory regions (usually preceding the coding region); Untranslated regions that precede and follow the coding regions The protein-coding region In eukaryotes (yeast, plants, animals), the protein-coding region is divided into a variable number of exons interspersed with introns. As a consequence, working with DNA sequences is always trickier than working with protein sequences.
  • Slide 4
  • Going from protein sequences to DNA sequences In databases, the correspondence between protein and DNA sequences is not one-to-one. Many different even non-overlapping DNA sequences can be linked to the same protein or gene name. The primary transcript that is generated by copying the DNA sequence of a gene from beginning to end, (including exons + introns). The mature transcript the mRNA, generated from the primary transcript by discarding the introns). The strict protein-coding region the open reading frame or ORF. Numerous types of partial sequences.
  • Slide 5
  • Given a protein sequence, how can the DNA sequence encompassing its coding region be retrieved? Retrieving the DNA sequence relevant to protein.
  • Slide 6
  • What is GenBank? GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25- 30). There are approximately 106,533,156,756 bases in 108,431,692 sequence records in the traditional GenBank divisions and 148,165,117,763 bases in 48,443,067 sequence records in the WGS division as of August 2009.Nucleic Acids Research, 2008 Jan;36(Database issue):D25- 30 The complete release notes for the current version of GenBank are available on the NCBI ftp site. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
  • Slide 7
  • (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30).Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30 Nucleic Acids Res. 2008 Jan;36(Database issue):D25-30. Epub 2007 Dec 11. GenBank. Benson DABenson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.Karsch-Mizrachi ILipman DJOstell JWheeler DL National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA. GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: PMID: 18073190 [PubMed - indexed for MEDLINE] PMCID: PMC2238942
  • Slide 8
  • How to retrieve the nucleotide sequence 1. Point the browser to 2. To access the E. coli dUTPase entry quickly, simply enter the accession number (P06968) in the Search window at the top of the page and then click the Search button. 3. Stroll down to GeneID and click at the number of the gene.
  • Slide 9
  • P06968
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Using GenBank for retrieving nucleotide sequence Search Nucleotide for XO1714 a GenBank entry consists of four parts. The locus name (ECDUT): an arbitrary identifier, is followed by a short definition line and a unique accession number (X01714). The Reference section lists article(s) relevant to the sequence determination. The Features section lists the definitions and exact ranges of multiple Types of elements that have been recognized in the sequence. The Sequence section rounds out the GenBank entry, where the nucleotides are listed between the Origin keyword and the final // that signals the very end of the entry. Numbering is provided to help relate the location of the dUTPase ORF (343-798) to the actual nucleotide sequence.
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Search for nucleotide sequence of human (homo sapiens) G-6-PD gene
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • How to save? 1. Scroll back to the top of the page for the ECDUT/X01714 entry. Refer to Figure 2-20 for what your screen should look like. 2. Choose FASTA from the Display drop-down menu, as shown in Figure 2-23. 3. Transform the content of this window into plain text by choosing Text from the drop-down menu located on the far right of the menu bar. 4. Save the FASTA sequence by using the following protocol: a. In the Edit menu of your Web browser, click Select All and then click Copy. b. Open a default Word document and, in the Edit menu of Word, click Paste. Then select a Courier font (8 or 10). c. Finally, save your document as dUTPaseDNA.txt by choosing the Save as type option text only (*.txt).
  • Slide 27
  • Slide 28
  • Slide 29
  • Using BLAST to Compare My Protein Sequence to Other Protein Sequences
  • Slide 30
  • BLAST BLAST (short for Basic Local Alignment Search Tool) is a great sequence- comparison tool that tells which of the other known proteins has a sequence similar to our sequence. This information can be used for a variety of purposes: including the prediction of protein function, 3-D structure and domain organization, the identification of homologues (similar proteins)in other organisms.
  • Slide 31
  • How to use BLAST 1. Point your favorite Internet browser to The BLAST home page probably the most frequented bioinformatic Web page in the world appears. 2. Click the Protein-Protein BLAST (blastp) link in the top right. A Query screen appears. At this point, you need a FASTA- formatted protein sequence. 3. Open the file that contains your dUTPase FASTA- formatted protein sequence. This is the file that you created on your PC by using the steps shown earlier in the Retrieving a list of related protein sequences section of this chapter or get the sequence again. 4. Using your browsers Edit menu, copy and paste ONE of the protein sequences (with its definition line) into the BLAST Search window.
  • Slide 32
  • Old appearance
  • Slide 33
  • Slide 34
  • Getting the sequence again
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • 100% analogy
  • Slide 41
  • 62% Analogy
  • Slide 42