principles of bioinformatics - albanyberg/bio540/bio540 lectures/lecture_4.pdf · 2011-08-29 ·...
TRANSCRIPT
PRINCIPLES OF BIOINFORMATICSPRINCIPLES OF BIOINFORMATICSBIO540/STA569/CSI660, Fall 2010
Lecture 4 (Sep‐15‐2010)
Bioinformatics DatabasesBioinformatics Databases
Igor KuznetsovIgor Kuznetsov
Department of Epidemiology & Biostatistics
Cancer Research CenterCancer Research Center
University at Albany
Reading: Zvelebil & Baum, Chapter 3.
1
Genomics ‐ the first of many ‘omics’ disciplines
Genome is the entire genetic material (DNA) of an individual organismorganism.
Genomics is a new scientific discipline that studies the genomes of various organismsof various organisms.
Genomics includes efforts to determine the entire DNA sequence of an organism’s genome (“sequencing thesequence of an organism s genome ( sequencing the genome”), mapping of individual genes within the genome, and studies of interactions between genes within the genome.
There are many other ‘omics’ disciplines: proteomics, metabolomics, etc.
2
Human Genome Trivia
• # of chromosomes: 46• # of chromosomes: 46.
• # of base pairs: ~ 3 billion.
l l h f h d• Total length of stretched DNA: 2 meters.
• # of protein coding genes: ~ 25,000.
• # of proteins: ~ 50,000.
• # of RNA‐coding genes: ~ 6,000.
• Human genome is the only genome that was sequenced by its own species.
How much biologically meaningful information is encoded inthe human genome?
• We know the function for about 2% of the nucleotides in the human ( h * 7 f * 9 l d )genome (that is, 6*107 out of 3*109 nucleotides).
• We know very little about the remaining 98% of so‐called “intergenicregions”.g
3
Raw genomic DNA sequenceCCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGAGATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGCTTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATTGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACAAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTTTACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGATTTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCTCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATGGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGATTGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAACATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGCAGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCGG G G G GG GG C G GG CG GG GG G CC GG CCGAGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAGCCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTGTTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGCAACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCCCTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAAAGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAAGCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGGATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTCGCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTTGAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGATATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACAAGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC
4
TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGAATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
5
Yet another step in genomic data analysis –Yet another step in genomic data analysis Structural genomics
• Structural genomics is the determination of three‐dimensional structures of novel proteins.
A typical output of a structural genomics experiment:
6
a protein 3D structure composed of individual atoms (colored spheres)
Growth of the Protein Databank (PDB)
7
Proteome
• The proteome is the entire set of proteins expressed by a genome, cell, tissue or organism.
• More specifically, it is the set of expressed proteins in a given type of cells or an organism at a given time under defined conditions.
• Proteome defines the PROTEins expressed by the genOMEProteome defines the PROTEins expressed by the genOME.
Proteomics
• Proteomics is a recent scientific discipline that aims to study all the proteins expressed by a genome at a given momentproteins expressed by a genome at a given moment.
• Proteomics involves the identification of all the proteins in the body and the determination of their role in physiological and
h h i l i l f ipathophysiological functions.
8
Proteomics vs. Genomics
• ~25,000 genes in the human genome, but genes aren’t functional end‐productsend products.
• The functional end products: ∼50,000 proteins (many genes encode multiple proteins).
• For a living organism the genome is mostly static while the• For a living organism, the genome is mostly static, while the proteome is highly dynamic – proteins are continuously produced, modified, and degraded.
• Proteins can be modified by post‐translational modification in response to the physiological state (stress, drug treatment,response to the physiological state (stress, drug treatment, disease, etc)
9
Completely sequenced genomes(as of Feb. 2010)
10
Informatics challenges
• Various ‘omics’ projects produce more and more huge and p j p gdiverse datasets.• These datasets need to be organized, stored, and analyzed.• This requires adequate information technology (IT) infrastructure and support capable of handling and analyzing these datasetsthese datasets.• IT support for molecular biology research is provided by Bioinformatics.• Databases are the backbone of the bioinformatics.
11
Database Systems for Bioinformatics
• A database is a repository of information that has a specific structure that enables the user to enter and extract the data. Database structure consistsenables the user to enter and extract the data. Database structure consists of files or tables, each containing numerous records and fields.
• There are two most popular types of bioinformatics databases:
Flat file databasesRelational databases (RDBMS)
12
Relational databases (R MS)
Flat file databasesThe simplest form of a database, where data, such as nucleotide or amino acid sequences, are stored as one large text file or a collection of text files. These databases are called “flat” because they are flat like a sheet of paper.These databases are called flat because they are flat like a sheet of paper.
FASTA file format ‐ the most primitive bioinformatics flat file format>gi|45387601|ref|NP_991149.1| prion protein [Danio rerio] MHSKFKLFSFLNCLLLLAVLLPVAQSRRGGGFGRGGGRGGGWGGSSSGRAGWGAAGGHHRAPPVHTGHMG HIGHTGHTGHTGSSGHGVGKVAGAAAAGALGGMLVGHGLSSMGRPGYGYGYGGYGGHGYGYGHGYGHGHG HGGHGGHSGDHNETDADYYLDGAASGHAYSCVTVFGLMMSFLIGHFLSHGGHGGHSGDHNETDADYYLDGAASGHAYSCVTVFGLMMSFLIGHFLS >gi|684|emb|CAA39368.1| prion protein [Bos taurus] MVKSHIGSWILVLFVAMWSDVGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ PHGGGWGQPHGGGWGQPHGGGWGQPHGGGGWGQGGTHGQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYM LGSAMSRPLIHFGSDYEDRYYRENMHRYPNQVYYRPVDQYSNQNNFVHDCVNITVKEHTVTTTTKGENFTGGYM LGSAMSRPLIHFGSDYEDRYYRENMHRYPNQVYYRPVDQYSNQNNFVHDCVNITVKEHTVTTTTKGENFT ETDIKMMERVVEQMCITQYQRESQAYYQRGASVILFSSPPVILLISFLIFLIVG >gi|147907216|ref|NP_001082180.1| prion protein [Xenopus laevis] MPQSLWTCLVLISLICTLTVSSKKSGGGKSKTGGWNTGSNRNPNYPGGYPGNTGGSWGQQPYNPSGYNKQ WKPPKSKTNMKSVAIGAAAGAIGGYMLGNAVGRMSYQFNNPMESRYYNDYYNQMPNRVYRPMYRGEEYVSWKPPKSKTNMKSVAIGAAAGAIGGYMLGNAVGRMSYQFNNPMESRYYNDYYNQMPNRVYRPMYRGEEYVS EDRFVRDCYNMSVTEYIIKPTEGKNNSELNQLDTTVKSQIIREMCITEYRRGSGFKVLSNPWLILTITLF VYFVIE >gi|2330626|emb|CAA04236.1| Prion protein [Ovis aries] MVKSHIGSWILVLFVAMWSDVGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ PHGGGWGQPHGGGWGQPHGGGGWGQGGSHSQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAM
13
PHGGGWGQPHGGGWGQPHGGGGWGQGGSHSQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAMSRP LIHFGNDYEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNITVKQHTVTTTTKGENFTETDIKIME QVVEQMCITQYQRESQAYYQRGASVILFSSPPVILLISFLIFLIVG
GenBank flat file format
FieldsFields
14
Limitations of flat file databases
• Hard to integrate.• Hard to search.Hard to search. • Hard to update (e.g., need to download the entire database once a
while). • In general hard to handle efficiently• In general, hard to handle efficiently.• Mostly used to distribute data.
A better solution: Relational Database
A l i l d b ll i d i hi l i l bl• A relational database stores all its data within multiple tables.• A table is a set of rows and columns.• Each table is linked to other tables by a shared field called a key.y y• Uses Structured Query Language (SQL) to access, retrieve, update data.
15
Two tables from a relational databaseTwo tables from a relational database
key
16
• MySQL is the most popular open source relational database.• Free for most users.
ll h l d h l• Integrates well with PHP, Perl and other scripting languages• Works well with Linux and Apache
S t l tf i l di i d UNIX/LINUX M• Supports many platforms including windows, UNIX/LINUX, Mac OSX, etc.
• Most bioinformatics applications, big and small, use MySQL.Most bioinformatics applications, big and small, use MySQL.
17
Primary vs. Derivative (secondary) databases
Primary Databasey• Original submissions by experimentalists• Database staff organize but don’t add additional information
Derivative (secondary) DatabaseC t d b h t• Curated by human experts
• Computationally Derived• Combination of the above two• Combination of the above two
18
Primary vs. Derivative (secondary) databases
L bsRefSeq
LabsTATAGCCGAGCTCCGATACCGATGACAA
SequencingCenters
CuratorsGenomeAssembly
Updated
GenBank
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
pcontinuously
GenBankUniGene
Updated ONLY
Algorithmsby submitters
19
Data quality issuesData quality issues
• Since primary databases are formed from all user submissions, the amount of garbage data can be significant.S hi h th h t i i t t ~1%• Some high‐throughput sequencing experiments can get ~1% of all bases wrong.
• The quality of primary database subsets decreases in the• The quality of primary database subsets decreases in the following order:
Manually curated ‐> Automatically curated ‐> Not curated
20
A general scheme of an on‐line database
Local computer
Remote computer
Data Some computer program(s)
Output
Program outputProgram output
21
Two major bioinformatics mega portalsTwo major bioinformatics mega‐portals
• USA ‐ NCBI (The National Center for Biotechnology Information). The home of GenBank sequence database.qhttp://www.ncbi.nih.gov/
• European Union ‐ EBI (The European Bioinformatics Institute). The home of UniProt sequence database.http://www ebi ac uk/http://www.ebi.ac.uk/
22
NCBI: National Center for Biotechnology Information
Bethesda,MD
Created in 1988 as a part of theCreated in 1988 as a part of theNational Library of Medicine at NIH
E t bli h bli d t b– Establish public databases– Develop research in computational biologyDevelop bioinformatics software tools– Develop bioinformatics software tools
– Disseminate biomedical information 23
NCBI Databases and Services
• GenBank ‐ largest primary sequence databaseGenBank ‐ largest primary sequence database• Free public access to biomedical literature
– PubMed – free article abstracts search– PubMed – free article abstracts search– PubMed‐Central – full‐text article access
• Entrez integrated molecular and literature databases• Entrez ‐ integrated molecular and literature databases• BLAST – fastest sequence search service• VAST structure similarity searches• VAST ‐ structure similarity searches• Software and Databases for download
M th i d d t b• Many other services and databases…
24
GenBankhttp://www.ncbi.nlm.nih.gov/genbank/
• Three ways to search GenBank:– Search GenBank for sequence identifiers and annotations with Entrez
Search GenBank sequences using BLAST (Basic Local Alignment Search– Search GenBank sequences using BLAST (Basic Local Alignment Search Tool).
– Search, link, and download sequences using NCBI e‐utilities (a set of software programs).
• The Reference Sequence (RefSeq) database is a curated ll ti f DNA RNA d t i b ilt b NCBIcollection of DNA, RNA, and protein sequences built by NCBI.
Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging fromnatural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.
25
EBI: European Bioinformatics Institutehttp://www.ebi.ac.ukhttp://www.ebi.ac.uk
• The structure of EBI services is similar to that of NCBI. The core databases reflect themethods used by biologists to collect information on how cells andreflect themethods used by biologists to collect information on how cells and organisms work:
DNA/RNA/protein sequences‐ DNA/RNA/protein sequences‐ Protein structure‐Whole genomes‐ Gene expression experimentsGene expression experiments‐ Literature databases‐ Software databases
• UniProt – second largest primary sequence database. It consists of several components, each optimized for different uses:
UniProtKB/Swiss Prot is manually annotated and reviewed‐ UniProtKB/Swiss‐Prot is manually annotated and reviewed.‐ UniProtKB/TrEMBL is automatically annotated and is not reviewed.
• The sequences and information in UniProt are accessible via text search, BLAST h d FTPBLAST search, and FTP.
26
PDB ‐ the primary database of experimental structures
Protein Databank: http://www.rcsb.org
Covers: all proteins for which there are published x‐ray and NMR structures (plus some theoretical predictions).
Provides:• Structures• Headers (authors, publications, experimental data)• Sequences• Links to structural classification information• Links to structural classification information• Lots of other tools
27
Sources of completely sequenced genomes
• ENSEMBL consortium:http://www.ensemblgenomes.org
• UCSC genome browser:http://genome.ucsc.edu
• TIGR, prokaryotic genomes:http://www tigr orghttp://www.tigr.org
28
Databases of protein‐protein interactions (PPIs)
• Protein‐protein interactions (PPIs) affect all processes in the cell. Examples: replication, transcription, translation, signal t d ti ttransduction, etc.
• Yeast – 6,000 proteins, 3 PPI per protein, 18,000 PPIs.O 100 000 di ll l PPI i h h ll• Over 100,000 medically‐relevant PPIs in the human cell.
• Databases of experimentally determined PPIs become increasingly important for reconstructing biological pathwaysincreasingly important for reconstructing biological pathways and modeling cellular processes.
• A comprehensive list of databases of PPI:• A comprehensive list of databases of PPI: http://mips.gsf.de/proj/ppi/
29
The yeast interactome
30
Visualization of PPI networks
• Visualization of PPI is a popular application of scientific visualization techniques. PPI networks are represented as graphs. • This task is not straightforward because of the density of the graphs.This task is not straightforward because of the density of the graphs.
• http://www.cytoscape.org/
• Cytoscape is an open source bioinformatics software platform for visualizingmolecular interaction networks and integrating these interactions with gene expression profiles and other state data.
31
The Gene Ontology (GO) databasehtt // i t l / i bi / i / ihttp://amigo.geneontology.org/cgi‐bin/amigo/go.cgi
• GO provides a way to capture and represent the biological• GO provides a way to capture and represent the biological knowledge in a standardized database framework.
• GO is a controlled vocabulary that can be applied to all• GO is a controlled vocabulary that can be applied to all organisms. It is used to describe gene products ‐ proteins and RNA ‐ in any organism.y g
• GO Includes:1. A vocabulary of terms (names for concepts)y ( p )2. Definitions3. Defined logical relationships to each otherg p
32
GO is structured as Directed Acyclic Graphs (DAG)GO terms are nodes in the graph
llcell is-apart-of
membrane chloroplastmembrane chloroplast
it h d i l hl l tmitochondrial chloroplastmembrane membrane
33
GO: Three ontologiesGO: Three ontologies
What does it do? Molecular Function
Wh t i it
What does it do? Molecular Function
What processes is it involved in? Biological Process
Where does it act? Cellular Component
d tgene product34
Molecular Function ‐ activities or “jobs” ofa gene product
insulin bindingi li i iinsulin receptor activity
35
Biological Process ‐ a commonly recognized series of biological events
Transcription is a biological process
36
Cellular Component: where a gene product acts
37