cs177 lecture 8 bioinformatics databases (and genetic diseases)

58
CS177 Lecture 8 Bioinformatics Databases (and genetic diseases) Tom Madej 10.31.05

Upload: tamarr

Post on 13-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

CS177 Lecture 8 Bioinformatics Databases (and genetic diseases). Tom Madej 10.31.05. Lecture overview. Very brief and fast overview of on-line databases. Formulating queries in Entrez. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

CS177 Lecture 8 Bioinformatics Databases

(and genetic diseases)

Tom Madej 10.31.05

Page 2: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Lecture overview

• Very brief and fast overview of on-line databases.

• Formulating queries in Entrez.

• Molecular biology of diseases, including an extensive example involving a lot of linking between a number of Entrez databases.

Page 3: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Bioinformatics Resources

• Reference: Chapter 3 in Sequence – Evolution – Function, E.V. Koonin and M.Y. Galperin, Kluwer Academic 2003.

• Available on the NCBI Bookshelf.

Page 4: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Sequence Databases

• GenBank, EMBL, DDBJ; archival (International Nucleotide Sequence Database Collaboration); sequences have a common accession

• SWISS-PROT curated, non-redundant, entries hyperlinked e.g. to PubMed; TrEMBL entries not yet ready for SWISS-PROT

• Motifs: PROSITE, BLOCKS, PRINTS

• Domains: Pfam, SMART, ProDOM, COGs (NCBI)

• Motifs/domains: InterPro, CDD (NCBI)

Page 5: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

More databases…

• Structure: PDB/RCSB, MMDB (NCBI), SCOP, CATH, FSSP

• Organism-specific: e.g. E. coli, B. subtilis, Synechocystis sp. (bacteria); yeast (unicellular eukaryote); Arabidopsis, C. Elegans (WormBase), Fruitfly, Human

• COGs clusters of orthologous groups; KEGG biochemical pathways; BIND protein-protein interactions; ENZYME; LIGAND enzymes and their substrates

• PubChem (NCBI) chemical substances

Page 6: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 7: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 8: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

PubChem (new)

Page 9: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 10: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

The (ever expanding) Entrez System

EntrezEntrez

PopSet

Structure

PubMed

Books

3D Domains

Taxonomy

GEO/GDS

UniGene

Nucleotide

Protein Genome

OMIM

CDD/CDART

Journals

SNP

UniSTS

PubMed Central

Gene

HomoloGeneHomoloGene

Gene

NLM CatalogPubChem

BioAssaysCompounds

Substances

Page 11: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Genomes

Taxonomy

Links Between and Within Nodes

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structures

Word weight

VAST

BLASTBLAST

Phylogeny

ComputationalComputational

Computational

Computational

Page 12: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Pubmed: Computation of Related Articles

The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths.

The value of a term is dependent on Global and Local types of information:

G - the number of different documents in the database that contain the term;

L - the number of times the term occurs in a particular document;

Page 13: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Global and local weights

• The global weight of a term is greater for the less frequent terms. The presence of a term that occurred in most of the documents would really tell one very little about a document.

• The local weight of a term is the measure of its importance in a particular document. Generally, the more frequent a term is within a document, the more important it is in representing the content of that document.

Page 14: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

How we define similar documents

• The similarity between two documents is computed by adding up the weights (local wt1 × local wt2 × global wt) of all of the terms the two documents have in common. All results are ranked and the most similar documents become Related Articles

Page 15: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Entrez database queries

• The databases are indexed by different sets of terms.

• You can get to a particular DB by selecting it and then entering a “null” query.

• The “Preview/Index” tab displays the index terms and can be used to formulate a query (if you can’t remember the syntax for the index).

• “Limits” can be used e.g. to select publications in a specified time range.

• “Details” shows the interpretation of the query.

Page 16: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 17: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Exercises!

• How many protein structures are there that include DNA and are from bacteria? “bacteria [orgn] AND 1:100 [DNAChainCount]”

• In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000?

• Notice that the results are not 100% accurate!

• In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse? “0:2 [HelixCount] AND 8:10 [StrandCount] AND mouse [orgn]”

Page 18: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Investigating genetic diseases

• Now we will see examples of how bioinformatics databases can be used to investigate genetic diseases.

Page 19: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Gene variants that can affect protein function

• Mutation to a stop codon; truncates the protein product!

• Insertion/deletion of multiple bases; changes the sequence of amino acid residues.

• Single point change could alter folding properties of the protein.

• Single point change could affect the active site of the protein.

• Single point change could affect an interaction site with another molecule.

Page 20: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Lodish et al. Molecular Cell Biology, W.H. Freeman 2000

Page 21: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Sickle cell anemia

• The first “molecular disease”, i.e. the first genetic disease with a known molecular basis.

• The most common variant is caused by a Glu6Val mutation in the Hemoglobin β-chain (HbS). However, there are 100’s of other mutations that can cause this (OMIM lists 524 variants!).

• This mutation causes the hemoglobin to polymerize, in turn the red blood cells form sickle shapes and clump together under low oxygen conditions or high hemoglobin concentrations.

• Confers some resistance to malaria, by inhibiting parasite growth.

Page 22: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

NHLBI web site

Page 23: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Exercise!

• Find an appropriate Hemoglobin structure and view it in Cn3D.

• Check the position of the Glu6Val mutation.

Page 24: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

P53 tumor suppressor protein

• Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer.

• Mutations in p53 are found in most tumor types.

• p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.

Page 25: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.

Page 26: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Exercise!

• Use Cn3D to investigate the binding of p53 to DNA.

• Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).

Page 27: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Important note!

• Most diseases (e.g. cancer) are complex and involve multiple factors (not just a single malfunctioning protein!).

Page 28: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Investigating a genetic disease…

• The following EST comes from a hemochromatosis patient; your task is to identify the gene and specific mutation causing the illness, and why the protein is not functioning properly.

• The sequence:TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG

TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA

ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT

GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA

TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG

GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC

TGGATCAGCCCCTCATTGTGATCTGGG

Page 29: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

ESTs

• Expressed Sequence Tags; useful for discovering genes, obtaining data on gene expression/regulation, and in genome mapping.

• Short nucleotide sequences (200-500 bases or so) derived from mRNA expressed in cells.

• The introns from the genes will already be spliced out.

• mRNA is unstable, however, and so it is “reverse transcribed” into cDNA.

Page 30: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 2

• BLAST the example EST vs. the Human genome (could take a few minutes).- Which chromosome is hit?- What is the contig that is hit (reference assembly)?- Is the EST identical to the genomic sequence?- Take note of the coords of the difference.

• Click on “Genome View”.

• Select the map element at the bottom corresponding to the contig.

Page 31: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 3

• What gene is hit? Zoom in on the BLAST hit a few times.

• Display the entire gene sequence vi “dl” and “Display”.

• Copy and save the genomic sequence.

• Record the coords for the start of the genomic sequence.

Page 32: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 4

• Add the UniGene map to the view (if it is not already there). Click on the UniGene link Hs.233325.

• Note: Expression profile presents data for the expression level of the gene in various tissues.

• How many mRNAs and ESTs are there for the HFE gene?

• Take note of the mRNA accession NM_000410.

Page 33: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 5

• Go to “spidey”: http://www.ncbi.nlm.nih.gov/spidey/

• To determine the intron/exon structure, paste the HFE gene sequence into the upper box, and enter the HFE mRNA accession NM_000410 in the lower box.

• Click “Align”.

Page 34: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 6

• How many exons are there?

• Which exon codes the residue that is changed in the original EST? (You have to do a little arithmetic!)

• Record some of the protein sequence around the changed residue: EQRYTCQVEHPG

Page 35: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 7

• From the Map Viewer page click on the HFE gene link.

• How many HFE transcripts are there? Which is the longest isoform?

• Follow “Links” to “Protein” and then to the report for NP_000410.

• Determine the residue number that corresponds to the mutation.

Page 36: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 37: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

RNA splicing and isoforms

Page 38: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 8

• What effect does the mutation in the original EST have on the protein? (Look at the table for the Genetic Code.)

• Go back to the Gene Report; read the summary and take note of the GeneRIF bibliography; notice the ‘C282Y’ entries.

• Now go to “Links” and then to “GeneView in dbSNP” to a list of known SNPs.

Page 39: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 9

• In the SNP list note that the one you want is currently shown.

• Select “view rs in gene region” and then click on “view rs” (actually, this is the default view).

• How many nonsynonomous substitutions do you see?

• Do you see the one we are particularly interested in?

Page 40: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Digression: SNPs

• Single Nucleotide Polymorphisms.

• A single base change that can occur in a person’s DNA.

• On average SNPs occur about 1% of the time, most are outside of protein coding regions.

• Some SNPs may cause a disease; some may be associated with a disease; others may affect disposition to a disease; others may be simple genetic variation.

• dbSNP archives SNPs and other variations such as small-scale deletion/insertion polymorphisms (DIPs), etc.

Page 41: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 42: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 10

• Back to the Gene Report, click on “Links” and go to “OMIM” (can also get there via the Map Viewer).

• In the OMIM entry you can read a bit; also click on “View List” for Allelic Variants, where you can see the mutation again.

Page 43: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 11

• From the Gene Report again follow “Links” to “Protein” and scroll down to NP_000401.

• Click on “Domains” and then “Show Details”.

• What is the Conserved Domain in the region of interest?

• Follow the link to the CD.

• Click on “View 3D Structure”.

Page 44: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Hemochromatosis 12

• Look for residue position 282 in the query sequence.

• Highlight that column.

• Is the Cys282 conserved in the family?

• The C282Y mutation therefore likely has the effect of …

Page 45: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Aligning a sequence on a structure with Cn3D (example)

• Example: Use structure 1ne3A, align sequence for 1m5xA.• In Sequence/Alignment Viewer window select the menu item

“Imports/Show Imports”.• In the Import Viewer window select the menu item “Edit/Import

Sequences”.• In the Select Chain dialogue box select 1N3E A and click OK.• In the Select Import Source dialogue box select “Network via

GI/Accession” and click OK.• In the Import Identifier dialogue box enter the accession 31615545

and click OK. The new sequence will appear.• Select “Algorithms/BLAST single” and use the cursor to click

anywhere on the 1m5xA sequence to align it using BLAST.

Page 46: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Aligning a sequence on a structure with Cn3D (example cont.)

• Select the menu item “Alignments/Merge All” to make the new alignment appear in the Sequence/Alignment Viewer window.

• The alignment should now appear in the Sequence/Alignment Viewer window, aligned residues will be red.

• Close the Import Viewer window, pick another color style for the alignment, if desired (e.g. identity).

• You can do this with multiple sequences; especially useful if there is no CD for the structure.

Page 47: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

PDB

Page 48: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

PDB File: HeaderHEADER ISOMERASE/DNA 01-MAR-00 1EJ9TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; COMPND 5 EC: 5.99.1.2; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN

REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.60 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER …REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK 290 ...

Page 49: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)
Page 50: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

From Coordinates to Models

1EJ9: Human topoisomerase I

Page 51: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Building the Structure Summary

Taxonomy

Pubmed

Protein 3D Domains

Domains

Nucleotide

Page 52: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Indexing into MMDB

Structure

• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences

inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,

id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,

Add secondary structure Add chemical bonds

• Create “backbone” model (Cα, P only)• Create single-conformer model

Page 53: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Structure Indexing

Entrez• MMDB-ID• MMDB entry date• EC number • Organism

PDB• Accession• Release date• Class• Source• Description• Comment

Ligands• PDB code• PDB name• PDB description

Literature• Article title• Author• Journal • Publication date

Experimental• Method• Resolution

Counters• Ligand types• Modified amino acids• Modified nucleotides• Modified ribonucleotides• Protein chains• DNA chains• RNA chains

topoisomerase AND 2[dnachaincount] AND human[organism]

Page 54: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Creating Sequence Records

Protein Nucleotide Nucleotide

1EJ9A 1EJ9C 1EJ9D

One record per chain

Page 55: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Annotating Secondary Structure

1EJ9: Human topoisomerase I

α-Helices

β-strands

coils/loops

Page 56: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Creating 3D Domains

3D Domain 0: 1EJ9A0 = entire polypeptide

Page 57: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

Creating 3D Domains

3D Domains

1EJ9A1

1EJ9A3

1EJ9A2

1EJ9A4

1EJ9A5

< 3 Secondary Structure Elements

Page 58: CS177 Lecture 8 Bioinformatics Databases  (and genetic diseases)

3D Domain Indexing

Entrez• SDI• MMDB-ID• Accession• MMDB entry date • Organism• Domain number• Cumulative number

PDB• Accession• Release date• Class• Source• Description• Comment

Literature• Article title• Author • Publication date

Counters• Modified amino acids• α-Helices• β-Strands• Residues• Molecular weight

REMEMBER:3D Domain 0 is the entirepolypeptide chain!

4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism]

Find all viral four helix bundles