modul 1 (struktur datenbanken)
TRANSCRIPT
Modul 1
DatenbankenStrukturen
Lokale Muster
Daten, Datensammlung, Datenbank
Inhalte Implementierung
• Molekülstrukturen• Spektren• Patentinformation• Moleküleigenschaften• Fachliteratur• Verweise• Anbieterinformation• Preise• …
• „Flat file“• Lokale Datenbank• www-Zugriff• …
Datenbank =
Verwaltungskomponente +
Speicherungskomponente für persistente Daten,
die einem bestimmten Zweck dienen.
Datenbank =
Verwaltungskomponente +
Speicherungskomponente für persistente Daten,
die einem bestimmten Zweck dienen.
Definition einer Datenbank
„Molekül-Datenbanken“
Source file
Filtering
Library file
Raw data
Index file
Data 1 Data 2 Data 3
Application software
User interface
User
• Wie erkennen wir einen „Baum“ ?• Welche „Bäume“ sind einander am ähnlichsten ?
• Wie erkennen wir einen „Baum“ ?• Welche „Bäume“ sind einander am ähnlichsten ?
Was ist das ?
Molecular Similarity
Typical applications of the “similarity concept”
• Similarity searching in databases• Pattern recognition in molecular structures• Similarity searching in virtual compound libraries• Data clustering / classification• Single compound design, de novo design• Compound library design• “Diversity” analysis of compound collections• SAR modeling & prediction
Analogy-based Feature Selection
“Find function-determining features ofmacromolecular receptors and their small molecule effectors”
“Find function-determining features ofmacromolecular receptors and their small molecule effectors”
A1A2
B1B2
C1C2
B3
Receptors
a1a2
b1b2
c1c2
b3
Ligands
Applications
• Drug Discovery
• Chemical Biology
• Functional Genomics
• Similarity Searching & Virtual Screening
• Identification of targets & ligands
• Design of compound libraries
Drug
TargetIdentification
TargetValidation
HitIdentification
LeadIdentification
LeadOptimization
PreclinicalDevelopment
Cheminformatics
Bioinformatics
The Early Drug Discovery Process
Primary Sequence Databases
MIPSOWL…UniProt combines SwissProt, TrEMBL, PIR UniProtKB/TrEMBL Release 33.7: 3 189 332 entries
à http://www.ebi.ac.uk/Databases/index.htmlà http://pir.georgetown.edu/pirwww/à http://pir.georgetown.edu/pirwww/dbinfo/
4,935,209 (02/05: 1,589,670)37.3 (10/2007)TrEMBL
105,696,243 (12/04: 46,105,397)92 (09/2007)EMBL
285,335 (02/05: 168,297)54.3 (10/2007)SwissProt
No. of SequencesVersionDatabase
Genome àààà mRNA àààà cDNA àààà EST
Genome
Coding part (H.sapiens ~ 1%)
E1 E2 E4E3
I1 I2 I3
Eukaryotic genewith Intron/Exon structure
E1 E2 E4E33’-UTR5’-UTR
5’-EST 3’-EST(most common)
Reverse Transcription
~7 x 106 (70%) EST in GenBank!
EST: C. Venter 1990s
Splicing
From Raw Data to Sequences
I) cDNA sequence fragments (ESTs)
II) Fragment matching (clustering)(>40 bp; >95% ident.)
III) Contig assembly IV) “Contig” (contiguous clone map)
5’3’ 5’
3’
V) DNA complement
VI) ORF (open reading frame) Prediction Six-frame translation
5’
3’ 5’
3’
Sequence “mature” in a database
Unannotated à Preliminary à Unreviewed à Standard
New sequence
DB-Entry
†ü
Some Numbers
Organism Genome Size Genes
Epstein-Barr virus 0.172 x 106 (bp) 80Escherichia coli 4.6 x 106 4406Saccharomyces cerevisiae 12.1 x 106 5885Drosophila melanogaster 180 x 106 13601Homo sapiens 3200 x 106 ~ 25000
Most human genes are “hypothetical”, “unclassified”, “unknown”Most human genes are “hypothetical”, “unclassified”, “unknown”
ID - Identification
AC - Accession number(s)
DT - Date
DE - Description
GN - Gene name(s)
OS - Organism species
OG - Organelle
OC - Organism classification
RN - Reference number
RP - Reference position
RC - Reference comments
RX - Reference cross-references
RA - Reference authors
RL - Reference location
UniProt_SwissProt Line Types
CC - Comments or notes
DR - Database cross-references
KW - Keywords
FT - Feature table data
SQ - Sequence header
- (blanks) sequence data
// - Termination line
A SwissProt EntryID LEP_ECOLI STANDARD; PRT; 324 AA.
AC P00803; P78098;
DT 21-JUL-1986 (REL. 01, CREATED)
DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE)
DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)
DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I).
GN LEPB.
OS ESCHERICHIA COLI.
OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;
OC ENTEROBACTERIACEAE.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 84008229.
RA WOLFE P.B., WICKNER W., GOODMAN J.M.;
RL J. BIOL. CHEM. 258:12073-12080(1983).
CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM
CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE.
CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE
CC I LEADER PEPTIDASE FAMILY.
DR EMBL; K00426; G146600; -.
DR PIR; A00998; ZPECS.
DR PROSITE; PS00501; SPASE_I_1; 1.
KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE.
FT MOD_RES 1 1 BLOCKED.
FT TRANSMEM 4 22
FT DOMAIN 23 58 CYTOPLASMIC.
FT TRANSMEM 59 77
FT DOMAIN 78 324 PERIPLASMIC.
FT ACT_SITE 91 91
FT ACT_SITE 146 146
FT MUTAGEN 62 62 E->V: INDIFFERENT.
LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 ..
1 MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT ..//
SwissProt Feature Table
The feature table may indicate regions that
• perform or affect function• interact with other molecules• affect replication• are involved in recombination• are a repeat unit• have secondary or tertiary structure• are revised or corrected
àààà• DB searching• links between databases
A Ala Alanine.
C Cys Cysteine.
D Asp Aspartic acid.
E Glu Glutamic acid.
F Phe Phenylalanine.
G Gly Glycine.
H His Histidine.
I Ile Isoleucine.
K Lys Lysine.
L Leu Leucine.
M Met Methionine.
N Asn Asparagine.
P Pro Proline.
Q Gln Glutamine.
R Arg Arginine.
S Ser Serine.
T Thr Threonine.
V Val Valine.
W Trp Tryptophan.
Y Tyr Tyrosine.
B Asx Aspartic acid or Asparagine.
Z Glx Glutamine or Glutamic acid.
X Xaa Any amino acid.
Amino acid codes
Levels of Pattern Conservation
Active site
3D protein structure
Protein fold / domains
2D protein structure
1D protein structure(amino acid sequence)
mRNA sequence
DNA sequence
PredictiveConservedPatterns
Alignment studies
N
O
N
O
SH
N
O
O
OH
N
O
OHO
N
O
N
O
N
O
NHN
N
O
N
O
NH2
N
O
N
O
S
N
O
O
NH2
N
O
N
O
NH2O
N
O
NH
NH2NH
N
O
OH
N
O
OH
N
O
N
O
NH
N
O
OH
A C D E F
G H I K L
M N P Q R
S T V W Y
The 20 standard
L-amino acids
NH2
NH
NH
O
R1
R2
R3
OHO
O
Peptide backboneN à C
Stereochemie von Aminosäuren: Fischer-Projektion
COOHHH2N
R
COOHNH2H
R
L D
Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentrenkann durch die Fischer-Projektion (Emil Fischer) erfolgen:• Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet.• Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben.• vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen ausder Papierebene nach vorne heraus.
à proteinogeneAminosäuren
Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichenUmständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozessder Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebautwerden können.
SeNH2
OH
OH
Selenocystein, Sec(UAG Stop-Codon)
à http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf
Pyrrolysin, Pyl(UAG Stop-Codon)
NH2
OH
OHN
O
N
die 21. und 22.
proteinogene Aminosäure
• white regions are disallowed except for glycine
3-10
Tutorialhttp://www.cryst.bbk.ac.uk/PPS2/course/
The Peptide Bond
Ramachandran Plot
trans
Peptide notation: Nà C
Right-handed α-Helix
i
i + 4
i + 8
5.4 Å pitch
• 3.6 residues in a turn(36 residues = 10 turns)
The alpha-Helix
3-10 Helix
• 3 residues in a turn• 10 atoms in ring formed
by a hydrogen-bond
Helical Structures
Beta strand conformation
7 Å pitch
The beta-Strand & beta-Sheet
Antiparallel beta-Sheet
C-terminus
Beta-Sheets
Flavodoxin (PDB: 1AG9)
Type I Type II
• difference between type I and II:orientation of the peptide bond between i+1 and i+2
• account for approx. 50% of all turns
G
Gly: no hindrance with C=O of (i+1)
Reverse Turns (“Beta-Turns”)
• generally occur at the surface of the protein• Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å)• nucleation centers during protein folding?
Beta-Hairpin Turns
Type I’ Type II’
Residue 2: always Gly Residue 1: always Gly
• Beta-hairpin turns occur between two antiparallel beta-strands
= Supersecondary Structure
VDLLKN
Local Conformations are Context-Dependent
à identical sequence, different 3D structureà too short for homology assessment!
Global and local sequence features determine
protein structure and function
Amino acid sequence Structural model
PDB: 4RNT
ACDYTCGSNCYSSSDVSTAQAAGYQL
HEDGETVGSNSYPHKYNNYEGFDFSV
SSPYYEWPILSSGDVYSGGSPGADRV
VFNENNQLAGVITHTGASGNNFVECT
Ribonuclease T1 from Aspergillus oryzaeA Guanyl-specific hydrolase
Bioinformatics: Searching for Homologues
HomologSimilar protein with a common ancestral sequence
• may have similar function or structure• structural homology• functional homology• homology ≠ similarity !• no “% homology” !
OrthologHomolog proteins in different species
ParalogHomolog proteins in the same species
Secondary Databases (Patterns & Motifs)
Database Primary Source Stored Information
PROSITE SwissProt Regular expressions (patterns)Profiles SwissProt Weighted matrices (profiles)PRINTS OWL (SwissProt) Aligned motifs (fingerprints)BLOCKS PROSITE/PRINTS Aligned motifs (blocks)IDENTIFY BLOCKS/PRINTS Fuzzy regular express. (patterns)Pfam SwissProt Hidden Markov Models (HMM)…
Databases integrating Genetic, Molecular, or
Metabolic Data
Amaze Biochemical pathwayswww.ebi.ac.uk/research/pfmp/
Ecocyc / Metacyc Metabolic pathwayshttp://biocyc.org
KEGG Metabolic pathwayswww.genome.ad.jp/kegg/
TransPath Signal transduction pathwayshttp://transpath.gbf.de/
BIND Protein interaction and complexeswww.bind.ca/
GeneNet Gene networkshttp://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/
CSNDB Cell-signaling networkshttp://geo.nihs.go.jp/csndb/
Information Retrieval Systems
• SRS – Sequence Retrieval System (at EBI, UK)http://www.srs.ebi.ac.uk
• Entrez (at NCBI, USA)http://www.ncbi.nlm.nih.gov/Entrez
Hausaufgabe: Üben !Hausaufgabe: Üben !
• Amino acids – structures and codeshttp://bioinf.man.ac.uk/aacids/amino_acid.htm
PCSS
CSH
SAG
N Q
DE
RKHYWF
M
IL
T
V
smallproline
polar
charged
negative
positivearomatic
hydrophobic
aliphatic
tiny
Amino Acid Classification: A Venn-Diagram
http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html
Sliding-Window: The Helical-Wheel Plot
Alpha-helix3.6 residues per turn(100 degrees / residue)
à Transmembrane helices of rhodopsin (à PDB)
Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy),ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale
Sliding-Window: The Hydrophobicity Plot
Detect potential transmembrane segments
Sliding-Window: Secondary Structure
• based on analyzing frequency of amino acids in different secondary structures
• A, E, L, and M strong predictors of alpha helices
• P and G are predictors in the break of a helix
• Table of predictive values created for alpha helices, beta sheets, and loops
• Structure with greatest overall prediction value used to determine the structure (80% majority, α+β window size = 5, turn: 4 residues)
• GOR method improves upon the Chou-Fasman method:
• Assumes amino acids surrounding the central amino acid influencesecondary structure central amino acid is likely to adopt
à Scoring matrices
Chou-Fasman method
SW_LEP_BACAM .......... ......MTEE Q..KPTSEKS VKRKSNTYWE WGKAIIIAVA
SW_LEPP_BACSU .......... .......... ....MTKEKV FKKKS.SILE WGKAIVIAVI
SW_LEP_ECOLI FAPKRRERQA AAQAAAGDSL D..KATLKKV APKPG..WLE TGASVFPVLA
SW_LEP_SALTY FAPKRRARQA AAQTASGDAL D..NATLNKV APKPG..WLE TGASVFPVLA
SW_LEP_PSEFL FAPRRRSAIA SYQGSVSQP. D..AVVIEKL NKEPL..LVE YGKSFFPVLF
SW_LEPC_BACCL .......... .......... ....MTKQKE KRGRR..... WPWFVA..VC
SW_LEP_HAEIN VLPKRHRQVA RAEQRSGKT. ...LSEEEKA KIEPISEASE FLSSLFPVLA
SW_LEP_MYCTU AGQVFDAAPF DAAPDADSEG DSKAAKTDEP RPAKRSTLRE FAVLAVIAVV
SW_LEP_BACAM LALLIRHFLF EPYLVEGSSM YPTLH..... DGERLFVN.. ..........
SW_LEPP_BACSU LALLIRNFLF EPYVVEGKSM DPTLV..... DSERLFVN.. ..........
SW_LEP_ECOLI IVLIVRSFIY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_SALTY IVLIVRSFLY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_PSEFL IVLVLRSFLV EPFQIPSGSM KPTLD..... VGDFILVNKF SYGIRLPVID
SW_LEPC_BACCL VVATLRLFVF SNYVVEGKSM MPTLE..... SGNLLIVN.. ..........
SW_LEP_HAEIN VVFLVRSFLF EPFQIPSGSM ESTLR..... VGDFLVVNKY AYGVKDPIFQ
SW_LEP_MYCTU LYYVMLTFVA RPYLIPSESM EPTLHGCSTC VGDRIMVD.. ..........
Example of a multiple sequence alignment (ClustalW)
“Block”
[S,G]-x-S-M-x-[P,S] “Pattern”
à Regular expression matching
-Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF]
[S is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 16.
-Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY]
[K is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL SPases I
from prokaryotes as well as yeast IMP1, but not IMP2.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 10.
Searching for Consensus Patterns in PROSITE
Query: E.coli leader peptidase
Spase_I_1 (G,S)xSMx(P,S)(A,T)(L,F)
(S)xSMx(P)(T)(L)
89: PFQIP SGSMMPTL LIGDF
%
0
2
4
6
8
10
12
14
16
SwissProt V 40.30 Archaebakterium (Thermoplasma volcanium) E.coli K-12 P. falciparum
Homo sapiens
A C D E F G H I K L M N P Q R S T V W Y
Amino Acid Composition
Protein Targeting Signals
mature protein e.g.secreted proteinsmitochondrial matrix proteinschloroplast stromal proteins
e.g.mitochondrial IMS proteinsapicoplast proteins
Known exceptions:
Signal peptidase
e.g.some mitochondrial proteins
some peroxisomal proteins
( ) SKL
http://www.rockefeller.edu/pubinfo/proteintarget.html