modul 1 (struktur datenbanken)

Modul 1

DatenbankenStrukturen

Lokale Muster

Daten, Datensammlung, Datenbank

Inhalte Implementierung

• Molekülstrukturen• Spektren• Patentinformation• Moleküleigenschaften• Fachliteratur• Verweise• Anbieterinformation• Preise• …

• „Flat file“• Lokale Datenbank• www-Zugriff• …

Datenbank =

Verwaltungskomponente +

Speicherungskomponente für persistente Daten,

die einem bestimmten Zweck dienen.

Datenbank =

Verwaltungskomponente +

Speicherungskomponente für persistente Daten,

die einem bestimmten Zweck dienen.

Definition einer Datenbank

„Molekül-Datenbanken“

Source file

Filtering

Library file

Raw data

Index file

Data 1 Data 2 Data 3

Application software

User interface

User

• Wie erkennen wir einen „Baum“ ?• Welche „Bäume“ sind einander am ähnlichsten ?

• Wie erkennen wir einen „Baum“ ?• Welche „Bäume“ sind einander am ähnlichsten ?

Was ist das ?

Molecular Similarity

Typical applications of the “similarity concept”

• Similarity searching in databases• Pattern recognition in molecular structures• Similarity searching in virtual compound libraries• Data clustering / classification• Single compound design, de novo design• Compound library design• “Diversity” analysis of compound collections• SAR modeling & prediction

Analogy-based Feature Selection

“Find function-determining features ofmacromolecular receptors and their small molecule effectors”

“Find function-determining features ofmacromolecular receptors and their small molecule effectors”

A1A2

B1B2

C1C2

B3

Receptors

a1a2

b1b2

c1c2

b3

Ligands

Applications

• Drug Discovery

• Chemical Biology

• Functional Genomics

• Similarity Searching & Virtual Screening

• Identification of targets & ligands

• Design of compound libraries

Drug

TargetIdentification

TargetValidation

HitIdentification

LeadIdentification

LeadOptimization

PreclinicalDevelopment

Cheminformatics

Bioinformatics

The Early Drug Discovery Process

Primary Sequence Databases

MIPSOWL…UniProt combines SwissProt, TrEMBL, PIR UniProtKB/TrEMBL Release 33.7: 3 189 332 entries

à http://www.ebi.ac.uk/Databases/index.htmlà http://pir.georgetown.edu/pirwww/à http://pir.georgetown.edu/pirwww/dbinfo/

4,935,209 (02/05: 1,589,670)37.3 (10/2007)TrEMBL

105,696,243 (12/04: 46,105,397)92 (09/2007)EMBL

285,335 (02/05: 168,297)54.3 (10/2007)SwissProt

No. of SequencesVersionDatabase

Genome àààà mRNA àààà cDNA àààà EST

Genome

Coding part (H.sapiens ~ 1%)

E1 E2 E4E3

I1 I2 I3

Eukaryotic genewith Intron/Exon structure

E1 E2 E4E33’-UTR5’-UTR

5’-EST 3’-EST(most common)

Reverse Transcription

~7 x 106 (70%) EST in GenBank!

EST: C. Venter 1990s

Splicing

From Raw Data to Sequences

I) cDNA sequence fragments (ESTs)

II) Fragment matching (clustering)(>40 bp; >95% ident.)

III) Contig assembly IV) “Contig” (contiguous clone map)

5’3’ 5’

3’

V) DNA complement

VI) ORF (open reading frame) Prediction Six-frame translation

5’

3’ 5’

3’

Sequence “mature” in a database

Unannotated à Preliminary à Unreviewed à Standard

New sequence

DB-Entry

†ü

Some Numbers

Organism Genome Size Genes

Epstein-Barr virus 0.172 x 106 (bp) 80Escherichia coli 4.6 x 106 4406Saccharomyces cerevisiae 12.1 x 106 5885Drosophila melanogaster 180 x 106 13601Homo sapiens 3200 x 106 ~ 25000

Most human genes are “hypothetical”, “unclassified”, “unknown”Most human genes are “hypothetical”, “unclassified”, “unknown”

ID - Identification

AC - Accession number(s)

DT - Date

DE - Description

GN - Gene name(s)

OS - Organism species

OG - Organelle

OC - Organism classification

RN - Reference number

RP - Reference position

RC - Reference comments

RX - Reference cross-references

RA - Reference authors

RL - Reference location

UniProt_SwissProt Line Types

CC - Comments or notes

DR - Database cross-references

KW - Keywords

FT - Feature table data

SQ - Sequence header

- (blanks) sequence data

// - Termination line

A SwissProt EntryID LEP_ECOLI STANDARD; PRT; 324 AA.

AC P00803; P78098;

DT 21-JUL-1986 (REL. 01, CREATED)

DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE)

DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)

DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I).

GN LEPB.

OS ESCHERICHIA COLI.

OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;

OC ENTEROBACTERIACEAE.

RN [1]

RP SEQUENCE FROM N.A.

RX MEDLINE; 84008229.

RA WOLFE P.B., WICKNER W., GOODMAN J.M.;

RL J. BIOL. CHEM. 258:12073-12080(1983).

CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM

CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR.

CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE.

CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE

CC I LEADER PEPTIDASE FAMILY.

DR EMBL; K00426; G146600; -.

DR PIR; A00998; ZPECS.

DR PROSITE; PS00501; SPASE_I_1; 1.

KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE.

FT MOD_RES 1 1 BLOCKED.

FT TRANSMEM 4 22

FT DOMAIN 23 58 CYTOPLASMIC.

FT TRANSMEM 59 77

FT DOMAIN 78 324 PERIPLASMIC.

FT ACT_SITE 91 91

FT ACT_SITE 146 146

FT MUTAGEN 62 62 E->V: INDIFFERENT.

LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 ..

1 MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT ..//

SwissProt Feature Table

The feature table may indicate regions that

• perform or affect function• interact with other molecules• affect replication• are involved in recombination• are a repeat unit• have secondary or tertiary structure• are revised or corrected

àààà• DB searching• links between databases

A Ala Alanine.

C Cys Cysteine.

D Asp Aspartic acid.

E Glu Glutamic acid.

F Phe Phenylalanine.

G Gly Glycine.

H His Histidine.

I Ile Isoleucine.

K Lys Lysine.

L Leu Leucine.

M Met Methionine.

N Asn Asparagine.

P Pro Proline.

Q Gln Glutamine.

R Arg Arginine.

S Ser Serine.

T Thr Threonine.

V Val Valine.

W Trp Tryptophan.

Y Tyr Tyrosine.

B Asx Aspartic acid or Asparagine.

Z Glx Glutamine or Glutamic acid.

X Xaa Any amino acid.

Amino acid codes

Levels of Pattern Conservation

Active site

3D protein structure

Protein fold / domains

2D protein structure

1D protein structure(amino acid sequence)

mRNA sequence

DNA sequence

PredictiveConservedPatterns

Alignment studies

N

O

N

O

SH

N

O

O

OH

N

O

OHO

N

O

N

O

N

O

NHN

N

O

N

O

NH2

N

O

N

O

S

N

O

O

NH2

N

O

N

O

NH2O

N

O

NH

NH2NH

N

O

OH

N

O

OH

N

O

N

O

NH

N

O

OH

A C D E F

G H I K L

M N P Q R

S T V W Y

The 20 standard

L-amino acids

NH2

NH

NH

O

R1

R2

R3

OHO

O

Peptide backboneN à C

Stereochemie von Aminosäuren: Fischer-Projektion

COOHHH2N

R

COOHNH2H

R

L D

Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentrenkann durch die Fischer-Projektion (Emil Fischer) erfolgen:• Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet.• Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben.• vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen ausder Papierebene nach vorne heraus.

à proteinogeneAminosäuren

Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichenUmständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozessder Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebautwerden können.

SeNH2

OH

OH

Selenocystein, Sec(UAG Stop-Codon)

à http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf

Pyrrolysin, Pyl(UAG Stop-Codon)

NH2

OH

OHN

O

N

die 21. und 22.

proteinogene Aminosäure

• white regions are disallowed except for glycine

3-10

Tutorialhttp://www.cryst.bbk.ac.uk/PPS2/course/

The Peptide Bond

Ramachandran Plot

trans

Peptide notation: Nà C

Right-handed α-Helix

i

i + 4

i + 8

5.4 Å pitch

• 3.6 residues in a turn(36 residues = 10 turns)

The alpha-Helix

3-10 Helix

• 3 residues in a turn• 10 atoms in ring formed

by a hydrogen-bond

Helical Structures

Beta strand conformation

7 Å pitch

The beta-Strand & beta-Sheet

Antiparallel beta-Sheet

C-terminus

Beta-Sheets

Flavodoxin (PDB: 1AG9)

Type I Type II

• difference between type I and II:orientation of the peptide bond between i+1 and i+2

• account for approx. 50% of all turns

G

Gly: no hindrance with C=O of (i+1)

Reverse Turns (“Beta-Turns”)

• generally occur at the surface of the protein• Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å)• nucleation centers during protein folding?

Beta-Hairpin Turns

Type I’ Type II’

Residue 2: always Gly Residue 1: always Gly

• Beta-hairpin turns occur between two antiparallel beta-strands

= Supersecondary Structure

VDLLKN

Local Conformations are Context-Dependent

à identical sequence, different 3D structureà too short for homology assessment!

Global and local sequence features determine

protein structure and function

Amino acid sequence Structural model

PDB: 4RNT

ACDYTCGSNCYSSSDVSTAQAAGYQL

HEDGETVGSNSYPHKYNNYEGFDFSV

SSPYYEWPILSSGDVYSGGSPGADRV

VFNENNQLAGVITHTGASGNNFVECT

Ribonuclease T1 from Aspergillus oryzaeA Guanyl-specific hydrolase

Bioinformatics: Searching for Homologues

HomologSimilar protein with a common ancestral sequence

• may have similar function or structure• structural homology• functional homology• homology ≠ similarity !• no “% homology” !

OrthologHomolog proteins in different species

ParalogHomolog proteins in the same species

Secondary Databases (Patterns & Motifs)

Database Primary Source Stored Information

PROSITE SwissProt Regular expressions (patterns)Profiles SwissProt Weighted matrices (profiles)PRINTS OWL (SwissProt) Aligned motifs (fingerprints)BLOCKS PROSITE/PRINTS Aligned motifs (blocks)IDENTIFY BLOCKS/PRINTS Fuzzy regular express. (patterns)Pfam SwissProt Hidden Markov Models (HMM)…

Databases integrating Genetic, Molecular, or

Metabolic Data

Amaze Biochemical pathwayswww.ebi.ac.uk/research/pfmp/

Ecocyc / Metacyc Metabolic pathwayshttp://biocyc.org

KEGG Metabolic pathwayswww.genome.ad.jp/kegg/

TransPath Signal transduction pathwayshttp://transpath.gbf.de/

BIND Protein interaction and complexeswww.bind.ca/

GeneNet Gene networkshttp://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/

CSNDB Cell-signaling networkshttp://geo.nihs.go.jp/csndb/

Information Retrieval Systems

• SRS – Sequence Retrieval System (at EBI, UK)http://www.srs.ebi.ac.uk

• Entrez (at NCBI, USA)http://www.ncbi.nlm.nih.gov/Entrez

Hausaufgabe: Üben !Hausaufgabe: Üben !

• Amino acids – structures and codeshttp://bioinf.man.ac.uk/aacids/amino_acid.htm

PCSS

CSH

SAG

N Q

DE

RKHYWF

M

IL

T

V

smallproline

polar

charged

negative

positivearomatic

hydrophobic

aliphatic

tiny

Amino Acid Classification: A Venn-Diagram

http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html

Sliding-Window: The Helical-Wheel Plot

Alpha-helix3.6 residues per turn(100 degrees / residue)

à Transmembrane helices of rhodopsin (à PDB)

Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy),ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale

Sliding-Window: The Hydrophobicity Plot

Detect potential transmembrane segments

Sliding-Window: Secondary Structure

• based on analyzing frequency of amino acids in different secondary structures

• A, E, L, and M strong predictors of alpha helices

• P and G are predictors in the break of a helix

• Table of predictive values created for alpha helices, beta sheets, and loops

• Structure with greatest overall prediction value used to determine the structure (80% majority, α+β window size = 5, turn: 4 residues)

• GOR method improves upon the Chou-Fasman method:

• Assumes amino acids surrounding the central amino acid influencesecondary structure central amino acid is likely to adopt

à Scoring matrices

Chou-Fasman method

SW_LEP_BACAM .......... ......MTEE Q..KPTSEKS VKRKSNTYWE WGKAIIIAVA

SW_LEPP_BACSU .......... .......... ....MTKEKV FKKKS.SILE WGKAIVIAVI

SW_LEP_ECOLI FAPKRRERQA AAQAAAGDSL D..KATLKKV APKPG..WLE TGASVFPVLA

SW_LEP_SALTY FAPKRRARQA AAQTASGDAL D..NATLNKV APKPG..WLE TGASVFPVLA

SW_LEP_PSEFL FAPRRRSAIA SYQGSVSQP. D..AVVIEKL NKEPL..LVE YGKSFFPVLF

SW_LEPC_BACCL .......... .......... ....MTKQKE KRGRR..... WPWFVA..VC

SW_LEP_HAEIN VLPKRHRQVA RAEQRSGKT. ...LSEEEKA KIEPISEASE FLSSLFPVLA

SW_LEP_MYCTU AGQVFDAAPF DAAPDADSEG DSKAAKTDEP RPAKRSTLRE FAVLAVIAVV

SW_LEP_BACAM LALLIRHFLF EPYLVEGSSM YPTLH..... DGERLFVN.. ..........

SW_LEPP_BACSU LALLIRNFLF EPYVVEGKSM DPTLV..... DSERLFVN.. ..........

SW_LEP_ECOLI IVLIVRSFIY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ

SW_LEP_SALTY IVLIVRSFLY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ

SW_LEP_PSEFL IVLVLRSFLV EPFQIPSGSM KPTLD..... VGDFILVNKF SYGIRLPVID

SW_LEPC_BACCL VVATLRLFVF SNYVVEGKSM MPTLE..... SGNLLIVN.. ..........

SW_LEP_HAEIN VVFLVRSFLF EPFQIPSGSM ESTLR..... VGDFLVVNKY AYGVKDPIFQ

SW_LEP_MYCTU LYYVMLTFVA RPYLIPSESM EPTLHGCSTC VGDRIMVD.. ..........

Example of a multiple sequence alignment (ClustalW)

“Block”

[S,G]-x-S-M-x-[P,S] “Pattern”

à Regular expression matching

-Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF]

[S is an active site residue]

-Sequences known to belong to this class detected by the pattern: ALL.

-Other sequence(s) detected in SWISS-PROT: 16.

-Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY]

[K is an active site residue]

-Sequences known to belong to this class detected by the pattern: ALL SPases I

from prokaryotes as well as yeast IMP1, but not IMP2.

-Other sequence(s) detected in SWISS-PROT: NONE.

-Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG]

-Sequences known to belong to this class detected by the pattern: ALL.

-Other sequence(s) detected in SWISS-PROT: 10.

Searching for Consensus Patterns in PROSITE

Query: E.coli leader peptidase

Spase_I_1 (G,S)xSMx(P,S)(A,T)(L,F)

(S)xSMx(P)(T)(L)

89: PFQIP SGSMMPTL LIGDF

%

0

2

4

6

8

10

12

14

16

SwissProt V 40.30 Archaebakterium (Thermoplasma volcanium) E.coli K-12 P. falciparum

Homo sapiens

A C D E F G H I K L M N P Q R S T V W Y

Amino Acid Composition

Protein Targeting Signals

mature protein e.g.secreted proteinsmitochondrial matrix proteinschloroplast stromal proteins

e.g.mitochondrial IMS proteinsapicoplast proteins

Known exceptions:

Signal peptidase

e.g.some mitochondrial proteins

some peroxisomal proteins

( ) SKL

http://www.rockefeller.edu/pubinfo/proteintarget.html

modul 1 (struktur datenbanken)

Documents