characterization of secondary structure of proteins using different vocabularies

Characterization of Secondary Structure of Proteins

using Different Vocabularies

Madhavi K. GanapathirajuLanguage Technologies Institute

Advisors

Raj Reddy, Judith Klein-Seetharaman,

Roni Rosenfeld

2nd Biological Language Modeling Workshop

Carnegie Mellon University

May 13-14 2003

2

Presentation overview

• Classification of Protein Segments by their Secondary Structure types

• Document Processing Techniques

• Choice of Vocabulary in Protein Sequences

• Application of Latent Semantic Analysis

• Results

• Discussion

3

Sample Protein:MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITALYSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA…

Sample Protein:MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITALYSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA…

Secondary Structure of Protein

4

Application of Text Processing

Letters Words SentencesLetter counts in languages

Word counts in Documents

Residues Secondary Structure ProteinsGenomes

Can unigrams distinguish Secondary Structure Elements from one another

5

Unigrams for Document Classification

• Word-Document matrix– represents documents in terms of their

word unigrams

Do

c-1

Do

c-2

Do

c-3

Do

c-4

clouds 1 1cell 1 3 drawing 10 1 3dry 2 1gene 1 1graph 1 2 1. . .weather 1 1

“Bag-of-words” model since the position of words in the document is not taken into account

6

Word Document Matrix1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

7

Document Vectors1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

8

1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

111010

Doc-1

Document Vectors

9

1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

003232

Doc-2

Document Vectors

10

1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

202100

Doc-3

Document Vectors

11

1 0 2 0 0 0 0 0 01 0 0 0 0 0 0 0 01 3 2 1 0 0 0 0 00 2 1 0 0 0 1 0 11 3 0 0 0 1 0 0 00 2 0 0 1 0 0 1 0

000100

Doc-N

Document Vectors

12

• Documents can be compared to one another in terms of dot-product of document vectors

Document Comparison

202100

003232

.* =006200

13


Document Comparison

202100

003232

.* =006200

14


Document Comparison

202100

003232

.* =006200

• Formal Modeling of documents is

presented in next few slides…

15

Vector Space Model construction

• Document vectors in word-document matrix are normalized– By word counts in entire document collection– By document lengths

• This gives a Vector Space Model (VSM) of the set of documents

• Equations for Normalization…

16

(Word count in document)

(document length)

(depends on word count in corpus)

t_i is the total number of times word i occurs in the corpus

Word count normalization

17

1 211 3 2 1

2 1 1 11 3 1

2 1 1

.2 .3

.3

.1 .1 .2 .4.1 .1 .4 .4

.1 .2 .6.1 .5 .5

Word-Document Matrix

Normalized Word-Document Matrix

18

Document vectors after normalisation

.2

.3

.1

.1

.1

.1

.2

.1

.3

.2

.1 .4

.2 .3

.3

.1 .1 .2 .4.1 .1 .4 .4

.1 .2 .6.1 .5 .5

...

19

Use of Vector Space Model

• A query document is also represented as a vector

• It is normalized by corpus word counts

• Documents related to the query-doc are identified – by measuring similarity of document

vectors to the query document vector

20

Application to Protein Secondary Structure

Prediction

21

Protein Secondary Structure

• Dictionary of Secondary Structure Prediction: annotation of each residue with its structure – based on hydrogen bonding patterns and

geometrical constraints

• 7 DSSP labels for PSS:– H– G– B– E– S– I– T

Helix types

Strand types

Coil types

22

Example

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT

Residues

DSSP

PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLHPKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH


T, S, I,_: Coil E, B: Strand H, G: Helix Key to DSSP labels

23

Reference Model

• Proteins are segmented into structural Segments

• Normalized word-document matrix – constructed from structural segments

24

Example

Structural Segments obtained from the given sequence:PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH


Residues

DSSP



25

Example

Unigrams in the structural segments

Se

g-1

Se

g-2

Se

g-3

Se

g-4

Se

g-5

Se

g-6

Se

g-7

Se

g-8

Se

g-9

A 1 1 1CD 1EF 1 1 1G 1 1H 1I 2K 2 1 1L 2 1 1 1 2M 1N 1 2 1 1 1P 3 3Q 1R 2S 1T 1V 1 1 1W 1Y 1 2

Structural Segments obtained from the given sequence:PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH


Residues

DSSP



26S

eg

-1

Se

g-2

Se

g-3

Se

g-4

Se

g-5

Se

g-6

Se

g-7

Se

g-8

Se

g-9


Am

ino

Aci

ds

StructuralSegments

Amino-acid Structural-Segment

Matrix

27S

eg

-1

Se

g-2

Se

g-3

Se

g-4

Se

g-5

Se

g-6

Se

g-7

Se

g-8

Se

g-9


Am

ino

Aci

ds

StructuralSegments

Amino-acid Structural-Segment

Matrix

Similar to Word-Document

Matrix

28S

eg

-1

Se

g-2

Se

g-3

Se

g-4

Se

g-5

Se

g-6

Se

g-7

Se

g-8

Se

g-9


Document Vectors

Word Vectors

…

29S

eg

-1

Se

g-2

Se

g-3

Se

g-4

Se

g-5

Se

g-6

Se

g-7

Se

g-8

Se

g-9


Document Vectors

Word Vectors

…

Qu

ery

-1

ACD 2E 1F G 0H 1I 1K 0LMN 2P Q 2RSTV 1WY

Query Vector

30

• JPred data– 513 protein sequences in all– <25% homology between sequences– Residues & corresponding DSSP

annotations are given

• We used• 50 sequences for model construction (training)• 30 sequences for testing

Data Set used for PSSP

31

• Proteins from test set– segmented into structural elements– Called “query segments”– Segment vectors are constructed

• For each query segment– ‘n’ most similar reference segment vectors

are retrieved– Query segment is assigned same structure

as that of the majority of the retrieved segments*

Classification

*k-nearest neighbour classification

32

.2 .3

.3

.1 .1 .2 .4.1 .1 .4 .4

.1 .2 .6.1 .5 .5

.1

.1

.2

Helix Strand CoilKey

Query Vect

or

Referen

ce Model

Compare Similarities

3 most similar reference vectors

.2

.3

.1

.1

.1

.1

.2

.1

.3

.2

.1

Majority voting out of 3-most similar reference vectors = = Coil

Hence Structure-type assigned to Query Vector is Coil

Structure type assignment to QVector

33

Choice of Vocabulary in Protein Sequences

• Amino Acids • But Amino acids are

– Not all distinct..– Similarity is primarily due to chemical

composition

So,– Represent protein segments in terms of

“types” of amino acids– Represent in terms of “chemical composition”

34

Representation in terms of “types” of AA

• Classify based on Electronic Properties– e- donors: D,E,A,P – weak e-donors: I,L,V – Ambivalent: G,H,S,W – weak e- acceptor: T,M,F,Q,Y – e- acceptor: K,R,N– C (by itself, another group)

• Use Chemical Groups

35

Representation using Chemical Groups

36

Results of Classification with “AA” as words

Helix Sheet CoilMicro

AverageMacro

Average Helix Sheet CoilMicro

AverageMacro

Average

AA Train 97.8 56.7 91.4 82.7 81.9 99.6 87.6 65.9 77.5 84.3AA Test 42.7 30.1 83.3 62 52 65.8 67.3 20 40.6 51

Precision Recall

Leave 1-out testing of reference vectorsUnseen query segments

37

Results with “chemical groups” as words


AverageMacro


AverageMacro

AverageCW Train 96.7 58.9 92.2 83.6 82.6 99.6 88.3 68.4 79 85.4CW Test 60 50 90 67.2 66.7 60 60 80 65.9 66.4

Precision Recall

• Build VSM using both reference segments and test segments– Structure labels of reference segments are known– Structure labels of query segments are unknown

38

Modification to Word-Document matrix

• Latent Semantic Analysis

• Word document matrix is transformed– by “Singular Value Decomposition”

40


AverageMacro


AverageMacro

Average

AA Train 97 60 91.6 83.7 82.8 99.2 87 70.4 79.7 85.5AA Test 40 50 80 63.6 66.1 70 50 80 63.3 66.8

Precision Recall

Results with “AA” as words, using LSA

41

Results with “types of AA” as wordsusing LSA


AverageMacro


AverageMacro

Average

AA Train 82.7 53.3 75.6 70.6 70.6 96.2 81.4 23.5 67 67AA Test 90 70 30 60.5 60.5 70 50 70 63.5 63.5

Precision Recall

42

Results with “chemical groups” as wordsusing LSA

Precision RecallHelix Sheet Coil Micro Macro Helix Sheet Coil Micro Macro

CW Train 99.6 66.2 82.7 82.6 80.9 99.6 89 54.2 81 80.9CW Test 80 50 50 55.7 59.7 40 40 80 64.4 55.1

43

LSA results for Different Vocabularies


AverageMacro


AverageMacro

Average

AA Train 97 60 91.6 83.7 82.8 99.2 87 70.4 79.7 85.5AA Test 40 50 80 63.6 66.1 70 50 80 63.3 66.8

Precision Recall

Amino acidsLSA

Types ofAmino acid LSA


AverageMacro


AverageMacro

AverageCW Train 99.6 66.2 82.7 82.6 80.9 99.6 89 54.2 81 80.9CW Test 80 50 50 55.7 59.7 40 40 80 64.4 55.1

Precision Recall

Chemical GroupsLSA



Average

AA Train 82.7 53.3 75.6 70.6 96.2 81.4 23.5 67AA Test 90 70 30 60.5 70 50 70 63.5

Precision Recall

44

Model construction using all data

Helix Strand Other Micro Macro Helix Strand Other Micro Macro

VSMAA Train 98.5 56.2 92.3 83.2 82.3 98.9 89.4 64.9 77.1 84.4AA Test 61.3 39.6 78 64.8 59.6 48 65 59.6 58.9 57.5LSACW Train 99.2 66.7 88.6 84.8 84.8 99.6 93.2 53 82 82CW Test 50 45 83 67.1 59.4 83.6 50.4 60.6 62 64.8VSMCW Train 99.2 63.2 79.8 80.7 80.7 99.6 87.1 49.2 78.6 78.6CW Test 49.8 45.2 81.5 66.2 58.8 84.2 41.6 61.7 60.3 62.5LSAAA_types_Train 85.2 55.2 78.4 72.9 96.2 83 28.8 69.3AA_types_Test 74.4 47 67.1 62.7 82.2 67.1 30.9 60.8VSMAA_types_Train 77.1 57 81.7 72 95.5 80.3 28.8 68.1AA_types_test 72.5 48.4 77.4 66.1 84.9 71.7 27 61.1

Word doc matrices constructed using both reference and query dataPricision Recall

Matrix models constructed using both reference and query documents together. This gives better models both for normalization and in construction Of latent semantic model

Am

ino

A

cid

C

hem

ica

l G

rou

ps

Am

ino

aci

d

typ

es

45

Applications

• Complement other methods for protein structure prediction– Segmentation approaches

• Protein classifications as all-alpha, all-beta, alpha+beta or alpha/beta types

• Automatically assigning new proteins into SCOP families

46

References1. Kabsch, Sander “Dictionary of Secondary Structure Prediction”, Biopolymers.

2. Dwyer, D.S., Electronic properties of the amino acid side chains contribute to the

structural preferences in protein folding. J Biomol Struct Dyn, 2001. 18(6): p. 881-92. 3. Bellegarda, J., “Exploiting Latent Semantic Information in Statistical Language

Modeling”, Proceedings of the IEEE, Vol 88:8, 2000.

Thank you!

48

Use of SVD

• Representation of Training and test segments very similar to that in VSM

• Structure type assignment goes through same process, except that it is done with the LSA matrices

49

Classification of Query Document

• A query document is also represented as a vector

• It is normalized by corpus word counts

• Documents related to the query are identified – by measuring similarity of document

vectors to the query document vector

• Query Document is assigned the same Structure as of those retrieved by similarity measure

• Majority voting*

*k-nearest neighbour classification

50

Notes…

• Results described are per-segment

• Normalized Word document matrix does not preserve document lengths– Hence “per residue” accuracies of structure

assignments cannot be computed