grammatical-restrained hidden conditional random fields for bioinformatics applications

10
BioMed Central Page 1 of 10 (page number not for citation purposes) Algorithms for Molecular Biology Open Access Research Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications Piero Fariselli*, Castrense Savojardo, Pier Luigi Martelli and Rita Casadio Address: Biocomputing Group, University of Bologna, via Irnerio 42, 40126 Bologna, Italy Email: Piero Fariselli* - [email protected]; Castrense Savojardo - [email protected]; Pier Luigi Martelli - [email protected]; Rita Casadio - [email protected] * Corresponding author Abstract Background: Discriminative models are designed to naturally address classification tasks. However, some applications require the inclusion of grammar rules, and in these cases generative models, such as Hidden Markov Models (HMMs) and Stochastic Grammars, are routinely applied. Results: We introduce Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving the discriminative character of HCRFs, can assign labels in agreement with the production rules of a defined grammar. The main GRHCRF novelty is the possibility of including in HCRFs prior knowledge of the problem by means of a defined grammar. Our current implementation allows regular grammar rules. We test our GRHCRF on a typical biosequence labeling problem: the prediction of the topology of Prokaryotic outer-membrane proteins. Conclusion: We show that in a typical biosequence labeling problem the GRHCRF performs better than CRF models of the same complexity, indicating that GRHCRFs can be useful tools for biosequence analysis applications. Availability: GRHCRF software is available under GPLv3 licence at the website http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz. Background Sequence labeling is a general task addressed in many dif- ferent scientific fields, including Bioinformatics and Com- putational Linguistics [1-3]. Recently Conditional Random Fields (CRFs) have been introduced as a new promising framework to solve sequence labeling prob- lems [4]. CRFs offer several advantages over Hidden Markov Models (HMMs), including the ability of relaxing strong independence assumptions made in HMMs [4]. CRFs have been successfully applied in biosequence anal- ysis and structural predictions [5-11]. However, several problems of sequence analysis can be successfully addressed only by designing a grammar in order to pro- vide meaningful results. For instance in gene prediction tasks exons must be linked in such a way that the donor and acceptor junctions define regions whose length is multiple of three (according to the genetic code), and in protein structure prediction, helical segments shorter than 4 residues should be consider meaningless, being this the shortest allowed length for a protein helical motif [1,2]. In this kind of problems, the training sets generally consist of pairs of observed and label sequences and very often the Published: 22 October 2009 Algorithms for Molecular Biology 2009, 4:13 doi:10.1186/1748-7188-4-13 Received: 12 June 2009 Accepted: 22 October 2009 This article is available from: http://www.almob.org/content/4/1/13 © 2009 Fariselli et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: unibo

Post on 30-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

BioMed CentralAlgorithms for Molecular Biology

ss

Open AcceResearchGrammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applicationsPiero Fariselli*, Castrense Savojardo, Pier Luigi Martelli and Rita Casadio

Address: Biocomputing Group, University of Bologna, via Irnerio 42, 40126 Bologna, Italy

Email: Piero Fariselli* - [email protected]; Castrense Savojardo - [email protected]; Pier Luigi Martelli - [email protected]; Rita Casadio - [email protected]

* Corresponding author

AbstractBackground: Discriminative models are designed to naturally address classification tasks.However, some applications require the inclusion of grammar rules, and in these cases generativemodels, such as Hidden Markov Models (HMMs) and Stochastic Grammars, are routinely applied.

Results: We introduce Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs)as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving thediscriminative character of HCRFs, can assign labels in agreement with the production rules of adefined grammar. The main GRHCRF novelty is the possibility of including in HCRFs priorknowledge of the problem by means of a defined grammar. Our current implementation allowsregular grammar rules. We test our GRHCRF on a typical biosequence labeling problem: theprediction of the topology of Prokaryotic outer-membrane proteins.

Conclusion: We show that in a typical biosequence labeling problem the GRHCRF performsbetter than CRF models of the same complexity, indicating that GRHCRFs can be useful tools forbiosequence analysis applications.

Availability: GRHCRF software is available under GPLv3 licence at the website

http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz.

BackgroundSequence labeling is a general task addressed in many dif-ferent scientific fields, including Bioinformatics and Com-putational Linguistics [1-3]. Recently ConditionalRandom Fields (CRFs) have been introduced as a newpromising framework to solve sequence labeling prob-lems [4]. CRFs offer several advantages over HiddenMarkov Models (HMMs), including the ability of relaxingstrong independence assumptions made in HMMs [4].CRFs have been successfully applied in biosequence anal-ysis and structural predictions [5-11]. However, several

problems of sequence analysis can be successfullyaddressed only by designing a grammar in order to pro-vide meaningful results. For instance in gene predictiontasks exons must be linked in such a way that the donorand acceptor junctions define regions whose length ismultiple of three (according to the genetic code), and inprotein structure prediction, helical segments shorter than4 residues should be consider meaningless, being this theshortest allowed length for a protein helical motif [1,2]. Inthis kind of problems, the training sets generally consist ofpairs of observed and label sequences and very often the

Published: 22 October 2009

Algorithms for Molecular Biology 2009, 4:13 doi:10.1186/1748-7188-4-13

Received: 12 June 2009Accepted: 22 October 2009

This article is available from: http://www.almob.org/content/4/1/13

© 2009 Fariselli et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

number of the different labels representing the experi-mental evidence is small compared to the grammarrequirements and the length distribution of the segmentsfor the different labels. Then a direct mapping of one-labelto one state results in poor predictive performances andHMMs trained for these applications routinely separatelabels from state names. The separation of state namesand labels allows to model a huge number of concurringpaths compatible with the grammar and with the experi-mental labels without increasing the time and space com-putational complexity [1].

In analogy with the HMM approach, in this paper wedevelop a discriminative model that incorporates regular-grammar production rules with the aim of integrating thedifferent capabilities of generative and discriminativemodels. In order to model labels and states disjointly, theregular grammar has to be included in the structure of aHidden Conditional Random Field (HCRF) [12-14]. Pre-viously, McCallum et al. [13] introduced a special HCRFthat exploits a specific automaton to align sequences.

The model here introduced as Grammatical-RestrainedHidden Conditional Random Field (GRHCRF), separatesthe states from the labels and restricts the accepted predic-tions only to those allowed by a predefined grammar. Bythis, it is possible to cast into the model prior knowledgeof the problem at hand, that may not be captured directlyfrom the learning associations and ensures that onlymeaningful solutions are provided.

In principle CRFs can directly model the same GRHCRFgrammar. However, given the fully-observable nature ofthe CRFs [12], the observed sequences must be re-labelledto obtain a bijection between states and labels. Thisimplies that only one specific and unique state path foreach observed sequence must be selected. On the contrarywith GRHCRFs that allow the separation between labelsand states, an arbitrary large number of different statepaths, corresponding to the same experimentallyobserved labels, can be counted at the same time. In orderto fully exploit this path degeneration in the predictionphase, the decoding algorithm must take into account allpossible paths, and the posterior-Viterbi (instead of theViterbi) should be adopted [15].

In this paper we define the model as an extension of aHCRF, we provide the basic inference equations and weintroduce a new decoding algorithm for CRF models. Wethen compare the new GRHCRF with CRFs of the samecomplexity on a Bioinformatics task whose solution mustcomply with a given grammar: the prediction of the topo-logical models of Prokaryotic outer membrane proteins.We show that in this task the GRHCRF performance is

higher than to those achieved by CRF and HMM modelsof the same complexity.

MethodsIn what follows x is the random variable over the datasequences to be labeled, y is the random variable over thecorresponding label sequences and s is the random varia-ble over the hidden states. We use an upper-script indexwhen we deal with multiple sequences. The problem thatwe want to model is then described by the observedsequences x(i), by the labels y(i) and by the underlyinggrammar G that is specified by its production rules withrespect to the set of the hidden states. Although it is pos-sible to imagine more complex models, in what followswe restrict each state to have only one possible associatedlabel. Thus we define a function that maps each hiddenstate to a given label as:

The difference between the CRF and GRHCRF (or HCRF)models can be seen in Figure 1, where their graphicalstructure is presented. GRHCRF and HCRF are indistin-guishable from their graphical structure representationsince it depicts only the conditional dependence amongthe random variables. Since the number of the states |{s}|is always greater than the number of possible labels |{y}|the GRHCRFs (HCRFs) have more expressive power thanthe corresponding CRFs.

We further restrict our model to linear HCRF, so that thecomputational complexity of the inference algorithmsremains linear with respect to the sequence length. Thischoice implies that the embedded grammar will be regu-lar. Our implementation and tests are based on first orderHCRFs with explicit transition functions (tk(sj-1, sj, x)) andstate functions (gk(sj, x)) unrolled over each sequenceposition j.

However, for sake of clarity in the following we use thecompact notation:

where fk(sj-1, sj, x) can be either a transition feature func-tion tl(sj-1, sj, x) or a state feature function gn(sj, x). Follow-ing the usual notation [16] we extend the local functionsto include the hidden states as

Λ( )s y=

λ λ

μ

k k j j

k

l l j j

l

n n j

n

f s s t s s

g s

( , , ) ( , , )

( , )

− −∑ ∑∑

=

+

1 1x x

x

Page 2 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

and we set the two constraints as:

With this choice, the local function ψj(s, y, x) becomeszero when the labeling (Ω(sj, yj)) or the grammar produc-tion rules (Γ(s, s')) are not allowed. In turn this sets to zerothe corresponding probabilities. As in the case of theHCRF, for the whole sequence we define Ψ(s, y, x) =Πjψj(s, y, x) and the normalization factors (or partitionfunctions) can be obtained summing over all possiblesequences of hidden states (or latent variables):

or summing over all possible sequences of labels and hid-den states:

Using the normalization factors the joint probability of alabel sequence y and an hidden state sequence s given anobservation sequence x is:

The probability of an hidden state sequence given a labelsequence and an observation sequence is:

Finally, the probability of a label sequence given an obser-vation sequence can be computed as follows:

Parameter estimationThe model parameters (θ) can be obtained by maximizingthe log-likelihood of the data:

ψ λj k k j j

k

j j j j

f s s

s s s y

( , , ) exp ( , , )

( , ) ( , )

s y x x=⎛

⎝⎜⎜

⎠⎟⎟

⋅ ⋅

∑ 1

1Γ Ω

(1)

Γ

ΩΛ

( , )( , )

( , )( )

s ss s G

s ys y

′ =′ ∈⎧

⎨⎩

==

1

0

1

0

if

otherwise

if

otherwisse

⎧⎨⎩

Z j

j

( , ) ( , , ) ( , , )y x s y x s y xss

= = ∏∑∑Ψ ψ

Z Z( ) ( , , ) ( , )x s y x y xysy

= =∑∑∑ Ψ

pZ

( , | )( , , )

( )y s x

s y xx

= Ψ

ppp Z

( | , )( , | )( | )

( , , )( , )

s y xy s xy x

s y xy x

= = Ψ

pZZ

( | )( , )( )

y xy xx

=

L( ) log ( | ; )

log( ( ), ( ))

( ( ))

lo

( ) ( )Θ Θ=

=

=

=

=

p

Z i i

Z i

i i

i

N

i

N

y x

y x

x

1

1

gg ( , log ( )( ) ( ) ( )Z Zi i i

i

N

i

N

y x x−==∑∑

11

Graphical structure of a linear-CRF (left) and a linear GRHCRF/HCRF (right)Figure 1Graphical structure of a linear-CRF (left) and a linear GRHCRF/HCRF (right).

XX

S1 S2 Sn-1 SnY1 Y2 Yn-1 Yn

Y1 Y2 Yn-1 Yn

Page 3 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

where the different sequences are supposed to be inde-pendent and identically distributed random variables.

Taking the first derivative with respect to parameter λk ofthe objective function we obtain:

where, in analogy with the Boltzmann machines andHMMs for labelled sequences [17], and an be seen asclamped and free phases. After simple computations wecan rewrite the derivative as:

where the Ep(s|y, x) [fk] and Ep(s, y|x) [fk] are the expected val-ues of the feature function fk computed in the clampedand free phases, respectively. Differently from the stand-ard CRF, both expectations have to be computed using theForward and Backward algorithms. These algorithmsmust take into consideration the grammar restraints.

To avoid overfitting, we regularize the objective functionusing a Gaussian prior, so that the function to maximizehas the form of:

and the corresponding gradient is:

Alternatively, the Expectation Maximization procedurecan be adopted [16].

Computing the expectationsThe partition functions and the expectations can be com-puted using the dynamic programming by defining the socalled forward and backward algorithms [1,2,4]. For theclamped phase the forward algorithm is:

where the clamped phase matrix MC takes into account

both the grammar constraint (Γ(s', s)) and the current

given labeling y .

The forward algorithm for the free phase is computed as:

where the free phase matrix MF is defined as:

It should be noted that also in the free phase the algo-rithm has to take into account the grammar productionrules Γ(s', s) and only the paths that are in agreement withthe grammar are counted. Analogously, the backwardalgorithms can be computed for the clamped phase as:

where L(i) is the length of the ith protein. For the free phasewe have:

∂∂

= ∂∂

− ∂∂

=∑L

C

( )log ( , )

log (

( ) ( )

(

Θλ λ

λ

k kZ

kZ

i i

i

N

i

y x

x

1

)))i

N

=∑

1

F

C

∂∂

= −L( )[ ] [ ]( | , ) ( , | )

Θλk

E f E fp k p ks y x s y x

L( ) log ( , ) log ( )( ) ( ) ( )Θ = −

= =∑ ∑

Z Z

k

i i

i

Ni

i

N

k

y x x1 1

2

2 2

λ

σ

∂∂

= − −L( )[ ] [ ]( | , ) ( , | )

Θλ

λσk

E f E f kp k p ks y x s y x 2

α

α

α

0

0

1

0

( | , )

( | , ) , \ { }

( |

( ) ( )

( ) ( )

BEGIN y x

y x BEGIN

y

i i

i i

j

s s S

s

=

= ∀ ∈

(( ) ( ) ( ) ( ), ) ( | , )

( , , )

i ij

i i

s S

C

s

M s s j

x y x= ′

⋅ ′

−′∈∑α 1

( ( , ))( )Ω s y ji

M s s j

f s s s s

s s s y

C

k k j ji

k

ji

( , , )

exp( ( , , ))

( , ) ( ,

( )

(

′ =

= ′ =

⋅ ′ ⋅

−∑λ 1 x

Γ Ω )))

α

α

α α

0

0

1

1

0

( | )

( | ) , \ { }

( | ) (

( )

( )

( )

BEGIN x

x BEGIN

x

i

i

ji

j

s s S

s

=

= ∀ ∈

= −

′′ ′′∈∑ s M s s ji

F

s S

| ) ( , , )( )x

M s s j

f s s s s

s s

F

k k j ji

k

( , , )

exp( ( , , ))

( , )

( )

′ =

= ′ =

⋅ ′

−∑λ 1 x

Γ

β

βL

i i

Li i

i

i s s S

( )

( )

( | , )

( | , ) , \ { }

( ) ( )

( ) ( )+

+

=

= ∀ ∈1

1

1

0

END y x

y x END

ββ βji i

ji i

s S

C

s s

M s s j

( | , ) ( | , )

( , , )

( ) ( ) ( ) ( )y x y x= ′

⋅ ′

+′∈∑ 1

β

β

β

Li

Li

ji

i

i s s S

s

( )

( )

( | )

( | ) , \ { }

( | )

( )

( )

( )

+

+

=

= ∀ ∈1

1

1

0

END x

x END

x

== ′ ′+′∈∑β j

iF

s S

s M s s j1( | ) ( , , )( )x

Page 4 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

The expectations of the feature functions (Ep(s|y, x) [fk], Ep(s,

y|x) [fk]) are computed as:

The partition functions can be computed using both for-ward or backward algorithms as:

where for simplicity we dropped out the sequence upper-script ((i)).

DecodingDecoding is the task of assigning labels (y) to an unknownobservation sequence x. Viterbi algorithm is routinelyapplied as decoding for the CRFs, since it finds the mostprobable path of an observation sequence given a CRFmodel [4]. Viterbi algorithm is particular effective whenthere is a single strong highly probable path, while whenseveral paths compete (have similar probabilities), poste-rior decoding may perform significantly better. However,the selected state path of the posterior decoding may notbe allowed by the grammar. A simple solution of thisproblem is provided by the posterior-Viterbi decoding,that was previously introduced for HMMs [15]. Posterior-Viterbi, exploits the posterior probabilities and at thesame time preserves the grammatical constraint. Thisalgorithm consists of three steps:

• for each position j and state s ∈ , compute poste-rior probability p(sj = s|x)

• find the allowed state path

S* = argmaxs Πj p(sj = s|x)

• assig to x a label sequence y so that yj = Λ(sj) for eachposition j

The first step can be accomplished using the Forward-Backward algorithm as described for the free phase ofparameter estimation. In order to find the best allowedstate path, a Viterbi search is performed over posteriorprobabilities. In what follows ρj(s|x) is the most probableallowed path of length j ending in state s and πj(s) is atraceback pointer. The algorithm can be described as fol-lows:

1. Initialization:

2. Recursion

3. Termination and Traceback

The labels are assigned to the observed sequence accord-ing to the state path s*. It is also possible to consider aslightly modified version of the algorithm where, for eachposition, the posterior probability of the labels is consid-ered, and the states with the same label have associatedthe same posterior probability. The rationale behind thisis to consider the aggregate probability of all state pathscorresponding to the same sequence of labels to improvethe overall per label accuracy. In many applications thisvariant of the algorithm might perform better.

ImplementationWe implemented the GRHCRF as linear HCRF in C++ lan-guage. Our GRHCRF can deal with sequences of symbolsas well as sequence profiles. A sequence profile of a proteinp is a matrix X whose rows represent the sequence posi-tions and whose columns are the 20 possible amino acids.Each element X [i] [a] of the sequence profile representsthe frequency of the residue type a in the aligned positioni. The profile positions are normalized such as ΣaX[i][a] =1 (for each i).

In order to take into account the information of the neigh-boring residues we define a symmetric sliding window oflength w centered into the i-th residue. With this choicethe state feature functions are defined as:

E f

f s s s s

p k

k j ji

s s Sj

L

i

N i

( | , )

( )

,

[ ]

( , , )

( )

s y x

x

=

= ′ =

−′ ∈=

+

=∑∑∑ 1

1

1

1

αα βj s i i MC s s j j s i i

Z i i

Ep

− ′ ′1( | ( ), ( )) ( , , ) ( | ( ), ( ))

( ( ), ( ))

y x y x

y x

(( , | )

( )

,

[ ]

( , , )

( )

s y x

x

f

f s s s s

k

k j ji

s s Sj

L

i

N i

=

= ′ =

−′ ∈=

+

=∑∑∑ 1

1

1

1

α jj s i MF s s j j s i

Z i− ′ ′1( | ( )) ( , , ) ( | ( ))

( ( ))

x x

x

β

Z

ZL

L

( , ) ( | , ) ( | , )

( ) ( | ) (

y x END y x BEGIN y x

x END x BEGI

= == =

+

+

α βα β

1 0

1 0 NN x| )

S

ρ

ρ0

0

1

0

( | )

( | ) , \ { }( )

BEGIN x

x BEGIN

=

= ∀ ∈s s Si

ρ

π

ρ

ρ

j

j

sj

j

s js

s s p s s

( | )

( )

max ( | )

( , ) ( | )

arg max

s x s x

x

=

=⋅ ′ ⋅ =

′ −

′ −

1

Γ

11( | ) ( , )s x ⋅ ′Γ s s

s

s

s

s for j n

n

j j

+∗

∗+

=

=

=

=1

0

1 1

END

BEGIN

π ( *) , ,…

Page 5 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

where s runs over all possible states, a runs over the differ-ent observed symbols A (in our case the 20 residues) and

k runs over the neighbor residues (from - to ).

When dealing with single sequences, the state functionsare simply products of Kronecker's deltas:

while in the case of sequence profiles, the state features arereal-valued and assume the profile scores:

Measures of performanceTo evaluate the accuracy we define the classical label-based indices, such as:

where p and N are the total number of correct predictionsand total number of examples, respectively. The Matthewscorrelation coefficient (C) for a given class s is defined as:

p(s) and n(s) are respectively the true positive and truenegative predictions for class s, while o(s) and u(s) are thenumbers of false positives and false negatives with respectto that class. The sensitivity (coverage, Sn) for each class sis defined as

The specificity (accuracy, Sp) is the probability of correctpredictions and it is defined as follows:

However, these measures cannot discriminate betweensimilar and dissimilar segment distributions and do notprovide any clues about the number of proteins that arecorrectly predicted. For this reason we introduce a protein-based index, the Protein OVerlap (POV) measure. We con-

sider a protein prediction to be correct only if the numberof predicted and observed transmembrane segments (inthe structurally resolved proteins, see Outer-membraneprotein data set section) is the same and if all correspond-ing pairs have a minimum segment overlap. POV is abinary measure (0 or 1) and for a given protein sequences is defined as:

Where and are the numbers of predicted and

observed segments, while pi and oi are the ith predicted and

observed segments, respectively. The threshold θ isdefined as the mean of the half lengths of the segments:

where Lp(= |pi|) and Lo(= |oi|) are the lengths of the pre-dicted and observed segments, respectively. For a set ofproteins the average of all POVs over the total number ofproteins N is:

To evaluate the average standard deviation of our predic-tions, we performed a bootstrapping procedure with 100runs over 60% of the predicted data sets.

Results and DiscussionProblem definitionThe prediction of the topology of the outer membraneproteins in Prokaryote organisms is a challenging task thatwas addressed several times given its biological relevance[18-20]. The problem can be defined as: given a proteinsequence that is known to be inserted in the outer mem-brane of a Prokaryotic cell, we want to predict the numberand the location with respect to the membrane plane ofthe membrane-spanning segments. From experimentalresults, we know that the outer membrane of Prokaryotesimposes some constraints to the topological models suchas:

• both C and N termini of the protein chain lie in theperiplasmic side of the cell (inside) and this impliesthat the number of the spanning segments is even;

• membrane spanning segments have a minimal seg-ment length (≥ 3 residues);

• the transmembrane-segment lengths are distributedaccordingly to a probability density distribution that

μ

μ

n n j

n

s a k s a k j

k

g s

g s j k

w

w

( , )

( , , )( , , ) ( , , )

x

x

∑ =

⋅ +

=−⎡

⎢⎢⎢

⎥⎥⎥

⎣⎢

2

2⎢⎢

⎦⎥⎥

∈∑∑∑

a As

w2

⎡⎢

⎤⎥

w2

⎢⎣

⎥⎦

g s i s s a xs a k j j i( , , )( , , ) ( , ) ( , )x = δ δ

g s i s s X i as a k j j( , , )( , , ) ( , ) [ ][ ]x = δ

Q p N2 = /

C sp s n s u s o s

p s u s p s o s n s u s( )

[ ( ) ( ) ( ) ( )]

[( ( ) ( ))( ( ) ( ))( ( ) ( ))= −

+ + + (( ( ) ( ))] /n s o s+ 1 2

Sn s p s p s u s( ) ( ) / [ ( ) ( )]= +

Sp s p s p s o s( ) ( ) / [ ( ) ( )]= +

POV sN N and p o i Np

sos

i i os

( )( , [ , ])= = ≥ ∀ ∈⎧

⎨⎪1 1

0

if

otherwise

∩ θ

⎩⎩⎪

N ps No

s

θ = +( / / ) /L Lp o2 2 2

POVPOV ss

N

N= =∑ ( )1

Page 6 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

can be experimentally determined and must be takeninto account.

For the reasons listed above the best performing predic-tors described in literature are based on HMMs andamong them the best performing single-method in thetask of the topology prediction is HMM-B2TMR [18] (seeTable 1 in [20]).

Outer-membrane protein data setThe training set consists of 38 high-resolution experimen-tally determined outer-membrane proteins of Prokaryo-tes, whose sequence identity between each pair is less than40%. We then generated 19 subsets for the cross-valida-tion experiments, such as there is no sequence identitygreater than 25% and no functional similarity betweentwo elements belonging to disjoint sets. The annotationconsists of three different labelings that correspond to:inner loop (i), outer loop (o) and transmembrane (t). Thisassignment was obtained using the DSSP program [21] byselecting the β-strands that span the outer membrane. Thedataset with the annotations and the cross-validation setsare available with the program at http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz.

For each protein in the dataset, a profile based on a mul-tiple sequence alignment was created using the PSI-BLASTprogram on the non-redundant dataset of sequences(uniref90 as described in http://www.uniprot.org/help/uniref). PSI-BLAST runs were performed using a fixednumber of cycles set to 3 and an e-value of 0.001.

Prediction of the topology of Prokaryotic outer membrane proteinsThe topology of outer-membrane proteins in Prokaryotescan be described assigning each residue to one of three

types: inner loop (i), transmembrane β-strand (t), outerloop (o). These three types are defined according to theexperimental evidence and are the terminal symbols of thegrammar. The chemico-physical and geometrical charac-teristics of the three types of segments as deduced by theavailable structures in the PDB suggest how to build agrammar (or the corresponding automaton) for the pre-diction of the topology. We performed our experimentsusing the automaton depicted in Figure 2, which was pre-viously introduced to model our HMM-B2TMR [18] (thisautomaton is substantially similar to all other HMMs usedfor this task [19,20]). It is essentially based on three differ-ent types of states. The states of the automaton are the non-terminal symbols of the regular grammar and the arrowsrepresent the allowed transitions (or production rules).The states represented with squares describe the trans-membrane strands while the states shown with circles rep-resent the loops (Figure 2). A statistics on the non-redundant database of outer membrane proteins pres-ently available, indicates that the length of the strands ofthe training set ranges from 3 to 22 residues (with an aver-age length of 12 residues). In Prokaryotic outer membraneproteins the inner loops are generally shorter than outerloops. Furthermore, both the N-terminus and C-terminusof all the proteins lie in the inner side of the membrane[18]. These constraints are modelled by means of theallowed transitions between the states.

The automaton described in Figure 2 assigns labels toobserved sequences that can be obtained using differentstate paths. This ambiguity leads to an ensemble of pathsthat must be taken into account during the likelihoodmaximization by summing up all possible trajectoriescompliant with the experimentally assigned labels (seeMethod section).

Table 1: Prediction of the topology of the Prokaryotic outer membrane proteins.

Method POV Q2 C(t) Sn(t) Sp(t)

CRF-1 (Vit) 0.26 ± 0.05 0.72 ± 0.01 0.47 ± 0.02 0.59 ± 0.01 0.80 ± 0.01CRF-1 (Pvit) 0.39 ± 0.05 0.77 ± 0.01 0.54 ± 0.02 0.71 ± 0.01 0.80 ± 0.01

CRF-2 (Vit) 0.34 ± 0.05 0.76 ± 0.01 0.52 ± 0.03 0.63 ± 0.02 0.82 ± 0.02CRF-2 (Pvit) 0.47 ± 0.05 0.80 ± 0.01 0.60 ± 0.03 0.74 ± 0.02 0.82 ± 0.02

CRF-3 (Vit) 0.29 ± 0.04 0.72 ± 0.01 0.45 ± 0.02 0.60 ± 0.02 0.79 ± 0.01CRF-3 (Pvit) 0.45 ± 0.04 0.76 ± 0.01 0.52 ± 0.02 0.70 ± 0.02 0.79 ± 0.01

GRHCRF 0.66 ± 0.04 0.85 ± 0.01 0.70 ± 0.03 0.83 ± 0.01 0.84 ± 0.01

HMM-B2TMR 0.58 ± 0.04 0.80 ± 0.01 0.62 ± 0.02 0.82 ± 0.02 0.83 ± 0.01

C(t), Sn(t) and Sp(t) are reported for the transmembrane segments (t).Vit = Viterbi decoding, Pvit = posterior-Viterbi decoding.For GRHCRF and HMM-B2TMR we used the posterior-Viterbi decoding.Models are detailed in the text. Scoring indices are described in Measure of Accuracy section.

Page 7 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

However, this ambiguity does not permit the adoption ofthe automaton of Figure 2 for CRF learning, since to trainCRFs a bijective mapping between states and labels isrequired. On the contrary, with the automaton of Figure2, several different state paths can be obtained (in theorya factorial number) that are in agreement with the autom-aton and with the experimental labels.

For this reason and for sake of comparison, we designedthree other automata (Figure 3a, b and 3c) that have thesame number of states but are non-ambiguous in term ofstate mapping. Then, starting from the experimentallyderived labels, three different sets of re-labelled sequencescan be derived to train CRFs (here referred as CRF1, CRF2and CRF3).

All compared methods take as input sequence profile andare bench-marked as shown in Table 1. In the case of non-ambiguous automata of the CRFs, we tested both theViterbi and posterior-Viterbi algorithms since given theViterbi-like learning of the CRFs it is not a priori predictablewhich one of the two decodings performs better on thisparticular task. From Table 1 it is clear that assigning thelabels according to the posterior-Viterbi always leads tobetter performance than with the Viterbi (see CRF in Table1). This indicates that also in other tasks where CRFs areapplied, the posterior-Viterbi here described can increasethe overall decoding accuracy. Furthermore, the fact thatboth HMM-B2TMR and GRHCRF perform better than theothers, implies that in the tasks where the observed labelsmay hide a more complex structure, as in the case of theprediction of the Prokaryotic outer membrane proteins, itis advantageous exploiting the ambiguity by taking into

consideration multiple concurring paths at the same time,both during training and decoding (see Method section).Considering that underlying grammar is the same, the dis-criminative GRHCRF outperforms the generative model(HMM-B2TMR). This indicates that the GRHCRF can sub-stitute the HMM-based models when the labeling predic-tion is the major issue. In order to asses the confidencelevel of our results, we computed pairwise t-tests betweenthe GRHCRF and the other methods. From the t-testresults reported in Table 2, it is evident that the measuresof the performace shown in Table 1 can be considered sig-nificant with a confidence level greater than 80% (see themost relevant index POV).

ConclusionIn this paper we presented a new class of conditional ran-dom fields that assigns labels in agreement with produc-tion rules of a defined regular grammar. The main noveltyof GRHCRF is then the introduction of an explicit regulargrammar that defines the prior knowledge of the problemat hand, eliminating the need of relabelling the observedsequences. The GRHCRF predictions satisfy the grammarproduction rules by construction, so that only meaningfulsolutions are provided. In [13], an automaton wasincluded to restrain the solution of a HCRFs. However inthat case, it was hard-coded in the model in order to trainfinite-state string edit distance. On the contrary, GRHCRFsare designed to provide solutions in agreement withdefined regular grammars that are provided as furtherinput to the model. To the best of our knowledge, this isthe first time that this is described. In principle, the gram-mar may be very complex, however, to maintain the trac-tability of the inference algorithm, we restrict our

Automaton structure designed for the prediction of the topology of the outer-membrane proteins in Prokaryotes with GRH-CRFs and HMMsFigure 2Automaton structure designed for the prediction of the topology of the outer-membrane proteins in Prokary-otes with GRHCRFs and HMMs.

Inner Side (i) Outer side (o)Transmembrane (t)

End

Begin

Page 8 of 10(page number not for citation purposes)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

Page 9 of 10(page number not for citation purposes)

Three different non-ambigous automata derived from the one depicted in Figure 2Figure 3Three different non-ambigous automata derived from the one depicted in Figure 2. These automata are designed to have a bijective mapping between the states and the labels (after the corresponding re-labeling of the sequences). In the text they are referred as CRF1 (a), CRF2 (b) and CRF3 (c).

Inner Side (i) Outer side (o)Transmembrane (t)

End

Begin

Inner Side (i) Outer side (o)Transmembrane (t)

End

Begin

Inner Side (i) Outer side (o)Transmembrane (t)

End

Begin

(a)

(b)

(c)

Algorithms for Molecular Biology 2009, 4:13 http://www.almob.org/content/4/1/13

implementation to regular grammars. Extensions to con-text-free grammars can be designed by modifying theinference algorithms at the expense of the computationalcomplexity of the final models. Since the Grammatical-Restrained HCRF can be seen as an extension of linearHCRF [13,14], the GRHCRF is also related to the modelsthat deal with latent variables such as Dynamic CRFs [22].

In this paper we also test the GRHCRFs on a real biologi-cal problem that require grammatical constraints: the pre-diction of the topology of Prokaryotic outer-membraneproteins. When applied to this biosequence analysis prob-lem we show that GRHCRFs perform similarly or betterthan the corresponding CRFs and HMMs indicating thatGRHCRFs can be profitably applied when a discrimina-tive problem requires grammatical constraints.

Finally we also present the posterior-Viterbi decodingalgorithm for CRFs that was previously designed forHMMs and that can be of general interest and application,since in many cases posterior-Viterbi can perform signifi-cantly better than the classical Viterbi algorithm.

Competing interestsThe authors declare that they have no competing interests.

Authors' contributionsPF and CS formalized the GRHCRF model. CS wrote theGRHCRF code. CS and PF performed the experiments. PF,PLM and RC defined the problem and provided the data.CS, PF, PLM and RC authored the manuscript.

AcknowledgementsWe thank MIUR for the PNR 2003 project (FIRB art.8) termed LIBI-Labo-ratorio Internazionale di BioInformatica delivered to R. Casadio. This work was also supported by the Biosapiens Network of Excellence project (a grant of the European Unions VI Framework Programme).

References1. Durbin R: Biological Sequence Analysis: Probabilistic Models of Proteins and

Nucleic Acids Cambridge Univ Pr, reprint edition; 1999. 2. Baldi P, Brunak S: Bioinformatics: The Machine Learning Approach 2nd

edition. MIT Press; 2001. 3. Manning C, Schütze H: Foundations of Statistical Natural Language

Processing MIT Press; 1999.

4. Lafferty J, McCallum A, Pereira F: Conditional Random Fields:Probabilistic Models for Segmenting and Labeling SequenceData. Proceedings of ICML01 2001:282-289.

5. Liu Y, Carbonell J, Weigele P, Gopalakrishnan V: Protein fold rec-ognition using segmentation conditional random fields(SCRFs). Journal of Computational Biology 2006, 13(2):394-406.

6. Sato K, Sakakibara Y: RNA secondary structural alignment withconditional random fields. Bioinformatics 2005, 21(2):237-242.

7. Wang L, Sauer UH: OnD-CRF: predicting order and disorder inproteins conditional random fields. Bioinformatics 2008,24(11):1401-1402.

8. Li CT, Yuan Y, Wilson R: An unsupervised conditional randomfields approach for clustering gene expression time series.Bioinformatics 2008, 24(21):2467-2473.

9. Li MH, Lin L, Wang XL, Liu T: Protein protein interaction siteprediction based on conditional random fields. Bioinformatics2007, 23(5):597-604.

10. Dang TH, Van Leemput K, Verschoren A, Laukens K: Prediction ofkinase-specific phosphorylation sites using conditional ran-dom fields. Bioinformatics 2008, 24(24):2857-2864.

11. Xia X, Zhang S, Su Y, Sun Z: MICAlign: a sequence-to-structurealignment tool integrating multiple sources of information inconditional random fields. Bioinformatics 2009,25(11):1433-1434.

12. Wang S, Quattoni A, Morency L, Demirdjian D: Hidden Condi-tional Random Fields for Gesture Recognition. CVPR 2006,II:1521-1527.

13. McCallum A, Bellare K, Pereira F: A Conditional Random Fieldfor Discriminatively-trained Finite-state String Edit Dis-tance. In Proceedings of the 21th Annual Conference on Uncertainty inArtificial Intelligence (UAI-05) Volume 388. Arlington, Virginia: AUAIPress; 2005.

14. Quattoni A, Collins M, Darrell T: Conditional Random Fields forObject Recognition. In Advances in Neural Information ProcessingSystems 17 Edited by: Saul LK, Weiss Y, Bottou L. Cambridge, MA:MIT Press; 2005:1097-1104.

15. Fariselli P, Martelli P, Casadio R: A new decoding algorithm forhidden Markov models improves the prediction of the topol-ogy of all-beta membrane proteins. BMC Bioinformatics 2005,6(Suppl 4):S12.

16. Sutton C, McCallum A: An Introduction to Conditional Random Fields forRelational Learning MIT Press; 2006.

17. Krogh A: Hidden Markov Models for Labeled Sequences. InProceedings of the 12th IAPR ICPR'94 IEEE Computer Society Press;1994:140-144.

18. Martelli P, Fariselli P, Krogh A, Casadio R: A sequence-profile-based HMM for predicting and discriminating beta barrelmembrane proteins. Bioinformatics 2002, 18(Suppl 1):46-53.

19. Bigelow H, Petrey D, Liu J, Przybylski D, Rost B: Predicting trans-membrane beta-barrels in proteomes. Nucleic Acids Res 2004,2566-2577:32.

20. Bagos P, Liakopoulos T, Hamodrakas S: Evaluation of methods forpredicting the topology of beta-barrel outer membrane pro-teins and a consensus prediction method. BMC Bioinformatics2005, 6:7-20.

21. Kabsch W, Sander C: Dictionary of protein secondary struc-ture: pattern recognition of hydrogen-bonded and geometri-cal features. Biopolymers 1983, 22(12):2577-2637.

22. Sutton C, McCallum A, Rohanimanesh K: Dynamic ConditionalRandom Fields: Factorized Probabilistic Models for Labelingand Segmenting Sequence Data. J Mach Learn Res 2007,8:693-723.

Table 2: Confidence level of the results reported in Table 1.

Methods POV Q2 C(t) Sn(t) Sp(t)

GRHCRF vs CRF-1 98.0% 99.5% 99.5% 99.8% 99.5%GRHCRF vs CRF-2 96.0% 99.5% 99.5% 99.5% 99.5%GRHCRF vs CRF-3 96.0% 99.5% 99.5% 99.5% 99.5%GRHCRF vs HMM-B2TMR 80.0% 96.0% 99.0% 98.0% 99.5%

The confidence level on the significance of the differences was computed with a t-test.For all methods we consider the best results of Table 1.

Page 10 of 10(page number not for citation purposes)