a bayesian model for recursive recovery of syntactic dependencies...

6

Click here to load reader

Upload: phunglien

Post on 06-Sep-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

A Bayesian Model for Recursive Recovery of Syntactic Dependencies

Virginia Savova ([email protected])Department of Cognitive Science, 3300 N. Charles StJohns Hopkins University, Baltimore, MD 21218 USA

Leonid Peshkin ([email protected])Department of Systems Biology, 200 Longwood Ave

Harvard University, Boston, MA 02115 USA

Abstract

Bayesian inference on graphical models can account fora variety of psychlogical data in non-linguistic domains.Recent proposals have touted its biological plausibility,which raises the question to what extent it may capturethe learning and use of grammar. We propose a way tostructure the parsing task in order to make it amenableto local classification methods. This allows us to build aDynamic Bayesian Network which uncovers the syntac-tic dependency structure of English sentences. Exper-iments with a small test corpus demonstrate that themodel successfully learns from labeled data. We discusswhat this approach may tell us about the way syntaxmay be encoded in the brain and about the modularityof the language faculty.

Keywords

Parsing, Finite-state models, Bayesian networks, Biolog-ical plausibility, Dependency Grammars, Grammar in-duction, Cognitive modeling.

Introduction

Bayesian graphical models have become an im-portant explanatory strategy in cognitive science( [Knill and Richards, 1996], [Kording and Wolpert, 2004],[Stocker and Simoncelli, 2005]). Recent work stronglysupports their biological plausibility in general and thatof dynamic Bayesian models in particular [Rao, 2005].Dynamic models are geared towards prediction and clas-sification of sequences. As such, they are naturally suit-able for language modeling and have already been apliedto tasks like speech recognition [Livescu et al., 2003] andpart-of-speech tagging [Peshkin et al., 2003]. However,grammar learning and parsing with such models gener-ally appears out-of-reach, because of their Markoviancharacter.

Markov models restrict possible dependencies to abounded, local context. At one extreme, the contextis confined to the symbol occupying the current positionin the sequence (order-0 or unigram models). In morerelaxed versions, context may include a fixed number ofpositions before the current symbol (k-order), typicallyno more than two (trigram models). The restricted spaceof possible dependencies allows transition probabilitiesto be infered from the data and stored in a look-up tablewith relatively little technical sophistication.

Not surprisingly however, the restricted space of rep-resentable dependencies is also the main disadvantage of

Markov models in syntax-related tasks like parsing. Syn-tactic dependencies in natural language are unboundedlynon-local, in the sense that no fixed amount of contextis guaranteed to contain the members of a given con-stituent. For example, consider the sentences in exam-ples (1 - 3). In the first sentence, the subject king andverb bought are adjacent to one another. Thus, the de-pendecy between them would be captured by a bigram(order-1) model. However, the same model would be un-able to represent the dependendency in the second exam-ple, because the subject and verb are separated by twowords. To capture this dependency, we need a 2nd-orderMarkov model (trigram). Similarly, the 2nd-order modelwould prove inadequate for the third example, where thesubject and verb are separated by four words.

(1) The king bought a camel.

(2) The king of Prussia bought a camel.

(3) The king of some strange country bought acamel.

Our solution to this problem relies on representing sen-tences with non-local dependencies like (2, 3) as derivedfrom their local dependency variants, akin to (1). Thisintuition is based on the formal notion that a string withnon-local dependency is obtained from a dependencytree via a recursive linearization procedure. The stringobtained at each step of the linearization procedure con-tains new local dependencies, which push apart local de-pendencies from previous levels. This way of conceptu-alizing the linearization of syntactic structure allows usto use a Dynamic Bayesian Network despite its Markovproperties. We construct a DBN parser which decidesonly on local attachments. We then call the parserrecursively to uncover the underlying dependency tree.Our results show that the model captures grammaticalknowledge for all levels of the derivation. The biologi-cal plausibility and remarkable compactness of learnedrepresentation may suggest that parsing in the brain isaccomplished in a similar manner.

Dependency grammar

Tree-based linguistic representations of natural languagesyntax treat non-local dependencies as local in the two-dimensional tree structure, of which the string is a onedimensional projection. The dependency grammar rep-resentation of (1) captures the dependency between the

1931

Page 2: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

subject, the object and the verb, and the dependency be-tween the determiners and their respective nouns (Figure1).

bought

��

HH

H

king

��

HH

the of

Prussia

camel

a

Figure 1: Dependency structure of example (2)

More formally, a dependency grammar consists of alexicon of terminal symbols (words), and an inventoryof dependency relations specifying inter-lexical require-ments. A string is generated by a dependency grammarif and only if:

• Every word but one (ROOT) is dependent on anotherword.

• No word is dependent on itself either directly or indi-rectly.

• No word is dependent on more than one word.

• Dependencies do not cross.

In a dependency tree, each word is the mother of its de-pendents, otherwise known as their head. To linearizethe dependency tree in Figure 1 into a string, we intro-duce the dependents recursively next to their heads:Step I: boughtStep II: king bought camelStep III: The king of bought a camelStep IV: The king of Prussia bought a camel.

Recursive parsing as local classification

Parsing in the dependency grammar framework is thetask of uncovering the dependency tree given the sen-tence. Suppose that instead of searching for a completeparse given a complete sentence, we restricted our taskto compressing the string up the linearization path. Notethat linearization is essentially dependency parsing in re-verse. In other words, we can uncover the dependencystructure by labeling the local head-dependent relation-ships at the bottom linearization level (i.e. the sentence)and erasing from the string the words whose heads arealready found. We recursively process the output untilthe root level. Thus, if as a first step in parsing (2),we pick the head of Prussia to be the preposition of,we can compress the string to a form virtually equiva-lent to linearization Step III. Picking king as the head ofthe preposition leads us to compress the string further,to the equivalent of step II. To compress the string, wemust simply identify which words in the string occupy aposition adjacent to their heads.

The attractive feature of this representation is thatthe parsing decisions taken at each step are local. Hence,parsing can be converted into a local classification task.The task is to chose the best sequence of labels denoting

local dependency relationships (links). At each position,we choose between setting the link to left, right, ornone, where left/right means the word is dependenton its left/right neighbor. none means the search forthis word’s head should be postponed until later stagesof compression. The output of the classifier is a labeledstring, which can be compressed by removing linked de-pendents. It is fed through recursively, until the stringis compressed to the ROOT.

The Dynamic Bayesian Network classifier

The first step towards building the classifier is coming upwith a feature representation. We will briefly motivatethe choice of feature set with linguistic arguments. It iseasy to determine that the linking pattern of a word de-pends on its part of speech (PoS) and the part of speechof its neighbor. For example, English determiners onlylink to the right, and adverbs link almost exclusively toverbs. However, the parts of speech alone are not suffi-cient to determine linking behavior. In some cases, theidentity of the adjacent word is required - bought acceptslinks from nouns to the right, while slept does not.

Another decisive factor is how many dependents thecurrent word has acquired so far. Since once the cur-rent word is linked it will become unavailable as a futurelinking target to other words, it is important to acertainthat its valency has already been satisfied. Valency refersto the minimal number of dependents a word activelyseeks to license. In English and other SVO languages,the word has particular requirements with respect to thenumber of left and right dependents. Thus, in our fea-ture representation, valency is indirectly captured by twovariables, which reflect the number of dependents whichhad already been linked to the current word from ei-ther side - left and right composite (comp). Thecomp variables affect not only the linking behavior ofthe current token, but that of its neighbor as well. Ifthe word has already received many dependents fromone side, the probability of accepting yet another onebecomes smaller, since its valency is already satisfied.

Finally, the current label depends on the labels of itsneighbor, because if the previous label is right, thenthe current label cannot be left, and if the next labelis left, the current label cannot be right. Thus, ourfull feature representation consists of the word and itsPoS tag, the words and PoS tags of its neighbor, the twovalencies of the current word, the right valency of its leftneighbor and the left valency of its right neighbor, aswell as the neighboring links.

The Word and Next Word feature vocabulary containthe 2500 most frequent words in the data. An additionalvalue was allocated for all remaining out-of-vocabularywords. The PoS, and Next PoS vocabulary contain 36 ofthe original 45 Penn Treebank Tagset, after all punctu-ation PoS tags were removed. The left and right compfeatures had tree values: none, one and many.

This feature representation is used as the basis of theDynamic Bayesian Network (dbn). After we briefly in-troduce the essential aspects of dbns, we wil expand onthe structure of the network for parsing. For more infor-

1932

Page 3: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

mation on the general class of models, we refer the readerto a recent dissertation [Murphy, 2002] for an excellentsurvey.

General notes on DBNs

A dbn is a Bayesian network unwrapped in “time” (i.e.over a sequence), such that it can represent dependenciesbetween variables at adjacent position. More formally, adbn consists of two models B0 and B+, where B0 definesthe initial distribution over the variables at position 0,by specifying:

• set of variables X1, . . . , Xn;

• directed acyclic graph over the variables;

• for each variable Xi a table specifying the conditionalprobability of Xi given its parents in the graphPr(Xi|Par{Xi}).

The joint probability distribution over the initial stateis:

Pr(X1, ..., Xn) =

n∏

1

Pr(Xi|Par{Xi}).

The transition model B+ specifies the conditional prob-ability distribution (cpd) over the state at time t giventhe state at time t−1. B+ consists of:

• directed acyclic graph over the variables X1, . . . , Xn

and their predecessors X−

1 , . . . , X−

n— roots of this

graph;

• conditional probability tables Pr(Xi|Par{Xi}) for allXi (but not X−

i).

The transition probability distribution is:

Pr(X1, ..., Xn

∣X−

1 , ..., X−

n) =

n∏

1

Pr(Xi|Par{Xi}).

Together, B0 and B+ define a probability distributionover the realizations of a system through time, whichjustifies calling these bns “dynamic”. In our setting, theword’s index in a sentence corresponds to time, whilerealizations of a system correspond to correctly taggedEnglish sentences. Probabilistic reasoning about suchsystem constitutes inference.

Standard inference algorithms for dbns are similar tothose for hmms. Note that, while the kind of dbn weconsider could be converted into an equivalent hmm, thatwould render the inference intractable due to a huge re-sulting state space. In a dbn, some of the variables willtypically be observed, while others will be hidden. Thetypical inference task is to determine the probability dis-tribution over the states of a hidden variable over time,given time series data of the observed variables. Thisis usually accomplished using the forward-backward al-gorithm. Alternatively, we might obtain the most likelysequence of hidden variables using the Viterbi algorithm.These two kinds of inference yield resulting link tags.

C o n t r o lL i n k

R i g h t C o m pL e f t C o m pP o SW o r ds l i c e K s l i c e K + 1Figure 2: The parsing DBN.

Learning the parameters of a dbn from data is gen-erally accomplished using the EM algorithm. However,in our model, learning is equivalent to collecting statis-tics over cooccurrences of feature values and link labels.This is implemented in gawk scripts and takes minuteson a large corpus. While in large dbns, exact inferencealgorithms are intractable, and are replaced by a vari-ety of approximate methods, the number of hidden statevariables in our model is small enough to allow exact al-gorithms to work. For the inference we use the standardalgorithms, as implemented in a recently released toolkit[Bilmes and Zweig, 2002].

Structure of the DBN parser

Each slice of our DBN parser is a representationof the joint probability distribution of word, pos,left/right comp, and the hidden variable link Fig-ure 2. In our model, the link determines the value ofall variables and they are independent of one another.Of course, this is not truly the case, but among thosevariables link is the only unobserved, hence modelingall other dependencies is inconsequential. In addition tothe intra-slice dependencies, we model dependencies be-tween the current, previous and next position. The linkvariable infulences all aforementioned variables in neigh-boring slices. Finally, we introduce a control variablewhich deterministically ensures that at least one link inthe sequence will be set to something other than none.This forces the parser to trully compress the string ateach recursive parsing step.

Experiments and results

For the results presented here we used the WSJ10 cor-pus [Klein and Manning, 2004]. It is a subset of the WSJPenn Treebank ([Marcus et al., 1993]), consisting of allsentences shorter than eleven words with punctuation

1933

Page 4: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

removed 1. Eliminating the punctuation was done tosimulate parsing in the oral modality. The dependencyannotation was obtained through automatic conversionof the original treebank annotation. The relatively shortsentences make this corpus a good approximation to ca-sual speech and limit the effects of misattachments dueto the conversion. Note that the parser is in principlecapable of handling longer sentences.

Encoding

The corpus was encoded in our feature representationas follows. For each sentence, a number of feature fileswere produced containing the feature representation ofthe sentence at each linearization level. The encoding ofan actual sentence-structure pair from our corpus (Fig-ure 3), is illustrated in Figures 4 to 7.h e r i m m e d i a t e p r e d e c e s s o r s u f f e r e d a n e r v o u s b r e a k d o w n .

Figure 3: Dependency structure.

At the lowest level, no word has any discovered depen-dents, hence the comp values are zero everywhere. Alllinks of words whose heads are not adjacent are labelednone (0).

At the next level, words whose labels were left orh e r i m m e d i a t e p r e d e c e s s o r s u f f e r e d a n e r v o u s b r e a k d o w n .00 00 00 00 00 00 00 00p r o n n o u n n o u na d j v e r b d e t a d j r o o t0 R i g h t R i g h t0 0 0 0R i g h t C o m p :L e f t C o m p :P o S :W o r d :L i n k :Figure 4: First layer representation.

right are removed from the structure and the compcounters for their head are incremented.

00 10 00 00 10 00p r o n n o u n n o u nv e r b d e t r o o tR i g h t R i g h t0 0 0P o S :W o r d :L i n k : h e r p r e d e c e s s o r s u f f e r e d a b r e a k d o w n .R i g h t C o m p :L e f t C o m p :Figure 5: Second layer representation.

The same procedure produces the subsequent levels (Fig-ures 6, 7)

Testing

The corpus was split randomly 9:1 into a training andtesting section. In training mode, the DBN was given

1the dot in our figures stands for an abstract ROOTsymbol

20 00 20 00n o u n n o u nv e r b r o o tR i g h t 0L e f tP o S :W o r d :L i n k : p r e d e c e s s o r s u f f e r e d b r e a k d o w n .R i g h t C o m p :L e f t C o m p :Figure 6: Third layer representation.

11 00v e r b r o o tR i g h tP o S :W o r d :L i n k : s u f f e r e d .R i g h t C o m p :L e f t C o m p :Figure 7: Top layer representation.

all levels with the correct labels. It was trained directlyon the annotations, with no additional smoothing. Theresult achieved was 79% correct link attachment for di-rected dependencies, and 82% for undirected. We com-pare the results to two baselines given for this corpus by[Klein and Manning, 2004], Table 1.

Table 1: DBN results against baseline.

Model AccuracyDir Undir

DBN 79 82Random 30 46Adjacent heuristic 34 57

More detailed results for our model are shown in Ta-ble 2 . The results unequivocally surpass the randombaseline, and the best available heuristic, which amountsto linking every word to its right neighbor. This suggestsour model has learned at least some of the non-trivialdependencies which govern the choice of link structure.The minimal difference between the vocabulary and out-of-vocabulary scores imply that the network can recoverthe syntactic properties of an unknown word in context.The fact that the root accuracy is higher than the non-root accuracy allows us to conclude that the network cor-rectly learns to postpone decisions about the root wordin all cases, and about its dependent in most cases.

Table 2: Detailed results for the DBN.

Measure AccuracyRoot dependency 83Non-root dependency 78Out-of-Vocabulary 75Sentence 36

1934

Page 5: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

Discussion

Our results show that combining a DBN model with re-cursive application is a reasonable parsing strategy. Thisopens the door to the hypothesis that Bayesian infer-ence is a possible mechanism for parsing in the brain,despite the Markovian properties of the correspondingdynamic models. The high ROOT accuracy suggeststhat the model has captured some fundamental princi-ples defining the local dependency structure at all levelsof the derivation. We take this result as evidence thatgraphical models with Markov properties are capable ofhandling unbounded non-local dependencies through re-cursive calls on their own output. The implication of thisfinding transcend Bayesian graphical models and speakto the general issue of how relevant other biologicallyplausible Markov models can be to language processingand learning. For example, Elman networks have beencriticized for their a priori limitation in handling un-bounded dependencies [Frank et al., 2005]. It is possiblethat such type of models may be adapted to discoverlocality in the hierarchical structure through recursiveapplication.

One exciting implication of this hypothesis is thedomain-generality of Bayesian inference and learn-ing mechanisms. Previous work has proposed thatthese mechanisms are involved in visual perception[Knill and Richards, 1996], [Kersten and Yuille, 2003],motor control [Kording and Wolpert, 2004], andattention modulation [Yu and Dayan, 2005].[Kersten and Yuille, 2003] proposes Bayesian graphicalmodel of object detection which rely on estimating hid-den variables such as relative depth and 3-D structurefrom observables they influence -shadow displacement,2-D projection. [Kording and Wolpert, 2004] suggeststhat subjects ina sensory-motor experiment internallyrepresent both the statistical distribution of the taskand their sensory uncertainty, combining them ina manner consistent with a performance-optimizingbayesian process. In our work, the hidden links areestimated from observable word and PoS, along with aprior label distibution.

The parallelism in the proposed cognitive strategies forall these different modalities may shed light on the issuewhether and how modular the language faculty is. Themodularity hypothesis states that the cognitive mech-anisms underlying linguistic competence are specific tolanguage. If Bayesian inference proves to be a plausi-ble uniting principle behind visual, motor and linguisticabilities, this hypothesis is seriously undermined. At thesame time, it is important to note that the generalityof the mechanism does not necessarily negate the mod-ularity of language completely. The feature represen-tation which our model used already encodes language-specific knowledge. Further research is needed to deter-mine whether the feature representation and the struc-ture of the network can be induced through structurelearning algorithms.

Our approach is particularly appealing in light of re-cent work suggesting that Bayesian type inference is bi-ologically plausible. [Rao, 2005] shows that recurrent

networks of noisy integrate-and-fire neurons can per-form approximate Bayesian inference for dynamic andhierarchical graphical models. According to him, themembrane potential dynamics of neurons correspondsto approximate belief propagation in the log domain,and the spiking probability of a neuron approximatesthe posterior probability of the preferred state encodedby the neuron, given past inputs. This seems to suggestthat our parsing model can be implemented in a neu-ral circuit. Furthermore, since the same DBN is used touncover local dependencies throughout all levels of thederivation, such implementation would address Hum-boldt’s characterization of language as a system thatmakes “infinite use of finite means” at the neurophys-iological level. The same neural aparatus could be usedto recursively uncover the dependency structure of a sen-tence level by level.

Another implication of our work is that the natureof the processing architecture may constrain the kind ofgrammar human languages permit. If indeeed parsing isaccomplished through recursive processing of the outputof previous stages, some types of long-distance depnden-cies would be impossible to detect. In particular, if thematerial intervening between a head-dependent pair (H,D) is not a constituent whose own head depend on ei-ther H or D, our model would not be able to uncoverit because H and D will not be adjacent at any pointin the derivation. In other words, this parser is inca-pable of handling strictly context-sensitive languages. tothe extent that such dependencies exist, they are fairlylimited [Shieber, 1985]. Such cases will need to be re-solved through some reordering in pre-processing, possi-bly based on case marking.

Future work

One deficiency of our model is that decisions at lowerlevels cannot be reversed in the interest of more optimalchoices at higher levels. There are however importantreasons why this might be necessary. For example, aprepositional phrase subcategorized for by the verb maybe mistakenly attached to a preceding noun phrase, leav-ing the verb with a missing dependent (4)

(4) The king put *[the camel in the trunk].

In the future, we hope to address this problem through aform of beam search - retaining the k-best parses at eachlevel and choosing among them based on what happensat the next level.

Another important issue that we need to address isthe total loss of information about the dependents thathave been linked to a word at previous levels. Somewell-known cases pose a problem for this aspect of ourmodel. For example, the sentences in (6) and (5) arestructurally distinct solely becase the complement of theprepositional phrase in the second sentence is an instru-ment appropriate for seeing.

(5) The king saw [the camel with two humps] .

(6) The king saw *[the camel with a telescope].

1935

Page 6: A Bayesian Model for Recursive Recovery of Syntactic Dependencies …csjarchive.cogsci.rpi.edu/Proceedings/2005/docs/p1931.pdf · A Bayesian Model for Recursive Recovery of Syntactic

In our current model, once the complement is linked tothe preposition, the two sentence will become identical,and one of them will be assigned the wrong structure.This concern can be addressed through introducing newvariables, which keep track not only of the number oflinked dependents but of their semantic category (e.g.instrument, animate etc.)

A natural way to extend our model in a different di-rection is to combine it with the Bayesian PoS taggerdeveloped in [Peshkin et al., 2003]. Allowing the modelto infer PoS tags and structure simultaneously will bea significantly better approximation to the parsing taskhumans are faced with. Last but not least, we wouldlike to implement semisupervised learning. One way todo this would involve starting off with a small labeledset of sentences at all parsing depths, followed by pre-senting unparsed whole sentences. The parses suggestedby the model would in their turn be used for learning ina bootstrap fashion.

Conclusion

In our closing remarks, we would like to emphasize sev-eral aspects of our parsing model which make it inter-esting from the perspective of cognitive science. First,it belongs to a class of models which have been used re-cently to capture cognitive mechanisms in non-linguisticdomains. Second, it naturally utilizes the overwhelming“disguised locality” of natural language syntax - in otherwords, it benefits from the fact that string-non-local de-pendencies are tree-local. Third, it is biologically plau-sible because it has been shown to be implementable ina neural circuit. And finally, it takes seriously the ques-tion how the finite amount of brain hardware is capableof encoding structures of unbounded depth. While thereis much room for improvement, we believe these qualitiesmake it an important step on the difficult road towardunderstanding how the mind emerges from the brain.

References

[Bilmes and Zweig, 2002] Bilmes, J. and Zweig, G.(2002). The graphical models toolkit: An open sourcesoftware system for speech and time-series processing.IEEE Int. Conf. on Acoustics, Speech, and Signal Pro-cessing, June 2002. Orlando Florida.

[Frank et al., 2005] Frank, R., Mathis, D., andBadecker, W. (2005). The acquisition of anaphora bysimple recurrent networks. unpublished manuscript.

[Kersten and Yuille, 2003] Kersten, D. and Yuille, A.(2003). Bayesian models of object perception. Cogni-tive Neuroscience, 13:1–9.

[Klein and Manning, 2004] Klein, D. and Manning, C.(2004). Corpus-based induction of syntactic structure:Models of dependency and constituency. In Proceed-ings of the 42nd Annual Meeting of the ACL.

[Knill and Richards, 1996] Knill, D. C. and Richards,W., editors (1996). Perception as Bayesian inference.Cambridge University Press.

[Kording and Wolpert, 2004] Kording, K. and Wolpert,D. (2004). Bayesian integration in sensorimotor learn-ing. Nature, 427:244–247.

[Livescu et al., 2003] Livescu, K., Glass, J., and Bilmes,J. (2003). Hidden feature models for speech recogni-tion using dynamic bayesian networks. In 8th Euro-pean Conference on Speech Communication and Tech-nology (Eurospeech).

[Marcus et al., 1993] Marcus, M., Santorini, B., andMarcinkiewicz, M. A. (1993). Building a large an-notated corpus of english: the penn treebank. Com-putational Linguistics, 19.

[Murphy, 2002] Murphy, K. (2002). Dynamic BayesianNetworks: Representation, Inference and Learning.PhD thesis, Univ. of California at Berkeley.

[Peshkin et al., 2003] Peshkin, L., Pfeffer, A., andSavova, V. (2003). Bayesian nets for syntactic catego-rization of novel words. In Proceedings of the NAACL.

[Rao, 2005] Rao, R. P. N. (2005). Hierarchical bayesianinference in networks of spiking neurons. In Saul,L. K., Weiss, Y., and Bottou, L., editors, Advancesin Neural Information Processing Systems 17. MITPress, Cambridge, MA.

[Shieber, 1985] Shieber, S. (1985). Evidence against thecontext-freeness of natural language. Linguistics andPhilosophy, 8:333–343.

[Stocker and Simoncelli, 2005] Stocker, A. and Simon-celli, E. (2005). Constraining a bayesian model of hu-man visual speed perception. In Saul, L. K., Weiss, Y.,and Bottou, L., editors, Advances in Neural Informa-tion Processing Systems 17. MIT Press, Cambridge,MA.

[Yu and Dayan, 2005] Yu, A. J. and Dayan, P. (2005).Inference, attention, and decision in a bayesian neuralarchitecture. In Saul, L. K., Weiss, Y., and Bottou,L., editors, Advances in Neural Information Process-ing Systems 17. MIT Press, Cambridge, MA.

1936