noisy-channel theory of sentence comprehension · 2020. 2. 28. · uncertain input in language...
TRANSCRIPT
Noisy-channel theory of sentence comprehension
Emily MorganLSA 2019 Summer Institute
UC Davis
1
Rational analysis• Background assumption: cognitive agent is optimized via
evolution and learning to solve everyday tasks effectively1. Specify a formal model of the problem to be solved and
the agent’s goalsA. Make as few assumptions about computational
limitations as possible.2. Derive optimal behavior given the problem and goals3. Compare optimal behavior to agent behavior4. If predictions are off, revise assumptions, and iterate
2(Anderson, 1990, 1991)
Rational analysis: Sentence processing1. Specify a formal model of the problem to be solved and
the agent’s goalsGiven a sentence, recover a probability distribution over trees
A. Make as few assumptions about computational limitations as possible.Did not assume any memory limitations.
2. Derive optimal behavior given the problem and goalsDerived surprisal theory
3. Compare optimal behavior to agent behaviorCorrectly predicted many reading time results
4. If predictions are off, revise assumptions, and iterateBut let’s look at a case where the predictions are off…
3(Anderson, 1990, 1991)
An incremental inference puzzle for surprisal• Try to understand this sentence:
(a) The coach smiled at the player tossed the frisbee.
…and contrast this with:
(b) The coach smiled at the player thrown the frisbee.
(c) The coach smiled at the player who was thrown the frisbee.
(d) The coach smiled at the player who was tossed the frisbee.
• Readers boggle at “tossed” in (a), but not in (b-d)
Tabor et al. (2004, JML)
RT spike in (a)
4
• In classic garden-paths, you are lead astray by an initially plausible but ultimately incorrect analysis
Why is tossed/thrown interesting?
5
Why is tossed/thrown interesting?• In LCEs, you have already seen the correct main verb
smiled so the main verb interpretation of tossed “should not” be plausible
• But we get led astray anyway!• It appears that the parser is failing to make rational/optimal use
of its previous input 6
tossed
Rational analysis: Sentence processing1. Specify a formal model of the problem to be solved and
the agent’s goalsGiven a sentence, recover a probability distribution over trees
A. Make as few assumptions about computational limitations as possible.Did not assume any memory limitations.
2. Derive optimal behavior given the problem and goalsDerived surprisal theory
3. Compare optimal behavior to agent behaviorCorrectly predicted many reading time results
4. If predictions are off, revise assumptions, and iterateBut let’s look at a case where the predictions are off…
7(Anderson, 1990, 1991)
Uncertain input in language comprehension• Previous models of sentences processing made a
simplifying assumption: • Input is clean and perfectly-formed• No uncertainty about input is admitted
• Intuitively seems patently wrong…• We sometimes misread things• We can also proofread
8
Uncertain input in language comprehension• Uncertain input/noisy-channel hypothesis:
Comprehenders account for possible noise in the input• Leads to questions:
1. What behavioral evidence do we have for uncertain input/noisy channel theory of sentence comprehension?
2. What might a model of sentence comprehension under uncertain input look like?
3. What further predictions might such a model make?
9
Uncertain input in language comprehension• How could uncertain input explain Local Coherence
Effects?• Consider the sentences:1. The coach smiled at the player tossed a frisbee.2. The coach smiled as the player tossed a frisbee.3. The coach smiled and the player tossed a frisbee.• The comprehender might think it’s more likely that the
word at is wrong than that the speaker really meant #1.
10
Experimental design• In a free-reading eye-tracking study, Levy et al. (2009)
crossed at/toward with tossed/thrown:
• Prediction: interaction between preposition & ambiguity in some subset of:• Early-measure (first pass) RTs at critical region
tossed/thrown• First-pass regressions out of critical region• Go-past time for critical region• Regressions into at/toward
The coach smiled at the player tossed the frisbeeThe coach smiled at the player thrown the frisbeeThe coach smiled toward the player tossed the frisbeeThe coach smiled toward the player thrown the frisbee
11
Experimental results
First-passRT
Regressionsout
Go-pastRT
Go-pastregressions
Comprehensionaccuracy
The coach smiled at the player tossed…??
12
Today’s questions1. What behavioral evidence do we have for uncertain
input/noisy channel theory of sentence comprehension?• Local coherence effects
2. How can we model sentence comprehension under uncertain input?
3. What further predictions might such a model make?
13
Standard probabilistic sentence processing
14
T
w
Tree
word sequence
I noisy Input
• Standard probabilistic sentence processing:
• If (as experimenters) we know true sentence w* presented to the participant, but not the perceived input I.
• What does the participant believe the intended sequence w is?
A noisy-channel model
Levy (2008, EMNLP)
true noise modelcomprehender’s
model
comprehender’s prior prob. of w
similarity function(“kernel”)
15
• How can we represent the type of noisy input generated by a word sequence?
• Finite-state automata (FSAs)
• A type of grammar that generates strings
• Equivalently, it accepts/reject strings• This FSA accepts a, ab, abb, abbb, abbbb, etc.
Representing noisy input
16
Input symbolLog-probability
(surprisal)
Weighted/Probabilistic FSAs (pFSAs)• Every transition has a probability associated with it
• Here, represented as log probability (aka surprisal)• Total probability of a string is the sum of the transition
surprisals (plus the surprisal of the final state, if there are multiple)
• Equivalently, the product of the probabilities
17(Mohri, 1997)
grammar
+input
Combining grammar & uncertain input• Bayes’ Rule says that the evidence and the prior should
be combined (multiplied)• For probabilistic grammars, this combination is the formal
operation of weighted intersection
=BELIEF
Grammar affects beliefs about the future 18
Revising beliefs about the past• When we’re uncertain about the future, grammar + partial
input can affect beliefs about what will happen • With uncertainty of the past, grammar + future input can
affect beliefs about what has already happened
19
{b,c} {?} {b,c} {f,e}grammarword 1 words 1 + 2
20
Flexibility of pFSAs• Probabilistic FSAs can also allow us to represent inputs
of variable length• ε-transitions allow for the possibility of generating fewer
than two input symbols• Loops allow for the possibility of generating more than two
input symbols*
• This pFSA gives probability to infinitely many strings, but the most likely are {a,b}{a,b}
21
The noisy-channel model (FINAL)
• For Q(w,w*): a WFSA based on Levenshtein distance between words (KLD):
Result of KLD applied to w* = a cat sat
Prior Expected evidence
Cost(a cat sat)=0
Cost(sat a sat cat)=822
Incremental inference under uncertain input
• Near-neighbors make the “incorrect” analysis “correct”:
• Hypothesis: the boggle at “tossed” involves what the comprehender wonders whether she might have seen
Any of these changes makes tossed a main verb!!!
The coach smiled at the player tossed the frisbee(as?)
(and?)(who?)(that?)
(who?)(that?)
(and?)
23
The core of the intuition• Grammar & input come together to determine two possible
“paths” through the partial sentence:• tossed is more likely to happen along the bottom path
• This creates a large shift in belief in the tossed condition• thrown is very unlikely to happen along the bottom path
• As a result, there is no corresponding shift in belief
the coach smiled…
at(likely)
…the player…
as/and(unlikely)
…the player…
tossed
tossed
thrown
thrown
(line thickness ≈ probability)
24
Ingredients for the model
• Q(w,w*) comes from KLD (with minor changes)
• PC(w) comes from a probabilistic grammar (this time a probabilistic finite-state grammar, i.e. a grammar which can be represented as a pFSA)
• We need one more ingredient:
• a quantified signal of the alarm induced by word wi about changes in beliefs about the past
Prior Expected evidence
25
Quantifying alarm about the past
• Relative Entropy (KL-divergence) is a natural metric of change in a probability distrib. (Levy, 2008; Itti & Baldi, 2005)
• Our distribution of interest is probabilities over the previous words in the sentence• Because we’re allowing uncertain input, there is a probability
distribution over what each previous word may have been• Call this distribution Pi(w[0,j))
• The change induced by wi is the error identification signal EISi, defined as
new distribution old distribution
strings up to but excluding word jconditioned on words 0 through i
26
Error identification signal: example• Measuring change in beliefs about the past:
Change: EIS2 = 0.14 {b,c} {?} {b,c} {f,e}
27
Results on local-coherence sentences• Locally coherent: The coach smiled at the player tossed the frisbee• Locally incoherent: The coach smiled at the player thrown the frisbee
EIS greater for the variant humans boggle more on
(All sentences of Tabor et al. 2004 with lexical coverage in model) 28
Experimental data• Does the model make the correct predictions for the
experimental data of Levy et al. (2009)?The coach smiled at the player tossed the frisbeeThe coach smiled at the player thrown the frisbeeThe coach smiled toward the player tossed the frisbeeThe coach smiled toward the player thrown the frisbee
30
Model predictions
at…tossed
at…throwntoward…thrown
toward…tossed
(The coach smiled at/toward the player tossed/thrown the frisbee)31
Rational analysis1. Specify a formal model of the problem to be solved and
the agent’s goalsA. Make as few assumptions about computational
limitations as possible.2. Derive optimal behavior given the problem and goals3. Compare optimal behavior to agent behavior4. If predictions are off, revise assumptions, and iterate• Initially we assumed input was noiseless
• But we made incorrect predictions about LCEs (we predicted they shouldn’t cause difficulty)
• Revise our assumptions to include uncertain input• Now our theory correctly predicts LCEs
• What novel predictions does our new theory make? 32
Today’s questions1. What behavioral evidence do we have for uncertain
input/noisy channel theory of sentence comprehension?• Local coherence effects (among others)
2. How can we model sentence comprehension under uncertain input?• One possibility is to use probabilistic finite state automata
3. What further predictions might such a model make?
33
Prediction 2: hallucinated garden paths• Try reading the sentence below:
While the clouds crackled, above the glider soared a magnificent eagle.
• There’s a garden-path clause in this sentence…• …but it’s interrupted by a comma.• Readers are ordinarily very good at using commas to
guide syntactic analysis:While the man hunted, the deer ran into the woodsWhile Mary was mending the sock fell off her lap
• “With a comma after mending there would be no syntactic garden path left to be studied.” (Fodor, 2002)
• We’ll see that the story is slightly more complicated.
34(Levy, 2010)
Prediction 2: hallucinated garden pathsWhile the clouds crackled, above the glider soared a magnificent eagle.
• This sentence is comprised of an initial intransitive subordinate clause…
• …and then a main clause with locative inversion. (c.f. a magnificent eagle soared above the glider)
• Crucially, the main clause’s initial PP would make a great dependent of the subordinate verb…
• …but doing that would require the comma to be ignored.• Inferences through …glider should thus involve a tradeoff
between perceptual input and prior expectations
35
• Inferences as probabilistic paths through the sentence:• Perceptual cost of ignoring the comma• Unlikeliness of main-clause continuation after comma• Likeliness of postverbal continuation without comma
• These inferences together make soared very surprising!
While the clouds crackled…
,(likely)
ø(unlikely)
…above the glider…(likely)
(unlikely)…above the glider…
soared
36
• Two properties come together to create “hallucinated garden path”1. Subordinate clause into which the main-clause inverted
phrase would fit well 2. Main clause with locative inversion
• Experimental design: cross (1) and (2)While the clouds crackled, above the glider soared a magnificent eagle.While the clouds crackled, the glider soared above a magnificent eagle.While the clouds crackled in the distance, above the glider soared a magnificent eagle.While the clouds crackled in the distance, the glider soared above a magnificent eagle.
• The phrase in the distance fulfills a similar thematic role as above the glider for crackled
• Should reduce hallucinated garden-path effect• We predict an interaction on reading times at soared
37
Prediction 2: Hallucinated garden paths• Methodology: word-by-word self-paced reading
• Readers aren’t allowed to backtrack• So the comma is visually gone by the time the inverted
main clause appears• Simple test of whether beliefs about previous input can
be revised
38
-----------------------------------------------------------------------While ---------------------------------------------------------------------- the ---------------------------------------------------------------------- clouds ---------------------------------------------------------------------- crackled, ---------------------------------------------------------------------- above ---------------------------------------------------------------------- the ---------------------------------------------------------------------- glider ---------------------------------------------------------------------- soared --------------------
Model predictions
While the clouds crackled, above the glider soared a magnificent eagle.
While the clouds crackledin the distance, abovethe glider soared a magnificent eagle.
While the clouds crackled, the glider soared above a magnificent eagle.
While the clouds crackledin the distance, the glider soared above a magnificent eagle.
39
Results: whole sentence reading times
Processing boggle occurs exactly where predicted
40
Hallucinated garden-path summary• The at/toward study showed that comprehenders note the
possibility of alternative strings and act on it• This study showed that comprehenders can actually
devote resources to grammatical analyses inconsistent with the surface string
41
Hallucinated garden paths cont’d• Sure, but punctuation’s weird stuff• What about real words?
• Bias against N N interpretation (at least sometimes)
42
I know that the desert trains could resupply the camp.
(Frazier & Rayner, 1987; Macdonald, 1993)
Hallucinated GPs with words• Bergen et al. (2012) used a bias against NN and toward
NV to test for GP hallucinations involving wordform change
The intern chauffeur for the governor hoped for more interesting work. [NN, “dense” neighborhood]
The intern chauffeured for the governor but hoped for more interesting work. [NV, “dense” neighborhood]
The inexperienced chauffeur for the governor hoped for more interesting work. [NN, “sparse” neighborhood]
Some interns chauffeured for the governor but hoped for more interesting work. [NV, “sparse” neighborhood]
43(Bergen, Levy, & Gibson, 2012)
Could be “intern chauffeured”
Could NOT be “inexperienced chauffeured”
Results• RT spike at disambiguating region for NN Dense
44(Bergen, Levy, & Gibson, 2012)
Today’s questions1. What behavioral evidence do we have for uncertain
input/noisy channel theory of sentence comprehension?• Local coherence effects (among others)
2. How can we model sentence comprehension under uncertain input?• One possibility is to use probabilistic finite state automata
3. What further predictions might such a model make?• Hallucinated garden paths
4. What is the structure of the noise model?• What types of noise operations (e.g. inserting words,
deleting words, substituting words) do comprehenders think are more/less likely?
45
Structure of the noise model• Gibson et al. (2013) hypotheses:
• Short words, particular function words, are more likely to be confusable (e.g. at vs. toward)
• Prior probabilities should pull interpretations towards semantically plausible sentences
• Considering just insertions and deletions…• Fewer insertions/deletions is more likely than more
insertions/deletions• Comprehenders should infer the original more easily if the change
involves a deletion (to get from the intended message to the perceived message) rather than an insertion• It’s easy for a speaker to accidentally delete a word• For a speaker to accidentally insert a word, not only do they have
to accidentally decide to insert a word, they have to generate the specific word that they insert
46
Structure of the noise model• Consider the following alternation:
47
Sentence Plausibility Insertions Deletions
The cook baked Lucy a cake. Plausible 0 1
The cook baked Lucy for a cake. Implausible 1 0
The cook baked a cake for Lucy. Plausible 1 0
The cook baked a cake Lucy. Implausible 0 1
Structure of the noise model• Consider the following implausible sentences:
48
Sentence Construction EditsThe girl was kicked by the ball. passive 2IThe ball kicked the girl. active 2DThe tax law benefited from the businessman. intransitive 1IThe businessman benefited the tax law. transitive 1D
The cook baked Lucy for a cake. Prepositional Object (PO) benefactive 1I
The cook baked a cake Lucy. Double Object (DO) benefactive
1D
Often “corrected” to plausible interpretation inconsistent with literal meaning
Consistently given literal interpretation
Noisy-channel inference results• Confirmed
predictions:• Fewer edits is
more likely thanmore edits
• Deletions are more likely than insertions
49(Poppels & Levy 2015, replication of Gibson et al., 2013)
Exchanges in the noise model?
• Anecdotally, people don’t even notice the problem in this sentence
• Extraordinarily unlikely under the Gibson noise model• Because they only consider insertions and deletions
• But reasonably likely if word exchanges are admitted
• What would you predict about comprehenders noisy-channel inferences?
50(Poppels & Levy 2015)
This is a problem that I need to talk about Joe with.
Noisy-channel inference results• Confirmed
predictions:• Comprehenders
make noisy-channel inferences consistent with expecting exchanges in the noise model
51(Poppels & Levy 2015)
Today’s questions1. What behavioral evidence do we have for uncertain
input/noisy channel theory of sentence comprehension?• Local coherence effects (among others)
2. How can we model sentence comprehension under uncertain input?• One possibility is to use probabilistic finite state automata
3. What further predictions might such a model make?• Hallucinated garden paths
4. What is the structure of the noise model?• Deletions more likely than insertions• Exchanges are also expected
52