systematicity in sentence processing by recurrent neural networks

Systematicity in sentence processing

by recurrent neural networks

Stefan FrankNijmegen Institute for Cognition and

InformationRadboud University Nijmegen

The Netherlands

“Please make it heavy on computers and AI and light on the psycho stuff”

(Konstantopoulos, personal communication, December 23, 2005)

Systematicity in language

Imagine you meet someone who only knows two sentences of English:

Could you please tell me where the toilet is?I can’t find my hotel.

So (s)he does not know:

Could you please tell me where my hotel is?I can’t find the toilet.

This person has no knowledge of English but simply memorized some lines from a phrase book.


Human language behavior is (more or less) systematic: if you know some sentences, you know many. Sentences are not atomic but made up of words. Likewise, words can be made up of morphemes.

(e.g., un + clear = unclear, un + stable = unstable, …)

It seems like language results from applying a set of rules (grammar, morphology) to symbols (words, morphemes).


The Classical symbol system hypothesis: the mind contains word-like symbols the are manipulated by structure-sensitive processes (Fodor & Pylyshyn, 1988). E.g., for dealing with language:

– boy and girl are nouns (N)– loves and sees are verbs (V)– N V N is a possible sentence structure

This hypothesis explains the systematicity found in language: If you know the N V N structure, you know all N V N sentences (boy sees girl, girl loves boy, boy sees boy, …)

Some issues for the Classical theory

Lack of systematic behavior: Why are people often so unsystematic in practice?The boy plays. OK

The boy who the girl likes plays. OK

The boy who the girl who the man sees likes plays. OK?The athlete who the coach who the sponsor hired trained won. OK!


Lack of systematic behavior: Why are people often so unsystematic in practice?

Lack of systematicity in language: Why are there exceptions to rules?

help + full = helpfulhelp + less = helplessmeaning + full = meaningfulmeaning + less = meaninglessbeauty + full = beautifulbeauty + less = ugly


Lack of systematic behavior: Why are people often so unsystematic in practice?

Lack of systematicity in language: Why are there exceptions to rules?

Development: How do children learn the rules from what they hear?

The Classical theory has answers to these questions, but no explanations.

Connectionism

The “state of mind” is represented as a pattern of activity over a large number of simple, quantitative (i.e., non-logical) processing units (“neurons”).

These units are connected by weighted links, forming a (neural) network through which activation moves around.

The connection weights are adjusted to the network’s input and task.

The network develops its own internal representation of the input.

It should generalize to new (test) inputs

Connectionism and the Classical issues

Lack of systematic behavior: Systematicity is built on top of an unsystematic architecture.

Lack of systematicity in language: “Beautiless” is expected statistically but never occurs, so the network learns it doesn’t exist.

Development: The network adapts to its input.

But can neural networks explain systematicity, or even behave systematically?

Connectionism and systematicity

Fodor & Pylyshyn (1988): Neural networks cannot be systematic. They only learn to associate examples rather than becoming sensitive to structure.

Systematicity: knowing X knowing Y.Generalization: training on X learning Y.So, systematicity equals generalization (Hadley, 1994)

Demonstrations of connectionist systematicity– require many training examples but only use few tests– are not robust: oversensitive to training details– only display weak systematicity: words occur in the

same ‘syntactic positions’ of training and test sentences

Simple Recurrent Networks Elman (1990)

input layer

hidden layer

output layer

Feedforward networks have long-term memory (LTM) but no short-term memory (STM). So how to process sequential input, like the words of a sentence?

A common SRN task is next- word prediction: The words of a sentences form the input sequence is. After each word, the output should be the next word.

SRNs and systematicity Van der Velde et al. (2004)

An SRN processed a minilanguage with 18 words (boy, girl, loves, sees, who, “.”, …) 3 sentence types:

– N V N . (boy sees girl.)– N V N who V N . (boy sees girl who loves boy.)– N who N V V N . (boy who girl sees loves boy.)

Nouns and verbs were divided into four groups, each had two nouns and two verbs. In training sentences, nouns and verbs were from the same group: < 0.44% of sentences used for training. In test sentences, nouns and verbs came from different groups. Note: weak systematicity only.

SRNs and systematicity Van der Velde et al. (2004)

SRNs “fail” on test sentences, so– They do not generalize to structurally similar sentences– They cannot learn systematic behavior from a small training

set– They do not form good models of human language behavior

But– what does it mean to “fail”? Maybe the network was more

than completely non-systematic?– was the size of the network appropriate?

larger network more STM better processing ?smaller network less LTM better

generalization ?– was the language complex enough? With more different

words there is more reason to abstract to syntactic types (nouns, verbs)

SRNs and systematicityreplication of Van der Velde et al. (2004)

What if a network does not generalize at all? When given a new sentence, it can only use the last word because combing words requires generalization.

This hypothetical, unsystematic network serves as the baseline for rating SRN performance.– Performance +1: network never makes ungrammatical predictions– Performance 0: network does not generalize at all, but gives the best possible output based on the

last word– Performance –1: network only makes ungrammatical predictions.

Positive performance indicates systematicity

Network architecture

input layer

recurrenthidden layer

hidden layer

output layer w = 18 units(one for each word)

10 units

n = 20 units

w = 18 units(one for each word)

SRN Results

Positive performance at each word of each test sentence type, so there is some systematicity.

SRN Resultseffect of recurrent layer size

Larger networks (n = 40) do better, but very large ones (n = 100) overfit.

N V N N V N who V N N who N V V N

SRN performance and memory

SRNs do show systematicity to some extent. But their performance is limited:

− small n limited processing capacity (STM)− large n large LTM overfitting.

How to combine large STM with small LTM?

Echo State NetworksJaeger (2003)

Keep the connections to and within the recurrent layer fixed at random values. The recurrent layer becomes a “dynamical reservoir”: a non-specific STM for the input sequence. Some constraints on the dynamical reservoir:

− large enough− sparsely connected (here: 15%)− weight matrix has spectral radius < 1

LTM capacity:− In SRNs: O(n2)− In ESNs: O(n)

So, can ESNs combine large STM with small LTM?

Network architecture

input layer

recurrenthidden layer

hidden layer

output layer

= trained= untrained

The STM remains untrained, but the network does develop internal representations

w = 18 units

10 units

n = 20 units

w = 18 units

ESN Results

Positive performance at each word of each test sentence type, so there is some systematicity, but less than in an SRN of the same size

ESN Resultseffect of recurrent layer size

Bigger is better: no overfitting even when n = 1530!


ESN Resultseffect of lexicon size (n = 100)

Note: with larger w, a smaller percentage of possible sentences is used for training.


Strong systematicity

30 words (boy(s), girl(s), like(s), see(s), who, …) Many sentence types:

– N V N . (girl sees boys.)– N V N who V N . (girl sees boys who like boy.)– N who N V V N . (girl who boy sees likes boy.)– N who V N who N V . (girls who like boys see boys who girl likes.)

Unlimited recursion (girls see boy who sees boy who sees man who …)

Number agreement between nouns and verbs

Strong systematicity

In training sentences: females as grammatical subjects, males as grammatical objects (girl sees boy)

In test sentences: vice versa (boy sees girl) Positive performance on all words of four test sentences types:

– N who V N V N . (boy who likes girls sees woman.)– N V N who V N . (boy likes girls who see woman.)– N who N V V N . (boys who man likes see girl.)– N V N who N V . (boys like girl who man sees.)

Conclusions

ESNs can display both weak and strong systematicity Even with few training sentences and many test sentences By doing less training, the network can learn more:

– Training fewer connections gives better results– Training a smaller part of possible sentences gives better results

Can connectionism explain systematicity?− No, because neural networks do not need to be

systematic− Yes, because they need to adapt to systematicity in

the training input.

The source of systematicity is not the cognitive system, but the external world.

systematicity in sentence processing by recurrent neural networks

Documents

lack of systematicity

n v n sentences boy

okthe boy

language results

n v n structure

sentences of english

verbs vn v n

symbols words