cognitive plausibility in learning algorithms

Cognitive plausibility in learning algorithmsWith application to natural language processing

Arvi Tavast, PhDQlaara Labs, UT, TLUTallinn, 10 May 2016

Introduction Understanding humans Results Application

MotivationWhy cognitive plausibility?

Objective: best product vs best researchModel the brainEnd-to-end learning from raw unlabelled dataGrounded cognitionCognitive computing, neuromorphic computing

Feedback loop: using the model to better understand theobject to be modelled


OutlineHeretical view on language - established learning model - application to NLP

1 Introduction

2 Understanding humansUnderstanding human communicationUnderstanding human learningRescorla-Wagner learning model

3 Results

4 ApplicationNaive Discrimination Learning


My backgroundmainly in linguistics

1993 TUT computer systems

1989-2004 IT translation

2000-2006 Microsoft MILS

2002 UT MA linguistics

2008 UT PhD linguistics

2015 Uni TUbingen postdoc quantitative linguistics


Understanding human communicationHow do we explain the observation that verbal communication sometimes works

The channel metaphor

Speaking is like sending things by train, selecting suitablewagons (words) for each thing (thought)

Hearing is like decoding the message

⇒meanings are properties of words

Communication as uncertainty reduction

Speaking is like sending blueprints for building things, whichthe receiver will have to follow (subject to their abilities,available materials, etc.)

⇒meanings are properties of people

Hearing is like using hints to reduce our uncertainty aboutthe message


Understanding human communicationWhen can the channel metaphor work?

Encoding of a message must contain a set of discriminablestates that is greater than or equal to the number ofdiscriminable states in the to-be-encoded message

or:

Encoding thoughts with words can only work if the numberof possible thoughts is smaller than or equal to the numberof possible words

This is the case only in restricted domains (weather forecasts)

Compare: reconstructing a document based on its hash sum


Understanding human learningCompositional vs discriminative

Possible ways of conceptualising biological learning

Compositional model: we start as an empty page, addingknowledge like articles in an encyclopedia

Discriminative model: we start by perceiving a single object(the world) and gradually learn to discriminate between itsparts

If discriminative:

Human language models can not be constant across time orsubjects


The Rescorla-Wagner learning modelLanguage acquisition can be described as creating a statistical relationship

The Rescorla-Wagner model: how do we learn that Cj means O

if we see that Cj ⇒ O, the relationship is strengthenedless, if there are other cues

if we see that Cj ⇒ ¬O, the relationship is weakenedmore, if there are other cues

(if we see that ¬Cj ⇒ O, the relationship is weakened)


Feature-label-order effectCreating the relationship between word and concept is only possible in one direction

Feature-label-order effect

If concept⇒word, the relationship is strengthened

If word⇒ concept, the relationship is not strengthened

Number of objects in the world� number of words inlanguage

Abstraction inevitably and irreversibly discards information

Recovering a meaning from a word is necessarilyunderspecified

Ramscar, M., Yarlett, D., Dye, M., Denny, K., and Thorpe, K. (2010). The effects of feature-label-order and theirimplications for symbolic learning. Cognitive Science, 34(6), 909–957.


Aging and cognitive declineWhy do our verbal abilities seem to fail around the age of 65?

Ramscar, M., Hendrix, P., Shaoul, C., Milin, P., and Baayen, H. (2014). The myth of cognitive decline: Non-linear dynamicsof lifelong learning. Topics in Cognitive Science, 6(1), 5–42.


MorphologyImplicit morphology (without morphemes)

0.1

0.378

0.116

0.576

0.531

0.4190.39

0.377

0.516

0.475

0.47

0.587

0.124

0.225

0.216

0.1630.138

0.5

0.5

#mA

ki#

#tA

tA# #mt

mtA

tAk

Aki

itA

#mimit

At#mAt

#m@

@tA

m@t

#m::t

m::tA

###


Naive Discrimination LearningThe R package: installation and basic usage

ndl: https://cran.r-project.org/web/packages/ndl/index.html

ndl2 (+ incremental learning): contact the authors

wm = estimateWeights(events) # Danks equilibria

wm = learnWeights(events) # incremental, ndl2 only


Naive Discrimination LearningInput data for Danks estimation: frequencies

Outcomes Cues Frequency

aadress aadress S SG N 1

aadresse aadress S PL P 1

aadressil aadress S SG AD 4

aadressile aadress S SG ALL 1

aasisid aasima V SID 1

aasta aasta S SG G 2

aasta aasta S SG N 1

aastane aastane A SG N 48


Naive Discrimination LearningInput data for incremental learning: single events

Outcomes Cues Frequency

aadress aadress S SG N 1

aadresse aadress S PL P 1





aadressile aadress S SG ALL 1

aasisid aasima V SID 1



aasta aasta S SG N 1




...


Naive Discrimination LearningOutput: weight matrix, cues x outcomes

Cues Outcomes Applicationletter ngrams words readingcharacter features words readingwords lexomes POS tagginglexomes letter ngrams morphological synthesiscontexts words distributional semanticsaudio signal words speech recognitionwords audio signal speech synthesis


Naive Discrimination LearningAbout the weight matrix

What we can look at:

Similarity of outcome vectors

Similarity of cue vectors

MAD (median absolute deviation) of outcome vector

Competing cues


Naive Discrimination LearningAbout the weight matrix

Other properties:

No dimensionality reduction (played with 200k x 100k)

Danks equations subject to R’s 232 limit (matrixpseudoinverse)

Slow (weeks on ca 16 cores, 200G ram)

Performance less than word2vec etc, but comparable


Some NLP toolsHow to get started quickly with NLP

Python NLTKEstNLTKGensim (incl word2vec)DISSECT

Java GATE (also web)Stanford NLPDeeplearning4j (incl word2vec)

C word2vecR NDL


Language understandingWhat’s missing from full language understanding

Training material

Interannotator agreement is less than perfect

Corpus is heterogenous

This is not a methodological flaw

Communicative intent and self-awareness

If cues are lexomes (=what the speaker wanted to say), thesystem must want something.


Thanks for listeningContacts and recommended reading

[email protected]

Easy readingblog.qlaara.com

Recommended readingHarald Baayenwww.sfs.uni-tuebingen.de/hbaayen/Michael Ramscarhttps://michaelramscar.wordpress.com/

cognitive plausibility in learning algorithms

Data & Analytics