university of birmingham a learning perspective on the

University of Birmingham

A learning perspective on the emergence ofabstractionsMilin, Petar; Tucker, Benjamin V.; Divjak, Dagmar

Citation for published version (Harvard):Milin, P, Tucker, BV & Divjak, D 2020, 'A learning perspective on the emergence of abstractions: the curiouscase of phonemes', arXiv.

Link to publication on Research at Birmingham portal

General rightsUnless a licence is specified above, all rights (including copyright and moral rights) in this document are retained by the authors and/or thecopyright holders. The express permission of the copyright holder must be obtained for any use of this material other than for purposespermitted by law.

•Users may freely distribute the URL that is used to identify this publication.•Users may download and/or print one copy of the publication from the University of Birmingham research portal for the purpose of privatestudy or non-commercial research.•User may use extracts from the document in line with the concept of ‘fair dealing’ under the Copyright, Designs and Patents Act 1988 (?)•Users may not further distribute the material nor use it for the purposes of commercial gain.

Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document.

When citing, please reference the published version.

Take down policyWhile the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has beenuploaded in error or has been deemed to be commercially or otherwise sensitive.

If you believe that this is the case for this document, please contact [email protected] providing details and we will remove access tothe work immediately and investigate.

Download date: 10. Apr. 2022

https://birmingham.elsevierpure.com/en/publications/b6764f59-f334-4ecc-925d-e1e1d5227f08

PREVIEWOFPAPERCURRENTLYUNDERREVIEW.

PLEASECITEASUNPUBLISHEDMANUSCRIPT.

Alearningperspectiveontheemergenceofabstractions:thecuriouscaseofphonemes

PetarMilina,BenjaminV.Tuckerb,andDagmarDivjaka/c

aUniversityofBirmingham,DepartmentofModernLanguages,Birmingham,UK;bUniversityofAlberta,DepartmentofLinguistics,Edmonton,Alberta,Canada;cUniversityofBirmingham,DepartmentofEnglishLanguageandLinguistics,Birmingham,UK.

AbstractInthepresentpaperweusearangeofmodelingtechniquestoinvestigatewhetheran

abstractphonecouldemerge fromexposuretospeechsounds.Wetest twoopposing

principlesregardingthedevelopmentoflanguageknowledgeinlinguisticallyuntrained

languageusers:Memory-BasedLearning(MBL)andError-CorrectionLearning(ECL).A

process of generalization underlies the abstractions linguists operate with, and we

probedwhetherMBL and ECL could give rise to a type of language knowledge that

resembleslinguisticabstractions.Eachmodelwaspresentedwithasignificantamount

of pre-processed speech produced by one speaker. We assessed the consistency or

stability of what the models have learned and their ability to give rise to abstract

categories.Bothtypesofmodelsfaredifferentlywithregardtothesetests.Weshowthat

ECLlearningmodelscanlearnabstractionsandthatatleastpartofthephoneinventory

canbereliablyidentifiedfromtheinput.

KEYWORDSError-CorrectionLearning;Memory-BasedLearning;Widrow-Hoffrule;TemporalDifferencerule;abstraction;phoneme;English

Correspondingauthor:PetarMilin.Email:[email protected]

2

1. Introduction

Manytheoriesoflanguagepresupposetheexistenceofabstractionsthataimtoorganize

theextremelyrichandvariedexperienceslanguageusershave.Theyarethoughttobe

either innate (generative theories) or to arise from experience (emergentist, usage-

basedapproaches).Severalmechanismsbywhichabstractionscouldformhavebeen

identified and include, but are not limited to, not attending to certain information

streams tomanage the influx of information, (passively or actively) forgetting some

dimensions of experience, and grouping similar experiences together. Importantly,

thesemechanismsarenotnecessarilymutually exclusive. In thispaper,we consider

anothermechanismandinvestigatewhetherlearningcanaccountfortheemergenceof

linguisticabstractionsfromexposuretotheambientlanguage.Can learningtoignore

uninformativecuesleadtoaformofabstractedknowledge?

Toinvestigatethispossibility,wepresentacasestudyonthesoundsofEnglish in

which we computationally model whether an abstract phone could emerge from

exposuretospeechsoundsusingthreecognitivelyplausible learningrules:Memory-

based learning (MBL),Widrow-Hoff (WH) and Temporal Difference learning (TD; a

direct generalization ofWH). The latter two computationalmodels learn to unlearn

irrelevantdimensionsofexperienceviathecuecompetitioninherentintheexperience.

Theyfocusonthosedimensionsthathelpdealwiththeenvironmentona‘goodenough’

basis.Whatremainswouldthenrepresentanabstraction:it is lessdetailed(ormore

schematic)thantheinputexemplarsandmoreparsimoniousbecause,aswelearn,we

discardorignorewhatisnon-predictive,i.e.,nothelpfulforimprovingperformance(by

error-correction), andwe retain only the relevant ‘core’. Importantly, however, that

abstraction is not a given and static memory trace, but rather an ever-evolving

information-richresidueoftheexperience(cf.,Love,Medin,&Gureckis,2004;Nosofsky,

1986;Ramscar&Port,2019).

3

1.1. Much ado about abstractions

If there isone thing that linguistic theorieshave incommon, it is that theyresort to

abstractions for describing language. Some of these abstractions are so commonly

agreedupon that theyare taught to children inelementary school (thinkofpartsof

speechlikenoun,verboradjective),whileothersarespecifictoaparticulartheory(such

as constructions).Dependingon the theory, theseabstractionsareeither innateand

presentatbirth,as isthecaseingenerativeframeworks,ortheyhaveemergedfrom

exposure to the ambient language, as is assumed in usage-based approaches (see

Ambridge&Lieven,2011, foranoverviewofthesetheoriesandthepredictionsthey

makeregardinglanguageacquisitionordevelopment).

Thisimpliesthatabstractionsplayadifferentroleinbothtypesoftheories.Inearlier

versions of generative linguistic theories, abstractions were underlying

representations:theyarepartofthedeepstructurewhichharborstheabstractformof

asound,wordorsentencebeforeanytransformationalruleshavebeenapplied.Inthis

sense, abstractions give rise to usage. In usage-based approaches, abstractions are

posited at the other end: abstractions are built up gradually, as experience with

language accumulates and commonalities between instances are detected. In this

framework,abstractionsfollowfromusage.Notethatinbothapproachesabstractions

co-existwithactualusage:itisnotuncommonforusage-basedapproachestoassume

redundantencodinginwhichabstractionsarestoredalongsideexemplars(Goldberg,

2006;Langacker,1987).

Thispracticeofdescribinglanguageusingabstractionshasledtotheassumptionthat

languageknowledgeandusemustinvolveknowledgeoftherelevantabstractions(but

see Ambridge, 2020). The challenge facing those who assume the existence of

abstractionsisoneofproposingahypothesizedconstructormechanismthatwouldgive

risetotheseabstractions.Fornativiststhismeansexplaininghowtheseabstractions

could be present at birth, and for emergentists this means explaining how such

abstractions can emerge from exposure to usage. Ideally this needs to be done in

sufficient detail for the account to be implemented computationally and/or tested

empirically.

4

1.2. The taming of the shrew (the smallest of language units)

Researchershave long searched for the smallestmeaningful unit in language (Sapir,

1921;Trubetzkoy,1969).Themaingoalofthesearchforanabstractionwastodiscover

a basic building block on which linguistic theories can build (much like the atom,

electronandHiggsBoson,andsubsequentresearchinparticlephysics;Gillies,2018).

Inthestudyofsound,overtheyears,manydifferentlevelsofabstractionhavebeen

proposedsuchasthemora,demisyllable,phoneme,syllable,orphoneticfeatures

(Goldinger & Azuma, 2003;Marslen-Wilson &Warren, 1994; Savin & Bever, 1970).

Theseareall assumed tohavemental correlates.Themost commonlyproposedand

assumedbuildingblockorunitofabstractionisthephonemeorsyllable(Goldinger&

Azuma,2003).Generally,phonemesaredefinedasabstractionsofactualspeechsounds

that ignore or ‘abstract away’ details of the speech signal that do not contribute to

differentiatingmeaning.Syllablesasabasicunitareusuallyconsideredacombinations

ofspeechsoundsthatcarryarhythmicweightandareusedtodifferentiatemeaning.

In this study,wewill focus on phoneme-like units. The phonemic perception and

representationofspeechisarguablyoneofthebest-establishedconstructsoflinguistic

theory (fora reviewseeDonegan,2015,p.39-40).Ourchoicemayappearas rather

unfortunate,sincephonemeshavecomeunderfiremorethanonceinrecenthistory.

Port(2010,p.53)claims:‘Phonesandphonemes,thoughtheycomeveryreadilytoour

consciousawarenessofspeech,arenotvalidempiricalphenomenasuitableasthedata

basisforlinguisticssincetheydonothavephysicaldefinitions.’WorkbyGoldingerand

Azuma(2003)investigatedanAdaptiveResonanceTheoryapproachtotheemergence

ofperceptual ‘units’orself-organizingdynamicstates. Itargues that thesearch fora

fundamentalunitismisguidedandadvocatesanexemplarrepresentation.Arecurring

claim in the argument against the cognitive reality of phonemes is the richness of

memoriesofexperiences,includingexperiencesofsounds(comparewithPort,2010).

Although we do not deny the validity of the findings reported in (e.g., Goldinger &

Azuma,2003;Port,2007;Port&Leary,2005;Savin&Bever,1970),wedonoteither

acceptthemasevidencethattherecannotpossiblybeanycognitiverealitytophoneme-

5

like units. Instead, we consider these findings a mere indication of the fact that

phoneme-likeunitsarenotnecessarilythewholestory.Compareherethefindingsof

BuchwaldandMiozzo(2011),who foundthat twosystemsneed toberecognized to

accountforerrorsinspokenlanguageproduction,i.e.,aprocessingstagewithcontext-

independentrepresentationsandonewithcontext-dependentrepresentations.

Asmentionedearlier,thetheoreticalneedforabasicbuildingblockinlinguisticsstems

fromadesiretosimplifytheinputforalearnerorlistener.Thisistrueofmuchofthe

early speech error work which provides strong evidence for the need for such

representation (e.g., Fromkin, 1971). The lack of invariance problem from speech

perceptionresearchillustratesthisneednicely.Simplyput,theacousticsignalishighly

complexandvariableandthereisnosimplemappingfromtheacousticsignaltosome

sortofphoneticstructure(Appelbaum,1996;Shankweiler&Fowler,2015).Asaresult,

what is thought of as a single phoneme can be represented by multiple variably

weightedacousticcuesandcanbeproducedinmanydifferentways(e.g.,Mielke,Baker,

&Archangeli,2016).Whilethelackofinvarianceproblemhasadvancedresearchinthe

field of speech perception and has not only increased our understanding of speech

productionandperception,buthasalsobeen influential inearlyattemptsat speech

synthesis,noteveryoneagreesthatitisasabigofaproblemaspreviouslythought(e.g.,

Appelbaum,1996;Goldinger&Azuma,2003).Evenso,thesearchforabasicbuilding

block which linguistic theories can use as a concept on which to build additional

linguisticstructurecontinues.

Itisalsoworthnotingthatthemainmotivationforourchoiceisnotconceptualbut

methodological. This is to say that, althoughwe arenot opposed to the existenceof

phoneme-likeunits,neitherdowebelieve thatphonemesare theonlyabstract form

cognitiverepresentationsofsoundcantake.Abstractionsarelikelylanguagedependent

and could range fromwhatmight traditionally be referred to asmoras, allophones,

phonemes, demi-syllables, syllables, words and sequences of words. A productive

theoretic proposition should not commit a priori to particular constructs or

abstractions.Inthatrespect,suchunitscannotbecompulsoryandthenatureofthese

unitscouldvaryacross languagesoreven individuals.Furthermore,suchunitscould

6

lack support in a cognitive sense and, yet, still be didactically useful (hypothetical)

constructs.Finally,wedonotconsidertheseabstractionsorunitstobestaticeither,as

weaddressthequestionofhowtheyunfoldovertime.

The main goal of this study, hence, is to explore whether abstract units can be

emergentfrominput,usingcognitivelyrealisticlearningalgorithms.Fortheremainder

ofthearticlewewilldistinguishbetweenphonemesandphonesandusephonestorefer

tothesegmentsweareworkingwith,toreiteratethefactthatweareusingphoneme-

likeunitsoutofconvenienceandnotoutofanytheoreticalleaning.

1.3. To learn or not to learn

Forthepurposesofthisstudy,wewillzoominonaspecificsubsetoflinguistictheories

andalgorithmicoptions.Linguistically,wewillassumethatuniversalsoundinventories

are not available at birth but are learned using general cognitive mechanisms.

Psychologically,wewill contrastmodels that assumeveridical encodingof the input

withthosethatalloworevenencourageinformation‘sifting’.Translatedintotheterms

of the relevant machine learning approaches, this opposition would pitch Memory-

Based Learning (MBL) against Error-Correction Learning (ECL; for amore in-depth

discussion of the differences between the two approaches see also Milin, Divjak,

Dimitrijević,&Baayen,2016).

To researchers interested in linguistics, and cognitive linguistics in particular,

Memory-Based Learning (MBL) appeals because it explains linguistic knowledge in

terms of categorization and analogical extension. Moreover, exemplar accounts are

particularly well established in phonetics and phonology (e.g., Bod, 1998; Hay &

Bresnan, 2006; Johnson, 1997; Pierrehumbert, 2001), yielding a situation in which

wordsorphrasesarestoredwithrichphoneticdetail.However,thisopinion,although

widespread,doesnotseemtomeshwiththeempiricalfindingsfromstudiesonechoic

sensory memory, representing the short-term storage of auditory input (cf.,

Jääskeläinen,Hautamäki,Näätänen,&Ilmoniemi,1999):sensorymemoryhasbeen

7

foundtobeveryshort-lived,andhenceitremainsunclearhowthatrichdetailwould

makeitintolong-termmemory(undernormal,non-traumaticcircumstances).

Error-Correction Learning (ECL) appears as conceptually much better suited for

testingtheemergentistperspectiveon languageknowledgeas it takesaholistic(i.e.,

molar)approachtoknowledge,andlanguageknowledgeinparticular.Thisknowledge

evolves gradually and continuously from the input which is conceived of as a

multidimensional space of ‘discernibles’ or contrasts. In other words, rather than

retainingtheexemplarsastheyare,inalltheirunparsimoniousrichness,learnersmay

disregard non-predictive aspects of the input, keeping only what is ‘useful’ in an

adaptivesense.

Inthisrespect,perhaps,MBLandECLcouldbeseenasopposingprinciples: inthe

extreme, an exemplar approachdoesnot require anythingbeyond storage,while an

error-correctionapproachisastockpileofperpetuallychanging,must-becontrasts.

Wecanassumethatthesmallestcommondenominatorofthetwoapproachesisthe

ability to learn from the environment for the sake of ever better adaptation. Rather

simplified,asystem(beitacomputer,animalorhuman)learnsaboutitsenvironment

byprocessinginputinformationabouttheenvironmentandaboutitselfinrelationto

that environment. Gradually, that system becomes more knowledgeable about its

environmentand,thus,betteradaptedto it. It followsthat,computationally, learning

assumesstimulationfromtheenvironmenttowhichthesystemreactsbymodulatingits

parameters(alsoknownasfreeparameters)foroperatinginandonthatenvironment

(cf.,Mendel&McLaren,n.d.).Ultimately,thesystemrespondstoitsenvironmentina

newandmoreadaptedway.Fromthiscomputationalperspective,whatmakesthese

twolearningprocessesuniqueisthespecificwayinwhichparameterchangehappens.

Generally speaking, MBL stores or memorizes data directly, while ECL filters or

discriminatesthatdatatoretainwhatisnecessaryandsufficient.

Most (if not all) MBL approaches store past experiences as correct input-output

exemplars:theinputisa(vector)representationoftheenvironmentandthesystem,

whereastheoutputassumesanactionoranentity–adesiredoutcome(seeHaykin,

1999).Memoryisorganizedinneighborhoodsofsimilarorofproximal inputs,which

8

presumablyelicitsimilaractions/entities.Thus,amemory-basedalgorithmconsistsof

(a)acriterionfordefiningtheneighborhoods(similarity,proximity)and(b)alearning

principle that guides assignment of a new exemplar to the right neighborhood. One

simpleyetefficientexampleofsuchaprinciple is thatofNearestNeighbors,wherea

particulartypeofdistance(say,theEuclideandistance)isminimalinagivengroupof

(near)neighbors.Ithasbeenshownthatthissimpleprinciplestraightforwardlyensures

the smallest probability of error over all possible decisions for a new exemplar (as

discussedbyCover&Hart,1967).Theprincipleappearssimplebutitisalsoefficient.

Moreover, it appeals to linguists, and Cognitive Linguists in particular, as it equates

learning to categorization or classification of newly encountered items into

neighborhoodsof ‘alikes’:new items(exemplars)areprocessedon thebasisof their

similarity to the existing neighborhoods, which resembles the process of analogical

extension.AndforCognitivelinguists,categorizationisoneofthefoundational,general

cognitiveabilitiesthatfacilitatelearningfrominput(Taylor,1995).

ECL runs on very different principleswhich filter the value (weight) of the input

information (signal) to facilitate error-freepredictionof somedesiredoutcome.The

processunfoldsinastep-wisemannerasnewinputinformationbecomesavailable.The

weightingorevaluationofthecurrentsuccessisdonerelativetoamismatchbetween

thecurrentpredictionandthetrueordesiredoutcome.Inotherwords,thestimulation

tomodulatetheoperatingparameterscomesfromthedifferencebetweenthedesired

outcome and the current prediction of it on the basis of available input data. That

difference induces a rather smalldisturbance in the inputweightswhich, effectively,

evaluatestheinputdataitselftobringitspredictionclosertothetruedesiredoutcome.

Thelearningprocesscontinuesasnewdataarrives.Gradually,thedifferencebetween

thepredictionandthetrueoutcomebecomessmallerand,consequently,thechangesin

theinputweightalsodiminishovertime.

Thus,comparatively, inMBL, the inputdataremainsunaffectedandstored inpair

withtheoutcome,whileinECLthatinputiscontinuously(re)evaluatedandfilteredto

achieve‘faithful’predictionoftheoutcome.Inasense,forECL,inputdatamattersonly

inasmuchasithelpsminimizepredictionerror.Otherwise,overtime,someinputmay

9

enduphavingazeroweight,whichmeansthatitiseffectivelyignored.Wewillreturn

tothesepointsinmoredetailsinSection2.2.

Coincidentally, both MBL and ECL are currently being pushed to their (almost

paradoxical) extremes. On the one hand, Ambridge (2020, https://osf.io/5acgy/)

denounces any stored linguistic abstractions (while, strangely, acknowledging other

typesofabstractions).Ontheotherhand,Ramscar(2019)denies languageunitsany

ontological (and, consequently, epistemological) relevance by disputingmappings of

anysubstance(and,hence,theoreticalimportance)betweenformunitsandsemantic

units: units become ‘dis-unit-ed’ due to a lack of one-to-one mappings between a

(allegedlyfixed)formanda(allegedlyfixed)meaning.

1.4. This study: pursuing that which is learnable

Inthispaper,wecontrastamodelthatstorestheauditorysignalinmemorywithone

thatlearnstodiscriminatephonesintheauditorysignal.Modelingwhetheranabstract

phonecouldemergefromexposuretospeechsoundsbyapplyingtheprincipleoferror-

correctionlearningimpliesdimensionalityreduction.Ourapproachcruciallyrelieson

theconceptofunlearningwherebythestrengthofthelinksbetweenanycuesthatdo

notco-occurwithacertainoutcomearereduced(seeRamscar,Dye,&McCauley,2013,

for discussion). The principle is such that the model will gradually try to unlearn

irrelevantvariationsinthespeechinput,therebydiscardingdimensionsoftheoriginal

experienceandretainingmoreabstractandparsimonious‘memories’.Thisstrategyis

the radical opposite of a model that presupposes that we store our experiences

veridically, i.e., in their full richness and idiosyncrasy. On such an account learning

equalsencoding,storingandorganizingthatstoragetofacilitateretrieval.

Toachievethis,wewillprovideourmodelswithacommoninputinautomaticspeech

recognition,MelFrequencyCepstralCoefficients(henceforthMFCCs),derivedfromthe

acousticsignal,andwithphonesaspossibleoutcomes.Wewillassignourmodelsthe

taskoflearningtolinkthetwo.Thisimpliesthat,inaveryrealsense,ourmodelingis

biased: ourmodelwill learnphone-likeunitsbecause those are theoutputs thatwe

provide.Ifweprovidedothertypesofunits,othertypesofoutputsmightbelearned.

10

Yet, thisdoesnotdiminishour finding thatphonescanbesuccessfully learned from

input,usinganerror-correctionlearningapproach.

Ofcourse,becauseweuseasimplealgorithm,performancedifferenceswithstate-

ofthe-artclassifiersshouldbeexpected.Yet,whatwetradein,intermsofperformance,

wehopetogainintermsofcognitiverealityand,atthesametime,interpretabilityofthe

findings. Inotherwords,ratherthanmaximizingperformance,wewillminimizetwo

critical ‘investments’: in cultivating the input, and in algorithmic sophistication (and

computationalopaqueness).Rather,weremainasnaïveaspossibleandengagewith

MFCCs directly as input, and we use one of the simplest yet surprisingly efficient

computationalmodelsoflearning.Finally,recallalsothatwedonotclaimthatphonemic

variationwouldbesomething,letaloneeverything,thatlistenersidentify.Atthispoint,

wearetestingwhetherthephones,thatarestandardlyrecognized,canbelearnedfrom

inputusingcognitivelyrealisticlearningalgorithms.

As said, learnability is our minimal requirement and our operationalization of

cognitive reality. On a usage-based, emergentist approach to language knowledge,

abstractions must be learnable from the input listeners are exposed to if these

abstractions are to lay claims to cognitive reality (Divjak, 2015a;Milin et al., 2016).

Whereasearlierworkexploredwhetherabstractlabelscanbelearnedandwhetherthey

mapontodistributionalpatternsintheinput(cf.Divjak&Caldwell-Harris,2015),more

recent work has brought these strands together bymodeling directly how selected

abstract labels would be learned from input using a cognitively realistic learning

algorithm(cf.Milin,Divjak,&Baayen,2017).

With this inmind, our study does not contradict recent developmentswithin the

discrimination-learning framework to language. The Linear Discriminative Learner

(LDL),inparticular,setsitselfthemoreambitiousgoaltorefutetheneedforhierarchies

oflinguisticconstruct(cf.,Baayen,Chuang,Shafaei-Bajestan,&Blevins,2019;Chuang&

Baayen,n.d.;Shafaei-Bajestan,Moradipour-Tari,&Baayen,2020).Thepresentstudy,

however,asksamorefundamentalquestionrelatingtothelearnabilityofthesimplest

suchlinguisticconstructs–phones.Yet,inthat,wetakearadicalapproachandmodel

theemergencestraightforwardlyfromcontinuous,numericinputrepresentations(i.e.,

11

MFCCs),ratherthanfrompre-processeddiscreteunits(i.e.,FBSFofArnold,Tomaschek,

Sering,Lopez,&Baayen,2017),asusedintherelatedstudiesbyNixonandTomaschek

(2020);Shafaei-Bajestanetal.(2020).

An error-correction approach to speech sound acquisition has recently been

proposed.NixonandTomaschek(2020)trainedaNaïveDiscriminativeLearningmodel

(NDL), relying on the Rescorla-Wagner learning rule, to form expectations about a

subsetofphonemes from the surrounding speech signal. For this study,data froma

German spontaneous speech corpus was used (Arnold & Tomaschek, 2016). The

continuousacousticfeatureswerediscretizedinto104spectralcomponentsper25ms

time-window, which were then used as both inputs and outputs. A learning event

consistedofthetwoprecedingcues(104elementseach),atarget(104elements)and

onefollowingcue(also104elements).Learningproceededinaniterativefashionover

moving time-window. The trained model was tested by trying to replicate an

identificationtaskonvowelandfricativecontinua,andwasabletogeneratereassuring

discriminationcurvesforbothcontinua.Inotherwords,NixonandTomaschek(2020)

investigatedsoundsthatarediscriminatedbasedoncues fromtheexpectedspectral

frequencyrangesforvowelsandfricatives.

WhilethepresentstudyissimilartotheworkbyNixonandTomaschek(2020),both

inspiritandinsubstance,italsodiffersinseveralmajorways.First,insteadofpredicting

learnedactivations,weusedactivationsaspredictorsofthecorrectphone.Second,the

presentstudyusestheentirephoneinventoryofEnglishanddoesnotfocusonasubset.

Third, we use real-value input from MFCCs directly, rather than transformed into

discretecategories.ThatiswhywerelyontheWidrow-Hofflearningrule,notingthatit

isconsideredtobeidenticaltotheRescorla-Wagnerrule(asR.Rescorla,2008,himself,

pointedout)butacceptsnumericinput(seeMilin,Madabushi,Croucher,&Divjak,2020,

forfurtherdetails).Fourth,ourapproach,specifically,isgroundedinpreviousempirical

work within the NDL framework (cf., Milin, Feldman, Ramscar, Hendrix, & Baayen,

2017),andusestwolearningmeasures:directlyproportionalactivationfromtheinput

MFCCs, and inversely proportionaldiversity of the same input MFCCs. The former

indicatesthesupportandthelatterthecompetitionthatanoutcome(inthisparticular

12

caseaphone)receivesfromtheinput.Finally,NixonandTomaschek(2020)focuson

thepredictivityof theresultingmodel, i.e.,onpredictingresponses indiscrimination

and categorization tasks, and less on the creation of abstract units. The authors do,

however, discuss interesting initial findings that begin to replicate categorical

perceptionresultsfromspeechperception.

Ourlearning-basedapproachisalsoreminiscentofthemechanism-drivenapproach

proposed in Schatz, Feldman, Goldwater, Cao, and Dupoux (2019). Their findings

suggestthatdistributionallearningmightbetherightapproachtoexplainhowinfants

becomeattunedtothesoundsoftheirnativelanguage.Theproposedlearningalgorithm

didnottrytoachieveanybiologicalorcognitiverealismbutfocusespurelyonthepower

ofgenerativemachinelearningtechniques(seeJebara,2012,fordetailsonthisfamilyof

MachineLearningmodels,andthedistinctionbetweendiscriminativeandgenerative

learningmodelfamilies).Thus,theyselectedanengineeringfavoritewhichisbasedon

aGaussianMixtureModel(GMM)classifier.Thecontinuoussoundsignalwassampled

in regular time-slices to serve as input data to generate a joint probability density

functionwhichisamixtureof(potentiallymany)Gaussiandistributionsfromthegiven

samples. The focus of the studyby Schatz et al. (2019)was, to reiterate, not on the

plausibilityofaparticular(generative)learningmechanism,butratherontherealism

oftheinputdataanditspotentialtobelearned.Thisjustifiestheirchoiceforapowerful

generativelearningmodel.

As we will explain in Section 2, we ran our simulations on much simpler but

biologicallyandcognitivelyplausiblelearningalgorithms.Inotherwords,Schatzetal.

(2019)concludedthatplausibleorrealisticdatacangeneratealearnablesignal(i.e.,in

termsofthejointprobabilitydistribution,givenaproblemspace),andwenowmoveto

makethecrucialsteptosimultaneouslytest(a)plausiblelearningmechanismsagainst

(b)realisticdata.

13

2. Methods

Oneofthechallengesofworkingwithspeechisthatitiscontinuous.Inthissectionwe

describethespeechmaterialsandhowtheywerepre-processedbeforewemoveonto

presenting thedetailsof themodelingprocedures.Recall thatourdecision tomodel

phonesdoesnotreflectanytheoreticalbiasonourpart:webelievethatsimilarresults

wouldbeobtainedwithanynumberofotherpossiblelevelsofrepresentation.

2.1. Speech Material

The speech items used formodeling stem from the stimuli in theMassive Auditory

LexicalDecision(MALD)database(Tuckeretal.,2019).Thisdatabasecontains26,000

wordstimuliand9,500pseudo-wordsproducedbyasinglemalespeakerfromWestern

Canada; together, this amounts to approximately 5 hours of speech. The forced

alignmentwascompletedusingthePennForcedAligner(Yuan&Liberman,2008).The

forced-alignedfilesprovidetimealignedphone-levelsegmentationofthespeechsignal

oftenusedinphoneticresearchandfortrainingofautomaticspeechrecognitiontools.

All phone-level segmentation is indicated using the ARPAbet transcription system

(Shoup, 1980) for English,whichwe use throughout this paper. The speech stimuli,

along with force-aligned TextGrids, are available for use in research

(http://mald.artsrn.ualberta.ca).

InordertopreparethefilesformodelingwewindowedandextractedMFCCsforall

of thewordsusing25ms longoverlappingwindows in10ms steps over the entire

acousticsignal.Bywayofexample,iftherewasa100msphoneitwouldbedividedinto

9overlappingwindowswithsomepaddingattheendtoallowthewindowstofit.MFCCs

are summary features of the acoustic signal used commonly in automatic speech

recognition.MFCCsarecreatedbyapplyingaMeltransformation,whichisanon-linear

transformmeanttoapproximatehumanhearing,tothefrequencycomponentsofthe

spectralpropertiesofthesignal.Themelfrequenciesarethensubmittedtoadiscrete

cosinetransform,sothat,inessence,wearetakingthespectrumofaspectrumknown

asthecepstrum.Followingstandardpracticeinautomaticspeechrecognition,weused

14

the first13coefficients inthecepstrum,calculatedthe firstderivative(delta) for the

next 13 coefficients, and the second derivative (delta-delta) for the final set of

coefficients (Holmes, 2001). This process results in eachwindow or trial having 39

features,thusforeachwordwecreatedanMFCCmatrixofthe39coefficientsandthe

correspondingphonelabel.

2.2. Algorithms

AsstatedintheIntroduction,MBLandECLcanbepositionedasrepresentingopposing

principlesinlearningandadaptation,withtheformerrelyingonstorageofotherwise

unalteredinput-outcomeexperiences,andthelatterusingfilteringoftheinputsignalto

generate a minimally erroneous prediction of the desired or true outcome. These

conceptualdifferencesaretypicallyimplementedtechnicallyinthefollowingway:the

MBL algorithmmust formalize the organization of stored exemplars in appropriate

neighborhoods and then, accordingly, the assignment of new inputs to the right

neighborhood.Conversely,theECLalgorithmwouldtypicallyuseiterativeorgradual

weighting of the input to minimize the mismatch between predicted and targeted

outcome.

WeappliedMBLby‘populating’neighborhoodswithexemplarsthatarecorralledby

thesmallestEuclideandistance,andthenassigningnewinputtotheneighborhoodofk

= 7 least distant (i.e., nearest) neighbors. This is, arguably, a rather typical MBL

implementation, as Euclidean distance is among the most commonly used distance

metrics,whilek-NearestNeighbors(kNN)isastraightforwardyetefficientprinciplefor

storingnewexemplars(asdiscussedpreviouslybyCover&Hart,1967;Haykin,

1999,etc.).Theneighborhoodsizewasset to7, following findingsreported inMilin,

Keuleers,andFilipović-Đurđević(2011):intheirstudy,thenumberofneighborswas

systematicallymanipulated,andk=7wassuggestedforoptimalperformance.

The two ECL learningmodels,Widrow-Hoff (WH) and Temporal Difference (TD),

werechosenastwoversionsofacloselyrelatedidea:WH(alsoknownastheDeltarule

ortheLeastMeanSquarerule;seeMilin,Madabushi,etal.,2020,andreferencestherein)

represents the simplest error-correction rule, while TD generalizes to a rule that

15

accountsforfuturerewards,too.Moreprecisely,attimestept,learningassumessmall

changestotheinputweightswt(smalltoobeythePrincipleofMinimalDisturbance;cf.,

Milin,Madabushi,etal.,2020;Widrow&Hoff,1960;Widrow&Lehr,1990)toimprove

predictioninthenexttimestept+1.Thisisachievedbyre-weightingtheprediction

error(whichisthedifferencebetweenthetargetedoutcomeanditscurrentprediction)

withthecurrentinputcuesandalearningrate(freeparameter):λ(ot−wtct)ct,whereλ

representsthelearningrate,ctisthevectorofinputcues,wtctisthepredictionandotis

theoutcome.ThefullWHupdateofweightswinthenexttimestept+1isthen:

wt+1=wt+λ(ot−wtct)ct

ForTD,however,theprediction(wtct)isadditionallycorrectedbyafutureprediction

asthetemporaldifferencetothatcurrentprediction:wtct−γwtct+1,whereγisadiscount

factor,theadditionalfreeparameterthatweightstheimportanceof

thefuture,andwtct+1isthepredictiongiventhenextinputct+1.Theupdateruleforthe

TDthusbecomes:

wt+1=wt+λ(ot−[wtct−γwtct+1])ct

2.3. Training

Weran several computational simulationsof learning thephonesofEnglishdirectly

fromtheMFCCinput.Thisconstitutesa40-wayclassificationtask,predicting39English

phonesandacategoryofsilence.

Thefulllearningsampleconsistedof2,073,999trialsthatweresplitintotrainingand

test trials (90% and 10%, respectively). Each trial represents one 25ms window

extractedfromthespeechsignalandmatchedwiththecorrectphonelabel.Thefirst

trainingregimeusedthetrainingdatatrialsastheyare:ateachtrial,inputcueswere

represented as anMFCC row-vector, and the outcomewas the corresponding 1-hot

encoded(i.e.,present)phone.Thesecondtrainingregimeusedthesame90%oftraining

datatocreateN=100normallydistributeddatapoints(Gaussian),withmeanand+/−

16

2.58SEconfidenceintervals(tothetrue,populationmean),oneforeachofthe40phone

categories,giventheinputMFCCs.TheRawinputconsistedof1,866,599trials(90%of

the full dataset), while the Gaussian data contained 4,000 trials (100 normally

distributeddatapointsforeachofthe39phonesandsilence).

Thetwodatasets–Rawvs.Gaussianinput–weretreatedidenticallyduringtraining.

ForthetwoECLalgorithmsthelearningratewassettoλ=0.0001,tosafeguardfrom

overflow, given the considerable range ofMFCC values (min= −90.41,max= 68.73,

range=159.14).Thislearningratewaskeptconstantthroughoutallsimulationruns.

Additionally,whentrainingwiththeTemporalDifferencemodel,thedecayparameter

of future reward expectationwas set to the commonly used value of γ= 0.5. Upon

completion,eachtrainingyieldedaweightmatrix,withtheMFCCsasinputcuesinrows

andthephoneoutcomesincolumns(39×40matrix).

Thetrainingresultedinsixweightmatrices,giventhetwotypesofinputs(Gaussian

andRaw)andthethree learningmodels(Memory-Based,MBL,Widrow-Hoff,WHand

TemporalDifference,TD).Thesematriceswereusedtopredictphonesinthetestdataset

(N=207,399,or10%ofalltrials).RawMFCCswerefedtothealgorithmsasinputand

were weighted by the weights from the six respective matrices, learned under the

distinctivelearningregimes.Weusedweightedinputsums,alsocalledtotalornetinput,

asthecurrentactivationforeachofthe40phoneoutcomes,andtheabsolutelength(1

−norm) of theweighted sum indicated thediversityor the competition among the

possibleoutcomephones(fordetailsandprioruse,seeMilin,Feldman,etal.,2017).The

prediction of the system was the phone with the highest ratio of activation over

diversity,orsupportoverthecompetition.

In addition to the ECLs we also trained a MBL with k-Nearest Neighbors, using

Euclidean Distances and k = 7 neighbors. The memory resource was built from

correspondinginput-outcomedatapairs:NRaw=1,866,599pairsforRawandNGaussian=

4,000pairsforGaussiandata.Next,MFCCinputcueswereusedtofindthe7nearest

neighbors in thetestdata,andtheoutcomeofamajorityvotewasconsideredas the

predictedphone.Thatpredictionwasthenweightedbythesystem’sconfidence(i.e.,the

probabilityof thatmajorityvoteagainstallothervotes).ForallMBLsimulationswe

17

madeuseoftheclasspackage(Venables&Ripley,2002)intheRsoftwareenvironment

(RCoreTeam,2020).

3. ResultsandDiscussion

Thefollowingfoursubsectionspresentindependentbutrelatedpiecesofevidencefor

assessing the learnability of abstract categories, given MFCC input. Whether or not

phonesarelearnablecanbeinferredfromsuccessinpredictingnewandunseendata

fromthesamespeaker(Section3.1)and,morechallengingso, fromanotherspeaker

(Section3.2).Atthemoregenerallevel,questionssuchasreliabilityandgeneralizability

areofcrucialimportance.Wewill,hence,explorehowconsistentorreliablethesuccess

inpredictingphonesisacrossmultiplerandomsamplesofthedata(Section3.3).Finally,

itisimportanttounderstandwhetherwhatislearnedgeneralizesvertically:dowesee

and,ifso,towhatextentdoweseeanemergenceofplausiblephoneclusters,resembling

traditionalclasses,directlyfromthelearnedweights(Section3.4).

3.1. More of the same: predicting new input from the same speaker

Wecanconsideroursimulationsas,essentially,a3×2experimentaldesign,withas

factorsLearningModel (Memory-Based,Widrow-Hoff,andTemporalDifference)and

TypeofInputdata(Rawvs.Gaussian).Thesuccessratesourthreealgorithmsachieve

inpredictingunseendata,givenRaworGaussianinput,aresummarizedinTable1.

18

Table 1. Success rates of predicting the correct phone of three different learningmodels (Widrow-Hoff, TemporalDifference,andMemory-BasedLearning)usingtwodifferenttypesofinputdata(RaworGaussian).TheleftmostcolumnrepresentstheprobabilityofagivenphoneintheRawtrainingsample.

phone Sampling Raw Gaussian

Probability MBL WH TD MBL WH TD

silence 8.03 54.26 41.58 52.33 58.19 34.58 34.51aa 2.12 42.51 0.63 4.20 38.86 17.32 16.55ae 2.64 52.82 19.78 30.02 51.66 29.02 31.26ah 5.49 26.31 11.94 11.65 24.12 32.21 30.03ao 1.02 38.24 5.38 8.33 23.13 19.15 19.34aw 0.79 27.88 0.64 2.09 13.18 11.62 15.38ay 2.58 41.03 0.00 10.14 28.35 23.83 26.13b 1.22 46.46 11.24 7.50 16.96 23.42 24.92ch 0.86 24.95 11.11 0.00 13.61 8.68 8.44d 2.43 41.85 3.24 2.57 17.55 15.63 15.12dh 0.07 17.65 0.00 0.00 1.39 0.66 0.74eh 2.36 26.36 0.00 1.55 24.46 21.46 18.08er 2.83 37.78 19.23 24.03 32.91 24.96 24.61ey 2.26 38.61 0.56 4.98 29.36 18.92 16.19f 1.66 42.94 15.34 20.31 32.49 18.71 14.64g 0.84 51.01 5.67 8.00 14.41 5.94 5.44hh 0.49 44.83 6.07 4.12 11.22 7.24 8.17ih 3.97 28.00 12.34 9.90 25.00 35.26 31.93iy 4.41 64.00 0.99 13.17 64.75 45.84 39.40jh 0.76 39.87 2.58 2.83 17.65 8.47 9.21k 4.03 66.56 22.09 27.87 41.05 21.16 21.41l 5.75 65.68 23.02 25.87 73.81 47.24 45.38m 2.75 68.34 26.17 26.91 49.61 43.30 37.75n 5.79 64.07 27.46 40.95 69.66 59.99 58.47ng 1.84 64.54 19.72 59.12 45.05 30.56 33.15ow 1.62 44.91 1.63 7.79 42.60 27.20 25.21oy 0.15 13.35 0.00 0.00 3.84 2.18 1.90p 2.04 44.08 26.51 42.66 11.82 22.08 22.56r 4.32 44.34 51.43 49.82 52.37 36.57 32.48s 9.57 60.89 2.30 50.17 66.07 53.27 55.51sh 1.86 52.89 11.93 0.00 49.32 19.27 19.15t 5.00 33.68 21.68 22.53 19.64 20.52 20.59th 0.26 15.07 0.00 0.00 3.52 1.68 1.67uh 0.17 7.44 0.00 0.00 0.86 0.74 0.86uw 0.94 53.29 1.65 4.49 40.65 16.70 9.01v 1.00 35.75 9.97 3.75 14.69 15.46 15.17w 0.81 52.31 0.36 1.08 18.41 14.66 13.01y 0.39 28.63 4.00 5.42 7.11 2.18 2.01z 4.78 45.08 17.26 19.08 41.43 29.14 29.70zh 0.08 27.46 0.23 0.35 3.69 1.16 1.16

19

Overall,theresultsrevealanabovechanceperformancegivenboththenaiveor‘flat’

probabilityofanygivenphoneoccurringbychance,whichstandsat2.5%assumingan

equiprobabledistributionofthe40outcomeinstances,ortheprobabilityofthephone

occurring in the sample, which is, essentially, its relative frequency in the training

sample. The error-correction learning rules (WH and TD) performed below the

samplingprobabilityonlyinafewcases,andonlywhenusingrawtrainingdata.Outof

40 instances, Widrow-Hoff underpredicted 12 phones, while Temporal Difference

underpredicted 7 phones. MBL, on the other hand, always performed above the

baselinessetbychanceandsamplingprobability.

The strength of the correlation between the probability of the phone in the test

sample,ontheonehand,andthesuccessratesofthethreelearningmodels,ontheother

hand,addsaninterestinglayerofinformationforassessingtheoverallefficiencyofthe

chosenlearningmodels.Tobringthisout,wemadeuseofKendall’staub,aBayesian

non-parametricalternativetoPearson’sproduct-momentcorrelationcoefficients.Table

2containsthesecorrelationcoefficientsandrevealstwointerestinginsights.

Table2.BayesianKendalltau-bbetweentheprobabilityofaphoneinthetestsampleandthecombinationofLearning(MBL,WH,TD)andTypeofData(RawandGaussian).

Raw Gaussian MBL WH TD MBL WH TD Test-sampleProb. 0.349 0.467 0.576 0.638 0.753 0.723

Firstly,despitethefactthat,forallthreelearningalgorithms,theGaussiandatasetdid

notreflectthesamplingprobabilitiesoftherespectivephonemes(100generateddata

pointsperphone;N=4000totaltrainingsamplesize),thecorrelationwiththephones’

probabilities in the test sample increased considerably (importantly, the correlation

betweenprobabilitiesinthetrainingandthetestsamplewasnear-perfect:r=0.99998).

Inotherwords,eventhoughthenumberofexemplars,i.e.,theextentofexposure,was

keptconstantacrossallphones,thealgorithms’successesresembledthelikelihoodof

encounteringthephonesintheRawsamples,bothfortrainingandtesting.Generally

speaking, this then represents indirect evidence of learning rather than of simply

20

passing through the equiprobable expectations (given that each outcome had 2.5%

chancetooccur).

Secondly, these correlation coefficients are considerably higher for Gaussian data

thanfortheirRawcounterparts.Likewise,thesuccessratesforthetwoECLalgorithms

increases.However, the opposite holds for theMBL algorithm:MBL shows a higher

correlation forGaussiandatabutcombinesthiswitha lowersuccessratewhenthat

sameGaussiandataisusedasstoredexemplars.

Takentogether,thesetwoobservationsindicatethattheECLalgorithmscandomore

withless.Inaddition,theyaresensitivetotheorderofexpositionorthetrialordereffect

inthecaseofRawdata:theRawdatawasfedtothelearningmachinesinaparticular

orderoftime-overlappingMFCC-phonepairs,whiletheGaussiandatawascomputer-

generatedandrandomized.TheincrementalnatureofECLmakesmanymodelsfrom

thisfamilydeliberatelysensitivetotheorderofpresentationinlearning.

Forexample,theRescorla-Wagner(1972)rule,thediscretedatacounterpartofthe

WidrowandHoff(1960)rule, ispurportedlysusceptibletoorder.This is, infact, the

corereasonwhy itcansimulate the famousblocking-effect (“Predictability, surprise,

attention,andconditioning”,n.d.).Conversely,MBLsimplystoresalldatainasisfashion

andis,consequently,blissfullyunawareofanyparticularorderinwhichthedatacomes

in and, for thatmatter, does not assign any particular value to the order (assuming

unlimitedstorage).

To examinewhether there is an interactionbetweenLearningModel andTypeof

Input data we applied Bayesian Quantile Mixed Effect Modeling in the R software

environment(RCoreTeam,2020)usingthebrmspackage(Bu¨rkner,2018).Bayesian

QuantileModelingwaschosenasaparticularlyrobustanddistribution-freealternative

toLinearModeling.Weexaminedthepredictedvaluesoftheresponsedistributionand

their(Bayesian)CredibleIntervals(CI)atthemedianpoint:Quantile=0.5(Schmidtke,

Matsuki,&Kuperman,2017;Tomaschek,Tucker,Fasiolo,&Baayen,2018,providemore

details about criteria for choosing the criterion quantile-point). The final model is

summarizedinthefollowingformula:

21

SuccessRate∼LearningModel∗TypeOfInput+

(1|phone)

Thedependentvariablewasthesuccessrate,andthetwofixedeffectfactorswere

learningmodel(3levels:MBL,WH,TD)andtypeofinputdata(2levels:Raw,Gaussian).

Finally,40phonecategoriesrepresentedtherandomeffectfactor.Themodelsummary

ispresentedinFigure3.1.

Figure 1. Conditional effect of Learning Model and Type of Input on Success Rates with respective 95% CredibleIntervals.Blackdashedlinerepresentstheflatchancelevelat2.5%.

IfwecompareMBLandECL(jointlyWHandTD),first,weseethatMBLshowsabetter

overallperformance(MBL>ECL:Estimate=11.24;Low90%CI=8.56;High90%CI=

13.95;EvidenceRatio= Inf). That difference, however, attenuateswhen theGaussian

sampleisused(MBL>WH:Estimate=6.49;Low90%CI=0.49;High90%CI=12.32;

EvidenceRatio=0.04;MBL>TD:Estimate=5.88;Low90%CI=0.04;High90%CI=

11.69;EvidenceRatio=0.05). Thus,we can conclude thatMBLperformsbetterwith

moredata(Raw),whichistobeexpectedasthistypeofdatasetprovidesaverylarge

pool of memorized exemplars. The two ECLs, on the other hand, show improved

performancewithGaussiandata.TheimprovementisgreaterforWHthanforTD,but

22

thatdifferenceisonlyweaklysupported(Estimate=2.42;Low90%CI=−1.36;High

90%CI=6.21;EvidenceRatio=5.86).

Interestingly, less is more for ECL models: less data gives an overall better

performance. Furthermore, with Gaussian data, a ‘cleverer’ learning model (i.e.,

TemporalDifference)doesnot yieldbetter results. It seems thatmoredata leads to

overfit(foradetaileddiscussion,seealsoMilin,Madabushi,etal.,2020).Although it

does not cancel out the possibility of overfitting the data, our interpretation of this

finding is that the effect is, most likely, yet another consequence of ECL models’

sensitivitytotheorderoftrials.Asexplainedabove,theRawdatawaspresentedina

particular order, and inasmuch as theECLmodels are sensitive to this effect (Milin,

Madabushi,etal.,2020),theMBLmodelisimmunetoit.Furthertothis,theadvantage

that TD shows in comparison to WH with Raw data stems from the fact that an

immediaterewardstrategy(as implemented inWH)performsworsethanastrategy

whichconsidersfuturerewardswhenthesignalisweak,thatiswhenthedataisnoisy

(foraninterestingdiscussioninthedomainofappliedroboticsseeWhelan,Prescott,&

Vasilaki,2020).When thatnoise ispurelyGaussian,however, informationabout the

futuredoesnotyieldmuchadvantage.

Generallyspeaking,withGaussiandata,thedifferencesbetweenthesuccessratesof

the three learningmodels become insignificant:whileMBL becomes less successful,

bothECLmodelsbenefit.LargersampleswithmoreGaussiangenerateddata,however,

donot improvetheperformanceofECL further. In fact, largersamplesofN=1,000;

10,000;and1,866,600(thelatesttomatchthesizeoftheRawdatasample)consistently

worsenedtheirpredictionsuccess(respectiveaveragesforWidrow-Hofflearningare:

MN=100=21.70%;MN=1K=19.49%;MN=10K=14.75%;MN=1.8M=12.28%).

3.2. Spicing things up: predicting input from a different speaker

ThedownwardtrendofsuccessexhibitedbyWHlearningthatisobservedasthesizeof

theGaussiangeneratedsampleincreasesprovidesindirectsupportfortheassumption

thatECLlearningrulesmayindeedbesusceptibletooverfittingasthedatabecomes

abundant. This sheds light on yet another aspect of theoppositionbetween the two

23

learningprinciples,MBLandECL:what‘pleases’theformerharmsthelatter,andvice

versa.Thisdifferencebegsthequestionofwhetherfidelitytothedataleadstoadead-

endoncethetaskbecomesevenmoredemanding;thatiswhen,forexample,theaimis

topredictadifferentspeaker.

Forthisweusewhatwaslearnedinthepreviousstep:directlystoredpairsofinput

MFCCs and outcome phones for MBL (NRaw= 1,866,599 stored exemplars) and the

resultingWHandTDweightmatrices(usingthesameNRaw=1,866,599learningtrials),

to predict phones from MFCCs that were generated by a new female speaker. The

datasetconsistedofN=1,699,643testtrialsfromtheproductionofabout27,000words.

TotesttheECLmodelsweusedthetwoweightmatricesfromtrainingonRawinput,

one with WH and another with TD. Then, raw MFCCs from the new speaker were

weighted with weights from the two corresponding matrices and then summed to

generatetheactivationforeachofthe40phoneoutcomes(i.e.,thenetinput),together

withtheirdiversity(theabsolutelengthor1−norm)(cf.,Milin,Feldman,etal.,2017).

As before, the predicted phone was the one with the strongest support over the

competition(theratioofactivationoverdiversity).TotesttheMBLmodel,newMFCC

inputwas‘placed’amongthe7-NearestNeighborstoinducethemajorityvote,weighted

bythesystem’sconfidenceinthatvote(i.e.,thesizeorprobabilityofthatmajority).The

phonewiningthemajorityvotewasthepredictedphone.

The results from this set of simulations are unanimous: the task of predicting a

differentspeakeristoodifficultandallthreelearningalgorithmsperformedatchance

level.Theestimatedperformance(successrates)wasgivenbythefollowingBayesian

QuantileMixedEffectModel:

SuccessRate∼LearningModel+

(1|phone)

Theresultingestimatesforthethreelearningmodelsare:MBLEstimate=2.28(Low

95%CI=1.29;High95%CI=3.36),TDEstimate=1.86(Low95%CI=0.97;High95%

24

CI=2.93),WHEstimate=1.69(Low95%CI=0.83;High95%CI=2.79).Inotherwords,

allthreelearningalgorithmsfailedtogeneralizetoanewspeaker.

Inanattempttobetterunderstandtheresultswegeneratedconfusionmatricesfor

eachmodel(seesupplementarymaterial).Theconfusionmatricesforthefirstspeaker

showthatfortheMBLmodeltheconfusionmatrixmirroredwhatwemightexpectfrom

a phonetic/perceptual perspective. For example, vowelsweremost confusablewith

vowelsand/s/wasconfusablewith/z/.ThisdoesnotcarryoverfortheTDandWH

modelswheretheconfusionsseemmorerandom.Intheconfusionmatricesforthenew

speakertheresultsaremuchclosertochanceformostofthephones.Asintheconfusion

matriceswiththeoriginalspeaker,theMBLmodelmirroredwhatmightbeexpected

fromaperceptualperspective.Generally,acrossalltheconfusionmatriceswefindabias

forsilenceintheresponses.Thisconfusionisprobablyatleastinpartcausedbythefact

thatpartofthesegmentforvoicelessstopsandaffricatesconsistsofsilence.

3.3. Assessing the consistency of learning

The different and task-dependent profiles of ECL and MBL models pose questions

regarding the reliability of different learning approaches. In otherwords, does their

performance remain stable across independent learning sessions (i.e., simulated

individual learners’ experiences) or is their success or failure rather a matter of

serendipity?Inprinciple,learningisadomaingeneralandhigh-levelcognitivecapacity

(c.f.,Milin,Nenadi´c,&Ramscar,2020)and,hence,itoughttoshowadesirablelevelof

consistencyorresiliencetorandomvariationsand,atthesametime,allowadesirable

level of systematicdifferences across individuals. Failure to achieve consistency and

predictability across different sets of learning events and/or learners (individuals),

woulddisqualifyalearningmodelasaplausiblecandidate-mechanism.

Given the fact that a two-year old child’s receptive vocabulary consists of

approximately300differentwords(Hamilton,Plunkett,&Schafer,2000),wesampled

withoutreplacement5randomsamplesof300wordseach.These300-wordsamples

mapped to a somewhatdifferentnumberof inputMFCCandoutputphonepairs.As

before,suchinput-outputpairswereusedinthesimulations.First,eachofthe5samples

25

ofMFCC-phonepairs(i.e.,learningevents)wasreplicated1000timestosimulatesolid

entrenchment.Next, tokeep these learningeventsasrealisticaspossibleand,at the

sametime,topreventoverfitting,weaddedGaussiannoise(withasmalldeviation)to

50%randomly chosenMFCCs: this reflectsminorvariations inpronunciationacross

trials.Allfiverandomsampleswereusedtotrainourthreemainlearningalgorithms,

bothECLs(Widrow-HoffandTemporalDifference)andMBL(usingEuclideandistances

tofindthe7-NearestNeighbors),infiveindependentlearningsessions.

Inparallel,wegenerated5testsamples.Eachtestsampleconsistedof200wordsin

total, half of which were ‘known’ words, sampled from the corresponding training

sample,whiletheotherhalfofwordswere‘new’,notpreviouslyencountered.Asbefore,

WH and TD learning yielded learningweightmatrices, whileMBL simply stored all

trainingexemplarsasunchangedinput-outputpairs.Then,theinputMFCCsfromthe

testsampleswereusedtopredictphoneoutcomes,asbefore:(a)forWHandTDthe

predictedphonewastheonewiththestrongestsupportagainstthecompetition(i.e.,

thehighestratioofactivationoverdiversity);(b)MBLcalculatedEuclideandistancesto

retrievek=7nearestneighbors,andtodeterminetheirmajorityvote,weightedbythe

probabilityofthatvoteagainstallothervotes.

This simulation provided success rates for each of the three learning algorithms,

separatelyforeachofthefiveindividuallearningsessions(generatedsamples),andfor

eachofthefortyphones.Again,wemadeuseofBayesianQuantileMixedEffectModeling

asimplementedinthebrmspackage(Bu¨rkner,2018)inR,andevaluatedthemodelat

themedianpoint.Thismodel,too,usedthesuccessrateasdependentvariable,andthe

learningmodelasfixedeffectfactor(with3levels:MBL,WH,TD).Inadditiontothis,it

containedtworandomeffects:40phonesand5individuallearning

sessions:

SuccessRate∼LearningModel+

(1|phone)+

(1|Session)

26

Consistent with our previous findings MBL showed a significantly better

performance:MBL>ECL:Estimate=45.49;Low90%CI=42.04;High90%CI=48.76;

EvidenceRatio=Inf.TheTDmodelachievedamarginallybettersuccessratethanthe

WH model: TD >WH: Estimate = 2.00; Low 90% CI = −0.61;High 90% CI = 4.65;

EvidenceRatio=8.83.Strikingly,however,closerinspectionofthemodels’performance

revealed that in2outof5sessionsMBL’saveragesuccesswas in factequal tozero.

Across the five sessions, themedian success rates forMBLwere: 72.51, 0.67, 68.00,

71.47,0.00.Thus,wecanconcludethatin40%ofcasesoursimulated‘individuals’did

notlearnanythingatall,whileintheother60%theywereluckilysuccessful.

Thus, our follow-up analysis used Median Absolute Deviations (MAD), a robust

alternative measure of dispersion, as the target dependent variable. The MAD was

calculatedforeachphoneacrossthefivelearningsessions.Theresultingdatasetwas

submittedto thesameBayesianQuantileMixedEffectModelingto test the following

model(notethatSessionsenteredintothecalculationofMADs,whichiswhytheycould

notappearasarandomeffect):

MAD∼LearningModel+

(1|phone)

Theresultsconfirmedthepreviouslyobservedsignificantlyhigherinconsistency,

asindicatedbyhigherMAD,ofMBLvs.ECL(MBL>ECL:Estimate=5.03;Low90%CI=

2.37;High90%CI=7.96;EvidenceRatio=1332.33).ThetwoECLmodels,however,did

notdiffersignificantlyamongthemselves(TD>WH:Estimate=−0.97;Low90%CI=

−3.68;High90%CI=1.83;EvidenceRatio=2.55).

Takentogether,theseresultscastdoubtonMBLasviableandplausiblemechanism

forlearningphonesfromauditoryinput.Onaverage,MBLperformsratherwellbutit

fails the reliability test and, consequently, cannot be considered as a plausible

mechanism for learning phones from raw input: toomany learnerswould not learn

anythingatall.Conversely,whileECLmodelsshowpooreraverageperformance,they

displayhigherresilienceandconsistencyacrossindividuallearningsessions.

27

3.4. Abstracting coherent groups of phones

Fromtheresultspresentedsofarwecanbecautiouslyoptimisticaboutwhatcanbe

learnedfromtheMFCCinputsignal.Thisseemsreasonableconsideringthisistheinput

formostautomaticspeechrecognitionsystems.Thereissomesystematicity(fordetails

seeDivjak,Milin,Ez-zizi,J´ozefowski,&Adam,2020;Milin,Nenadi´c,&Ramscar,2020)

andthatappearstobepickedupbylearningmodels.Thisis,ofcourse,relativetothe

taskathand,whichinthiscasewaspredictingaphonefromnew,rawinputMFCCs(the

testdataset).

Thelastchallengeforourlearningmodelsistoexaminewhetherthelearnedweights

facilitatethemeaningfulclusteringofphonesintotraditionallyrecognizedgroups.We

haveassessedthepotentialofthethreelearningmodelstogeneralizehorizontally,i.e.,

topredictnewdatafromthesamespeakerandtopredictnewdatafromanewspeaker.

Sincetraditionalclassesofphonesarewellestablishedintheliterature,thequestionis

whetherourlearningmodelsalsogeneralizesvertically.Thatis,doesityieldemerging

abstractionsofplausiblegroupsofphoneswhichshould,at least toanotableextent,

matchandmirrortheexistingphoneticgroupingsofsounds.

Tofacilitatedeliberationsonthemodels’abilitytomakeverticalgeneralizations,we

applied hierarchical cluster analysis using the pvclust package (Suzuki, Terada, &

Shimodaira,2019)intheRsoftwareenvironment(RCoreTeam,2020)totheweight

matrices.We focushereontheresults for themostandthe leastsuccessful learning

simulations, i.e., WH and MBL using Raw data, and discuss their differences in

generating‘natural’groupsofphones.WealsoaddtheresultsforTDlearningonRaw

data, for completeness. The three cluster analysesdonot onlypresent the extremes

given their prediction successes, but are also the most interesting. Further cluster

analysescanberundirectlyontheweightmatricesprovided.

All cluster analyses presented here used Euclidean distances and Ward’s

agglomerative clustering method, with 1,000 bootstrap runs. The results are

summarized in Figures3.4, 3.4, and3.4. TheRpackagepvclust allows assessing the

uncertaintyinhierarchicalclusteranalysis.Usingbootstrapresampling,pvclustmakes

itpossible tocalculatep-values foreachcluster in thehierarchicalclustering.Thep-

28

valueof a cluster rangesbetween0 and1, and indicateshowstrongly the cluster is

supported by data. The package pvclust provides two types of p-values: the AU

(ApproximatelyUnbiased)p-value(inred)andtheBP(BootstrapProbability)value(in

green).TheAUp-value,whichreliesonmultiscalebootstrapresampling,isconsidered

abetterapproximationtoanunbiasedp-valuethantheBPvalue,whichiscomputedby

normal bootstrap resampling. The grey values indicate the order of clustering,with

smaller values signaling cluster that were formed earlier in the process, and hence

containelementsthatareconsideredmoresimilar.ThefirstformedclusterinFigure

3.4,forexample,contains/dh/and/y/,whilethesecondonecontains/oy/and/uh/.

Inathirdstep,thefirstclustercontaining/dh/and/y/isjoinedwith/th/.

Figure2.DendrogramofphoneclusteringusingWidrow-HoffweightsfromtrainingonRawdata.

TheWidrow-Hoff learning weights appear to yield rather intuitive phone groups

(Figure 3.4). Specifically, the left-hand side of the dendrogram reveals small and

relatively meaningful clusters of phones. On the right-hand side of the same panel,

however,thereisapoorlydifferentiatedlumpconsistingofthe20remainingphones

thatdonotyetappeartohavebeenlearnedsufficientlywelltoformmeaningful(i.e.,

expected)groupings.Nevertheless,someinterestingpatternsemerge.Forexample,the

29

voiceless stops cluster togetherwith silence in a later step; considering that a good

portionof thevoiceless stop is likely silence, this clusteringseems fairly logical.The

voicelessalveolarfricative/s/occurslargelybyitselfandthenthereisanothercluster

containingthevoicedlabialstop/b/,thevoicedalveolarstopandlateral/d,l/,andthe

voicedandvoicelesslabio-dentalfricatives/f,v/.Thisclusterislargelymadeupofvoice

soundsarticulatedinthefrontofthevocaltract.Thethirdclusterislargelymadeupof

vowels,asonorant,andthevoicelesspalato-alveolarfricative/sh/.Thisclusterseems

tocontainmostofthevowelsalongwith/r/and/er/whichalsohaveafairlyopenvocal

tractduringarticulation.However,the/sh/isasurprise,andatthispoint,anypossible

explanations as towhy /sh/ is in this clusterwould be guesswork. The last cluster

contains the remaining 20 phones which seem to be grouped in a mixture of the

remainingsoundswithnoapparentpattern.NoteinpassingthattheWHdendrogram

alsopassesthelitmustestofphonerecognition,i.e.,thedistinctionbetween/l/and/r/.

InFigure3.4the/l/andthe/r/ individuallybelongtoclustersthatareestimatedto

replicateondifferentdata.

TheresultsforclusteringTDweightsarelesslikelytoreplicate.Fewclustersachieve

an AU value of 95 or higher. However, the clustering is highly similar to the WH

dendrogram.Thereisalater-formedclustercontainingthevoicelessstops/k,p,t/on

theleft-handside,nexttoaclusterlinking/b,d,m,f,v,g,hh,w/,whichincludesthevoiced

stopsandsomeofthebilabialandlabio-dentalconsonants.Athirdreliablecluster,on

theright-handside,unites13phones.Thesephonesarelargelymadeupofaffricates,

thoughtherearea fewvowelsandanasal in thisgroupsaswell.The fourthreliable

clusterissohigh-levelthatitlinks35outofthe40phonestogether.Notethat/s/and

silencearenotincludedinthiscluster.Herewealsonotethatthedistinctionbetween

/l/and/r/islessreliablethanintheWHanalysis.

Figure3.4tellsaverydifferentstory,orrather:nostoryatall.Theclustersmakelittle

sensefromaphoneticperspective.Soundsdonotseemtoclustertogetheraswouldbe

expected and no alternative pattern becomes apparent either. The p-values for the

clustersinFigures3.4and3.4revealsomestrikingdifferences.Despitesomequibbles

withtheWHphonegroups(Figure3.4),discussedabove,manyoftheclustersinthis

30

solutionareratherlikelytoreplicate,andthisappliesinparticulartomidandhigh-level

clusters (formed later cf. the grey values) on the left-hand side of the diagram. The

clusters in theMBL solution, however, show the opposite tendency,where only one

cluster(consistingofoyanduh)receivesanAUp-valueequaltoorhigherthan.95.In

particular,higherorderclusterfarepoorlyinthisrespect,withallbutonereceivingno

chanceatallofreplicatinginadifferentdataset.Note,also,thatthevaluesonthevertical

axisareverydifferentforWH/TDandMBL:MBLclustersstartformingatvalue0.5,by

whichpointtheWHandTDclustersolutionshavealreadyconverged.Thechance-like

behaviorofMBLinclusteringwhatislearnedisinlinewithotherresultspresentedhere

and in particularwith the highly inconsistent and serendipitous nature of ‘learning’

exhibitedbythisparticularmodel.

Figure3.DendrogramofphoneclusteringusingTemporalDifferenceweightsfromtrainingonRawdata.

In summarizing the results of the cluster analysis, two points appear to be of

particular interest. First, aspointedoutbefore, learning ‘success’ doesnot exist in a

vacuumorasauniversalnotion,butisrelativetoperformanceonthetaskathand.In

thatrespect,andinthecontextofthetaskofpredictingnewinputorindicatinghigher-

order relationships between phones,we can conclude thatMBL performs better on

31

predictingunheardinput(particularlywhenthesetofexemplarsissufficientlylarge),

but worse on generating meaningful natural groups of phones. As Nathan (2007)

concluded: whatever the role of exemplars might be, they do not make phonemic

representationssuperfluous.

TheWH rule showsexactly theoppositeperformance, and is better in generating

meaningfulnaturalgroupsofphonesbutworseinpredictingunheardinput.Second,the

meaningfulnessoftheobtainedphonegroupsvariesfromthosethatarehighlyplausible

tosomethatarenotsufficientlylearnedbutmakea‘primordialsoup’ofdiversephones

inaratherunprincipledway.Whetherhumansexperiencethesametrajectoryinthe

hypothesized acquisition of such phone groups is beyond the scope of this study,

however.

Figure4.DendrogramofphoneclusteringusingMemory-BasedprobabilitiesfromtrainingonRawdata.

32

4. Generaldiscussion

We set out to test two opposing principles regarding the development of language

knowledgeinlinguisticallyuntrainedlanguageusers:Memory-BasedLearning(MBL)

andError-CorrectionLearning(ECL).MBLreliesona largepoolofveridicallystored

experiences and a mechanism to match new ones with already stored exemplars,

focusingonefficiencyinstorageandretrieval.ECLfilterstheinputdatainsuchaway

thatthediscrepancybetweenthesystem’spredictionoftheoutcome(asmadeonthe

basisofavailableinput)andtheactualoutcomeisminimized.Throughfiltering,those

dimensions of the experience are removed that are not useful for reducing the

prediction error. What gets stored, thus, changes continuously as the cycles of

predictionanderror-correctionalternate.Suchan‘evolvingunit’cannotbeaveridical

reflectionof anexperience,but containsonly themostuseful essence,distilled from

manyexposure-events(orusage-eventsviz.Langacker,1991).

Aprocessofgeneralizationunderliestheabstractionslinguistsoperatewith,andwe

probedwhetherMBL and/or ECL could give rise to a type of generalized language

knowledge that resembles linguistic abstractions. Eachmodelwas presentedwith a

significantamountofspeech(approximately5hours)producedbyonespeaker,created

fromtheMALDstimuli(Tuckeretal.,2019).Thespeechsignalsweretransformedinto

MFCCrepresentationsandthenpairedwiththe1-hotencodedoutcomephone.

Againstsuchdatawepittedthreelearningrules,amemorygreedyone(MBL),ashort-

sightedone(WH),andonethatexplicitlytakesintoaccountfuturedata

(TD). These three rules, again, map onto the principles of Memory-Based vs.

ErrorDrivenLearning.Weputbothtypesoflearningalgorithms(MBLvs.WHandTD)

through four tests that probe the quality of their learning by assessing themodels’

abilitytopredictdatathatisverysimilartothedatatheyweretrainedon(samespeaker,

differentwords)aswellasdatathatisratherdifferentfromthedatatheyweretrained

on (different speaker, same/different words). We also assessed the consistency or

stability of what the models have learned and their ability to give rise to abstract

categories.Asexpected,bothtypesofmodelsfaredifferentlywithregardtothesetests.

33

4.1. The learnability of abstractions

WhileMBLoutperformsECLinpredictingnewinputfromthesamespeaker,thisisonly

thecaseforRawdata;withGaussiandata,thelearningmodelsdonotdiffersignificantly

intheirperformance.Interestingly,forECLlessismore:trainingWHandTDonfewer

datapointswithGaussiannoiseyieldsbetterpredictionresults than trainingWHon

Rawandbigdata.ThisindicatesthatWHmayinfactoverfitincaseoflargerdatasets,

essentially overcommitting to the processed input-outcome data. Another plausible

interpretationoftheseresultsisthatECLs,whenlearningfromRawdata,areaffected

by the order of trials. Yet, neither an order effect nor overfitting to input data are

necessarilyunnaturalornegativeastheyrepresentknownphenomenain,e.g.,language

acquisition more generally (see, for example, Arnon & Ramscar, 2012; Ramscar &

Yarlett,2007)aswellasininL1andL2learningofphoneticinputspecifically.

Infants, at least inWestern cultures, typically have one primary caregiver, which

makesitlikelythatthey,too,‘overfittheirmentalmodels’totheinputofthatparticular

caregiver.There isevidence from learning inothersensorydomains, i.e.,vision, that

caststhisapproachinapositivelight:infantsusetheirlimitedworldtobecomeexperts

atafewthingsfirst,suchastheircaregiver’sface(s).Thesame,veryfewfacesareinthe

infant’svisualfieldfrequentlyandpersistentlyandtheseextended,repeatedexposures

havebeenarguedtoformthestartingpointforallvisuallearning(Jayaraman&Smith,

2019).

L2learnersarealsoknowtooverfittheirinputdata,andthistendencyhasbeenfound

tohindertheirprogress(forareview,seeBarriuso&Hayes-Harb,2018).Toremedythe

problem, researchers have experimented with high versus low variability phonetic

training.Highvariabilityconditionsincludeinputwithmultipletalkersandcontextsand

is hypothesized to support the formation of robust sound categories and boost the

generalizabilitytonewtalkers.WhileithasbeenshownthatadultL2learnersbenefit

fromsuch training (for the initial report, seeLively,Logan,&Pisoni,1993), childL2

learnersdonot appear tobenefitmore fromhigh variability training than from low

variability training (Barriuso&Hayes-Harb,2018;Brekelmans,Evans,&Wonnacott,

2020;Giannakopoulou,Brown,Clayards,&Wonnacott,2017).

34

However,whileoverfittingmaybeanadvantageinsomelearningsituationsitmay

causechallengesinothers.Theuseofasinglespeakermayalsointroduceachallengeto

themodel,notdissimilartothelong-knownspeakernormalizationprobleminspeech

perception(e.g.,Ladefoged&Broadbent,1957).Speakernormalizationistheprocessof

alisteneradaptingtosounds(typicallyvowels)producedbydifferentspeakers.Thisis

duetothefactthatdifferentspeakershaveindividualcharacteristicsandtheacoustic

characteristicsofavowelproducedbyamalespeakermaybesimilartotheacoustic

characteristics of an entirely different vowel produced by a female speaker (eg.g,

Barreda & Nearey, 2012). Speaker normalization describes research in speech

perceptioninvestigatinghowlistenersadaptto individualspeakers. Inourmodeling,

the learner has not been given an opportunity to acquire knowledge about other

speakersandasaresultitmaybethatithasnotlearnedtocreatecategoriesthatare

speakerindependent.

Atsomepoint,however,learnersareexposedtoinputproducedbyanotherspeaker

thantheirprimarycare-giverorteacher.Wesimulatedthissituationbyexposingour

models,trainedoninputfromonespeaker,toinputproducedbyanovelspeaker.This

time,unfortunately,all learningmodelsperformedatchancelevel,showingthatthey

wereunabletogeneralizewhattheyhadlearnedtoanewspeaker.Thisdidnotcomeas

acompletesurprise,however,asthechallengewasdeliberatelyextreme,notallowing

for any prior experience with that speaker at all. Incrementally adding input from

differentspeakersintothemixwillbeinformativeabouthowlongandtowhatextent

themodel can ormust keep on learning to be able to generalize to new, previously

unheardspeakers.

The ideaof individualdifferences in early learningwas addressed in a systematic

fashion. Somewhat simplistically, we simulated 5 child-like learners that each

experienced plenty of minimally different examples of a limited number of words,

spokenbyoneand the samespeaker.When testedon seenandunseenwords,MBL

appeared to outperform ECL, but closer inspection of the model revealed that this

advantageappliedtoanon-existentaverageonly:2outofthe5 learnerswhotooka

MBL approach had learned nothing at all. This, then, casts doubts on the cognitive

35

plausibilityofthisparticularlearningapproach,whichdemonstratesasdesirableatthe

populationlevelbutfailsmanyindividuals.ContrarytoMBL’sperformance,bothECL

modelsshowedlowersuccessonaverage,yettheyarethemorereliablechoiceacross

individuals.

Inalaststep,wetestedtowhatextentweights(fromthetwoECLs)andprobabilities

(from MBL), which we understand as summative proxies of learning, facilitate the

meaningful clustering of phones into traditionally acknowledged groups of phones.

Even though ECL was trained on its least favorite type of data (Raw) it performed

reasonablywell:itgaverisetoadendrogramthatallocatedatleasthalfofthephone

inventorytogroupsthatresembletheirtraditionalphoneticclassification.Furthermore,

thegroupsitidentifiedwereestimatedtohaveahighchanceofbeingdetectedinnew,

independentdatasets.MBL,on theotherhand,performedexceptionallypoorly,even

though it was trained on its favorite type of data (again, Raw). The absence of any

recognizable systematicity in the MBL-based clustering, and the estimated lack of

replicabilityoftheclusters,isstrikingindeed.

4.2. Implications for a theoretical discussion of abstractions

Generalization isoneof themostbasicattributesof thebroadnotionof learning.To

quote one of the pioneers of Artificial Neural Networks (ANNs), intelligent Signal

ProcessingandMachineLearning,thepointhereisthat“ifgeneralizationisnotneeded,

wecansimplystoretheassociationsinalook-uptable,andwillhavelittleneedfora

neuralnetwork”,wherethe“inabilitytorealizeallfunctionsisinasenseastrength[...]

becauseit[...]improvesitsabilitytogeneralize”tonewandunseenpatterns(Widrow

&Lehr,1990,p.1422).Attheveryleast,ourfindingsshowthatweshouldnotthrowthe

abstractions-baby outwith the theoretical bathwater. Instead, theway inwhichwe

approach thediscussionof abstractions should change.Upuntil now, the amountof

theoreticalspeculationabouttheexistenceofabstractionsisinverselyproportionalto

theamountofempiricalworkthathasbeendevotedtoprovidingevidencefororagainst

theexistenceofsuchabstractions.Thisisspecificallytrueforattemptstomodelhow

suchabstractionsmightemergefromexposuretoinput.

36

For example, Ramscar and Port (2016) discuss the impossibility of a discrete

inventoryofphonemesandclaimthat“noneofthetraditionaldiscreteunitsoflanguage

that are thought to be the compositional parts of language – phones or phonemes,

syllables,morphemes,andwords–canbeidentifiedunambiguouslyenoughtoserveas

the theoretical bedrock on which to build a successful theory of spoken language

composition and interpretation. Surely, if these units were essential components of

speechperception,theywouldnotbesouncertainlyidentified”(p.63).Ourresultsshow

that ‘thereismoretothat’:thefindingsarenotunanimousbutspeakformuchmore

thanno-thingandmuchlessthanevery-thing.Wehaveshownthatatleastpartofthe

phoneinventorycanbereliablyidentifiedfrominputbyalgorithmsappliedhere–ECL

andMBL.Atthesametime,notallphonesappeartobeequallylearnablefromexposure,

at least not fromexposure to one speaker. For anybalanceddiet, the proof is in the

pudding,butneitherintheknow-it-allcheforthelistofingredients.

Maybe this shouldnot comeasa surprise:afterall,many ifnotmostabstractions

came into existence as handy theoretical or descriptive shortcuts, not as cognitively

realistic reflections of (individual) users’ language knowledge (see, for discussion,

Divjak,2015b;Milinetal.,2016).Cognitivelyspeaking,thephonemeinventorythatis

traditionallyassumedforEnglishispossiblyeithertoospecificortoocoarseinplaces,

inparticularwhendealingwiththeidiosyncrasiesofonespeaker.Schatzetal.(2019)

likewisereportedfindingadifferenttypeofunits,astheonestheirmodellearnedwere

too brief and too variable acoustically to correspond to the traditional phonetic

categories.They,too,tooktheviewthatsucharesultshouldnotautomaticallybetaken

to challenge the learningmechanism,but rather to challengewhat is expected tobe

learned, e.g., phonetic categories. Ormaybewe should consider that not all phones

appeartobeequallylearnablefrominputaloneandproductiondatashouldbeprovided

tothemodel;emergentistapproachesallowproductionconstraintstoaffectperception

(Donegan,2015,p.45).Perhaps,also,theidiosyncrasiesofouronespeakershowthat

morediversetrainingisnecessarytolearnothergroupsofphone-likeunits.

Shortcomings aside, we have shown that ECL learning methods can learn

abstractions. Continuous-time learning, as represented by TD, does not necessarily

37

improvelearningasmeasuredbypredictionaccuracy:lookingintothefuturehelpsonly

if the data is rather noisy, i.e., the signal isweak. Furthermore, the argument for or

againsttherelevanceofunitsandabstractionsis,aswesaid,voidifnotpittedagainsta

particular task that may or may not require such units to be discerned and/or

categorized(thispointwasdemonstratedbysuccessfulmodelsofcategorization,such

asLoveetal.,2004;Nosofsky,1986).Inthissense,ECLoffersthemoreversatiletypeof

learning in that it can handle tasks at the level of exemplars and of abstractions, as

demonstrated by its reasonable performance across the four tasks. It also confirms

claims that adults seem to disregard differences that are not phonemic rather than

actuallylosingtheabilitytoperceivethem(Donegan,2015,p.43).

Theempiricalexplorationofwhetherabstractionsarelearnableshouldprecedeany

decisionsaboutwhichtheoreticalframeworktopromote.TheZeitgeist,however,seems

tofavorareductionismthatleadstoratherextremepositionsontheroleofexemplars

andthestatusof(abstract)units(cf.,Ambridge,2020;Ramscar,2019).Yet,asweknow

from the past, reductionism of any kind provokes rather than inspires (recall, for

example, the Behaviorist take on Language, and the Generativist reaction to it, see,

Chomsky1959;Skinner1957;andforalaterre-evaluationofthesecontributions,see,

among others, MacCorquodale 1970; Palmer 2006; Swiggers 1995), where both

provocateurandprovokeeloseinthelongrun.Suchtheoreticaldeliberationsonlylead

to ‘tunnelvision’onwhatconstitutesevidenceandhowsupportingdataought tobe

collected.Afterall,weshouldknowbetter:“allmodelsarewrong”(Box,1976,p.792),

andthose thatoccupyextremepositionsoftengohand-in-handwithsimplifyingand

evensimplisticviewpoints,as theirpotentialusefulnessgetsspentonpromotionand

debasementratherthanonproductive(self-)refutability.

38

5. Acknowledgments

ThisworkwasfundedbyaLeverhulmeTrustaward(RL-2016-001)whichfundedPM

and DD and by the Social Sciences and Humanities Research Council of Canada

(4352014-0678)whichfundedBVT.WearegratefultoBenAmbridgeforafascinating

seminarwhichsparkedourinterestinthistopic,andtomembersoftheOutofourMinds

team and the Alberta Phonetics Laboratory for sharing their thoughts. We thank

DanielleMatthews, Gerardo Ortega and Eleni Vasilaki for useful discussions and/or

pointerstotheliterature.

39

References

Ambridge,B.(2020,October).Againststoredabstractions:Aradicalexemplarmodeloflanguage

acquisition. First Language, 40(5-6), 509–559. Retrieved 2020-12-08, from

https://doi.org/10.1177/0142723719869731 (Publisher:SAGEPublicationsLtd)

Ambridge, B., & Lieven, E. V. M. (2011). Child language acquisition: contrasting theoretical

approaches.Cambridge:Cambridge:CambridgeUniversityPress.

Appelbaum,I.(1996,October).Thelackofinvarianceproblemandthegoalofspeechperception.

InProceedingofFourth InternationalConferenceonSpokenLanguageProcessing. ICSLP ’96

(Vol.3,pp.1541–1544).

Arnold,D.,&Tomaschek,F.(2016).Thekarleberhardscorpusofspontaneouslyspokensouthern

germanindialogues–audioandarticulatoryrecordings.In

C. Draxler & F. Kleber (Eds.), Tagungsband der 12. tagung phonetik und phonologie im

deutschsprachigenraum(p.9-11).Munich:Ludwig-Maximilians-University.Retrieved from

https://epub.ub.uni-muenchen.de/29405/

Arnold, D., Tomaschek, F., Sering, K., Lopez, F., & Baayen, R. H. (2017, April). Words from

spontaneousconversationalspeechcanberecognizedwithhuman-likeaccuracybyanerror-

drivenlearningalgorithmthatdiscriminatesbetweenmeaningsstraightfromsmartacoustic

features,bypassingthephonemeasrecognitionunit.PLOSONE,12(4),e0174623.Retrieved

2018-01-10,fromhttp://journals.plos.org/plosone/article?id=10.1371/journal.pone.0174623

Arnon, I.,&Ramscar,M. (2012).Granularityand theacquisitionofgrammaticalgender:How

order-of-acquisitionaffectswhatgetslearned.Cognition,122(3),292–305.

Baayen, R. H., Chuang, Y.-Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative

lexicon: A unified computational model for the lexicon and lexical processing in

comprehensionandproductiongroundednotin(de)compositionbutinlineardiscriminative

learning.Complexity,2019.

Barreda,S.,&Nearey,T.M.(2012).Theassociationbetweenspeaker-dependentformantspace

estimatesandperceivedvowelquality.CanadianAcoustics,40(3),12–13.

Barriuso,T.A.,&Hayes-Harb,R. (2018).HighVariabilityPhoneticTrainingas aBridge from

Research to Practice. CATESOL Journal, 30(1), 177–194. Retrieved 2020-11-02, from

https://eric.ed.gov/?id=EJ1174231 (Publisher:CATESOL)

40

Bod,R.(1998).Beyondgrammar:Anexperience-basedtheoryoflanguage.Stanford,CA:CSLI

publications.

Box,G.E.(1976).Scienceandstatistics.JournaloftheAmericanStatisticalAssociation,71(356),

791–799.

Brekelmans,G.,Evans,B.,&Wonnacott,E.(2020,March).Noevidenceofahighvariabilitybenefit

in phonetic vowel training for children. London, UK. Retrieved 2020-11-02, from

https://zenodo.org/record/3732383.X5fmYhKj3l(Publisher:Zenodo)

Buchwald, A., & Miozzo, M. (2011, September). Finding Levels of Abstraction in Speech

Production: Evidence From Sound-Production Impairment. Psychological Science, 22(9),

1113–1119. Retrieved 2020-11-02, from https://doi.org/10.1177/0956797611417723

(Publisher:SAGEPublicationsInc)

Bu¨rkner,P.-C.(2018).AdvancedBayesianmultilevelmodelingwiththeRpackagebrms.The

RJournal,10(1),395–411.Chomsky,N.(1959).Areviewofbfskinner’sverbalbehavior.Language,35(1),26-58.

Chuang,Y.-Y.,&Baayen,R.H.(n.d.).Discriminativelearningandthelexicon:Ndlandldl.Toappear

in.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on

informationtheory,13(1),21–27.

Divjak,D.(2015a).Exploringthegrammarofperception:Acasestudyusingdatafromrussian.

FunctionsofLanguage,22(1),44-68.

Divjak, D. (2015b). Four challenges for usage-based linguistics. Change of Paradigms: New

Paradoxes.RecontextualizingLanguageandLinguistics,31,297–309.(Publisher:DeGruyter)

Divjak,D.,&Caldwell-Harris,C.(2015).Frequencyandentrenchment.InD.Divjak&

E.Dabrowska(Eds.),HandbookofCognitiveLinguistics(Vol.39).(Publisher:DeGruyter)

Divjak,D.,Milin,P.,Ez-zizi,A.,J´ozefowski,J.,&Adam,C.(2020).Whatislearnedfromexposure:

anerror-drivenapproachtoproductivityinlanguage.Language,CognitionandNeuroscience,

1–24.

Donegan, P. (2015). The Emergence of Phonological Representation. In The Handbook of

Language Emergence (pp. 33–52). John Wiley & Sons, Ltd. Retrieved 2020-11-17, from

https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118346136.ch1 (Section: 1 eprint:

https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118346136.ch1)

41

Fromkin,V.A.(1971).TheNon-AnomalousNatureofAnomalousUtterances.Language,47(1),

27–52.Retrieved2020-11-09,fromhttp://www.jstor.org/stable/412187 (Publisher:Linguistic

SocietyofAmerica)

Giannakopoulou, A., Brown, H., Clayards, M., & Wonnacott, E. (2017, May). High or low?

Comparing high and low-variability phonetic training in adult and child second language

learners. PeerJ, 5, e3209. Retrieved 2020-11-02, from https://peerj.com/articles/3209

(Publisher:PeerJInc.)

Gillies,J.(2018).CERNandtheHiggsBoson:TheGlobalQuestfortheBuildingBlocksof

Reality.IconBooks.(Google-Books-ID:0etmDwAAQBAJ)

Goldberg,A.E.(2006).ConstructionsatWork:TheNatureofGeneralizationinLanguage.Oxford

UniversityPress.

Goldinger,S.D.,&Azuma,T.(2003,July).Puzzle-solvingscience:thequixoticquestforunitsin

speech perception. Journal of Phonetics, 31(3), 305–320. Retrieved 2020-09-14, from

http://www.sciencedirect.com/science/article/pii/S0095447003000305 Hamilton, A., Plunkett,

K., & Schafer, G. (2000, October). Infant vocabulary development assessed with a British

communicativedevelopmentinventory.JournalofChildLanguage,27(3),689–705.Retrieved

2020-11-12, from http://www.cambridge.org/core/journals/journal-of-child-

language/article/infant-vocabulary-dev(Publisher:CambridgeUniversityPress)

Hay, J., & Bresnan, J. (2006, October). Spoken syntax: The phonetics of giving a hand inNew

Zealand English. The Linguistic Review, 23(3), 321–349. Retrieved 2020-08-07, from

https://www-degruyter-com.login.ezproxy.library.ualberta.ca/view/journals/tlir/23/3/article-p3

(Publisher:DeGruyterMoutonSection:TheLinguisticReview)

Haykin,S.(1999).Neuralnetworks:Acomprehensivefoundation.NewJersey:PrenticeHall.

Holmes,W. (2001). SpeechSynthesisandRecognition.CRCPress. (Google-Books-ID:

uJNm3Uw2CBAC)

Jääskeläinen,I.P.,Hautamäki,M.,Näätänen,R.,&Ilmoniemi,R.J.(1999).Temporalspanof

humanechoicmemoryandmismatchnegativity:revisited.Neuroreport,10(6),1305–1308.

Jayaraman,S.,&Smith,L.B.(2019).Facesinearlyvisualenvironmentsarepersistentnotjust

frequent.VisionResearch,157,213–221.

Jebara,T.(2012).Machinelearning:Discriminativeandgenerative(Vol.755).SpringerScience&

BusinessMedia.

42

Johnson,K.(1997).Theauditory/perceptualbasisforspeechsegmentation.OhioStateUniversity

WorkingPapersinLinguistics,50,101–113.

Ladefoged,P.,&Broadbent,D.E.(1957). InformationConveyedbyVowels.TheJournalof the

Acoustical Society of America, 29(1), 98–104. Retrieved 2013-01-04, from

http://link.aip.org/link/?JAS/29/98/1

Langacker,R.W.(1987).FoundationsofCognitiveGrammar:Theoreticalprerequisites.Stanford

UniversityPress.(Google-Books-ID:g4NCRFbZkZYC)

Langacker,R.W.(1991).FoundationsofCognitiveGrammar.Vol.2,DescriptiveApplication.

Stanford:StanfordUniversityPress.

Lively,S.E.,Logan,J.S.,&Pisoni,D.B.(1993,September).TrainingJapaneselistenerstoidentify

English/r/and/l/.II:Theroleofphoneticenvironmentandtalkervariabilityinlearningnew

perceptual categories.The Journal of the Acoustical Society of America,94(3), 1242–1255.

Retrieved 2020-11-02, from https://asa.scitation.org/doi/10.1121/1.408177 (Publisher:

AcousticalSocietyof

America)Love,B.C.,Medin,D.L.,&Gureckis,T.M.(2004).Sustain:anetworkmodelofcategorylearning.

Psychologicalreview,111(2),309-332.

MacCorquodale, K. (1970). On chomsky’s review of skinner’s verbal behavior. Journal of the

experimentalanalysisofbehavior,13(1),83.

Marslen-Wilson,W.,&Warren,P. (1994).Levelsofperceptual representationandprocess in

lexical access: Words, phonemes, and features. Psychological Review, 101(4), 653–675.

Retrievedfrom

Mendel,J.,&McLaren,R.(n.d.).Reinforcement-learningcontrolandpatternrecognitionsystems.

In J.Mendel&K.Fu(Eds.),Adaptive, learning,andpatternrecognitionsystems:Theoryand

applications.

Mielke, J., Baker, A., & Archangeli, D. (2016). Individual-level contact limits phonological

complexity:Evidence frombunchedandretroflex//.Language,92(1),101–140.Retrieved

2016-05-21,from

https://muse.jhu.edu/content/crossref/journals/language/v092/92.1.mielke.html

43

Milin,P.,Divjak,D.,&Baayen,R.H.(2017).Alearningperspectiveonindividualdifferencesin

skilled reading: Exploring and exploiting orthographic and semantic discrimination cues.

JournalofExperimentalPsychology:Learning,Memory,andCognition.

Milin,P.,Divjak,D.,Dimitrijević,S.,&Baayen,R.H.(2016).Towardscognitivelyplausibledata

scienceinlanguageresearch.CognitiveLinguistics,0(0).

Milin,P.,Feldman,L.B.,Ramscar,M.,Hendrix,P.,&Baayen,R.H.(2017).Discriminationinlexical

decision.PloSone,12(2),e0171935.

Milin, P., Keuleers, E., & Filipović-Durdević, D. (2011). Allomorphic responses in serbian

pseudo-nounsasaresultofanalogicallearning.ActaLinguisticaHungarica,58(1),65-84.

Milin,P.,Madabushi,H.T.,Croucher,M.,&Divjak,D.(2020).Keepingitsimple:Implementation

andperformanceoftheproto-principleofadaptationandlearninginthelanguagesciences.

arXivpreprintarXiv:2003.03813.

Milin,P.,Nenadić,F.,&Ramscar,M.(2020).Approachingtextgenre:Howcontextualized

experienceshapestask-specificperformance.ScientificStudyofLiterature,10(1),3–34.

Nathan,G.S.(2007).Phonology. InD.Geeraerts&H.Cuyckens(Eds.),TheOxfordHandbookof

Cognitive Linguistics. Oxford University Press. Retrieved 2020-11-17, from

http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199738632.001.0001/oxfordhb-

978019973

Nixon,J.,&Tomaschek,F.(2020).LearningfromtheAcousticSignal:Error-DrivenLearningof

Low-Level Acoustics Discriminates Vowel and Consonant Pairs. InProceedings of the 42nd

AnnualConferenceoftheCognitiveScienceSociety(Vol.42,pp.585–591).Retrieved2020-09-

04,fromhttps://cognitivesciencesociety.org/cogsci20/papers/0105/index.html

Nosofsky,R.M.(1986).Attention,similarity,andtheidentification-categorizationrelationship.

Journalofexperimentalpsychology:General,115(1),39-57.

Palmer, D. C. (2006). On chomsky’s appraisal of skinner’s verbal behavior: A half century of

misunderstanding.TheBehaviorAnalyst,29(2),253–267.

Pierrehumbert, J. B. (2001). Exemplar dynamics:Word frequency, lenition and contrast. In J.

Bybee&P.Hooper(Eds.),FrequencyandtheEmergenceofLinguisticStructure.Philadelphia,

PA:JohnBenjamins.

Port,R.F.(2007,August).Howarewordsstoredinmemory?Beyondphonesandphonemes.New

Ideas in Psychology, 25(2), 143–170. Retrieved 2015-04-14, from

http://www.sciencedirect.com/science/article/pii/S0732118X07000189

44

Port,R.F.(2010).LanguageasaSocialInstitution:WhyPhonemesandWordsDoNotLiveinthe

Brain. Ecological Psychology, 22(4), 304–326. Retrieved 2020-08-07, from

https://doi.org/10.1080/10407413.2010.517122 (Publisher:Routledge)

Port,R.F.,&Leary,A.P.(2005).Againstformalphonology.LANGUAGE,81,927–964.Retrieved

from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.8395 Predictability, surprise,

attention,andconditioning.(n.d.).

R Core Team. (2020). R: A language and environment for statistical computing [Computer

softwaremanual].Vienna,Austria.Retrievedfromhttps://www.R-project.org/

Ramscar, M. (2019, March). Source codes in human communication (preprint). PsyArXiv.

Retrieved2020-05-27,fromhttps://osf.io/e3hps

Ramscar,M.,Dye,M.,&McCauley,S.M. (2013). ErrorandExpectationinLanguage

Learning:TheCuriousAbsenceof”Mouses”inAdultSpeech.Language,89(4),760–793.

Retrieved2020-09-28,fromhttps://www.jstor.org/stable/24671957

Ramscar,M.,&Port,R.(2016).Howspokenlanguagesworkintheabsenceofaninventoryof

discreteunits.LanguageSciences,53,58–74.

Ramscar, M., & Port, R. (2019). Categorization (without categories). Cognitive

LinguisticsFoundationsofLanguage,87.

Ramscar,M.,&Yarlett,D.(2007).Linguisticself-correctionintheabsenceoffeedback:Anew

approachtothelogicalproblemoflanguageacquisition.Cognitivescience,31(6),927-960.

Rescorla,R.(2008).Rescorla-wagnermodel.Scholarpedia,3(3),2237.

Rescorla,R.A.,&Wagner,A.R. (1972).A theoryofPavlovian conditioning:Variations in the

effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current

researchandtheory,64–99.

Sapir,E.(1921).LanguageanIntroductiontotheStudyofSpeech.NewYork:HarcourtBrace.

(Google-Books-ID:gLmsgEACAAJ)

Savin,H.,&Bever,T.(1970,June).Thenonperceptualrealityofthephoneme.JournalofVerbal

Learning and Verbal Behavior, 9(3), 295–302. Retrieved 2020-09-14, from

http://www.sciencedirect.com/science/article/pii/S0022537170800640 (Publisher: Academic

Press)

45

Schatz,T.,Feldman,N.,Goldwater,S.,Cao,X.N.,&Dupoux,E.(2019,May).Earlyphoneticlearning

without phonetic categories – Insights from large-scale simulations on realistic input (Tech.

Rep.).PsyArXiv.Retrieved2020-11-02,fromhttps://psyarxiv.com/fc4wh/

Schmidtke, D., Matsuki, K., & Kuperman, V. (2017). Surviving blind decomposition: A

distributional analysis of the time-course of complex word recognition. Journal of

ExperimentalPsychology:Learning,Memory,andCognition,43(11),1793.

Shafaei-Bajestan, E., Moradipour-Tari, M., & Baayen, R. H. (2020). Ldl-auris: Error-driven

learninginmodelingspokenwordrecognition.

Shankweiler,D.,&Fowler,C.A.(2015).Seekingareadingmachinefortheblindanddiscovering

the speech code. History of Psychology, 18(1), 78–99. (Place: US Publisher: Educational

PublishingFoundation)

Shoup,J.E.(1980).PhonologicalaspectsofSpeechRecognition.TrendsinSpeechRecognition,

125–138.(Publisher:PrenticeHall)

Skinner,B.F.(1957).Verbalbehavior.NewYork:Appleton-Century-Crofts.

Suzuki,R.,Terada,Y.,&Shimodaira,H.(2019).pvclust:Hierarchicalclusteringwithpvaluesvia

multiscale bootstrap resampling [Computer software manual]. Retrieved from

https://CRAN.R-project.org/package=pvclust (Rpackageversion2.2-0)

Swiggers,P. (1995).Howchomskyskinnedquine,orwhat ‘verbalbehavior’cando.Language

Sciences,17(1),1–18.

Taylor,J.R.(1995).LinguisticCategorization:PrototypesinLinguisticTheory.ClarendonPress.

(Google-Books-ID:qQOQQgAACAAJ)

Tomaschek,F.,Tucker,B.V.,Fasiolo,M.,&Baayen,R.H.J.L.V.(2018).Practicemakesperfect:

Theconsequencesoflexicalproficiencyforarticulation,4(s2).

Trubetzkoy,N.S.(1969).PrinciplesofPhonology.UniversityofCaliforniaPress,2223FultonSt.

Tucker,B.V.,Brenner,D.,Danielson,D.K.,Kelley,M.C.,Nenadi´c,F.,&Sims,M.(2019,June).The

MassiveAuditoryLexicalDecision(MALD)

database. BehaviorResearchMethods,51(3),1187–1204. Retrieved2019-05-31,from

https://doi.org/10.3758/s13428-018-1056-1

Venables,W.N.,&Ripley,B.D.(2002).Modernappliedstatisticswiths(Fourthed.).NewYork:

Springer.Retrievedfromhttp://www.stats.ox.ac.uk/pub/MASS4 (ISBN0-387-

95457-0)

46

Whelan,M.T.,Prescott,T.J.,&Vasilaki,E.(2020).Aroboticmodelofhippocampalreversereplay

forreinforcementlearning.ToAppearinarXiv.

Widrow,B.,&Hoff,M.E.(1960).Adaptiveswitchingcircuits.In(p.96-104).

Widrow,B.,&Lehr,M.A.(1990).30yearsofadaptiveneuralnetworks:Perceptron,madaline,

andbackpropagation.ProceedingsoftheIEEE,78(9),1415-1442.

Yuan, J.,&Liberman,M. (2008). Speaker identificationon theSCOTUScorpus.Proceedingsof

Acoustics‘08,5687–5690.

university of birmingham a learning perspective on the

Documents