university of birmingham a learning perspective on the
TRANSCRIPT
University of Birmingham
A learning perspective on the emergence ofabstractionsMilin, Petar; Tucker, Benjamin V.; Divjak, Dagmar
Citation for published version (Harvard):Milin, P, Tucker, BV & Divjak, D 2020, 'A learning perspective on the emergence of abstractions: the curiouscase of phonemes', arXiv.
Link to publication on Research at Birmingham portal
General rightsUnless a licence is specified above, all rights (including copyright and moral rights) in this document are retained by the authors and/or thecopyright holders. The express permission of the copyright holder must be obtained for any use of this material other than for purposespermitted by law.
•Users may freely distribute the URL that is used to identify this publication.•Users may download and/or print one copy of the publication from the University of Birmingham research portal for the purpose of privatestudy or non-commercial research.•User may use extracts from the document in line with the concept of ‘fair dealing’ under the Copyright, Designs and Patents Act 1988 (?)•Users may not further distribute the material nor use it for the purposes of commercial gain.
Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document.
When citing, please reference the published version.
Take down policyWhile the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has beenuploaded in error or has been deemed to be commercially or otherwise sensitive.
If you believe that this is the case for this document, please contact [email protected] providing details and we will remove access tothe work immediately and investigate.
Download date: 10. Apr. 2022
PREVIEWOFPAPERCURRENTLYUNDERREVIEW.
PLEASECITEASUNPUBLISHEDMANUSCRIPT.
Alearningperspectiveontheemergenceofabstractions:thecuriouscaseofphonemes
PetarMilina,BenjaminV.Tuckerb,andDagmarDivjaka/c
aUniversityofBirmingham,DepartmentofModernLanguages,Birmingham,UK;bUniversityofAlberta,DepartmentofLinguistics,Edmonton,Alberta,Canada;cUniversityofBirmingham,DepartmentofEnglishLanguageandLinguistics,Birmingham,UK.
AbstractInthepresentpaperweusearangeofmodelingtechniquestoinvestigatewhetheran
abstractphonecouldemerge fromexposuretospeechsounds.Wetest twoopposing
principlesregardingthedevelopmentoflanguageknowledgeinlinguisticallyuntrained
languageusers:Memory-BasedLearning(MBL)andError-CorrectionLearning(ECL).A
process of generalization underlies the abstractions linguists operate with, and we
probedwhetherMBL and ECL could give rise to a type of language knowledge that
resembleslinguisticabstractions.Eachmodelwaspresentedwithasignificantamount
of pre-processed speech produced by one speaker. We assessed the consistency or
stability of what the models have learned and their ability to give rise to abstract
categories.Bothtypesofmodelsfaredifferentlywithregardtothesetests.Weshowthat
ECLlearningmodelscanlearnabstractionsandthatatleastpartofthephoneinventory
canbereliablyidentifiedfromtheinput.
KEYWORDSError-CorrectionLearning;Memory-BasedLearning;Widrow-Hoffrule;TemporalDifferencerule;abstraction;phoneme;English
Correspondingauthor:PetarMilin.Email:[email protected]
2
1. Introduction
Manytheoriesoflanguagepresupposetheexistenceofabstractionsthataimtoorganize
theextremelyrichandvariedexperienceslanguageusershave.Theyarethoughttobe
either innate (generative theories) or to arise from experience (emergentist, usage-
basedapproaches).Severalmechanismsbywhichabstractionscouldformhavebeen
identified and include, but are not limited to, not attending to certain information
streams tomanage the influx of information, (passively or actively) forgetting some
dimensions of experience, and grouping similar experiences together. Importantly,
thesemechanismsarenotnecessarilymutually exclusive. In thispaper,we consider
anothermechanismandinvestigatewhetherlearningcanaccountfortheemergenceof
linguisticabstractionsfromexposuretotheambientlanguage.Can learningtoignore
uninformativecuesleadtoaformofabstractedknowledge?
Toinvestigatethispossibility,wepresentacasestudyonthesoundsofEnglish in
which we computationally model whether an abstract phone could emerge from
exposuretospeechsoundsusingthreecognitivelyplausible learningrules:Memory-
based learning (MBL),Widrow-Hoff (WH) and Temporal Difference learning (TD; a
direct generalization ofWH). The latter two computationalmodels learn to unlearn
irrelevantdimensionsofexperienceviathecuecompetitioninherentintheexperience.
Theyfocusonthosedimensionsthathelpdealwiththeenvironmentona‘goodenough’
basis.Whatremainswouldthenrepresentanabstraction:it is lessdetailed(ormore
schematic)thantheinputexemplarsandmoreparsimoniousbecause,aswelearn,we
discardorignorewhatisnon-predictive,i.e.,nothelpfulforimprovingperformance(by
error-correction), andwe retain only the relevant ‘core’. Importantly, however, that
abstraction is not a given and static memory trace, but rather an ever-evolving
information-richresidueoftheexperience(cf.,Love,Medin,&Gureckis,2004;Nosofsky,
1986;Ramscar&Port,2019).
3
1.1. Much ado about abstractions
If there isone thing that linguistic theorieshave incommon, it is that theyresort to
abstractions for describing language. Some of these abstractions are so commonly
agreedupon that theyare taught to children inelementary school (thinkofpartsof
speechlikenoun,verboradjective),whileothersarespecifictoaparticulartheory(such
as constructions).Dependingon the theory, theseabstractionsareeither innateand
presentatbirth,as isthecaseingenerativeframeworks,ortheyhaveemergedfrom
exposure to the ambient language, as is assumed in usage-based approaches (see
Ambridge&Lieven,2011, foranoverviewofthesetheoriesandthepredictionsthey
makeregardinglanguageacquisitionordevelopment).
Thisimpliesthatabstractionsplayadifferentroleinbothtypesoftheories.Inearlier
versions of generative linguistic theories, abstractions were underlying
representations:theyarepartofthedeepstructurewhichharborstheabstractformof
asound,wordorsentencebeforeanytransformationalruleshavebeenapplied.Inthis
sense, abstractions give rise to usage. In usage-based approaches, abstractions are
posited at the other end: abstractions are built up gradually, as experience with
language accumulates and commonalities between instances are detected. In this
framework,abstractionsfollowfromusage.Notethatinbothapproachesabstractions
co-existwithactualusage:itisnotuncommonforusage-basedapproachestoassume
redundantencodinginwhichabstractionsarestoredalongsideexemplars(Goldberg,
2006;Langacker,1987).
Thispracticeofdescribinglanguageusingabstractionshasledtotheassumptionthat
languageknowledgeandusemustinvolveknowledgeoftherelevantabstractions(but
see Ambridge, 2020). The challenge facing those who assume the existence of
abstractionsisoneofproposingahypothesizedconstructormechanismthatwouldgive
risetotheseabstractions.Fornativiststhismeansexplaininghowtheseabstractions
could be present at birth, and for emergentists this means explaining how such
abstractions can emerge from exposure to usage. Ideally this needs to be done in
sufficient detail for the account to be implemented computationally and/or tested
empirically.
4
1.2. The taming of the shrew (the smallest of language units)
Researchershave long searched for the smallestmeaningful unit in language (Sapir,
1921;Trubetzkoy,1969).Themaingoalofthesearchforanabstractionwastodiscover
a basic building block on which linguistic theories can build (much like the atom,
electronandHiggsBoson,andsubsequentresearchinparticlephysics;Gillies,2018).
Inthestudyofsound,overtheyears,manydifferentlevelsofabstractionhavebeen
proposedsuchasthemora,demisyllable,phoneme,syllable,orphoneticfeatures
(Goldinger & Azuma, 2003;Marslen-Wilson &Warren, 1994; Savin & Bever, 1970).
Theseareall assumed tohavemental correlates.Themost commonlyproposedand
assumedbuildingblockorunitofabstractionisthephonemeorsyllable(Goldinger&
Azuma,2003).Generally,phonemesaredefinedasabstractionsofactualspeechsounds
that ignore or ‘abstract away’ details of the speech signal that do not contribute to
differentiatingmeaning.Syllablesasabasicunitareusuallyconsideredacombinations
ofspeechsoundsthatcarryarhythmicweightandareusedtodifferentiatemeaning.
In this study,wewill focus on phoneme-like units. The phonemic perception and
representationofspeechisarguablyoneofthebest-establishedconstructsoflinguistic
theory (fora reviewseeDonegan,2015,p.39-40).Ourchoicemayappearas rather
unfortunate,sincephonemeshavecomeunderfiremorethanonceinrecenthistory.
Port(2010,p.53)claims:‘Phonesandphonemes,thoughtheycomeveryreadilytoour
consciousawarenessofspeech,arenotvalidempiricalphenomenasuitableasthedata
basisforlinguisticssincetheydonothavephysicaldefinitions.’WorkbyGoldingerand
Azuma(2003)investigatedanAdaptiveResonanceTheoryapproachtotheemergence
ofperceptual ‘units’orself-organizingdynamicstates. Itargues that thesearch fora
fundamentalunitismisguidedandadvocatesanexemplarrepresentation.Arecurring
claim in the argument against the cognitive reality of phonemes is the richness of
memoriesofexperiences,includingexperiencesofsounds(comparewithPort,2010).
Although we do not deny the validity of the findings reported in (e.g., Goldinger &
Azuma,2003;Port,2007;Port&Leary,2005;Savin&Bever,1970),wedonoteither
acceptthemasevidencethattherecannotpossiblybeanycognitiverealitytophoneme-
5
like units. Instead, we consider these findings a mere indication of the fact that
phoneme-likeunitsarenotnecessarilythewholestory.Compareherethefindingsof
BuchwaldandMiozzo(2011),who foundthat twosystemsneed toberecognized to
accountforerrorsinspokenlanguageproduction,i.e.,aprocessingstagewithcontext-
independentrepresentationsandonewithcontext-dependentrepresentations.
Asmentionedearlier,thetheoreticalneedforabasicbuildingblockinlinguisticsstems
fromadesiretosimplifytheinputforalearnerorlistener.Thisistrueofmuchofthe
early speech error work which provides strong evidence for the need for such
representation (e.g., Fromkin, 1971). The lack of invariance problem from speech
perceptionresearchillustratesthisneednicely.Simplyput,theacousticsignalishighly
complexandvariableandthereisnosimplemappingfromtheacousticsignaltosome
sortofphoneticstructure(Appelbaum,1996;Shankweiler&Fowler,2015).Asaresult,
what is thought of as a single phoneme can be represented by multiple variably
weightedacousticcuesandcanbeproducedinmanydifferentways(e.g.,Mielke,Baker,
&Archangeli,2016).Whilethelackofinvarianceproblemhasadvancedresearchinthe
field of speech perception and has not only increased our understanding of speech
productionandperception,buthasalsobeen influential inearlyattemptsat speech
synthesis,noteveryoneagreesthatitisasabigofaproblemaspreviouslythought(e.g.,
Appelbaum,1996;Goldinger&Azuma,2003).Evenso,thesearchforabasicbuilding
block which linguistic theories can use as a concept on which to build additional
linguisticstructurecontinues.
Itisalsoworthnotingthatthemainmotivationforourchoiceisnotconceptualbut
methodological. This is to say that, althoughwe arenot opposed to the existenceof
phoneme-likeunits,neitherdowebelieve thatphonemesare theonlyabstract form
cognitiverepresentationsofsoundcantake.Abstractionsarelikelylanguagedependent
and could range fromwhatmight traditionally be referred to asmoras, allophones,
phonemes, demi-syllables, syllables, words and sequences of words. A productive
theoretic proposition should not commit a priori to particular constructs or
abstractions.Inthatrespect,suchunitscannotbecompulsoryandthenatureofthese
unitscouldvaryacross languagesoreven individuals.Furthermore,suchunitscould
6
lack support in a cognitive sense and, yet, still be didactically useful (hypothetical)
constructs.Finally,wedonotconsidertheseabstractionsorunitstobestaticeither,as
weaddressthequestionofhowtheyunfoldovertime.
The main goal of this study, hence, is to explore whether abstract units can be
emergentfrominput,usingcognitivelyrealisticlearningalgorithms.Fortheremainder
ofthearticlewewilldistinguishbetweenphonemesandphonesandusephonestorefer
tothesegmentsweareworkingwith,toreiteratethefactthatweareusingphoneme-
likeunitsoutofconvenienceandnotoutofanytheoreticalleaning.
1.3. To learn or not to learn
Forthepurposesofthisstudy,wewillzoominonaspecificsubsetoflinguistictheories
andalgorithmicoptions.Linguistically,wewillassumethatuniversalsoundinventories
are not available at birth but are learned using general cognitive mechanisms.
Psychologically,wewill contrastmodels that assumeveridical encodingof the input
withthosethatalloworevenencourageinformation‘sifting’.Translatedintotheterms
of the relevant machine learning approaches, this opposition would pitch Memory-
Based Learning (MBL) against Error-Correction Learning (ECL; for amore in-depth
discussion of the differences between the two approaches see also Milin, Divjak,
Dimitrijevi´c,&Baayen,2016).
To researchers interested in linguistics, and cognitive linguistics in particular,
Memory-Based Learning (MBL) appeals because it explains linguistic knowledge in
terms of categorization and analogical extension. Moreover, exemplar accounts are
particularly well established in phonetics and phonology (e.g., Bod, 1998; Hay &
Bresnan, 2006; Johnson, 1997; Pierrehumbert, 2001), yielding a situation in which
wordsorphrasesarestoredwithrichphoneticdetail.However,thisopinion,although
widespread,doesnotseemtomeshwiththeempiricalfindingsfromstudiesonechoic
sensory memory, representing the short-term storage of auditory input (cf.,
J¨a¨askel¨ainen,Hautam¨aki,N¨a¨at¨anen,&Ilmoniemi,1999):sensorymemoryhasbeen
7
foundtobeveryshort-lived,andhenceitremainsunclearhowthatrichdetailwould
makeitintolong-termmemory(undernormal,non-traumaticcircumstances).
Error-Correction Learning (ECL) appears as conceptually much better suited for
testingtheemergentistperspectiveon languageknowledgeas it takesaholistic(i.e.,
molar)approachtoknowledge,andlanguageknowledgeinparticular.Thisknowledge
evolves gradually and continuously from the input which is conceived of as a
multidimensional space of ‘discernibles’ or contrasts. In other words, rather than
retainingtheexemplarsastheyare,inalltheirunparsimoniousrichness,learnersmay
disregard non-predictive aspects of the input, keeping only what is ‘useful’ in an
adaptivesense.
Inthisrespect,perhaps,MBLandECLcouldbeseenasopposingprinciples: inthe
extreme, an exemplar approachdoesnot require anythingbeyond storage,while an
error-correctionapproachisastockpileofperpetuallychanging,must-becontrasts.
Wecanassumethatthesmallestcommondenominatorofthetwoapproachesisthe
ability to learn from the environment for the sake of ever better adaptation. Rather
simplified,asystem(beitacomputer,animalorhuman)learnsaboutitsenvironment
byprocessinginputinformationabouttheenvironmentandaboutitselfinrelationto
that environment. Gradually, that system becomes more knowledgeable about its
environmentand,thus,betteradaptedto it. It followsthat,computationally, learning
assumesstimulationfromtheenvironmenttowhichthesystemreactsbymodulatingits
parameters(alsoknownasfreeparameters)foroperatinginandonthatenvironment
(cf.,Mendel&McLaren,n.d.).Ultimately,thesystemrespondstoitsenvironmentina
newandmoreadaptedway.Fromthiscomputationalperspective,whatmakesthese
twolearningprocessesuniqueisthespecificwayinwhichparameterchangehappens.
Generally speaking, MBL stores or memorizes data directly, while ECL filters or
discriminatesthatdatatoretainwhatisnecessaryandsufficient.
Most (if not all) MBL approaches store past experiences as correct input-output
exemplars:theinputisa(vector)representationoftheenvironmentandthesystem,
whereastheoutputassumesanactionoranentity–adesiredoutcome(seeHaykin,
1999).Memoryisorganizedinneighborhoodsofsimilarorofproximal inputs,which
8
presumablyelicitsimilaractions/entities.Thus,amemory-basedalgorithmconsistsof
(a)acriterionfordefiningtheneighborhoods(similarity,proximity)and(b)alearning
principle that guides assignment of a new exemplar to the right neighborhood. One
simpleyetefficientexampleofsuchaprinciple is thatofNearestNeighbors,wherea
particulartypeofdistance(say,theEuclideandistance)isminimalinagivengroupof
(near)neighbors.Ithasbeenshownthatthissimpleprinciplestraightforwardlyensures
the smallest probability of error over all possible decisions for a new exemplar (as
discussedbyCover&Hart,1967).Theprincipleappearssimplebutitisalsoefficient.
Moreover, it appeals to linguists, and Cognitive Linguists in particular, as it equates
learning to categorization or classification of newly encountered items into
neighborhoodsof ‘alikes’:new items(exemplars)areprocessedon thebasisof their
similarity to the existing neighborhoods, which resembles the process of analogical
extension.AndforCognitivelinguists,categorizationisoneofthefoundational,general
cognitiveabilitiesthatfacilitatelearningfrominput(Taylor,1995).
ECL runs on very different principleswhich filter the value (weight) of the input
information (signal) to facilitate error-freepredictionof somedesiredoutcome.The
processunfoldsinastep-wisemannerasnewinputinformationbecomesavailable.The
weightingorevaluationofthecurrentsuccessisdonerelativetoamismatchbetween
thecurrentpredictionandthetrueordesiredoutcome.Inotherwords,thestimulation
tomodulatetheoperatingparameterscomesfromthedifferencebetweenthedesired
outcome and the current prediction of it on the basis of available input data. That
difference induces a rather smalldisturbance in the inputweightswhich, effectively,
evaluatestheinputdataitselftobringitspredictionclosertothetruedesiredoutcome.
Thelearningprocesscontinuesasnewdataarrives.Gradually,thedifferencebetween
thepredictionandthetrueoutcomebecomessmallerand,consequently,thechangesin
theinputweightalsodiminishovertime.
Thus,comparatively, inMBL, the inputdataremainsunaffectedandstored inpair
withtheoutcome,whileinECLthatinputiscontinuously(re)evaluatedandfilteredto
achieve‘faithful’predictionoftheoutcome.Inasense,forECL,inputdatamattersonly
inasmuchasithelpsminimizepredictionerror.Otherwise,overtime,someinputmay
9
enduphavingazeroweight,whichmeansthatitiseffectivelyignored.Wewillreturn
tothesepointsinmoredetailsinSection2.2.
Coincidentally, both MBL and ECL are currently being pushed to their (almost
paradoxical) extremes. On the one hand, Ambridge (2020, https://osf.io/5acgy/)
denounces any stored linguistic abstractions (while, strangely, acknowledging other
typesofabstractions).Ontheotherhand,Ramscar(2019)denies languageunitsany
ontological (and, consequently, epistemological) relevance by disputingmappings of
anysubstance(and,hence,theoreticalimportance)betweenformunitsandsemantic
units: units become ‘dis-unit-ed’ due to a lack of one-to-one mappings between a
(allegedlyfixed)formanda(allegedlyfixed)meaning.
1.4. This study: pursuing that which is learnable
Inthispaper,wecontrastamodelthatstorestheauditorysignalinmemorywithone
thatlearnstodiscriminatephonesintheauditorysignal.Modelingwhetheranabstract
phonecouldemergefromexposuretospeechsoundsbyapplyingtheprincipleoferror-
correctionlearningimpliesdimensionalityreduction.Ourapproachcruciallyrelieson
theconceptofunlearningwherebythestrengthofthelinksbetweenanycuesthatdo
notco-occurwithacertainoutcomearereduced(seeRamscar,Dye,&McCauley,2013,
for discussion). The principle is such that the model will gradually try to unlearn
irrelevantvariationsinthespeechinput,therebydiscardingdimensionsoftheoriginal
experienceandretainingmoreabstractandparsimonious‘memories’.Thisstrategyis
the radical opposite of a model that presupposes that we store our experiences
veridically, i.e., in their full richness and idiosyncrasy. On such an account learning
equalsencoding,storingandorganizingthatstoragetofacilitateretrieval.
Toachievethis,wewillprovideourmodelswithacommoninputinautomaticspeech
recognition,MelFrequencyCepstralCoefficients(henceforthMFCCs),derivedfromthe
acousticsignal,andwithphonesaspossibleoutcomes.Wewillassignourmodelsthe
taskoflearningtolinkthetwo.Thisimpliesthat,inaveryrealsense,ourmodelingis
biased: ourmodelwill learnphone-likeunitsbecause those are theoutputs thatwe
provide.Ifweprovidedothertypesofunits,othertypesofoutputsmightbelearned.
10
Yet, thisdoesnotdiminishour finding thatphonescanbesuccessfully learned from
input,usinganerror-correctionlearningapproach.
Ofcourse,becauseweuseasimplealgorithm,performancedifferenceswithstate-
ofthe-artclassifiersshouldbeexpected.Yet,whatwetradein,intermsofperformance,
wehopetogainintermsofcognitiverealityand,atthesametime,interpretabilityofthe
findings. Inotherwords,ratherthanmaximizingperformance,wewillminimizetwo
critical ‘investments’: in cultivating the input, and in algorithmic sophistication (and
computationalopaqueness).Rather,weremainasnaïveaspossibleandengagewith
MFCCs directly as input, and we use one of the simplest yet surprisingly efficient
computationalmodelsoflearning.Finally,recallalsothatwedonotclaimthatphonemic
variationwouldbesomething,letaloneeverything,thatlistenersidentify.Atthispoint,
wearetestingwhetherthephones,thatarestandardlyrecognized,canbelearnedfrom
inputusingcognitivelyrealisticlearningalgorithms.
As said, learnability is our minimal requirement and our operationalization of
cognitive reality. On a usage-based, emergentist approach to language knowledge,
abstractions must be learnable from the input listeners are exposed to if these
abstractions are to lay claims to cognitive reality (Divjak, 2015a;Milin et al., 2016).
Whereasearlierworkexploredwhetherabstractlabelscanbelearnedandwhetherthey
mapontodistributionalpatternsintheinput(cf.Divjak&Caldwell-Harris,2015),more
recent work has brought these strands together bymodeling directly how selected
abstract labels would be learned from input using a cognitively realistic learning
algorithm(cf.Milin,Divjak,&Baayen,2017).
With this inmind, our study does not contradict recent developmentswithin the
discrimination-learning framework to language. The Linear Discriminative Learner
(LDL),inparticular,setsitselfthemoreambitiousgoaltorefutetheneedforhierarchies
oflinguisticconstruct(cf.,Baayen,Chuang,Shafaei-Bajestan,&Blevins,2019;Chuang&
Baayen,n.d.;Shafaei-Bajestan,Moradipour-Tari,&Baayen,2020).Thepresentstudy,
however,asksamorefundamentalquestionrelatingtothelearnabilityofthesimplest
suchlinguisticconstructs–phones.Yet,inthat,wetakearadicalapproachandmodel
theemergencestraightforwardlyfromcontinuous,numericinputrepresentations(i.e.,
11
MFCCs),ratherthanfrompre-processeddiscreteunits(i.e.,FBSFofArnold,Tomaschek,
Sering,Lopez,&Baayen,2017),asusedintherelatedstudiesbyNixonandTomaschek
(2020);Shafaei-Bajestanetal.(2020).
An error-correction approach to speech sound acquisition has recently been
proposed.NixonandTomaschek(2020)trainedaNaïveDiscriminativeLearningmodel
(NDL), relying on the Rescorla-Wagner learning rule, to form expectations about a
subsetofphonemes from the surrounding speech signal. For this study,data froma
German spontaneous speech corpus was used (Arnold & Tomaschek, 2016). The
continuousacousticfeatureswerediscretizedinto104spectralcomponentsper25ms
time-window, which were then used as both inputs and outputs. A learning event
consistedofthetwoprecedingcues(104elementseach),atarget(104elements)and
onefollowingcue(also104elements).Learningproceededinaniterativefashionover
moving time-window. The trained model was tested by trying to replicate an
identificationtaskonvowelandfricativecontinua,andwasabletogeneratereassuring
discriminationcurvesforbothcontinua.Inotherwords,NixonandTomaschek(2020)
investigatedsoundsthatarediscriminatedbasedoncues fromtheexpectedspectral
frequencyrangesforvowelsandfricatives.
WhilethepresentstudyissimilartotheworkbyNixonandTomaschek(2020),both
inspiritandinsubstance,italsodiffersinseveralmajorways.First,insteadofpredicting
learnedactivations,weusedactivationsaspredictorsofthecorrectphone.Second,the
presentstudyusestheentirephoneinventoryofEnglishanddoesnotfocusonasubset.
Third, we use real-value input from MFCCs directly, rather than transformed into
discretecategories.ThatiswhywerelyontheWidrow-Hofflearningrule,notingthatit
isconsideredtobeidenticaltotheRescorla-Wagnerrule(asR.Rescorla,2008,himself,
pointedout)butacceptsnumericinput(seeMilin,Madabushi,Croucher,&Divjak,2020,
forfurtherdetails).Fourth,ourapproach,specifically,isgroundedinpreviousempirical
work within the NDL framework (cf., Milin, Feldman, Ramscar, Hendrix, & Baayen,
2017),andusestwolearningmeasures:directlyproportionalactivationfromtheinput
MFCCs, and inversely proportionaldiversity of the same input MFCCs. The former
indicatesthesupportandthelatterthecompetitionthatanoutcome(inthisparticular
12
caseaphone)receivesfromtheinput.Finally,NixonandTomaschek(2020)focuson
thepredictivityof theresultingmodel, i.e.,onpredictingresponses indiscrimination
and categorization tasks, and less on the creation of abstract units. The authors do,
however, discuss interesting initial findings that begin to replicate categorical
perceptionresultsfromspeechperception.
Ourlearning-basedapproachisalsoreminiscentofthemechanism-drivenapproach
proposed in Schatz, Feldman, Goldwater, Cao, and Dupoux (2019). Their findings
suggestthatdistributionallearningmightbetherightapproachtoexplainhowinfants
becomeattunedtothesoundsoftheirnativelanguage.Theproposedlearningalgorithm
didnottrytoachieveanybiologicalorcognitiverealismbutfocusespurelyonthepower
ofgenerativemachinelearningtechniques(seeJebara,2012,fordetailsonthisfamilyof
MachineLearningmodels,andthedistinctionbetweendiscriminativeandgenerative
learningmodelfamilies).Thus,theyselectedanengineeringfavoritewhichisbasedon
aGaussianMixtureModel(GMM)classifier.Thecontinuoussoundsignalwassampled
in regular time-slices to serve as input data to generate a joint probability density
functionwhichisamixtureof(potentiallymany)Gaussiandistributionsfromthegiven
samples. The focus of the studyby Schatz et al. (2019)was, to reiterate, not on the
plausibilityofaparticular(generative)learningmechanism,butratherontherealism
oftheinputdataanditspotentialtobelearned.Thisjustifiestheirchoiceforapowerful
generativelearningmodel.
As we will explain in Section 2, we ran our simulations on much simpler but
biologicallyandcognitivelyplausiblelearningalgorithms.Inotherwords,Schatzetal.
(2019)concludedthatplausibleorrealisticdatacangeneratealearnablesignal(i.e.,in
termsofthejointprobabilitydistribution,givenaproblemspace),andwenowmoveto
makethecrucialsteptosimultaneouslytest(a)plausiblelearningmechanismsagainst
(b)realisticdata.
13
2. Methods
Oneofthechallengesofworkingwithspeechisthatitiscontinuous.Inthissectionwe
describethespeechmaterialsandhowtheywerepre-processedbeforewemoveonto
presenting thedetailsof themodelingprocedures.Recall thatourdecision tomodel
phonesdoesnotreflectanytheoreticalbiasonourpart:webelievethatsimilarresults
wouldbeobtainedwithanynumberofotherpossiblelevelsofrepresentation.
2.1. Speech Material
The speech items used formodeling stem from the stimuli in theMassive Auditory
LexicalDecision(MALD)database(Tuckeretal.,2019).Thisdatabasecontains26,000
wordstimuliand9,500pseudo-wordsproducedbyasinglemalespeakerfromWestern
Canada; together, this amounts to approximately 5 hours of speech. The forced
alignmentwascompletedusingthePennForcedAligner(Yuan&Liberman,2008).The
forced-alignedfilesprovidetimealignedphone-levelsegmentationofthespeechsignal
oftenusedinphoneticresearchandfortrainingofautomaticspeechrecognitiontools.
All phone-level segmentation is indicated using the ARPAbet transcription system
(Shoup, 1980) for English,whichwe use throughout this paper. The speech stimuli,
along with force-aligned TextGrids, are available for use in research
(http://mald.artsrn.ualberta.ca).
InordertopreparethefilesformodelingwewindowedandextractedMFCCsforall
of thewordsusing25ms longoverlappingwindows in10ms steps over the entire
acousticsignal.Bywayofexample,iftherewasa100msphoneitwouldbedividedinto
9overlappingwindowswithsomepaddingattheendtoallowthewindowstofit.MFCCs
are summary features of the acoustic signal used commonly in automatic speech
recognition.MFCCsarecreatedbyapplyingaMeltransformation,whichisanon-linear
transformmeanttoapproximatehumanhearing,tothefrequencycomponentsofthe
spectralpropertiesofthesignal.Themelfrequenciesarethensubmittedtoadiscrete
cosinetransform,sothat,inessence,wearetakingthespectrumofaspectrumknown
asthecepstrum.Followingstandardpracticeinautomaticspeechrecognition,weused
14
the first13coefficients inthecepstrum,calculatedthe firstderivative(delta) for the
next 13 coefficients, and the second derivative (delta-delta) for the final set of
coefficients (Holmes, 2001). This process results in eachwindow or trial having 39
features,thusforeachwordwecreatedanMFCCmatrixofthe39coefficientsandthe
correspondingphonelabel.
2.2. Algorithms
AsstatedintheIntroduction,MBLandECLcanbepositionedasrepresentingopposing
principlesinlearningandadaptation,withtheformerrelyingonstorageofotherwise
unalteredinput-outcomeexperiences,andthelatterusingfilteringoftheinputsignalto
generate a minimally erroneous prediction of the desired or true outcome. These
conceptualdifferencesaretypicallyimplementedtechnicallyinthefollowingway:the
MBL algorithmmust formalize the organization of stored exemplars in appropriate
neighborhoods and then, accordingly, the assignment of new inputs to the right
neighborhood.Conversely,theECLalgorithmwouldtypicallyuseiterativeorgradual
weighting of the input to minimize the mismatch between predicted and targeted
outcome.
WeappliedMBLby‘populating’neighborhoodswithexemplarsthatarecorralledby
thesmallestEuclideandistance,andthenassigningnewinputtotheneighborhoodofk
= 7 least distant (i.e., nearest) neighbors. This is, arguably, a rather typical MBL
implementation, as Euclidean distance is among the most commonly used distance
metrics,whilek-NearestNeighbors(kNN)isastraightforwardyetefficientprinciplefor
storingnewexemplars(asdiscussedpreviouslybyCover&Hart,1967;Haykin,
1999,etc.).Theneighborhoodsizewasset to7, following findingsreported inMilin,
Keuleers,andFilipović-Đurđević(2011):intheirstudy,thenumberofneighborswas
systematicallymanipulated,andk=7wassuggestedforoptimalperformance.
The two ECL learningmodels,Widrow-Hoff (WH) and Temporal Difference (TD),
werechosenastwoversionsofacloselyrelatedidea:WH(alsoknownastheDeltarule
ortheLeastMeanSquarerule;seeMilin,Madabushi,etal.,2020,andreferencestherein)
represents the simplest error-correction rule, while TD generalizes to a rule that
15
accountsforfuturerewards,too.Moreprecisely,attimestept,learningassumessmall
changestotheinputweightswt(smalltoobeythePrincipleofMinimalDisturbance;cf.,
Milin,Madabushi,etal.,2020;Widrow&Hoff,1960;Widrow&Lehr,1990)toimprove
predictioninthenexttimestept+1.Thisisachievedbyre-weightingtheprediction
error(whichisthedifferencebetweenthetargetedoutcomeanditscurrentprediction)
withthecurrentinputcuesandalearningrate(freeparameter):λ(ot−wtct)ct,whereλ
representsthelearningrate,ctisthevectorofinputcues,wtctisthepredictionandotis
theoutcome.ThefullWHupdateofweightswinthenexttimestept+1isthen:
wt+1=wt+λ(ot−wtct)ct
ForTD,however,theprediction(wtct)isadditionallycorrectedbyafutureprediction
asthetemporaldifferencetothatcurrentprediction:wtct−γwtct+1,whereγisadiscount
factor,theadditionalfreeparameterthatweightstheimportanceof
thefuture,andwtct+1isthepredictiongiventhenextinputct+1.Theupdateruleforthe
TDthusbecomes:
wt+1=wt+λ(ot−[wtct−γwtct+1])ct
2.3. Training
Weran several computational simulationsof learning thephonesofEnglishdirectly
fromtheMFCCinput.Thisconstitutesa40-wayclassificationtask,predicting39English
phonesandacategoryofsilence.
Thefulllearningsampleconsistedof2,073,999trialsthatweresplitintotrainingand
test trials (90% and 10%, respectively). Each trial represents one 25ms window
extractedfromthespeechsignalandmatchedwiththecorrectphonelabel.Thefirst
trainingregimeusedthetrainingdatatrialsastheyare:ateachtrial,inputcueswere
represented as anMFCC row-vector, and the outcomewas the corresponding 1-hot
encoded(i.e.,present)phone.Thesecondtrainingregimeusedthesame90%oftraining
datatocreateN=100normallydistributeddatapoints(Gaussian),withmeanand+/−
16
2.58SEconfidenceintervals(tothetrue,populationmean),oneforeachofthe40phone
categories,giventheinputMFCCs.TheRawinputconsistedof1,866,599trials(90%of
the full dataset), while the Gaussian data contained 4,000 trials (100 normally
distributeddatapointsforeachofthe39phonesandsilence).
Thetwodatasets–Rawvs.Gaussianinput–weretreatedidenticallyduringtraining.
ForthetwoECLalgorithmsthelearningratewassettoλ=0.0001,tosafeguardfrom
overflow, given the considerable range ofMFCC values (min= −90.41,max= 68.73,
range=159.14).Thislearningratewaskeptconstantthroughoutallsimulationruns.
Additionally,whentrainingwiththeTemporalDifferencemodel,thedecayparameter
of future reward expectationwas set to the commonly used value of γ= 0.5. Upon
completion,eachtrainingyieldedaweightmatrix,withtheMFCCsasinputcuesinrows
andthephoneoutcomesincolumns(39×40matrix).
Thetrainingresultedinsixweightmatrices,giventhetwotypesofinputs(Gaussian
andRaw)andthethree learningmodels(Memory-Based,MBL,Widrow-Hoff,WHand
TemporalDifference,TD).Thesematriceswereusedtopredictphonesinthetestdataset
(N=207,399,or10%ofalltrials).RawMFCCswerefedtothealgorithmsasinputand
were weighted by the weights from the six respective matrices, learned under the
distinctivelearningregimes.Weusedweightedinputsums,alsocalledtotalornetinput,
asthecurrentactivationforeachofthe40phoneoutcomes,andtheabsolutelength(1
−norm) of theweighted sum indicated thediversityor the competition among the
possibleoutcomephones(fordetailsandprioruse,seeMilin,Feldman,etal.,2017).The
prediction of the system was the phone with the highest ratio of activation over
diversity,orsupportoverthecompetition.
In addition to the ECLs we also trained a MBL with k-Nearest Neighbors, using
Euclidean Distances and k = 7 neighbors. The memory resource was built from
correspondinginput-outcomedatapairs:NRaw=1,866,599pairsforRawandNGaussian=
4,000pairsforGaussiandata.Next,MFCCinputcueswereusedtofindthe7nearest
neighbors in thetestdata,andtheoutcomeofamajorityvotewasconsideredas the
predictedphone.Thatpredictionwasthenweightedbythesystem’sconfidence(i.e.,the
probabilityof thatmajorityvoteagainstallothervotes).ForallMBLsimulationswe
17
madeuseoftheclasspackage(Venables&Ripley,2002)intheRsoftwareenvironment
(RCoreTeam,2020).
3. ResultsandDiscussion
Thefollowingfoursubsectionspresentindependentbutrelatedpiecesofevidencefor
assessing the learnability of abstract categories, given MFCC input. Whether or not
phonesarelearnablecanbeinferredfromsuccessinpredictingnewandunseendata
fromthesamespeaker(Section3.1)and,morechallengingso, fromanotherspeaker
(Section3.2).Atthemoregenerallevel,questionssuchasreliabilityandgeneralizability
areofcrucialimportance.Wewill,hence,explorehowconsistentorreliablethesuccess
inpredictingphonesisacrossmultiplerandomsamplesofthedata(Section3.3).Finally,
itisimportanttounderstandwhetherwhatislearnedgeneralizesvertically:dowesee
and,ifso,towhatextentdoweseeanemergenceofplausiblephoneclusters,resembling
traditionalclasses,directlyfromthelearnedweights(Section3.4).
3.1. More of the same: predicting new input from the same speaker
Wecanconsideroursimulationsas,essentially,a3×2experimentaldesign,withas
factorsLearningModel (Memory-Based,Widrow-Hoff,andTemporalDifference)and
TypeofInputdata(Rawvs.Gaussian).Thesuccessratesourthreealgorithmsachieve
inpredictingunseendata,givenRaworGaussianinput,aresummarizedinTable1.
18
Table 1. Success rates of predicting the correct phone of three different learningmodels (Widrow-Hoff, TemporalDifference,andMemory-BasedLearning)usingtwodifferenttypesofinputdata(RaworGaussian).TheleftmostcolumnrepresentstheprobabilityofagivenphoneintheRawtrainingsample.
phone Sampling Raw Gaussian
Probability MBL WH TD MBL WH TD
silence 8.03 54.26 41.58 52.33 58.19 34.58 34.51aa 2.12 42.51 0.63 4.20 38.86 17.32 16.55ae 2.64 52.82 19.78 30.02 51.66 29.02 31.26ah 5.49 26.31 11.94 11.65 24.12 32.21 30.03ao 1.02 38.24 5.38 8.33 23.13 19.15 19.34aw 0.79 27.88 0.64 2.09 13.18 11.62 15.38ay 2.58 41.03 0.00 10.14 28.35 23.83 26.13b 1.22 46.46 11.24 7.50 16.96 23.42 24.92ch 0.86 24.95 11.11 0.00 13.61 8.68 8.44d 2.43 41.85 3.24 2.57 17.55 15.63 15.12dh 0.07 17.65 0.00 0.00 1.39 0.66 0.74eh 2.36 26.36 0.00 1.55 24.46 21.46 18.08er 2.83 37.78 19.23 24.03 32.91 24.96 24.61ey 2.26 38.61 0.56 4.98 29.36 18.92 16.19f 1.66 42.94 15.34 20.31 32.49 18.71 14.64g 0.84 51.01 5.67 8.00 14.41 5.94 5.44hh 0.49 44.83 6.07 4.12 11.22 7.24 8.17ih 3.97 28.00 12.34 9.90 25.00 35.26 31.93iy 4.41 64.00 0.99 13.17 64.75 45.84 39.40jh 0.76 39.87 2.58 2.83 17.65 8.47 9.21k 4.03 66.56 22.09 27.87 41.05 21.16 21.41l 5.75 65.68 23.02 25.87 73.81 47.24 45.38m 2.75 68.34 26.17 26.91 49.61 43.30 37.75n 5.79 64.07 27.46 40.95 69.66 59.99 58.47ng 1.84 64.54 19.72 59.12 45.05 30.56 33.15ow 1.62 44.91 1.63 7.79 42.60 27.20 25.21oy 0.15 13.35 0.00 0.00 3.84 2.18 1.90p 2.04 44.08 26.51 42.66 11.82 22.08 22.56r 4.32 44.34 51.43 49.82 52.37 36.57 32.48s 9.57 60.89 2.30 50.17 66.07 53.27 55.51sh 1.86 52.89 11.93 0.00 49.32 19.27 19.15t 5.00 33.68 21.68 22.53 19.64 20.52 20.59th 0.26 15.07 0.00 0.00 3.52 1.68 1.67uh 0.17 7.44 0.00 0.00 0.86 0.74 0.86uw 0.94 53.29 1.65 4.49 40.65 16.70 9.01v 1.00 35.75 9.97 3.75 14.69 15.46 15.17w 0.81 52.31 0.36 1.08 18.41 14.66 13.01y 0.39 28.63 4.00 5.42 7.11 2.18 2.01z 4.78 45.08 17.26 19.08 41.43 29.14 29.70zh 0.08 27.46 0.23 0.35 3.69 1.16 1.16
19
Overall,theresultsrevealanabovechanceperformancegivenboththenaiveor‘flat’
probabilityofanygivenphoneoccurringbychance,whichstandsat2.5%assumingan
equiprobabledistributionofthe40outcomeinstances,ortheprobabilityofthephone
occurring in the sample, which is, essentially, its relative frequency in the training
sample. The error-correction learning rules (WH and TD) performed below the
samplingprobabilityonlyinafewcases,andonlywhenusingrawtrainingdata.Outof
40 instances, Widrow-Hoff underpredicted 12 phones, while Temporal Difference
underpredicted 7 phones. MBL, on the other hand, always performed above the
baselinessetbychanceandsamplingprobability.
The strength of the correlation between the probability of the phone in the test
sample,ontheonehand,andthesuccessratesofthethreelearningmodels,ontheother
hand,addsaninterestinglayerofinformationforassessingtheoverallefficiencyofthe
chosenlearningmodels.Tobringthisout,wemadeuseofKendall’staub,aBayesian
non-parametricalternativetoPearson’sproduct-momentcorrelationcoefficients.Table
2containsthesecorrelationcoefficientsandrevealstwointerestinginsights.
Table2.BayesianKendalltau-bbetweentheprobabilityofaphoneinthetestsampleandthecombinationofLearning(MBL,WH,TD)andTypeofData(RawandGaussian).
Raw Gaussian MBL WH TD MBL WH TD Test-sampleProb. 0.349 0.467 0.576 0.638 0.753 0.723
Firstly,despitethefactthat,forallthreelearningalgorithms,theGaussiandatasetdid
notreflectthesamplingprobabilitiesoftherespectivephonemes(100generateddata
pointsperphone;N=4000totaltrainingsamplesize),thecorrelationwiththephones’
probabilities in the test sample increased considerably (importantly, the correlation
betweenprobabilitiesinthetrainingandthetestsamplewasnear-perfect:r=0.99998).
Inotherwords,eventhoughthenumberofexemplars,i.e.,theextentofexposure,was
keptconstantacrossallphones,thealgorithms’successesresembledthelikelihoodof
encounteringthephonesintheRawsamples,bothfortrainingandtesting.Generally
speaking, this then represents indirect evidence of learning rather than of simply
20
passing through the equiprobable expectations (given that each outcome had 2.5%
chancetooccur).
Secondly, these correlation coefficients are considerably higher for Gaussian data
thanfortheirRawcounterparts.Likewise,thesuccessratesforthetwoECLalgorithms
increases.However, the opposite holds for theMBL algorithm:MBL shows a higher
correlation forGaussiandatabutcombinesthiswitha lowersuccessratewhenthat
sameGaussiandataisusedasstoredexemplars.
Takentogether,thesetwoobservationsindicatethattheECLalgorithmscandomore
withless.Inaddition,theyaresensitivetotheorderofexpositionorthetrialordereffect
inthecaseofRawdata:theRawdatawasfedtothelearningmachinesinaparticular
orderoftime-overlappingMFCC-phonepairs,whiletheGaussiandatawascomputer-
generatedandrandomized.TheincrementalnatureofECLmakesmanymodelsfrom
thisfamilydeliberatelysensitivetotheorderofpresentationinlearning.
Forexample,theRescorla-Wagner(1972)rule,thediscretedatacounterpartofthe
WidrowandHoff(1960)rule, ispurportedlysusceptibletoorder.This is, infact, the
corereasonwhy itcansimulate the famousblocking-effect (“Predictability, surprise,
attention,andconditioning”,n.d.).Conversely,MBLsimplystoresalldatainasisfashion
andis,consequently,blissfullyunawareofanyparticularorderinwhichthedatacomes
in and, for thatmatter, does not assign any particular value to the order (assuming
unlimitedstorage).
To examinewhether there is an interactionbetweenLearningModel andTypeof
Input data we applied Bayesian Quantile Mixed Effect Modeling in the R software
environment(RCoreTeam,2020)usingthebrmspackage(Bu¨rkner,2018).Bayesian
QuantileModelingwaschosenasaparticularlyrobustanddistribution-freealternative
toLinearModeling.Weexaminedthepredictedvaluesoftheresponsedistributionand
their(Bayesian)CredibleIntervals(CI)atthemedianpoint:Quantile=0.5(Schmidtke,
Matsuki,&Kuperman,2017;Tomaschek,Tucker,Fasiolo,&Baayen,2018,providemore
details about criteria for choosing the criterion quantile-point). The final model is
summarizedinthefollowingformula:
21
SuccessRate∼LearningModel∗TypeOfInput+
(1|phone)
Thedependentvariablewasthesuccessrate,andthetwofixedeffectfactorswere
learningmodel(3levels:MBL,WH,TD)andtypeofinputdata(2levels:Raw,Gaussian).
Finally,40phonecategoriesrepresentedtherandomeffectfactor.Themodelsummary
ispresentedinFigure3.1.
Figure 1. Conditional effect of Learning Model and Type of Input on Success Rates with respective 95% CredibleIntervals.Blackdashedlinerepresentstheflatchancelevelat2.5%.
IfwecompareMBLandECL(jointlyWHandTD),first,weseethatMBLshowsabetter
overallperformance(MBL>ECL:Estimate=11.24;Low90%CI=8.56;High90%CI=
13.95;EvidenceRatio= Inf). That difference, however, attenuateswhen theGaussian
sampleisused(MBL>WH:Estimate=6.49;Low90%CI=0.49;High90%CI=12.32;
EvidenceRatio=0.04;MBL>TD:Estimate=5.88;Low90%CI=0.04;High90%CI=
11.69;EvidenceRatio=0.05). Thus,we can conclude thatMBLperformsbetterwith
moredata(Raw),whichistobeexpectedasthistypeofdatasetprovidesaverylarge
pool of memorized exemplars. The two ECLs, on the other hand, show improved
performancewithGaussiandata.TheimprovementisgreaterforWHthanforTD,but
22
thatdifferenceisonlyweaklysupported(Estimate=2.42;Low90%CI=−1.36;High
90%CI=6.21;EvidenceRatio=5.86).
Interestingly, less is more for ECL models: less data gives an overall better
performance. Furthermore, with Gaussian data, a ‘cleverer’ learning model (i.e.,
TemporalDifference)doesnot yieldbetter results. It seems thatmoredata leads to
overfit(foradetaileddiscussion,seealsoMilin,Madabushi,etal.,2020).Although it
does not cancel out the possibility of overfitting the data, our interpretation of this
finding is that the effect is, most likely, yet another consequence of ECL models’
sensitivitytotheorderoftrials.Asexplainedabove,theRawdatawaspresentedina
particular order, and inasmuch as theECLmodels are sensitive to this effect (Milin,
Madabushi,etal.,2020),theMBLmodelisimmunetoit.Furthertothis,theadvantage
that TD shows in comparison to WH with Raw data stems from the fact that an
immediaterewardstrategy(as implemented inWH)performsworsethanastrategy
whichconsidersfuturerewardswhenthesignalisweak,thatiswhenthedataisnoisy
(foraninterestingdiscussioninthedomainofappliedroboticsseeWhelan,Prescott,&
Vasilaki,2020).When thatnoise ispurelyGaussian,however, informationabout the
futuredoesnotyieldmuchadvantage.
Generallyspeaking,withGaussiandata,thedifferencesbetweenthesuccessratesof
the three learningmodels become insignificant:whileMBL becomes less successful,
bothECLmodelsbenefit.LargersampleswithmoreGaussiangenerateddata,however,
donot improvetheperformanceofECL further. In fact, largersamplesofN=1,000;
10,000;and1,866,600(thelatesttomatchthesizeoftheRawdatasample)consistently
worsenedtheirpredictionsuccess(respectiveaveragesforWidrow-Hofflearningare:
MN=100=21.70%;MN=1K=19.49%;MN=10K=14.75%;MN=1.8M=12.28%).
3.2. Spicing things up: predicting input from a different speaker
ThedownwardtrendofsuccessexhibitedbyWHlearningthatisobservedasthesizeof
theGaussiangeneratedsampleincreasesprovidesindirectsupportfortheassumption
thatECLlearningrulesmayindeedbesusceptibletooverfittingasthedatabecomes
abundant. This sheds light on yet another aspect of theoppositionbetween the two
23
learningprinciples,MBLandECL:what‘pleases’theformerharmsthelatter,andvice
versa.Thisdifferencebegsthequestionofwhetherfidelitytothedataleadstoadead-
endoncethetaskbecomesevenmoredemanding;thatiswhen,forexample,theaimis
topredictadifferentspeaker.
Forthisweusewhatwaslearnedinthepreviousstep:directlystoredpairsofinput
MFCCs and outcome phones for MBL (NRaw= 1,866,599 stored exemplars) and the
resultingWHandTDweightmatrices(usingthesameNRaw=1,866,599learningtrials),
to predict phones from MFCCs that were generated by a new female speaker. The
datasetconsistedofN=1,699,643testtrialsfromtheproductionofabout27,000words.
TotesttheECLmodelsweusedthetwoweightmatricesfromtrainingonRawinput,
one with WH and another with TD. Then, raw MFCCs from the new speaker were
weighted with weights from the two corresponding matrices and then summed to
generatetheactivationforeachofthe40phoneoutcomes(i.e.,thenetinput),together
withtheirdiversity(theabsolutelengthor1−norm)(cf.,Milin,Feldman,etal.,2017).
As before, the predicted phone was the one with the strongest support over the
competition(theratioofactivationoverdiversity).TotesttheMBLmodel,newMFCC
inputwas‘placed’amongthe7-NearestNeighborstoinducethemajorityvote,weighted
bythesystem’sconfidenceinthatvote(i.e.,thesizeorprobabilityofthatmajority).The
phonewiningthemajorityvotewasthepredictedphone.
The results from this set of simulations are unanimous: the task of predicting a
differentspeakeristoodifficultandallthreelearningalgorithmsperformedatchance
level.Theestimatedperformance(successrates)wasgivenbythefollowingBayesian
QuantileMixedEffectModel:
SuccessRate∼LearningModel+
(1|phone)
Theresultingestimatesforthethreelearningmodelsare:MBLEstimate=2.28(Low
95%CI=1.29;High95%CI=3.36),TDEstimate=1.86(Low95%CI=0.97;High95%
24
CI=2.93),WHEstimate=1.69(Low95%CI=0.83;High95%CI=2.79).Inotherwords,
allthreelearningalgorithmsfailedtogeneralizetoanewspeaker.
Inanattempttobetterunderstandtheresultswegeneratedconfusionmatricesfor
eachmodel(seesupplementarymaterial).Theconfusionmatricesforthefirstspeaker
showthatfortheMBLmodeltheconfusionmatrixmirroredwhatwemightexpectfrom
a phonetic/perceptual perspective. For example, vowelsweremost confusablewith
vowelsand/s/wasconfusablewith/z/.ThisdoesnotcarryoverfortheTDandWH
modelswheretheconfusionsseemmorerandom.Intheconfusionmatricesforthenew
speakertheresultsaremuchclosertochanceformostofthephones.Asintheconfusion
matriceswiththeoriginalspeaker,theMBLmodelmirroredwhatmightbeexpected
fromaperceptualperspective.Generally,acrossalltheconfusionmatriceswefindabias
forsilenceintheresponses.Thisconfusionisprobablyatleastinpartcausedbythefact
thatpartofthesegmentforvoicelessstopsandaffricatesconsistsofsilence.
3.3. Assessing the consistency of learning
The different and task-dependent profiles of ECL and MBL models pose questions
regarding the reliability of different learning approaches. In otherwords, does their
performance remain stable across independent learning sessions (i.e., simulated
individual learners’ experiences) or is their success or failure rather a matter of
serendipity?Inprinciple,learningisadomaingeneralandhigh-levelcognitivecapacity
(c.f.,Milin,Nenadi´c,&Ramscar,2020)and,hence,itoughttoshowadesirablelevelof
consistencyorresiliencetorandomvariationsand,atthesametime,allowadesirable
level of systematicdifferences across individuals. Failure to achieve consistency and
predictability across different sets of learning events and/or learners (individuals),
woulddisqualifyalearningmodelasaplausiblecandidate-mechanism.
Given the fact that a two-year old child’s receptive vocabulary consists of
approximately300differentwords(Hamilton,Plunkett,&Schafer,2000),wesampled
withoutreplacement5randomsamplesof300wordseach.These300-wordsamples
mapped to a somewhatdifferentnumberof inputMFCCandoutputphonepairs.As
before,suchinput-outputpairswereusedinthesimulations.First,eachofthe5samples
25
ofMFCC-phonepairs(i.e.,learningevents)wasreplicated1000timestosimulatesolid
entrenchment.Next, tokeep these learningeventsasrealisticaspossibleand,at the
sametime,topreventoverfitting,weaddedGaussiannoise(withasmalldeviation)to
50%randomly chosenMFCCs: this reflectsminorvariations inpronunciationacross
trials.Allfiverandomsampleswereusedtotrainourthreemainlearningalgorithms,
bothECLs(Widrow-HoffandTemporalDifference)andMBL(usingEuclideandistances
tofindthe7-NearestNeighbors),infiveindependentlearningsessions.
Inparallel,wegenerated5testsamples.Eachtestsampleconsistedof200wordsin
total, half of which were ‘known’ words, sampled from the corresponding training
sample,whiletheotherhalfofwordswere‘new’,notpreviouslyencountered.Asbefore,
WH and TD learning yielded learningweightmatrices, whileMBL simply stored all
trainingexemplarsasunchangedinput-outputpairs.Then,theinputMFCCsfromthe
testsampleswereusedtopredictphoneoutcomes,asbefore:(a)forWHandTDthe
predictedphonewastheonewiththestrongestsupportagainstthecompetition(i.e.,
thehighestratioofactivationoverdiversity);(b)MBLcalculatedEuclideandistancesto
retrievek=7nearestneighbors,andtodeterminetheirmajorityvote,weightedbythe
probabilityofthatvoteagainstallothervotes.
This simulation provided success rates for each of the three learning algorithms,
separatelyforeachofthefiveindividuallearningsessions(generatedsamples),andfor
eachofthefortyphones.Again,wemadeuseofBayesianQuantileMixedEffectModeling
asimplementedinthebrmspackage(Bu¨rkner,2018)inR,andevaluatedthemodelat
themedianpoint.Thismodel,too,usedthesuccessrateasdependentvariable,andthe
learningmodelasfixedeffectfactor(with3levels:MBL,WH,TD).Inadditiontothis,it
containedtworandomeffects:40phonesand5individuallearning
sessions:
SuccessRate∼LearningModel+
(1|phone)+
(1|Session)
26
Consistent with our previous findings MBL showed a significantly better
performance:MBL>ECL:Estimate=45.49;Low90%CI=42.04;High90%CI=48.76;
EvidenceRatio=Inf.TheTDmodelachievedamarginallybettersuccessratethanthe
WH model: TD >WH: Estimate = 2.00; Low 90% CI = −0.61;High 90% CI = 4.65;
EvidenceRatio=8.83.Strikingly,however,closerinspectionofthemodels’performance
revealed that in2outof5sessionsMBL’saveragesuccesswas in factequal tozero.
Across the five sessions, themedian success rates forMBLwere: 72.51, 0.67, 68.00,
71.47,0.00.Thus,wecanconcludethatin40%ofcasesoursimulated‘individuals’did
notlearnanythingatall,whileintheother60%theywereluckilysuccessful.
Thus, our follow-up analysis used Median Absolute Deviations (MAD), a robust
alternative measure of dispersion, as the target dependent variable. The MAD was
calculatedforeachphoneacrossthefivelearningsessions.Theresultingdatasetwas
submittedto thesameBayesianQuantileMixedEffectModelingto test the following
model(notethatSessionsenteredintothecalculationofMADs,whichiswhytheycould
notappearasarandomeffect):
MAD∼LearningModel+
(1|phone)
Theresultsconfirmedthepreviouslyobservedsignificantlyhigherinconsistency,
asindicatedbyhigherMAD,ofMBLvs.ECL(MBL>ECL:Estimate=5.03;Low90%CI=
2.37;High90%CI=7.96;EvidenceRatio=1332.33).ThetwoECLmodels,however,did
notdiffersignificantlyamongthemselves(TD>WH:Estimate=−0.97;Low90%CI=
−3.68;High90%CI=1.83;EvidenceRatio=2.55).
Takentogether,theseresultscastdoubtonMBLasviableandplausiblemechanism
forlearningphonesfromauditoryinput.Onaverage,MBLperformsratherwellbutit
fails the reliability test and, consequently, cannot be considered as a plausible
mechanism for learning phones from raw input: toomany learnerswould not learn
anythingatall.Conversely,whileECLmodelsshowpooreraverageperformance,they
displayhigherresilienceandconsistencyacrossindividuallearningsessions.
27
3.4. Abstracting coherent groups of phones
Fromtheresultspresentedsofarwecanbecautiouslyoptimisticaboutwhatcanbe
learnedfromtheMFCCinputsignal.Thisseemsreasonableconsideringthisistheinput
formostautomaticspeechrecognitionsystems.Thereissomesystematicity(fordetails
seeDivjak,Milin,Ez-zizi,J´ozefowski,&Adam,2020;Milin,Nenadi´c,&Ramscar,2020)
andthatappearstobepickedupbylearningmodels.Thisis,ofcourse,relativetothe
taskathand,whichinthiscasewaspredictingaphonefromnew,rawinputMFCCs(the
testdataset).
Thelastchallengeforourlearningmodelsistoexaminewhetherthelearnedweights
facilitatethemeaningfulclusteringofphonesintotraditionallyrecognizedgroups.We
haveassessedthepotentialofthethreelearningmodelstogeneralizehorizontally,i.e.,
topredictnewdatafromthesamespeakerandtopredictnewdatafromanewspeaker.
Sincetraditionalclassesofphonesarewellestablishedintheliterature,thequestionis
whetherourlearningmodelsalsogeneralizesvertically.Thatis,doesityieldemerging
abstractionsofplausiblegroupsofphoneswhichshould,at least toanotableextent,
matchandmirrortheexistingphoneticgroupingsofsounds.
Tofacilitatedeliberationsonthemodels’abilitytomakeverticalgeneralizations,we
applied hierarchical cluster analysis using the pvclust package (Suzuki, Terada, &
Shimodaira,2019)intheRsoftwareenvironment(RCoreTeam,2020)totheweight
matrices.We focushereontheresults for themostandthe leastsuccessful learning
simulations, i.e., WH and MBL using Raw data, and discuss their differences in
generating‘natural’groupsofphones.WealsoaddtheresultsforTDlearningonRaw
data, for completeness. The three cluster analysesdonot onlypresent the extremes
given their prediction successes, but are also the most interesting. Further cluster
analysescanberundirectlyontheweightmatricesprovided.
All cluster analyses presented here used Euclidean distances and Ward’s
agglomerative clustering method, with 1,000 bootstrap runs. The results are
summarized in Figures3.4, 3.4, and3.4. TheRpackagepvclust allows assessing the
uncertaintyinhierarchicalclusteranalysis.Usingbootstrapresampling,pvclustmakes
itpossible tocalculatep-values foreachcluster in thehierarchicalclustering.Thep-
28
valueof a cluster rangesbetween0 and1, and indicateshowstrongly the cluster is
supported by data. The package pvclust provides two types of p-values: the AU
(ApproximatelyUnbiased)p-value(inred)andtheBP(BootstrapProbability)value(in
green).TheAUp-value,whichreliesonmultiscalebootstrapresampling,isconsidered
abetterapproximationtoanunbiasedp-valuethantheBPvalue,whichiscomputedby
normal bootstrap resampling. The grey values indicate the order of clustering,with
smaller values signaling cluster that were formed earlier in the process, and hence
containelementsthatareconsideredmoresimilar.ThefirstformedclusterinFigure
3.4,forexample,contains/dh/and/y/,whilethesecondonecontains/oy/and/uh/.
Inathirdstep,thefirstclustercontaining/dh/and/y/isjoinedwith/th/.
Figure2.DendrogramofphoneclusteringusingWidrow-HoffweightsfromtrainingonRawdata.
TheWidrow-Hoff learning weights appear to yield rather intuitive phone groups
(Figure 3.4). Specifically, the left-hand side of the dendrogram reveals small and
relatively meaningful clusters of phones. On the right-hand side of the same panel,
however,thereisapoorlydifferentiatedlumpconsistingofthe20remainingphones
thatdonotyetappeartohavebeenlearnedsufficientlywelltoformmeaningful(i.e.,
expected)groupings.Nevertheless,someinterestingpatternsemerge.Forexample,the
29
voiceless stops cluster togetherwith silence in a later step; considering that a good
portionof thevoiceless stop is likely silence, this clusteringseems fairly logical.The
voicelessalveolarfricative/s/occurslargelybyitselfandthenthereisanothercluster
containingthevoicedlabialstop/b/,thevoicedalveolarstopandlateral/d,l/,andthe
voicedandvoicelesslabio-dentalfricatives/f,v/.Thisclusterislargelymadeupofvoice
soundsarticulatedinthefrontofthevocaltract.Thethirdclusterislargelymadeupof
vowels,asonorant,andthevoicelesspalato-alveolarfricative/sh/.Thisclusterseems
tocontainmostofthevowelsalongwith/r/and/er/whichalsohaveafairlyopenvocal
tractduringarticulation.However,the/sh/isasurprise,andatthispoint,anypossible
explanations as towhy /sh/ is in this clusterwould be guesswork. The last cluster
contains the remaining 20 phones which seem to be grouped in a mixture of the
remainingsoundswithnoapparentpattern.NoteinpassingthattheWHdendrogram
alsopassesthelitmustestofphonerecognition,i.e.,thedistinctionbetween/l/and/r/.
InFigure3.4the/l/andthe/r/ individuallybelongtoclustersthatareestimatedto
replicateondifferentdata.
TheresultsforclusteringTDweightsarelesslikelytoreplicate.Fewclustersachieve
an AU value of 95 or higher. However, the clustering is highly similar to the WH
dendrogram.Thereisalater-formedclustercontainingthevoicelessstops/k,p,t/on
theleft-handside,nexttoaclusterlinking/b,d,m,f,v,g,hh,w/,whichincludesthevoiced
stopsandsomeofthebilabialandlabio-dentalconsonants.Athirdreliablecluster,on
theright-handside,unites13phones.Thesephonesarelargelymadeupofaffricates,
thoughtherearea fewvowelsandanasal in thisgroupsaswell.The fourthreliable
clusterissohigh-levelthatitlinks35outofthe40phonestogether.Notethat/s/and
silencearenotincludedinthiscluster.Herewealsonotethatthedistinctionbetween
/l/and/r/islessreliablethanintheWHanalysis.
Figure3.4tellsaverydifferentstory,orrather:nostoryatall.Theclustersmakelittle
sensefromaphoneticperspective.Soundsdonotseemtoclustertogetheraswouldbe
expected and no alternative pattern becomes apparent either. The p-values for the
clustersinFigures3.4and3.4revealsomestrikingdifferences.Despitesomequibbles
withtheWHphonegroups(Figure3.4),discussedabove,manyoftheclustersinthis
30
solutionareratherlikelytoreplicate,andthisappliesinparticulartomidandhigh-level
clusters (formed later cf. the grey values) on the left-hand side of the diagram. The
clusters in theMBL solution, however, show the opposite tendency,where only one
cluster(consistingofoyanduh)receivesanAUp-valueequaltoorhigherthan.95.In
particular,higherorderclusterfarepoorlyinthisrespect,withallbutonereceivingno
chanceatallofreplicatinginadifferentdataset.Note,also,thatthevaluesonthevertical
axisareverydifferentforWH/TDandMBL:MBLclustersstartformingatvalue0.5,by
whichpointtheWHandTDclustersolutionshavealreadyconverged.Thechance-like
behaviorofMBLinclusteringwhatislearnedisinlinewithotherresultspresentedhere
and in particularwith the highly inconsistent and serendipitous nature of ‘learning’
exhibitedbythisparticularmodel.
Figure3.DendrogramofphoneclusteringusingTemporalDifferenceweightsfromtrainingonRawdata.
In summarizing the results of the cluster analysis, two points appear to be of
particular interest. First, aspointedoutbefore, learning ‘success’ doesnot exist in a
vacuumorasauniversalnotion,butisrelativetoperformanceonthetaskathand.In
thatrespect,andinthecontextofthetaskofpredictingnewinputorindicatinghigher-
order relationships between phones,we can conclude thatMBL performs better on
31
predictingunheardinput(particularlywhenthesetofexemplarsissufficientlylarge),
but worse on generating meaningful natural groups of phones. As Nathan (2007)
concluded: whatever the role of exemplars might be, they do not make phonemic
representationssuperfluous.
TheWH rule showsexactly theoppositeperformance, and is better in generating
meaningfulnaturalgroupsofphonesbutworseinpredictingunheardinput.Second,the
meaningfulnessoftheobtainedphonegroupsvariesfromthosethatarehighlyplausible
tosomethatarenotsufficientlylearnedbutmakea‘primordialsoup’ofdiversephones
inaratherunprincipledway.Whetherhumansexperiencethesametrajectoryinthe
hypothesized acquisition of such phone groups is beyond the scope of this study,
however.
Figure4.DendrogramofphoneclusteringusingMemory-BasedprobabilitiesfromtrainingonRawdata.
32
4. Generaldiscussion
We set out to test two opposing principles regarding the development of language
knowledgeinlinguisticallyuntrainedlanguageusers:Memory-BasedLearning(MBL)
andError-CorrectionLearning(ECL).MBLreliesona largepoolofveridicallystored
experiences and a mechanism to match new ones with already stored exemplars,
focusingonefficiencyinstorageandretrieval.ECLfilterstheinputdatainsuchaway
thatthediscrepancybetweenthesystem’spredictionoftheoutcome(asmadeonthe
basisofavailableinput)andtheactualoutcomeisminimized.Throughfiltering,those
dimensions of the experience are removed that are not useful for reducing the
prediction error. What gets stored, thus, changes continuously as the cycles of
predictionanderror-correctionalternate.Suchan‘evolvingunit’cannotbeaveridical
reflectionof anexperience,but containsonly themostuseful essence,distilled from
manyexposure-events(orusage-eventsviz.Langacker,1991).
Aprocessofgeneralizationunderliestheabstractionslinguistsoperatewith,andwe
probedwhetherMBL and/or ECL could give rise to a type of generalized language
knowledge that resembles linguistic abstractions. Eachmodelwas presentedwith a
significantamountofspeech(approximately5hours)producedbyonespeaker,created
fromtheMALDstimuli(Tuckeretal.,2019).Thespeechsignalsweretransformedinto
MFCCrepresentationsandthenpairedwiththe1-hotencodedoutcomephone.
Againstsuchdatawepittedthreelearningrules,amemorygreedyone(MBL),ashort-
sightedone(WH),andonethatexplicitlytakesintoaccountfuturedata
(TD). These three rules, again, map onto the principles of Memory-Based vs.
ErrorDrivenLearning.Weputbothtypesoflearningalgorithms(MBLvs.WHandTD)
through four tests that probe the quality of their learning by assessing themodels’
abilitytopredictdatathatisverysimilartothedatatheyweretrainedon(samespeaker,
differentwords)aswellasdatathatisratherdifferentfromthedatatheyweretrained
on (different speaker, same/different words). We also assessed the consistency or
stability of what the models have learned and their ability to give rise to abstract
categories.Asexpected,bothtypesofmodelsfaredifferentlywithregardtothesetests.
33
4.1. The learnability of abstractions
WhileMBLoutperformsECLinpredictingnewinputfromthesamespeaker,thisisonly
thecaseforRawdata;withGaussiandata,thelearningmodelsdonotdiffersignificantly
intheirperformance.Interestingly,forECLlessismore:trainingWHandTDonfewer
datapointswithGaussiannoiseyieldsbetterpredictionresults than trainingWHon
Rawandbigdata.ThisindicatesthatWHmayinfactoverfitincaseoflargerdatasets,
essentially overcommitting to the processed input-outcome data. Another plausible
interpretationoftheseresultsisthatECLs,whenlearningfromRawdata,areaffected
by the order of trials. Yet, neither an order effect nor overfitting to input data are
necessarilyunnaturalornegativeastheyrepresentknownphenomenain,e.g.,language
acquisition more generally (see, for example, Arnon & Ramscar, 2012; Ramscar &
Yarlett,2007)aswellasininL1andL2learningofphoneticinputspecifically.
Infants, at least inWestern cultures, typically have one primary caregiver, which
makesitlikelythatthey,too,‘overfittheirmentalmodels’totheinputofthatparticular
caregiver.There isevidence from learning inothersensorydomains, i.e.,vision, that
caststhisapproachinapositivelight:infantsusetheirlimitedworldtobecomeexperts
atafewthingsfirst,suchastheircaregiver’sface(s).Thesame,veryfewfacesareinthe
infant’svisualfieldfrequentlyandpersistentlyandtheseextended,repeatedexposures
havebeenarguedtoformthestartingpointforallvisuallearning(Jayaraman&Smith,
2019).
L2learnersarealsoknowtooverfittheirinputdata,andthistendencyhasbeenfound
tohindertheirprogress(forareview,seeBarriuso&Hayes-Harb,2018).Toremedythe
problem, researchers have experimented with high versus low variability phonetic
training.Highvariabilityconditionsincludeinputwithmultipletalkersandcontextsand
is hypothesized to support the formation of robust sound categories and boost the
generalizabilitytonewtalkers.WhileithasbeenshownthatadultL2learnersbenefit
fromsuch training (for the initial report, seeLively,Logan,&Pisoni,1993), childL2
learnersdonot appear tobenefitmore fromhigh variability training than from low
variability training (Barriuso&Hayes-Harb,2018;Brekelmans,Evans,&Wonnacott,
2020;Giannakopoulou,Brown,Clayards,&Wonnacott,2017).
34
However,whileoverfittingmaybeanadvantageinsomelearningsituationsitmay
causechallengesinothers.Theuseofasinglespeakermayalsointroduceachallengeto
themodel,notdissimilartothelong-knownspeakernormalizationprobleminspeech
perception(e.g.,Ladefoged&Broadbent,1957).Speakernormalizationistheprocessof
alisteneradaptingtosounds(typicallyvowels)producedbydifferentspeakers.Thisis
duetothefactthatdifferentspeakershaveindividualcharacteristicsandtheacoustic
characteristicsofavowelproducedbyamalespeakermaybesimilartotheacoustic
characteristics of an entirely different vowel produced by a female speaker (eg.g,
Barreda & Nearey, 2012). Speaker normalization describes research in speech
perceptioninvestigatinghowlistenersadaptto individualspeakers. Inourmodeling,
the learner has not been given an opportunity to acquire knowledge about other
speakersandasaresultitmaybethatithasnotlearnedtocreatecategoriesthatare
speakerindependent.
Atsomepoint,however,learnersareexposedtoinputproducedbyanotherspeaker
thantheirprimarycare-giverorteacher.Wesimulatedthissituationbyexposingour
models,trainedoninputfromonespeaker,toinputproducedbyanovelspeaker.This
time,unfortunately,all learningmodelsperformedatchancelevel,showingthatthey
wereunabletogeneralizewhattheyhadlearnedtoanewspeaker.Thisdidnotcomeas
acompletesurprise,however,asthechallengewasdeliberatelyextreme,notallowing
for any prior experience with that speaker at all. Incrementally adding input from
differentspeakersintothemixwillbeinformativeabouthowlongandtowhatextent
themodel can ormust keep on learning to be able to generalize to new, previously
unheardspeakers.
The ideaof individualdifferences in early learningwas addressed in a systematic
fashion. Somewhat simplistically, we simulated 5 child-like learners that each
experienced plenty of minimally different examples of a limited number of words,
spokenbyoneand the samespeaker.When testedon seenandunseenwords,MBL
appeared to outperform ECL, but closer inspection of the model revealed that this
advantageappliedtoanon-existentaverageonly:2outofthe5 learnerswhotooka
MBL approach had learned nothing at all. This, then, casts doubts on the cognitive
35
plausibilityofthisparticularlearningapproach,whichdemonstratesasdesirableatthe
populationlevelbutfailsmanyindividuals.ContrarytoMBL’sperformance,bothECL
modelsshowedlowersuccessonaverage,yettheyarethemorereliablechoiceacross
individuals.
Inalaststep,wetestedtowhatextentweights(fromthetwoECLs)andprobabilities
(from MBL), which we understand as summative proxies of learning, facilitate the
meaningful clustering of phones into traditionally acknowledged groups of phones.
Even though ECL was trained on its least favorite type of data (Raw) it performed
reasonablywell:itgaverisetoadendrogramthatallocatedatleasthalfofthephone
inventorytogroupsthatresembletheirtraditionalphoneticclassification.Furthermore,
thegroupsitidentifiedwereestimatedtohaveahighchanceofbeingdetectedinnew,
independentdatasets.MBL,on theotherhand,performedexceptionallypoorly,even
though it was trained on its favorite type of data (again, Raw). The absence of any
recognizable systematicity in the MBL-based clustering, and the estimated lack of
replicabilityoftheclusters,isstrikingindeed.
4.2. Implications for a theoretical discussion of abstractions
Generalization isoneof themostbasicattributesof thebroadnotionof learning.To
quote one of the pioneers of Artificial Neural Networks (ANNs), intelligent Signal
ProcessingandMachineLearning,thepointhereisthat“ifgeneralizationisnotneeded,
wecansimplystoretheassociationsinalook-uptable,andwillhavelittleneedfora
neuralnetwork”,wherethe“inabilitytorealizeallfunctionsisinasenseastrength[...]
becauseit[...]improvesitsabilitytogeneralize”tonewandunseenpatterns(Widrow
&Lehr,1990,p.1422).Attheveryleast,ourfindingsshowthatweshouldnotthrowthe
abstractions-baby outwith the theoretical bathwater. Instead, theway inwhichwe
approach thediscussionof abstractions should change.Upuntil now, the amountof
theoreticalspeculationabouttheexistenceofabstractionsisinverselyproportionalto
theamountofempiricalworkthathasbeendevotedtoprovidingevidencefororagainst
theexistenceofsuchabstractions.Thisisspecificallytrueforattemptstomodelhow
suchabstractionsmightemergefromexposuretoinput.
36
For example, Ramscar and Port (2016) discuss the impossibility of a discrete
inventoryofphonemesandclaimthat“noneofthetraditionaldiscreteunitsoflanguage
that are thought to be the compositional parts of language – phones or phonemes,
syllables,morphemes,andwords–canbeidentifiedunambiguouslyenoughtoserveas
the theoretical bedrock on which to build a successful theory of spoken language
composition and interpretation. Surely, if these units were essential components of
speechperception,theywouldnotbesouncertainlyidentified”(p.63).Ourresultsshow
that ‘thereismoretothat’:thefindingsarenotunanimousbutspeakformuchmore
thanno-thingandmuchlessthanevery-thing.Wehaveshownthatatleastpartofthe
phoneinventorycanbereliablyidentifiedfrominputbyalgorithmsappliedhere–ECL
andMBL.Atthesametime,notallphonesappeartobeequallylearnablefromexposure,
at least not fromexposure to one speaker. For anybalanceddiet, the proof is in the
pudding,butneitherintheknow-it-allcheforthelistofingredients.
Maybe this shouldnot comeasa surprise:afterall,many ifnotmostabstractions
came into existence as handy theoretical or descriptive shortcuts, not as cognitively
realistic reflections of (individual) users’ language knowledge (see, for discussion,
Divjak,2015b;Milinetal.,2016).Cognitivelyspeaking,thephonemeinventorythatis
traditionallyassumedforEnglishispossiblyeithertoospecificortoocoarseinplaces,
inparticularwhendealingwiththeidiosyncrasiesofonespeaker.Schatzetal.(2019)
likewisereportedfindingadifferenttypeofunits,astheonestheirmodellearnedwere
too brief and too variable acoustically to correspond to the traditional phonetic
categories.They,too,tooktheviewthatsucharesultshouldnotautomaticallybetaken
to challenge the learningmechanism,but rather to challengewhat is expected tobe
learned, e.g., phonetic categories. Ormaybewe should consider that not all phones
appeartobeequallylearnablefrominputaloneandproductiondatashouldbeprovided
tothemodel;emergentistapproachesallowproductionconstraintstoaffectperception
(Donegan,2015,p.45).Perhaps,also,theidiosyncrasiesofouronespeakershowthat
morediversetrainingisnecessarytolearnothergroupsofphone-likeunits.
Shortcomings aside, we have shown that ECL learning methods can learn
abstractions. Continuous-time learning, as represented by TD, does not necessarily
37
improvelearningasmeasuredbypredictionaccuracy:lookingintothefuturehelpsonly
if the data is rather noisy, i.e., the signal isweak. Furthermore, the argument for or
againsttherelevanceofunitsandabstractionsis,aswesaid,voidifnotpittedagainsta
particular task that may or may not require such units to be discerned and/or
categorized(thispointwasdemonstratedbysuccessfulmodelsofcategorization,such
asLoveetal.,2004;Nosofsky,1986).Inthissense,ECLoffersthemoreversatiletypeof
learning in that it can handle tasks at the level of exemplars and of abstractions, as
demonstrated by its reasonable performance across the four tasks. It also confirms
claims that adults seem to disregard differences that are not phonemic rather than
actuallylosingtheabilitytoperceivethem(Donegan,2015,p.43).
Theempiricalexplorationofwhetherabstractionsarelearnableshouldprecedeany
decisionsaboutwhichtheoreticalframeworktopromote.TheZeitgeist,however,seems
tofavorareductionismthatleadstoratherextremepositionsontheroleofexemplars
andthestatusof(abstract)units(cf.,Ambridge,2020;Ramscar,2019).Yet,asweknow
from the past, reductionism of any kind provokes rather than inspires (recall, for
example, the Behaviorist take on Language, and the Generativist reaction to it, see,
Chomsky1959;Skinner1957;andforalaterre-evaluationofthesecontributions,see,
among others, MacCorquodale 1970; Palmer 2006; Swiggers 1995), where both
provocateurandprovokeeloseinthelongrun.Suchtheoreticaldeliberationsonlylead
to ‘tunnelvision’onwhatconstitutesevidenceandhowsupportingdataought tobe
collected.Afterall,weshouldknowbetter:“allmodelsarewrong”(Box,1976,p.792),
andthose thatoccupyextremepositionsoftengohand-in-handwithsimplifyingand
evensimplisticviewpoints,as theirpotentialusefulnessgetsspentonpromotionand
debasementratherthanonproductive(self-)refutability.
38
5. Acknowledgments
ThisworkwasfundedbyaLeverhulmeTrustaward(RL-2016-001)whichfundedPM
and DD and by the Social Sciences and Humanities Research Council of Canada
(4352014-0678)whichfundedBVT.WearegratefultoBenAmbridgeforafascinating
seminarwhichsparkedourinterestinthistopic,andtomembersoftheOutofourMinds
team and the Alberta Phonetics Laboratory for sharing their thoughts. We thank
DanielleMatthews, Gerardo Ortega and Eleni Vasilaki for useful discussions and/or
pointerstotheliterature.
39
References
Ambridge,B.(2020,October).Againststoredabstractions:Aradicalexemplarmodeloflanguage
acquisition. First Language, 40(5-6), 509–559. Retrieved 2020-12-08, from
https://doi.org/10.1177/0142723719869731 (Publisher:SAGEPublicationsLtd)
Ambridge, B., & Lieven, E. V. M. (2011). Child language acquisition: contrasting theoretical
approaches.Cambridge:Cambridge:CambridgeUniversityPress.
Appelbaum,I.(1996,October).Thelackofinvarianceproblemandthegoalofspeechperception.
InProceedingofFourth InternationalConferenceonSpokenLanguageProcessing. ICSLP ’96
(Vol.3,pp.1541–1544).
Arnold,D.,&Tomaschek,F.(2016).Thekarleberhardscorpusofspontaneouslyspokensouthern
germanindialogues–audioandarticulatoryrecordings.In
C. Draxler & F. Kleber (Eds.), Tagungsband der 12. tagung phonetik und phonologie im
deutschsprachigenraum(p.9-11).Munich:Ludwig-Maximilians-University.Retrieved from
https://epub.ub.uni-muenchen.de/29405/
Arnold, D., Tomaschek, F., Sering, K., Lopez, F., & Baayen, R. H. (2017, April). Words from
spontaneousconversationalspeechcanberecognizedwithhuman-likeaccuracybyanerror-
drivenlearningalgorithmthatdiscriminatesbetweenmeaningsstraightfromsmartacoustic
features,bypassingthephonemeasrecognitionunit.PLOSONE,12(4),e0174623.Retrieved
2018-01-10,fromhttp://journals.plos.org/plosone/article?id=10.1371/journal.pone.0174623
Arnon, I.,&Ramscar,M. (2012).Granularityand theacquisitionofgrammaticalgender:How
order-of-acquisitionaffectswhatgetslearned.Cognition,122(3),292–305.
Baayen, R. H., Chuang, Y.-Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative
lexicon: A unified computational model for the lexicon and lexical processing in
comprehensionandproductiongroundednotin(de)compositionbutinlineardiscriminative
learning.Complexity,2019.
Barreda,S.,&Nearey,T.M.(2012).Theassociationbetweenspeaker-dependentformantspace
estimatesandperceivedvowelquality.CanadianAcoustics,40(3),12–13.
Barriuso,T.A.,&Hayes-Harb,R. (2018).HighVariabilityPhoneticTrainingas aBridge from
Research to Practice. CATESOL Journal, 30(1), 177–194. Retrieved 2020-11-02, from
https://eric.ed.gov/?id=EJ1174231 (Publisher:CATESOL)
40
Bod,R.(1998).Beyondgrammar:Anexperience-basedtheoryoflanguage.Stanford,CA:CSLI
publications.
Box,G.E.(1976).Scienceandstatistics.JournaloftheAmericanStatisticalAssociation,71(356),
791–799.
Brekelmans,G.,Evans,B.,&Wonnacott,E.(2020,March).Noevidenceofahighvariabilitybenefit
in phonetic vowel training for children. London, UK. Retrieved 2020-11-02, from
https://zenodo.org/record/3732383.X5fmYhKj3l(Publisher:Zenodo)
Buchwald, A., & Miozzo, M. (2011, September). Finding Levels of Abstraction in Speech
Production: Evidence From Sound-Production Impairment. Psychological Science, 22(9),
1113–1119. Retrieved 2020-11-02, from https://doi.org/10.1177/0956797611417723
(Publisher:SAGEPublicationsInc)
Bu¨rkner,P.-C.(2018).AdvancedBayesianmultilevelmodelingwiththeRpackagebrms.The
RJournal,10(1),395–411.Chomsky,N.(1959).Areviewofbfskinner’sverbalbehavior.Language,35(1),26-58.
Chuang,Y.-Y.,&Baayen,R.H.(n.d.).Discriminativelearningandthelexicon:Ndlandldl.Toappear
in.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on
informationtheory,13(1),21–27.
Divjak,D.(2015a).Exploringthegrammarofperception:Acasestudyusingdatafromrussian.
FunctionsofLanguage,22(1),44-68.
Divjak, D. (2015b). Four challenges for usage-based linguistics. Change of Paradigms: New
Paradoxes.RecontextualizingLanguageandLinguistics,31,297–309.(Publisher:DeGruyter)
Divjak,D.,&Caldwell-Harris,C.(2015).Frequencyandentrenchment.InD.Divjak&
E.Dabrowska(Eds.),HandbookofCognitiveLinguistics(Vol.39).(Publisher:DeGruyter)
Divjak,D.,Milin,P.,Ez-zizi,A.,J´ozefowski,J.,&Adam,C.(2020).Whatislearnedfromexposure:
anerror-drivenapproachtoproductivityinlanguage.Language,CognitionandNeuroscience,
1–24.
Donegan, P. (2015). The Emergence of Phonological Representation. In The Handbook of
Language Emergence (pp. 33–52). John Wiley & Sons, Ltd. Retrieved 2020-11-17, from
https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118346136.ch1 (Section: 1 eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118346136.ch1)
41
Fromkin,V.A.(1971).TheNon-AnomalousNatureofAnomalousUtterances.Language,47(1),
27–52.Retrieved2020-11-09,fromhttp://www.jstor.org/stable/412187 (Publisher:Linguistic
SocietyofAmerica)
Giannakopoulou, A., Brown, H., Clayards, M., & Wonnacott, E. (2017, May). High or low?
Comparing high and low-variability phonetic training in adult and child second language
learners. PeerJ, 5, e3209. Retrieved 2020-11-02, from https://peerj.com/articles/3209
(Publisher:PeerJInc.)
Gillies,J.(2018).CERNandtheHiggsBoson:TheGlobalQuestfortheBuildingBlocksof
Reality.IconBooks.(Google-Books-ID:0etmDwAAQBAJ)
Goldberg,A.E.(2006).ConstructionsatWork:TheNatureofGeneralizationinLanguage.Oxford
UniversityPress.
Goldinger,S.D.,&Azuma,T.(2003,July).Puzzle-solvingscience:thequixoticquestforunitsin
speech perception. Journal of Phonetics, 31(3), 305–320. Retrieved 2020-09-14, from
http://www.sciencedirect.com/science/article/pii/S0095447003000305 Hamilton, A., Plunkett,
K., & Schafer, G. (2000, October). Infant vocabulary development assessed with a British
communicativedevelopmentinventory.JournalofChildLanguage,27(3),689–705.Retrieved
2020-11-12, from http://www.cambridge.org/core/journals/journal-of-child-
language/article/infant-vocabulary-dev(Publisher:CambridgeUniversityPress)
Hay, J., & Bresnan, J. (2006, October). Spoken syntax: The phonetics of giving a hand inNew
Zealand English. The Linguistic Review, 23(3), 321–349. Retrieved 2020-08-07, from
https://www-degruyter-com.login.ezproxy.library.ualberta.ca/view/journals/tlir/23/3/article-p3
(Publisher:DeGruyterMoutonSection:TheLinguisticReview)
Haykin,S.(1999).Neuralnetworks:Acomprehensivefoundation.NewJersey:PrenticeHall.
Holmes,W. (2001). SpeechSynthesisandRecognition.CRCPress. (Google-Books-ID:
uJNm3Uw2CBAC)
J¨a¨askel¨ainen,I.P.,Hautam¨aki,M.,N¨a¨at¨anen,R.,&Ilmoniemi,R.J.(1999).Temporalspanof
humanechoicmemoryandmismatchnegativity:revisited.Neuroreport,10(6),1305–1308.
Jayaraman,S.,&Smith,L.B.(2019).Facesinearlyvisualenvironmentsarepersistentnotjust
frequent.VisionResearch,157,213–221.
Jebara,T.(2012).Machinelearning:Discriminativeandgenerative(Vol.755).SpringerScience&
BusinessMedia.
42
Johnson,K.(1997).Theauditory/perceptualbasisforspeechsegmentation.OhioStateUniversity
WorkingPapersinLinguistics,50,101–113.
Ladefoged,P.,&Broadbent,D.E.(1957). InformationConveyedbyVowels.TheJournalof the
Acoustical Society of America, 29(1), 98–104. Retrieved 2013-01-04, from
http://link.aip.org/link/?JAS/29/98/1
Langacker,R.W.(1987).FoundationsofCognitiveGrammar:Theoreticalprerequisites.Stanford
UniversityPress.(Google-Books-ID:g4NCRFbZkZYC)
Langacker,R.W.(1991).FoundationsofCognitiveGrammar.Vol.2,DescriptiveApplication.
Stanford:StanfordUniversityPress.
Lively,S.E.,Logan,J.S.,&Pisoni,D.B.(1993,September).TrainingJapaneselistenerstoidentify
English/r/and/l/.II:Theroleofphoneticenvironmentandtalkervariabilityinlearningnew
perceptual categories.The Journal of the Acoustical Society of America,94(3), 1242–1255.
Retrieved 2020-11-02, from https://asa.scitation.org/doi/10.1121/1.408177 (Publisher:
AcousticalSocietyof
America)Love,B.C.,Medin,D.L.,&Gureckis,T.M.(2004).Sustain:anetworkmodelofcategorylearning.
Psychologicalreview,111(2),309-332.
MacCorquodale, K. (1970). On chomsky’s review of skinner’s verbal behavior. Journal of the
experimentalanalysisofbehavior,13(1),83.
Marslen-Wilson,W.,&Warren,P. (1994).Levelsofperceptual representationandprocess in
lexical access: Words, phonemes, and features. Psychological Review, 101(4), 653–675.
Retrievedfrom
Mendel,J.,&McLaren,R.(n.d.).Reinforcement-learningcontrolandpatternrecognitionsystems.
In J.Mendel&K.Fu(Eds.),Adaptive, learning,andpatternrecognitionsystems:Theoryand
applications.
Mielke, J., Baker, A., & Archangeli, D. (2016). Individual-level contact limits phonological
complexity:Evidence frombunchedandretroflex//.Language,92(1),101–140.Retrieved
2016-05-21,from
https://muse.jhu.edu/content/crossref/journals/language/v092/92.1.mielke.html
43
Milin,P.,Divjak,D.,&Baayen,R.H.(2017).Alearningperspectiveonindividualdifferencesin
skilled reading: Exploring and exploiting orthographic and semantic discrimination cues.
JournalofExperimentalPsychology:Learning,Memory,andCognition.
Milin,P.,Divjak,D.,Dimitrijevi´c,S.,&Baayen,R.H.(2016).Towardscognitivelyplausibledata
scienceinlanguageresearch.CognitiveLinguistics,0(0).
Milin,P.,Feldman,L.B.,Ramscar,M.,Hendrix,P.,&Baayen,R.H.(2017).Discriminationinlexical
decision.PloSone,12(2),e0171935.
Milin, P., Keuleers, E., & Filipovi´c-Durdevi´c, D. (2011). Allomorphic responses in serbian
pseudo-nounsasaresultofanalogicallearning.ActaLinguisticaHungarica,58(1),65-84.
Milin,P.,Madabushi,H.T.,Croucher,M.,&Divjak,D.(2020).Keepingitsimple:Implementation
andperformanceoftheproto-principleofadaptationandlearninginthelanguagesciences.
arXivpreprintarXiv:2003.03813.
Milin,P.,Nenadi´c,F.,&Ramscar,M.(2020).Approachingtextgenre:Howcontextualized
experienceshapestask-specificperformance.ScientificStudyofLiterature,10(1),3–34.
Nathan,G.S.(2007).Phonology. InD.Geeraerts&H.Cuyckens(Eds.),TheOxfordHandbookof
Cognitive Linguistics. Oxford University Press. Retrieved 2020-11-17, from
http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199738632.001.0001/oxfordhb-
978019973
Nixon,J.,&Tomaschek,F.(2020).LearningfromtheAcousticSignal:Error-DrivenLearningof
Low-Level Acoustics Discriminates Vowel and Consonant Pairs. InProceedings of the 42nd
AnnualConferenceoftheCognitiveScienceSociety(Vol.42,pp.585–591).Retrieved2020-09-
04,fromhttps://cognitivesciencesociety.org/cogsci20/papers/0105/index.html
Nosofsky,R.M.(1986).Attention,similarity,andtheidentification-categorizationrelationship.
Journalofexperimentalpsychology:General,115(1),39-57.
Palmer, D. C. (2006). On chomsky’s appraisal of skinner’s verbal behavior: A half century of
misunderstanding.TheBehaviorAnalyst,29(2),253–267.
Pierrehumbert, J. B. (2001). Exemplar dynamics:Word frequency, lenition and contrast. In J.
Bybee&P.Hooper(Eds.),FrequencyandtheEmergenceofLinguisticStructure.Philadelphia,
PA:JohnBenjamins.
Port,R.F.(2007,August).Howarewordsstoredinmemory?Beyondphonesandphonemes.New
Ideas in Psychology, 25(2), 143–170. Retrieved 2015-04-14, from
http://www.sciencedirect.com/science/article/pii/S0732118X07000189
44
Port,R.F.(2010).LanguageasaSocialInstitution:WhyPhonemesandWordsDoNotLiveinthe
Brain. Ecological Psychology, 22(4), 304–326. Retrieved 2020-08-07, from
https://doi.org/10.1080/10407413.2010.517122 (Publisher:Routledge)
Port,R.F.,&Leary,A.P.(2005).Againstformalphonology.LANGUAGE,81,927–964.Retrieved
from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.8395 Predictability, surprise,
attention,andconditioning.(n.d.).
R Core Team. (2020). R: A language and environment for statistical computing [Computer
softwaremanual].Vienna,Austria.Retrievedfromhttps://www.R-project.org/
Ramscar, M. (2019, March). Source codes in human communication (preprint). PsyArXiv.
Retrieved2020-05-27,fromhttps://osf.io/e3hps
Ramscar,M.,Dye,M.,&McCauley,S.M. (2013). ErrorandExpectationinLanguage
Learning:TheCuriousAbsenceof”Mouses”inAdultSpeech.Language,89(4),760–793.
Retrieved2020-09-28,fromhttps://www.jstor.org/stable/24671957
Ramscar,M.,&Port,R.(2016).Howspokenlanguagesworkintheabsenceofaninventoryof
discreteunits.LanguageSciences,53,58–74.
Ramscar, M., & Port, R. (2019). Categorization (without categories). Cognitive
LinguisticsFoundationsofLanguage,87.
Ramscar,M.,&Yarlett,D.(2007).Linguisticself-correctionintheabsenceoffeedback:Anew
approachtothelogicalproblemoflanguageacquisition.Cognitivescience,31(6),927-960.
Rescorla,R.(2008).Rescorla-wagnermodel.Scholarpedia,3(3),2237.
Rescorla,R.A.,&Wagner,A.R. (1972).A theoryofPavlovian conditioning:Variations in the
effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current
researchandtheory,64–99.
Sapir,E.(1921).LanguageanIntroductiontotheStudyofSpeech.NewYork:HarcourtBrace.
(Google-Books-ID:gLmsgEACAAJ)
Savin,H.,&Bever,T.(1970,June).Thenonperceptualrealityofthephoneme.JournalofVerbal
Learning and Verbal Behavior, 9(3), 295–302. Retrieved 2020-09-14, from
http://www.sciencedirect.com/science/article/pii/S0022537170800640 (Publisher: Academic
Press)
45
Schatz,T.,Feldman,N.,Goldwater,S.,Cao,X.N.,&Dupoux,E.(2019,May).Earlyphoneticlearning
without phonetic categories – Insights from large-scale simulations on realistic input (Tech.
Rep.).PsyArXiv.Retrieved2020-11-02,fromhttps://psyarxiv.com/fc4wh/
Schmidtke, D., Matsuki, K., & Kuperman, V. (2017). Surviving blind decomposition: A
distributional analysis of the time-course of complex word recognition. Journal of
ExperimentalPsychology:Learning,Memory,andCognition,43(11),1793.
Shafaei-Bajestan, E., Moradipour-Tari, M., & Baayen, R. H. (2020). Ldl-auris: Error-driven
learninginmodelingspokenwordrecognition.
Shankweiler,D.,&Fowler,C.A.(2015).Seekingareadingmachinefortheblindanddiscovering
the speech code. History of Psychology, 18(1), 78–99. (Place: US Publisher: Educational
PublishingFoundation)
Shoup,J.E.(1980).PhonologicalaspectsofSpeechRecognition.TrendsinSpeechRecognition,
125–138.(Publisher:PrenticeHall)
Skinner,B.F.(1957).Verbalbehavior.NewYork:Appleton-Century-Crofts.
Suzuki,R.,Terada,Y.,&Shimodaira,H.(2019).pvclust:Hierarchicalclusteringwithpvaluesvia
multiscale bootstrap resampling [Computer software manual]. Retrieved from
https://CRAN.R-project.org/package=pvclust (Rpackageversion2.2-0)
Swiggers,P. (1995).Howchomskyskinnedquine,orwhat ‘verbalbehavior’cando.Language
Sciences,17(1),1–18.
Taylor,J.R.(1995).LinguisticCategorization:PrototypesinLinguisticTheory.ClarendonPress.
(Google-Books-ID:qQOQQgAACAAJ)
Tomaschek,F.,Tucker,B.V.,Fasiolo,M.,&Baayen,R.H.J.L.V.(2018).Practicemakesperfect:
Theconsequencesoflexicalproficiencyforarticulation,4(s2).
Trubetzkoy,N.S.(1969).PrinciplesofPhonology.UniversityofCaliforniaPress,2223FultonSt.
Tucker,B.V.,Brenner,D.,Danielson,D.K.,Kelley,M.C.,Nenadi´c,F.,&Sims,M.(2019,June).The
MassiveAuditoryLexicalDecision(MALD)
database. BehaviorResearchMethods,51(3),1187–1204. Retrieved2019-05-31,from
https://doi.org/10.3758/s13428-018-1056-1
Venables,W.N.,&Ripley,B.D.(2002).Modernappliedstatisticswiths(Fourthed.).NewYork:
Springer.Retrievedfromhttp://www.stats.ox.ac.uk/pub/MASS4 (ISBN0-387-
95457-0)
46
Whelan,M.T.,Prescott,T.J.,&Vasilaki,E.(2020).Aroboticmodelofhippocampalreversereplay
forreinforcementlearning.ToAppearinarXiv.
Widrow,B.,&Hoff,M.E.(1960).Adaptiveswitchingcircuits.In(p.96-104).
Widrow,B.,&Lehr,M.A.(1990).30yearsofadaptiveneuralnetworks:Perceptron,madaline,
andbackpropagation.ProceedingsoftheIEEE,78(9),1415-1442.
Yuan, J.,&Liberman,M. (2008). Speaker identificationon theSCOTUScorpus.Proceedingsof
Acoustics‘08,5687–5690.