words, more words … and statistics · picking out single words in a flow of speech is no easy...

Words,morewords…andstatistics

Tosegmentwords,thebraincouldbeusingstatisticalmethodsMay19,2016Pickingoutsinglewordsinaflowofspeechisnoeasytaskand,accordingtolinguists,tosucceedindoingitthebrainmightusestatisticalmethods.AgroupofSISSAscientistshasappliedastatistics-basedmethodforwordsegmentationandmeasureditsefficacyonnaturallanguage,in9differentlanguages,todiscoverthatlinguisticrhythmplaysanimportantrole.ThestudyhasjustbeenpublishedintheJournalofDevelopmentalScience.

Haveyoueverrackedyourbrainstryingtomakeoutevenasinglewordofanuninterruptedflow

ofspeechinalanguageyouhardlyknowatall?Itisnaïvetothinkthatinspeechthereiseventhesmallestofpausesbetweenonewordandthenext(likethespaceweconventionallyinsertbetweenwordsinwriting):inactualfact,speechisalmostalwaysacontinuousstreamofsound.However,whenwelistentoournativelanguage,word“segmentation”isaneffortlessprocess.Whatare,linguistswonder,theautomaticcognitivemechanismsunderlyingthisskill?Clearly,knowledgeofthevocabularyhelps:memoryofthesoundofthesinglewordshelpsustopickthemout.However,manylinguistsargue,therearealsoautomatic,subconscious“low-level”mechanismsthathelpusevenwhenwedonotrecognisethewordsorwhen,asinthecaseofveryyoungchildren,ourknowledgeofthelanguageisstillonlyrudimentary.Thesemechanisms,theythink,relyonthestatisticalanalysisofthefrequency(estimatedbasedonpastexperience)ofthesyllablesineachlanguage.Oneindicatorthatcouldcontributetosegmentationprocessesis“transitionalprobability”(TP),whichprovidesanestimateofthelikelihoodoftwosyllablesco-occurringinthesameword,basedonthefrequencywithwhichtheyarefoundassociatedinagivenlanguage.Inpractice,ifeverytimeIhearthesyllable“TA”itisinvariablyfollowedbythesyllable“DA”,thenthetransitionalprobabilityfor“DA”,given“TA”,is1(thehighest).If,ontheotherhand,wheneverIhearthesyllable“BU”itisfollowedhalfofthetimebythesyllable”DI”andhalfofthetimeby“FI”,thenthetransitionalprobabilityof“DI”(and“FI”),given“BU”,is0.5,andsoforth.Thecognitivesystemcouldbeimplicitlycomputingthisvaluebyrelyingonlinguisticmemory,fromwhichitwouldderivethefrequencies.ThestudyconductedbyAmandaSaksida,researchscientistattheInternationalSchoolforAdvancedStudies(SISSA)inTrieste,withthecollaborationofAlanLangus,SISSAresearchfellow,underthesupervisionofSISSAprofessorMarinaNespor,usedTPtosegmentnaturallanguage,byusingtwodifferentapproaches.BasedonrhythmSaksida’sstudyisbasedontheworkwithcorpora,thatis,bodiesoftextsspecificallycollectedforlinguisticanalysis.Inthecaseathand,thecorporaconsistedoftranscriptionsofthe“linguisticsoundenvironment”thatinfantsareexposedto.“Wewantedtohaveanexampleofthetypeoflinguisticenvironmentinwhichachild’slanguagedevelops”,explainedSaksida,“Wewonderedwhetheralow-levelmechanismsuchastransitionalprobabilityworkedwithreal-lifelanguagecues,whichareverydifferentfromtheartificialcuesnormallyusedinthelaboratory,whicharemoreschematicandfreeofsourcesof‘noise’.Furthermore,thequestionwaswhetherthesamelow-levelcueisequallyefficientindifferentlanguages”.Saksidaandcolleaguesusedcorporaofnolessthan9differentlanguages,andtoeachtheyappliedtwodifferentTP-basedmodels.FirsttheycalculatedtheTPvaluesforeachpointofthelanguageflowforallofthecorpora,andthenthey“segmented”theflowusingtwodifferentmethods.Thefirstwasbasedonabsolutethresholding:acertainfixedreferenceTPvaluewasestablishedbelowwhichaboundarywasidentified.Thesecondmethodwasbasedonrelativethresholding:theboundariescorresponded

tothelocallylowestTPfunction.Inallcases,Saksidaandcolleaguesfoundthattransitionalprobabilitywasaneffectivetoolforsegmentation(49%to86%ofwordsidentifiedcorrectly)irrespectiveofthesegmentationalgorithmused,whichconfirmsTPefficacy.Ofnote,whilebothmodelsprovedtobequiteefficient,whenonemodelwasparticularlysuccessfulwithonelanguage,thealternativemodelalwaysperformedsignificantlyworse.“Thiscross-linguisticdifferencesuggeststhateachmodelisbettersuitedthantheotherforcertainlanguagesandviceversa.Wethereforeconductedfurtheranalysestounderstandwhatlinguisticfeaturescorrelatedwiththebetterperformanceofonemodelovertheother”,explainsSaksida.Thecrucialdimensionprovedtobelinguisticrhythm.“WecandivideEuropeanlanguagesintotwolargegroupsbasedonrhythm:stress-timedandsyllable-timed“.Stress-timedlanguageshavefewervowelsandshorterwords,andincludeEnglish,SlovenianandGerman.Syllable-timedlanguagescontainmorevowelsandlongerwordsonaverage,andincludeItalian,SpanishandFinnish.ThethirdrhythmicgroupoflanguagesdoesnotexistinEuropeandisbasedon“morae”(apartofthesyllable),suchasJapanese.Thisgroupisknownas“mora-timed”andcontainsevenmorevowelsthansyllable-timedlanguages.Theabsolutethresholdmodelprovedtoworkbestonstress-timedlanguages,whereasrelativethresholdingwasbetterforthemora-timedones.“It’sthereforepossiblethatthecognitivesystemlearnstousethesegmentationalgorithmthatisbestsuitedtoone’snativelanguage,andthatthisleadstodifficultiessegmentinglanguagesbelongingtoanotherrhythmiccategory.Experimentalstudieswillclearlybenecessarytotestthishypothesis.Weknowfromthescientificliteraturethatimmediatelyafterbirthinfantsalreadyuserhythmicinformation,andwethinkthatthestrategiesusedtochoosethemostappropriatesegmentationcouldbeoneoftheareasinwhichinformationaboutrhythmismostuseful”.Thestudyisinfactunabletosaywhetherthecognitivesystem(ofbothadultsandchildren)reallyusesthistypeofstrategy.“Ourstudyclearlyconfirmsthatthisstrategyworksacrossawiderangeoflanguages”,concludesSaksida.“Itwillnowserveasaguideforlaboratoryexperiments.”USEFULLINKS:

• OriginalpaperArticolooriginale:http://goo.gl/cOk5VD

IMAGES:

• Credits:Jev55(Flickr:https://goo.gl/yVVdJ3)

Contact:

Pressoffice:[email protected]

Tel:(+39)0403787644|(+39)366-3677586viaBonomea,26534136TriesteMoreinformationaboutSISSA:www.sissa.it

words, more words … and statistics · picking out single words in a flow of speech is no easy...

Documents