austin butts hearing loss as a health issue 53 million with severe (61+ db) or greater (mathers et...

57
Enhancing the Perception of Speech Indexical Properties of Cochlear Implants through Sensory Substitution Austin Butts

Upload: shanna-sullivan

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Enhancing the Perception of Speech Indexical Properties of Cochlear Implants through Sensory Substitution

Enhancing the Perception of Speech Indexical Properties of Cochlear Implants through Sensory SubstitutionAustin Butts(Verbal Intro)

Thank everyone for attending

Share research on how sensory substitution might provide aid to cochlear implant patients1Background2History of Cochlear ImplantsHearing Loss as a Health Issue53 million with severe (61+ dB) or greater (Mathers et al. 2000)14 million profound (81+ dB) loss (Mathers et al. 2000)About $300,000 for people with severe loss in the US (Mohr et al. 2000)

What are Cochlear Implants (CIs)"the most successful neural prosthesis" (Zeng et al. 2008)Restore hearing by convert sounds into electrical pulsesStimulate the nerves of the inner ear

Milestones and Trends in CIs (Zeng et al. 2008)Preliminary work: 1950s - 70sFirst single channel electrode approved by FDA in 1984Conversational speech beginning in 1990s1.3: lifetime net cost (opp. cost) of having severe hearing loss, skewed higher on pediatric hearing loss (associated with special education)3Shortcomings of CIsIndexical PropertiesQualities of the speaker, not linguistic in natureEx: Identifying speakers, discriminating voice genderAlso leads to problems in group contexts

OthersTelephone conversations: High variability of usage, more common to suffer with unfamiliar talkers and topicsSpeech intonation/Prosody: Statement vs. questionExpense and Risk: $40k - $100k, invasive(ASHA 2015, AAO-HNS 2015)

Variability in Outcomes

1.1: Group membership, speaker characteristics, changes in the state of the speaker

1.3+: literature has shown that indexical and linguistic properties often rely on similar cues; convince someone who doesnt care about hearing other than linguistic properties, point them to that

3: Average statistics usually discussed, "supporting conversational speech" fails to capture poorer performers4Possible Solution:Sensory SubstitutionWhat is Sensory Substitution?Convert information about one sense to anotherAccomplished by a machine interfaceNon-invasive, application of system is external

Prior WorkTVSS: camera to height map of pin array (Bach-y-Rita 2004)Tongue electrotactile: camera to electrically stimulate tongue (Bach-y-Rita 2004)vOICe: image, treated as audio spectrogram, into sound output (Auvray et al. 2007)Other examples: tactile-tactile, balance aids (Kaczmarek et al. 1991)Discussions on Sensory SubstitutionGeneral Considerations (Lenay et al. 2003)Issues in fundamental usefulness and widespread useReplace sensation perceptionRequire time to form associationsActually substitution additionPhysical areas still retain old functions and contextual perceptionsTypical formulation leaves out crucial role of motor integration(Bach-y-Rita and Kercel 2003, Bach-y-Rita 2004)More obvious in visual systems

Applications for Cochlear ImplantsSame* for perception framework as traditional applicationsPsychophysics theory implies an increased sensitivity for multimodal systemsAlso applicable for 'addition'Sensorimotor regimes more contentiousMotor components not crucial to linguistic and indexical speech perceptionSpace not represented topographically in human system (Kandel et al. 2013)Lenay formed a valuable, more comprehensive resource for critiquing aspects of sensory substitution, will apply some of these points to SS for Cis

1.1: Don't see people walking around with these devices

2.1: Is it providing new information or reinforcing weak information? Applying psychophysics theory makes the question dubious6Discussions on Sensory Substitution (cont.)Aspects of Multisensory IntegrationClearly is in this applicationRequires integrating cues from audition and another modality to arrive at a single abstractionSpeech is natively multisensoryInformation is not specific to a single mode, often present in multiple channels at least in part (Rosenblum 2005, McGurk and MacDonald 1976, Hopyan-Misakyan et al. 2009)

Neural MechanismsDeaf vs. NH: Plasticity to tactile cues(Levnen et al. 1998, Auer et al. 2007)Small amounts of activation to vibrotactile cues in NH(Schrmann et al. 2006)Lesion studies: voice quality and ID/familiarity(Van Lancker 1982, Kreiman 1997; Belin et al. 2000)Synesthesia following stroke: audio eliciting tactile sensations(Ro et al. 2007, Beauchamp and Ro 2008, Naumer and van den Bosch 2009)1.2: Why utilizing S.S. seems like a prime candidate; lip-reading (esp. in hearing loss), McGurk effect, emotional cues in facial expressions and tone of voice

2: blitz through some neural mechanisms that are noteworthy in integrating auditory and tactile information

2.1: HL people have higher levels of activation in auditory cortex to vibrotactile cues; suspected it might have to do with high-powered hearing aids

2.2: Found smaller amounts in the secondary/belt areas auditory cortex for NH

2.3: Localization of some indexical properties; either temporal lobe with discrimination, right parietal lobe with identification; region of upper bank of superior temporal sulcus associated specifically with voices (in what way?)

2.4: stroke specifically in right ventrolateral nucleus (VLN) of the thalamus; loss of tactile sensations in left side; certain sounds cause tingling sensations, esp. in arm and hand; proposed mechanism: unmasking of latent connection between areas of cortex7Prior Work in Auditory Sensory SubstitutionDirect Spectral MappingPrinciple: Energy of spectral bandsMimic tonotopic nature of cochleae/CIs(De Filippo and Scott 1978, Sparks et al. 1979, Wada et al. 1996, Galvin et al. 1999)

Fundamental and Formant IsolationPrinciple: Source/filter propertiesSource is fundamental frequency/glottal pulse rateFormants are peak frequencies of filter(Rothenberg and Molitor 1979, Boothroyd 1986, Franklin et al. 1991)

Contemporary WorkVESTConference demos, but no journal publications to date(Eagleman 2014, Novich and Eagleman 2014, Eagleman 2015)

Historically, tactile aids explored and widely considered for deaf before the rise of the CI; review of literature, clear two different approaches (possibly different design philosophies)

Also note spatial utilization is common, but not necessary

1: low frequencies to one location up sequentially to a location for high frequencies

2: reduce picture of spectral energy to underlying frequencies of source and filter peaks; phonemes or sounds like vowels have different peak frequencies

8Prior Work in Auditory Sensory Substitution (cont.)DiscussionConfer some information (Osberger et al. 1991)Hardly any tests just at chance levelOnly comparable to single electrode CIsMultichannel technology clearly has performance advantages after initial development(Pickett and McFarland 1985, Osberger et al. 1991)Multiple aids (i.e. with CIs) in linguistic applications appear dimMore difficult to demonstrate effectiveness when baseline is already above chanceThe literature largely ignores indexical properties9ApproachReceiver and Target DomainsCan it be done?Auditory-Tactile cue mappingContinuum of dimensions approachContrasting to pattern-based (Tan et al. 2003)Do not have that knowledge for particular results the user wantsInfeasible for the end-goal for sheer number (speaker ID)Information extraction might be a prerequisite to clarify salient cues3.3: frame it as design specification: so-called Dunbar's number (pop culture notoriety), target: number of people who you can recall socially relevant details, about 150

4: in line with philosophy of formant extraction11Auditory AspectsCues in Speech and Hearing ScienceRhythm, speaking rate, breathiness, nasality, pitch and intonation, formants, and dynamic articulatory cues (Sambur 1975 , Cleary et al. 2005, Vongpaisal et al. 2010)Methods (Kreiman et al. 2005)Manipulate samples to select or drop certain cuesAdditional (abstract) frameworks: Factor Analysis (FA) and Multidimenional Scaling (MDS)

Cues in Computer ScienceMathematical approach based on the signal, not concerned with the speech apparatusFeatures: Cepstral coefficients (Furui 1981, Gowdy and Tufekci 2000, Zheng et al. 2001)Frameworks:Gaussian Mixture Model (GMM)Hidden Markov Model (HMM)Support Vector Machine (SVM)Artificial Neural Network (ANN)I-vectors1.1: DAC are timings nuances in phoneme production; comment on transparency

1.2: How do we know this, what's the evidence?

2: change gears; why? cues explored in automated speaker recognition

2.2: briefly explain: cepstral coefficients; split into overlapping spectral bands, define features (discrete cosine transform), from overall distribution of spectral energy to finer and finer details

2.3: capture they are about machine learning for patterns in these features12Auditory Aspects (cont.)Reconciling the Two ApproachesBoth provide accurate descriptions, utility depends on the applicationAbstraction of features against complexity of algorithmsSensory substitution systems with human-machine interfaces need to be mindful of bothWhat is salient and what can be categorizedTactile AspectsNeural and Psychophysical DescriptionsSkin described as having four types of tactile receptors(Johansson and Vallbo 1979, Johansson and Vallbo 1980)Each defined by size of receptor field and speed of neural adaptation(Johansson and Vallbo 1983)Inquiry here has more to do with perceptual dimensionsContributions of specific receptors are complex (Gescheider et al. 2002, Sherrick et al. 1990)Can't be elicited specifically in any device implementation1.2: I - small, II - large; rapid or slow adapting14Tactile Aspects (cont.)Potential DimensionsStatic materials: Roughness, hardness, possibly compressional elasticity(Hollins et al. 1993)Vibrotactile: Mechanical pitch and loudness, temporal envelope(Melara and Day 1992, Park and Choi 2011)Specific systems: Spatial location, frequency, intensity, waveform, duration, rhythm, and roughness or temporal envelope(Jones et al. 2009, Brown et al. 2006, Cholewiak et al. 2001)Difficult to infer discriminability of two different arbitrary stimuli with current frameworkLinking Dimensions of ModalitiesAssign each auditory dimension to a dimension of the deviceNon-trivial for more than one dimensionPotential issues (Kreiman et al. 2005)Specific to experimentalIndividual variations in saliencePutting the messiness aside, once find dimensions to utilize, approach specifies...16Experimental FrameworkExperiments 1&2: Trial and Full StudyTest fundamental frequency mapped to the height spatial dimensionTask: Identify the gender of the speaker

Experiment 3: ComputationalSimulate identification of speaker using linear discrimination procedureFirst two: empirical experiment17Voice GenderDiscrimination TaskInitial Work: Experiment 1Methods

EquipmentChairVibrotactile (Haptic) ArrayArduino ControllerComputer (interface)Chair with Vibrotactile ArrayEmphasize this was produced for a prior study in visual-tactile sensory substitution

2: might use vibrotactile and haptics interchangably for this application, but v.t. is more specific19Methods

Schematic of Device Design Dimensions (units in mm)Frequency Range (Hz)MethodsStimuli ProcessingAudioSpeech sentences from TIMIT database (Fisher et al. 1986)CI simulations: process TIMIT files through 8 channel noise vocoder (AngelSim)(Emily Shannon Fu Foundation 2014)VibrotactileFundamental frequency (Praat) converted to patternsLarge (500 ms) and small (50 ms) windows displayed; left and right respectively

Methods

Range of Representative Male SpeakerRange of Representative Female SpeakerMethodsSession Files3 blocks of 52 trials eachBalanced speaker gender within blocksNormal audio, CI simulation, CI sim + hapticsOrder of blocks originally randomized and counterbalanced among subjectsNot completely balanced due to ending experiment before completionNo feedback on any of the correct responses

MethodsParticipants12 participants recruited (10 M and 2 F)Compensated for a 1-hour max. session

Session ProcedureLoad spreadsheet (CSV) with file directories in specified orderWeb interfacePlay segment, prompt the userPlease select the gender of the speaker; Male or FemalePause and prompt to continueLikert scale survey at the endAnalysis MethodologyFactorsOrder of Stimulus: ASubject: B(A)Type of Stimulus: C

Response VariablesRaw data: correct trials and time to respondFinal metricsAccuracyResponse timeBias (Donaldson 1992)These factors will appear in abbreviated form, try my best to recount what they are in plain English25Analysis MethodologyTransformation TechniquesPrinciple: stabilize varianceAccuracy: arcsine-squareroot (Vollset 1993)Derived from variance as a function of the meanResponse Time: inverseDerived from Box-Cox test (Montgomery 2012)Bias: noneDifficult to determineTransformation for Accuracy

Confidence Interval

4: interpret bias results in ANOVA with caution

6: Conf interval: highlight the magic term at end (what we want for approximating the standard error)

26Analysis MethodologyANOVA F-TestsModel: restricted nested mixed effectsF tests for each factor reflect the model

Post-hoc TestsSingle fixed factors: Tukeys testContrasts (interactions): Scheff's Method for comparing all contrastsCorrelations of subject-wise factors: underlying predictionSign test when dealing with potentially non-normal dataResultsAccuracyB(A) [subject]C [stim type]Normal at ceiling, above CI and CI+HapticsNo difference between CI and CI+HapticsAC [stim order*type]Involve learning CI sim between blocks and absolute order learningNo significant and meaningful post-hoc results

**

Top: raw accuracy data; bottom: transformed28ResultsResponse TimeAC [stim order*type]Significant F stat, but again no meaningful post-hocB(A) [subject]B(A)C [subject*type]No sig. correlations between B(A)C and B(A)

BiasOverall towards maleNo factors are significant

Both figs: response speed as opposed to time (faster has larger value)29Results

Which of the following interpretations best accounts for two modes seen in the data?By-Subject ModelBy-Training Modelcontours are arbitrary, shapes you see don't matter, it's relative density values

If it looks like it's hard to tell a difference in the models, that's because it is30DiscussionPerformance statisticsLack of fundamental difference between modalities, appears that the chair does not contribute meaningful informationSubjects vary in overall performance, and within modalities for response time

Why utilizing chair is difficultSpeculate lack of instruction for most participants, combined with two data streams and no direct feedback

Bias in answer choiceMight have made flawed associations31Voice GenderDiscrimination TaskFull Design: Experiment 2

MethodsEquipmentComparable to first experimentDifferent laptop

Stimuli ProcessingIn addition to CI simulation segments, also made matching set AMR file compression before simulationMimic phone networkSeparate two streams of vibrotactile patterns (time window size) into two different sets

Methods

Range of Representative Male SpeakerRange of Representative Female SpeakerMethodsSession Files3 blocks of 80 trials each (16 specific training segments, 64 normal)Balanced speaker gender within blocksCI simulation, haptics alone, CI sim + hapticsOrder of blocks fully randomized and counterbalanced among subjectsFeedback on training segments

Participants18 different participants recruited (10 M and 8 F), compensated for a max 1 hour sessionAll informed of mapping

Session ProcedureSimilar sessionChoice layout randomizedNow training segments have correct answers displayed afterwards on continue screenAnalysis MethodologyCodeFactorAOrder of StimuliB(A)SubjectCType of StimulusDType of Auditory StimulusEType of Haptic StimulusHBlock HalvesFactors and AttributesAdd two within-block factorsType of audio stimulus: DType of haptic stimulus: EConsider which half of the block a trial: HConsider duration and distance from center F0 of files in separate analysis

Response VariablesAccuracyResponse timeBias: choice and layoutAlso consider the Likert scoresGet back to the center "distance" of file in the relevant analysis section36Analysis MethodologyTransform TechniquesSame techniquesANOVA StagesNot all the factors can be crossed with others, nonsense combinationsSeparate ANOVAs are completed that have all factors crossedANOVA F-values and TestsModel: Restricted nested mixed effects, different variety now with additional factors and invalid termsPost-hoc TestsSame kinds of testsLinear models for fitting an outcome based on predictorsComment on ANOVA stages?37ResultsAccuracyFixed: C [stim type]Combined stimuli have greater effect than either modality aloneHaptic trends higher than CI (not sig.)Random: B(A), B(A)C, B(A)D [subject * nothing, type and audio]No correlations within or between random factors foundNote: D [audio] and A [order] are marginalD tends towards compression having negative effect

**

Results

**Response TimeFixed: C [stim type]Haptic alone slower than both CI and combinedNo significant difference between CI and combinedFixed: AD and ADE [order*within]Nothing significant found relating to block orderRandom: B(A), B(A)C [subject and subject*type]No significant correlations within and between random factors

ResultsBiasesNo significant bias for speaker gender (choice) or L/R (layout)

File Parameters on AccuracyAgainst (i) the duration of the segment and (ii) cross-modal distance from the centerCoefficients from both variables significantDistance having larger effect

SourceEstimateSt. Errort-statisticp-value(Intercept)0.940880.04126322.8021.0843 E-61Distance0.265590.029588.97898.6729 E-17Duration0.0214750.00870652.46650.014352Linear Model for Accuracy vs. File Parameters2: Here, explain distance: how far the speakers are from the overall center (absolute value)

2 (cont): unit is scaled to relative spacing, smaller values are more likely to be confused based on the fundamental frequency40ResultsANOVA on Likert ScoresOnly effect from C is significantHaptics and combined conditions perceptually easier than CI aloneBut not significant between themselves (trend)

Predict Likert from PerformanceNeither significantBoth trended as expected and effect for accuracy was marginal

SourceEstimateSt. Errort-statisticp-value(Intercept)8.37532.23053.75490.00044586Accuracy-3.07541.5776-1.94940.056754RespTime-1.27221.8331-0.694020.49082Linear Model for Likert vs. Objective PerformanceResultsSplitting Trials in HalfNo effects on accuracyEffect of H, B(A)H, and B(A)CH on response time

Multimodal EnhancementAlternative model: subjects just utilize the one which works for them (no multisensory regime)Is it typical for subjects to utilize both modalities?Method 1: Accuracy for a modality above chance, and combined above that single modality (two ways to step)Both ways of stepping are significantMethod 2: Accuracy for combined above both CI and haptic alone scoresMarginal, not quite significant

DiscussionPerformance MetricsType of stimulus (C) has significant effects, with interesting interplayCombined modalities result in higher accuracy above CI alone without having to sacrifice reaction time (as with haptics alone)Variability of random factors a constant themeNot all subjects react to stimuli the sameD alone just barely not significant in accuracy, but do see variability in how subjects react to compressed audio (B(A)D)Fixed factors related to learning, when they show up significant in ANOVA, are not significant for meaningful contrastsFail to show any correlations in random factors, need further demonstration to confirm strong independent relationsOverarching point: main difference in experiments is not the kind of information, but the way it is presented; training and keeping it simple43DiscussionBiasesExperimental setup appears to fix the issue with bias

File Parameters on AccuracyLonger files helps with accuracy, but not nearly as much as having distinct stimuli for cross-modal distance

Likert ScoresSee similar trends across C (different significant post-hoc results)More difficult to show significance in how much influence accuracy and response time have on scores1: Nice to get rid of it, but it leaves it open to why it was present/went away in the first place44DiscussionSplitting Trials in HalfOverall increase in speed, and also significant variation between and within subjects

Multimodal EnhancementContentious how much it is typical for multimodal usage can be confirmedSecond method may be more susceptible to errorIndicative, but requires further testing with this being the primary hypothesisCan still show existence of some multimodal subjects and average effectsNote the same improvements not demonstrated in accuracy45Simulating Speaker IdentificationExperiment 3MotivationWant to see how Mel-frequency Cepstral Coefficient (MFCC) features correspond to the fundamental frequencyMale and female speakers separated well by a hyperplane in MFCC feature space (129 out of 130 in both groups)

Linear Discriminant Representation for Classifying Voice Gender1: Recall cepstral coefficients, Mel-frequency just describes the band delineations in the first phase47MotivationBroad categorization and correlationSuspect for within groupsMuch lower variance explained

ModelSlope EstimateSlope Std. Errort-statisticp-valueAdjusted R2Both Genders1.68820.0435438.7731.3038E-1090.853Male Only0.542480.124184.36862.5581E-050.123Female Only1.52070.1263612.0358.8334E-230.527Linear Models for Hyperplane Distance to Mean log F0MethodsGoal: See if a linear classifier can succeed in identifying speaker based on the mean MFCC vectors of speech segmentsReduce dimensions of MFCCs to maximize the variance between means (presumed device operation)

ParametersNumber of speakers: 2, 3, 5, 7, 10, 15, 20; n of 260Dimensionality of space: Integers 1-12Duration: Up to 5 seconds in 0.25 second increments1000 trials per parameter combination (max error +/-3%)Train with the 3 SI segments for each speakerRandom selections of SX sentences until required duration reached for testingAnalysis and ResultsDuration PlotsVarying number of dimensions (speakers = 10)Increase in dimensions makes accuracy riseVarying number of speakers (dimensions = 3)Increasing speakers makes accuracy fallPlateaus quickly for duration

Lighter corresponds to higher number50Analysis and Results

Number of DimensionsNumber of SpeakersEstimate95% CI120.79800.7720 - 0.82171200.08000.0647 - 0.09851220.94600.9302 - 0.958412200.58400.5532 - 0.6142Accuracy (Raw)Scaled Performance LevelDiscussionBest performance with full dimensional representationsReduction leads to substantial problems, especially for moderate to large numbers of speakersSome information conveyed, but not passable for a usable implementationDifferent mathematical approach needed2: not even in same order of magnitude for desired number of distinguishable features52ConclusionsMain ThemesSensory substitution devices can support perception of indexical qualities of speechEven in subjects that are already aided by simulations of CIsMapping and procedure make all the differenceTheme of variation among subjectsExistence and possible prevalence of utilizing information in a multimodal fashionSophisticated models needed to convey speaker ID in reduced dimensionsLimitationsCI simulation, really no true substitute for real patientsScores observed not too differentStepping through to familiarize with vocoder(Fu et al. 2004, Fu et al. 2005, Gonzalez and Oliver 2005)Needed for more rigorous procedure to acclimate subjects1: esp. considering vocoders, as with other literature, less validated for indexical properties than linguistic55Future DirectionsDevice ComponentsRobustness of conclusions to different actuators and implementationsMicrophone/sensorimotor integrationMapping AlgorithmsTest against categorical approachDifferent mathematical framework and possibly featuresUser Study TasksLogistics of building speaker ID experiment together (database and procedure)Validate task itself in normal hearing peopleSimultaneous task (intelligibility)3.3: tested under isolated task, does not capture cognitive load to a full extent56Thank YouEnd of Presentation