liverpool university

44
Liverpool University

Upload: vince

Post on 19-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Liverpool University. The Department. Centre for Cognitive Neuroscience Department of Psychology Liverpool University Overall Aim Understanding Human Information Processing. Expertise. Auditory Scene Analysis (ASA) Perception experiments Modelling Speech Perception - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Liverpool University

Liverpool University

Page 2: Liverpool University

The Department

• Centre for Cognitive NeuroscienceDepartment of PsychologyLiverpool University

• Overall AimUnderstanding Human Information Processing

Page 3: Liverpool University

Expertise

• Auditory Scene Analysis (ASA)– Perception experiments – Modelling

• Speech Perception

• Audio-Visual Integration– Models of AV information fusion– Applying these models to ASA

Page 4: Liverpool University

Work at Liverpool

Task 1.3, active/passive speech perception. Research Question: Do human listeners actively

predict the time course of background noise to aid speech recognition?

Current state: Perceptual evidence for ‘predictive scene analysis’ : Elvira Perez will explain all

Planned work: Database of environmental noise to test computational models

Page 5: Liverpool University

Work at Liverpool

Task 1.4 Envelope Information & Binaural ProcResearch Question: What features do listeners use to

track a target speaker in the presence of competing signals? Patti Adank (Aug.03-July.04)

Current State: Tested the hypothesis that ‘jitter’ is a stream segregation or stream formation cue:

Tech report finalised in July 04

Page 6: Liverpool University

Work at LiverpoolTask 2.2 Reliability of auditory cues in multi-cue

scenarios Research Question: How are cues perceptually

integrated?Combination of experimentation and modelling

Current state: Experimental data and models on audio-visual motion signal integration (non Hoarse)

Ongoing work: MLE models for speech feature integration. Elvira

Planned work: Collaboration with Patras (John Worley) on location and pitch segregation cue integration

Page 7: Liverpool University

Work at Liverpool

Task 4.1: Informing speech recognition Research Question: How to apply data derived from perception experiments to machine learning?

Current State: Just starting to ‘predict’ environmental noises (using Aurora noises)Recording database of natural scenes for analysis and modelling. (with Sheffield)

Page 8: Liverpool University

… over to Elvira

Page 9: Liverpool University

Environmental Noise

• Two-pronged approach – Elvira: is perceptual evidence for active noise

modelling in listeners

– Georg (+ Sheffield): noise modelling based on database

Page 10: Liverpool University

Baseline Data

• Typical noise databases not very representative– Size severely limited (e.g. Aurora)– Unrealistsic scenarios (fighter jets, foundries)

• Database of environmental noise– Transport noises: A320-200, ICE, Saab 9-3, – Social Places: Departure lounges, Hotel Lobby, Pub– Private Journeys: urban walk, country walk– Buildings: offices, corridors– …

• Aim is to have about 10-20 mins of representative data for typical situations.

Page 11: Liverpool University

Recordings

• Soundman OKMII binaural microphones

• Sony D3 DAT recorder

• 48kHz stereo recordings

• Digital transfer to PC

mics

Page 12: Liverpool University

Analysis

• Previous work– Auditory filterbank (linear, Mel-Scale, 32 ch.)– Linear prediction

• Within channels

• Across channels

• Planned work– Aud filterbank– Non-linear prediction using Nnets

Page 13: Liverpool University

Using Envelope Informationfor ASA (Patti Adank)

• Background– Brungard & Darwin (resource allocation task?)

Two simultaneous sentences: track one • Segregation benefits from

– Pitch differences

– Speaker differences

– Key question: operational definition of speaker characteristics

Page 14: Liverpool University

Speaker characteristics

• Vocal tract shape – Difficult to quantify / computationally extract

• Speaking style (intonation, stress, accent…)– Difficult to extract measures for very short segments

• Voice characteristics– F0 – of course…– Shimmer (amp modulation)– Jitter (roughness - random GCI variation)– Breathiness (open quotient duing voiced speech)

• All relatively easy to extract computationally• All relatively easy to control in speech re-synthesis

Page 15: Liverpool University

No one choice: Jitter

– Dan Ellis• Computational model is segregation by glottal

closure instance

• Model groups coincident energy in auditory filterbank

– Could ‘Jitter’ be useful for segregation?

Page 16: Liverpool University

Jitter as a primary segregation cue• Double vowel experiment:

– 5 synthetic vowels (Assmann & Summerfield)– Synthesized with range of

• 5 pitch levels

• 5 jitter levels

• Results– Pitch difference aids segregation– Jitter difference does not

Page 17: Liverpool University

baseline (0%) 0.5% 1% 2% 4%

% Jitter

45

50

55

60

65

70M

ean

+-

1 S

E p

erc

ent

Page 18: Liverpool University

Jitter analogous to location cues

• Location cues not primary segregation cues– Segregate on pitch first, then– Use location cues for stream formation

• Experiment– Brungard, Darwin (e.g. 2001) Task,

E.g. “Ready Tiger go to White One now”,And “Ready Arrow go to Red Four now”, but

– Speech resynthesized using Praat• Same speaker, different sentences

– Jitter does not aid stream formation

Page 19: Liverpool University

% correct colour/no combination

0% 3% 6% 9% 12% 15%

% jitter

20

30

40

50

60

70

Mea

n +

- 1

SE

Per

cen

t co

rrec

t

Page 20: Liverpool University

Informing Speech Recognition

• Jitter not no.1 candidate for informing speech recognition…

Page 21: Liverpool University

Task 2.2 Reliability of auditory cues in multi-cue scenarios.

• Ernst & Banks (Nature 2002)– Maximum likelihood estimation good model

for visual/somatosensory cue integration

– Adapted this for AV integration: mouse catching experiment: MLE good modelHofbauer et al., JPP: HPP 2004

– Want to look at speech cue integration in collaboration with Sheffield

Page 22: Liverpool University

Hypothesis

• If listeners organise formants by continuity, then

– the /o/ should lead to /m/ , while

– the /e/ should lead to /n/, with the secondformant of the nasal remaining unassigned

• if proximity is a cue then there should bea changeover at around 1400 Hz

800

2000

2700

375

100 200ms

Page 23: Liverpool University

Formants as a representation?

If sequential grouping of formants explains the perceptual change from /m/ to /n/ for high vowel F2s.

Then transitions should ‘undo’ this change.

time

F

Page 24: Liverpool University

Transitions in /vm/ syllables

• Synthetic /v-m/ segments as before, but 0, 2.5, 5, 10, 20ms formant transitions

• 7 fluent German speakers, 200 trials each

• Experimental results fit prediction0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

hear

d a

s /e

m/

transition duration (ms)

Page 25: Liverpool University

Transitions ??

• ‘Format transitions’ of 5 ms have an effect

• Synthetic speech was synthesized at 100 Hz

• formant transitions: half glottal period ?? – Confirmed that transition has to coincide with

energetic bit of glottal period

• Do subjects use a ‘transition’ or just energy in the appropriate band (1-2 kHz) ?

Page 26: Liverpool University

Formant transitions?

• Take /em/ stimulus without transitions (heard as /en/)• add a chirp in place

of F2 transition (0-,5,10,20,40ms)

– down chirp is FM sinusoid 2kHz-1kHz

– control is FM sinusoid 1kHz-2kHz

800

2000

2700

375

100 200ms

vowel nasal

Form

ant

freq

u. (

Hz)

upchirpdown

Page 27: Liverpool University

• ASA: Chirp should be segregated– listeners should hear ‘vowel-nasal’ plus chirp– listeners should find to difficult to report ‘time of

chirp’

Model prediction

Page 28: Liverpool University

Down Chirp

• 7 listeners 200 trialseach.

• Result:– chirp is perceived

by listeners

– and integrated into percept

/en/ is heard as /em/ 0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

p(/

m/)

DOWN chirp duration (ms)

Page 29: Liverpool University

What does it all mean

• Subjects– Hear /em/ when the chirp is added

(any chirp!)– Hear the chirp as a separate sound– Can identify direction of chirp

• Chirps are able to replace formant– Spectral and fine time structure different– Up-direction inconsistent with expected F2

Page 30: Liverpool University

Multiresolution scene analysis

• Speech recognition does not require detail

• Scene analysis does…

Page 31: Liverpool University

MLE framework

• Propose to testMLE model forASA cue integration

• Cue integration as weighted sum () of component probability

time

F

ASA says:Ignore this

bit

Page 32: Liverpool University

Hypothetical Example

time

F

time

F

Labial transition p(m) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3

time

F

velar transition p(n) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3

unknown transition p(n/m) = 0 = 0.0Formant structure p(m) = 0.7 = 1.0/m/ /m//n/

Page 33: Liverpool University

MLE experiment (Elvira)

time

F

time

F

time

F

Page 34: Liverpool University

Taking it further (back)

• Transition cues

• Prior prob speech high non-speech low

• Localisationcues is low

Page 35: Liverpool University

What does it all mean

• Duplex Perception is – Nothing special– Entirely consistent with a probablistic scene

analysis viewpoint

• Could imagine a fairly high impact publication on this topic

• Training activity on ‘Data fusion’?

Page 36: Liverpool University

Where to go from here

• Would like to collaborate on principled testing of these (and related) ideas – Sheffield ?? IDIAP ??

• Is this any different from missing data recognition?

– Bochum ??• Want to ‘warm up’ duplex perception?

– Most useful: a hands-on modeller

Page 37: Liverpool University

EEG / MEG Study• We argue that

– Scene analysis informs speech perception• Therefore would expect non-speech signals to be

processed/evaluated before speech is recognised• EEG / MEG data should show

– Differential processing of speech / non-speech signals– Perhaps show an effect of the chirps on the latency of speech driven

auditory evoked potential (field)

• We have– A really neat stimulus– Emen signals can be listened to as speech

non-speech signal – Non-speech changes speech identity

Page 38: Liverpool University

(very!) Preliminary data

• Four conditions– /em/ with 20 ms formant transitions– /em/ no formant transitions (en percept)– /em/ no formant transition + 20 ms up chirp (em)– /em/ no formant transition + 20 ms dn chirp (em)

• Two tasks– Identify /em/ s– Identify signals containing chirps

• 16 channel EEG recordings 200 stim each

Page 39: Liverpool University

Predictions

• If ‘speech is special’ then should see significant task dependent differences

• May also see significant differences between stimuli leading to same percept– Effect of chirp might delay speech recognition?

– Here we go:

Page 40: Liverpool University

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120

140 T7 T8 (.. nonspeech __ speech)

LHSRHS

Speech

non-speech

T7 T8

Page 41: Liverpool University

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120

140 TP7 TP8 (.. nonspeech __ speech)

LHS

RHS

Speech

non-speech

TP7 TP8

Page 42: Liverpool University

-100 0 100 200 300 400 500 600-80

-60

-40

-20

0

20

40

60

80

100 F1 F2 (.. nonspeech __ speech)

Speech

non-speech

F1 F2

Page 43: Liverpool University

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120 O1 O2 (.. nonspeech __ speech)

O1 O2 (control…)

No evidencefor differences In early (sensory)procesing

Page 44: Liverpool University

EEG Conclusions

• (very!) preliminary data looks very promising

• Need to get more subjects • Refine paradigm (sequence currently too fast)

– Would a MME study be appropriate

• Would like to – Look at source localisation

(MEG Helsinki, fMRI Liverpool)– Get more channels (MEG Helsinki)