speech processing lab uchechukwu ofoegbu, dissertation defense 10/24/2015 1 model formation and...
TRANSCRIPT
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2311
Model Formation and Classification Model Formation and Classification Techniques For Conversation-based Techniques For Conversation-based
Speaker DiscriminationSpeaker Discrimination
Uchechukwu O. Ofoegbu
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2322
Acknowledgement Acknowledgement
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D. The audience, for being a part of thisThe audience, for being a part of this
ECE faculty and staff, for your great ECE faculty and staff, for your great supportsupport
Members and Friends of the Speech Members and Friends of the Speech Lab, for your valuable contributionsLab, for your valuable contributions
The Air Force Research Labs, for The Air Force Research Labs, for financially supporting most of this financially supporting most of this
research workresearch work
My committee members, for your time My committee members, for your time and commitment to my researchand commitment to my researchDr Y, the best advisor one could hope forDr Y, the best advisor one could hope forMy family, for being thereMy family, for being there
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2333
Presentation OutlinePresentation Outline
• Introduction– Challenges of Conversational Data– General Applications of Research– Novelty of Research
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
• Introduction
• Evaluation Databases
• Modeling Speakers– Traditional Speaker Modeling– Proposed Method– Features Used– Distance Used
• Introduction
• Evaluation Databases
• Modeling Speakers
• Application Systems– Unsupervised Speaker Indexing – Speaker Count– Generalized Speaker Indexing
• Introduction
• Evaluation Databases– HTIMIT– SWITCHBOARD– New Conversations Database
• Introduction
• Evaluation Databases
• Modeling Speakers
• Application Systems
• Fusion of Distance Measures– “Optimized T Distance – Decision-Based Combination– Weighted Decision-Based Combination
• Introduction
• Evaluation Databases
• Modeling Speakers
• Application Systems
• Fusion of Distance Measures
• Summary
• Introduction
• Evaluation Databases
• Modeling Speakers
• Application Systems
• Fusion of Distance Measures
• Summary
• Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2344
Introduction
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2355
Challenges of Conversational DataChallenges of Conversational Data
• No a priori information available from participating speakers– Training is impossible
• No a priori knowledge of change points
• Speakers alternate very rapidly– Limited amounts of data for single speaker
representations
• Distortion– Channel noise, co-channel data
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2366
Proposed SolutionsProposed Solutions
1. Selective creation of data models
2. Distance-Based Model Comparison
3. Development of application-specific system
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2377
Novelty of this ResearchNovelty of this Research
1. Selective creation of data models
2. Distance-Based Model Comparison
3. Development of application-specific system
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2388
ApplicationsApplications
• Monitoring criminal conversations
• Forensics
• Automated Customer Services
• Storage/Search/Retrieval of Audio Data
• Military Activities
• Conference calls
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/2399
DatabasesDatabases
• Standard Speaker Discrimination Databases– HTMIT– Switchboard
• Temple Conversations Database (TCD)
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231010
Modeling Speakers
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231111
Traditional Speaker ModelingTraditional Speaker Modeling
• Examples– Gaussian Mixture Models– Hidden Markov Models– Neural Networks– Prosody-Based Models
• Disadvantages– Require large amounts– Sometimes require training procedure– Relatively complex
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231212
Conversational Data ModelingConversational Data Modeling
• Current Method– Equal segmentation of data– Indiscriminate use of data
• Problems– Change points unknown– Not all speech is useful– Poor performance
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231313
Proposed Speaker ModelingProposed Speaker Modeling
SSSS
VVVV
UUUU
VVVV
UUUU
VVVV
UUUU
…………
UUUU
VVVV
SSSS
VVVV
VVVV
VVVV
VVVV
VVVV
VVVV
VVVV
VVVV. . .. . .
SEGMENT 1SEGMENT 1 SEGMENT MSEGMENT M
FEATURE FEATURE COMPUTATIONCOMPUTATION
FEATURE FEATURE COMPUTATIONCOMPUTATION
MEAN AND COVARIANCE MEAN AND COVARIANCE MATRIX COMPUTATIONMATRIX COMPUTATION
MEAN AND COVARIANCE MEAN AND COVARIANCE MATRIX COMPUTATIONMATRIX COMPUTATION
MODEL 1MODEL 1MODEL 1MODEL 1
MODEL MMODEL MMODEL MMODEL M. . .. . .
. . .. . .
FEATURE FEATURE COMPUTATIONCOMPUTATION
FEATURE FEATURE COMPUTATIONCOMPUTATION
MEAN AND COVARIANCE MEAN AND COVARIANCE MATRIX COMPUTATIONMATRIX COMPUTATION
MEAN AND COVARIANCE MEAN AND COVARIANCE MATRIX COMPUTATIONMATRIX COMPUTATION
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231414
Proposed Speaker ModelingProposed Speaker Modeling
• Why voiced only?– Same speech class compared– Contains the most information
• What’s the appropriate number of phonemes?
– Large enough to sufficiently represent speakers
– Small enough to avoid speaker overlap
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231515
Features ConsideredFeatures Considered
• Linear Predictive Cepstral Coefficients
– Model the vocal tract
• Mel-Scale Frequency Cepstral Coefficients
– Model the human auditory system
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231616
Distance MeasurementsDistance Measurements
Same speaker distancesSame speaker distances
Different speaker distancesDifferent speaker distances
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231717
Distances UsedDistances Used
• Mahalanobis DistanceMahalanobis Distance
• Hotelling’s T-Square StatisticsHotelling’s T-Square Statistics
• Kullback-Leibler DistanceKullback-Leibler Distance
• Bhattacharyya DistanceBhattacharyya Distance
• Levene’s TestLevene’s Test
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231818
Analysis of Cepstral FeaturesAnalysis of Cepstral Features
• Mahalanobis Distance
0 1 2 3 40
0.02
0.04
0.06
0.08
0.1
Mahalanobis Distance Comparisons
LPCC-based Distances
Pro
bability o
f O
ccurr
ence
Intra
Inter
0 1 2 3 40
0.02
0.04
0.06
0.08
0.1
0.12
MFCC-based Distances
Pro
bability o
f O
ccurr
ence
Intra
InterIntroduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/231919
Best Number of Phonemes?Best Number of Phonemes?
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Number of segments
Mahala
nobis
dis
tance
Speaker Differentiation with Respect to Data Size
Same Speaker
Different Speaker
Number of Phonemes
Features Used - LPCCFeatures Used - LPCC
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232020
Application Systems
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232121
Unsupervised Speaker IndexingUnsupervised Speaker Indexing
• The Restrained-Relative Minimum Distance (RRMD) Approach
0 D0 D1,21,2 D D1,31,3 … …
DD2,12,1 0 D 0 D2,32,3 … …
DD3,13,1 D D3,23,2 0 … 0 …
……
0 D0 D1,21,2 D D1,31,3 … …
DD2,12,1 0 0 DD2,32,3 … …
DD3,13,1 DD3,23,2 0 … 0 …
……
REFERENCE MODELSREFERENCE MODELS
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232222
Unsupervised Speaker IndexingUnsupervised Speaker Indexing
• The Restrained-Relative Minimum Distance (RRMD) Approach
Reference 2Reference 2
Restraining Condition
Restraining Condition
Same Speaker
PassedPassedRelativeDistance
Condition
RelativeDistance
Condition
FailedFailed
Passed
FailedFailed
Unusable DataReference 1
Reference 1
Observe distanceObserve distance
Min. Distance
Same Speaker?
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232323
RRMD ApproachRRMD Approach
• Restraining Condition
– Distance Likelihood Ratio
DLR > 1 Same Speaker
DLR < 1 Check Relative
Distance Condition
),|(
),|(
22
11
xf
xfDLR
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232424
RRMD ApproachRRMD Approach
• Relative Distance Condition
– Relative Distance:
Drel = dmax – dmin
– Drel > threshold Same Speaker
Reference 2Reference 2
Reference 1Reference 1
dmin dmax
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232525
Experiments and ResultsExperiments and Results
• Experiments
– HTIMIT used for obtaining likelihood ratio parameters
• 1000 same speaker and 1000 different speaker utterances computed
– 100 conversations from Switchboard database used for evaluation
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232626
Indexing Results - Mahalanobis Indexing Results - Mahalanobis
MFCCMFCCLPCCLPCC
0 0.1 0.2 0.3 0.4 0.50
2
4
6
8
10
12
Classification Error with Respect to Relative Distance Threshold Mahalanobis
Per
cent
Err
or
Threshold
Indexing Error
Undecided Error
Equal Error
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
0.1 0.2 0.3 0.4 0.5
2
4
6
8
10
12
14
Classification Error with Respect to Mean-Difference Threshold Mahalanobis
Per
cent
Err
or
Threshold
Indexing Error
Undecided Error
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232727
Indexing Results – T-Square Indexing Results – T-Square
MFCCMFCCLPCCLPCC
0 50 100 150 2000
2
4
6
8
10
12
14
16
18
Classification Error with Respect to Mean-Difference Threshold T-Square
accu
racy
Threshold
Indexing Error
Undecided Error
0 50 100 150 2000
2
4
6
8
10
12
Classification Error with Respect to Relative Distance Threshold T-Square
Perc
en
t E
rro
r
Threshold
Indexing Error
Undecided Error
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232828
Indexing Results - Bhattacharyya Indexing Results - Bhattacharyya
MFCCMFCCLPCCLPCC
0.5 0.6 0.7 0.8 0.9 13
3.5
4
4.5
5
5.5
6
6.5
7
7.5
Classification Error with Respect to Mean-Difference Threshold Bhattacharyya
Perc
ent
Err
or
Threshold
Indexing Error
Undecided Error
Equal error
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
6
8
10
12
14
16
18
20
22
24
Classification Error with Respect to Mean-Difference Threshold Bhattacharyya
accu
racy
Threshold
Indexing Error
Undecided Error
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/232929
Indexing Results - SummaryIndexing Results - Summary
• Mahalanobis distance yielded best results
• LPCCs outperformed MFCCs
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233030
Speaker Count SystemSpeaker Count System
• The Residual Ratio Algorithm (RRA)
• Process is repeated K-1 times for counting up to K speakers
Reference Model Reference Model Selected RandomlySelected Randomly
DLR-based DLR-based Model Model ComparisonComparison
Reference Model Reference Model Selected RandomlySelected Randomly
. . .. . .
DLR-based DLR-based Model Model ComparisonComparison
Reference Model Reference Model Selected RandomlySelected Randomly
Too little data Removed, select
Another modelIntroduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233131
Speaker CountSpeaker Count
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
ARR
Pro
babi
lity
ARR Probability Distributions
1 Speaker2 Speakers3 Speakers4 Speakers
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
ARR
Pro
babi
lity
ARR Probability Distributions
1 Speaker
2 Speakers
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
ARR
Pro
babi
lity
ARR Probability Distributions
1 Speaker2 Speakers3 Speakers
Added Residual Ratio:Added Residual Ratio:
• Is the sum of the residual ratios in all elimination stages
• Should be higher for greater number of speakers
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233232
Experiments and Results Experiments and Results
• Experiments
– 4000 conversations generated from HTIMIT
– All 40 conversations from new database used
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233333
Speaker Count Results - HTIMIT Speaker Count Results - HTIMIT
MFCCMFCCLPCCLPCC
RRA Speaker Count Accuracy
0
20
40
60
80
100
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cent
Cor
rect
Bhattacharrya
T-Square
Mahalanobis
Speaker Count Accuracy - MFCC
0
20
40
60
80
100
1 or 2+ 1, 2 or3+
1, 2, 3or 4
Accuracy Method
Perc
en
t C
orr
ect
Bhattacharrya
T-Square
Mahalanobis
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233434
Speaker Count Results - HTIMIT Speaker Count Results - HTIMIT
MFCCMFCCLPCCLPCC
Speaker Count Accuracy - MFCC
01020
3040506070
8090
100
1 or2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cen
t C
orr
ect
Mahalanobis
OTD
DBC
WDBC
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speaker Count Accuracy - LPCC
35
45
55
65
75
85
95
1 or 2+ 1, 2 or 3+ 1, 2, 3 or4
Accuracy Method
Per
cen
t C
orr
ect
Mahalanobis
OTD
DBC
WDBC
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233535
Speaker Count Results – TCDSpeaker Count Results – TCD
MFCCMFCCLPCCLPCC
RRA Speaker Count Accuracy - LPCC
0
20
40
60
80
100
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cen
t C
orr
ect
Bhattacharrya
T-Square
Mahalanobis
RRA Speaker Count Accuracy - MFCC
0
20
40
60
80
100
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cen
t C
orr
ect
Bhattacharrya
T-Square
Mahalanobis
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233636
Speaker Count Results – TCDSpeaker Count Results – TCD
MFCCMFCCLPCCLPCC
Speaker Count Accuracy - LPCC
20
30
40
50
60
70
80
90
100
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cen
t C
orr
ect
T-Square
OTD
DBC
WDBC
Speaker Count Accuracy - MFCC
20
30
40
50
60
70
80
90
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Per
cen
t C
orr
ect
T-Square
OTD
DBC
WDBC
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233737
Cross EvaluationCross Evaluation
HTIMIT – LPCCs with the WDBCHTIMIT – LPCCs with the WDBCTCD – MFCCs with the T-SquareTCD – MFCCs with the T-Square
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speaker Count Accuracy - Cross Evaluation
20
40
60
80
100
1 or 2+ 1, 2 or 3+ 1, 2, 3 or 4
Accuracy Method
Pe
rce
nt
Co
rre
ct
HTIMIT TCD
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233838
Speaker Counting-IndexingSpeaker Counting-Indexing
• The Residual Ratio speaker count algorithm is applied
• Test models are associated with their matching reference models
• Unmatched models are assigned to the references from which it has the minimum distance.
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/233939
Speaker Counting /Indexing ResultsSpeaker Counting /Indexing Results
Speaker Indexing Accuracy
0102030405060708090
100
Bhatta
char
rya
T-Squ
are
Mah
alan
obis
OTDDBC
WDBC
Pe
rce
nt
Ac
cu
rac
y
Solid - HTMIT; Patterned – TCDSolid - HTMIT; Patterned – TCD
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234040
Fusion of Distance Measures
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234141
Correlation AnalysisCorrelation Analysis
20 40 60 80 100120140Levenne
2 4 6Bhrattacharyya
50 100 150 200KL
0 100 200 300TSquared
1 2 3
20406080
100120140
Mahalanobis
Leve
nne
2
4
6
Bhr
atta
char
yya
50
100
150
200
KL
0
100
200
300
TS
quar
ed
1
2
3
Mah
alan
obis
Draftsmans Dispalay of Distances (LPCC)
Intra
Inter
Draftsman’s Display - LPCC
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234242
““Best Distance”Best Distance”
• Optimal Criteria for Fusion of Distances
– Maximize inter-speaker variation
– Minimize intra-speaker variation
– Maximize T-test value between inter-class distance distributions
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234343
Decision Level FusionDecision Level Fusion
D1 => matchD1 => match
D2 => no matchD2 => no match
D3 => matchD3 => match
D4 => matchD4 => match
Match = ¾Match = ¾
No Match = ¼No Match = ¼
Final Decision = MatchFinal Decision = Match
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234444
Weighted Decision Level FusionWeighted Decision Level Fusion
ii
ii T
T
Ti = T-value corresponding to each distance
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234545
Summary
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234646
Research GoalResearch Goal
• To differentiate between speakers in a conversation– To determine the number of speakers present– To determine who is speaking when
• To overcome the following challenges– No a priori information– Limited data size– No knowledge of change points– Co-channel speech
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234747
Summary of AccomplishmentsSummary of Accomplishments
• Novel model formation technique
• Three novel approaches for conversations-based speaker differentiation
• Distance combination techniques to enhance performance
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234848
ObservationsObservations
• Mahalanobis Distance, LPCCs optimal for standard databases
• T-Square Distance, MFCCs optimal for new database
• Best fusion technique: Weighted voting combination technique most efficient
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/234949
ConclusionConclusion
• Developed system yields about 6% EER whereas state of the art speaker indexing systems yield about 10% error rate.
• Methods for discrimination between speakers (speaker count or indexing) in CONVERSATIONS with more than two speakers have been introduced.
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235050
Further Research
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235151
Further ResearchFurther Research
• Investigation of prosodic speaker discrimination features
• Improving model formation technique by determining speaker change-points a priori
• Exploring the use of individual phonemes to form models
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235252
Further Research, cont’dFurther Research, cont’d
• Investigating the use of unvoiced speech, cautiously, in the formation of models
• Speech enhancement techniques to handle distorted data
• Implementation of other fusion techniques such as KL measure of divergence
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235353
PublicationsPublications• U. Ofoegbu, A. Iyer, R. Yantorno and S. Wenndt,
“Unsupervised Indexing of Noisy conversations with Short Speaker Utterances”, IEEE Aerospace Conference. March, 2007
• U. Ofoegbu, A. Iyer, R. Yantorno, “Detection of a Third Speaker in Telephone Conversations”, ICSLP, INTERSPEECH 2006
• U. Ofoegbu, A. Iyer, R. Yantorno, “A Simple Approach to Unsupervised Speaker Indexing”, IEEE ISPACS. 2006.
• U. Ofoegbu, A. Iyer, R. Yantorno, “A Speaker Count System for Telephone Conversations”, IEEE ISPACS. 2006.
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235454
ACKNOWLEDGEMENTACKNOWLEDGEMENT
To the greatest teacher in the world, and the one who has made the most impact in my life,
Dr. Robert E. Yantorno.
To my best friend and the love of my life, Dr. Jude C. Abanulo
To Dr. Brian Butz, Dr. John Helferty, Dr. Saroj Biswas and Dr. Henry Sendaula
To my dissertation committee members, Dr. Iyad Obeid and Dr. Dennis Silage, and to Dr.
Rena Krakow.
To my friend, Ananth Iyer
To Abdoul Fall, Joe Fitschgrund, Angela Linse and Ralph Oyini; and to the members of the
Speech Processing Lab and the faculty of the electrical engineering department
To engineering administrators, Tamika Butler, Carol Dahlberg, Yvette Gibson and Cheryl
Sharp, and to Louise, day time janitress for the engineering building
To the Temple students who volunteered as participants in the New Conversations
Database
To Temple
To the Air Force Research Labs at Rome – Financial supporters of most of the research
To my parents, Ugo & Joseph Ofoegbu; my siblings, Amaka & Humphrey Onyendi, Nene,
Obinna and Chibuzor Ofoegbu; and my grandmother, Cordelia Osuji
To God
Thank you.
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
or give suggestionsor give suggestions
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235555
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
Brett Smolenski, Ph.D.
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235656
Cepstral AnalysisCepstral Analysis
Frequency Analysis of Speech Excitation ComponentExcitation Component Vocal Tract ComponentVocal Tract Component
XX==
Slowly varying formantsSlowly varying formants
Fast varying harmonicsFast varying harmonics
STFT of SpeechSTFT of Speech
==
==
++
++
Log of STFTLog of STFT Log of ExcitationLog of Excitation Log of Vocal Tract ComponentLog of Vocal Tract Component
IDFT of Log of IDFT of Log of STFTSTFT
Vocal tractVocal tract ExcitationExcitation
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235757
Cepstral FeaturesCepstral Features
• Linear Predictive Cepstral Coefficients
– Obtained Recursively from LPC Coefficients
Let LPC vector = [a0 a1 a2 …ap] and
LPCC vector = [c0 c1 c2 …cp c0 … c1 c2 …cn-1]
20 ln Ec
nmpcam
kmc
m
kkmkm
,
)(1
1)(
pmcakmm
acm
kkmkmm
1,)(
1 1
1)(
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235858
Conversational Data ModelingConversational Data Modeling
• Current Method– Equal Segmentation of Data– Indiscriminate use of data
• Problems– Change points unknown– Not all speech is useful
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/235959
““Best Distance”Best Distance”
• Intra-speaker and inter-speaker distance lengths are always equal, therefore:
P = sum of the covariance matrices of the two classes.
λ1 = maximum eigenvalue obtained by solving the generalized eigenvalue problem:
Q = is the square of the distance between the mean vectors of the two classes
22
21
21)(
aT )( 21
1
1
Pa
k
1
211
1 )( Pk
aQaP 11
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236060
““Best Distance”Best Distance”
Distance Measure 1
Dis
tan
ce M
easu
re 2
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236161
RRMD ApproachRRMD Approach
• Relative Distance Condition
0 100 200 300 400 500 6000
0.05
0.1
0.15
0.2
T-Square Statistics
Pro
bab
ilit
y
Distribution of T-Square Statistics for N = 5
Intra Speaker
Inter Speaker
D rel
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236262
Modeling AnalysisModeling Analysis
0 0.5 1 1.5 2 2.5 3 3.50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Distance Value
Pro
babi
lity
Distribution of Mahalanobis Distance - Utterance Based
Same Speaker
Different Speaker
N = 20 – 4 seconds of voiced speech
0 0.5 1 1.5 2 2.5 3 3.50
0.005
0.01
0.015
0.02
0.025
0.03
Distance
Pro
bab
ilit
y o
f O
ccu
rren
ce
Mahalanobis Distance Comparisons
SSDU
DSDU
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236363
Modeling AnalysisModeling Analysis
0.5 1 1.5 2 2.5 3 3.50
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Distance value
Pro
babili
ty
Distributions of Mahalanobis Distance - Segment Based
Same Speaker
Different Speaker
N = 5 – 1 second of voiced speech
0 0.5 1 1.5 2 2.5 3 3.50
0.005
0.01
0.015
0.02
0.025
0.03
Distance
Pro
bab
ilit
y o
f O
ccu
rren
ce
Mahalanobis Distance Comparisons
SSDU
DSDU
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236464
Distance MeasuresDistance Measures
• Mahalanobis DistanceMahalanobis Distance– Measures the separation between the means of both
classes
• Hotelling’s T-Square StatisticsHotelling’s T-Square Statistics– Measures the separation between the means of both
classes and takes into consideration the data lengths
• Kullback-Leibler DistanceKullback-Leibler Distance– Measures the separation between the distribution of
both classes
• Bhattacharyya DistanceBhattacharyya Distance– Derived from measuring the classification error
between both classes
• Levene’s TestLevene’s Test– Measures absolute deviation from the center of the
class distribution
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236565
Speaker Recognition Speaker Recognition
Reference SpeechReference Speech
Feature ExtractionFeature ExtractionFeature ExtractionFeature Extraction
Model BuildingModel BuildingModel BuildingModel Building
Test SpeechTest Speech
Feature Feature ExtractionExtraction
Feature Feature ExtractionExtraction ComparisonComparison
ComparisonComparison RecognitionRecognitionDecision Decision
RecognitionRecognitionDecision Decision
SystemSystemOutputOutput
• Speaker Identification– Who is this speaker?
• Speaker Verification– Is he who he claims to be?
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236666
Speaker Segmentation Speaker Segmentation
• Broadcast News/Conference Data
• Conversational Data
12 13 14 15 16 17 18 19 20-0.5
0
0.5
Time (Seconds)
Am
plitu
de
12 13 14 15 16 17 18 19 20-0.5
0
0.5
Time (Seconds)
Am
plitu
de
0 5 10 15 20 25 30-0.4
-0.2
0
0.2
0.4
0.6
Time (seconds)
Am
plitu
de
0 5 10 15 20 25 30-0.4
-0.2
0
0.2
0.4
0.6
Time (seconds)
Am
plitu
de
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236767
Randomly Randomly Select 2 Select 2
UtterancesUtterances
Randomly Randomly Select 2 Select 2
UtterancesUtterances
Window Window DataData
Window Window DataData
Window Window DataData
Window Window DataData
Utterances from Speaker A Compute Compute
FeatureFeature
Compute Compute FeatureFeature
Compute Compute FeatureFeature
Compute Compute FeatureFeature
UtteranUtterance 2ce 2
Utterance 1
ComputCompute e
DistanceDistance
ComputCompute e
DistanceDistance
Procedural Set-upProcedural Set-up
Intra-speaker distance computationsIntra-speaker distance computations
• 384-Speaker database used• Average Utterance Length = 5 seconds
Randomly Randomly Select Select
Utterance Utterance
Randomly Randomly Select Select
Utterance Utterance
ComputeComputeFeatureFeature
ComputeComputeFeatureFeature
ComputeCompute FeatureFeature
ComputeCompute FeatureFeature
Compute Compute DistanceDistance
Compute Compute DistanceDistance
Randomly Randomly Select Select
UtteranceUtterance
Randomly Randomly Select Select
UtteranceUtterance
Window Window DataData
Window Window DataData
Window Window DataData
Window Window DataData
Speaker Speaker AA
Speaker Speaker BB
Inter-speaker distance computationsInter-speaker distance computations
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236868
Best ‘N’ EstimationBest ‘N’ Estimation
• 245 conversations from SWITCHBOARD used245 conversations from SWITCHBOARD used• Results shown for T-Square distanceResults shown for T-Square distance
2 4 6 8 10 12 14 16 18 2070
72
74
76
78
80
82
84Average Indexing Accuracy Wth Respect to Number of Voiced Phonemes Per Models
Acc
ura
cy
N = Number of Voiced Phonemes
N = 5N = 5
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/236969
RRA Examples – 2 SpeakersRRA Examples – 2 Speakers
0 10 20 30 40 50 60-20
0
20How the Residual Ratio Algorithm Works for Two-Speaker Conversations
Init
ial
Sp
eech
0 10 20 30 40 50 60-20
0
20
Ro
un
d 2
Res
idu
al
Time (Seconds)
0 10 20 30 40 50 60-20
0
20
Ro
un
d 3
Res
idu
al
Time (Seconds)
Speaker 1
Speaker 2
Speaker 2
Speaker 2
Speaker 1 Speaker 2Addressing the
Challenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237070
RRA Examples – 3 SpeakersRRA Examples – 3 Speakers
0 10 20 30 40 50 60 70 80 90 100-20
0
20How the Residual Ratio Algorithm Works for Three-Speaker Conversations
Initi
al S
peec
h
0 10 20 30 40 50 60 70 80 90 100-20
0
20
Rou
nd 2
Res
idua
l
Time (Seconds)
0 10 20 30 40 50 60 70 80 90 100-20
-10
0
10
Rou
nd 3
Res
idua
l
Time (Seconds)
Speaker 1
Speaker 2Speaker 3
Speaker 2
Speaker 1
Speaker 2
Speaker 3
Speaker 2
Speaker 3
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237171
ComparisonComparison
Speaker 2
Residual Ratio after 2nd round Residual Ratio after 2nd round of RRAof RRA
Residual Ratio after 2nd round Residual Ratio after 2nd round of RRAof RRA
TWO-SPEAKER RESIDUALTWO-SPEAKER RESIDUAL THREE-SPEAKER RESIDUALTHREE-SPEAKER RESIDUAL
Speaker 2
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237272
Effects of FusionEffects of Fusion
0.5 1 1.5 2 2.5 3 3.5 40
0.02
0.04
Pro
bability
Tmax - 44.3369
0.5 1 1.5 2 2.5 3 3.50
0.02
Pro
bability
Mahalanobis - 44.0652
0 200 400 600 8000
0.050.1
Pro
bability
TSquared - 35.2111
0 50 100 150 200 250 300 3500
0.050.1
Pro
bability
KL - 22.7672
2 4 6 80
0.05
Pro
bability
Bhrattacharyya - 33.7449
0 50 100 150 200 2500
0.05
Pro
bability
Distance Feature - LPCC
Levenne - 13.4432
intra
inter
LPCCsLPCCs
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237373
Effects of FusionEffects of Fusion
LPCCsLPCCs
1 2 3 4 5 44.05
44.1
44.15
44.2
44.25
44.3
44.35
Mahalanobis
LevenneBhrattacharyya
KLTSquared
Increase in Inter-Class Separation as Number of Distances is Increased Features - LPCC
Number of Distances
T-T
est
Val
ue
Mahalanobis
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237474
Effects of FusionEffects of Fusion
MFCCsMFCCs
1 2 3 4 50
0.02
Pro
bability
39.4524
1 1.5 2 2.5 3 3.5 40
0.020.04
Pro
bability
Mahalanobis - 38.2733
0 200 400 600 800 1000 12000
0.050.1
Pro
bability
TSquared - 32.5542
0 500 1000 1500 20000
0.050.1
Pro
bability
KL - 11.0738
5 10 150
0.05
Pro
bability
Bhrattacharyya - 23.4276
0 100 200 300 400 5000
0.05
Pro
bability
Distance Feature - MFCC
Levenne - 16.8735
intra
interAddressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237575
Effects of FusionEffects of Fusion
MFCCsMFCCs
1 2 3 4 538.2
38.4
38.6
38.8
39
39.2
39.4
39.6
Mahalanobis
Levenne
KL
TSquared
Bhrattacharyya
Increase in Inter-Class Separation as Number of Distances is Increased Features - MFCC
Number of Distances
T-t
est
Val
ue
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237676
Best Feature SizeBest Feature Size
3 6 9 12 15 18 21 24 27 300
5
10
15
20
25
T-S
tati
stic
s
Analyses of the Effects of Increasing the Size of the Feature Set (Voiced - Segment Based - LPCC)
Number of Coefficients
MahalanobisT-SquareKLBhattacharyyaLevene
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237777
Best Feature SizeBest Feature Size
3 6 9 12 15 18 21 24 27 300
5
10
15
20
25
30
T-S
tati
stic
s
Number of Coefficients
Analyses of the Effects of Increasing the Size of the Feature Set (Voiced - Segment Based - MFCC)
MahalanobisT-SquareKLBhatacharryaLevene
Addressing theChallenges
Applications
MethodsModeling SpeakersSpeaker Indexing
Speaker CountSpeaker Count-Indexing
Fusion of Distances
Evaluation
Summary and Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237878
Correlation AnalysisCorrelation Analysis
50 100 150 200 250Levenne
4 6 8 10Bhrattacharyya
200 400 600 800KL
200 400 600 800TSquared
1.5 2 2.5 3 3.5
50
100150
200
250
Mahalanobis
Leve
nne
4
6
8
10
Bhr
atta
char
yya
0
500
1000
KL
200
400
600
800
TS
quar
ed
1.5
2
2.5
33.5
Mah
alan
obis
Draftsmans Dispalay of Distances (MFCC)
IntraInter
Draftsman’s Display - MFCC
Introduction
Evaluation Databases
Modeling Speakers
Application Systems
Fusion of Distance Measures
Summary
Further Research
Speech Processing Lab Uchechukwu Ofoegbu, Dissertation Defense
04/20/237979
Advisor: Robert Yantorno, Ph.D
Committee Members:Brian Butz, Ph.D.
Dennis Silage, Ph.D.Iyad Obeid, Ph.D.
Brett Smolenski, Ph.D.