speaker diarization
TRANSCRIPT
-
8/6/2019 Speaker Diarization
1/47
Speaker Diarization
A Thesis Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Technologyin
Computer Science & Engineering
By:
Avinash Kumar Pandey
2006CS50213
Under The Guidance of:
Prof. K. K. Biswas
Department of Computer Science
IIT Delhi
Email: [email protected]
Department of Computer Science
Indian Institute of Technology Delhi
-
8/6/2019 Speaker Diarization
2/47
i
Certificate
This is to certify that the thesis titled "Speaker Diarization" being submitted by
Avinash Kumar Pandey, Entry Number: 2006CS50213, in partial fulfillment of therequirements for the award of the degree of Master of Technology in Computer
Science & Engineering, Department of Computer Science, Indian Institute of
Technology Delhi, is a bona-fide record of the work carried out by him under my
supervision. The matter submitted in this dissertation has not been admitted for an
award of any other degree anywhere unless explicitly referenced.
Signed: ______________________
Prof. K.K. Biswas
Department of Computer Science
Indian Institute Of Technology, Delhi
New Delhi-110016 India
Date: ______________________
-
8/6/2019 Speaker Diarization
3/47
ii
Abstract
In this document we describe a speaker diarization System which is basically an
internal segmentation system which uses text independent speaker identification using
Gaussian mixture models and MFCC feature vector set, aided by also a Gaussian
mixture models based speech activity detection system.
A general speaker diarization system or speaker detection and tracking system
consists of four modules, Speech activity detection, speaker segmentation; speaker
clustering and speaker identification. In an internal segmentation type diarization
system the functions of all the three modules of Segmentation, Clustering and
Identification are discharged by the same speaker identification module. Segmentation
is done after identifying which speaker this particular audio segment belongs to, so
clustering is also done simultaneously.
A speaker identification system is again of various types, it can impose certain
limitations on the text that speaker utters to get identified or it may keep it completely
limit-free. The later systems are called text independent speaker identification
systems. Different speaker identification systems work on different types of feature
vectors, the text independent speaker identification systems work on lower level
glottal features of the speaker.
The feature vectors thus obtained can be modeled using different statistical models;
we have experimented with two models mainly, Vector quantization and Gaussian
mixture models.
-
8/6/2019 Speaker Diarization
4/47
iii
Acknowledgements
I would like to acknowledge the guidance ofProf. K. K. Biswas, (Department of
Computer Science, IIT Delhi) whose guidance was the corner-stone of this project,
without which this project would never have been possible. Thank you for your
wonderful support. I would also like to express my gratitude towards Prof. S. K.
Gupta and Prof. Saroj Kaushik for their guidance throughout the development of
the project. I would like to gratefully acknowledge my debt to other people who have
assisted in the project in different ways.
Signed: ______________________
Avinash kumar Pandey
2006CS50213
Indian Institute Of Technology, Delhi
New Delhi-110016 India
Date: ______________________
-
8/6/2019 Speaker Diarization
5/47
iv
Contents
Certificate................................ ................................ ................................ .................. i
Abstract ................................ ................................ ................................ .................... ii
Acknowledgements ................................ ................................ ................................ . iii
List of Figures................................ ................................ ................................ ..........vi
List of Tables ................................ ................................ ................................ ..........vii
1 Introduction................................ ................................ ................................ ...........1
1.1 Motivation........................................................................................ .....................1
1.2 Definition...................................................................................... ....................... .1
1.3 History .........................................................................................................2
1.3.1 Rich transcription framework........................................................................ .3
1.4 Applications.................................................................................. ........................4
1.5 Outline of the Work............................................................................................. .4
1.6 Chapter Outline of the Thesis..................................................... ..........................5
2 Speaker diarization............................... ................................................................... .6
2.1 Introduction ..................................................................................... ....................6
2.2 Speech activity detection..6
2.2 Speaker segmentation..................................................... ......................................8
2.2.1 Segmentation using silence............................................................................. .9
2.2.2 Segmentation using divergence measures.......................................................9
2.2.3 Segmentation using frame level audio classification.....................................10
2.2.4 Segmentation using direction of arrival......................................... ................11
-
8/6/2019 Speaker Diarization
6/47
v
2.3 Speaker clustering.............................. ..................................... ................... ........11
2.4 Speaker identification.12
3 Speech Activity detection and implementation................................ ....................13
3.1 Introduction.13
3.2 Gaussian Mixture Models...................... 14
3.3 Our algorithm for SAD...15
3.3.1 Observation............................................................................... .....................15
3.4 Experiments....16
3.4.1 Fan noise.............................................................................. .................... ......16
3.4.2 Silence..................................................... ................................... ....................16
3.4.3 Fast paced speech................................................................. .................... .....17
3.4.4 Moderate paced speech......................................................... ..................... ....17
3.4.5 White noise................................................................................ ....................18
3.5 Advantages and Drawbacks....................................... .........................................19
4 Implementation of Text Independent Speaker Identification.......................... ...20
4.1 Introduction.......................................................... ......................... ..................... .20
4.2 Speech Parameterization : Feature vectors.................................. .......................23
4.2.1 MFCC.................................................................................. ..........................23
4.3 Statistical Modelling...........................................................................................27
4.3.1 Vector Quantization............................................................ ................... ........28
4.3.2 Gaussian Mixture Models......31
4.4Bayesian Information Criterion ......33
5 Experimental Results..35
6 Conclusion........37
Bibliography...38
-
8/6/2019 Speaker Diarization
7/47
vi
List of Figures
Figure 1: Segmentation of a an audio clip ................................ ................................ ............. 8
Figure 2: Gaussian mixture models ............................................................................. ...........16
Figure 3: Fan noise ................................ ................................ ................................ .............. 16
Figure 4: Silence ................................ ................................ ................................ ................. 16
Figure 5: Fast paced speech.......................................................................................... .......... 17
Figure 6: Moderately paced speech ................................ ................................ ..................... 17
Figure 7: White noise ................................ ................................ ................................ .......... 18
Figure 8: Training phase of a speaker identification system ................................ ................. 22
Figure 9: Testing phase of a speaker identification system ................................ ................... 22
Figure 10: General schematic to calculate PLP or MF-PLP features ................................ .... 23
Figure 11: Different modules in computation of cesptral features ................................ ........ 24
Figure 12: Vector quantization ................................ ................................ ............................ 30
-
8/6/2019 Speaker Diarization
8/47
vii
List of Tables
Table 1: Speech activity detection experiment results ................................ .......................... 19
Table 2:Speaker Diarization experiment results ................................ ................................ ... 36
-
8/6/2019 Speaker Diarization
9/47
1
Chapter 1
Introduction
In this chapter we will discuss where the problem of speaker diarization originated, what
all problem domains it can find application in and how much work has already been done
in this area and we will also discuss in brief about the several chapters in this thesis.
1.1 Motivation
Recording of speech, for several purposes, has long been in practice. There are many
reasons to record ones voice, like for educational purposes, archival purposes and
conserving a memory through the vicissitudes of time. While its more automatic than
scribbling down the dialogue it is often times cheaper than video as well.
Across the globe, in countless archives, there exists a huge amount of audio data. We
organize our data, like database tables, through certain keys. In audio clips, as such, we
have no key. The idea is to devise an organization for these databases in order to make
them easy to be handled. One of the possible indexing of audio data can be in terms of
speaker identity; this thought is the key idea for motivating speaker diarization.
1.2 Definition
In a given audio clip, the task of speaker diarization essentially addresses the question of
Who spoke when. The problem involves labeling a given audio file entirely with
speaker beginning and end times si and ei, for all homogenous single speaker segments.
If there are portions that correspond to non speech those have to be explicitly mentioned
too, for example, a sample output for some one minute audio file could be like 0:3
seconds Non speech, 3:25
-
8/6/2019 Speaker Diarization
10/47
2
Speaker 1,25:37 music,37:60 speaker 2.
1.3 History
Historically, the problem was first thus formulated in national institute ofstandards and
technology, rich transcription framework, or better known as RT- Framework, convention
of 1999. Since then, till 2007, several conventions were held about this problem. Mainly,
two types of diarization problems were undertaken
1) Broadcast news speaker diarization2) Meeting or conference room speaker diarization
In the broadcast news framework, the recording system is single modal. That is to mean,
there is only one recording device, in which all the speakers take turn to speak.
Apparently, the apparatus is simple, but the accuracy of the diarization system is
hampered on this account.
In the meeting room domain problem, audio clips are recorded across multiple distant
microphones, locations of which were not disclosed. The results of diarization of these
different clips were then combined together to enhance the efficiency of the overall
diarization engine.
There is one aspect about multimodal scenarios that lead to enhanced diarization results
with them, the TDOA parameter; time delay of arrival. There are different recording
devices; the distance of speaker from each microphone is bound to change with speakerchange because the two speakers will most probably not be occupying the same physical
position. This information, the time difference in different recording devices, leads to a
very strong tool for speaker segmentation or identifying speaker turn points.
The problem we have undertaken to solve is similar to that of the broadcast news
scenario where differentspeakers take turn to speak in a single recording device.
The current state of art in the speaker diarization regime has moved beyond audio. The
idea is to record not only audio, but also take visual cues from the speakers and audiences
to determine speaker status and change. Its impact has been conclusively shown inMultimodal Speaker diarization of real world meeting using compressed domain video
features, a paper by Friedland and Hung, 2010.
-
8/6/2019 Speaker Diarization
11/47
3
1.3.1 NIST RICH TRANSCRIPTION FRAMEWORK
There is a set of related problems that is undertaken under NIST rich transcription
framework, namely,
1) Large vocabulary continuous speech recognition (LVCSR)2) Speaker diarization3) Speech activity detection(SAD)
Let us now discuss each of these problems briefly and the success that has been achieved
in solving them briefly.
Large Vocabulary Continuous Speech Recognition
This is the name for most common and important problem in speech processing. LVCSR
is simply a technical name for the most general speech recognition system. The amateur
speech recognition engines place some kinds of restrictions on the vocabulary for
example the speaker could speak only out of so many words already recognized by the
engine or the speaker should speak at a certain pace which generally meant a place
slower than the normal pace of speaking. LVCSR was meant to overcome all these
restrictions
Speaker Diarization
The problem statement of Speaker Diarization has already been introduced; it will be
taken up in fair detail; a both theoretical and implementational detail as it is the object of
this thesis.
Speech Activity Detection
Speech activity detection is a sub-problem in most of the speech processing applications
and speaker Diarization is no different. We will also develop an algorithm for Speech
activity detection. In chapter 3, we will discuss in detail about our algorithm for Speechactivity detection, its experimental performance, its advantages and its drawbacks.
-
8/6/2019 Speaker Diarization
12/47
4
1.4 ApplicationsSpeaker diarization provides for multifarious applications in diverse domains. Some of
them follow
1) Once our audio archive has been indexed as per the speaker identities, a user canquickly browse through the archive only looking at the speakers of ones own
interest rather than manually looking for the speaker he is interested in, through
the entire file.
2) Speaker diarization also plays a vital role in automatic speech recognition. If wedo not the speaker identity and are trying to convert the speech to text, the
phonetic models we apply are rather generic but once we know who exactly this
speaker is we can migrate to a speaker specific phonetic model which performs
better. Literature says, there has been about 10% improvement in automatic
speech recognition if we knew in advance the identity of speaker.
3) If one decides to resort to manual transcription in order to avoid the inherentdifficulties and inaccuracies in automatic speech recognition, even then a diary of
which speaker started when will come in handy.
1.5 Outline of the Work
We created a system which is capable of producing a diary of a given audio clip,
provided it has in its database the training samples of all the speakers present in the audio
clip. Our implementation fundamentally consists of three main modules, a Speech
activity detection module, A Bayesian Information criterion module and a Text
independent speaker identification module. The SAD module, given a speech segmentdecides whether its speech or non-speech. The non-speech categories could be Gaussian
noise, Background noise, Music or most commonly silence. The SAD module
differentiates speech from these kinds of signals.
After this, we have Bayesian Information Criterion module which narrows down on a
given audio segment till it becomes reasonably assured that this particular segment
belongs to a single speaker.
-
8/6/2019 Speaker Diarization
13/47
5
Then in the end, we have the core speaker identification engine based on MFCC feature
vectors. It is of the text independent type, when we say its text independent we mean it
does not depend on the text uttered by the speaker, it can identify a speaker no matter
what he speaks, so this establishes that our audio clip doesnt have to be a fixed set of
words, Our speakers can talk about anything and we will still be able to produce a diary.
1.6 Chapter outline of the thesis
In the remaining chapters we develop the idea of Speaker diarization first and then
discuss the various details associated with the different modules chapter by chapter. In
Chapter 2; we discuss one by one the theoretical ideas behind and the advances made in
and techniques popular in each of the areas of speaker segmentation, Speaker clustering
and speaker identification. Then, in the chapter 3 we discuss, our implementation of the
speech activity detection algorithm, in chapter 4 we discuss our implementation and the
general stuff about text independent speaker identification and in chapter 5 we discuss in
brief about Bayesian Information Criterion. In chapter 6 we furnish results about the
performance of our speaker diarization system and in chapter 7 we conclude the thesis.
-
8/6/2019 Speaker Diarization
14/47
6
Chapter2
Speaker Diarization
2.1 Introduction
Speaker Diarization as Anguera and Wooters contended in Robust Speaker
Segmentation for Meetings, ICSI SRI Spring, 2005, for the twin purposes of
abstraction and modularity can be divided into four stages.
1) Speech activity detection2) Speaker segmentation3) Speaker clustering4) Speaker identification
2.1 Speech activity detection
Speech activity detection is the first step in almost every other speech processing
application, the primary reason being the highly computationally intensive nature of these
processing methods. One wouldnt want to waste ones resources computing on part of
the audio clip that is not speech. Suppose, there is a meeting room in our computer
science department, a far field microphone is installed in the room and it is always alert.
That is to mean, it always keeps recording. We do not want to set things up every time we
enter the meeting room for discussions. Now after a certain period of time, say 24 hours,
we take the output of the recorder and want to filter out portions that are non speech.
They could be any kind of noise, music or sounds of people passing by in the gallery.
This task is accomplished by speech activity detection module. The problem of Speech
activity detection also goes by the name of Voice activity detection.
There are various possible ways to determine whether an incoming clip is speech or not.
One of those ways could be cepstral analysis, as investigated by Haigh and Mason in A
Voice Activity Detectorbased on Cepstral analysis, 1993.
Later in this thesis we will discuss in profound detail what we mean by Cepstral features,
for now it is sufficient to know these are some values extracted from a stationary portion
of the audio clip, may be a 10 seconds window.
-
8/6/2019 Speaker Diarization
15/47
7
The idea proposed by Haigh and Mason is to detect the speech end-points, that is , the
points where speech portion begins and ends, by cepstral analysis. This approach will be
strictly based on static explicit modeling of speech and non-speech. We will have to train
a binary classifier to differentiate speech from non-speech based on some feature vectors
extracted from the clip which the authors suggested should be cepstral features. As the
reader will see later in the thesis, these are the same feature vectors we use for our text
independent speaker identification engine.
The model used could be any model ranging from a Gaussian mixture model to a support
Vector machine.
This is one way to tell speech from non-speech, the earliest algorithms for Voice activity
detection used two common speech parameters to decide whether there is voice in a
speech frame
1)
Short term energy2) Zero crossing rateShort term energy in an audio frame can be determined with the log energy coefficientin the MFCC feature vector set; it is the 0Th coefficient in the MFCC feature vector set.
The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which
the signal changes from positive to negative or back. This feature has been used heavilyin both speech recognition and music information retrieval and is defined formally as
Where s is a signal of length Tand the indicator function is 1 if its argument A istrue and 0 otherwise.
But as can be understood the usage of these two parameters can still lead us to error, there
can be cases where there is high short term energy in a speech frame and it is still not
speech, for example a Gaussian noise with a probability distribution such that the
intensity peaks every few seconds.
Same can be said for zero crossing rates, though uncommon there can be noises where
the change from positive to negative happens very frequently and so it can deceive our
system. To overcome these difficulties we have come up with a new algorithm for speech
activity detection which will be described in detail in chapter 3.
-
8/6/2019 Speaker Diarization
16/47
8
2.2 Speaker segmentation
Given an audio clip, speaker segmentation is the task of finding the speaker turn points.
Our segmenter is supposed to divide the audio clip in non overlapping segments such thateach segment contains speech from only one speaker. Non speech segments, we assume,
have already been filtered out by our speech activity detection module. The idea is better
illustrated through the following figure.
Figure 1 Speaker Segmentation
The figure shown above is the amplitude graph of an audio clip, where the number of
speakers exceeds one. We begin with one speaker putting forth his point, in between he is
interrupted by another speaker and there is a region of overlapping speech from where the
second speaker takes over. This overlapping speech region is an instance of turn point.
There is a little abuse of the term point here, as we are calling a whole overlap region to
be a point. So we have to identify such positions in the audio clip where the shifting over
of speakers takes place. These, from now on, will be called homogenous speaker
segments.
There are various ways to go about solving the problem of speaker segmentation.
y Segmentation using silence.y Segmentation using divergence measures.y Segmentation (and clustering) by performing frame level audio classification.y Segmentation (and clustering) using a HMM decoder.y Segmentation (and clustering) using direction of arrival.
The last three methods are unified segmentation and clustering methods. They are also
called internal segmentation methods at times.
The earlier two methods are called external segmentation methods because the
ascertaining of identity follows the merging of segments in clusters as opposed to internal
-
8/6/2019 Speaker Diarization
17/47
9
segmentation methods where for every frame you first find out who the speaker is, there
you have essentially found out the cluster without bothering about the segment at all.
Now we will discuss each of these methods briefly.
2.2.1 Segmentation Using Silence
Segmentation using Silence is a common sense method which is based on the
assumption that whenever a speaker change happens there must be a portion of silence in-
between. This, however, cannot be said to hold true in all environments, for example, in a
parliament speaker changes almost inevitably happen by one speaker forcing his entry in
another speakers speech. So a speaker change did happen, but there was no point of
intervening silence. Hence we run the risk of losing some speaker turn points. Besides,
this method of segmentation using silence is plagued by another difficulty. If we observe
the amplitude graph of any speech file closely we will notice when a speaker speaks, he
doesnt just keep speaking all the time, he stops frequently or the tonality of his voice bellows down to silence so even a continuous speaker segment contains many
intermediate points of silence. What this means for our purpose is that while we miss
some true some speaker turn points, we generate too many unnecessary segments as well.
The greater the number of segments greater is the difficulty in clustering them. So we can
see why the method of segmenting speech using silence is not such a good idea. Mainly
for two reasons
1) Misses some true speaker turn points with overlapping speech.2) Generates a much higher number of segments, false positives.
2.2.2 Segmentation Using Divergence Measures
Delacourt and Wellshowed in DISTBIC,Aspeaker-basedsegmentation for audio data
indexing, that using divergence measures for speaker segmentation can be useful .They
used Bayesian Information Criteria as the divergence measure.
Segmentation using Divergence measures is the state of the art, lets first discuss what is a
divergence measure and what all divergence measures one can use.
A divergence measure is fundamentally a tool to determine how similar or dissimilar
two things are. In our case it could be two successive audio frames or two successive
windows of audio frames. Two famous divergence measures are Kullback- Leibler
divergence and Bayesian information criteria. We have used Bayesian Information
criteria in our implementation so we will be discussing it in detail in chapter 6. Right now
we will discuss Kullback-Leibler Divergence.
-
8/6/2019 Speaker Diarization
18/47
10
In probability theory and information theory, the KullbackLeibler divergence (also
information divergence, information gain, relative entropy, orKLIC) is a non-
symmetric measure of the difference between two probability distributions Pand Q. KLmeasures the expected number of extra bits required to code samples from Pwhen using
a code based on Q, rather than using a code based on P. TypicallyPrepresents the "true"
distribution of data, observations, or a precisely calculated theoretical distribution. Themeasure Q typically represents a theory, model, description, or approximation ofP.
Although it is often intuited as a distance metric, the KL divergence is not a true metric for example, it's not symmetric: the KL from P to Q is generally not the same as the
KL from Q toP.
For probability distributions Pand Q of a discrete random variable their KL divergenceis defined to be
In words, it is the average of the logarithmic difference between the probabilities PandQ, where the average is taken using the probabilities P. The K-L divergence is only
defined ifPand Q both sum to 1 and ifQ(i) > 0 for any i such that P(i) > 0. If thequantity 0log0 appears in the formula, it is interpreted as zero.
Now that we know what a divergence measure is, we can proceed with our discussion of
our segmentation using divergence measures. We consider two windows and calculate
their similarity or dissimilarity index with our given divergence measure. If that
dissimilarity is above a particular threshold, determined by empirical experimentation, we
call that particular point a speaker turn point as it flanks the audio in two windows whichare different from each other.
2.2.3 Segmentation using Frame level audio classification
This is an example of an internal segmentation strategy. Frame level audio classification
means with every frame of the audio, we determine as to what kind of an audio data it is,
is it speech, is it music and then if it is speech which speaker does it come from in our
database, this is more or less the strategy we will be following in our implementation
except that instead of looking at individual frames we are looking at a window of frames
which has been determined to contain a single speaker speech. During our
experimentation with various statistical models we observed looking at individual frames
rarely leads us to the right answers, we have to look at a certain length of the audio, look
at a collection of frames and see which speaker the majority of points belong to. The
-
8/6/2019 Speaker Diarization
19/47
11
criteria for belonging may be proximity to a particular codebook code vector or a log-
likelihood computation as in Gaussian Mixture Models.
2.2.4 Segmentation using Direction of Arrival (DOA)
In a multimodal corpus where we have multiple distant microphones to record speech,
two parameters assume overwhelming significance in speaker diarization or speaker
identification, direction of arrival and time delay of arrival. The process has been
discussed at length in Real time monitoring of participants interaction in a meeting
using audio-visualsensorsby Buso and Narayanan, 2008.
The technique is called acoustic source localization and has been used widely in RADAR
and SONAR. The speech is recorded in a smart room consisting of a microphone array.
The microphone array is used for acoustic source localization. The approach is based on
TDOA, time delay of arrival due to various microphones. The geometric inference of the
source location is calculated from this TDOA. First pair-wise delays are estimated
between all the microphones. These delays are subsequently projected as angles into a
single axes system.
2.3 Speaker clustering
Speaker clustering is the next step from speaker segmentation. Let us rewind a little bit
now. We were first given an audio clip from which using the speech activity detection
module we filtered out non speech part. Next, we used our speaker segmentation tool to
generate homogenous speaker segments from it. Now, as one can observe, the segments
belonging to single speaker can be strewn across the clip. We have to label them as
coming from a single source. So we proceed to speaker clustering. We build a similarity
matrix corresponding to every single segment in the segments list, and we use a distance
metric to calculate the distance between the segments. The segments which match best
are merged and incrementally the identities of these clusters get changed. If we already
know the number of speakers present, we can choose to keep as many clusters as the
number of speakers. Otherwise, we choose a stopping criterion which has to be decided
based on experimentation. When that stopping criterion is met, we stop merging the
segments and the number of clusters present at that point will be the number of speakers
present in our audio clip.
Popular approaches to speaker clustering
y Clustering using vector quantization.
-
8/6/2019 Speaker Diarization
20/47
12
y Clustering using iterative model training and classification.y Clustering in a hierarchical manner using divergence measures.y Clustering and segmentation using a HMM decoder.y Clustering and segmentation using direction of arrival
2.4 Speaker Identification
The last step is speaker identification. First the training part will be discussed.
1) We divided every second of the training clip into approximately 173 frames andfor each of these frames we calculated MFCC features.
2) Each MFCC feature vector consisted of 13 dimensions.3) We use the set of these feature vectors to generate a codebook for every speaker
using the concept of vector quantization.
4) Codebook of a speaker consisted of 16 feature vectors which best modeled the setof obtained feature vectors for the clip.
Now we move on to the testing part.
In testing
1) We again divided the test audio clip into frames, 173 per second and computedthe feature vectors per frame.
2) We matched each feature vector thus computed, with the individual codebooks.3) The codebook with maximum matches is declared to be the one corresponding to
the identity of our test speaker.
-
8/6/2019 Speaker Diarization
21/47
13
Chapter 3
Speech Activity Detection and
Implementation
3.1 Introduction
Speech activity detection involves separating speech from
1) Silence2) Background noise in different ambience3) White Gaussian noise4) Music5) Crowd noise
The majority of algorithms that are used for speech activity detection fall in two
categories.
1. Noise level is estimated after looking at the entire file and anything over andabove a particular decibel level is called speech.
2. According to the ambience, speech and non speech training sets are taken, modelsare trained based on them and then these models are used for further
classification.
Both of these kinds of algorithms suffer from some serious drawbacks. The algorithms in
the first category fail to discriminate speech from non speech when noise is variable; it
assumes same noise function throughout the audio. The algorithms falling in secondcategory will fail to perform in an unfamiliar environment. They need a lot of training
data.
The approach we propose here, overcomes both these hurdles, though it has drawbacks of
its own. We want a speech detection algorithm which
y Does not require any training at all.
-
8/6/2019 Speaker Diarization
22/47
14
y Is, nevertheless, able to grasp the difference between speech and nonspeech in dynamically changing ambience.
Since we use Gaussian mixture models in our approach, a little briefing of what reallyGaussian mixture models are is called for.
3.2 GAUSSIAN MIXTURE MODELS
Many times, it is so that our distribution of data cant be accurately modeled by using a
single multivariate distribution. The sample point set might come from two different
Gaussian distributions. In that case, rather than modeling the dataset by a singlemultivariate distribution, its better to model the data as a mixture of two Gaussians
where each Gaussian accounts for a certain fraction of point set which is called the
mixing proportion of this Gaussian in the Gaussian mixture model. The mean and
covariance matrices of each Gaussian can be completely independent and we can restrain
them too as the case might be, in our case we let them be completely general.
Figure 2: Gaussian Mixture Models
The diagram above is a case where two individual Gaussians are present with different
mean and covariance.
-
8/6/2019 Speaker Diarization
23/47
15
3.3 Our algorithm for Speech activity detection
The algorithm consists of three simple steps.
y Divide the entire audio clip in fixed sized intervals of 15 seconds.y Extract MFCC features from each of these segments.y Cluster individually, the feature vectors obtained for each of these intervals, using
Gaussian mixture models where the number of components is two.
3.3.1 Observation
In case of silence, repeated beats, fan noise, white Gaussian noise the preponderance of
points belongs to one of the clusters. The mixing proportion remains highly skewed.
While for speech frames, the mixing proportion stays even, This attribute of speech
frames becomes the basis of our classification strategy.
We want to narrow down the number of segments which could be speech. Its a negative
classification strategy. Whatever is clustered evenly remains a candidate for speech. Wecan filter out silence, instrument noises, repeated beats and white noise. This effectively
narrows down our search space by a significant factor.
3.4 Experiments
3.4.1 Experiment 1: fan Noise
-
8/6/2019 Speaker Diarization
24/47
-
8/6/2019 Speaker Diarization
25/47
-
8/6/2019 Speaker Diarization
26/47
18
3.4.5 Experiment 5 : White Noise
Figure 7 : White Noise
y Software generated white noise.y Might come in our audios from some malfunctioning in our recording device ornetwork channel noise.y BIC value of clustering always comes out to be negative.y The algorithm often fails to converge for two components in 100 iterations.y When it does converge, the mixing proportion is highly skewed.
Summary of experimental results
-
8/6/2019 Speaker Diarization
27/47
19
3.5 Advantages and Drawbacks
Advantages
y Does not require any training at all, so we can apply it even in environmentswhere our training based approaches would have failed for lack of training data.
y Since it treats every segment independently, it can adjust itself dynamically tochanging ambience properties.
Drawbacks
y This algorithm is computationally intensive.y We have to calculate the feature vectors, cluster them and find the BIC for every
single frame in the file.
y Works well for relatively longer periods of silence, is not as robust for shortersegments.
-
8/6/2019 Speaker Diarization
28/47
20
Chapter 4
Implementation of text Independent
speaker identification
4.1 Introduction
Our approach to handling the other three modules, other than speech activity detection,
can be called internal segmentation and clustering.
We do not first segment and then cluster the audio clip, instead we take a portion of the
clip, try to identify as to which speaker it belongs, if there is a confusion, that is the
statistics revealed by our testing algorithms are not conclusive then we narrow it down
further, thus we determine a minimum time difference between which a speaker change is
not happening and in that we establish the identity of the speaker. We can also take this
time to be 1 sec and then determine individually for all segments of 1 second to which
speaker they belong.
For doing so, we have two tools at our disposal.
1) A text independent speaker identification engine and2) Bayesian information criteria BIC
Speaker identification systems are primarily of two types, text independent and text
dependent. The names are fairly self explanatory, text independent speaker identification
systems are the ones which work for any test utterance while the text dependent speaker
identification systems can work only with a fixed test utterance. The two systems classify
their training data based on widely different sets of speech parameters, resulting in
different accuracies too, under similar circumstances. Accordingly, they find applications
in different problem domains.
An abstract layout of a speaker identification/verification engine, consisting of various
modules and stages, is shown in the following figures.
While figure 1.1 shows the training phase, figure 1.2 is for testing phase.
-
8/6/2019 Speaker Diarization
29/47
21
Training phase assuming a most general speaker identification/verification system,
consists of two fundamental modules,Speech parameterization module and statistical
Modelingmodule.
The raw speech data received is first processed to extract some useful characteristic
information through the speech parameterization module. These information parameters
are collected, usually, at different points in time domain and frequency domain, to mark
the variations in the speech, characteristic of a particular speaker.
The speech parameters thus obtained are then fitted in a statistical model of choice using
the statistical modelingmodule, in order to calculate the defining parameters of the
model corresponding to that particular speaker, for example, mean and variance in a
single Gaussian model. The choice of statistical model could vary. We have
experimented with two models, codebooks or vector quantization and Gaussian mixture
models. The state of the art systems in speaker identification employ Gaussian mixture
models with great success.
Testing phase is preceded by collection of training samples of all speakers in our
universe, those training samples are converted into corresponding speaker models and if
the statistical modeling demands so, a universal model is formed with the training
samples. Now, when we are given a test utterance, we calculate the speech parameters
using the speech parameterization module and then use a decision scoringmodule to
decide which of the available speaker models these parameters are best matched with.
Again, the choice of decision scoring module is dependent on the statistical model weuse, when we use vector quantization, it is simply the number of points corresponding to
each codebook, the one which has maximum points closest to the parameter set achieved
is our output identity, the distance metric being the simple Euclidean distance metric.
While, when use Gaussian mixture models as our statistical modeling tool, log-likelihood
becomes the decision scoring module. We calculate log-likelihood of each feature vector
obtained from the test utterance and see which speaker model it best corresponds to.
So, our discussion so far has established there are fundamentally three variables in
the speaker identification systems.
1) Speech parameterization2) Statistical modeling3) Decision scoring
-
8/6/2019 Speaker Diarization
30/47
22
We will take them all one by one now, first giving the theoretical details, and
then the experimental results.
Input Speech parameters Model
Figure 8: Different modules in the training phase of a speaker
identification system
Speech data from a Speech parameters
given speaker
Speaker models from the database Ide
Identity
Figure 9 :Different modules in the testing phase of a speaker
identification system
Speech Parameterization
Module
Statistical Modeling
Speech parameterization
Module
Scoring Decision
-
8/6/2019 Speaker Diarization
31/47
23
4.2 Speech parameterization: Feature Vectors
The speech parameterization module calculates or extracts useful speech parameters froma raw audio clip. The popular term for the parameters thus obtained is feature vectors.
The most widely used feature vectors are from a particular class called cepstral features.
We will discuss briefly what exactly we mean by cepstral features and then we will give
the specifics of the feature vector set we are using.
Cepstralfeatures based on filter-bank
The entire process of calculation of filter-bank based cepstral features is shown
schematically, module-wise, in figures 2 and 3.
Figure 10: General Schematic for calculation Of PLP or MF-PLP
features.
Input
Pre- emphasis Windowing FFT
-
8/6/2019 Speaker Diarization
32/47
24
Figure 11: Different modules employed in calculation of cepstral
features, MFCC.
Now we will discuss all of these modules and their relevance in our work one by one.
1) Pre-emphasis: Emphasis is laid on certain special section of the speech signal,the special sections being the higher frequency range of the spectrum. It is
believed that the nature of speech production reduces the higher frequencies,
thereby, inducing a need for pre-emphasizing the signal making up for the loss in
production process. In our case we studied, hardly any benefits were accrued by
using pre-Emphasis so we have done away with this module.
2) Windowing: This is a crucial phase in the calculation of feature vectors. Wemake an assumption called stationary assumption which means that if we
consider a window of the speech signal small enough there wouldnt be any
variations in the values of feature vectors across that small window.
So we select a window beginning at the beginning of the speech signal, in our
case of 20 ms, then we shift the starting position of the moving window by 10 ms
and consider the next window of length 20 ms which means every two
consecutive windows considered have 10 ms part in common. The choice of
windows is again dependent on experimental evidence; we went with triangular
hamming window. Other options could have been hamming or hanning windows.
3) FFT: Next step is simply calculating the fast Fourier transform of the spectralvector thus obtained, after windowing and possibly pre-emphasis.
4) Filter-bank: The spectrum obtained after applying FFT still contains a lot of un-necessary details and fluctuations, things we are not interested in. So, in order toobtain the features we are really interested in we multiply the spectrum thus
obtained with a filter-bank. A filter-bank is nothing but a collection ofband pass
frequency filters. So in essence we filter out all the un-necessary information and
keep only the frequencies that concern us. The knowledge of these particular
frequencies comes from our knowledge of the process of speech production. The
Filter-
bank
20*LogCepstral
transform
-
8/6/2019 Speaker Diarization
33/47
25
spectral feature set MFCC receives its name from its filter-bank which is called
the Mel scale frequency filter-bank. This scale is an auditory scale which is
similar to the frequency scale of the human ear.
5) Cosine discrete transform: An additional transform is applied which in genericterms we have called the cepstral transform, in our case it is cosine discrete
transform which when applied on the result of filter-bank operations yields final
cepstral feature vectors , which are of interest to us.
Two other important features are the log energy and the of log energy. MFCC is a 13
dimensional feature set. The first coefficient is the log energy, we incorporated the
difference of log energy as well in our feature set, which resulted in significantly
improved recognition rates. In effect we have incorporated all the deltas corresponding to
all the 13 feature vector dimensions.
There are a number of popular feature vector sets that can be extracted from an audio
clip. Different feature vectors sets capture different properties of an audio clip. The most
widely used ones are
1) MFCC Mel frequency cepstral features
2) Rasta_PLP
3) LPC, Linear predictive coding.
The feature vector set which we are using is the MFCC one with its set, the set
essentially means the differentials of these feature vectors, that is, it captures how the
values of the MFCC feature vectors varies over time. This information, as it turns out, is
also vital to characterizing a speaker.
-
8/6/2019 Speaker Diarization
34/47
26
4.2.1 MFCC
The mel-frequency cepstrum (MFC) is a representation of the short-term powerspectrum of a sound, based on a linear cosine transform of a log power spectrum on anonlinear Mel scale frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up
an MFC. They are derived from a type of cepstral representation of the audio clip (anonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-
frequency cepstrum is that in the MFC, the frequency bands are equally spaced on theMel scale which approximates the human auditory system's response more closely than
the linearly-spaced frequency bands used in the normal cepstrum. This frequencywarping can allow for better representation of sound.
MFCCs are derived as follows
1. Take the Fourier transform of a windowed excerpt of a signal.2. Map the powers of the spectrum obtained above onto the Mel scale using
triangular overlapping windows.3. Take the logs of the powers at each of the Mel frequencies.4. Take the discrete cosine transform of the list of Mel log powers, as if it were a
signal.
5. The MFCCs are the amplitudes of the resulting spectrum.
-
8/6/2019 Speaker Diarization
35/47
27
4.3 Statistical Modeling
Now that we have our 23 dimensional feature vectors, 173 of them for every second ofthe audio clip, we need to fit this data in a statistical model of choice. We want to get a
concise representation for making sense of this data. The two most widely used statistical
models are
1)Vector Quantizationor Codebook quantization
and
2)Gaussian Mixture Models.
We will discuss the theoretical details about both one by one and then detail the
experimental results obtained with each.
-
8/6/2019 Speaker Diarization
36/47
-
8/6/2019 Speaker Diarization
37/47
29
The algorithm is deemed to have converged when the assignments no longer change.
Voronoi Diagrams
Without going in the details the Voronoi diagrams consist of Voronoi cells. So, given a
set of points, the whole feature space is decomposed in a number of sections or cells, onecorresponding to each point. The points lying in a Voronoi cell of a point are the points
which are closer to this particular point than any other point in the given set of points.
-
8/6/2019 Speaker Diarization
38/47
30
Image source http://www.data-compression.com/vq-2D.gif
Figure 12
4.3.2 Gaussian Mixture Models
At times, it so happens that the representation provided by vector quantization or
codebook quantization is not adequate for modeling variations in the tonalities of a
particular speaker, so it seems like a nice idea to allow multiple underlying
representations with different probabilities to model a particular speaker. This can be
achieved handsomely by Gaussian mixture models which are state of the art for speaker
recognition.
In statistics, a mixture model is a probabilistic model for representing the presence ofsub-populations within an overall population, without requiring that an observed data-setshould identify the sub-population to which an individual observation belongs. Formally
a mixture model corresponds to the mixture distribution that represents the probabilitydistribution of observations in the overall population.
-
8/6/2019 Speaker Diarization
39/47
31
However, while problems associated with "mixture distributions" relate to deriving theproperties of the overall population from those of the sub-populations, "mixture models"
are used to make statistical inferences about the properties of the sub-populations givenonly observations on the pooled population, without sub-population-identity information.
Some ways of implementing mixture models involve steps that do attribute postulatedsub-population-identities to individual observations (or weights towards such sub-populations), in which case these can be regarded as a types of unsupervised learning or
clustering procedures. However not all inference procedures involve such steps.
The structure of a general mixture model can be understood as follows, a typical finite-dimensional mixture model is a hierarchical model consisting of the following
components:
y N random variables corresponding to observations, each assumed to be distributedaccording to a mixture ofK components, with each component belonging to the
same parametric family of distributions but with different parametersy Ncorresponding random latent variables specifying the identity of the mixture
component of each observation, each distributed according to a K-dimensional
categorical distributiony A set ofKmixture weights, each of which is a probability (a real number between
0 and 1), all of which sum to 1y A set ofKparameters, each specifying the parameter of the corresponding
mixture component. In many cases, each "parameter" is actually a set ofparameters. For example, observations distributed according to a mixture of one-
dimensional Gaussian distributions will have a mean and variance for eachcomponent. Observations distributed according to a mixture ofV-dimensional
categorical distributions (e.g., when each observation is a word from a vocabularyof size V) will have a vector ofVprobabilities, collectively summing to 1.
The general mixture model can then easily be converted in a Gaussian mixture model
with the adoption of parameters.
Parameter estimation, Expectation Maximization
We have used expectation maximization algorithm for parameter estimation or mixture
decomposition in Gaussian Mixture Models.
The expectation step
With initial guesses for the parameters of our mixture model, "partial membership" of
each data point in each constituent distribution is computed by calculating expectation
-
8/6/2019 Speaker Diarization
40/47
32
values for the membership variables of each data point. That is, for each data point xj anddistribution Yi, the membership valueyi,j is:
The maximization step
With expectation values in hand for group membership, plug-in estimates are recomputedfor the distribution parameters.
The mixing coefficients ai are the means of the membership values over the Ndatapoints.
The component model parameters i are also calculated by expectation maximization
using data 32pointsxj that have been weighted using the membership values. Forexample, ifis a mean
With new estimates forai and the i's, the expectation step is repeated to recompute newmembership values. The entire procedure is repeated until model parameters converge.
4.4 Bayesian Information Criterions (BIC)When going through every window of the given audio clip, it becomes necessary to
determine if this window belongs to one speaker or there exists a segmentation point in-
between, this we can achieve by using Bayesian information criteria.
The metric is an indicator of the acoustic dissimilarity between the two sub windows.
-
8/6/2019 Speaker Diarization
41/47
33
The question we fundamentally ask is, if this window is better modeled using one speaker
alone or more speakers give it better representation, model with higher number of
independent parameters are penalized using a lambda function.
y Given that a model M is denoted by the statistical distribution theta, the BIC fora window can be defined as
D ln
1) is the series ofaudio feature vectors captured in the window W.2) D is the number of independent parameters present in theta.3) Second term is the penalty term and penalizes a model for its complexity.4) can be adjusted.5) Model with the higher BIC value is to be chosen.
BIC in Segmentation
BIC = BIC () BIC (
=
y Two models, M0 and M1 are defined.y Model M0 represents the scenario where t(test) is not a turn point so the left and
right sub windows will belong to a common distribution theta(w).
y Model M1 represents the scenario where t(test) is a turn point, then the left andright sub windows will belong to different distributions theta(L) and theta(R).
y It is assumed that the feature vectors follow Gaussian distributions.
-
8/6/2019 Speaker Diarization
42/47
34
Chapter 6
Experimental Results
6.1 Datasets and Objective of the experiments
In the previous three chapters we have discussed how we went about to implement our
speaker diarization system.
The primary objective of the experiments was to see if the performance of our speaker
diarization system eroded with increase in the number of speakers in a conversation, for
this it must be borne in mind that all other parameters are kept the same. So we came up
with an idea to achieve the same. We recorded speech samples from six different
speakers, they were asked to read excerpts from Frederic Wilhelm Nietzsches Also
Sprach Zarthustra. Five training samples of 30 to 45 seconds each were taken from each
speaker. Then ten testing samples were also taken. From this data then conversations
were synthesized or contrived. We take a testing speech from one of the speakers; insert
some silence before or after it and then stuff speech samples from other speakers and
keep doing it till you are satisfied with the length and other attributes of the conversation
clip.
The results of our experiments with four, five and six speakers can be summarized as in
the table on the following page.
-
8/6/2019 Speaker Diarization
43/47
35
#Speaker
Codebook
Accuracy
Accuracy GMM
2 85.4% 91.3%3 81.3% 86.8%
4 74.5% 78%
5 73.3% 76.5%
6 71.2% 74%
-
8/6/2019 Speaker Diarization
44/47
36
Chapter 7
Conclusion
The experimental results clearly demonstrate that Gaussian mixture models are much
more robust than vector quantization or codebook quantization when it comes to
maintaining performance in the wake of increasing number of speakers in a conversation.
This can be attributed to the fact that a speakers glottal chord behaves differently under
different conditions like uttering different categories of phonetic sounds and so we need a
model which accounts for the sub-populations within the feature point population
generated by a particular speaker or a mixture model kind of representation which is
adequately provided for by the Gaussian mixture models.
There are different patterns or sub-populations within the dataset of feature vectors
generated by a speaker but these are not many. In our mixture model the number of
components has to be equal to the number of such sub-populations within a speakers
feature vector set to optimize the performance.
Moreover, it can be seen that despite using Gaussian mixture models, state of the art in
speaker identification, the performance of our diarization system is not very encouraging,hence it becomes imperative to incorporate two other factors in a diarization system
1) Make the corpus multi-modal and use Direction of arrival and time delay ofarrival to enhance the performance.
2) Incorporate visual cues, as mere audio has not given significantly encouragingresults.
-
8/6/2019 Speaker Diarization
45/47
37
Bibliography
Ajmera, J., & Wooters,C.2003. A Robust speaker clustering algorithm. In :ASRU2003-
8th IEEEAutomatic Speech Recognition andUnderstanding workshop.
Ajmera,J.. McCowan,I. & Bourlard,H.2004. Robust speaker change detection.IEEE
Signal processing Letters., 11(8).
Akita,Yuva and Kawahara, Tatsuya, 2003. Unsupervised Speaker Indexing Using Anchor
models and Automatic Transcription of Discussions In : The interspeech2003 -8th
European Conference on Speech Communication andTechnology.
Anguera, X.,Wooters, C,, Pardo, J. & Hernando, J.2007. Automatic Weighting for the
Combination of TDOA and Acoustic features in Speaker diarization for Meetings In :
IEEE International Conference on Acoustics, Speech and SignalProcessing 2007.
Anguera,X.,Wooters,C.,& Hernando,J.2005a. Speaker diarization for multi party
meetings using Acoustic fusion. In : ASRU2005 -9th
IEEEAutomatic Speech Recognition
andUnderstanding workshop.
Anguera, Xavier. 2005. XBIC : Real time Cross probability measure for speaker
segmentation. Tech. Rept. ICSI
Anguera, Xavier.2006b. Robust Speaker Diarization for Meetings. Ph.D thesis,
Universitat Politecnica de Catalunya
Anguera, Xavier, Wooters, Chuck, Peskin, Barbara, & Aguilo, Matu. 2005b. Robust
Speaker Segmentation for Meetings : The ICSI-SRI Spring 2005 Diarization system. In :
Rich transcription 2005 Spring MeetingRecognition Evaluation Workshop.
Edinburgh,UK: Springer LNCS 3869.
Anguera, Xavier, Wooters, Chuck, Hernando, Javier.2006 a. Friends and Enemies : A
Novel Initialization for Speaker Diarization. In :Interspeech2006 ICSLP-9th
International conference on Spoken Language Processing.
Anguera, Xavier, Wooters, Chuck & Pardo, Jos M.2006b. Robust speaker Diarization for
Meetings : ICSI RT06S Meetings Evaluation System In :Rich Transcription 2006 SpringMeetingRecognition Evaluation Workshop. Bethsda, MD,USA: Springer LNCS 4299.
-
8/6/2019 Speaker Diarization
46/47
38
Barras, Clause, Zhu, Xuan, Meignier, Sylvain & Gauvain , Jean-Luc.2004 . Improving
Speaker Diarization. In : Rich Transcription 2004 Fall Workshop.
Barras. Michael, Bimbot, Fedric, Ben, Mathieu & Gravier, Guillaume 2004 Multistage
Speaker Diarization of Broadcast News. In:IEEETransactions on Audio, Speech,And
Language Processing, 14(5).
Bester, Michael, Bimbot, Frdric, Ben, Mathieu, & Gravier, Guillaume. 2004. Speaker
Diarization Using bottom up clustering based on a Parameter-Derived Distance between
adapted GMMs.
In :ICSLP2004- 8th
International Conference on Spoken Language Processing.
Bimbot, Frdric, & Mathan, Luc, 1993. Text Free speaker Recognition using an
arithmetic-harmonic sphericity measure. In : Eurospeech93 -3rd
European Conference on
Speech Communication and Technology.
Black, A.,& Schultz,T.2006. Speaker clustering for Multilingual Synthesis. In : ISCA
Tutorial andResearch Workshop on Multilingual Speech and Language Processing
Bonastre, J.-F., Delacourt,P.,Fredouille,C.,Merlin,T.,&Wellelens,C.2000. A speaker
tracking system based on speaker turn detection for NIST Evaluation. In: IEEE
International Conference on Acoustics, Speech and SignalProcessing 2000.
Burges, CJ.C1998. A Tutorial on Support Vector Machines for Pattern Recognition.
Data mining andKnowledge Discovery,2(2).
Cettolo, Mauro.2000. Segmentation, Classification, and Clustering of an ItalianBroadcast news Corpus. In : 6
thRIAO 2000- Content Based Multimedia Information
Access.
Chen. Jingdong, Benesty, Jacob& Huang, Yiteng.2006. Time delay estimation in room
acoustic environments : An Overview. EURASIPJournal on Applied SignalProcessing.
Chen,Scott, Shaobing, & Gopalkrishnan,P.S.1998b. Speaker, Environment And Channel
Change Detection and Clustering Via the Bayesian Information Criterion. In DARPA
Speech Recognition Workshop 1998.
Cheng.E.Lukasiak, J., Burnett,I.S.,& Stirling, D.2005. Using Spatial cues for meeting
Speech Segmentation. In :IEEE ICME05- International Conference on Multimedia and
Erpo 2005.
Cohen,A.& Lapidus,V.1995. Unsupervised Text Independent Speaker classification. In :
18th
Convention of Electrical and Electronics Engineers in Israel.
-
8/6/2019 Speaker Diarization
47/47