speaker diarization

8/6/2019 Speaker Diarization

1/47

Speaker Diarization

A Thesis Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Technologyin

Computer Science & Engineering

By:

Avinash Kumar Pandey

2006CS50213

Under The Guidance of:

Prof. K. K. Biswas

Department of Computer Science

IIT Delhi

Email: [email protected]


Indian Institute of Technology Delhi


2/47

i

Certificate

This is to certify that the thesis titled "Speaker Diarization" being submitted by

Avinash Kumar Pandey, Entry Number: 2006CS50213, in partial fulfillment of therequirements for the award of the degree of Master of Technology in Computer

Science & Engineering, Department of Computer Science, Indian Institute of

Technology Delhi, is a bona-fide record of the work carried out by him under my

supervision. The matter submitted in this dissertation has not been admitted for an

award of any other degree anywhere unless explicitly referenced.

Signed: ______________________

Prof. K.K. Biswas


Indian Institute Of Technology, Delhi

New Delhi-110016 India

Date: ______________________


3/47

ii

Abstract

In this document we describe a speaker diarization System which is basically an

internal segmentation system which uses text independent speaker identification using

Gaussian mixture models and MFCC feature vector set, aided by also a Gaussian

mixture models based speech activity detection system.

A general speaker diarization system or speaker detection and tracking system

consists of four modules, Speech activity detection, speaker segmentation; speaker

clustering and speaker identification. In an internal segmentation type diarization

system the functions of all the three modules of Segmentation, Clustering and

Identification are discharged by the same speaker identification module. Segmentation

is done after identifying which speaker this particular audio segment belongs to, so

clustering is also done simultaneously.

A speaker identification system is again of various types, it can impose certain

limitations on the text that speaker utters to get identified or it may keep it completely

limit-free. The later systems are called text independent speaker identification

systems. Different speaker identification systems work on different types of feature

vectors, the text independent speaker identification systems work on lower level

glottal features of the speaker.

The feature vectors thus obtained can be modeled using different statistical models;

we have experimented with two models mainly, Vector quantization and Gaussian

mixture models.


4/47

iii

Acknowledgements

I would like to acknowledge the guidance ofProf. K. K. Biswas, (Department of

Computer Science, IIT Delhi) whose guidance was the corner-stone of this project,

without which this project would never have been possible. Thank you for your

wonderful support. I would also like to express my gratitude towards Prof. S. K.

Gupta and Prof. Saroj Kaushik for their guidance throughout the development of

the project. I would like to gratefully acknowledge my debt to other people who have

assisted in the project in different ways.

Signed: ______________________

Avinash kumar Pandey

2006CS50213

Indian Institute Of Technology, Delhi

New Delhi-110016 India

Date: ______________________


5/47

iv

Contents

Certificate................................ ................................ ................................ .................. i

Abstract ................................ ................................ ................................ .................... ii

Acknowledgements ................................ ................................ ................................ . iii

List of Figures................................ ................................ ................................ ..........vi

List of Tables ................................ ................................ ................................ ..........vii

1 Introduction................................ ................................ ................................ ...........1

1.1 Motivation........................................................................................ .....................1

1.2 Definition...................................................................................... ....................... .1

1.3 History .........................................................................................................2

1.3.1 Rich transcription framework........................................................................ .3

1.4 Applications.................................................................................. ........................4

1.5 Outline of the Work............................................................................................. .4

1.6 Chapter Outline of the Thesis..................................................... ..........................5

2 Speaker diarization............................... ................................................................... .6

2.1 Introduction ..................................................................................... ....................6

2.2 Speech activity detection..6

2.2 Speaker segmentation..................................................... ......................................8

2.2.1 Segmentation using silence............................................................................. .9

2.2.2 Segmentation using divergence measures.......................................................9

2.2.3 Segmentation using frame level audio classification.....................................10

2.2.4 Segmentation using direction of arrival......................................... ................11


6/47

v

2.3 Speaker clustering.............................. ..................................... ................... ........11

2.4 Speaker identification.12

3 Speech Activity detection and implementation................................ ....................13

3.1 Introduction.13

3.2 Gaussian Mixture Models...................... 14

3.3 Our algorithm for SAD...15

3.3.1 Observation............................................................................... .....................15

3.4 Experiments....16

3.4.1 Fan noise.............................................................................. .................... ......16

3.4.2 Silence..................................................... ................................... ....................16

3.4.3 Fast paced speech................................................................. .................... .....17

3.4.4 Moderate paced speech......................................................... ..................... ....17

3.4.5 White noise................................................................................ ....................18

3.5 Advantages and Drawbacks....................................... .........................................19

4 Implementation of Text Independent Speaker Identification.......................... ...20

4.1 Introduction.......................................................... ......................... ..................... .20

4.2 Speech Parameterization : Feature vectors.................................. .......................23

4.2.1 MFCC.................................................................................. ..........................23

4.3 Statistical Modelling...........................................................................................27

4.3.1 Vector Quantization............................................................ ................... ........28

4.3.2 Gaussian Mixture Models......31

4.4Bayesian Information Criterion ......33

5 Experimental Results..35

6 Conclusion........37

Bibliography...38


7/47

vi

List of Figures

Figure 1: Segmentation of a an audio clip ................................ ................................ ............. 8

Figure 2: Gaussian mixture models ............................................................................. ...........16

Figure 3: Fan noise ................................ ................................ ................................ .............. 16

Figure 4: Silence ................................ ................................ ................................ ................. 16

Figure 5: Fast paced speech.......................................................................................... .......... 17

Figure 6: Moderately paced speech ................................ ................................ ..................... 17

Figure 7: White noise ................................ ................................ ................................ .......... 18

Figure 8: Training phase of a speaker identification system ................................ ................. 22

Figure 9: Testing phase of a speaker identification system ................................ ................... 22

Figure 10: General schematic to calculate PLP or MF-PLP features ................................ .... 23

Figure 11: Different modules in computation of cesptral features ................................ ........ 24

Figure 12: Vector quantization ................................ ................................ ............................ 30


8/47

vii

List of Tables

Table 1: Speech activity detection experiment results ................................ .......................... 19

Table 2:Speaker Diarization experiment results ................................ ................................ ... 36


9/47

1

Chapter 1

Introduction

In this chapter we will discuss where the problem of speaker diarization originated, what

all problem domains it can find application in and how much work has already been done

in this area and we will also discuss in brief about the several chapters in this thesis.

1.1 Motivation

Recording of speech, for several purposes, has long been in practice. There are many

reasons to record ones voice, like for educational purposes, archival purposes and

conserving a memory through the vicissitudes of time. While its more automatic than

scribbling down the dialogue it is often times cheaper than video as well.

Across the globe, in countless archives, there exists a huge amount of audio data. We

organize our data, like database tables, through certain keys. In audio clips, as such, we

have no key. The idea is to devise an organization for these databases in order to make

them easy to be handled. One of the possible indexing of audio data can be in terms of

speaker identity; this thought is the key idea for motivating speaker diarization.

1.2 Definition

In a given audio clip, the task of speaker diarization essentially addresses the question of

Who spoke when. The problem involves labeling a given audio file entirely with

speaker beginning and end times si and ei, for all homogenous single speaker segments.

If there are portions that correspond to non speech those have to be explicitly mentioned

too, for example, a sample output for some one minute audio file could be like 0:3

seconds Non speech, 3:25


10/47

2

Speaker 1,25:37 music,37:60 speaker 2.

1.3 History

Historically, the problem was first thus formulated in national institute ofstandards and

technology, rich transcription framework, or better known as RT- Framework, convention

of 1999. Since then, till 2007, several conventions were held about this problem. Mainly,

two types of diarization problems were undertaken

1) Broadcast news speaker diarization2) Meeting or conference room speaker diarization

In the broadcast news framework, the recording system is single modal. That is to mean,

there is only one recording device, in which all the speakers take turn to speak.

Apparently, the apparatus is simple, but the accuracy of the diarization system is

hampered on this account.

In the meeting room domain problem, audio clips are recorded across multiple distant

microphones, locations of which were not disclosed. The results of diarization of these

different clips were then combined together to enhance the efficiency of the overall

diarization engine.

There is one aspect about multimodal scenarios that lead to enhanced diarization results

with them, the TDOA parameter; time delay of arrival. There are different recording

devices; the distance of speaker from each microphone is bound to change with speakerchange because the two speakers will most probably not be occupying the same physical

position. This information, the time difference in different recording devices, leads to a

very strong tool for speaker segmentation or identifying speaker turn points.

The problem we have undertaken to solve is similar to that of the broadcast news

scenario where differentspeakers take turn to speak in a single recording device.

The current state of art in the speaker diarization regime has moved beyond audio. The

idea is to record not only audio, but also take visual cues from the speakers and audiences

to determine speaker status and change. Its impact has been conclusively shown inMultimodal Speaker diarization of real world meeting using compressed domain video

features, a paper by Friedland and Hung, 2010.


11/47

3

1.3.1 NIST RICH TRANSCRIPTION FRAMEWORK

There is a set of related problems that is undertaken under NIST rich transcription

framework, namely,

1) Large vocabulary continuous speech recognition (LVCSR)2) Speaker diarization3) Speech activity detection(SAD)

Let us now discuss each of these problems briefly and the success that has been achieved

in solving them briefly.

Large Vocabulary Continuous Speech Recognition

This is the name for most common and important problem in speech processing. LVCSR

is simply a technical name for the most general speech recognition system. The amateur

speech recognition engines place some kinds of restrictions on the vocabulary for

example the speaker could speak only out of so many words already recognized by the

engine or the speaker should speak at a certain pace which generally meant a place

slower than the normal pace of speaking. LVCSR was meant to overcome all these

restrictions

Speaker Diarization

The problem statement of Speaker Diarization has already been introduced; it will be

taken up in fair detail; a both theoretical and implementational detail as it is the object of

this thesis.

Speech Activity Detection

Speech activity detection is a sub-problem in most of the speech processing applications

and speaker Diarization is no different. We will also develop an algorithm for Speech

activity detection. In chapter 3, we will discuss in detail about our algorithm for Speechactivity detection, its experimental performance, its advantages and its drawbacks.


12/47

4

1.4 ApplicationsSpeaker diarization provides for multifarious applications in diverse domains. Some of

them follow

1) Once our audio archive has been indexed as per the speaker identities, a user canquickly browse through the archive only looking at the speakers of ones own

interest rather than manually looking for the speaker he is interested in, through

the entire file.

2) Speaker diarization also plays a vital role in automatic speech recognition. If wedo not the speaker identity and are trying to convert the speech to text, the

phonetic models we apply are rather generic but once we know who exactly this

speaker is we can migrate to a speaker specific phonetic model which performs

better. Literature says, there has been about 10% improvement in automatic

speech recognition if we knew in advance the identity of speaker.

3) If one decides to resort to manual transcription in order to avoid the inherentdifficulties and inaccuracies in automatic speech recognition, even then a diary of

which speaker started when will come in handy.

1.5 Outline of the Work

We created a system which is capable of producing a diary of a given audio clip,

provided it has in its database the training samples of all the speakers present in the audio

clip. Our implementation fundamentally consists of three main modules, a Speech

activity detection module, A Bayesian Information criterion module and a Text

independent speaker identification module. The SAD module, given a speech segmentdecides whether its speech or non-speech. The non-speech categories could be Gaussian

noise, Background noise, Music or most commonly silence. The SAD module

differentiates speech from these kinds of signals.

After this, we have Bayesian Information Criterion module which narrows down on a

given audio segment till it becomes reasonably assured that this particular segment

belongs to a single speaker.


13/47

5

Then in the end, we have the core speaker identification engine based on MFCC feature

vectors. It is of the text independent type, when we say its text independent we mean it

does not depend on the text uttered by the speaker, it can identify a speaker no matter

what he speaks, so this establishes that our audio clip doesnt have to be a fixed set of

words, Our speakers can talk about anything and we will still be able to produce a diary.

1.6 Chapter outline of the thesis

In the remaining chapters we develop the idea of Speaker diarization first and then

discuss the various details associated with the different modules chapter by chapter. In

Chapter 2; we discuss one by one the theoretical ideas behind and the advances made in

and techniques popular in each of the areas of speaker segmentation, Speaker clustering

and speaker identification. Then, in the chapter 3 we discuss, our implementation of the

speech activity detection algorithm, in chapter 4 we discuss our implementation and the

general stuff about text independent speaker identification and in chapter 5 we discuss in

brief about Bayesian Information Criterion. In chapter 6 we furnish results about the

performance of our speaker diarization system and in chapter 7 we conclude the thesis.


14/47

6

Chapter2

Speaker Diarization

2.1 Introduction

Speaker Diarization as Anguera and Wooters contended in Robust Speaker

Segmentation for Meetings, ICSI SRI Spring, 2005, for the twin purposes of

abstraction and modularity can be divided into four stages.

1) Speech activity detection2) Speaker segmentation3) Speaker clustering4) Speaker identification

2.1 Speech activity detection

Speech activity detection is the first step in almost every other speech processing

application, the primary reason being the highly computationally intensive nature of these

processing methods. One wouldnt want to waste ones resources computing on part of

the audio clip that is not speech. Suppose, there is a meeting room in our computer

science department, a far field microphone is installed in the room and it is always alert.

That is to mean, it always keeps recording. We do not want to set things up every time we

enter the meeting room for discussions. Now after a certain period of time, say 24 hours,

we take the output of the recorder and want to filter out portions that are non speech.

They could be any kind of noise, music or sounds of people passing by in the gallery.

This task is accomplished by speech activity detection module. The problem of Speech

activity detection also goes by the name of Voice activity detection.

There are various possible ways to determine whether an incoming clip is speech or not.

One of those ways could be cepstral analysis, as investigated by Haigh and Mason in A

Voice Activity Detectorbased on Cepstral analysis, 1993.

Later in this thesis we will discuss in profound detail what we mean by Cepstral features,

for now it is sufficient to know these are some values extracted from a stationary portion

of the audio clip, may be a 10 seconds window.


15/47

7

The idea proposed by Haigh and Mason is to detect the speech end-points, that is , the

points where speech portion begins and ends, by cepstral analysis. This approach will be

strictly based on static explicit modeling of speech and non-speech. We will have to train

a binary classifier to differentiate speech from non-speech based on some feature vectors

extracted from the clip which the authors suggested should be cepstral features. As the

reader will see later in the thesis, these are the same feature vectors we use for our text

independent speaker identification engine.

The model used could be any model ranging from a Gaussian mixture model to a support

Vector machine.

This is one way to tell speech from non-speech, the earliest algorithms for Voice activity

detection used two common speech parameters to decide whether there is voice in a

speech frame

1)

Short term energy2) Zero crossing rateShort term energy in an audio frame can be determined with the log energy coefficientin the MFCC feature vector set; it is the 0Th coefficient in the MFCC feature vector set.

The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which

the signal changes from positive to negative or back. This feature has been used heavilyin both speech recognition and music information retrieval and is defined formally as

Where s is a signal of length Tand the indicator function is 1 if its argument A istrue and 0 otherwise.

But as can be understood the usage of these two parameters can still lead us to error, there

can be cases where there is high short term energy in a speech frame and it is still not

speech, for example a Gaussian noise with a probability distribution such that the

intensity peaks every few seconds.

Same can be said for zero crossing rates, though uncommon there can be noises where

the change from positive to negative happens very frequently and so it can deceive our

system. To overcome these difficulties we have come up with a new algorithm for speech

activity detection which will be described in detail in chapter 3.


16/47

8

2.2 Speaker segmentation

Given an audio clip, speaker segmentation is the task of finding the speaker turn points.

Our segmenter is supposed to divide the audio clip in non overlapping segments such thateach segment contains speech from only one speaker. Non speech segments, we assume,

have already been filtered out by our speech activity detection module. The idea is better

illustrated through the following figure.

Figure 1 Speaker Segmentation

The figure shown above is the amplitude graph of an audio clip, where the number of

speakers exceeds one. We begin with one speaker putting forth his point, in between he is

interrupted by another speaker and there is a region of overlapping speech from where the

second speaker takes over. This overlapping speech region is an instance of turn point.

There is a little abuse of the term point here, as we are calling a whole overlap region to

be a point. So we have to identify such positions in the audio clip where the shifting over

of speakers takes place. These, from now on, will be called homogenous speaker

segments.

There are various ways to go about solving the problem of speaker segmentation.

y Segmentation using silence.y Segmentation using divergence measures.y Segmentation (and clustering) by performing frame level audio classification.y Segmentation (and clustering) using a HMM decoder.y Segmentation (and clustering) using direction of arrival.

The last three methods are unified segmentation and clustering methods. They are also

called internal segmentation methods at times.

The earlier two methods are called external segmentation methods because the

ascertaining of identity follows the merging of segments in clusters as opposed to internal


17/47

9

segmentation methods where for every frame you first find out who the speaker is, there

you have essentially found out the cluster without bothering about the segment at all.

Now we will discuss each of these methods briefly.

2.2.1 Segmentation Using Silence

Segmentation using Silence is a common sense method which is based on the

assumption that whenever a speaker change happens there must be a portion of silence in-

between. This, however, cannot be said to hold true in all environments, for example, in a

parliament speaker changes almost inevitably happen by one speaker forcing his entry in

another speakers speech. So a speaker change did happen, but there was no point of

intervening silence. Hence we run the risk of losing some speaker turn points. Besides,

this method of segmentation using silence is plagued by another difficulty. If we observe

the amplitude graph of any speech file closely we will notice when a speaker speaks, he

doesnt just keep speaking all the time, he stops frequently or the tonality of his voice bellows down to silence so even a continuous speaker segment contains many

intermediate points of silence. What this means for our purpose is that while we miss

some true some speaker turn points, we generate too many unnecessary segments as well.

The greater the number of segments greater is the difficulty in clustering them. So we can

see why the method of segmenting speech using silence is not such a good idea. Mainly

for two reasons

1) Misses some true speaker turn points with overlapping speech.2) Generates a much higher number of segments, false positives.

2.2.2 Segmentation Using Divergence Measures

Delacourt and Wellshowed in DISTBIC,Aspeaker-basedsegmentation for audio data

indexing, that using divergence measures for speaker segmentation can be useful .They

used Bayesian Information Criteria as the divergence measure.

Segmentation using Divergence measures is the state of the art, lets first discuss what is a

divergence measure and what all divergence measures one can use.

A divergence measure is fundamentally a tool to determine how similar or dissimilar

two things are. In our case it could be two successive audio frames or two successive

windows of audio frames. Two famous divergence measures are Kullback- Leibler

divergence and Bayesian information criteria. We have used Bayesian Information

criteria in our implementation so we will be discussing it in detail in chapter 6. Right now

we will discuss Kullback-Leibler Divergence.


18/47

10

In probability theory and information theory, the KullbackLeibler divergence (also

information divergence, information gain, relative entropy, orKLIC) is a non-

symmetric measure of the difference between two probability distributions Pand Q. KLmeasures the expected number of extra bits required to code samples from Pwhen using

a code based on Q, rather than using a code based on P. TypicallyPrepresents the "true"

distribution of data, observations, or a precisely calculated theoretical distribution. Themeasure Q typically represents a theory, model, description, or approximation ofP.

Although it is often intuited as a distance metric, the KL divergence is not a true metric for example, it's not symmetric: the KL from P to Q is generally not the same as the

KL from Q toP.

For probability distributions Pand Q of a discrete random variable their KL divergenceis defined to be

In words, it is the average of the logarithmic difference between the probabilities PandQ, where the average is taken using the probabilities P. The K-L divergence is only

defined ifPand Q both sum to 1 and ifQ(i) > 0 for any i such that P(i) > 0. If thequantity 0log0 appears in the formula, it is interpreted as zero.

Now that we know what a divergence measure is, we can proceed with our discussion of

our segmentation using divergence measures. We consider two windows and calculate

their similarity or dissimilarity index with our given divergence measure. If that

dissimilarity is above a particular threshold, determined by empirical experimentation, we

call that particular point a speaker turn point as it flanks the audio in two windows whichare different from each other.

2.2.3 Segmentation using Frame level audio classification

This is an example of an internal segmentation strategy. Frame level audio classification

means with every frame of the audio, we determine as to what kind of an audio data it is,

is it speech, is it music and then if it is speech which speaker does it come from in our

database, this is more or less the strategy we will be following in our implementation

except that instead of looking at individual frames we are looking at a window of frames

which has been determined to contain a single speaker speech. During our

experimentation with various statistical models we observed looking at individual frames

rarely leads us to the right answers, we have to look at a certain length of the audio, look

at a collection of frames and see which speaker the majority of points belong to. The


19/47

11

criteria for belonging may be proximity to a particular codebook code vector or a log-

likelihood computation as in Gaussian Mixture Models.

2.2.4 Segmentation using Direction of Arrival (DOA)

In a multimodal corpus where we have multiple distant microphones to record speech,

two parameters assume overwhelming significance in speaker diarization or speaker

identification, direction of arrival and time delay of arrival. The process has been

discussed at length in Real time monitoring of participants interaction in a meeting

using audio-visualsensorsby Buso and Narayanan, 2008.

The technique is called acoustic source localization and has been used widely in RADAR

and SONAR. The speech is recorded in a smart room consisting of a microphone array.

The microphone array is used for acoustic source localization. The approach is based on

TDOA, time delay of arrival due to various microphones. The geometric inference of the

source location is calculated from this TDOA. First pair-wise delays are estimated

between all the microphones. These delays are subsequently projected as angles into a

single axes system.

2.3 Speaker clustering

Speaker clustering is the next step from speaker segmentation. Let us rewind a little bit

now. We were first given an audio clip from which using the speech activity detection

module we filtered out non speech part. Next, we used our speaker segmentation tool to

generate homogenous speaker segments from it. Now, as one can observe, the segments

belonging to single speaker can be strewn across the clip. We have to label them as

coming from a single source. So we proceed to speaker clustering. We build a similarity

matrix corresponding to every single segment in the segments list, and we use a distance

metric to calculate the distance between the segments. The segments which match best

are merged and incrementally the identities of these clusters get changed. If we already

know the number of speakers present, we can choose to keep as many clusters as the

number of speakers. Otherwise, we choose a stopping criterion which has to be decided

based on experimentation. When that stopping criterion is met, we stop merging the

segments and the number of clusters present at that point will be the number of speakers

present in our audio clip.

Popular approaches to speaker clustering

y Clustering using vector quantization.


20/47

12

y Clustering using iterative model training and classification.y Clustering in a hierarchical manner using divergence measures.y Clustering and segmentation using a HMM decoder.y Clustering and segmentation using direction of arrival

2.4 Speaker Identification

The last step is speaker identification. First the training part will be discussed.

1) We divided every second of the training clip into approximately 173 frames andfor each of these frames we calculated MFCC features.

2) Each MFCC feature vector consisted of 13 dimensions.3) We use the set of these feature vectors to generate a codebook for every speaker

using the concept of vector quantization.

4) Codebook of a speaker consisted of 16 feature vectors which best modeled the setof obtained feature vectors for the clip.

Now we move on to the testing part.

In testing

1) We again divided the test audio clip into frames, 173 per second and computedthe feature vectors per frame.

2) We matched each feature vector thus computed, with the individual codebooks.3) The codebook with maximum matches is declared to be the one corresponding to

the identity of our test speaker.


21/47

13

Chapter 3

Speech Activity Detection and

Implementation

3.1 Introduction

Speech activity detection involves separating speech from

1) Silence2) Background noise in different ambience3) White Gaussian noise4) Music5) Crowd noise

The majority of algorithms that are used for speech activity detection fall in two

categories.

1. Noise level is estimated after looking at the entire file and anything over andabove a particular decibel level is called speech.

2. According to the ambience, speech and non speech training sets are taken, modelsare trained based on them and then these models are used for further

classification.

Both of these kinds of algorithms suffer from some serious drawbacks. The algorithms in

the first category fail to discriminate speech from non speech when noise is variable; it

assumes same noise function throughout the audio. The algorithms falling in secondcategory will fail to perform in an unfamiliar environment. They need a lot of training

data.

The approach we propose here, overcomes both these hurdles, though it has drawbacks of

its own. We want a speech detection algorithm which

y Does not require any training at all.


22/47

14

y Is, nevertheless, able to grasp the difference between speech and nonspeech in dynamically changing ambience.

Since we use Gaussian mixture models in our approach, a little briefing of what reallyGaussian mixture models are is called for.

3.2 GAUSSIAN MIXTURE MODELS

Many times, it is so that our distribution of data cant be accurately modeled by using a

single multivariate distribution. The sample point set might come from two different

Gaussian distributions. In that case, rather than modeling the dataset by a singlemultivariate distribution, its better to model the data as a mixture of two Gaussians

where each Gaussian accounts for a certain fraction of point set which is called the

mixing proportion of this Gaussian in the Gaussian mixture model. The mean and

covariance matrices of each Gaussian can be completely independent and we can restrain

them too as the case might be, in our case we let them be completely general.

Figure 2: Gaussian Mixture Models

The diagram above is a case where two individual Gaussians are present with different

mean and covariance.


23/47

15

3.3 Our algorithm for Speech activity detection

The algorithm consists of three simple steps.

y Divide the entire audio clip in fixed sized intervals of 15 seconds.y Extract MFCC features from each of these segments.y Cluster individually, the feature vectors obtained for each of these intervals, using

Gaussian mixture models where the number of components is two.

3.3.1 Observation

In case of silence, repeated beats, fan noise, white Gaussian noise the preponderance of

points belongs to one of the clusters. The mixing proportion remains highly skewed.

While for speech frames, the mixing proportion stays even, This attribute of speech

frames becomes the basis of our classification strategy.

We want to narrow down the number of segments which could be speech. Its a negative

classification strategy. Whatever is clustered evenly remains a candidate for speech. Wecan filter out silence, instrument noises, repeated beats and white noise. This effectively

narrows down our search space by a significant factor.

3.4 Experiments

3.4.1 Experiment 1: fan Noise


24/47


25/47


26/47

18

3.4.5 Experiment 5 : White Noise

Figure 7 : White Noise

y Software generated white noise.y Might come in our audios from some malfunctioning in our recording device ornetwork channel noise.y BIC value of clustering always comes out to be negative.y The algorithm often fails to converge for two components in 100 iterations.y When it does converge, the mixing proportion is highly skewed.

Summary of experimental results


27/47

19

3.5 Advantages and Drawbacks

Advantages

y Does not require any training at all, so we can apply it even in environmentswhere our training based approaches would have failed for lack of training data.

y Since it treats every segment independently, it can adjust itself dynamically tochanging ambience properties.

Drawbacks

y This algorithm is computationally intensive.y We have to calculate the feature vectors, cluster them and find the BIC for every

single frame in the file.

y Works well for relatively longer periods of silence, is not as robust for shortersegments.


28/47

20

Chapter 4

Implementation of text Independent

speaker identification

4.1 Introduction

Our approach to handling the other three modules, other than speech activity detection,

can be called internal segmentation and clustering.

We do not first segment and then cluster the audio clip, instead we take a portion of the

clip, try to identify as to which speaker it belongs, if there is a confusion, that is the

statistics revealed by our testing algorithms are not conclusive then we narrow it down

further, thus we determine a minimum time difference between which a speaker change is

not happening and in that we establish the identity of the speaker. We can also take this

time to be 1 sec and then determine individually for all segments of 1 second to which

speaker they belong.

For doing so, we have two tools at our disposal.

1) A text independent speaker identification engine and2) Bayesian information criteria BIC

Speaker identification systems are primarily of two types, text independent and text

dependent. The names are fairly self explanatory, text independent speaker identification

systems are the ones which work for any test utterance while the text dependent speaker

identification systems can work only with a fixed test utterance. The two systems classify

their training data based on widely different sets of speech parameters, resulting in

different accuracies too, under similar circumstances. Accordingly, they find applications

in different problem domains.

An abstract layout of a speaker identification/verification engine, consisting of various

modules and stages, is shown in the following figures.

While figure 1.1 shows the training phase, figure 1.2 is for testing phase.


29/47

21

Training phase assuming a most general speaker identification/verification system,

consists of two fundamental modules,Speech parameterization module and statistical

Modelingmodule.

The raw speech data received is first processed to extract some useful characteristic

information through the speech parameterization module. These information parameters

are collected, usually, at different points in time domain and frequency domain, to mark

the variations in the speech, characteristic of a particular speaker.

The speech parameters thus obtained are then fitted in a statistical model of choice using

the statistical modelingmodule, in order to calculate the defining parameters of the

model corresponding to that particular speaker, for example, mean and variance in a

single Gaussian model. The choice of statistical model could vary. We have

experimented with two models, codebooks or vector quantization and Gaussian mixture

models. The state of the art systems in speaker identification employ Gaussian mixture

models with great success.

Testing phase is preceded by collection of training samples of all speakers in our

universe, those training samples are converted into corresponding speaker models and if

the statistical modeling demands so, a universal model is formed with the training

samples. Now, when we are given a test utterance, we calculate the speech parameters

using the speech parameterization module and then use a decision scoringmodule to

decide which of the available speaker models these parameters are best matched with.

Again, the choice of decision scoring module is dependent on the statistical model weuse, when we use vector quantization, it is simply the number of points corresponding to

each codebook, the one which has maximum points closest to the parameter set achieved

is our output identity, the distance metric being the simple Euclidean distance metric.

While, when use Gaussian mixture models as our statistical modeling tool, log-likelihood

becomes the decision scoring module. We calculate log-likelihood of each feature vector

obtained from the test utterance and see which speaker model it best corresponds to.

So, our discussion so far has established there are fundamentally three variables in

the speaker identification systems.

1) Speech parameterization2) Statistical modeling3) Decision scoring


30/47

22

We will take them all one by one now, first giving the theoretical details, and

then the experimental results.

Input Speech parameters Model

Figure 8: Different modules in the training phase of a speaker

identification system

Speech data from a Speech parameters

given speaker

Speaker models from the database Ide

Identity

Figure 9 :Different modules in the testing phase of a speaker

identification system

Speech Parameterization

Module

Statistical Modeling

Speech parameterization

Module

Scoring Decision


31/47

23

4.2 Speech parameterization: Feature Vectors

The speech parameterization module calculates or extracts useful speech parameters froma raw audio clip. The popular term for the parameters thus obtained is feature vectors.

The most widely used feature vectors are from a particular class called cepstral features.

We will discuss briefly what exactly we mean by cepstral features and then we will give

the specifics of the feature vector set we are using.

Cepstralfeatures based on filter-bank

The entire process of calculation of filter-bank based cepstral features is shown

schematically, module-wise, in figures 2 and 3.

Figure 10: General Schematic for calculation Of PLP or MF-PLP

features.

Input

Pre- emphasis Windowing FFT


32/47

24

Figure 11: Different modules employed in calculation of cepstral

features, MFCC.

Now we will discuss all of these modules and their relevance in our work one by one.

1) Pre-emphasis: Emphasis is laid on certain special section of the speech signal,the special sections being the higher frequency range of the spectrum. It is

believed that the nature of speech production reduces the higher frequencies,

thereby, inducing a need for pre-emphasizing the signal making up for the loss in

production process. In our case we studied, hardly any benefits were accrued by

using pre-Emphasis so we have done away with this module.

2) Windowing: This is a crucial phase in the calculation of feature vectors. Wemake an assumption called stationary assumption which means that if we

consider a window of the speech signal small enough there wouldnt be any

variations in the values of feature vectors across that small window.

So we select a window beginning at the beginning of the speech signal, in our

case of 20 ms, then we shift the starting position of the moving window by 10 ms

and consider the next window of length 20 ms which means every two

consecutive windows considered have 10 ms part in common. The choice of

windows is again dependent on experimental evidence; we went with triangular

hamming window. Other options could have been hamming or hanning windows.

3) FFT: Next step is simply calculating the fast Fourier transform of the spectralvector thus obtained, after windowing and possibly pre-emphasis.

4) Filter-bank: The spectrum obtained after applying FFT still contains a lot of un-necessary details and fluctuations, things we are not interested in. So, in order toobtain the features we are really interested in we multiply the spectrum thus

obtained with a filter-bank. A filter-bank is nothing but a collection ofband pass

frequency filters. So in essence we filter out all the un-necessary information and

keep only the frequencies that concern us. The knowledge of these particular

frequencies comes from our knowledge of the process of speech production. The

Filter-

bank

20*LogCepstral

transform


33/47

25

spectral feature set MFCC receives its name from its filter-bank which is called

the Mel scale frequency filter-bank. This scale is an auditory scale which is

similar to the frequency scale of the human ear.

5) Cosine discrete transform: An additional transform is applied which in genericterms we have called the cepstral transform, in our case it is cosine discrete

transform which when applied on the result of filter-bank operations yields final

cepstral feature vectors , which are of interest to us.

Two other important features are the log energy and the of log energy. MFCC is a 13

dimensional feature set. The first coefficient is the log energy, we incorporated the

difference of log energy as well in our feature set, which resulted in significantly

improved recognition rates. In effect we have incorporated all the deltas corresponding to

all the 13 feature vector dimensions.

There are a number of popular feature vector sets that can be extracted from an audio

clip. Different feature vectors sets capture different properties of an audio clip. The most

widely used ones are

1) MFCC Mel frequency cepstral features

2) Rasta_PLP

3) LPC, Linear predictive coding.

The feature vector set which we are using is the MFCC one with its set, the set

essentially means the differentials of these feature vectors, that is, it captures how the

values of the MFCC feature vectors varies over time. This information, as it turns out, is

also vital to characterizing a speaker.


34/47

26

4.2.1 MFCC

The mel-frequency cepstrum (MFC) is a representation of the short-term powerspectrum of a sound, based on a linear cosine transform of a log power spectrum on anonlinear Mel scale frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up

an MFC. They are derived from a type of cepstral representation of the audio clip (anonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-

frequency cepstrum is that in the MFC, the frequency bands are equally spaced on theMel scale which approximates the human auditory system's response more closely than

the linearly-spaced frequency bands used in the normal cepstrum. This frequencywarping can allow for better representation of sound.

MFCCs are derived as follows

1. Take the Fourier transform of a windowed excerpt of a signal.2. Map the powers of the spectrum obtained above onto the Mel scale using

triangular overlapping windows.3. Take the logs of the powers at each of the Mel frequencies.4. Take the discrete cosine transform of the list of Mel log powers, as if it were a

signal.

5. The MFCCs are the amplitudes of the resulting spectrum.


35/47

27

4.3 Statistical Modeling

Now that we have our 23 dimensional feature vectors, 173 of them for every second ofthe audio clip, we need to fit this data in a statistical model of choice. We want to get a

concise representation for making sense of this data. The two most widely used statistical

models are

1)Vector Quantizationor Codebook quantization

and

2)Gaussian Mixture Models.

We will discuss the theoretical details about both one by one and then detail the

experimental results obtained with each.


36/47


37/47

29

The algorithm is deemed to have converged when the assignments no longer change.

Voronoi Diagrams

Without going in the details the Voronoi diagrams consist of Voronoi cells. So, given a

set of points, the whole feature space is decomposed in a number of sections or cells, onecorresponding to each point. The points lying in a Voronoi cell of a point are the points

which are closer to this particular point than any other point in the given set of points.


38/47

30

Image source http://www.data-compression.com/vq-2D.gif

Figure 12

4.3.2 Gaussian Mixture Models

At times, it so happens that the representation provided by vector quantization or

codebook quantization is not adequate for modeling variations in the tonalities of a

particular speaker, so it seems like a nice idea to allow multiple underlying

representations with different probabilities to model a particular speaker. This can be

achieved handsomely by Gaussian mixture models which are state of the art for speaker

recognition.

In statistics, a mixture model is a probabilistic model for representing the presence ofsub-populations within an overall population, without requiring that an observed data-setshould identify the sub-population to which an individual observation belongs. Formally

a mixture model corresponds to the mixture distribution that represents the probabilitydistribution of observations in the overall population.


39/47

31

However, while problems associated with "mixture distributions" relate to deriving theproperties of the overall population from those of the sub-populations, "mixture models"

are used to make statistical inferences about the properties of the sub-populations givenonly observations on the pooled population, without sub-population-identity information.

Some ways of implementing mixture models involve steps that do attribute postulatedsub-population-identities to individual observations (or weights towards such sub-populations), in which case these can be regarded as a types of unsupervised learning or

clustering procedures. However not all inference procedures involve such steps.

The structure of a general mixture model can be understood as follows, a typical finite-dimensional mixture model is a hierarchical model consisting of the following

components:

y N random variables corresponding to observations, each assumed to be distributedaccording to a mixture ofK components, with each component belonging to the

same parametric family of distributions but with different parametersy Ncorresponding random latent variables specifying the identity of the mixture

component of each observation, each distributed according to a K-dimensional

categorical distributiony A set ofKmixture weights, each of which is a probability (a real number between

0 and 1), all of which sum to 1y A set ofKparameters, each specifying the parameter of the corresponding

mixture component. In many cases, each "parameter" is actually a set ofparameters. For example, observations distributed according to a mixture of one-

dimensional Gaussian distributions will have a mean and variance for eachcomponent. Observations distributed according to a mixture ofV-dimensional

categorical distributions (e.g., when each observation is a word from a vocabularyof size V) will have a vector ofVprobabilities, collectively summing to 1.

The general mixture model can then easily be converted in a Gaussian mixture model

with the adoption of parameters.

Parameter estimation, Expectation Maximization

We have used expectation maximization algorithm for parameter estimation or mixture

decomposition in Gaussian Mixture Models.

The expectation step

With initial guesses for the parameters of our mixture model, "partial membership" of

each data point in each constituent distribution is computed by calculating expectation


40/47

32

values for the membership variables of each data point. That is, for each data point xj anddistribution Yi, the membership valueyi,j is:

The maximization step

With expectation values in hand for group membership, plug-in estimates are recomputedfor the distribution parameters.

The mixing coefficients ai are the means of the membership values over the Ndatapoints.

The component model parameters i are also calculated by expectation maximization

using data 32pointsxj that have been weighted using the membership values. Forexample, ifis a mean

With new estimates forai and the i's, the expectation step is repeated to recompute newmembership values. The entire procedure is repeated until model parameters converge.

4.4 Bayesian Information Criterions (BIC)When going through every window of the given audio clip, it becomes necessary to

determine if this window belongs to one speaker or there exists a segmentation point in-

between, this we can achieve by using Bayesian information criteria.

The metric is an indicator of the acoustic dissimilarity between the two sub windows.


41/47

33

The question we fundamentally ask is, if this window is better modeled using one speaker

alone or more speakers give it better representation, model with higher number of

independent parameters are penalized using a lambda function.

y Given that a model M is denoted by the statistical distribution theta, the BIC fora window can be defined as

D ln

1) is the series ofaudio feature vectors captured in the window W.2) D is the number of independent parameters present in theta.3) Second term is the penalty term and penalizes a model for its complexity.4) can be adjusted.5) Model with the higher BIC value is to be chosen.

BIC in Segmentation

BIC = BIC () BIC (

=

y Two models, M0 and M1 are defined.y Model M0 represents the scenario where t(test) is not a turn point so the left and

right sub windows will belong to a common distribution theta(w).

y Model M1 represents the scenario where t(test) is a turn point, then the left andright sub windows will belong to different distributions theta(L) and theta(R).

y It is assumed that the feature vectors follow Gaussian distributions.


42/47

34

Chapter 6

Experimental Results

6.1 Datasets and Objective of the experiments

In the previous three chapters we have discussed how we went about to implement our

speaker diarization system.

The primary objective of the experiments was to see if the performance of our speaker

diarization system eroded with increase in the number of speakers in a conversation, for

this it must be borne in mind that all other parameters are kept the same. So we came up

with an idea to achieve the same. We recorded speech samples from six different

speakers, they were asked to read excerpts from Frederic Wilhelm Nietzsches Also

Sprach Zarthustra. Five training samples of 30 to 45 seconds each were taken from each

speaker. Then ten testing samples were also taken. From this data then conversations

were synthesized or contrived. We take a testing speech from one of the speakers; insert

some silence before or after it and then stuff speech samples from other speakers and

keep doing it till you are satisfied with the length and other attributes of the conversation

clip.

The results of our experiments with four, five and six speakers can be summarized as in

the table on the following page.


43/47

35

#Speaker

Codebook

Accuracy

Accuracy GMM

2 85.4% 91.3%3 81.3% 86.8%

4 74.5% 78%

5 73.3% 76.5%

6 71.2% 74%


44/47

36

Chapter 7

Conclusion

The experimental results clearly demonstrate that Gaussian mixture models are much

more robust than vector quantization or codebook quantization when it comes to

maintaining performance in the wake of increasing number of speakers in a conversation.

This can be attributed to the fact that a speakers glottal chord behaves differently under

different conditions like uttering different categories of phonetic sounds and so we need a

model which accounts for the sub-populations within the feature point population

generated by a particular speaker or a mixture model kind of representation which is

adequately provided for by the Gaussian mixture models.

There are different patterns or sub-populations within the dataset of feature vectors

generated by a speaker but these are not many. In our mixture model the number of

components has to be equal to the number of such sub-populations within a speakers

feature vector set to optimize the performance.

Moreover, it can be seen that despite using Gaussian mixture models, state of the art in

speaker identification, the performance of our diarization system is not very encouraging,hence it becomes imperative to incorporate two other factors in a diarization system

1) Make the corpus multi-modal and use Direction of arrival and time delay ofarrival to enhance the performance.

2) Incorporate visual cues, as mere audio has not given significantly encouragingresults.


45/47

37

Bibliography

Ajmera, J., & Wooters,C.2003. A Robust speaker clustering algorithm. In :ASRU2003-

8th IEEEAutomatic Speech Recognition andUnderstanding workshop.

Ajmera,J.. McCowan,I. & Bourlard,H.2004. Robust speaker change detection.IEEE

Signal processing Letters., 11(8).

Akita,Yuva and Kawahara, Tatsuya, 2003. Unsupervised Speaker Indexing Using Anchor

models and Automatic Transcription of Discussions In : The interspeech2003 -8th

European Conference on Speech Communication andTechnology.

Anguera, X.,Wooters, C,, Pardo, J. & Hernando, J.2007. Automatic Weighting for the

Combination of TDOA and Acoustic features in Speaker diarization for Meetings In :

IEEE International Conference on Acoustics, Speech and SignalProcessing 2007.

Anguera,X.,Wooters,C.,& Hernando,J.2005a. Speaker diarization for multi party

meetings using Acoustic fusion. In : ASRU2005 -9th

IEEEAutomatic Speech Recognition

andUnderstanding workshop.

Anguera, Xavier. 2005. XBIC : Real time Cross probability measure for speaker

segmentation. Tech. Rept. ICSI

Anguera, Xavier.2006b. Robust Speaker Diarization for Meetings. Ph.D thesis,

Universitat Politecnica de Catalunya

Anguera, Xavier, Wooters, Chuck, Peskin, Barbara, & Aguilo, Matu. 2005b. Robust

Speaker Segmentation for Meetings : The ICSI-SRI Spring 2005 Diarization system. In :

Rich transcription 2005 Spring MeetingRecognition Evaluation Workshop.

Edinburgh,UK: Springer LNCS 3869.

Anguera, Xavier, Wooters, Chuck, Hernando, Javier.2006 a. Friends and Enemies : A

Novel Initialization for Speaker Diarization. In :Interspeech2006 ICSLP-9th

International conference on Spoken Language Processing.

Anguera, Xavier, Wooters, Chuck & Pardo, Jos M.2006b. Robust speaker Diarization for

Meetings : ICSI RT06S Meetings Evaluation System In :Rich Transcription 2006 SpringMeetingRecognition Evaluation Workshop. Bethsda, MD,USA: Springer LNCS 4299.


46/47

38

Barras, Clause, Zhu, Xuan, Meignier, Sylvain & Gauvain , Jean-Luc.2004 . Improving

Speaker Diarization. In : Rich Transcription 2004 Fall Workshop.

Barras. Michael, Bimbot, Fedric, Ben, Mathieu & Gravier, Guillaume 2004 Multistage

Speaker Diarization of Broadcast News. In:IEEETransactions on Audio, Speech,And

Language Processing, 14(5).

Bester, Michael, Bimbot, Frdric, Ben, Mathieu, & Gravier, Guillaume. 2004. Speaker

Diarization Using bottom up clustering based on a Parameter-Derived Distance between

adapted GMMs.

In :ICSLP2004- 8th

International Conference on Spoken Language Processing.

Bimbot, Frdric, & Mathan, Luc, 1993. Text Free speaker Recognition using an

arithmetic-harmonic sphericity measure. In : Eurospeech93 -3rd

European Conference on

Speech Communication and Technology.

Black, A.,& Schultz,T.2006. Speaker clustering for Multilingual Synthesis. In : ISCA

Tutorial andResearch Workshop on Multilingual Speech and Language Processing

Bonastre, J.-F., Delacourt,P.,Fredouille,C.,Merlin,T.,&Wellelens,C.2000. A speaker

tracking system based on speaker turn detection for NIST Evaluation. In: IEEE

International Conference on Acoustics, Speech and SignalProcessing 2000.

Burges, CJ.C1998. A Tutorial on Support Vector Machines for Pattern Recognition.

Data mining andKnowledge Discovery,2(2).

Cettolo, Mauro.2000. Segmentation, Classification, and Clustering of an ItalianBroadcast news Corpus. In : 6

thRIAO 2000- Content Based Multimedia Information

Access.

Chen. Jingdong, Benesty, Jacob& Huang, Yiteng.2006. Time delay estimation in room

acoustic environments : An Overview. EURASIPJournal on Applied SignalProcessing.

Chen,Scott, Shaobing, & Gopalkrishnan,P.S.1998b. Speaker, Environment And Channel

Change Detection and Clustering Via the Bayesian Information Criterion. In DARPA

Speech Recognition Workshop 1998.

Cheng.E.Lukasiak, J., Burnett,I.S.,& Stirling, D.2005. Using Spatial cues for meeting

Speech Segmentation. In :IEEE ICME05- International Conference on Multimedia and

Erpo 2005.

Cohen,A.& Lapidus,V.1995. Unsupervised Text Independent Speaker classification. In :

18th

Convention of Electrical and Electronics Engineers in Israel.


47/47

speaker diarization

Documents