age detection using audio features

AGE AND GENDER ESTIMATION FROM AUDIO

FEATURES USING DISCRIMINANT ANALYSIS

AND NN FRAMEWORK

A thesis submitted in partial fulfilment of the requirements for

The award of the degree of

M.Tech.

in

COMMUNICATION SYSTEMS

By

PUJARI SUJAY GIRISH

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

NATIONAL INSTITUTE OF TECHNOLOGY

TIRUCHIRAPALLI – 620 015.

MAY 2011

BONAFIDE CERTIFICATE

This is to certify that the project titled “AGE AND GENDER ESTIMATION

FROM AUDIO FEATURES USING DISCRIMINANT ANALYSIS AND NN

FRAMEWORK” is a bonafide record of the work done by

PUJARI SUJAY GIRISH (208109013)

in partial fulfilment of the requirements for the award of the degree of Master of

Technology in Communication Systems of the NATIONAL INSTITUTE OF

TECHNOLOGY, TIRUCHIRAPPALLI, during the year 2010-2011.

S. DEIVALAKSHMI

Guide Head of the Department

Project Viva-voce held on _____________________________

Internal Examiner External Examiner

ABSTRACT

In the field of speech processing, applications like Interactive voice response (IVR)

systems and Artificial Intelligence needs to replicating human behavior; one of them

is auditory feature of human and from that to perceive sex and approximate age of

speaker.

With the help of selected features extracted from unknown speaker voice, proposed

automated system can estimate age group and gender of that person. For the sake of

classification 7 classes are defined namely young male–female, adult male-female,

senior male-female and child.

This system is made of mainly 2 parts feature extraction from real time samples from

microphone and followed by 2 stage feature classification. Features like Pitch, MFCC

& delta MFCC are considered for extraction .For purpose of classification

combination of Canonical discriminant analysis and NN framework is applied.

For the purpose of experiment required stimuli databases were collected from 192

different speakers.

Keywords: Speech Processing, Age, Gender, Discriminant analysis, Neural Network.

ACKNOWLEDGEMENTS

I take this opportunity to express my sincere thanks & deep sense of gratitude to my

project guide Mrs S. Deivalakshmi, Assistant Professor, Department of Electronics

and Communication Engineering, National Institute of Technology, Tiruchirappalli

for her guidance, and kind co-operation.

With immense pleasure, I record my profound gratitude and indebtedness to

Prof. Sanjay Patil, Department of Electronics and Telecommunication, Maharashtra

Academy of Engineering, Pune University for his needful suggestions & guidance.

I would like to express my sincere thanks to Prof. P. Somaskandan, Professor and

Head of the Department, Department of Electronics and Communication Engineering,

National Institute of Technology, Tiruchirappalli for providing with all facilities from

the part of department for the successful completion of this project.

I express my deep sense of gratitude to Dr S. Raghavan, Prof. Department of ECE

and Mr M. Bhaskar, Associate Prof. Dept. of ECE for giving me the much required

lab facilities; I would also like to thank him for his motivation and support.

My special thanks to Anil, Jamuna, Nithyananth, Senkathir, Kishore and Pardu for

their encouragement and invaluable help to collect audio database.

I would like to thank to all teaching staff and my classmates and computer support

group staff, for their sincere help.

Last but not the least; I dedicate this work to my parents and my family.

Sujay Pujari

May 2011

TABLE OF CONTENTS

Title Page No

ABSTRACT………………………………………………….... iii

ACKNOWLEDGEMENTS…………………………………... iv

TABLE OF CONTENTS……………………………………... v

LIST OF FIGURES………………………………………….... vi

LIST OF TABLES…………………………………………….. viii

ABBREVIATIONS……………………………………………. ix

CHAPTER 1 INTRODUCTION

1.1 Objectives and Approach ……………………………………………....... 1

1.2 Database Collection …….…………………………………………………. 3

1.3 Study Outline …………………………………………………………. 5

CHAPTER 2 LITERATURE REVIEW 6

CHAPTER 3 FEATURE EXTRACTION

3.1 Pitch………………………………………………………………………. 9

3.2 MFCC (Mel Frequency Cepstral Coefficient )………………………….. 12

3.3 Windowing………………………………………………………………… 15

CHAPTER 4 FEATURE CLASSIFICATION

4.1 First Stage with Discriminant Analysis ………………………………….. 20

4.2 Second Stage with NN frameworks………………………………………. 22

CHAPTER 5 RESULTS AND DISCUSSION

5.1 Unknown stimuli results ……….…………………………………………. 30

5.2 classification stage one result …….………………………………………. 34

5.3 classification stage two result ………………………..…………………… 36

CHAPTER 6 CONCLUSION & FURTHER WORK 41

REFERENCES 42

LIST OF FIGURES

Figure No Title Page

No

1.1 Proposed System ……………………………………………………2

1.2 Snapshot of recording with Wave Surfer……………….…………...4

3.1 An example, input sinusoidal signal. ………………….…………...10

3.2 Autocorrelation of give frame input ….……………………………11

3.3 Number of samples between 2 maximums. ………………………..11

3.4 MFCC feature extraction steps ………………………………….....12

3.5 Mel frequency Vs Frequency ………………………………………13

3.6 Mel filter bank ……………………………………………….……..14

3.7 Hamming window …………………………………………………..15

3.8 Overlapped frames followed by windowing function ……………...16

3.9 Reconstructed waveform (Above) after windowing and original wave

(below) …………………………………………………………..….16

3.10 Cross correlation between original signal and reconstructed one…...17

3.11 Welch method all periodogram and their average…………………18

4.1 Abstract Flow of Proposed classification stages …………….……...19

4.2 Classification based on Discriminant analysis followed by decision

based on Euclidean distance for stage one classification ……..…….21

4.3 Equivalent decision C1 of NN framework based on 3 neural networks

output ……………………………………………………………….22

4.4 Euclid. distance method for decision based on stage 2……...............22

4.5 Neural Network structure for NNA, NNB and NNC ……………….23

4.6 Classification algorithm in stage 2…………………………………..24

4.7 Matlab nprtool for NN implementation……………………...……...26

5.1 Average pitch for Females of all 4 classes from database 1 ….......27

5.2 Average pitch for Males of all 4 classes from database 1 …………..28

5.3 Waveform of one of the record from database 1 ...............................28

5.4 Pitch track for waveform shown in fig. 5.3 ………………...............29

5.5 Pitch track- Unknown stimuli ……………………………………...30

5.6 13 MFCC coefficients- Unknown stimuli …………………………..30

5.7 12 dMFCC coefficients- Unknown stimuli …...................................31

5.8 11 ddMFCC coefficients- Unknown stimuli ……………………….31

5.9 Feature vector of 37 x 1 for Unknown stimuli ……………………..32

5.10 Discriminant score plot for all 3 groups ………………………...….35

5.11 NN1 framework – NNA network…………………………………...36

5.12 NN1 framework – NNB network …………………………………...36

5.13 NN1 framework – NNC network …………………………………...37

5.14 NN2 framework – NNA network…………………………………...37

5.15 NN2 framework – NNB network ………………………………...…38

5.16 NN2 framework – NNC network …………………………………...38

5.17 Stage 1 + NN2 framework – all females…………………………….39

5.18 Stage 1 + NN1 framework – males………………………………….39

5.19 Overall classification result ....…...………………………………….40

6.1 Comparison chart for successful estimation of class………………..41

LIST OF TABLES

Table no Title Page no

1.1 Classification Groups………………………………………..………4

5.1 Neural networks output –unknown stimuli……………………...….33

5.2 Canonical Discriminant function coefficients………………………34

5.3 Functions at group centroids……..…………………………………34

5.4 classification result of stage 1 with database 1……………………...35

ABBREVIATIONS

CDF Canonical Discriminant function

DA Discriminant Analysis

DS Discriminant score

MFCC Mel Frequency Cepstrum Coefficients

NN Neural Network

YM Young Male

YF Young Female

AM Adult Male

FM Adult Female

SM Senior Male

SF Senior Female

DCT Discrete Cosine Transform

PSD Power Spectral Density

CHAPTER-1

INTRODUCTION

Automatic speech recognition (ASR) based algorithms are widely deployed for customer

care and service applications. ASR research is currently moving from mere “speech-to-text”

(STT) systems towards “rich transcription” (RT) systems, which annotate recognized text

with non-verbal information such as speaker identity, emotional state. In Interactive voice

response systems; this approach is already being used to identify dialogs involving angry

customers, which can then be analyzed with the goal of automatically identifying

problematic dialogs, transferring unsatisfied customers to an agent, and other purposes.

Also, the first adaptive dialogs are now appearing, particularly in systems exposed to

inhomogeneous user groups. These can adapt degree of automation, order of presentation,

waiting queue music, or other properties to properties of the caller such as age or gender.

As an example, it would be possible to offer different advertisements to children and adults

in the waiting queue. In non-personalized services, speaker classification will be based on

the caller’s speech data. While classifier performance is only one factor influencing the utility

of the above approach in an IVR system, it is certainly a major factor.

Proposed algorithm for automatic age and gender estimation is going to help in same

regard, as it is a proposed algorithm which classifies speaker voice in one among the classes,

which predicts its gender and approximate age group.

1.1 Objectives and Approach

The ultimate aim of the proposed system is to predict age group and gender of speaker with

the help of its stimuli of any length at real time. Now such systems are mainly consisting of

two stages feature extraction, selection and Classification based on extracted features. For

such observations or to find features which can give distinct values for different classes we

need to have database. So one of the first tasks was to collect this audio database; followed

by feature extraction and classification.

For feature extraction features like pitch, MFCC & delta MFCC coefficients are worked out. In

following stage of classification we adapted two different method namely CDA and NN

framework. With the help of all collected 290 stimuli present in the database neural

networks are trained and at the end such trained networks are used for real time

classification.

Fig. 1.1 Proposed System

Classification stage 1 based on Discriminant analysis

Feature set extraction

Classification stage 2 based on NN Frameworks

Young

Male 5

Child

1

Young

Female 2

Adult

Male 6

Adult

Female 3

Senior

Male 7

Senior

Female 4

Audio input

Human voice

1.2 Database Collection

For classification purpose we had collected data for following 8 different groups:

I. Child –Boy (age <15)

II. Child –Girl (age <15)

III. Young Men (age<30)

IV. Young Women (age<30)

V. Adult Gents (age<55)

VI. Adult Ladies (age<55)

VII. Senior citizen –Male (age>55)

VIII. Senior citizen –Female (age>55)

At the starting, we had collected 2 stimuli from 105 speakers namely

1) “HAPPY BIRTHDAY”

2) To tell your name in your mother tongue.

For example,

<My Name is Sujay> or

<Maz nav ... ahe.>Marathi or

< En peyar... > Tamil or

< Naa peru...> Telugu.

With specification of

a) Fs sampling rate=8000 samples/sec

b) Bits per sample =16 bit

c) Mono channel

By the end of experiment on these 2 stimuli it was quite clear that to distinguish between

groups I & II was quite impossible.

So finally we fixed classification groups to 7 classes by merging group I and II to a single

group known as child. And finally we adapted classification groups as given in Table 1.1

Table 1.1 Classification Groups

3) After this we adapted stimuli of “OM” but with condition to extend it to more than

10seconds and with single breathe.

Like- “OOOOOOOOOOOMMMMMMMMMMMmmmmmmmm”

We have such 87 sample recordings with following specifications:

A. Fs sampling rate=16000 samples/sec

B. Bits per sample =16 bit

C. Mono channel

Fig. 1.2 Snapshot of recording with Wave Surfer

For recording purpose we used open source tool known as “Wave Surfer 1.8.8p3” for the

purpose of recording & editing with desired specifications.

We can refer these 3 databases by Database1, Database 2and Database 3. In which all files

are stored in the form of “Name _Age.wav”.

Group no Group Symbol Category

I C Child (age <18)

2 YF Young Female (age <30)

3 AF Adult Female (age <55)

4 SF Senior Female (age >55)

5 YM Young Male (age <30)

6 AM Adult Male (age <55)

7 SM Senior Male (age >55)

1.3 Study Outline

This project thesis is organized as follows. Chapter 2 reviews the literature review &

background of the algorithms adapted towards estimation of age and gender. The

Materials and Methods used in this study are discussed in Chapter 3 and Chapter 4.

Chapter 3 deals with first part of feature extraction and Chapter 4 with Feature

classification. Chapter 5 provides the results and discussion and the Chapter 6 concludes

thesis with future direction.

CHAPTER-2

LITERATURE REVIEW

In this chapter, important literatures used to implement proposed algorithm are

reviewed.

Minematsu, N.et.al (1993),

In “Automatic estimation of one’s age with his/her speech based upon acoustic modelling

techniques of speakers”, proposed technique to identify subjective elderly speakers with

prosodic features such as MFCC based speech rate.

William R. Klecka (1980), in “Discriminant Analysis” presents a lucid and simple

introduction to several related statistical procedures known as discriminant analysis.

Discriminant Analysis (DA) introduces canonical discriminant function (CDF) of variables in

discriminant analysis. Professor Klecka derives canonical discriminant function coefficients,

provides spatial interpretation of them, and provides a nice discussion of the interpretation

of CDFs. He presents clear discussion of unstandardized and standardized

SPSS ver. 14 manual on algorithms titled “Discriminant” explains all steps involved toward

Classification based on CDF coefficients.

Braun,Aet. Al (1999),

In “Estimating speaker age across Languages ”, he conducted analysis to show correlation

between calender age and perceived age with help of Italian and Dutch stimulies; and

further concluded that male and female listener can safely be combined.

Cerrato, L. et. Al (2000),

In “subjective age estimation of telephonic voices”,he carried out the statistical analysis

which show that listeners are capable of of assigning a general chronological age category to

voice without seeing or knowing the speaker & they are able to make distinguish beetween

male and female voice transmitted over telephone line.

Krauss, R. M.et.al (2002)

In “Inferring speakers, physical attributes from their voices”, he examined listeners ability to

make accurate inferences about speakers from the non-linguistic content of their speech.

Shafran, I. et. al. (2003),

In “Voice signatures”, he explores problem of extracting voice signature from speaker voice

& found standard Mel warped cepstral features, speaking rate & shimmer to be useful.

Rabiner, L. et. al. (1976)

In “A comparative performance study of several pitch detection algorithms” ,all Pitch

detection algorithms are discussed .According to him Pitch can be as low as 40 Hz (for very

low pitch male) or high as 600 Hz (for very high pitched female or child’s voice)

Rabiner, L. et. al. (1977)

In “On the use of autocorrelation analysis for pitch detection”, with the help of short time

autocorrelation analysis, pitch detection technique is explained.

Mcleod,P. and Wywill, G. (2005)

“A smarter way to find pitch”, he found that existing pitch algorithms that use Fourier

domain suffer from spectral leakage and he suggested windowing as remedy over it.

Metze, F. et.al. (2007)

In “Comparison of four approaches to age and gender recognition for telephone

application”, compares different approaches to age gender classification on telephone

speech with small and large utterance lengths.

Welch, P. D. (1967) In “The Use of Fast Fourier Transform for the Estimationof Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms”, use of FFT for PSD estimation is given. Moller, M. (1993)

In, “A scaled conjugate algorithm for fast supervised learning”, he introduces SCG whose

performance is benchmarked compare to that of back propagation. Also it is fully

automated, included no critical user dependent parameter & avoids time consuming line

search

Huang, X. et .al. (2001)

In "Spoken Language Processing: A guide to Theory, Algorithm, and System Development,"

Prentice Hall, he describes prosodic phenomena’s like pitch with algorithms available.

Childers, D. et.al.(1977)

In “The Cepstrum: A guide to processing”, Pragmatic details of Cepstrum concepts is given.

Spiegl, W. et. al. (2009)

In “Analysing Features for Automatic Age Estimation on Cross-Sectional Data”, developed

acoustic feature set for the estimation of persons age from recorded speech signal. &

demonstrated that age can be effectively estimated using feature vector of prosodic,

spectral & cepstral features.

CHAPTER-3

FEATURE EXTRACTION

In this chapter Feature extraction algorithms are explained. In this work 2 type of features

are found more suitable namely Pitch and MFCC coefficients.

3.1 PITCH

Pitch represents perceived fundamental frequency of a sound and it may be quantified as

frequency in cycles per second (hertz), however pitch is not purely objective physical

property, but a subjective psychoacoustical attribute of sound.

According to Huang, X. [ref 10], prosody is a complex weave of physical, phonetic effects

that is being employed to express attitude, assumptions, and attention as a parallel channel

in our daily speech communication.

The semantic content of a spoken or written message is referred to as its denotation, while

the emotional and attentional effects intended by the speaker or inferred by a listener are

part of the message’s connotation. Prosody has an important supporting role in guiding a

listener’s recovery of the basic messages (denotation) and a starring role in signalling

connotation, or the speaker’s attitude toward the message, toward the listener(s), and

toward the whole communication event.

From the listener’s point of view, prosody consists of systematic perception and recovery of

a speaker’s intentions based on:

I. Pauses: to indicate phrases and to avoid running out of air.

II. Pitch: rate of vocal-fold cycling (fundamental frequency) as a function of time.

III. Rate/relative duration: phoneme durations, timing, and rhythm.

IV. Loudness: relative amplitude/volume.

Pitch is the most expressive of the prosodic phenomena. As we speak, we systematically

vary our fundamental frequency to express our feelings about what we are saying, or to

direct the listener’s attention to especially important aspects of our spoken message

3.1.1 PITCH DETECTION

According to Naotoshi Seo [ref 15], Pitch can be detected in following ways:

a. Autocorrelation method

b. Cepstrum method

c. Harmonic product spectrum method (HPS)

d. Linear predictive coding (LPC)

In our work we have adapted first method, in order to calculate pitch, we need at least two

peaks to be within the block we are measuring pitch. We can ensure that at least 2 peaks are

within the block. And therefore block size must be greater than 3 wavelength of lowest

possible frequency. Lowest possible frequency is known as Pitch floor.

No of minimum samples required per frame

……

Samples/frame

According to this method to get pitch we need to get autocorrelation of signal for given

block or frame. Then sample distance between value at first sample and at second highest

peak K can be used to find fundamental frequency, where Fs is sampling frequency.

………. Hertz

Fig. 3.1: An example, input sinusoidal signal.

Nmin = Pitch floor3 * Fs

Pitch =(noof samplescovered between2 max imum peaksK)

Fs

0 50 100 150 200 250 300 350 400 450 500-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

no of samples

Ampli

tude

Fig. 3.2: Autocorrelation of given input frame

Fig. 3.3 Number of samples between 2 maximums.

0 2000 4000 6000 8000 10000 12000 14000 16000 18000-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

sample number

Am

plitu

de

0 200 400 600 800 10000

1000

2000

3000

4000

5000

6000

7000

8000

X: 41

Y: 7980

sample number

Am

plitu

de

For Example as shown in fig. 2,

K = (41-1) = 40

For, Fs = 8000

Pitch=8000/40=200 Hz

3.2 MFCC (Mel Frequency Cepstral Coefficient )

We are using MFCC which is so popular as its efficient to compute .It incorporates a

perceptual Mel frequency scale. It seprates source and filter. IDFT (DCT) decorrelates the

features which in turn improves differences..

Fig. 3.4 MFCC feature extraction steps

3.2.1 MEL SCALE

Human hearing is not equally sensitive to all frequency bands

Less sensitive at higher frequencies, roughly > 1000 Hz

I.e. human perception of frequency is non-linear:

Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Fig. 3.5 Mel frequency Vs Frequency

For our work we are using 13 Mel filter banks as shown in fig. 1, which inturns gives 13 MFCC

coeficients

Fig. 3.6 Mel filter bank

3.2.2 LOG ENERGY

Logarithm compresses dynamic range of values

o Human response to signal level is logarithmic

o humans less sensitive to slight differences in amplitude at high amplitudes

than low amplitudes

Makes frequency estimates less sensitive to slight variations in input (power

variation due to speaker’s mouth moving closer to mike)

Phase information not helpful in speech

3.2.3 CEPSTRUM

According to Childers, D. [ref 4] ; cepstrum is nothing but spectrum of spectrum.

The cepstrum requires Fourier analysis But we’re going from frequency space back to time

So we actually apply inverse DFT . Since the log power spectrum is real and symmetric,

inverse DFT reduces to a Discrete Cosine Transform (DCT)

3.2.4 DELTA MFCC and DOUBLE DELTA MFCC

These are nothing but MFCC variations and variation in MFCC variations. For 13 MFCC

coefficients we get 12 delta-MFCC and 11-delta2 MFCC coefficients. Which is nothing but

Quefrency.

3.3 Windowing

Instead of recording all the audio signal we are using windowing with

overlaping which helps to limit buffer length to 1024 samples and

previous frame results like PSD. And then we can implement methods

like welch to find out periodogram for non-stationary signals.

Fig. 3.7 Hamming window - W (n) = 0.54 - 0.46 * COS (2*pi*n/N)

In our case we are adapting windowing for estimation of pitch in case of

pitch track and PSD estimation for extracting MFCC coefficients.

Here we are adapting hamming window of length 1024 with 50%

overlapping.

Fig. 3.8 Overlapped frames followed by windowing function

Fig. 3.9 : Reconstructed waveform(Above) after windowing and original wave (below)

Fig. 3.10 Crosscorrelation between original signal and reconstructed one.

3.3.1 PSD ESTMATION -Welch's Method

Welch's method for estimating power spectra is carried out by dividing the time signal into

successive blocks, forming the Periodograms for each block, and averaging.

Denote the mth windowed, zero-padded frame from the signal x by

Where R is defined as the window hop size, and let k denote the number of available frames.

Then the Periodogram of the mth block is given by

And the Welch estimate of the PSD is given by

In our work we are using Hamming window of N=1024, with 50% overlapping.

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

-15

-10

-5

0

5

10

15

X: 2560

Y: 15.53

Cross correlation

Fig. 3.11 : Welch method all periodogams and their average.

0 500 1000 1500 2000 2500 3000 3500 4000 45000

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

-4

Hz

Watt/H

z

Periodograms of all blocks

0 500 1000 1500 2000 2500 3000 3500 4000 45000

0.2

0.4

0.6

0.8

1

1.2x 10

-4

X: 993

Y: 0.0001179

Hz

Watt/H

Z

Average periodogram

CHAPTER-4

FEATURE CLASSIFICATION

In this chapter we are going to propose combination of two classification algorithms as two

stage classification. As there are 7 classes to classify C,YF,AF,SF,YM,AM & SM. It was found

that with the help of statistical classificationa like Canonical Discriminant Analysis we can

predict speaker as Male , Female or Child. Now, that is the first stage of classification.Then to

specify young, adult or senior among Male and Female group we are using NN

framework;which is nothing but second stage.

Fig 4.1 Abstract Flow of Proposed classification stages

Classification based on Discriminant analysis

MALE

Group 5 to 7

Feature set

FEMALE

Group 2 to 4

NN Framework -1 NN Framework -2

YM

5

AM

6

SM

7

C

1

YF

2

AF

3

SF

4

4.1 First Stage with Discriminant Analysis

Feature classification stage 1 is done using Discriminant Analysis (DA). In this method 2

Canonical Discriminant functions are determined from extracted features. For the feature

vectors only two features namely pitch and delta2 MFCC (10) are only used as input vector

for this state. For the purpose of training we used 39 Female, 27 child and 37 Male stimuli

from Database 1.

After extracting features from feature set for all Training cases we determined

unstandardized coefficients along with group centroids for 2 functions. Now, we can

determine Discriminant score for unknown feature set. And classification is done based on

Euclidean distance rule.

4.1.1 Steps for Canonical Discriminant Analysis:

Selected 103 samples can be referred as training database. So using SPSS package we can

find out canonical Discriminant functions, in our case we have 3 classes so it will end up with

2 functions. For that we need to give (no. of samples from that group x 2 feature values)

such 3 feature matrices as input. Along with a (Total samples x 1) vector depicting truth

value of that class.

After following steps as explained in Klecka [ref 5] we will have following information for

each function

1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids

Now Canonical Discriminant Function can be determined as,

f=D0+XD

Where X is (1xP) feature vector for given mammogram.

4.1.2 Classification based on CDF

Now after substituting Xinput value in obtained CDF we will get finput; this value is nothing but

Discriminant Score (DS) for given input feature vector. And for 2 functions you will get 2

different finput;

Therefore,

F1 = [finput1 finput2]

We will have 3 [2 x 1] group centroids values, we can find out the group which is having

minimum Euclidian distance from F1, can be selected as classified group.

Fig. 4.2 Classification based on Discriminant analysis followed by decision based on

Euclidean distance for stage one classification

C3

F1

C1

C2

X input

[Pitch d2MFCC (10)]

F1

2 Trained Functions with

1. Unstandardized coefficients D

2. Constant D0 3. Function values at group

centroids

Nearest Centroid

belongs to class M

or F or C

C F

M

4.2 Second Stage with NN frameworks

Now we just need to apply NN framework as per applicable to male or female speaker. Now

for both NN frameworks we are going to apply same algorithm as shown in figure .That

means we are going to apply [37 x 1] feature vector simultaneously to 3 Neural Networks.

And output obtained from each network can be considered as co-ordinates in 3d plane

assuming 3rd coordinate as zero. From such 3 position vectors P1, P2 & P3 we will get

centroid co-ordinates C1.Now among [1 0 0], [0 1 0] & [0 0 1] which are target values for 3

subclasses of male & female. Minimum distance between centroid C1 and among three

points proves selection of one of the three classes.

Fig. 4.3 Equivalent output C1 of NN framework based on 3 neural networks output

Figure:

Fig. 4.4 Euclidian distance method for decision based on classification stage 2

Now these 3 neural networks are nothing but 3 trained networks obtained by considering 2

classes as target output at a time so such 3C2=3 neural networks are required.

[0 0 1]

P1

P2

P3

C1

[1 0 0]

[0 1 0]

L3

L1

L2

C1

[0 0 0]

[0 0 1]

[1 0 0]

[0 1 0]

Fig. 4.5 Neural Network structure for NNA, NNB and NNC

Then,

From NNA, P1 = [op1 op2 0]

From NNB, P2 = [0 op1 op2]

From NNC, P3 = [op1 0 op2]

C1=centroid of (P1, P2 & P3);

L1=dist(C1,[1 0 0]) L2=dist(C1,[0 1 0]) L3=dist(C1,[0 0 1])

For both NN1 and NN2 framework following method is applicable.

NNA is neural network with the help of only Young and Adults male or Female

NNB is neural network with the help of only Senior and Adults male or Female

Input Layer

37 Elements

Hidden Layer

40 elements

Target

2 elements

W i, j

ayer

W2 i, j

ayer

op1

op2

NNC is neural network with the help of only Young and senior male or Female

Fig. 4.6 Classification algorithm in stage 2

0

X input [37x1] feature vector

{Pitch, 13 MFCC, 12 dMFCC, 11 ddMFCC}

NNB

Centroid C1

NNA NNC

0 0

P1 P3 P2

L1 L3 L2

Y A

S

Smaller L decides

class

4.2.2 Neural Network Implementation

For the purpose of neural network implementation we used Matlab tool

specially designed for Neural Network Pattern Recognition tool which

can be invoked by “nprtool” command.

This tool uses Conjugate gradient back propagation Method as explained

in ref.

With the help of “trainscg” command .Speciality about this SCGB

method is it con train any network as long as its weight, net input &

transfer function have derivative function. This algorithm is based on

conjugate directions and does not perform a line search at each

iterations.

Training stops when any one of occurs:

1. Maximum no of epochs is reached

2. Maximum amount of time is reached

3. Performance is minimised to goal

4. Performance gradient falls below minimum gradient.

5. Validation performance has increased more than max fail time.

We are taking number neurons in hidden layer equal to 40. And for

training all 3 databases combinedly we are using. In that tool itself we

can specify percentage samples for training, validation and testing; we

are using them with ratio of 70%, 15% and 15% respectively.

For training it uses ‘Hyperbolic tangent sigmoid transfer function.’ for

neuron modelling..

After satisfactory training i.e. good classification rate you can save this

network in memory and can anytime invoke it at the time of testing.

Fig. 4.7 Matlab nprtool for NN implementation

CHAPTER-5

RESULTS AND DISCUSSION

From database1 we have extracted pitch feature. It was found that group F1 and M1 do not

show any distinct features and can be safely combined to a class of Child. At same time using

pitch one can clearly distinguish between child and Men; but with child and Women only

pitch was found not to be completely reliable.

o Children: ≤ 15 years, male (M1) and female(F1)

o Young people: 15-30 years, male (M2) and female (F2)

o Adults: 30-55years, male (M3) and female (F3)

o Seniors: ≥ 55 years, male (M4) and female (F4)

Fig. 5.1 Average pitch for Females of all 4 classes from database 1

Fig. 5.2 Average pitch for Males of all 4 classes from database 1

Fig. 5.3 Waveform of one of the record from database 1 “Happy birthday”

Fig. 5.4 Pitch track for waveform shown in fig. 5.3

While we are plotting pitch track i.e. pitch contour for database 1 and 2 that it was showing

dramatic variations in pitch within that stimuli .So it was followed by collection of database 3

in which pitch contour is near to average value of pitch at all time.

With the help of database 3 one more fact we come to know that,

For Male < 12 and Female <18 are showing distinct results as compare to others.

It was observed that for boys there is change in pitch after age of 12 years, which is 18 years

for girls.

So it was deciding factor for fixing classification group as given in Table 1.1.According to which

child come under category of any human having less than 18 years age.

In following section with one stimuli example we will give results obtained and calculation

part while following algorithm

Hz

Frame no.

5.1 Unknown stimuli results

Fig. 5.5 Pitch track- Unknown stimuli

Fig. 5.6: 13 MFCC coefficients- Unknown stimuli

0 10 20 30 40 50 60 70 80 90 100-200

-150

-100

-50

0

50

100

150

frame no

Hz

0 2 4 6 8 10 12 14-30

-25

-20

-15

-10

-5

0

5

Fig. 5.7: 12 dMFCC coefficients- Unknown stimuli

Fig. 5.8: 11 ddMFCC coefficients- Unknown stimuli

0 2 4 6 8 10 12-5

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11-35

-30

-25

-20

-15

-10

-5

0

5

Fig. 5.9 Feature vector of 37 x 1 for Unknown stimuli

[Mean (pitch) 13-MFCC 12-dMFCC 11-ddMFCC]

Finput = [ pitch = 106.4090 ddMFCC (10) = 0.1201 ]

Discriminant score

DS1 = [106.4090 0.1201] * [0.027 0.77] T + 5.93 = -2.9325 DS2 = [106.4090 0.1201] *[-0.006 2.3797] T + 0.9263 = 0.5349 F1 = [-2.93 0.53] Centroid c0 (Child) = [2.5286 -0.3146] Centroid c1 (Male) = [0.4167 0.3881] Centroid c2 (Female) = [-2.2844 -0.1795] Here, distance between c2 & F1 is smaller than other distances & c2 belongs to group of male. Classification stage 1 result: Male Now, in classification stage 2, it will go through NN2 framework. In this stage it will again passes through NNA, NNB & NNC networks.

Table 5.1 Neural networks output –unknown stimuli

0 5 10 15 20 25 30 35 40-40

-20

0

20

40

60

80

100

120

Feature number

Network Op1 Op2

NNA 0.7367 0.1617

NNB 0.9899 0.0201

NNC 0.9628 0.0194

Therefore,

P1= [0.7367 0.1617 0] P2= [ 0 0.9628 0.0194] P3= [0.9899 0 0.0201]

C1= [0.5755 0.3748 0.0132]…..centroid L1= 0.5664 L2= 0.8499 L3=1.2023

This means again class 1 means Young group. And final classification will be Male-Young means group 5-YM So output of our algorithm is from Matlab environment: --------Group no------------------------------------------------------- Child = 1 Female<30 = 2 Female<55 = 3 Female>55 = 4 Male<30 = 5 Male<55 = 6 Male>55 = 7 ---------------------------- -----And answer is-------- group = 5 --------------------------------------------------------------------- Group 5 is nothing but YM and it was true positive result.

5.2 Classification result stage one – Canonical Discriminant Analysis

Classification results of stage one using database one as training database.

Table 5.2

Table 5.3

Canonical Discriminant Function Coefficients

.027 -.006

.779 2.380

-5.939 .926

pitch

ddmf cc10

(Constant)

1 2

Function

Unstandardized coef f icients

Functions at Group Centroids

2.529 -.315

.417 .388

-2.284 -.179

group

.00

1.00

2.00

1 2

Function

Unstandardized canonical discriminant

f unct ions ev aluated at group means

Fig. 5.10 Discriminant score plot for all 3 groups

Table 5.4 classification result of stage 1 with database 1

Classification Resultsb,c

22 5 0 27

4 31 4 39

0 1 36 37

81.5 18.5 .0 100.0

10.3 79.5 10.3 100.0

.0 2.7 97.3 100.0

21 6 0 27

4 31 4 39

0 1 36 37

77.8 22.2 .0 100.0

10.3 79.5 10.3 100.0

.0 2.7 97.3 100.0

group

.00

1.00

2.00

.00

1.00

2.00

.00

1.00

2.00

.00

1.00

2.00

Count

%

Count

%

Original

Cross-validateda

.00 1.00 2.00

Predicted Group Membership

Total

Cross validation is done only for those cases in the analy sis. In cross

validation, each case is classif ied by the f unctions deriv ed f rom all cases other

than that case.

a.

86.4% of original grouped cases correctly classif ied.b.

85.4% of cross-validated grouped cases correctly classif ied.c.

5.3 Classification result (Confusion Matrices) stage two- Neural Network

Fig 5.11 NN1 framework – NNA network & this deal with YM and AM category

Fig 5.12 NN1 framework – NNB network & this deal with SM and AM category

Fig. 5.13 NN1 framework – NNC network & this deal with YM and SM category

Fig. 5.14 NN2 framework – NNA network & this deal with YF and AF category

Fig. 5.15 NN2 framework – NNB network & deals with SF and AF category

Fig. 5.16 NN2 framework – NNC network & this deal with YF and SF category

Fig. 5.17 Stage 1 + NN2 framework – all females

Fig. 5.18 Stage 1 + NN1 framework – all males

Fig 5.19 Overall classification result when whole database 3 as testing samples

CHAPTER-6

CONCLUSION AND FUTURE WORK

Proposed Automatic Age and Gender estimating system is implemented with the help

of Matlab toolbox.Figure 6.1 compares classification rates obtained by applying

database 3 for testing at the end of second/last classification stage.

It is found that overall male category is having good classification rate .Except AF other

results are quite satisfactory including overall classification rate which was 69.4%.

Fig. 6.1 Comparison chart for successful estimation of class.

As part of further work, we need to train /neural network whenever true class of

user we know and it is showing different class. And not only there is need to collect

more stimuli but also one needs to explore more features.

0

10

20

30

40

50

60

70

80

90

100

Child YF AF SF YM AM SM

classification rate %

classification rate

REFERENCES

1. Welch, P. D. (1967); “The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms”, IEEE Trans. on Audio Electroacoustic, Volume AU-15,pages 70-73.

2. Rabiner, L. et. al. (1976); “A comparative performance study of several pitch

detection algorithms”, IEEE Tran. Acoustics, Speech and Signal Processing, Volume

24, Issue 5, Page 399-418.

3. Rabiner, L. et. al. (1977); “On the use of autocorrelation analysis for pitch

detection”, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 25, Issue 1,

Page 24-33.

4. Childers, D. et.al.(1977); “The Cepstrum: A guide to processing”, Proc. IEEE . Volume

65, issue 10,pp 1428-1443

5. William R. Klecka (1980); “Discriminant Analysis”, sage university paper

6. Minematsu, N.et.al (1993); “Automatic estimation of one’s age with his/her speech

based upon acoustic modelling techniques of speakers”. ICASSP-93, 1993 IEEE

International Conference Acoustics, Speech, and Signal Processing

7. Moller, M. (1993); “A scaled conjugate algorithm for fast supervised learning”,

Neural Networks, volume 6(4), 523-533.

8. Braun,Aet. Al (1999); “Estimating speaker age across Languages” , The International

Congress of Phonetic Sciences -IPhS99

9. Cerrato, L. et. Al (2000); “subjective age estimation of telephonic voices”, Speech

Communication archive. Volume 31, Issue 2-3 (June 2000), Elsevier.

10. Huang, X. et .al. (2001); "Spoken Language Processing: A guide to Theory,

Algorithm, and System Development," Prentice Hall.

11. Krauss, R. M.et.al (2002); “Inferring speakers, physical attributes from their voices”,

Journal of Experimental Social Psychology, 38, Page 618-625.

12. Shafran, I. et. al. (2003); “Voice signatures”, In Proc. IEEE Automatic Speech

Recognition and Understanding Workshop, ASRU 2003.

13. Mcleod,P. and Wywill, G. (2005); “A smarter way to find pitch”, Proc. , International

computer Music CONFERENCE, Barcelona, July 2005,pp 300-303

14. Metze, F. et.al. (2007); “Comparison of four approaches to age and gender

recognition for telephone application”, ICASSP.

15. Naotoshi Seo (2008); “ENEE632 Project4 Part I: Pitch Detection”, ECE dept.,

Maryland university.

16. Spiegl, W. et. al. (2009); “Analysing Features for Automatic Age Estimation on

Cross-Sectional Data “10th Annual Conference of the International Speech

Communication Association, Brighton. Page 1-4.

17. SPSS ver. 14 manual on algorithms titled “Discriminant”

age detection using audio features

Documents