Download - Recent work on Language Identification

Recent work on Language Identification

Pietro Laface

POLITECNICO di

TORINO

Brno 28-06-2009 Pietro LAFACE

Team

POLITECNICO di TORINO

Pietro Laface Professor

Fabio Castaldo Post-doc

Sandro Cumani PhD student

Ivano Dalmasso Thesis Student

LOQUENDO

Claudio Vair Senior

Researcher

Daniele Colibro Researcher

Emanuele Dalmasso Post-doc

Our technology progress

1. Inter-speaker compensation in feature space GLDS / SVM models (ICASSP 2007) - GMMs

2. SVM using GMM super‑vectors (GMM-SVM) Introduced by MIT-LL for speaker recognition

3. Fast discriminative training of GMMs Alternative to MMIE Exploiting the GMM-SVM separation hyperplanes MIT discriminative GMMs

4. Language factors

GMM super‑vectors

Appending the mean value of all Gaussians in a single stream we get a super-vector

Without normalization

Inter-speaker/channel variation compensation

11 12 1 21 22 2p p pN

We use GMM super-vectors

With Kullback‑Leibler normalization

Training GMM-SVM models Training Discriminative GMMs

Using an UBM in LID

1. The frame based inter-speaker variation compensation approach estimates the inter-speaker compensation factors using the UBM

2. In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM

3. The UBM is used for fast selection of Gaussians

Speaker/channel compensation

in feature space

U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain.

x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i.

is the occupation probability of the m-th Gaussian

ˆ i im m

m

t = t γ t i o o U x

mγ t

Estimating the U matrix Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability Speaker recognition

Estimating the U matrix with a large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter-speaker/channel variability within the same language Language recognition

GMM-SVM weakness

GMM-SVM models perform very well with rather long test utterances

Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs

It is difficult to estimate a robust GMM with a short test utterance

SVM discriminative directions

w1

w2w3

0= w x b

w: normal vector to the class‑separation hyperplane

GMM discriminative training

Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space

1kw

2kw

Feature Space

kw

KL Space

Utterance GMM

Language GMM

ˆ k k k k k μ μ w ˆgpk k k kgp gp

g

Experiments with 2048 GMMs

Year

Models

Discriminative GMMs GMM-SVM

3s 10s 30s

1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)

2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)

2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )

Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks.

In parentheses, the average of the EERs of each language.

256-MMI (Brno University – 2006 IEEE Odyssey )

2005 17.1 8.6 4.6

Pushed GMMs (MIT-LL)

Language Factors

Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009).

Analogy

Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech 2009).

UBMs = + Vy

Language Factors: advantages

Language factors are low-dimension vectors

Training and evaluating SVMs with different kernels is easy and fast: it requires the dot product of normalized language factors

Using a very large number of training examples is feasible

Small models give good performance

Toward an eigen-language space

After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains.

However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique

Speaker compensated eigenvoices

First approach

Estimating the principal directions of the GMM supervectors of all the training segments before inter-speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices.

After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.

Eigen-language space

Second approach Computing the differences between the GMM

supervectors obtained from utterances of a polyglot speaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others.

We do not have labeled databases including polyglot speakers compute and collect the difference between GMM

supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain

Eigen-language space

The number of these differences would grow with the square of utterances of the training set.

Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the average supervector of every other language.

Training corpora

The same used for LRE07 evaluation

All data of the 12 languages in the Callfriend corpus

Half of the NIST LRE07 development corpus

Half of the OSHU corpus provided by NIST for LRE05

The Russian through switched telephone network

− Automatic segmentation

LRE07 30s closed set test

Language factor’s minDCF is always better and more stable

Pushed GMMs (MIT-LL)

Pushed eigen-language GMMs

| 0 | 0

| 0 | 0

ij

ij

ipositive i

i jj

inegative i

i jj

g g

g g

| 0 | 0

| 0 | 0

ij

ij

ipositive UBM i

i jj

inegative UBM i

i jj

g μ V y

g μ V y

The same approach to obtain discriminative GMMs from the language factors

Min DCFs and (%EER)

Models 30s 10s 3s

GMM-SVM (KL kernel)0.029(3.43)

0.085(9.12)

0.201(21.3)

GMM-SVM (Identity kernel)

0.031(3.72)

0.087(9.51)

0.200(21.0)

LF-SVM (KL kernel)0.026(3.13)

0.083(9.02)

0.186(20.4)

LF-SVM (Identity kernel)0.026(3.11)

0.083(9.13)

0.187(20.4)

Discriminative GMMs 0.021(2.56)

0.069(7.49)

0.174(18.45)

LF-Discriminative GMMs (KL kernel)

0.025(2.97)

0.084(9.04)

0.186(19.9)

LF-Discriminative GMMs(Identity kernel)

0.025(3.05)

0.084(9.05)

0.186(20.0)

Loquendo-Polito LRE09 System

Acoustic features SVM-GMMs Pushed GMMs

MMIE GMMs

Phonetic transcriber



N-gram counts

N-gram counts

N-gram counts TFLLR SMV

TFLLR SMV

TFLLR SMV

Model Training

Phone transcribers

12 phone transcribers for French, German, Greek, Italian, Polish,

Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.

The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment

ASR Recognizer phone-loop grammar with diphone transition constraints

Phone transcribers

10 phone transcribers for Catalan, French, German, Greek, Italian,

Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.

The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment

ANN models Same phone-loop grammar - different engine

Multigrams

Two different TFLLR kernels

trigrams pruned multigrams

multigrams can provide useful information about the language by capturing “word parts” within the string sequences

Scoring

The total number of models that we use for scoring an unknown segment is 34:

11 channel dependent models (11 x 2) 12 single channel models (2 telephone and

10 broadcast models only).

23 x 2 for MMIE GMMs (channel independent but M/F)

Calibration and fusion

Pushed GMMs

MMIE GMMs

1-best 3-grams SVMs

1-best n-grams SVMs

Lattice n-grams SVMs

34

46

34

34

34

34 23

34

34

34

G. back-end

G. back-end

G. back-end

G. back-end

G. back-end

34

34

LLR max lre_detection

Multi-class FoCal

max of the channel dependent scores

Language pair recognition

For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.

Telephone development corpora

• CALLFRIEND - Conversations split into slices of 150s

• NIST 2003 and NIST 2005

• LRE07 development corpus

• Cantonese and Portuguese data in the 22 Language OGI corpus

• RuSTeN -The Russian through Switched Telephone Network corpus

“Broadcast” development corpora

Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences

The development data, further split in training, calibration and test subsets, should cover the mentioned variability

Problems with LRE09 dev data

Often same speaker segments Scarcity of segments for some languages after filtering

same speaker segments Genders are not balanced Excluding “French”, the segments of all languages are

either telephone or broadcast. No audited data available for Hindi, Russian, Spanish and

Urdu on VOA3, only automatic segmentation was provided No segmentation was provided in the first release of the

development data for Cantonese, Korean, Mandarin, and Vietnamese

For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.

Additional “audited” data

For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files

Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments

The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages

Development data for bootstrap models

The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained

Telephone and audited/checked broadcast data

Training (50 %)Development (25 %) Test (25 %)

Additional not-audited data from VOA3

Preliminary tests with the bootstrap models indicate the need of additional data

Selected from VOA3 to include new speakers in the train, calibration and test sets assuming that the file label correctly identify

the corresponding language

Speaker selection

Performed by means of a speaker recognizer

We process the audited segments before the others

A new speaker model is added to the current set of speaker models whenever the best recognition score obtained by a segment is less than a threshold

Enriching the training set

Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system

A segment has been selected only if the 1-best language hypothesis of our system

had associated a score greater than a given (rather high) threshold

matched the 1-best hypothesis provided by the BUT system

Additional not-audited data from VOA2

Total number of segments for this evaluation

SetCorpora

voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S

Train 529 116 316 1955 590 66

Extendedtrain

114 22 65 2483 574 151

Development 396 85 329 1866 449 45

Suffix: A audited C checked S automatic segmentation

ftp: ftp://8475.ftp.storage.akadns.net/mp3/voa/

Hausa- Decision Cost Function

DCF

Hindi- Decision Cost Function

DCF

Results on the development set

Test on

Systems

PushedGMMs

MMIEGMMs

3-gramsMulti-grams

Lattice Fusion

Broadcast &

telephone1.48 1.70 1.09 1.12 1.06 0.86

Broadcastsubset

1.54 1.69 1.24 1.26 1.14 0.91

Telephonesubset

2.00 2.51 1.45 1.49 1.42 1.21

Average minDCFx100 on 30s test segments

Korean - score cumulative distribution

b-b

t-t

t-b

b-t

Download - Recent work on Language Identification

Top Related