Download - Recent work on Language Identification
![Page 1: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/1.jpg)
Recent work on Language Identification
Pietro Laface
POLITECNICO di
TORINO
Brno 28-06-2009 Pietro LAFACE
![Page 2: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/2.jpg)
Team
POLITECNICO di TORINO
Pietro Laface Professor
Fabio Castaldo Post-doc
Sandro Cumani PhD student
Ivano Dalmasso Thesis Student
LOQUENDO
Claudio Vair Senior
Researcher
Daniele Colibro Researcher
Emanuele Dalmasso Post-doc
![Page 3: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/3.jpg)
Our technology progress
1. Inter-speaker compensation in feature space GLDS / SVM models (ICASSP 2007) - GMMs
2. SVM using GMM super‑vectors (GMM-SVM) Introduced by MIT-LL for speaker recognition
3. Fast discriminative training of GMMs Alternative to MMIE Exploiting the GMM-SVM separation hyperplanes MIT discriminative GMMs
4. Language factors
![Page 4: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/4.jpg)
GMM super‑vectors
Appending the mean value of all Gaussians in a single stream we get a super-vector
Without normalization
Inter-speaker/channel variation compensation
11 12 1 21 22 2p p pN
We use GMM super-vectors
With Kullback‑Leibler normalization
Training GMM-SVM models Training Discriminative GMMs
![Page 5: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/5.jpg)
Using an UBM in LID
1. The frame based inter-speaker variation compensation approach estimates the inter-speaker compensation factors using the UBM
2. In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM
3. The UBM is used for fast selection of Gaussians
![Page 6: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/6.jpg)
Speaker/channel compensation
in feature space
U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain.
x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i.
is the occupation probability of the m-th Gaussian
ˆ i im m
m
t = t γ t i o o U x
mγ t
![Page 7: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/7.jpg)
Estimating the U matrix Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability Speaker recognition
Estimating the U matrix with a large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter-speaker/channel variability within the same language Language recognition
![Page 8: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/8.jpg)
GMM-SVM weakness
GMM-SVM models perform very well with rather long test utterances
Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs
It is difficult to estimate a robust GMM with a short test utterance
![Page 9: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/9.jpg)
SVM discriminative directions
w1
w2w3
0= w x b
w: normal vector to the class‑separation hyperplane
![Page 10: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/10.jpg)
GMM discriminative training
Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space
1kw
2kw
Feature Space
kw
KL Space
Utterance GMM
Language GMM
ˆ k k k k k μ μ w ˆgpk k k kgp gp
g
![Page 11: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/11.jpg)
Experiments with 2048 GMMs
Year
Models
Discriminative GMMs GMM-SVM
3s 10s 30s
1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)
2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)
2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )
Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks.
In parentheses, the average of the EERs of each language.
256-MMI (Brno University – 2006 IEEE Odyssey )
2005 17.1 8.6 4.6
![Page 12: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/12.jpg)
Pushed GMMs (MIT-LL)
![Page 13: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/13.jpg)
Language Factors
Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009).
Analogy
Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech 2009).
UBMs = + Vy
![Page 14: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/14.jpg)
Language Factors: advantages
Language factors are low-dimension vectors
Training and evaluating SVMs with different kernels is easy and fast: it requires the dot product of normalized language factors
Using a very large number of training examples is feasible
Small models give good performance
![Page 15: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/15.jpg)
Toward an eigen-language space
After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains.
However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique
![Page 16: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/16.jpg)
Speaker compensated eigenvoices
First approach
Estimating the principal directions of the GMM supervectors of all the training segments before inter-speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices.
After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.
![Page 17: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/17.jpg)
Eigen-language space
Second approach Computing the differences between the GMM
supervectors obtained from utterances of a polyglot speaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others.
We do not have labeled databases including polyglot speakers compute and collect the difference between GMM
supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain
![Page 18: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/18.jpg)
Eigen-language space
The number of these differences would grow with the square of utterances of the training set.
Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the average supervector of every other language.
![Page 19: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/19.jpg)
Training corpora
The same used for LRE07 evaluation
All data of the 12 languages in the Callfriend corpus
Half of the NIST LRE07 development corpus
Half of the OSHU corpus provided by NIST for LRE05
The Russian through switched telephone network
− Automatic segmentation
![Page 20: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/20.jpg)
LRE07 30s closed set test
Language factor’s minDCF is always better and more stable
![Page 21: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/21.jpg)
Pushed GMMs (MIT-LL)
![Page 22: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/22.jpg)
Pushed eigen-language GMMs
| 0 | 0
| 0 | 0
ij
ij
ipositive i
i jj
inegative i
i jj
g g
g g
| 0 | 0
| 0 | 0
ij
ij
ipositive UBM i
i jj
inegative UBM i
i jj
g μ V y
g μ V y
The same approach to obtain discriminative GMMs from the language factors
![Page 23: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/23.jpg)
Min DCFs and (%EER)
Models 30s 10s 3s
GMM-SVM (KL kernel)0.029(3.43)
0.085(9.12)
0.201(21.3)
GMM-SVM (Identity kernel)
0.031(3.72)
0.087(9.51)
0.200(21.0)
LF-SVM (KL kernel)0.026(3.13)
0.083(9.02)
0.186(20.4)
LF-SVM (Identity kernel)0.026(3.11)
0.083(9.13)
0.187(20.4)
Discriminative GMMs 0.021(2.56)
0.069(7.49)
0.174(18.45)
LF-Discriminative GMMs (KL kernel)
0.025(2.97)
0.084(9.04)
0.186(19.9)
LF-Discriminative GMMs(Identity kernel)
0.025(3.05)
0.084(9.05)
0.186(20.0)
![Page 24: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/24.jpg)
Loquendo-Polito LRE09 System
Acoustic features SVM-GMMs Pushed GMMs
MMIE GMMs
Phonetic transcriber
Phonetic transcriber
Phonetic transcriber
N-gram counts
N-gram counts
N-gram counts TFLLR SMV
TFLLR SMV
TFLLR SMV
Model Training
![Page 25: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/25.jpg)
Phone transcribers
12 phone transcribers for French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.
The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment
ASR Recognizer phone-loop grammar with diphone transition constraints
![Page 26: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/26.jpg)
Phone transcribers
10 phone transcribers for Catalan, French, German, Greek, Italian,
Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English.
The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment
ANN models Same phone-loop grammar - different engine
![Page 27: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/27.jpg)
Multigrams
Two different TFLLR kernels
trigrams pruned multigrams
multigrams can provide useful information about the language by capturing “word parts” within the string sequences
![Page 28: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/28.jpg)
Scoring
The total number of models that we use for scoring an unknown segment is 34:
11 channel dependent models (11 x 2) 12 single channel models (2 telephone and
10 broadcast models only).
23 x 2 for MMIE GMMs (channel independent but M/F)
![Page 29: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/29.jpg)
Calibration and fusion
Pushed GMMs
MMIE GMMs
1-best 3-grams SVMs
1-best n-grams SVMs
Lattice n-grams SVMs
34
46
34
34
34
34 23
34
34
34
G. back-end
G. back-end
G. back-end
G. back-end
G. back-end
34
34
LLR max lre_detection
Multi-class FoCal
max of the channel dependent scores
![Page 30: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/30.jpg)
Language pair recognition
For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.
![Page 31: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/31.jpg)
Telephone development corpora
• CALLFRIEND - Conversations split into slices of 150s
• NIST 2003 and NIST 2005
• LRE07 development corpus
• Cantonese and Portuguese data in the 22 Language OGI corpus
• RuSTeN -The Russian through Switched Telephone Network corpus
![Page 32: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/32.jpg)
“Broadcast” development corpora
Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences
The development data, further split in training, calibration and test subsets, should cover the mentioned variability
![Page 33: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/33.jpg)
Problems with LRE09 dev data
Often same speaker segments Scarcity of segments for some languages after filtering
same speaker segments Genders are not balanced Excluding “French”, the segments of all languages are
either telephone or broadcast. No audited data available for Hindi, Russian, Spanish and
Urdu on VOA3, only automatic segmentation was provided No segmentation was provided in the first release of the
development data for Cantonese, Korean, Mandarin, and Vietnamese
For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.
![Page 34: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/34.jpg)
Additional “audited” data
For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files
Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments
The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages
![Page 35: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/35.jpg)
Development data for bootstrap models
The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained
Telephone and audited/checked broadcast data
Training (50 %)Development (25 %) Test (25 %)
![Page 36: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/36.jpg)
Additional not-audited data from VOA3
Preliminary tests with the bootstrap models indicate the need of additional data
Selected from VOA3 to include new speakers in the train, calibration and test sets assuming that the file label correctly identify
the corresponding language
![Page 37: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/37.jpg)
Speaker selection
Performed by means of a speaker recognizer
We process the audited segments before the others
A new speaker model is added to the current set of speaker models whenever the best recognition score obtained by a segment is less than a threshold
![Page 38: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/38.jpg)
Enriching the training set
Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system
A segment has been selected only if the 1-best language hypothesis of our system
had associated a score greater than a given (rather high) threshold
matched the 1-best hypothesis provided by the BUT system
Additional not-audited data from VOA2
![Page 39: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/39.jpg)
Total number of segments for this evaluation
SetCorpora
voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S
Train 529 116 316 1955 590 66
Extendedtrain
114 22 65 2483 574 151
Development 396 85 329 1866 449 45
Suffix: A audited C checked S automatic segmentation
ftp: ftp://8475.ftp.storage.akadns.net/mp3/voa/
![Page 40: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/40.jpg)
Hausa- Decision Cost Function
DCF
![Page 41: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/41.jpg)
Hindi- Decision Cost Function
DCF
![Page 42: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/42.jpg)
Results on the development set
Test on
Systems
PushedGMMs
MMIEGMMs
3-gramsMulti-grams
Lattice Fusion
Broadcast &
telephone1.48 1.70 1.09 1.12 1.06 0.86
Broadcastsubset
1.54 1.69 1.24 1.26 1.14 0.91
Telephonesubset
2.00 2.51 1.45 1.49 1.42 1.21
Average minDCFx100 on 30s test segments
![Page 43: Recent work on Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022051623/5681587d550346895dc5dd59/html5/thumbnails/43.jpg)
Korean - score cumulative distribution
b-b
t-t
t-b
b-t