university of joensuu dept. of computer science p.o. box 111 fin- 80101 joensuu tel. +358 13 251...
TRANSCRIPT
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Automatic Speaker Recognition for Series 60
Mobile Devices
University of Joensuu,Department of Computer Science
Specom’2004, Sep 20, 2004
Juhani Saastamoinen, Evgeny Karpov,Ville Hautamäki, and Pasi Fränti
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Background
• Project in National FENIX programme– New Methods and Applications in Speech
Technology
• 7 research institutes• Project partners: NRC, Lingsoft, National
Bureau of Investigation, etc.• Joensuu: Speaker Recognition• http://cs.joensuu.fi/pages/pums
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Research Group
Pasi FräntiProfessor
Juhani SaastamoinenProject manager
Evgeny KarpovProject researcher
Ville HautamäkiProject researcher
Tomi KinnunenResearcher
Ismo Kärkkäinen Clustering algorithms
PUMS project
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Application Scenarios
Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification
Speaker RecognitionSpeaker Recognition
Whose voice is this?Is this Bob’s voice?
(Claim)+
Verification
Imposter!
?Identification
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Project Goal
Port speaker recognition to Series 60 mobile phone
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Symbian Phones
• Series 60 phone features:– 16 MB ROM– 8 MB RAM
– 176 x 208 display
– ARM-processor
– No floating-point unit!!!
Series 80
Series 60UIQ
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Symbian OS
• Defined by Symbian consortium
• Based on EPOC• Operating system for mobile phones
– Real-time system– Long uptime required
• Multitasking, multithreading
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Problems of Porting
• Usual considerations when porting to phone– GUI event driven program(ming)
– Platform specific programming model
– Real-time system, exceptions
• Application specific porting problems– Number crunching without floating point unit!!!
– Signal processing numerically challenging
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Identification System
Speaker Recognition:Classify input speech
based on existing profiles
Signal ProcessingFeature Extraction
Speaker Modelling:Create speaker
profileFeatureVectors
SpeechAudio
Add speaker profiles during training
Read and use all profiles during recognition
Decision
Speaker ProfileDatabase
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
MFCC Signal Processing
Time windowin
gDFT Abs
Filter bank
Log
DCT
Digital speechsignal frame
Featurevector
Pre-emphasis
• pre-emph. coeff. 0.97, Hamm window, 30 triangular mel-filters, base-2 logarithm, output 12 MFCC's
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Fixed-Point Implementation
• Numerical analysis needed for fixed-point arithmetic implementation
• Truncation and re-scaling to avoid overflows in the converted algorithm
• Minimize information loss caused by computation in fixed-point arithmetic – Minimize relative error
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
FFT, Fixed-Point
• Frequency spectrum of speech– Biggest source of numerical error– Butterflies have multiplications– Layers repeat truncation errors
• Fixed number of bits per element– 32, native integer size in many systems
• Reference implementation: FFTGEN– http://www.jjj.de/fft/fftgen.tgz
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
FFTGEN (16/16)
• Multiplication: 32 x 32 -bit result must fit in 32 bits: truncate input
• FFTGEN: Truncate inputs to 16/16 bits
32-bit multiplication result
FFT layer input FFT Twiddle FactorX
X16-bit integer 16-bit integer
FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer
16 used bits 16 crop-off bits
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Info Preserving FFT (22/10)• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information
– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024
– Truncate multiplication inputs to 22/10 bits (signal/op)
22 used bits 10 crop-off bits
32-bit multiplication result
X32-bit integer, 22 bits used 16-bit integer, 10 bits used
32-bit integer
FFT layer input FFT Twiddle FactorX
FFT layer output (part of it)Crop-off for next layer: 10 bits
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
FFT Spectrum, Fixed-Point
originalTIMIT signal
TIMIT signal x 4
16/16 abs values 22/10 abs values
• x-axis: fixed-point FFT element abs. values
• y-axis: correct FFT element abs. values
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Scale of Error in Proposed FFT
16/16 22/10
Log10 of relative error in FFT elements
16/16 22/10
average -0.775 -2.118
standard deviation 0.797 0.590
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
• Compute complex absolute values using maximum coordinate and coordinate ratio
• Suppose |x| > |y| for z = x + i y, then
• Interpret the (squared) y/x by t• Approx. square root by a polynomial P(t)• Constant time algorithm (vs. Newton)
Magnitude Spectrum, Fixed-Point
222 /1 xy+x=y+x|=z|
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Logarithm, Fixed-Point
• Use base 2 instead of base 10– corresponds to output multiplication
• Standard technique:– Return problem to interval [1,2)– Use linear interpolation from values
stored in a look-up table– 8 bits used for indexing the look-up
table values
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Rest of System, Fixed-Point
• No improvement needed in VQ/GLA• Should apply similar technique as
with FFT to other signal processing– Pre-emphasis, utilize full 32 bits– Time windowing, use less bits in
windowing function– FB, use less bits in frequency responses– DCT, use less bits for the cosines
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Effect of Signal Processing
• TIMIT data sets, varying number of speakers (N)• For each N repeat (6x, 5x, 2x) train/recognize
cycles (eliminate GLA initial solution randomness)• FFTGEN: FFT with 16/16 multiplication• Fixed-point: use proposed 22/10 FFT• Mixed: floating-point DSP, fixed-point GLA/VQ
N=10 (6x) N=20 (5x) N=100 (2x)FFTGEN 93,3% 68,0% 59,5%Fixed-point 98,3% 95,0% 82,5%Mixed 100,0% 100,0% 100,0%Floating-point 100,0% 100,0% 100,0%
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Effect of Signal Quality
• GSM/PC data: 16 aligned dual recordings
• All computations in floating-point arith.
• Signal recorded with laptop and PC mic gives average recognition rate 100%
• Signal recorded with Nokia 3660 results in average recognition rate 84,9%
13/16 14/16 15/16 16/16Symbian audio 1 3 3 10PC audio 0 0 0 17
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Conclusion
• Speaker identification was ported to Symbian Series 60 mobile phone
• 22/10 bit usage in multiplication proposed instead of “standard” 16/16
• Experiments indicate that recognition accuracy improves from 68% to 95%