tanja schultz carnegie mellon university cairo, egypt, may-21 2001 data recording, transcription,...

31
Tanja Schultz Tanja Schultz Carnegie Mellon University Carnegie Mellon University Cairo, Egypt, May-21 2001 Cairo, Egypt, May-21 2001 Data Recording, Transcription, Data Recording, Transcription, and Speech Recognition for and Speech Recognition for Egypt Egypt

Upload: magdalene-burke

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Tanja SchultzTanja Schultz

Carnegie Mellon UniversityCarnegie Mellon University

Cairo, Egypt, May-21 2001Cairo, Egypt, May-21 2001

Data Recording, Transcription, and Data Recording, Transcription, and Speech Recognition for EgyptSpeech Recognition for Egypt

Page 2: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

OutlineOutline

Requirements for Speech Recognition

Data Requirements Audio data Pronunciation dictionary Text corpus data

Recording of Audio data

Transcription of Audio data

Initialization of an Egypt Speech Recognition Engine

Multilingual Speech Recognition

Rapid Adaptation to new Languages

Page 3: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Part 1Part 1

Requirements for Speech Recognition

Data Requirements Audio data Pronunciation dictionary Text corpus data

Recording of Audio data

Transcription of Audio data

Thanks to Celine Morel and Susanne Burger

Page 4: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Speech RecognitionSpeech Recognition

hello

HelloHale BobHallo ::

TTS

Speech Input - Preprocessing

Decoding/ Search

Postprocessing - Synthesis

Page 5: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Fundamental Equation of SRFundamental Equation of SR

hello

P(W/x) = [ P(x/W) * P(W) ] / P(x)

Am AE MAre A RI AIyou J Uwe VE

I am you are we are:

Acoustic Model Pronunciation Language Model

A-b A-m A-e

Page 6: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

SR: Data RequirementsSR: Data Requirements

Audio DataPhoneme Set

PronunciationDictionary

Text Data

Am AE MAre A RI AIyou J Uwe VE

I am you are we are:

Acoustic Model Pronunciation Language Model

A-b A-m A-e

Page 7: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Audio Data Audio Data For training and testing the SR-engine many high quality data in the

target language should be collected What kind of data are needed

Scenario and Task How to collect these data, Recording setup Preparation of Information

Quality of data Sampling rate, resolution

Amount of data Number of dialogs and speakers

Transcription of Audio Data

Page 8: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

What kind of Audio DataWhat kind of Audio Data

C-Star Scenario: Travel arrangement

(planning a vacation trip, booking a hotel room, ...)

Scenario is realistic and attractive to the people

Dialog between two people: One Agent: Travel assistant

One Client: Traveler, pretends to visit a specific site

Speakers get instructions about what task they have to accomplish

but not HOW to do that

Role playing setup

Page 9: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

How to collect Audio DataHow to collect Audio Data Recording setup

The dialog partners can NOT see each other, i.e. no face-to-face (in preparation of telephone, web applications)

No non-verbal communication Spontaneous Speech (noise effects, disfluencies, ... may occur) No Push-to-talk, try to avoid crosstalk Balanced dialogs

Dialog structure, Task Greetings and formalities between dialog partners Client gives information like number of persons traveling, date of travel

(arrival/departure), interest Client ask questions about means of transportation (train,flight), hotel or appartment

modalities, visits of sights or cultural events Agent provides information according to clients questions

Page 10: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Prepare Information for Client and AgentPrepare Information for Client and Agent A: Hotel list (3-4 hotels per dialog) A: Transportation list (3-4 flights, train, bus schedules) A: List of 3-4 cultural events per dialog C: information about specific task:

who is traveling (i.e. client travels with partner + two kids) when is s/he traveling (i.e. 2 weeks vacation trip in July) where (i.e. trip to Pennsylvania, US) how ( i.e. direct flight to Pittsburgh, rental car) what are the places of interest (CMU - Pittsburgh, Liberty Bell in

Philadelphia, ...) Date and time of recording might be faked Dialog takes place at recording place Example sheets Celine Morel

Page 11: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Quality and Quantity of Audio DataQuality and Quantity of Audio Data Quality of data

High quality clean speech close-speaking microphone, like Sennheiser H-420

16kHz sampling rate, 16 bit resolution Amount of data

Minimum of 10 hours of spoken speech Average length of dialogs 10 - 20 minutes 10 hours 30 - 60 dialogs

Number of speakers as much speakers as possible (speaker independent AM) 30 - 60 dialogs = maximum of 120 different spk Split up the speakers/dialogs into three disjunctive subsets:

training set, development testset, evaluation testset

Page 12: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

RecordingRecording Tool: Total Recorder Tool: Total Recorder http://www.highcriteria.com/download/totrec.exe

Registration fee: 11.95 $ IBM compatible PC, soundcard (i.e. Soundblaster) Close-speaking microphone (i.e. Sennheiser H-420) Win95, Win98, Win2000, WinNT

Sound-board

Sound-boardDriver

TotalRecorder

Page 13: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Transcription of Audio DataTranscription of Audio DataFor training the SR-engine we need to transcribe the spoken data

manually Very time consuming (10-20 times real time) The more accurate transcribed the more valuable Since we do have the pronunciations, only word-based

transcriptions are needed Transcription convention from Susanne Burger

download from http://www.cs.cmu.edu/~tanja Describes notation

Transcription tool: transEdit (Burger & Meier)

Page 14: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Transliteration conventionsTransliteration conventionsExample:tanja_0001: this sentence +uhm+ was spoken +pause+ by ~Tanja and

+/cos/+ contains one restart

Parsability - one turn per line: Tanja_0001 Consistency Filter programs

tagging of proper names ~Tanja tagging of numbers special noise markers +uhm+ no capitalization at the beginning of turns

Page 15: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Pronunciation Dictionary Pronunciation Dictionary For each word seen in the training set, a pronunciation of this

word has to be defined in terms of the phoneme set Define an appropriate phoneme set: atomar sounds of language Describe each word to be recognized in terms of this phoneme set Example in English:

I AI

you J U

Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the vocalization is transcribed, romanized transcription

Grapheme-to-Phoneme tool for Standard Arabic (collected in Tunesia and Palestine) already developed at CMU (master student Jamal Abu-Alwan)

Page 16: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Phoneme Set (i.e. Standard Arabic)Phoneme Set (i.e. Standard Arabic)Phon.

Symbol Trans. Name Arabic

Symbol Phon.

Symbol Trans. Name Arabic

Symbol

SD Sd saad ص E E hamza ء

DD Dd daad ض AA A~ wasla آ

TT Tt tta ط AE Ae hamza أ ,إ DS D~ tha ظ O O hamza ؤ E3 3 ain ع I I hamza ئ GH Gh gin غ A A alif ا F F fa ف U U alif

maksura ى

Q Q qaaf ق B B ba ب K K kaaf ك TE Te ta marbuta ة L L lam ل T T ta ت M M mim م TH Th sa ث N N noon ن J J jeem ج W W waw و H7 7 ha ح Y Y yaa ي H H ha هه

a a fatha َ# KH Kh khaf خ

u u damma َ% D D daal د

i i kasra ِِ�� DH Dh thal ذ

an an tanwin fatha َ( R R ra ر

un un tanwin damma َ* Z Z za ز

in in tanwin kasra َ, S S seen س

SH Sh sha ش

Page 17: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Text Data Text Data For training the language model we need a huge corpus of text

data of same domain The language model helps guiding the search Compute probabilities of words, word pairs and word tripels Millions of words needed to calculate these probs Text corpus should be as close as possible to the given

domain Writing systems must be the same Other text might be useful as background information

Page 18: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Computer RequirementsComputer Requirements Data collection

IBM compatible PC High quality Soundcard like Soundblaster Close-speaking microphone like Sennheiser H-420 Operating System Win95 Large Harddisc

16000 x 2 bytes per sec 30 kBytes/sec 2 Mb/min 120 Mb/hr 1.2 GigaBytes for 10hr spoken speech

Speech Recognition Fast processor - as fast as possible RAM 512 Mb Additional 2-4 GigaBytes for temporary files during training and testing

Translation Donna, Lori?

Page 19: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

DiscussionDiscussion Speech Recognizer in Egypt or Standard Arabic language ? Egypt

Spoken -used- language more interesting for a human-to-human speech-to-speech translation system?

Standardized pronunciation? Large text resources available in Egypt? Parser output follows Standard Arabic vocalization? Use Egypt CallHome data and pronunciation dictionaries (LDC)?

Standard Arabic Useful to a larger community? Canonical pronunciation? Preliminary speech recognizer and data already available at CMU Larger text resources available?

Do we want monolingual dialogs (agent&client) or multilingual recordings?

Page 20: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Part 2Part 2

Initialization of an Egypt Speech Recognition Engine

Multilingual Speech Recognition

Rapid Adaptation to new Languages

Page 21: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Initialization of Egypt SR EngineInitialization of Egypt SR Engine

Rapid initialization of an Egypt/Arabic speech recognizer?

Pronunciation dictionary: Grapheme-to-Phoneme tool available

if vocalization, romanization is provided by trl

Language model: text corpora if vocalized

Apply Egypt parser for vocalization?

Acoustic models: Initialization or Adaptation according to our

fast adaptation approach PDTS

Page 22: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

GlobalPhoneGlobalPhone Multilingual Database

Widespread languages Native Speakers Uniformity Broad domain Huge text resources

Internet Newspapers

Total sum of resources 15 languages so far 300 hours speech data 1400 native speakers

ArabicCh-MandarinCh-ShanghaiEnglishFrench

German JapaneseKoreanCroatianPortuguese

RussianSpanishSwedishTamilTurkish

Page 23: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Speech Recognition in Multiple LanguagesSpeech Recognition in Multiple Languages

Pronunciationrules Text data

Sound systemSpeech data( 10 hours)

Goal: Speech recognition in a many different languagesProblem: Only few or no training data available (costs, time)

ela /e/l/a/eu /e/u/sou /s/u/

eu souvocê éela é

AM Lex LM

Page 24: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Speech Recognition in Multiple LanguagesSpeech Recognition in Multiple Languages

Pronunciationrules Text data

Sound systemSpeech data

ela /e/l/a/eu /e/u/sou /s/u/

eu souvocê éela é

AM Lex LM

Page 25: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Multilingual Acoustic ModelingMultilingual Acoustic Modeling

Step 1: • Combine acoustic models• Share data across languages

Page 26: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Multilingual Acoustic ModelingMultilingual Acoustic Modeling

Sound production is human not language specific: International Phonetic Alphabet (IPA) Multilingual Acoustic Modeling

1) Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes

2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages

m,n,s,l occur in all languages p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages no sharing of triphthongs and palatal consonants

Page 27: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Rapid Language AdaptationRapid Language AdaptationStep 2: • Use ML acoustic models, borrow data• Adapt ML acoustic models to target language

ela /e/l/a/eu /e/u/sou /s/u/

eu souvocê éela é

AM Lex LM

Page 28: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Rapid Language AdaptationRapid Language AdaptationModel mapping to the target language

1) Map the multilingual phonemes to Portuguese ones based on the IPA-scheme

2) Copy the corresponding acoustic models in order to initialize Portuguese models

Problem: Contexts are language specific, how to apply context dependent models to a new target language

Solution: Adaptation of multilingual contexts to the target language based on limited training data

Page 29: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Language Adaptation ExperimentsLanguage Adaptation Experiments

69,1

57,149,9

40,632,8

28,9

19,6 19

0

20

40

60

80

100

Wor

d E

rror

rat

e [%

]

0 0:15 0:15 0:25 0:25 0:25 1:30 16:30

Ø Tree ML-Tree Po-Tree PDTS

+

Page 30: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

SummarySummary Multilingual database suitable for MLVCSR Covers the most widespread languages Language dependent recognition in 10 languages Language independent acoustic modeling

Global phoneme set that covers 10 languages Data sharing thru multilingual models

Language adaptive speech recognition Limited amount of language specific data

Create speech engines in new target languages using only limited data, save time and money

Page 31: Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Selected PublicationsSelected Publications

Language Independent and Language Adaptive Acoustic Modeling Tanja Schultz and Alex Waibel in: Speech Communication, To appear 2001

Multilinguality in Speech and Spoken Language Systems Alex Waibel, Petra Geutner, Laura Mayfield-Tomokiyo, Tanja Schultz, and Monika Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken Language Processing, Volume 88(8), pp 1297-1313, August 2000

Polyphone Decision Tree Specialization for Language Adaptation Tanja Schultz and Alex Waibel in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, June 2000.

Download from http://www.cs.cmu.edu/~tanja