tanja schultz carnegie mellon university cairo, egypt, may-21 2001 data recording, transcription,...

Download Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt

Post on 18-Dec-2015

214 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt
  • Slide 2
  • Outline Requirements for Speech Recognition Data Requirements Audio data Pronunciation dictionary Text corpus data Recording of Audio data Transcription of Audio data Initialization of an Egypt Speech Recognition Engine Multilingual Speech Recognition Rapid Adaptation to new Languages
  • Slide 3
  • Part 1 Requirements for Speech Recognition Data Requirements Audio data Pronunciation dictionary Text corpus data Recording of Audio data Transcription of Audio data Thanks to Celine Morel and Susanne Burger
  • Slide 4
  • Speech Recognition Output hellohello Hello Hale Bob Hallo : TTS Speech Input - Preprocessing Decoding / Search Postprocessing - Synthesis
  • Slide 5
  • Fundamental Equation of SR Output hellohello P(W/x) = [ P(x/W) * P(W) ] / P(x) Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e
  • Slide 6
  • SR: Data Requirements Audio Data Phoneme Set Pronunciation Dictionary Text Data Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e
  • Slide 7
  • Audio Data For training and testing the SR-engine many high quality data in the target language should be collected u u What kind of data are needed l l Scenario and Task l l How to collect these data, Recording setup l l Preparation of Information u u Quality of data l l Sampling rate, resolution u u Amount of data l l Number of dialogs and speakers u u Transcription of Audio Data
  • Slide 8
  • What kind of Audio Data C-Star Scenario: Travel arrangement (planning a vacation trip, booking a hotel room,...) u u Scenario is realistic and attractive to the people u u Dialog between two people: l l One Agent: Travel assistant l l One Client: Traveler, pretends to visit a specific site u u Speakers get instructions about what task they have to accomplish but not HOW to do that u u Role playing setup
  • Slide 9
  • How to collect Audio Data u u Recording setup l l The dialog partners can NOT see each other, i.e. no face-to-face (in preparation of telephone, web applications) l l No non-verbal communication l l Spontaneous Speech (noise effects, disfluencies,... may occur) l l No Push-to-talk, try to avoid crosstalk l l Balanced dialogs u u Dialog structure, Task l l Greetings and formalities between dialog partners l l Client gives information like number of persons traveling, date of travel (arrival/departure), interest l l Client ask questions about means of transportation (train,flight), hotel or appartment modalities, visits of sights or cultural events l l Agent provides information according to clients questions
  • Slide 10
  • Prepare Information for Client and Agent u u A: Hotel list (3-4 hotels per dialog) u u A: Transportation list (3-4 flights, train, bus schedules) u u A: List of 3-4 cultural events per dialog u u C: information about specific task: l l who is traveling (i.e. client travels with partner + two kids) l l when is s/he traveling (i.e. 2 weeks vacation trip in July) l l where (i.e. trip to Pennsylvania, US) l l how ( i.e. direct flight to Pittsburgh, rental car) l l what are the places of interest (CMU - Pittsburgh, Liberty Bell in Philadelphia,...) u u Date and time of recording might be faked u u Dialog takes place at recording place u u Example sheets Celine Morel
  • Slide 11
  • Quality and Quantity of Audio Data u u Quality of data l l High quality clean speech close-speaking microphone, like Sennheiser H-420 l l 16kHz sampling rate, 16 bit resolution u u Amount of data l l Minimum of 10 hours of spoken speech l l Average length of dialogs 10 - 20 minutes l l 10 hours 30 - 60 dialogs u u Number of speakers l l as much speakers as possible (speaker independent AM) l l 30 - 60 dialogs = maximum of 120 different spk l l Split up the speakers/dialogs into three disjunctive subsets: training set, development testset, evaluation testset
  • Slide 12
  • Recording Tool: Total Recorder h ttp://www.highcriteria.com/download/totrec.exe u u Registration fee: 11.95 $ u u IBM compatible PC, soundcard (i.e. Soundblaster) u u Close-speaking microphone (i.e. Sennheiser H-420) u u Win95, Win98, Win2000, WinNT Sound- board Sound- board Driver Total Recorder
  • Slide 13
  • Transcription of Audio Data For training the SR-engine we need to transcribe the spoken data manually u u Very time consuming (10-20 times real time) u u The more accurate transcribed the more valuable Since we do have the pronunciations, only word-based transcriptions are needed u u Transcription convention from Susanne Burger l l download from http://www.cs.cmu.edu/~tanja l l Describes notation u u Transcription tool: transEdit (Burger & Meier)
  • Slide 14
  • Transliteration conventions Example: tanja_0001: this sentence +uhm+ was spoken +pause+ by ~Tanja and +/cos/+ contains one restart u u Parsability - one turn per line: Tanja_0001 u u Consistency u u Filter programs l l tagging of proper names ~Tanja l l tagging of numbers l l special noise markers +uhm+ l l no capitalization at the beginning of turns
  • Slide 15
  • Pronunciation Dictionary For each word seen in the training set, a pronunciation of this word has to be defined in terms of the phoneme set u u Define an appropriate phoneme set: atomar sounds of language u u Describe each word to be recognized in terms of this phoneme set u u Example in English: IAI youJ U u u Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the vocalization is transcribed, romanized transcription u u Grapheme-to-Phoneme tool for Standard Arabic (collected in Tunesia and Palestine) already developed at CMU (master student Jamal Abu-Alwan)
  • Slide 16
  • Phoneme Set (i.e. Standard Arabic)
  • Slide 17
  • Text Data For training the language model we need a huge corpus of text data of same domain u u The language model helps guiding the search u u Compute probabilities of words, word pairs and word tripels u u Millions of words needed to calculate these probs u u Text corpus should be as close as possible to the given domain u u Writing systems must be the same u u Other text might be useful as background information
  • Slide 18
  • Computer Requirements u u Data collection l l IBM compatible PC l l High quality Soundcard like Soundblaster l l Close-speaking microphone like Sennheiser H-420 l l Operating System Win95 l l Large Harddisc 16000 x 2 bytes per sec 30 kBytes/sec 2 Mb/min 120 Mb/hr 1.2 GigaBytes for 10hr spoken speech u u Speech Recognition l l Fast processor - as fast as possible l l RAM 512 Mb l l Additional 2-4 GigaBytes for temporary files during training and testing u u Translation l l Donna, Lori?
  • Slide 19
  • Discussion u u Speech Recognizer in Egypt or Standard Arabic language ? u u Egypt l l Spoken -used- language more interesting for a human-to-human speech-to-speech translation system? l l Standardized pronunciation? l l Large text resources available in Egypt? l l Parser output follows Standard Arabic vocalization? l l Use Egypt CallHome data and pronunciation dictionaries (LDC)? u u Standard Arabic l l Useful to a larger community? l l Canonical pronunciation? l l Preliminary speech recognizer and data already available at CMU l l Larger text resources available? u u Do we want monolingual dialogs (agent&client) or multilingual recordings?
  • Slide 20
  • Part 2 Initialization of an Egypt Speech Recognition Engine Multilingual Speech Recognition Rapid Adaptation to new Languages
  • Slide 21
  • Initialization of Egypt SR Engine u u Rapid initialization of an Egypt/Arabic speech recognizer? u u Pronunciation dictionary: Grapheme-to-Phoneme tool available if vocalization, romanization is provided by trl u u Language model: text corpora if vocalized u u Apply Egypt parser for vocalization? u u Acoustic models: Initialization or Adaptation according to our fast adaptation approach PDTS
  • Slide 22
  • GlobalPhone Multilingual Database l Widespread languages l Native Speakers l Uniformity l Broad domain l Huge text resources Internet Newspapers Total sum of resources l 15 languages so far l 300 hours speech data l 1400 native speakers Arabic Ch-Mandarin Ch-Shanghai English French German Japanese Korean Croatian Portuguese Russian Spanish Swedish Tamil Turkish
  • Slide 23
  • Speech Recognition in Multiple Languages Pronunciation rules Text data Sound system Speech data ( 10 hours) Goal: Speech recognition in a many different languages Problem: Only few or no training data available (costs, time) ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou voc ela AM LexLM
  • Slide 24
  • Speech Recognition in Multiple Languages Pronunciation rules Text data Sound system Speech data ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou voc ela AM LexLM
  • Slide 25
  • Multilingual Acoustic Modeling Step 1: Combine acoustic models Share data across languages
  • Slide 26
  • Multilingual Acoustic Modeling Sound production is human not language specific: International Phonetic Alphabet (IPA) Multilingual Acoustic Modeling 1) Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2) Each