unlocking audio/video content with speech recognition behrooz chitsaz director, ip strategy...

Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research Frank Seide Lead Researcher Microsoft Research Kit Thambiratnam Researcher Microsoft Research Microsoft Research Multimedia Research Speech Search Video summarization Semantic extraction Face identification Object recognition Visual search 3D Modeling Speech Applications Indexing Search Metadata extraction Advertisin g Transcription Meeting notes Closed caption Voic Translation Translating phone Speech as interface Speech as 1 st class content Mobile access Search Automation PC application Web service Text input Dictation Mobile access Search Automation PC application Web service Text input Dictation Indexing Search Metadata extraction Advertising Transcription Meeting notes Closed caption Voic Translation Translating phone meta-data surrounding & anchor text, URL top-N lists, collaborative filtering editorial meta-data file content itself keyword search in audio track using speech recognition Searching Media Today Demo Spectral Analysis Matching (Decoding) time alignment most likely hypothesis W=argmax (w 1..w N ) p(o t..o |w 1..w N ) P(w 1..w N ) Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) Hello World o 1..o T (w 1..w N )^ Speech recognition speech recognition in a nutshell Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) Speech recordings + full manual transcripts Speech recognition Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N )... microscopem:s ay:n k:n r:n ax:n s:n k:n ow:n p:e microsecond m:s ay:n k:n r:n ax:n s:n eh:n k:n ax:n n:n d:e microsecondm:s ay:n k:n r:n ow:n s:n eh:n k:n ax:n n:n d:e microsoftm:s ay:n k:n r:n ax:n s:n ao:n f:n t:e microsoftm:s ay:n k:n r:n ow:n s:n ao:n f:n t:e Speech recognition Acoustic Models p(o t..o |phoneme) Dictionary P(phonemes|w) Grammar (Language Model) P(w 1..w N ) this is a this is about this is absolutely this is accomplished this is actually is a barnyard is a barometer is a baseball is a baseless is a baseline Speech recognition Challenges Speaker accent Background noise Reverberation Vocabulary Language lattice-based indexing into this bank account lattice-based indexing into this bank account expected benefits from indexing lattices: alternative recognition candidates recall++ confidence scores precision++ (time information user experience) expected benefits from indexing lattices: alternative recognition candidates recall++ confidence scores precision++ (time information user experience) Speech Word statistics Metadata NP extraction Web query builder Recognizer Bing Search Docs Queries Docs Base Dict Base LM Adapt Dictionary Adapt Language Model Adapted Dict Adapted LM Vocabulary Adaptation from NLC group Architectural decisions SQL Server(s) 1. Submit audio/video to index 2. Get back AIB 3. Import AIB in SQL Web server(s)Media server(s) 4. Search/Retrieve results video RSS feed Azure integration Cloud computing made simple Windows Azure + Power shell = Cloud computing at your fingertips Demo media content submission Microsoft Research Tell us if you are interested Tell us if you are interested Visit us: Visit us: Thank you! Questions?

unlocking audio/video content with speech recognition behrooz chitsaz director, ip strategy...

Documents