speech retrieval
TRANSCRIPT
![Page 1: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/1.jpg)
Speech Retrieval
-By Sarang Rakhecha
![Page 2: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/2.jpg)
AgendaWhat is Speech Retrieval?
Areas for Speech Retrieval
Stress and Emotion Classification
• Speaker Diarization: Working
Speaker Diarization
Multilingual Audio Analysis
Audio Fingerprinting
Conclusion
![Page 3: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/3.jpg)
What is Speech Retrieval?
Information that is retrieved from a speech or audio/music in a number of different ways is speech retrieval.
A speech retrieval system facilitates content-based retrieval of speech documents, i.e. audio recordings containing spoken text.
The speech retrieval process receives queries from users and for every query it ranks the speech documents in decreasing order of their probabilities that they are relevant to the query.
![Page 4: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/4.jpg)
Areas for Speech Retrieval Information can be extracted from a speech signal in a number of different ways,
and thus there are several well-established speech signal analysis research fields.1. Event detection: Divides an audio stream into segments according to audio
types (silence, male speech, female speech, noise, etc.).2. Stress and emotion classification: Attempts to discern stress level or an
emotion label for a given speech signal.3. Speaker diarization: Divides speech audio into segments according to distinct
speakers; answers the question ‘‘who spoke when?’’4. Speaker recognition: Identifies a particular speaker in an audio signal; answers
the question ‘‘what is the identity of the person speaking?’’5. Speech recognition: Identifies what is being communicated; answers the
question ‘‘what is the person saying?’’6. Multilingual audio analysis: Multilingual speech recognition and automatic
language identification; answers the question ‘‘what language is being spoken?’’7. Acoustic fingerprinting: Creates a condensed digital summary of a piece of
audio that can be used to identify the audio sample or quickly locate similar items in an audio database.
![Page 5: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/5.jpg)
Stress and Emotion Classification• Analyze a speech signal to
determine the stress, strain or emotion in the speakers voice.
• Emotion classification has been studied with images as well as text.
• 2 kinds of Dataset:• Speech under simulated and actual
stress(SUSAS) data set. It contains a large set of eleven
speaking styles under actual and simulated conditions.
• The second is the ‘‘How May I Help You’’ data set, gathered from a long series of automated help center calls made to AT&T.
Each call in that corpus is labeled with one of six possible emotions.
![Page 6: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/6.jpg)
Speaker Diarization• The process of partitioning an
input audio stream into homogeneous segments.
• Answers question :‘‘who spoke when?’
• This is a combination of speaker segmentation and speaker clustering.
Speaker Segmentation: Aims at which speaker spoke, and when.
Speaker Clustering: Groups together speech segments depending on the characteristics of the speaker.
![Page 7: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/7.jpg)
Speaker Diarization: Working• Speech Detection: To find the
audio/speech region in the given input. • Change Detection: To find the
change points between speaker by calculating the distance metric between the adjacent window.
• Gender/Bandwidth Classification: The segments are clustered together based on their gender or bandwidth.
• Clustering: The segments are clustered with the same kind of a speaker.
• Cluster re-combination: A re-test is done, just to improve the accuracy/performance.
• Re-segmentation: This stage is to add short segments to increase robustness and to refine the segment boundaries.
![Page 8: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/8.jpg)
Multilingual Audio Analysis• To identify the language which is
spoken by the speaker. • Computer science fields are
involved in this particular research field.
• These are achieved by two basic methods:
1. By using subtitles and metadata, extracting and then analyzing the same.
2. By extracting linguistic features, as most of the times metadata is not present.
![Page 9: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/9.jpg)
Audio Fingerprinting• A condensed digital summary,
generated from an audio signal, that can be used to identify an audio sample.
• Prepare database: For given audio clips, we extract a “fingerprint‟ from each clip and store it in a database. This clip is not changed.
• Run: Then compare traces, computed from an audio stream, against your database of stored fingerprints, to identify what’s playing.
• It accurately determines the items, irrespective of the audio streams compression and distortion.
• Its widely used in identifying songs or video file identification.
![Page 10: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/10.jpg)
Conclusion
• Information Retrieval, focus is on speech data (Speech Retrieval)
• Presented the current state of the art of speech information retrieval by discussing major research areas.
• Focus in retrieval on exact relevant passages.• Importance of segmentation.
• Speech analysis applications can be used in tandem with other multimedia analysis systems to achieve a multimodal analysis approach to a common problem, e.g., event detection in surveillance video.
![Page 11: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/11.jpg)
References Speech information retrieval: a review by Ryan P. Hafen • Michael J. Henry,
Published on 26th July, 2012 2. https://en.wikipedia.org/wiki/Speaker_recognition 3. http://www.cs.tut.fi/kurssit/SGN-4010/ikkunointi_en.pdf 4. https://en.wikipedia.org/wiki/Speaker_diarisation 5. https://en.wikipedia.org/wiki/Speech_recognition 6. Speech Retrieval by Ciprian Chelba1 , Timothy J. Hazen2 , Bhuvana
Ramabhadran3 , Murat Sarac¸lar4 1Google Inc., 2MIT Lincoln Laboratory, 3 IBM TJ Watson Research Center, 4Bo˘gazic¸i University
7. https://en.wikipedia.org/wiki/Acoustic_fingerprint
![Page 12: Speech Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022082721/587738071a28ab342e8b4f6b/html5/thumbnails/12.jpg)