speech retrieval

Speech Retrieval

-By Sarang Rakhecha

AgendaWhat is Speech Retrieval?

Areas for Speech Retrieval

Stress and Emotion Classification

• Speaker Diarization: Working

Speaker Diarization

Multilingual Audio Analysis

Audio Fingerprinting

Conclusion

What is Speech Retrieval?

Information that is retrieved from a speech or audio/music in a number of different ways is speech retrieval.

A speech retrieval system facilitates content-based retrieval of speech documents, i.e. audio recordings containing spoken text.

The speech retrieval process receives queries from users and for every query it ranks the speech documents in decreasing order of their probabilities that they are relevant to the query.

Areas for Speech Retrieval Information can be extracted from a speech signal in a number of different ways,

and thus there are several well-established speech signal analysis research fields.1. Event detection: Divides an audio stream into segments according to audio

types (silence, male speech, female speech, noise, etc.).2. Stress and emotion classification: Attempts to discern stress level or an

emotion label for a given speech signal.3. Speaker diarization: Divides speech audio into segments according to distinct

speakers; answers the question ‘‘who spoke when?’’4. Speaker recognition: Identifies a particular speaker in an audio signal; answers

the question ‘‘what is the identity of the person speaking?’’5. Speech recognition: Identifies what is being communicated; answers the

question ‘‘what is the person saying?’’6. Multilingual audio analysis: Multilingual speech recognition and automatic

language identification; answers the question ‘‘what language is being spoken?’’7. Acoustic fingerprinting: Creates a condensed digital summary of a piece of

audio that can be used to identify the audio sample or quickly locate similar items in an audio database.

Stress and Emotion Classification• Analyze a speech signal to

determine the stress, strain or emotion in the speakers voice.

• Emotion classification has been studied with images as well as text.

• 2 kinds of Dataset:• Speech under simulated and actual

stress(SUSAS) data set. It contains a large set of eleven

speaking styles under actual and simulated conditions.

• The second is the ‘‘How May I Help You’’ data set, gathered from a long series of automated help center calls made to AT&T.

Each call in that corpus is labeled with one of six possible emotions.

Speaker Diarization• The process of partitioning an

input audio stream into homogeneous segments.

• Answers question :‘‘who spoke when?’

• This is a combination of speaker segmentation and speaker clustering.

Speaker Segmentation: Aims at which speaker spoke, and when.

Speaker Clustering: Groups together speech segments depending on the characteristics of the speaker.

Speaker Diarization: Working• Speech Detection: To find the

audio/speech region in the given input. • Change Detection: To find the

change points between speaker by calculating the distance metric between the adjacent window.

• Gender/Bandwidth Classification: The segments are clustered together based on their gender or bandwidth.

• Clustering: The segments are clustered with the same kind of a speaker.

• Cluster re-combination: A re-test is done, just to improve the accuracy/performance.

• Re-segmentation: This stage is to add short segments to increase robustness and to refine the segment boundaries.

Multilingual Audio Analysis• To identify the language which is

spoken by the speaker. • Computer science fields are

involved in this particular research field.

• These are achieved by two basic methods:

1. By using subtitles and metadata, extracting and then analyzing the same.

2. By extracting linguistic features, as most of the times metadata is not present.

Audio Fingerprinting• A condensed digital summary,

generated from an audio signal, that can be used to identify an audio sample.

• Prepare database: For given audio clips, we extract a “fingerprint‟ from each clip and store it in a database. This clip is not changed.

• Run: Then compare traces, computed from an audio stream, against your database of stored fingerprints, to identify what’s playing.

• It accurately determines the items, irrespective of the audio streams compression and distortion.

• Its widely used in identifying songs or video file identification.

Conclusion

• Information Retrieval, focus is on speech data (Speech Retrieval)

• Presented the current state of the art of speech information retrieval by discussing major research areas.

• Focus in retrieval on exact relevant passages.• Importance of segmentation.

• Speech analysis applications can be used in tandem with other multimedia analysis systems to achieve a multimodal analysis approach to a common problem, e.g., event detection in surveillance video.

References Speech information retrieval: a review by Ryan P. Hafen • Michael J. Henry,

Published on 26th July, 2012 2. https://en.wikipedia.org/wiki/Speaker_recognition 3. http://www.cs.tut.fi/kurssit/SGN-4010/ikkunointi_en.pdf 4. https://en.wikipedia.org/wiki/Speaker_diarisation 5. https://en.wikipedia.org/wiki/Speech_recognition 6. Speech Retrieval by Ciprian Chelba1 , Timothy J. Hazen2 , Bhuvana

Ramabhadran3 , Murat Sarac¸lar4 1Google Inc., 2MIT Lincoln Laboratory, 3 IBM TJ Watson Research Center, 4Bo˘gazic¸i University

7. https://en.wikipedia.org/wiki/Acoustic_fingerprint

speech retrieval

Documents