voice based retrieval system prajna bhandary cmsc - 676

20
Voice Based Retrieval System Prajna Bhandary CMSC - 676 UMBC May 4, 2020

Upload: others

Post on 21-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Voice Based Retrieval System

Prajna Bhandary

CMSC - 676 UMBC

May 4, 2020

Abstract

Voice based systems are likely to soon take over text based systems in the coming future.

Although we use text based approach for network access today, almost all roles of text can be

accomplished by voice. Voice based system, refers to a system that accepts voice as a form of

input with a query and/or gives out an output in a voice format for a given query. This report

summarizes some of the possible models that are required to develop a voice based system. It

also includes some examples of some already developed system prototypes that have been

researched.

There has been an idea circulating that lets you use a telephone to access information and

having a computer with a proper internet connection will not be a necessity. But, in order to do

that we have to dig deeper to speech recognition information retrieval system.

Chapter 1

Introduction This chapter introduces the types of voice based systems and the requirements of the voice

based systems. Voice based retrieval systems are gaining popularity among researchers in the

field of machine learning, artificial intelligence and neural networks. Apart from the fact that they

are a perfect platform to help visually challenged individuals, they also help in the advancement

of technology as it is.

1.1 Tasks of Voice Based Systems

There are three different tasks of a voice based system.

1.1.1 Using Text Queries to retrieve spoken documents

This type is also referred to as Spoken Document Retrieval and has been considered and

researched for a long time. For example, in the last decade in the TREC (Text REtrieval

Conference) Spoken Document Retrieval track, very good retrieval performance based on ASR

one-best results for the spoken documents was obtained as compared to that on human

reference transcripts, although using relatively long queries and relatively long target documents

[5]. It was then realized that considering much shorter queries and much shorter spoken

segments with much poorer recognition accuracies should be a more realistic scenario [2].

1.1.2 Using spoken queries to retrieve text documents

This has usually been referred to as Voice Search [6]. The information to be retrieved is usually

an existing text database such as those in directory assistance applications, although with

lexical variations and so on but primarily without recognition uncertainty.

1.1.3 Using spoken queries to retrieve spoken documents

In this case the speech recognition uncertainty exists on both sides of the queries and the

documents, and therefore naturally this is a more difficult task. This type of systems have

gained a lot of recognition in recent times and various technology methods are being applied for

this case. In an example effort, the task was considered as a problem of query-by-example

[20]. In another example effort, the lattices of the query and the documents were aligned and

compared using the graphical model. People also tried to directly match the query and the

content on the signal level [7], as other examples. There have been advancements in this case

but still a lot of word needs to be done here.

Text-Based Voice-Based

Resources Rich resources-huge quantities of text documents available over the internet Quantity continues to increase exponentially due to convenient access

Spoken/multimedia content are the new trend Can be realized even sooner given mature technologies

Accuracy Retrieval accuracy is acceptable to users and are properly ranked and filtered

Problems with speech recognition errors, especially for spontaneous speech under adverse environments

User-system Interaction

Retrieved documents easily summarised on-screen thus easily scanned and selected by the user User may easily select query terms suggested for next iteration retrieval in an interactive process

Spoken/multimedia documents easily summarised on-screen thus difficult to scan and select Lacks efficient user system interaction

Table 1. Comparison between text-based and Voice-based systems[2]

1.2 Comparison between text-based and Voice-based systems

Table 1 lists the comparison between voice-based and text-based information retrieval in terms

of Resources, Accuracy and User-system Interaction.

First, consider the resources, there are rich resources of huge text documents available and it

continues to be available. Voice Based systems on the other hand have a disadvantage as

there are not many spoken documents as of it that are available. Though there has been a

significant increase in the spoken documents and methods of accepting spoken queries.

Second, considering the accuracy with which the documents are retrieved for a given query

text-based system has an acceptable method that is properly marked and filtered. On the other

hand, the Voice based systems have problems with speech recognition. The problem persists

especially for spontaneous speech under adverse environments.

Third, in case of the user-style interaction spoken/multimedia documents are easily summarised

on screen which makes it difficult to scan and select. It also lacks efficient user system

interacted systems. Comparatively, text-based systems retrieves documents easily summarised

on-screen thus easily scanned and selected by the user. The users are easily able to select a

query they wish and the system is able to suggest query terms for the next iteration retrieval in

an interactive process which is very challenging for the Voice-based systems.

1.4 Methodology for Voice-based System

The Voice-based System of the type that accepts voice as an input and produces voice output

usually follows the following steps or requirements.

1.4.1 Speech to text Conversion

The input is in a speech format. A number of speech recognition techniques such as linear

timescaled word template matching, dynamic timewarped word-template matching (find the

phonemes, assemble into words and assemble into sentences)and hidden Markov models

(HMM) were used in olden times. Of all the available techniques, HMMs are currently yielding

the best performance due to its computational speed and accuracy.[3]

1.4.2 Pattern Matching

It is done to match the query words with the available database to find the best match for the

relevant document. Usually, the query is divided into keywords and the keywords are matched

with the available database containing the documents. The best matched document is then

retrieved as an output.

1.4.3 Text to Speech Conversion

The document thus fetch is in a text format and needs to be converted to speech format.

Text-to-speech synthesis takes place in several steps. The TTS systems get a text as input,

which it first must analyze and then transform into a phonetic description. Then in a further step

it generates the prosody. From the information now available, it can produce a speech signal.[1]

Chapter 2

Survey of Relevant Work

This section includes the relevant research work that has been made with respect to building a

voice based system. This section first talks about the possible techniques that can be used to

overcome the difficulties as discussed in chapter 1. It then talks about different systems that the

researchers have been able to build to overcome the challenges.

2.1 Retrieval Accuracy for Voice based Retrieval systems

If we can recognise the spoken query with an accuracy of 100% then the relevant document

thus retrieved will also be 100% accurate. But, unfortunately this is never true. Wherever there

is speech considered recognition errors are inevitable and recognition errors are not even

predictable or controllable. Many approaches have been considered to handle the recognition

errors here. Use of confusion matrices or fuzzy matching techniques to tolerate recognition

errors to a certain extent [8], use of lattices rather than 1-best output to consider multiple

recognition hypotheses so as to include more correct results, and use of subword units to try to

handle the out-of-vocabulary (OOV) words to some degree are good examples. There are two

major approaches discussed in a research paper as follows:

2.1.1 Lattice-based Approaches

If all utterances in the spoken segments are represented as lattices with multiple alternatives

rather than 1-best output, certainly the probability that the correct words are included and

considered can be higher. However, much more noisy words are also included which cause

some trouble, although they can be discriminated against with some scores such as posterior

probabilities, while some important words (e.g. OOV words) may still be missing.[2]

2.1.1.1 Position-Specific Posterior Lattices (PSPL)

The basic idea of PSPL is to calculate the posterior probability prob of a word W at a specific

position pos (actually the sequence ordering in a path including the word W) in a lattice for a

spoken segment d as a tuple (W, d, pos, prob). Such information is actually hidden in the lattice

L of d since in each path of L we clearly know the position (or sequence ordering) of each word.

Since it is very likely that more than one path includes the same word in the same position, we

need to aggregate over all possible paths in a lattice that include a given word at a given

position.[2]

2.1.1.2 Confusion Network (CN)

This approach was proposed earlier to cluster the word arcs in a lattice into several strictly linear

clusters of word alternatives, referred to as the Confusion Network (CN) [9]. In each cluster,

posterior probabilities for the word alternatives are also obtained. The original goal of CN was

focused on the WER minimization for ASR, since it was shown that this structure gives better

expected word accuracy [9]. In the retrieval task here, however, we consider CN as a compact

structure representing the original lattice, giving us the proximity information of each word arc.

Figure 2. (a) The ASR lattice, (b) all paths in (a)

PSPL locates a word in a segment according to the position (or sequence ordering) of the word

in a path. CN on the other hand, Clustering several words in a segment according to similar time

spans and word pronunciation as shown in figure 3.

Figure 3. d) the constructed PSPL structure, (e) the constructed CN structure, where W 1 , W 2 , … are words and by W 1 :p 1 we mean W 1 and its posterior probability p 1 in a specific clusters[2]

2.1.2 Subword Units

Word/Subword-based lattice information was converted into a weighted finite state machine

(WFSM) in an earlier work [10]. The query word/subword sequence was then located in the

WFSM using exact-matching. A two-stage approach was used in another work [11]: audio

documents were first selected by approximated term frequencies, and then a detailed lattice

search was performed to determine the exact locations.[2]

The detailed study of Subwords Units is another topic entirely and requires more research to

include in this report.

2.2 User-Interaction for Voice Based Retrieval System

For voice-based information retrieval, we don’t have any interactive or dialogue scenario yet as

compared to text-based information retrieval. Unlike the written documents with well structured

paragraphs and titles, the multimedia and spoken documents are both very difficult to browse,

since they are just audio/video signals.[2] There are many approaches described but this report

only focuses on three of those types.

2.2.1 Multi-model dialogues

In this concept, for a query given by the user, the retrieval system produces a topic hierarchy

constructed from the retrieved spoken documents that are to be searched. [2] Every node on

the Topic hierarchy represents a cluster of retrieved documents and is labeled by a topic(or a

keyword). The user’s query is expanded and they get to choose which subquery best fits the

query they want an output for. This process is called a multi-modal dialogue process because

the system responds in a form of a hierarchy of sub questions that try to find the desired

document as shown in the figure.

Figure 4. The multi-modal dialogue scenario for convenient user-system interaction.[2]

2.2.2 Semantic Analysis of Spoken Document

In PLSA, a set of latent topic variables is defined, T k , k = 1, 2, . . . , K, to characterize the

“term-document” co-occurrence relationships, as shown in Figure 9. We can notice from the

figure that the documents are not directly fetched from the terms but are first put through P(tj

|Tk), the probability that the term t j is used in the latent topic k , as well as P(Tk |di). That’s the

likelihood that di addresses the latent topic Tk . The PLSA model can be optimized with an EM

algorithm by maximizing a carefully defined likelihood function[13].

Figure 5. Graphical representation of the Probabilistic Latent Semantic Analysis (PLSA) model.[2]

2.2.3 Key Term Extraction from Spoken Document

Key terms have long been used to identify the semantic content of documents.[2] The only

difference here is that the key terms need to be extracted automatically from spoken documents

which are dynamically generated and updated from time to time. In fact, key terms have also

been found useful in constructing retrieval models. One of the example parameters is explained

below [12].

2.2.3.1 Latent Topic Significance

The latent topic significance score of a term tj with respect to a topic Tk , Stj Tk , is defined as:

where n(tj , di ) is the occurrence count of the term tj in a document di , and P(Tk |di ) is

obtained from a PLSA model trained with a large corpus. In the above given equation the term

frequency of tj in a document di , n(tj , di ), is further weighted by a ratio which has to do with

how the document di is focused on the topic Tk , since the denominator of the ratio is the

probabilities that the document d i is addressing all other topics different from Tk . After

summation over all documents a higher Sdi ,(Tk ) obtained in the equation given above implies

the tj term tj has a higher frequency in the latent topics Tk than other latent topics, and is thus

more important in the latent topic Tk .

Chapter 3

Compare and Contrast Relevant Work

There are many systems proposed to overcome the challenges as discussed in the previous

chapter. This section will discuss one of the recent systems that was proposed and that has

proven to be efficient compared to other existing systems.

2.1 Voice based system using bag of words

This part of the section describes the proposed model for a voice based system that could

overcome the challenges as mentioned in the above sections. There were many models

proposed in various papers. This one was the most efficient among the others.

The following IR model follows the following steps:

● Speech based Request (input)

● Creating BOW(Bag of Words)

● Pattern Matching

● Text to Speech reply (output)

2.3.1. Speech based Request(input)

The input is in the form of a voice and that needs to be converted into text. The conversion of

text to speech can be done using multiple ways but there are a lot of challenges in order to do

that. The foremost challenge is recognising speech. The Hidden Markov model(HMM) was used

in olden times. The recorded signal is compared with the original signal where the method of

MFCC is used in feature extraction[1] and the result stored is saved as a text document. This

type of pattern matching is adopted only in few cases where the questions are pre-entered.[3]

In this model a fuzzy logic is used to match the speech of different accents. Eg. The word

“Vector” has different pronunciations and so the fuzzy logic represents every word by a fuzzy

set. Now since this is very specific to fit in a generic model of speech recognition, we can have a

more general model of fuzzification of phonemes. This model is applied to spoken sentences.

One fuzzy set is based on accents, the second one the speeds of pronunciation and the third on

emphasis.

2.3.2 Creating Bag-Of-Words

A bag-of-words is a representation of text that describes the occurrence of words within a

document. It involves two things:

A. A vocabulary of known words.

B. A measure of the presence of unknown words.

The steps followed while using a bag of words are:

1. Collect Data: This step includes collecting all the documents that are necessary for the

system we are building.

2. Create Vocabulary: The documents are tokenized to sensible words called vocabulary.

3. Create Document Vector: The tokenized words are then vectorised to create document

vectors.

4. Managing Vocabulary: The words can be grouped together if they are similar to a certain

topic.

5. Scoring Words: The words are scored based on their occurences in a document.

6. TF-IDF: The term frequency and the inverse term frequency are calculated.

2.3.3 Pattern Matching

Boyer-Moore(BM) algorithm is used which positions the pattern over the leftmost characters in

the text and attempts to match it from right to left. If no mismatch occurs then the pattern is

found otherwise, the algorithm computes a shift by an amount by which the pattern is moved to

the right before a new matching is undertaken. The shift is computed using two heuristics:[1]

A. match heuristic

Match all characters previously matched and

B. Occurence heuristics

To bring different character to the position in the text that caused the mismatch

2.3.4 Text to Speech Reply

After getting the text it must analyse and then transform into a phonetic description. An NLP

module Digital SIgnal Processing(DSP) is used for this purpose. DSP transforms the symbolic

information received to audible one as follows:

A. text analysis:

First, the text is segmented into tokens. The token-to-word conversion creates the

orthographic form of the token. For example Mr is mister and humber like 2 are

transformed to two. There are some rules that need to be considered in this case as not

all words are pronounced the same way.

There is a problem of application of pronunciation rules. After the text analysis is completed

pronunciation rules can be applied. Silent letters in a word(h in caught) or several phonemes

like(m in maximum), these words need to be considered based on the silent letters and

phoneme characters. The solutions for this problem can be solved by the two ways proposed:

1. Dictionary based solution: A dictionary can be used where all forms of possible

words are stored.

2. Rule based solution: rules are generated from the phonological knowledge of

dictionaries. Only words with come exception on pronunciation are included

The two applications differ significantly in the size of their dictionaries. The dictionary-based

solution is many times larger than the rules-based solution’s dictionary of exception. However,

dictionary-based solutions can be more exact than rule-based solutions if they have a large

enough phonetic dictionary available.

Chapter 4

Conclusions The different approaches thus described in chapter 2 section 2.1 had varied results with respect

to the index size among different approaches as shown in figure 6.

Figure 6.The tradeoff between MAP and index size for the different approaches considered [47].

There has been a significant rise in systems that are being proposed for voice retrieval

systems that solves the challenges as discussed in chapter 2.

The proposed model described in chapter 3 had a comparison result of Voice Processed and

Voice replied as shown in figure 7.

Figure 7. Number of Voice Input vs. Number of Voice Output.

Figure 7. Number of Query vs. Index Matching Accuracy.

Figure 7. Number of Query vs. Index Matching Accuracy.

The number of indices matched and the number of patterns matched is calculated and shown in

Figure-8 and Figure-9 respectively. The number of query index matching is proportionally

increased, according to the number of query data and accent. The number of pattern matching

is up and down in scale due to pattern matched and the data available on the DS.

References [1] R. Uma, B. Latha. “An efficient voice based information retrieval using bag of words based indexing”, International Journal of Engineering & Technology [2] Lin-shan Lee and Yi-cheng Pan. “Voice-based Information Retrieval- how far are we from the text-based information retrieval?”, 2009 IEEE [3] Kiruthika M, Priyadarsini S, Rishwana Roshan K, Shifana Parvin V.M, Dr. G. Umamaheshwari. “Voice Based iNformation Retrieval System”, International Journal of Innovative Research in Science, Engineering and Technology [4]Personal Voice Based Information Retrieval System, patent [5] http://trec.nist.gov/ [6]Y. Wang, D. Yu, Y.-C. Ju, A. Acero, “An Introduction to Voice Search”, IEEE Signal Processing Magazine, May 2008, pp. 29-38. [7] T. K. Chia, K. C. Sim, H. Li, H. T. Ng, “A Lattice-based Approach to Query-by-Example Spoken Term Retrieval”, SIGIR 2008, pp. 363-370. [8] J. Mamou, B. Ramabhadran, “Phonetic Query Expansion for Spoken Document Retrieval”, Interspeech 2008, pp. 2106-2109. [9] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: Word error minimization and other applications of confusion networks”, Computer Speech and Language, vol. 14, no. 4, pp. 373-400, Oct 2000. [10] M. Saraclar and R. Sproat, “Lattice-based search for spoken utterance retrieval”, in HLT 2004. [11] P. Yu, K. J. Chen, C. Y. Ma, and F. Seide, “Vocabulary-independent indexing of spontaneous speech”, IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 635.643, 2005. [12] ] Sheng-Yi Kong and Lin-shan Lee “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006, pp. I941-944. [13] T. Hofmann, “Probabilistic latent semantic analysis”, Uncertainty in Artificial Intelligence, 1999.

[14] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall:Automatic query expansion with a generative feature model for object retrieval. In ICCV, pages1–8, 2007. [15] HHerv´eJ´egou, MatthijsDouze, and CordeliaSchmid. Improving bag-of-features for largescale image search. International Journal of Computer Vision, 87(3):316–336, 2010. [16] Lakra, Sachin, et al. "Application of fuzzy mathematics to speechto-text conversion by elimination of paralinguistic content." arXiv preprint arXiv: 1209.4535 (2012). [17] Kleber, Florian, Markus Diem, and Robert Sablatnig, "Form classification and retrieval using bag of words with shape features of line structures"-IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2013. [18] RupaliS Chavan, Ganesh. S Sable, “An Overview of Speech Recognition Using HMM”, IJCSMC, Vol. 2, Issue. 6, June 2013.