machine learning

White paper on Machine Learning(Speech to Text Conversion using Android Platform)

Group 10[Type the company name]Apurva Mittal (20141009)Ketan Gyanchandani (20141028)Riya Giri (20141058)Sanjeev Kumar (20141063)Saurabh Ojha (20141064)Vikash Kumar (20141072)

Introduction:

Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for human comprehension, machine learning uses that data to improve the program's own understanding. Machine learning programs detect patterns in data and adjust program actions accordingly. For example, Facebook's News Feed changes according to the user's personal interactions with other users. If a user frequently tags a friend in photos, writes on his wall or "likes" his links, the News Feed will show more of that friend's activity in the user's News Feed due to presumed closeness.

Essentially, it is a method of teaching computers to make and improve predictions or behaviours based on some data. What is this "data"? Well, that depends entirely on the problem. It could be readings from a robot's sensors as it learns to walk, or the correct output of a program for certain input. Another way to think about machine learning is that it is "pattern recognition" - the act of teaching a program to react to or recognize patterns.

Speech has not been used much in the field of electronics and computers due to the complexity and variety of speech signals and sounds. However, with modern processes, algorithms, and methods we can process speech signals easily and recognize the text. Speech recognition (SR) is the translation of spoken words into text. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).

Background:

For the past several decades, designers have processed speech for a wide variety of applications ranging from mobile communications to automatic reading machines. Speech has not been used much in the field of electronics and computers due to the complexity and variety of speech signals and sounds. However, with modern processes, algorithms, and methods we can process speech signals easily and recognize the text.

Speech recognition is usually processed in middleware; the results are transmitted to the user applications.

Speech recognition using android platform is done via the Internet, connecting to Google's server. Speech recognition for Voice uses algorithms based on hidden Markov models (HMM - Hidden Markov Model) and N-gram language model. It is currently the most successful and most flexible approach to speech recognition. This application is adapted to input messages in English.

A hidden Markov model (HMM) is a statistical Markov model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states.

Markov Model:

In simple Markov models the state is directly visible or known to the observer, and therefore the state transition probabilities are the only parameters.Let’s take a model of the weather consisting of four state Markov model of the weather.Suppose that once on any day (e.g. in the morning), the weather is observed as any of the following state with state transition probability as shown in fig.

• State 1: cloudy

• State 2: sunny

• State 3: rainy

• State 4: windy

Fig. Markov Model for weather

Now, from the above figure the pattern of weather over a period of it can be easily predicted as the initial state is known & the probability of occurrence of various states are known. For e.g. the probability of getting the sequence of “sunny, rainy, sunny, cloudy, cloudy” can be given by Eqn.

P(O A,π)= π XA

= π2 .a23 .a32 .a24 .a41 .a11

Where a11 a12 a13 a14

A= a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44 Transition state probability .

Π = π1 π2 π3 π4 Initial state probability.

Thus we can see that in simple Markov model, the prediction of event on the basis of known initial state can be easily predicted.

Hidden Markov Model:

The Markov model which is used in Speech to text model is known as Hidden Markov

Model. This model is called Hidden Markov because; the state is not directly visible.

Even though state is not directly visible but output which is dependent on the state is

visible. Each state has a probability distribution over the possible output. Therefore the

sequence of output generated by an HMM gives some information about the sequence of

states. The state sequence through which the model passes is hidden, not to the

parameters of the model; even if the model parameters are known exactly, the model is

still 'hidden'.

Hidden Markov models have application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, and bioinformatics.

We can have a better understanding of the HMM model by looking at the Urn & Ball Model.

Urn & Ball Model: Hidden Markov Model

Let’s consider that there are N large glass urns in a room. In each urn there is definite no. of coloured balls. Let’s consider that the set of N urns contains balls of 6 colours (R = red, O=orange, B=black, G=green, B=blue, P=purple).

The person in the room chooses an urn in that room and randomly draws a ball from that urn.

He then puts the ball on a conveyor belt, where the observer can observe the sequence of the

balls but not the sequence of urns from which they were drawn. The person in the room has

some procedure to choose urns; the choice of the urn for the n-th ball depends only upon a

random number and the choice of the urn for the (n − 1)-th ball. The choice of urn does not

directly depend on the urns chosen before the single previous urn; therefore, this is called a

Markov process. The Markov process itself cannot be observed, and only the sequence of

labelled balls can be observed, thus this process is called Hidden Markov Process.

Fig: Urn & Ball Model

Although the observer doesn’t know the sequence in which the urn has been chosen, he knows the probability of the different colour ball which can be chosen from each urn. The observer can calculate the probability of particular ball being chosen from a particular urn by calculating the various probabilities of the sequence of choice of urn.

N-gram language model: we can easily recall being told by our high school grammar teacher, not every random combination of words forms a grammatically acceptable sentence:

Colourless green ideas sleep furiously Furiously sleep ideas green colourless Ideas furiously colourless sleep green

The sentence Colourless green ideas sleep furiously (made famous by the linguist Noam Chomsky), for instance, is grammatically perfectly acceptable, but of course entirely nonsensical. If you compare this sentence to the other two sentences, this grammaticality becomes evident. The sentence Furiously sleep ideas green colourless is grammatically unacceptable, and so is Ideas furiously colourless sleep green: these sentences do not play by the rules of the English language. In other words, the fact that languages have rules constraints the way in which words can be combined into an acceptable sentence.

Language plays by rules whereas, computers work with rules. Inferring a set of rules gives us the language model. A model that describes how a language, say English, works and behaves. The rules by which a language plays are very complex, and no full set of rules to describe a

language has ever been proposed. There are simpler ways to obtain a language model, namely by exploiting the observation that words do not combine in a random order. That is, we can learn a lot from a word and its neighbours. Language models that exploit the ordering of words are called n-gram language models, in which the n represents any integer greater than zero.

N-gram models can be imagined as placing a small window over a sentence or a text, in which only n words are visible at the same time. The simplest n-gram model is therefore a so-called unigram model. This is a model in which we only look at one word at a time. The sentence Colourless green ideas sleep furiously, for instance, contains five unigrams: “colourless”, “green”, “ideas”, “sleep”, and “furiously”. Of course, this is not very informative, as these are just the words that form the sentence. In fact, N-grams start to become interesting when n is two (a bigram) or greater.

We can easily modify our definition of bigrams to extract n-grams at a specified length. Rather than always takeing two elements, we make the number of items to take an argument to the function. When used for language modelling, independence assumptions are made so that each word depends only on the last n-1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together. in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a "multinomial distribution"). Basically a n-gram model predicts what is the likelihood of the next letter. From training data, one can derive the probability distribution for the next letter given in a history of size n: a=.4, b=.0004, c=0, where the probabilities of all possible “next letters” sums up to 1.0.

Problems: the following problems faced in today’s world

Hands-free computing:

Today’s generation prefer speaking over writing any day. This may be due to many reasons: time scarcity, multitasking efficiency, and hassle free tasks division. They have so many things to do and very less time. There is a need of such an interface that they can connect to and interact with to make their daily talks easy. With the help of this application any one can give an input of their voice and abruptly the voice will be converted into text without doing any extra task.

Education and daily life:

Today’s tech-savvy youngsters want to get their hands on anything and everything. They have indulged themselves in so many things that they cannot bear spending time writing their projects and assignments. In today’s generation children are loaded with so many activities

that their health is degrading day by day. This application will help them doing their assignments. They will have to just dictate their lines and this application will provide them with the written documents. Moreover they need to learn new languages so that they can connect with the outer world and work on their pronunciation skills to. This application with the multiple languages option can help people learning different languages without the help of any tutor and without going to the particular region.

In day-to-day life when texts messages are integral part of our lives, this application will help everyone typing the text messages on a go while doing any other task. For example, anyone can just speak their message to be sent while driving when it is urgent.

Blindness and education:

Among people there are some that are unable to write, either because of blindness (complete or partial), or for other reasons. Students who are physically disabled or suffer from Repetitive strain injury/other injuries to the upper extremities have to worry about handwriting, typing, or working with scribe on school assignments. They all need such an interface that listens to them and do their task and help them connect to the outer world instantly.

For those people who cannot read or write, it is very difficult to use the texting application of phones. This application will help them in this kind of situation by proving the proper platform.

Solution:

Speech recognition will be very helpful to such people. They will be able to take notes of anything and everything, send messages across distances at a go. Students who are blind or have very low vision can benefit from using the technology to convey words and then hear the application recite for them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard. For language learning, speech recognition can be useful for learning a second language. It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills. Today’s generation prefer speaking over writing any day, for such tech-say youngsters speech to text is the best way to promote learning and sharing of information. This will help them take notes on a go.

ANDROID Platform as a way out of this problem:

Android is a software environment for mobile devices that includes an operating system, middleware and key applications.

The Android operating system (OS) architecture is divided into 5 layers. The application layer of Android OS is visible to end user, and consists of user applications. The application layer includes basic applications which come with the operating system and applications which user subsequently takes. All applications are written in the Java programming

language. Framework is extensible set of software components used by all applications in the operating system. The next layer represents the libraries, written in the C and C + + programming languages, and OS accesses them via framework. Dalvik Virtual Machine (DVM), forms the main part of the executive system environment. Virtual machine is used to start the core libraries written in the Java programming language.

Android Architecture

Unlike Java’s virtual machine, which is based on the stack, DVM bases on registry structure and it is intended for mobile devices. The last architecture layer of Android operating system is kernel based on Linux OS, which serves as a hardware abstraction layer. The main reasons for its use are memory management and processes, security model, network system and the constant development of systems. There are four basic components used in construction of applications: activity, intent, service and the content provider. An activity is the main element of every application and simplified description defines it as a window that users see on their mobile device. The application can have one or more activities. Main activity is the one that is used as startup. The transition between the activities is carried out in a way that launched activity calls a new activity. Each activity as a separate component is implemented with inheritance of Activity class. During the execution of applications, activities are added to the stack, currently running activity is on the top of the stack.

An intent is a message used to run the activities, services, or recipient’s multicast. An intent can contain the name of the components you need to run, the action which is necessary to execute, the address of stored data needed to run the component, and component type. A

service is a component that runs in the background to perform long running operations or to perform work for remote processes. One service can link multiple applications and service is executed until a connection with all applications is done. A content provider manages a shared set of application data. Data can be stored in the file system, a SQLite database, on the web, or any other persistent storage location which application can access [1]. Through the content provider, other applications can query or even modify the data (if the content provider allows it).

Speech Recognition:

Speech recognition for this application is done on Google server, using the HMM and n-gram algorithm. The system can be divided into several blocks: feature extraction, acoustic models database which is built based on the training data, dictionary, language model and the speech recognition algorithm.

The input audio waveform from a microphone is converted into a sequence of fixed size acoustic vectors Y 1:T = y1,...,yT in a process called feature extraction. The decoder then attempts to find the sequence of words w1:L = w1,...,wL which is most likely to have generated Y , i.e. the decoder tries to find wˆ = arg max w{P(w|Y )}.However, since P(w|Y ) is difficult to model directly,1 Bayes’ Rule is used to transform it into the equivalent problem of finding:

wˆ = arg max w{p(Y |w)P(w)}

The likelihood p(Y |w) is determined by an acoustic model and the prior P(w) is determined by a language model.

The basic unit of sound represented by the acoustic model is the phone. For example, the word “bat” is composed of three phones /b/ /ae/ /t/. About 40 such phones are required for English.

For any given w, the corresponding acoustic model is synthesized by concatenating phone models to make words as defined by a pronunciation dictionary. The parameters of these phone models are estimated from training data consisting of speech waveforms and their orthographic transcriptions. The language model is typically an N-gram model in which the probability of each word is conditioned only on its N-1 predecessors. The N-gram parameters are estimated by counting N-tuples in appropriate text corpora (set of words). The decoder operates by searching through all possible word sequences using pruning to remove unlikely hypotheses thereby keeping the search tractable. When the end of the utterance is reached, the

most likely word sequence is output. Alternatively, modern decoders can generate lattices containing a compact representation of the most likely hypotheses.

MAIN PARTS OF THE PROJECT:

A. Voice Recognition Activity class: Voice Recognition Activity is startup activity defined as launcher in AndroidManifest.xmlfile.This is where most of the initialization goes to programmatically interact with widgets in the user interface. In this method there is also a check whether mobile phone, on which application is installed, has speech recognition possibility. If a mobile device doesn’t have one of many Google’s applications which integrate speech recognition, further work of this application Voice SMS will be disabled and message on the screen will be “Recognizer not present”. Recognition process is done trough one of Google’s speech recognition applications. If recognition activity is present user can start the speech recognition by pressing on the button and thus launching startActivityForResult (Intent intent, int requestCode). The application uses startActivityForResult() to broadcast an intent that requests voice recognition, including an extra parameter that specifies one of two language models.

Enables search after clicking image button

Processes and gives text output

b. SMS class: this class acts as an interface for sending SMS activity. The text is entered in the space for writing messages and displayed on the screen. By clicking the Send SMS button application checks whether the message and the number of recipient are entered to perform sending of message. When cursor is positioned in the space for recipient number from contacts, button attribute visibility is changed from default gone to visible. Pressing the button the command allows you to enter the contact numbers. After selecting desired contact, message can be sent.

Interface for sending SMS

c.XML files: Application consists of two different interfaces. When the user runs application screen is defined in voice_recognition.xml. The linear arrangement of elements allows adding widget one below another. Width and height are defined with fill_parent attribute, which means to be equal as parent (in this case the screen). The second interface, defined within sms.xml file, is displayed when the user chooses one of offered messages. AndroidManifest.xml realizes installing and launching applications on the mobile device.

Economic feasibility:

As far as the Economic feasibility of this project is considered, we can say that it would be very cost efficient for any company to incorporate this application into their project. The following features of this application shows its economic feasibility:

Operating System used for development id free of cost and so is the eclipse ide used as an interface for application development.

Free use and adaptation of operating system to manufacturers of mobile devices. Equality of basic core applications and additional applications in access to resources. Optimized use of memory and automatic control of applications which are being

executed. Quick and easy development of applications using development tools and rich

database of software libraries. High quality of audiovisual content, it is possible to use vector graphics, and most

audio and video formats.

Ability to test applications on most computing platforms, including Windows, Linux. Thus saving time and money.

Conclusion:

A speech recognizer’s effectiveness depends on its synthesizing rate and pronunciation quality. Generally it is seen that STT software uses only type of the language models. Using only one type of algorithm does serve the purpose of converting speech to text but often lands up with not so fine quality and low synthesizing rate. Our application attempts to interpolate the data by combining two sets of model namely, hidden Markov and n-gram model which are the best algorithms so far for any STT software.

Future developments:

The existing Speech to Text Conversion software available over the net converts the speech entered in any particular language into that language’s text only. Moreover if, we look at few applications which do provide the feature of incorporating multiple languages translator, it often ends up creating a mess for application users by mixing up words an often not giving desired output.

With the development of software and hardware capabilities of mobile devices, there is an increased need for device-specific content, what resulted in market changes. We look forward incorporating the idea of how this software can be developed in future to enter speech in multi languages and convert it into the multi language text effectively, which could create a foundation for everyday use of this technology worldwide. The user shall be given a preference to choose a language he wishes to speak which shall be matched with the existing database consisting of dictionary and then he will be asked to choose the language he wants the data to be converted in. The speech synthesizer shall convert the same into required language and display the desired output. We will focus on various languages spoken in India thus making it one of its kinds.