question answering system using the approach of nlp · question answering system using the approach...

14
Question Answering System Using the Approach Of NLP 1 P. John Paul, 2 Sibi Amaran, 3 K. Sree Kumar and 4 U.M. Prakash 1 Department of CSE, SRM University, Kattankulathur, India 2,3,4 Department of CSE, SRM University, Kattankulathur, India [email protected] Abstract The ability of the machine to infer knowledge from the user document can be examined based on its ability to answer the question asked. The proposed question answering system is a new deterministic approach to co-reference resolution that combines the global information and precise features of modern machine learning model with the transparency and modularity of deterministic, rule based models. Further, our aim is to make use of global information through an entity centric model that encourages the sharing of features across all mentions that point to the same real-world entity. The problem statement is given a text document from where the model needs to find out the possible answers from the document. For this we are developing a model for documents in which information is factual, written in simple English language like short stories and the suitable answer to the question is a single phrase. To make this model tasks performed can be categorized under six basic tasks which are Pronoun Resolution, Dependency Extraction, Lemmatization of graph entities and relationships, Semantic Graph Generation, Query Graph Generation and finally graph search for answers. Keywords: Entity centric model, Pronoun Resolution, Dependency Extraction, Semantic Graph Generation International Journal of Pure and Applied Mathematics Volume 117 No. 7 2017, 445-458 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 445

Upload: others

Post on 11-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Question Answering System Using the Approach

Of NLP

1P. John Paul, 2Sibi Amaran, 3K. Sree Kumar and 4U.M. Prakash

1Department of CSE, SRM University, Kattankulathur, India

2,3,4Department of CSE, SRM University, Kattankulathur, India

[email protected]

Abstract The ability of the machine to infer knowledge from the user document

can be examined based on its ability to answer the question asked. The

proposed question answering system is a new deterministic approach to

co-reference resolution that combines the global information and precise

features of modern machine learning model with the transparency and

modularity of deterministic, rule based models. Further, our aim is to make

use of global information through an entity centric model that encourages

the sharing of features across all mentions that point to the same real-world

entity. The problem statement is given a text document from where the

model needs to find out the possible answers from the document. For this

we are developing a model for documents in which information is factual,

written in simple English language like short stories and the suitable

answer to the question is a single phrase. To make this model tasks

performed can be categorized under six basic tasks which are Pronoun

Resolution, Dependency Extraction, Lemmatization of graph entities and

relationships, Semantic Graph Generation, Query Graph Generation and

finally graph search for answers.

Keywords: Entity centric model, Pronoun Resolution, Dependency

Extraction, Semantic Graph Generation

International Journal of Pure and Applied MathematicsVolume 117 No. 7 2017, 445-458ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

445

1. Introduction

Machine Learning

Modern Machine learning involves computers power to learn. It involves

creation of algorithm that enables computer to learn and based on that make

accurate prediction on data provided. In today’s world there are many places

where to get desired output, designing algorithms and explicitly running them in

not feasible. Hence it is necessary to make machine learn to finds its own

output. Certain real life examples of machine learning are Speech Recognition

in Google Now, using Deep Learning Networks reduced error rate by over 20%,

making the technology seems magical. Another example is recommendation

engine used by Amazon, Netflix, Flipkart and many others.

Now machine learning is classified into three main categories: Supervised

Learning: Supervised learning is where you have input variables (x) and an

output variable (Y) and you use an algorithm to learn the mapping function

from the input to the output. he goal is to approximate the mapping function so

well that when you have new input data (x) that you can predict the output

variables (Y) for that data.

Unsupervised Learning: It doesn’t use sample input and output, rather all the

similar vectors are grouped together to specify how a member of each group

looks on to which group a number belongs.

Reinforcement Learning: In supervised learning the programmer gives sample

input and corresponding output. In few instances, the information may not be

sufficient. For example the programmer gives sample input and corresponding

to it output, but that output is “50 percent” correct or so on. The information

available is critical and not exact. When learning is based on critical information

it is called reinforcement learning.

Natural Language Processing

Computer understands only binary language of 0 or 1. Human understand each

other by using natural language. Thus natural language processing (NLP) is

used to process human natural language by machine. The major NLP tasks are:

i. Part of Speech Tagging: Each word in a sentence is tagged based of part of

speech like noun, verb, adjective, pronoun, and adverb. For example sentence is

“Shweta is a girl, she is beautiful and likes singing.” The POS tagging for this

would be:Noun- Shweta, girl ; Pronoun- she ; Adjective- beautiful ; verb-

singing

ii. Named Entity Recognition: It classify each and every word, whether it is a

name of a person, location, company, animal, time, etc. For example sentence is

“Shweta was born in Pune.” NER for this would be:Person- Shweta ; Location-

International Journal of Pure and Applied Mathematics Special Issue

446

Pune

iii. Parsing: It is a process of breaking down complex sentence into smaller

units which can be both syntactically and semantically analyzed and grouped

together based on POS.

iv. Co-reference resolution: All the expressions referring to same entity are

found using co-reference resolution. It is importance while doing tasks like

summarization, question answering and information retrieval.

Question Answering System

Question Answering System uses science of information retrieval, natural

language processing, and machine learning to answer the question automatically

asked by user in natural language. The general steps in making any QAS system

are-

Pre-processing the text and question – includes POS tagging, annotating

the document, named entity recognition, lemmatization.

Making relations between different entities.

Finding answer based on the relations between entities.

In our proposed model, we have first pre-processed text document, like POS

tagging, named entity recognition, lemmatization, annotating document as well

as question. After that we do pronoun resolution which that ten steps of co-

reference sieves. A dependency graph is built between entities for document as

well as question. Finally based on depend graph, answer is extracted.

2. Related Works

Ahlam Ansari [1] used deep cases along with conventional Artificial Neural

Network to extract answer to the question. Given a text document, the sentences

in it are divided into knowledge units. Each word in a sentence is assigned deep

cases. In this research, they tried to imitate human brain information recalling

system. The human brain first processes the document (understands it) and then

tries to find out the answer of question asked. The human brain connects the

related information through links. In the similar manner here the related words

are linked through deep cases. Deep cases are assigned to each word based on

POS tagging and entity recognition. The benefits of deep neural network are it

can be used to go beyond the data provided to the system and identify the

relation among the individual, while artificial neural network is only limited to

data provided to system. Figure 1 shows table of type of deep cases with its

description.

International Journal of Pure and Applied Mathematics Special Issue

447

Figure 1: Type of deep cases

LE Juan in their paper Answer Extraction Based on Merging Score Strategy of

Hot Terms [2] proposed a Question Answer System (QAS) in which highest

score will be given to the expected answer based on effective answer score

strategy. For example if the question is “Who was the first man to reach the

moon?”. The answer would definitely be a person’s name. The proposed model

will first extract all the entities that are tagged as name from the document. Now

we have list of candidate answers (Yuri Gagarin, Neil Armstrong, Rakesh

Sharma, Kalpana Chawla). The candidate answers will be given score and

ranked accordingly. The expected answer (here it is Neil Armstrong) will be

given highest score and rank and is thus chosen as answer to the question asked.

Our model is based on Deterministic Coreference Resolution Based on Entity-

Centric, Precision-Ranked Rules by Heeyoung Lee [3]. All the expressions

referring to same entity are found using co-reference resolution and it is plays a

crucial role while performing tasks like summarization, question answering and

information retrieval. In our model during pronoun resolution, we use the

coreference resolution algorithm proposed in this work. Figure 2 shows the

algorithm for coreference resolution.

Figure 2: Algorithm for coreference resolution

International Journal of Pure and Applied Mathematics Special Issue

448

This stage involves mention detection and co-reference resolution. In mention

detection, nominal and pronominal mentions are identified using high recall

algorithm. It selects all occurrences of noun phases (NPs), pronouns, and named

entity. After that it filters out non-mentions which includes pleonastic it, i-

within-i, numeric entities, partitives, etc. After mention detection, co-reference

resolution involves applying ten independent co reference models (or”sieves”)

in succession(Fig 2 shows ten sieves).

Let us take an example of a sentence I which we will apply ten co reference

sieves. In each step effected mention is marked bold. Superscript and subscript

indicates entity id and mention id respectively.

Noun phases (NPs), pronouns, and named entity mentions are selected in this

phase. In the given sentence John, musician, a new song, a girl, the song being

noun phases are given mention id and entity id. He, it, my, her are pronoun and

given mention id and entity id. Superscript and subscript indicates entity id and

mention id respectively.

We first see the pronominal mentions in the quotation and matches them to the

corresponding speaker. Here we have pronominal mentions- it, my. [My] is

matched with John (who is speaker). [It] cannot refer to a person so it cannot

refer to speaker. Giving higher priority to speaker sieve ensures linking of girl

and my, because anaphoric candidates are generally preferred than cataphoric

candidates.

International Journal of Pure and Applied Mathematics Special Issue

449

Here anaphoric word for mention under consideration is found that have exact

same string as mention under consideration. For [John]11 the anaphoric word

that has exactly same string is [John]109. Thus these two entities are matched.

Like in string matching that matches exact strings, here there is loser set

constraints for string matching. In our example there no such mentions, so no

change.

Here predicate nominatives are matched. Our sentences have two predicate

nominatives-

[John]11 and [musician]2

2, [It]77 and [[my]9

1 favorite]88.

Mentions having same head words are linked.

International Journal of Pure and Applied Mathematics Special Issue

450

Based on gender, animacy and number, the pronouns are linked to its

antecedents.

To allow new features to be discovered for corresponding entity and shared

between its mentions we remove links which are obtained through predicate

nominative pattern.

Marie-Catherine de Marneffe and Christopher D. Manning worked on Stanford

typed dependencies manual.[4] It’s purpose was to provide knowledge of

grammatical relationships between parsed words in simple form and that is

meant to be easily understood by people who doesn’t have great linguistic

knowledge and want to extract textual relations effectively. It is quite different

from phrase structure representation which prevailed long in computational

linguistic field; rather it represents every sentence relationships uniformly as

typed dependencies. Example sentence:

International Journal of Pure and Applied Mathematics Special Issue

451

Figure 3: Graphical representation of the Stanford Dependencies for the above

sentence

Figure 3 shows graphical representation of the Stanford Dependencies for the

above sentence. The present representation consists of around 50 grammatical

relations. All these dependencies are binary relations i.e., a relation between a

governor which is also referred as head and a dependent. For example one listed

above is adjectival complement:

International Journal of Pure and Applied Mathematics Special Issue

452

3. System Architecture and Implementation

Figure 4: System Architecture

Figure 4 depicts system architecture of question answer system. The problem

statement given to us is that we are given a text document and a question and

we need to find out possible answer from that document. We assume that

1. We are developing the model for documents in which information is

factual, written in simple English language, like short stories (the tales

of Panchatantra).

2. The suitable answer to the question is a single phrase.

The text document in figure 4 refers to stories and question to questions. The

stories and questions are passed through pre-processing phase. Here various

natural language analysis tools work on data. The analyzed data is passed

through graph generation phase. Here based on dependencies between words as

done during pre-processing phase, graph is generated for stories as well as for

question. Finally the graphs are matched and best possible answer is searched.

Pre Processing:

The pre-processing step includes three main steps – pronoun resolution,

dependency extraction and lemmatization of graphs entities and relationships.

Pronoun Resolution:

This stage involves mention detection and co-reference resolution. In mention

detection, nominal and pronominal mentions are identified using high recall

algorithm. It selects all occurrences of noun phases (NPs), pronouns, and named

entity. After that it filters out non-mentions which includes pleonastic it, i-

within-i, numeric entities, partitives, etc. After mention detection, co-reference

resolution involves applying ten independent co reference models (or”sieves”)

in succession.

International Journal of Pure and Applied Mathematics Special Issue

453

Dependency Extraction:

Dependency Extraction provides simple description of grammatical

relationships in a sentence of the document and these grammatical relationships

can be used by people without linguistic expertise and who want to extract

textual relations effectively. There are more than 50 grammatical relations.

Dependency extraction is done for both query as well as textual story. Figure 5

shows an example of dependency extraction on a simple sentence.

.

Figure 5: Dependency extraction on a simple sentence

Lemmatization :

After Dependency Extraction, we change each word into its root form. For

example served – nsubj – Ram is changed into serve – nsubj – Ram. Also we

check their POS TAG if it is Pronoun then replace it with its representative

mention.

Graph generation: Semantic graph generation

We take all the triples generated and store it in a database (Neo4j). As we made

the assumption that in the entire document the context of an object does not

change, there will be only one node corresponding to a particular object.

Example: In the sentence - “Ram is a boy. Ram likes to run.” both the instances

of “Ram” correspond to the same node.

Query graph generation

In a similar way graph is made for query asked.

Graph search for answers

The graphs generated for the question and for the story are compared together

and the question is answered by comparing the two graph nodes and based on

that each sentence is given it’s individual priority and then sentences are

arranged in priority order. Answer is selected from the sentence having highest

priority.

International Journal of Pure and Applied Mathematics Special Issue

454

4. Results of Implementation

Figure 6: Knowledge graph generated from the story

Figure 7: GUI asking for the question and the story file to be uploaded. Based

on this the query is replied and a knowledge graph is generated similar to Fig 6

Figure 8: Question which is asked and the final result obtained. The line below

that shows the line from the story which got the highest priority and from which

the answer was extracted

International Journal of Pure and Applied Mathematics Special Issue

455

Figure 9: Output on console which shows every line of story with it’s priority

and the final answer which is obtained based on the line having highest priority.

5. Conclusion

The simplicity of this method makes it perfect for multilingual QA. Many tools

required by sophisticated QA systems (named entity taggers, parsers,

ontologies, etc.) are language specific and require significant effort to adapt to a

new language.

References [1] Ahlam Ansari, Moonish Maknojia, Altamash Shaikh, “Intelligent

Question Answering System based on Artificial Neural Network”, 2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th& 18thMarch 2016, Coimbatore, TN, India

[2] LE Juan, ZHANG Chunxia, NIU Zhendong, “Answer Extraction Based on Merging Score Strategy of Hot Terms”, Chinese Journal of Electronics Vol.25, No.4, July 2016

[3] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, "Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules", MIT Press Journals, December 2013, Vol. 39, No. 4, Pages: 885-916

[4] Marie-Catherine de Marneffe, Christopher D. Manning "Stanford typed dependencies manual", manual describes the original Stanford Dependencies representation, September 2008 Revised for the Stanford Parser v. 3.7.0 in September 2016

[5] https://stanfordnlp.github.io/CoreNLP/ index.html

[6] https://jena.apache.org/documentation/query/index.html

International Journal of Pure and Applied Mathematics Special Issue

456

[7] https://nlp.stanford.edu/projects/coref.shtml

[8] https://stanfordnlp.github.io/CoreNLP/coref.html

[9] http://en.wikipedia.org/wiki/List of adjectival and demonymic forms of place names

[10] https://nlp.stanford.edu/software/tregex.shtml

[11] http://nlp.stanford.edu/software/dcoref.shtml.

[12] https://github.com/Sciss/ws4j

International Journal of Pure and Applied Mathematics Special Issue

457

458