a review on “answering relationship queries on the web” bhushan pendharkar asu id 993934582

A review on “Answering Relationship

Queries on the Web”

Bhushan Pendharkar

ASU ID 993934582

Problem statement

Inability of existing search engines to answer relationship queries, although they excel in keyword matching and document ranking.

Focus of the paper on finding relationship between two entities given as queries, by finding top ranked Web pages for each query and matching them to form list of web page pairs.

Use of connecting terms for determining the relationship and ranking the Web page pairs.

Given two entities E1 and E2 , a Web search engine displays top pages which do not show any relationship between E1 and E2

Attempt to overcome the shortcoming of current search engines , by providing a system and interface for relationship queries.

Proposed system dependent on Google search engine.

Solution Proposed

The proposed system accepts two entities as queries through its interface. The top ranked pages of each entity E1 and E2 are retrieved separately

from a search engine like Google. These pages or documents are preprocessed: elimination of HTML tags,

stemming of words, stop-word removal (Porter stemmer) and elimination of irrelevant words (noise removal).

Calculation of term weight for common term ‘t’ that shows relationship between P1 & P2 .( P1 is a result of query E1 , P2 of E2).

Connecting terms: terms having higher term weights Use of cosine similarity (OKAPI method) to calculate similarity between P1

and P2—( Replacing ‘document’ and ‘query’ by P1 & P2 respectively) Sorting the web-page pairs in descending order of similarity( or weights) and

displaying them along with the connecting terms for each pair.

Criticism of the solution

Assumption: Top-ranked pages for E1 and top-ranked pages for E2 do not contain any relationship between E1 and E2. No ground truth provided. The fact might be the exact opposite.

Overview of the relationship between entities E1 and E2 given as a random term ‘Ec’. Explanation missing about ‘Ec’.

Less processing tasks , heavy dependence on Google results. If “Google” results are not perfect or correct (rarely…!!), the system fails. Explicit mention of “changes in results” if Google results vary.

Use of standard “Porter Stemmer”. This stemmer is not so perfect. Stemming (“ignition” is stemmed to “ignit”, “Monday” to “Mondai”)

Paper concluded by unnecessary explanation of the influence on results when the steps of the proposed approach are eliminated one at a time, although all steps are necessary for the proper implementation of the system.

Relevance to IRM

Significant relevance to the topics taught in the course. The crux of the paper is similarity calculation between Web Page

Pairs(P1,P2). Cosine similarity is used for the same. The concept of TF-IDF is used for determining the term weights for

terms present in the documents P1 and P2. Use of stemming to obtain root words Ranking done on the basis of the similarity values of the Web page

pairs.

The proposed system accepts two entities as queries through its interface.

The top ranked pages of each entity E1 and E2 are retrieved separately from a search engine like Google.

These pages or documents are preprocessed: elimination of HTML tags, stemming of words, stop-word removal (Porter stemmer) and elimination of irrelevant words (noise removal).

Calculation of term weight for common term ‘t’ that shows relationship between P1 & P2 .( P1 is a result of query E1 , P2 of E2).

Connecting terms: terms having higher term weights Use of cosine similarity (OKAPI method) to calculate similarity

between P1 and P2—( Replacing ‘document’ and ‘query’ by P1 & P2 respectively)

Sorting the web-page pairs in descending order of similarity( or weights) and displaying them along with the connecting terms for each pair.

Significant relevance to the topics taught in the course. The crux of the paper is similarity calculation between Web

Page Pairs(P1,P2). Cosine similarity is used for the same. The concept of TF-IDF is used for determining the term

weights for terms present in the documents P1 and P2. Use of stemming to obtain root words Ranking done on the basis of the similarity values of the Web

page pairs.

Inability of existing search engines to answer relationship queries, although they excel in keyword matching and document ranking.

Focus of the paper on finding relationship between two entities given as queries, by finding top ranked Web pages for each query and matching them to form list of web page pairs.

Use of connecting terms for determining the relationship and ranking the Web page pairs.

Given two entities E1 and E2 , a Web search engine displays top pages which do not show any relationship between E1 and E2

Attempt to overcome the shortcoming of current search engines , by providing a system and interface for relationship queries.

Proposed system dependent on Google search engine.

Assumption: Top-ranked pages for E1 and top-ranked pages for E2 do not contain any relationship between E1 and E2. No ground truth provided. The fact might be the exact opposite.

Overview of the relationship between entities E1 and E2 given as a random term ‘Ec’. Explanation missing about ‘Ec’.

Less processing tasks , heavy dependence on Google results. If “Google” results are not perfect or correct (rarely…!!), the system fails. Explicit mention of “changes in results” if Google results vary.

Use of standard “Porter Stemmer”. This stemmer is not so perfect. Stemming (“ignition” is stemmed to “ignit”, “Monday” to “Mondai”)

Paper concluded by unnecessary explanation of the influence on results when the steps of the proposed approach are eliminated one at a time, although all steps are necessary for the proper implementation of the system.

Problem statement (1)

Criticism of the solution (3)

Relevance to IRM (4)

Solution Proposed (2)

a review on “answering relationship queries on the web” bhushan pendharkar asu id 993934582

Documents