answering relationship queries on the web

25
Answering Relationship Queries on the Web Gang Luo, Chunqiang Tang and Y ing-li Tian IBM T.J. Watson Research Cente r WWW 2007

Upload: rufina

Post on 20-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Answering Relationship Queries on the Web. Gang Luo, Chunqiang Tang and Ying-li Tian IBM T.J. Watson Research Center. WWW 2007. E 1. E 2. Motivation. Relationship between people Dr. John Robert Schrieffer: A professor at Florida State University Nobel prize laureate in physics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Answering Relationship Queries on the Web

Answering Relationship Queries on the Web

Gang Luo, Chunqiang Tang and Ying-li Tian

IBM T.J. Watson Research Center

WWW 2007

Page 2: Answering Relationship Queries on the Web

Motivation• Relationship between people

– Dr. John Robert Schrieffer:• A professor at Florida State University• Nobel prize laureate in physics• He is invited to a party.

– Mr. Glenn Klausman: • Glenn plans to attend the party.• He would like to chat with him.• Florida attorney practicing personal injury law

• A answering relationship query (RQ) asks for the relationships between two or more entities.

E1 E2

Page 3: Answering Relationship Queries on the Web

Challenge

• Web pages : Unstructured Documents

• Large mount of “noise” (i.e., irrelevant information) in the web pages

• How to capture potential connecting terms between E1 and E2.

• How to compute term weights based on the characteristics of the two Web pages sets.

E1 E2

Page 4: Answering Relationship Queries on the Web

Answering Relationship Query

• Searchers may not be able to find desired relationships between E1 and E2

• (1) Retrieved pages do not contain any desired relationship

• (2) No Web page mentions both E1 and E2 and their relationship

• (3) Web pages may either (a) mention some relationships, or (b) just happen to incidentally mention both E1 and E2.

• (4) No desired relationship exists between E1 and E2

E1 E2

Page 5: Answering Relationship Queries on the Web

Answering RQ – User Interface 1/2

E1 E2

Page 6: Answering Relationship Queries on the Web

Answering RQ – User Interface 2/2

Page 7: Answering Relationship Queries on the Web

Answering RQ - Step1

• Step 1 : Obtaining Web pages– Using entity Ei (i=1,2 ) as a query keyword retrieve

s the URLs of the top Mi Web pages

– For each URL, the corresponding Web page is retrieved from the Web.

– M1=M2=50

Page 8: Answering Relationship Queries on the Web

Answering RQ - Step 2

• Step 2 : Document Pre-processing• Operation 1: All Html comments, JavaScript code, tags, and n

on-alphabetic characters are removed.• Operation 2: stemming (Porter stemmer)• Operation 3: stopwords are removed by SMART stopword list• Operation 4: Connecting term

– W: windows size (25~35)

– They assume that the most useful information is typically centered around query keywords and use windowing to obtain this information.

– Operation 4 attempts to strike a balance between noise reduction and omission of useful information

query keyword in Ki

Ki: set of keywords of entity Ei

Page 9: Answering Relationship Queries on the Web

Answering RQ - Step 3 (1/3)

• Step 3 : Computing Similarity Values– For each pair of Web pages (P1,P2), where P1 S1 a

nd P2 S2, they compute a similarity value.

– For each connecting term t that appears in both P1 and P2, they compute a term weight.

• The weight reflects the likelihood that t captures the relationship between E1 and E2.

• Using Okapi formula computes both term weights and the similarity values of Web page pairs.

Page 10: Answering Relationship Queries on the Web

Answering RQ - Step 3 (2/3)

• Okapi formula - Q: Query and S: document set– For each term t in the vocabulary and D S, Okapi

uses the following formulas:

qtf : t’s frequency in Q N : total number of documents in Sdf : the number of documents in S that contain t dl : length of D in bytes avdl : the average length in bytes

Page 11: Answering Relationship Queries on the Web

Answering RQ - Step 3 (3/3)• For RQ – Two Web pages sets rather than one

document set and one query.– The idea is to replace (D,Q) with (P1S, P2S)

– The method reuses equation f1, drops f3, and changes f2, f4, and f5 into f2’,f4’, and f5’

-Top C potential connecting terms : 20 ~ 30

Page 12: Answering Relationship Queries on the Web

Answering RQ - Step 4

• Step 4 :Sorting Web page Pairs– All the Web page pairs are sorted in descending

order of their similarity values– Top ten Web page pairs are returned to the

searcher in the first result page

Page 13: Answering Relationship Queries on the Web

Experimental Results - Example 1 • Scenario I: Relationship between People

– Example 1 (Nobel Example)• (P1,1 ,P2,28)

Page 14: Answering Relationship Queries on the Web

Experimental Results - Example 2– Example 2 (Lomet Example)

• Suppose Arthur will attend a conference and he notices that David will attend the same conference.

• Assume that Arthur does not know David and would like to chat with him.

• (P1, 48 , P2, 5)

Page 15: Answering Relationship Queries on the Web

Experimental Results - Example 3(1/3)

• Scenario II: Relationship between Places– Example 3 (Yorktown Example)

• The first Web page pair : ( P1, 31 , P2, 21)

Page 16: Answering Relationship Queries on the Web

Experimental Results - Example 3(2/3)

– Example 3 (Yorktown Example)• The second Web page pair : (P1, 46 , P2, 19)

Page 17: Answering Relationship Queries on the Web

Experimental Results - Example 3(3/3)

– Example 3 (Yorktown Example)

Page 18: Answering Relationship Queries on the Web

Experimental Results - Example 4

– Example 4 (Hartlepool Example)• E1 : Hartlepool E2 : Three Gorges

• The fourth Web page pair (P1,17 , P2,49)

Page 19: Answering Relationship Queries on the Web

Experimental Results - Example 5• Scenario III: Relationship between Companies

– Example 5 (Bank Example)• E1 : St. Petersburg Real Estate Holding Co E2 : Union Bank of Switzerland scandal

• The first Web page pair ( P1,15 , P2, 45)

Page 20: Answering Relationship Queries on the Web

Experimental Results - Example 6• Scenario IV: Relationship between Institutes

– Example 6 (CMU Example)• Anton graduated from the Computer Science Department of CMU.

• He is currently a researcher at Microsoft Research (MSR).

• He will go back to CMU to recruit new employees for MSR.

• E1: Microsoft Research. E2: Carnegie Mellon University computer science.

• ( P1,39 ,P2,11 )

Page 21: Answering Relationship Queries on the Web

Experimental Results - Example 7

• Scenario V: Relationship between Document Sets– Example 7 (Paper Example)

• Cathy is a manager at a research lab.

Page 22: Answering Relationship Queries on the Web

Experimental Results -Sensitivity Analysis of Parameter Values (1/3)

• The score is defined as the sum of reciprocal ranks of relevant Web page pairs in the returned top ten Web page pairs.– For example, if in the returned top ten Web page pairs, the

first, second, and eighth Web page pairs are relevant ones, the score would be 1+1/ 2 + 1/ 8 = 1.625 .

• Relevant Web page pairs contain desired relationships between the two entities and are manually identified.

• 30 examples

Page 23: Answering Relationship Queries on the Web

Experimental Results -Sensitivity Analysis of Parameter Values (2/3)

Page 24: Answering Relationship Queries on the Web

Experimental Results -Sensitivity Analysis of Parameter Values (3/3)

– Tech1: Use windowing in document pre-processing (Operation 4 in Step 2).

– Tech2: Use max(W’idf ,1 W’idf , 2) in the term weighting formula (Step 3).

– Tech3: Only consider the top C potential connecting terms in computing the similarity value of a Web page pair (Step 3).

– Tech4: For either i (i=1, 2), compute a set of global statistics (Ni, avdli, dfi) on the Web page set Si (Step 3).

Page 25: Answering Relationship Queries on the Web

Conclusion

• They believe that they are among the first to study the problem of answering relationship queries on the Web.

• To effectively filter out the large amount of noise in the Web pages without losing much useful information– They do windowing around query keywords, compute term

weights based on the characteristics of the two Web page sets

– Only use the top potential connecting terms to compute the similarity values of Web page pairs