finding the right facts in the crowd: factoid question answering over social media j. bian, y. liu,...

Finding the Right Facts in the Crowd: Factoid Question

Answering over Social Media

J. Bian, Y. Liu, E. Agichtein, and H. ZhaACM WWW, 2008

Introduction Question Answering (QA)

Form of information retrieval where the users’ information need is specified in the form of a natural language question

• The desired result is a self-contained answer, not a list of documents

Community Question Answering (CQA) Communities organized around QA, such as Yahoo! Answers

and Naver

Archive millions of questions and hundreds of millions of answers

More effective alternative to web search, since CQA connects users to others who are willing to share information directly

• Users receive direct responses and thus do not have to browse results of search engines to locate their answers

2

Challenges Searching for existing answers is crucial to avoid duplication,

and save time and efforts to users However, existing search engines are not designed for answering

queries that require deep semantic understanding

• Example. Consider the query “When is the hurricane season in the Caribbean?”. Using Yahoo search users still need to click

into web pages to find information

3

Challenges• Example (cont.). Yahoo! Answers provides one brief, high-quality

answer

4

Challenges A large portion of CQA content reflects personal,

unsubstantiated opinion of users, which are not useful for factual information

To retrieve correct factual answers to a question it is necessary to determine the relevance

and quality of candidate answers

• Explicit feedback from users, in the form of “best answer” or “thumps up/down” rating, can provide a strong indicator of the quality of an answer

• However, how to integrate explicit user feedback and relevance into a single ranking, it is still a concern

5

Ranking framework that takes advantage of user interaction info. to retrieve high quality, relevant content in social media

Proposed Solution

Learning Ranking Functions Problem definition of QA retrieval

Given a user query Q, the ordering of a set of QA pairs according to their relevance to Q is done by learning a ranking function for triples of the form (qrk, qsti, ansi

j)

• where qrk is the k-query in a set of queries, qsti is the ith question in a CQA system, and ansi

j is the jth answer to qsti

6

User Interactions in CQA Yahoo! Answers supports effective search of archived

questions and answers, and allows its users to Ask questions (“Asker”) Answer questions (“Answerer”) Evaluate the system (“Evaluator”), by voting for answers

of other users, marking interesting questions, and reporting abusive behavior

7

Each query-question-answer triple is represented by Textual features, i.e., textual similarity between query,

question, and answers

Features

Statistical features, i.e., independent features for query, question, and answers

8

Social features, i.e., user interaction activities & community- based features, that can approximate the users’

expertise in the QA community

9

Features

10

Preference Data Extraction “Users evaluation data” are extracted as a set of

preference data which can be used for ranking answers

For each query qr under the same question qst, consider two existing answers ans1 and ans2

• Assume ans1 has p1 plus votes and m1 minus votes out of n1 impressions, whereas ans2 has p2 plus

votes and m2 minus votes out of n2 impression

• To determine whether ans1 is preferred over ans2, in terms of their relevance to qst, it is assumed that plus

votes obey a binomial distribution

11

Binomial Distribution A binomial experiment (i.e., Bernoulli trial) is a statistical

experiment that has the following properties: The experiment consists of N repeated trials Each trial can result in just two possible outcomes, i.e., success or

failure The probability of success, denoted by p, is the same on every

trial. The probability of failure is 1 – p The trials are independent, i.e., the outcome on one trial does

not affect the outcome on other trials

In a binomial experiment that (i) consists of N trials, (ii) results in x successes, and (iii) the probability of success on an individual trial is p, the binomial probability is

“Binomial coefficient” which is read as “x

out of N”

12

Binomial Distribution

Example. On a 10-question multiple choice test, with 4 options per question, the probability of getting 5 answers correct if

the answers are guessed can be calculated as

B(5; 10, 25%) = c(10, 5)(0.25)5(0.75)5

≈ 5.8% where p = 0.25, 1 - p = 0.75, x = 5, N = 10

Thus, if somebody guesses 10 answers on a multiple choice test with 4 options, they have about a 5.8% chance of getting 5 correct answers

13

Preference Data Extraction To determine whether ans1 and ans2 are significant, i.e.,

there are enough votes to compare the pair, the likelihood ratio test is applied

• If λ > threshold, then the pair is significant

To determine the preference for the pair, ans1 and ans2, if

then ans1 is preferred over ans2, denoted ans1 ans2; o.w., ans2 is preferred over ans1, denoted ans2 ans1

Positive constant

Binomial Distribution

p1

p1 + m1 + sp2

p2 + m2 + s>

14

Preference Data Extraction For two query-question-answer items with the same

query, i.e., (qr, qst1, ans1) and (qr, qst2, ans2), let their feature vectors be X and Y

• If ans1 has a higher labeled grade than ans2 , then the preference X Y is included

• If, on the other hand, ans2 has a higher labeled grade than ans1, then the preference Y X is

included

Suppose the set of available preferences is

where <x, y> ϵ S, x, y denote the feature vector for two query-question-answer triples with the same query, and x ymeans that x is preferred over y, i.e., x should be ranked higher than y

15

Learning Ranking from Preference Data The problem of learning ranking functions is cast as the

problem of computing the ranking function h that matches the set of preferences, i.e., i = 1..N

h(xi) ≥ h(yi), if xi yi

(h) is the objective function (squared hinge loss function) that measures the risk of a given ranking function h, such that xi yi is a contradicting pair w.r.t. h if h(xi) < h(yi)

where is a function class, chosen to be linear combinations of regression trees

The minimization (min) problem is solved by using functional gradient descent, an algorithm based on gradient boosting

16

Learning Ranking from Preference Data Learning ranking function h using gradient boosting (GBRank)

An algorithm that optimizes a cost function over function space by iteratively choosing a (ranking) function

& the number of iterations are determined by cross validation

decision tree

17

Experimental Setup Datasets

1,250 Factoid questions from the TREC QA benchmarks data

QA collection dataset:

• Submit each query Q to the Yahoo! Answers & extracts up to 10 top-ranked related questions

• Retrieve as many answers to Q as available

• Total number of <query, question, answer> tuples: 89,642 with 17,711 relevant & 71,931 non-relevant ones

Evaluation Metrics: MRR, P@K, and MAP

Ranking Methods Compared Baseline_Yahoo (ordered by posting date) Baseline_Votes (ordered by Positive_Votes – Negative_Votes) GBRanking (ordered by proposed community/social features)

18

Experimental Results Ranking Methods Compared

For each TREC query, there is a list of Yahoo! questions (YQa, YQb, …) & for each question, there are

multiple answers (YQa1, YQa

2, …)

19

Experimental Results Ranking Methods

MRR_MAX: Calculate the MRR value of each Yahoo! Answers question & choose the highest MRR

value as the TREC query’s MRR

• Simulates an “intelligent” user who always selects the most relevant retrieved Yahoo! question first

MRR_STRICT: Same as MRR_MAX, but choose their average MRR values as the TREC query’s MRR

• Simulates a user who blindly follow the Yahoo! Answer’s ranking & their corresponding ordered answers

MRR_RR (Round Robin): Use YQa’s 1st answer as the TREC query’s 1st answer, YQb’s 1st answer

as the TREC query’s 2nd answer, and so on

• Simulates a “jumpy” user who believes in first answers

20

Experimental Results Ranking Methods Compared

MAX performs better than the other two metrics for baseline

GBrank is even better than MAX & achieves a gain of 18% relative to MAX

(i.e., Baseline_Yahoo)

(i.e., Baseline_Votes)

21

Experimental Results Learning Ranking Function

Using 10-fold cross validation on 400(/1250) TREC queries

22

Experimental Results Robustness to Noisy Labels

Use 50 manually-labeled queries & randomly select 350 TREC queries with related questions & answers

Results show that a nearly-optimal model is generated even when trained on noisy relevance labels

23

Experimental Results Study on Feature Set

P@K when learning ranking function with removing each category, respectively

Results show that a nearly-optimal model is generated even when trained on noisy relevance labels

24

Experimental Results Study on Feature Set

Users’ evaluations play a very important role in learning ranking function

finding the right facts in the crowd: factoid question answering over social media j. bian, y. liu,...

Documents