finding the right facts in the crowd: factoid question answering over social media j. bian, y. liu,...
DESCRIPTION
Challenges Searching for existing answers is crucial to avoid duplication, and save time and efforts to users However, existing search engines are not designed for answering queries that require deep semantic understanding Example. Consider the query “When is the hurricane season in the Caribbean?”. Using Yahoo search users still need to click into web pages to find information 3TRANSCRIPT
Finding the Right Facts in the Crowd: Factoid Question
Answering over Social Media
J. Bian, Y. Liu, E. Agichtein, and H. ZhaACM WWW, 2008
Introduction Question Answering (QA)
Form of information retrieval where the users’ information need is specified in the form of a natural language question
• The desired result is a self-contained answer, not a list of documents
Community Question Answering (CQA) Communities organized around QA, such as Yahoo! Answers
and Naver
Archive millions of questions and hundreds of millions of answers
More effective alternative to web search, since CQA connects users to others who are willing to share information directly
• Users receive direct responses and thus do not have to browse results of search engines to locate their answers
2
Challenges Searching for existing answers is crucial to avoid duplication,
and save time and efforts to users However, existing search engines are not designed for answering
queries that require deep semantic understanding
• Example. Consider the query “When is the hurricane season in the Caribbean?”. Using Yahoo search users still need to click
into web pages to find information
3
Challenges• Example (cont.). Yahoo! Answers provides one brief, high-quality
answer
4
Challenges A large portion of CQA content reflects personal,
unsubstantiated opinion of users, which are not useful for factual information
To retrieve correct factual answers to a question it is necessary to determine the relevance
and quality of candidate answers
• Explicit feedback from users, in the form of “best answer” or “thumps up/down” rating, can provide a strong indicator of the quality of an answer
• However, how to integrate explicit user feedback and relevance into a single ranking, it is still a concern
5
Ranking framework that takes advantage of user interaction info. to retrieve high quality, relevant content in social media
Proposed Solution
Learning Ranking Functions Problem definition of QA retrieval
Given a user query Q, the ordering of a set of QA pairs according to their relevance to Q is done by learning a ranking function for triples of the form (qrk, qsti, ansi
j)
• where qrk is the k-query in a set of queries, qsti is the ith question in a CQA system, and ansi
j is the jth answer to qsti
6
User Interactions in CQA Yahoo! Answers supports effective search of archived
questions and answers, and allows its users to Ask questions (“Asker”) Answer questions (“Answerer”) Evaluate the system (“Evaluator”), by voting for answers
of other users, marking interesting questions, and reporting abusive behavior
7
Each query-question-answer triple is represented by Textual features, i.e., textual similarity between query,
question, and answers
Features
Statistical features, i.e., independent features for query, question, and answers
8
Social features, i.e., user interaction activities & community- based features, that can approximate the users’
expertise in the QA community
9
Features
10
Preference Data Extraction “Users evaluation data” are extracted as a set of
preference data which can be used for ranking answers
For each query qr under the same question qst, consider two existing answers ans1 and ans2
• Assume ans1 has p1 plus votes and m1 minus votes out of n1 impressions, whereas ans2 has p2 plus
votes and m2 minus votes out of n2 impression
• To determine whether ans1 is preferred over ans2, in terms of their relevance to qst, it is assumed that plus
votes obey a binomial distribution
11
Binomial Distribution A binomial experiment (i.e., Bernoulli trial) is a statistical
experiment that has the following properties: The experiment consists of N repeated trials Each trial can result in just two possible outcomes, i.e., success or
failure The probability of success, denoted by p, is the same on every
trial. The probability of failure is 1 – p The trials are independent, i.e., the outcome on one trial does
not affect the outcome on other trials
In a binomial experiment that (i) consists of N trials, (ii) results in x successes, and (iii) the probability of success on an individual trial is p, the binomial probability is
“Binomial coefficient” which is read as “x
out of N”
12
Binomial Distribution
Example. On a 10-question multiple choice test, with 4 options per question, the probability of getting 5 answers correct if
the answers are guessed can be calculated as
B(5; 10, 25%) = c(10, 5)(0.25)5(0.75)5
≈ 5.8% where p = 0.25, 1 - p = 0.75, x = 5, N = 10
Thus, if somebody guesses 10 answers on a multiple choice test with 4 options, they have about a 5.8% chance of getting 5 correct answers
13
Preference Data Extraction To determine whether ans1 and ans2 are significant, i.e.,
there are enough votes to compare the pair, the likelihood ratio test is applied
• If λ > threshold, then the pair is significant
To determine the preference for the pair, ans1 and ans2, if
then ans1 is preferred over ans2, denoted ans1 ans2; o.w., ans2 is preferred over ans1, denoted ans2 ans1
Positive constant
Binomial Distribution
p1
p1 + m1 + sp2
p2 + m2 + s>
14
Preference Data Extraction For two query-question-answer items with the same
query, i.e., (qr, qst1, ans1) and (qr, qst2, ans2), let their feature vectors be X and Y
• If ans1 has a higher labeled grade than ans2 , then the preference X Y is included
• If, on the other hand, ans2 has a higher labeled grade than ans1, then the preference Y X is
included
Suppose the set of available preferences is
where <x, y> ϵ S, x, y denote the feature vector for two query-question-answer triples with the same query, and x ymeans that x is preferred over y, i.e., x should be ranked higher than y
15
Learning Ranking from Preference Data The problem of learning ranking functions is cast as the
problem of computing the ranking function h that matches the set of preferences, i.e., i = 1..N
h(xi) ≥ h(yi), if xi yi
(h) is the objective function (squared hinge loss function) that measures the risk of a given ranking function h, such that xi yi is a contradicting pair w.r.t. h if h(xi) < h(yi)
where is a function class, chosen to be linear combinations of regression trees
The minimization (min) problem is solved by using functional gradient descent, an algorithm based on gradient boosting
16
Learning Ranking from Preference Data Learning ranking function h using gradient boosting (GBRank)
An algorithm that optimizes a cost function over function space by iteratively choosing a (ranking) function
& the number of iterations are determined by cross validation
decision tree
17
Experimental Setup Datasets
1,250 Factoid questions from the TREC QA benchmarks data
QA collection dataset:
• Submit each query Q to the Yahoo! Answers & extracts up to 10 top-ranked related questions
• Retrieve as many answers to Q as available
• Total number of <query, question, answer> tuples: 89,642 with 17,711 relevant & 71,931 non-relevant ones
Evaluation Metrics: MRR, P@K, and MAP
Ranking Methods Compared Baseline_Yahoo (ordered by posting date) Baseline_Votes (ordered by Positive_Votes – Negative_Votes) GBRanking (ordered by proposed community/social features)
18
Experimental Results Ranking Methods Compared
For each TREC query, there is a list of Yahoo! questions (YQa, YQb, …) & for each question, there are
multiple answers (YQa1, YQa
2, …)
19
Experimental Results Ranking Methods
MRR_MAX: Calculate the MRR value of each Yahoo! Answers question & choose the highest MRR
value as the TREC query’s MRR
• Simulates an “intelligent” user who always selects the most relevant retrieved Yahoo! question first
MRR_STRICT: Same as MRR_MAX, but choose their average MRR values as the TREC query’s MRR
• Simulates a user who blindly follow the Yahoo! Answer’s ranking & their corresponding ordered answers
MRR_RR (Round Robin): Use YQa’s 1st answer as the TREC query’s 1st answer, YQb’s 1st answer
as the TREC query’s 2nd answer, and so on
• Simulates a “jumpy” user who believes in first answers
20
Experimental Results Ranking Methods Compared
MAX performs better than the other two metrics for baseline
GBrank is even better than MAX & achieves a gain of 18% relative to MAX
(i.e., Baseline_Yahoo)
(i.e., Baseline_Votes)
21
Experimental Results Learning Ranking Function
Using 10-fold cross validation on 400(/1250) TREC queries
22
Experimental Results Robustness to Noisy Labels
Use 50 manually-labeled queries & randomly select 350 TREC queries with related questions & answers
Results show that a nearly-optimal model is generated even when trained on noisy relevance labels
23
Experimental Results Study on Feature Set
P@K when learning ranking function with removing each category, respectively
Results show that a nearly-optimal model is generated even when trained on noisy relevance labels
24
Experimental Results Study on Feature Set
Users’ evaluations play a very important role in learning ranking function