penguins in sweaters, or serendipitous entity search on user-generated content

Penguins in Sweaters,or Serendipi tous Ent i ty Search

on User-generated-Content

I l a r i a B o r d i n o , Ye l e n a M e j o v a , a n d M o u n i a L a l m a s

( Ya h o o L a b s )A C M I n t e r n a t i o n a l C o n f e r e n c e o n I n f o r m a t i o n a n d K n o w l e d g e

M a n a g e m e n t ( C I K M 2 0 1 3 )

O c t o b e r 2 9 t h , 2 0 1 3

Entity Search

we build an entity-driven serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo! Answers

Serendipityfinding something good or useful while not specifically looking for it, serendipitous search systems provide relevant and interesting results

2

Why/when do penguins wear sweaters?

1. What connections between entities do web community knowledge portals offer?

WHAT

3

WHY2. How do they contribute to an interesting, serendipitous browsing experience?

Yahoo Answers vs Wikipedia

community-driven question & answer portal 67M questions & 262M

answers 2 years [2010/2011] English-language

community-driven encyclopedia• 3 795 865 articles• from end of December

2011• English Wikipedia

4

minimally curatedopinions, gossip, personal info

variety of points of view

curatedhigh-quality knowledgevariety of niche topics

Entity & Relationship Extraction

Entity: any concept having a Wikipedia page

Use an internal tool to

(1) identify surface forms,

(2) resolve to Wikipedia entities,

(3) rank entities using aboutness score;

W. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. ECIR 2011.D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. CIKM 2009.

5

Relationship: Cosine similarity of tf/idf vectors (concatenation of documents where entity appears)

Dataset Features

Sentiment› using SentiStrength compute positive & negative scores› compute attitude and sentimentality [Kucuktunc’12]

› Entity-level scores Quality

› Flesch Reading Ease score

Attitude (Polarity) Sentimentality (Strength) Readability

Topical Category– Yahoo Content Taxonomy

6

Dataset # Nodes # Edges # Isolated

Yahoo! Answers 896,799 112,595,138 69,856

Wikipedia 1,754,069 237,058,218 82,381

Wikipedia

7

Yahoo Answers

Retrieval

Wikipedia Yahoo! Answers Combined

Precision @ 5 0.668 0.724 0.744

MAP 0.716 0.762 0.782

Justin Bieber, Nicki Minaj, Katy Perry, Shakira, Eminem, Lady Gaga, Jose Mourinho, Selena Gomez, Kim Kardashian, Miley Cyrus, Robert Pattinson, Adele (singer), Steve Jobs, Osama bin Laden, Ron Paul, Twitter, Facebook, Netflix, IPad, IPhone, Touchpad, Kindle, Olympic Games, Cricket, FIFA, Tennis, Mount Everest, Eiffel Tower, Oxford Street, Nubcrburgring, Haiti, Chile, Libya, Egypt, Middle East, Earthquake, Oil spill, Tsunami, Subprime mortgage crisis, Bailout, Terrorism, Asperger syndrome, McDonal's, Vitamin D, Appendicitis, Cholera, Influenza, Pertussis, Vaccine, Childbirth

3 label per query-result pair

Yahoo! AnswersJon RubinsteinTimothy CookKane Kramer

Steve WozniakJerry York

WikipediaSystem 7

PowerPC G4SuperDrive

Power MacintoshPower Computing Corp.

Steve Jobs Annotator agreement (overlap): 0.85

Average overlap in top 5 results: 12%

8

Algorithm: Lazy Random walk with restart

Serendipity

“making fortunate discoveries by accident”

M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. IRecSys 2010.

9

Serendipity = unexpectedness + relevance“Expected” result baselines from web search

Serendipity = interestingness + relevanceResult interestingness given the queryPersonal interest in result

P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in web search. SIGCHI 2009.

Baseline Data General High Read.

Top: 5 entities that occur most frequently

WP 0.63 (0.58) 0.56 (0.53)

in the top 5 search results provided by YA 0.69 (0.63) 0.71 (0.65)

Bing and Google Comb 0.70 (0.61) 0.68 (0.61)

Top –WP: same as above, but excluding WP 0.63 (0.58) 0.56 (0.54)

the Wikipedia page from the set of YA 0.70 (0.64) 0.71 (0.66)

results Comb 0.71 (0.64) 0.68 (0.63)

Rel: top 5 entities in the related query WP 0.64 (0.61) 0.57 (0.56)

suggestions provided by Bing and Google

YA 0.70 (0.65) 0.71 (0.66)

Comb 0.72 (0.67) 0.69 (0.65)

Rel + Top: union of Top and Rel WP 0.61 (0.54) 0.55 (0.51)

YA 0.68 (0.57) 0.69 (0.59)

Comb 0.68 (0.55) 0.66 (0.56)

| relevant & unexpected | / | unexpected |number of serendipitous results out of all of the unexpected results retrieved

| relevant & unexpected | / | retrieved |serendipitous out of all retrieved

10

User-perceived Quality

11

1. Which result is more relevant to the query?

2. If someone is interested in the query, would they also be interested in these results?

3. Even if you are not interested in the query, are these results interesting to you personally?

4. Would you learn anything new about the query?

Interestingness

Labelers provide pairwise comparisons between results; Combine into a reference ranking and Compare result ranking to optimal (Kendall’s tau-b)

12

Relevance (83%), Query interest (81%),

Personal interest (76%), Learning something new (81%)Agreement:

Interesting > Relevant

Oil Spill Sweaters for Penguins

WP

Robert Pattinson Water for Elephants WP

Egypt Ptolemaic Kingdom WP & YA

Relevant > Interesting Egypt Cairo Conference WP

Netflix Blu-ray Disc YAJ. Arguello, F. Diaz, J. Callan, and B. Carterette. A methodology for evaluating aggregated search results. ECIR 2011.

Data General +Topic

Which result is more WP 0.162 0.194

relevant to the query? YA 0.336 0.374

Comb 0.201 0.222

If someone is interested in WP 0.162 0.176

the query, would they also YA 0.312 0.343

be interested in the result? Comb 0.184 0.222

Even if you are not interested

WP 0.139 0.144

in the query, is the result YA 0.324 0.359

interesting to you personally?

Comb 0.168 0.198

Would you learn anything WP 0.167 0.164

new about the query from YA 0.307 0.346

this result? Comb 0.184 0.203

Similarity (Kendall’s tau-b) between result sets and reference ranking

Topicalcategoryconstraintpromote resultsof same topicas query entity

Sentiment andReadabilityconstraintshurt performance

13

1. What connections between entities do web community knowledge portals offer?

14

≠ANSWERS

>ANSWERS

2. How do they contribute to an interesting, serendipitous browsing experience?

What did we learn?

15 Yahoo Confidential & Proprietary

Thank you!!!!!

penguins in sweaters, or serendipitous entity search on user-generated content

Technology

search results

serendipitous entity

unexpected results

interesting results

results interesting

web search serendipity

query personal

b o r d i n o