penguins in sweaters, or serendipitous entity search on user-generated content

15
Penguins in Sweaters, or Serendipitous Entity Search on User-generated-Content Ilaria Bordino, Yelena Mejova, and Mounia Lalmas (Yahoo Labs) ACM International Conference on Information and Knowledge Management (CIKM 2013) October 29 th , 2013

Upload: mounia-lalmas

Post on 10-May-2015

6.237 views

Category:

Technology


0 download

DESCRIPTION

In many cases, when browsing the Web users are searching for speci c information or answers to concrete questions. Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further. What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted from two sources of user-generated content - Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum - in promoting serendipitous search. In this work, the content of each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and the desirable metadata attributes in serendipitous search.

TRANSCRIPT

Page 1: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Penguins in Sweaters,or Serendipi tous Ent i ty Search

on User-generated-Content

I l a r i a B o r d i n o , Ye l e n a M e j o v a , a n d M o u n i a L a l m a s

( Ya h o o L a b s )A C M I n t e r n a t i o n a l C o n f e r e n c e o n I n f o r m a t i o n a n d K n o w l e d g e

M a n a g e m e n t ( C I K M 2 0 1 3 )

O c t o b e r 2 9 t h , 2 0 1 3

Page 2: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Entity Search

we build an entity-driven serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo! Answers

Serendipityfinding something good or useful while not specifically looking for it, serendipitous search systems provide relevant and interesting results

2

Why/when do penguins wear sweaters?

Page 3: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

1. What connections between entities do web community knowledge portals offer?

WHAT

3

WHY2. How do they contribute to an interesting, serendipitous browsing experience?

Page 4: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Yahoo Answers vs Wikipedia

community-driven question & answer portal 67M questions & 262M

answers 2 years [2010/2011] English-language

community-driven encyclopedia• 3 795 865 articles• from end of December

2011• English Wikipedia

4

minimally curatedopinions, gossip, personal info

variety of points of view

curatedhigh-quality knowledgevariety of niche topics

Page 5: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Entity & Relationship Extraction

Entity: any concept having a Wikipedia page

Use an internal tool to

(1) identify surface forms,

(2) resolve to Wikipedia entities,

(3) rank entities using aboutness score;

W. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. ECIR 2011.D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. CIKM 2009.

5

Relationship: Cosine similarity of tf/idf vectors (concatenation of documents where entity appears)

Page 6: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Dataset Features

Sentiment› using SentiStrength compute positive & negative scores› compute attitude and sentimentality [Kucuktunc’12]

› Entity-level scores Quality

› Flesch Reading Ease score

Attitude (Polarity) Sentimentality (Strength) Readability

Topical Category– Yahoo Content Taxonomy

6

Dataset # Nodes # Edges # Isolated

Yahoo! Answers 896,799 112,595,138 69,856

Wikipedia 1,754,069 237,058,218 82,381

Page 7: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Wikipedia

7

Yahoo Answers

Page 8: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Retrieval

Wikipedia Yahoo! Answers Combined

Precision @ 5 0.668 0.724 0.744

MAP 0.716 0.762 0.782

Justin Bieber, Nicki Minaj, Katy Perry, Shakira, Eminem, Lady Gaga, Jose Mourinho, Selena Gomez, Kim Kardashian, Miley Cyrus, Robert Pattinson, Adele (singer), Steve Jobs, Osama bin Laden, Ron Paul, Twitter, Facebook, Netflix, IPad, IPhone, Touchpad, Kindle, Olympic Games, Cricket, FIFA, Tennis, Mount Everest, Eiffel Tower, Oxford Street, Nubcrburgring, Haiti, Chile, Libya, Egypt, Middle East, Earthquake, Oil spill, Tsunami, Subprime mortgage crisis, Bailout, Terrorism, Asperger syndrome, McDonal's, Vitamin D, Appendicitis, Cholera, Influenza, Pertussis, Vaccine, Childbirth

3 label per query-result pair

Yahoo! AnswersJon RubinsteinTimothy CookKane Kramer

Steve WozniakJerry York

WikipediaSystem 7

PowerPC G4SuperDrive

Power MacintoshPower Computing Corp.

Steve Jobs Annotator agreement (overlap): 0.85

Average overlap in top 5 results: 12%

8

Algorithm: Lazy Random walk with restart

Page 9: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Serendipity

“making fortunate discoveries by accident”

M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. IRecSys 2010.

9

Serendipity = unexpectedness + relevance“Expected” result baselines from web search

Serendipity = interestingness + relevanceResult interestingness given the queryPersonal interest in result

P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in web search. SIGCHI 2009.

Page 10: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Baseline Data General High Read.

Top: 5 entities that occur most frequently

WP 0.63 (0.58) 0.56 (0.53)

in the top 5 search results provided by YA 0.69 (0.63) 0.71 (0.65)

Bing and Google Comb 0.70 (0.61) 0.68 (0.61)

Top –WP: same as above, but excluding WP 0.63 (0.58) 0.56 (0.54)

the Wikipedia page from the set of YA 0.70 (0.64) 0.71 (0.66)

results Comb 0.71 (0.64) 0.68 (0.63)

Rel: top 5 entities in the related query WP 0.64 (0.61) 0.57 (0.56)

suggestions provided by Bing and Google

YA 0.70 (0.65) 0.71 (0.66)

Comb 0.72 (0.67) 0.69 (0.65)

Rel + Top: union of Top and Rel WP 0.61 (0.54) 0.55 (0.51)

YA 0.68 (0.57) 0.69 (0.59)

Comb 0.68 (0.55) 0.66 (0.56)

| relevant & unexpected | / | unexpected |number of serendipitous results out of all of the unexpected results retrieved

| relevant & unexpected | / | retrieved |serendipitous out of all retrieved

10

Page 11: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

User-perceived Quality

11

1. Which result is more relevant to the query?

2. If someone is interested in the query, would they also be interested in these results?

3. Even if you are not interested in the query, are these results interesting to you personally?

4. Would you learn anything new about the query?

Page 12: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Interestingness

Labelers provide pairwise comparisons between results; Combine into a reference ranking and Compare result ranking to optimal (Kendall’s tau-b)

12

Relevance (83%), Query interest (81%),

Personal interest (76%), Learning something new (81%)Agreement:

Interesting > Relevant

Oil Spill Sweaters for Penguins

WP

Robert Pattinson Water for Elephants WP

Egypt Ptolemaic Kingdom WP & YA

Relevant > Interesting Egypt Cairo Conference WP

Netflix Blu-ray Disc YAJ. Arguello, F. Diaz, J. Callan, and B. Carterette. A methodology for evaluating aggregated search results. ECIR 2011.

Page 13: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Data General +Topic

Which result is more WP 0.162 0.194

relevant to the query? YA 0.336 0.374

Comb 0.201 0.222

If someone is interested in WP 0.162 0.176

the query, would they also YA 0.312 0.343

be interested in the result? Comb 0.184 0.222

Even if you are not interested

WP 0.139 0.144

in the query, is the result YA 0.324 0.359

interesting to you personally?

Comb 0.168 0.198

Would you learn anything WP 0.167 0.164

new about the query from YA 0.307 0.346

this result? Comb 0.184 0.203

Similarity (Kendall’s tau-b) between result sets and reference ranking

Topicalcategoryconstraintpromote resultsof same topicas query entity

Sentiment andReadabilityconstraintshurt performance

13

Page 14: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

1. What connections between entities do web community knowledge portals offer?

14

≠ANSWERS

>ANSWERS

2. How do they contribute to an interesting, serendipitous browsing experience?

What did we learn?

Page 15: Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

15 Yahoo Confidential & Proprietary

Thank you!!!!!