google tech talk: reconsidering relevance

54
© 2009 Endeca Technologies, Inc. All rights reserved. Reconsidering Relevance Daniel Tunkelang Chief Scientist, Endeca

Upload: daniel-tunkelang

Post on 09-May-2015

46.266 views

Category:

Technology


0 download

DESCRIPTION

Reconsidering RelevanceWe've become complacent about relevance. The overwhelming success of web search engines has lulled even information retrieval (IR) researchers to expect only incremental improvements in relevance in the near future. And beyond web search, there are still broad search problems where relevance still feels hopelessly like the pre-Google web.But even some of the most basic IR questions about relevance are unresolved. We take for granted the very idea that a computer can determine which documents are relevant to a person's needs. And we still rely on two-word queries (on average) to communicate a user's information need. But this approach is a contrivance; in reality, we need to think of information-seeking as a problem of optimizing the communication between people and machines.We can do better. In fact, there are a variety of ongoing efforts to do so, often under the banners of "interactive information retrieval", "exploratory search", and "human computer information retrieval". In this talk, I'll discuss these initiatives and how they are helping to move "relevance" beyond today's outdated assumptions.About the SpeakerDaniel Tunkelang is co-founder and Chief Scientist at Endeca, a leading provider of enterprise information access solutions. He leads Endeca's efforts to develop features and capabilities that emphasize user interaction. Daniel has spearheaded the annual Workshops on Human Computer Information Retrieval (HCIR) and is organizing the Industry Track for SIGIR '09. Daniel also publishes The Noisy Channel, a widely read and cited blog that focuses on how people interact with information.Daniel holds undergraduate degrees in mathematics and computer science from the Massachusetts Institute of Technology, with a minor in psychology. He completed a PhD at Carnegie Mellon University for his work on information visualization. His work previous to Endeca includes stints at the IBM T. J. Watson Research Center and AT&T Labs,

TRANSCRIPT

Page 1: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.

Reconsidering Relevance

Daniel TunkelangChief Scientist, Endeca

Page 2: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.2

howdy!

• 1988 – 1992

• 1993 – 1998

• 1999 -

Page 3: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.3

overview

what is relevance?

what’s wrong with relevance?

what are the alternatives?

Page 4: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.4

but first let’s set the stage

Page 5: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.5

iconic businesses of the 20th and 21st centuries

I’m Feeling Lucky

Page 6: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.6

process and scale orchestration

Page 7: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.7

but there’s a dark side

Page 8: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.8

users are satisfied

Page 9: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.9

an interesting contrast

“Search on the internet is solved. I always find what I need.

But why not in the enterprise?

Seems like a solution waiting to happen.”

- a Fortune 500 CTO

Page 10: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.10

the real questions

• What is “search on the internet” and why is it perceived a solved problem?

• What is “search in the enterprise” and why is it perceived as an unsolved problem?

• And what does this have to do with relevance?

Page 11: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.11

easy vs. hard search problems

• easywhere to buy Ender in Exile?

• hardgood novel to read on the beach?

• easyproof that sorting has n log n lower bound?

• hardalgorithm to sort partially ordered set, given a constant-time comparator?

Page 12: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.12

what is relevance?

what’s wrong with relevance?

what are the alternatives?

Page 13: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.13

defining relevance

Relevance is defined as a measure of information conveyed by a document relative to a query.

It is shown that the relationship between the document and the query, though necessary, is not sufficient to determine relevance.

William Goffman, On relevance as a measure, 1964.

Page 14: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.14

we need more definitions

Page 15: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.15

let’s work top-down

• information retrieval (IR) =

study of retrieval of information (not data) from collection of written documents

retrieved documents aim at satisfying user information need

Page 16: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.16

IR assumes information needs

• user information need =

natural language declaration of informational need of user

• query =

expression of user information need in input language provided by information system

Page 17: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.17

relevance drives IR modeling

• modeling =

studies algorithms used for ranking documents according to system assigned likelihood of relevance

• model =

a set of premises and an algorithm for ranking documents with regard to a user query

Page 18: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.18

a relevance-centric approach

information Need query select from results

rank using IR model

USER:

SYSTEM:tf-idf PageRank

Page 19: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.19

what is relevance?

what’s wrong with relevance?

what are the alternatives?

Page 20: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.20

our first communication problem

information need query

• 2 words?• natural language?• telepathy?

Page 21: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.21

and the game of telephone continues

query rank using IR model

• cumulative error• relevance is subjective• what Goffman said

Page 22: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.22

and hopefully users feel lucky

rank using IR model

• selection bias• inefficient channel• backup plan?

select from results

Page 23: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.23

queries are misinterpreted

Results 1-10 out of about 344,000,000 for ir

Page 24: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.24

ranked lists are inefficient

Page 25: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.25

assumptions of relevance-centric approach

• self-awareness

• self-expression

• model knows best

• answer is a document

• one-shot query

Page 26: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.26

can we do better?

Page 27: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.27

what is relevance?

what’s wrong with relevance?

what are the alternatives?

Page 28: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.28

human-computer information retrieval

• don’t just guess the user’s intent– optimize communication

• increase user responsibility and control– require and reward human intellectual effort

“Toward Human-Computer Information Retrieval”

Gary Marchionini

Page 29: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.29

human computer information retrieval

Page 30: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.30

a concrete use case

• Colleague:

Hey Daniel! You should check out what this guy Steve Pollitt’s been researching. Sounds right up your alley.

• Daniel:

Sure thing, I’ll look into it.

Page 31: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.31

google him!

Page 32: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.32

google scholar him?

Page 33: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.33

rexa him?

Page 34: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.34

getting better

Page 35: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.35

hcir-inspired interface

Page 36: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.36

tags provide summarization and guidance

Page 37: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.37

my information need evolves as i learn

Page 38: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.38

hcir – implementing the vision

Page 39: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.39

scatter/gather: a search for “star”

Page 40: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.40

faceted search

Page 41: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.41

practical considerations

• which facets to show

• which facet values to show

• when to suggest faceted refinement

• how to automate faceted classification

Page 42: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.42

showing the right facets: microwaves

Page 43: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.43

showing the right facets: ceiling fans

Page 44: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.44

query-driven clarification before refinement

Matching Categories include:

Appliances > Small Appliances > Irons & Steamers

Appliances > Small Appliances > Microwaves & Steamers

Bath > Sauna & Spas > Steamers

Kitchen > Bakeware & Cookware > Cookware >Open Stock Pots > Double Boilers & Steamers

Kitchen > Small Appliances > Steamers

Page 45: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.45

results-driven clarification before refinement

Search: storage

Page 46: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.46

crowd-sourcing to tag documents

Page 47: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.47

recall

precision

hcir cheats the precision / recall trade-off

Page 48: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.48

set retrieval 2.0

• set retrieval that responds to queries with– overview of the user's current context– organized set of options for exploration

• contextual summaries of document sets– optimize system’s communication with user

• query refinement options– optimize user’s communication with system

Page 49: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.49

hcir using set retrieval 2.0

emphasize set summaries over ranked lists

establish a dialog between the user and the data

enable exploration and discovery

Page 50: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.50

think outside the (search) box

• relevance-centric search solves many use cases

• but not some of the most valuable ones

• support interaction, exploration

• human-computer information retrieval

Page 51: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.51

one more thing

Page 52: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.52

“Google's mission is to organize the

world's information and make it

universally accessible and useful.”

Page 53: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.53

organizer or referee?

Page 54: Google Tech Talk: Reconsidering Relevance

© 2009 Endeca Technologies, Inc. All rights reserved.54

thank you

communication 1.0email: [email protected]

communication 2.0blog: http://thenoisychannel.com

twitter: http://twitter.com/dtunkelang