evaluation of (search) results how do we know if our results are any good? evaluating a search...

20
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall Results summaries: Making our good results usable to a user

Upload: jordan-sutton

Post on 14-Jan-2016

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Evaluation of (Search) Results

How do we know if our results are any good? Evaluating a search engine

Benchmarks Precision and recall

Results summaries: Making our good results usable to a user

Page 2: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Architecture of a Search Engine

A

EC

D

B

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Page 3: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Benchmarks for a search engine

Index building: How fast does it index Number of documents/hour (Average document size)

Search speed: How fast does it search Latency as a function of index size

Expressiveness: How Expressive is its query languageAbility to express complex information needsSpeed on complex queries

Page 4: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Measures for a search engine

All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise

But, what is the most important measure?User happiness

What is it?Speed of response/size of index are factorsBut blindingly fast, useless answers

won’t make a user happy

Need a way of quantifying user happiness

Page 5: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Who is the user we are trying to make happy?

Web search engine: user finds what they want and return to the engine

eCommerce site: users find what they want and make a purchase

Is it the end-user, or the eCommerce site, whose happiness we measure?

Enterprise (company/government/academic): increased “user productivity”

How much time do users save when looking for information?

Can measure rate of return usersCan measure rate of return users

Measure time to purchase, or fraction of searchers who become buyers?Measure time to purchase, or fraction of searchers who become buyers?

Many other criteria e.g., breadth of access, secure access, etcMany other criteria e.g., breadth of access, secure access, etc

Page 6: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Happiness: elusive to measure

Commonest proxy: relevance of search resultsBut how do you measure relevance?We will detail a methodology, then examine its issuesRelevant measurement requires 3 elements:

A benchmark document collectionA benchmark suite of queriesA binary assessment of either Relevant or Irrelevant for each query-doc pair

Some assessments work on more-than-binary, but not the standard

Page 7: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Evaluating an IR system

Note: the information need is translated into a query

Relevance is assessed relative to the information need not the query

E.g., Information need: I'm looking for information on what types of computer applications are available at wellesley college.

Query: computer applications available wellesley college

You evaluate whether the documents found address the information need,

not whether they have keywords

Page 8: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Standard relevance benchmarks

TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Reuters and other benchmark doc collections used“Retrieval tasks” specified

sometimes as queries

Human experts mark, for each query and for each doc, Relevant or Irrelevant

or at least for subset of docs that some system returned for that query

Page 9: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Unranked retrieval evaluation:Precision and Recall

Precision: portion of retrieved docs that are relevant = P(relevant|retrieved)Recall: portion of relevant docs that are retrieved = P(retrieved|relevant)

Precision P = tp/(tp + fp)Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp true positive fp false positive

Not Retrieved

fn false positive tn true negative

Page 10: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Accuracy

Given a query an engine classifies each doc as “Relevant” or “Irrelevant”.

Accuracy of an engine: the fraction of these classifications that is correct.

= (tp+tn) / (tp+fp+fn+tn)

Can’t we just use Accuracy?

Page 11: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget….

People using an IR system want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

Page 12: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Precision vs Recall

You can get high recall (but low precision) by retrieving all docs for all queries!

Recall can only go up! As the number of docs retrieved increases

But as either number of docs retrieved or recall increasesprecision goes down (in a good system)

A fact with strong empirical confirmation

Page 13: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Difficulties in using precision/recall

We should average over large corpus/query ensembles

We need human relevance assessments People aren’t reliable assessors

Assessments have to be binary Nuanced assessments?

Heavily skewed by corpus/authorship Results may not translate from one domain to another

Page 14: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

A combined measure: F

Combined measure that assesses this tradeoff is F measure (weighted harmonic mean):

People usually use balanced F1 measure i.e., with = 1 or = ½

Harmonic mean is a conservative averageSee CJ van Rijsbergen, Information Retrieval

RP

PR

RP

F

2

2 )1(1)1(

11

Page 15: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

F1 and other averages

Recall fixed at 70%

Page 16: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Result Summaries

Having ranked the documents matching a query, we wish to present a results listMost commonly, the document title plus a short summaryThe title is typically automatically extracted from document metadataWhat about the summaries?

Page 17: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Summaries

Two basic kinds: Static Dynamic

A static summary of a document is always the same, regardless of the query that hit the docDynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand

Page 18: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Static summaries

In typical systems, the static summary is a subset of the document

Simplest heuristic: the first 50 (or so) words of the document

Summary cached at indexing time

More sophisticated: extract from each document a set of “key” sentences

Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.

Most sophisticated: NLP used to synthesize a summary

Seldom used in IR; cf. text summarization work

Page 19: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Dynamic summaries

Present one or more “windows” within the document that contain several of the query terms

“KWIC” snippets: Keyword in Context presentation

Generated in conjunction with scoring If query found as a phrase,

the/some occurrences of the phrase in the doc If not,

windows within the doc that contain multiple query terms

The summary itself gives the entire content of the window – all terms, not only the query terms – how?

Page 20: Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Dynamic summaries

Producing good dynamic summaries is tricky The real estate for the summary is normally small and fixed Want short item, so show as many KWIC matches as possible, and

perhaps other things like title Want snippets to be long enough to be useful Want linguistically well-formed snippets:

users prefer snippets that contain complete phrases Want snippets maximally informative about document

But users really like snippets, even if they complicate IR system design