measuring the quality of web search engines

40
Measuring the quality of web search engines Prof. Dr. Dirk Lewandowski University of Applied Sciences Hamburg [email protected] Tartu University, 14 September 2009

Upload: dirk-lewandowski

Post on 07-Nov-2014

455 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Measuring the quality of web search engines

Measuring the quality of web search enginesProf. Dr. Dirk LewandowskiUniversity of Applied Sciences [email protected]

Tartu University, 14 September 2009

Page 2: Measuring the quality of web search engines

1 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 3: Measuring the quality of web search engines

2 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 4: Measuring the quality of web search engines

3 | Dirk Lewandowski

Search engine market: Germany 2009

(Webhits, 2009)

Page 5: Measuring the quality of web search engines

4 | Dirk Lewandowski

Search engine market: Estonia 2007

(Global Search Report 2007)

Page 6: Measuring the quality of web search engines

5 | Dirk Lewandowski

Why measure the quality of web search engines?

• Search engines are the main access point to web content.

• One player is dominating the worldwide market.

• Open questions– How good are search engines’ results?– Do we need alternatives to “big three” (“big two”? “big one”?)– How good are alternative search engines in delivering an alternative view on web

content?– How good must a new search engine be to compete?

Page 7: Measuring the quality of web search engines

6 | Dirk Lewandowski

A framework for measuring search engine quality

• Index quality– Size of database, coverage of the web– Coverage of certain areas (countries, languages)– Index overlap– Index freshness

• Quality of the results– Retrieval effectiveness– User satisfaction– Results overlap

• Quality of the search features– Features offered– Operational reliability

• Search engine usability and user guidance

(Lewandowski & Höchstötter, 2007)

Page 8: Measuring the quality of web search engines

7 | Dirk Lewandowski

A framework for measuring search engine quality

• Index quality– Size of database, coverage of the web– Coverage of certain areas (countries, languages)– Index overlap– Index freshness

• Quality of the results– Retrieval effectiveness– User satisfaction– Results overlap

• Quality of the search features– Features offered– Operational reliability

• Search engine usability and user guidance

(Lewandowski & Höchstötter, 2007)

Page 9: Measuring the quality of web search engines

8 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 10: Measuring the quality of web search engines

9 | Dirk Lewandowski

Users use relatively few cognitive resources in web searching.

• Queries– Average length: 1.7 words (German language queries; English language queries

slightly longer)– Approx. 50 percent of queries consist of just one word

• Search engine results pages (SERPs)– 80 percent of users view no more than the first results page (10 results)– Users normally only view the first few results („above the fold“)– Users only view up to five results per session– Session length is less than 15 minutes

• Users are usually satisfied with the results given.

Page 11: Measuring the quality of web search engines

10 | Dirk Lewandowski10 |

Results selection (top11 results)

(Granka et al. 2004)

Page 12: Measuring the quality of web search engines

11 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 13: Measuring the quality of web search engines

12 | Dirk Lewandowski

Standard design for retrieval effectiveness tests

• Select (at least 50) queries (from log files, from user studies, etc.)• Select some (major) search engines• Consider top results (use cut-off)• Anonymise search engines, randomise results positions• Let users judge results

• Calculate precision scores– the ratio of relevant results in proportion to all results retrieved at the

corresponding position• Calculate/assume recall scores

– the ratio of relevant results shown by a certain search engine in proportion to allrelevant results within the database.

Page 14: Measuring the quality of web search engines

13 | Dirk Lewandowski

Recall-Precision-Graph (top20 results)

(Lewandowski 2008)

Page 15: Measuring the quality of web search engines

14 | Dirk Lewandowski

Standard design for retrieval effectiveness tests

• Problematic assumptions– Model of “dedicated searcher” (willing to select one result after the other and go

through an extensive list of results)– User wants high precision and high recall, as well.

• These studies do not consider– how many documents a user is willing to view / how many are sufficient for

answering the query– how popular the queries used in the evaluation are– graded relevance judgements (relevance scales)– different relevance judgements by different jurors– different query types– results descriptions– users’ typical results selection behaviour– visibility of different elements in the results lists (through their presentation)– users’ preference for a certain search engine– diversity of the results set / the top results– ...

Page 16: Measuring the quality of web search engines

15 | Dirk Lewandowski

• Results selection simple

Page 17: Measuring the quality of web search engines

16 | Dirk Lewandowski

Universal Search

• x

Page 18: Measuring the quality of web search engines

17 | Dirk Lewandowski

Universal Search

• x

News results

ads

organic results

organic results (contd.)

image results

video results

Page 19: Measuring the quality of web search engines

18 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 20: Measuring the quality of web search engines

19 | Dirk Lewandowski

Results descriptions

META Description

Yahoo Directory

Open Directory

Page 21: Measuring the quality of web search engines

20 | Dirk Lewandowski

Results decriptions: keywords in context (KWIC)

Page 22: Measuring the quality of web search engines

21 | Dirk Lewandowski

• Results selection simple

Page 23: Measuring the quality of web search engines

22 | Dirk Lewandowski

• results selection with descriptions

Page 24: Measuring the quality of web search engines

23 | Dirk Lewandowski

Ratio of relevant results vs. relevant descriptions (top20 results)

Page 25: Measuring the quality of web search engines

24 | Dirk Lewandowski

Recall-precision graph (top20 descriptions)

Page 26: Measuring the quality of web search engines

25 | Dirk Lewandowski

Precision of descriptions vs. precision of results (Google)

Page 27: Measuring the quality of web search engines

26 | Dirk Lewandowski

Recall-Precision-Graph (Top20, DRprec = relevant descriptionsleading to relevant results)

Page 28: Measuring the quality of web search engines

27 | Dirk Lewandowski

Search engines deal with different query types.

Query types (Broder, 2002):

• Informational– Looking for information on a certain topic– User wants to view a few relevant pages

• Navigational– Looking for a (known) homepage– User wants to navigate to this homepage, only one relevant result

• Transactional– Looking for a website to complete a transaction– One or more relevant results– Transaction can be purchasing a product, downloading a file, etc.

Page 29: Measuring the quality of web search engines

28 | Dirk Lewandowski

Search engines deal with different query types.

Query types (Broder, 2002):

• Informational– Looking for information on a certain topic– User wants to view a few relevant pages

• Navigational– Looking for a (known) homepage– User wants to navigate to this homepage, only one relevant result

• Transactional– Looking for a website to complete a transaction– One or more relevant results– Transaction can be purchasing a product, downloading a file, etc.

Page 30: Measuring the quality of web search engines

29 | Dirk Lewandowski

Percentage of unanswered queries (“navigational fail”)

(Lewandowski 2009)

Page 31: Measuring the quality of web search engines

30 | Dirk Lewandowski

Successful answered queries on results position n

(Lewandowski 2009)

Page 32: Measuring the quality of web search engines

31 | Dirk Lewandowski

Results for navigational vs. informational queries

• Studies should consider informational, as well as navigational queries.

• Queries should be weighted according to their frequency.

• When >40% of queries are navigational, new search engines should putsignificant effort in answering these queries sufficiently.

Page 33: Measuring the quality of web search engines

32 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 34: Measuring the quality of web search engines

33 | Dirk Lewandowski

Addressing major problems with retrieval effectiveness tests

• We use navigational and informational queries, as well.– no suitable framework for transactional queries, though.

• We use query frequency data from the T-Online database.– The database consists of approx. 400 million queries from 2007 onwards.– We can use time series analysis.

• We classify queries according to query type and topic.– We did a study on query classification based on 50,000 queries from T-Online log

files to gain a better understanding of user intents. Data collection was“crowdsourced” to Humangrid GmbH.

Page 35: Measuring the quality of web search engines

34 | Dirk Lewandowski

Addressing major problems with retrieval effectiveness tests

• We consider all elements on the first results page.– Organic results, ads, shortcuts– We will use clickthrough data from T-Online to measure “importance” of certain

results.

• Each result will be judged by several jurors.– Juror groups: Students, professors, retired persons, librarians, school children,

other.– Additional judgements by the “general users” are collected in cooperation with

Humangrid GmbH.

• Results will be graded on a relevance scale.– Results and descriptions will be getting judged.

• We will classify all organic results according to– document type (e.g., encyclopaedia, blog, forum, news)– date– degree of commercial intent

Page 36: Measuring the quality of web search engines

35 | Dirk Lewandowski

Addressing major problems with retrieval effectiveness tests

• We will count ads on results pages– Do search engines prefer pages carrying ads from the engine’s ad system?

• We will ask users additional questions– Users will also judge the results set of each individual search engine as a whole.– Users will rank search engine based on the result sets.– Users will say where they would have stopped viewing more results.– Users will provide their own individual relevance-ranked list by card-sorting the

complete results set from all search engines.

• We will use printout screenshots of the results– Makes the study “mobile”– Especially important when considering certain user groups (e.g., elderly people).

Page 37: Measuring the quality of web search engines

36 | Dirk Lewandowski

State of current work

• First wave of data collection starting in October.

• Proposal for additional project funding sent to DFG (German ResearchFoundation).

• Project on user intents from search queries near completion.

• Continuing collaboration with Deutsche Telekom, T-Online.

Page 38: Measuring the quality of web search engines

37 | Dirk Lewandowski

Introduction

A few words about user behaviour

Standard retrieval effectiveness tests vs. “Universal Search”

Selected results: Results descriptions, navigational queries

Towards an integrated test framework

Conclusions

Agenda

Page 39: Measuring the quality of web search engines

38 | Dirk Lewandowski

Conclusion

• Measuring search engine quality is a complex task.

• Retrieval effectiveness is a major aspect of SE quality evaluation.

• Established evaluation frameworks are not sufficient for the web context.

Page 40: Measuring the quality of web search engines

Thank you for your attention.Prof. Dr.Dirk Lewandowski

Hamburg University of Applied SciencesDepartment InformationBerliner Tor 5D - 20099 HamburgGermany

www.bui.haw-hamburg.de/lewandowski.htmlE-Mail: [email protected]