1 automatic identification of user goals in web search uichin lee, zhenyu liu, junghoo cho computer...

1

Automatic Identification of User Goals in Web Search

Uichin Lee, Zhenyu Liu, Junghoo Cho

Computer Science Department, UCLA

{uclee, vicliu, cho}@cs.ucla.edu

2

Motivation

• Users have different goals for Web search

– Reach the homepage of an organization (e.g., UCLA)

– Learn about a topic (e.g., simulated annealing)

– Download online music, etc.

• Can we identify the user goal for a Web search

automatically?

– Improve and customize search results based on the

identified user goal, for example

3

Two high-level user-goals

• Navigational query

– Reach a Web site the user already has in mind (e.g.,

“UCLA Library”)

• Informational query

– Visit multiple sites to learn about a particular topic (e.g.

“Simulated Annealing”)

• Based on [Broder02, Rose&Levinson04]

– Navigational and informational are common in both studies

4

Exploiting identified user goals

• Tailored weighting/ranking mechanism– Navigational queries

• Emphasize on anchor texts [Craswell01, Kang03], URL path [Westerveld01]

– Informational queries• Emphasize on page content [Kang03], IR techniques (query

expansion, relevance feedback, pseudo relevance feedback, etc.)

• Tailored result presentation– Informational queries

• Clustered search results [Etzioni99, Zeng04, Kummamuru04]

• Targeted ads / answers

5

Outline

• Are query goals predictable?

– Human-subject study

• How can we predict user goals automatically?

– Anchor-link distribution

– User-click distribution

• How effective are our features?

– Experimental evaluation

6

Are query goals “predictable”?

• Search engines “see” only a few keywords

– No explicit indication of goals by users

– Can we predict the user goal simply from the keywords?

• Human subject study

– 50 most popular Google queries from UCLA CS

– 28 participants (grad students) from UCLA CS

– Ask subjects to indicate the likely goal of each query if

they had issued it

• Do most subjects agree on a particular goal?

7

Human subject study results

• i(q) – the % of participants

that judge query q as

informational

– e.g., i(q) = 0.038 for

“UCLA Library”

0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s

Queries with a predictable goal

8




informational

– e.g., i(q) = 0.038 for

“UCLA Library”

0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s

“ambiguous queries”

43.5% software names30.4% person names

9


0

1

2

3

4

5

6

7

8

9

[0, 0

.1)

[0.1

, 0.2

)

[0.2

, 0.3

)

[0.3

, 0.4

)

[0.4

, 0.5

)

[0.5

,0.6

)

[0.6

, 0.7

)

[0.7

, 0.8

)

[0.8

, 0.9

)

[0.9

, 1]

i (q )

# of

qu

erie

s



informational

– e.g., i(q) = 0.038 for

“UCLA Library”

• After removing software and

person-name queries

10

Human subject study: summary

• Majority of queries have predictable goals

• Interestingly, most ambiguous queries tend to be on a

certain set of topics

– Topic-based ambiguity detection may be possible

– Treat ambiguous queries differently from others

11

Outline

• Are query goals predictable?

– Human-subject study

• How can we predict user goals automatically?

• How effective are our features?

– Experimental evaluation

12

How to predict user goal?

• “UCLA Library” vs. “Simulated Annealing”

– Navigational vs. informational

– Semantic analysis necessary?

• Our idea: use information provided implicitly by Web

users

– Web-link structure

– User-click behavior

13

Web-link structure

• Anchor-link distribution to quantify the link structure

www.library.ucla.edu

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

UCLA Library

www.ucla.edu/library.html

repositories.cdlib.org/uclalib/

14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy f

or e

ach

lin

k de

tin

atio

n

Web-link structure

• Anchor-link distribution to quantify the link structure

www.library.ucla.edu

www.ucla.edu/library.html

repositories.cdlib.org/uclalib/

Anchor-link distribution for query: “UCLA Library”

15

Anchor-link distribution for sample queries

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy f

or e

ach

lin

k

det

inat

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

anchor link rank

freq

uen

cy o

f ea

ch l

ink

d

esti

nat

ion

Navigational Informational

“UCLA Library” “Simulated Annealing”

16

User-click behavior

• Click distribution to

quantify past user-

click behavior

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10answer rank

clic

k f

req

uen

cy

Click distribution for the navigational query: “UCLA Library”

17

User-click behavior (cont’d)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10answer rank

clic

k f

req

uen

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

answer rank

clic

k f

req

uen

cy

Navigational Informational

“UCLA Library” “Simulated Annealing”

18

Capturing the “shape” of distributions

• Possible numeric features for f(x)

– Mean –

– Median

– Skewness – (x - )3f(x)dx / 3

• How “asymmetric” f(x) is

– Kurtosis – (x - )4f(x)dx / 4

• How “peaked” f(x) is

• Single linear regression

– Median is the most effective measurement for both anchor-

link distribution and click distribution

19

Evaluation of features

• Based on 30 queries from the human subject study

– Except software and person-name queries

– Each query is associated with a distinct user goal

• Anchor-link distribution for each query

– Based on 60M pages crawled from the Web

• Click distribution for each query

– Based on Google-result click behavior from UCLA CS

during April 2004 - September 2004

20

Goal-prediction graph (synthetic)

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

an in

divi

dual

fea

ture

An effective feature (hypothetically)

navigational informational

21

Prediction graph: median of anchor-link dist.

0.0

1.0

2.0

3.0

4.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

med

ian

of a

ncho

r-li

nk

dist

ribu

tion

1 = 1.0

• Navigational iff median < 1 = 1.0

– Navigational queries: the vast majority of links point to the#1 anchor destination

• Prediction accuracy: 80.0%navigational informational

22

Prediction graph: combining the two features

• Linear combination with

equal weights:

Navigational queries iff

the median of click dist. +

the median of anchor-link dist.

< 1 + 2 (= 2.0)

• Prediction accuracy: 90%0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

med

ian

of c

lick

dis

trib

utio

n +

med

ian

of a

ncho

r-li

nk d

istr

ibut

ion

1+2 = 2.0

navigational informational

23

• Three features in [Kang and Kim 03]

(1) Anchor usage rate

(2) Query term distribution

(3) Term-dependence

Comparison with previous work

-2.0-1.00.01.02.03.04.05.06.07.08.0

0.0 0.2 0.4 0.6 0.8 1.0

i (q )

quer

y te

rm d

istr

ibut

ion navigational informational

• Result

– Could not reproduce reported results

– Three features not very effective

24

Summary

• Two effective features for goal identification

– Anchor-link distribution (Web-link structure) and click

distribution (user-click behavior)

– Achieved an overall accuracy of 90% on a benchmark

query set

• More details in the paper

25

Future work

• Evaluate on a larger and less biased query set

• Handle queries with insufficient anchor/click statistics

– Learn patterns from queries whose goals are clear

• Predict search intentions on a finer granularity

– Informational queries can be further classified, e.g., directed,

undirected, advice, list, etc. [Rose04]

– Analyze the contents of Web pages that users have

clicked/viewed

– Linguistic methods

26

Thank you

• Any questions?

27

Questionnaire design

• 1st version: direct classification by subjects

– Navigational vs. informational

– Some confusion

• “Alan Kay”: home page + other pages

• “Have a site in mind?” vs “plan to visit one site?”

• 2nd version:

1. Have a site in mind. Intend to visit only that site

2. Have a site in mind. But willing to visit others

3. Have no site in mind. Willing to visit anything relevant

1 automatic identification of user goals in web search uichin lee, zhenyu liu, junghoo cho computer...

Documents

ucla library queries

query goals predictable

identified user goals

predictable goal slide

predictable goals

ucla library ambiguous

humansubject study

behavior slide