experiments on query expansion for internet yellow page services using log mining summarized by...

20
Experiments Experiments on Query on Query Expansion for Internet Yellow Expansion for Internet Yellow Page Services Using Log Page Services Using Log Mining Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis Team IDS Lab., SNU 2008.08.14. 28 th VLDB Conference (2002) Yusuke Ohura Katsumi Takahashi Iko Pramudiono Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo NTT Information Sharing Platform Laboratories

Upload: donald-thompson

Post on 16-Dec-2015

227 views

Category:

Documents


2 download

TRANSCRIPT

ExperimentsExperiments on Query Expansion on Query Expansion for Internet Yellow Page Services for Internet Yellow Page Services Using Log MiningUsing Log Mining

Summarized by Dongmin Shin

Presented by Dongmin Shin

User Log Analysis Team

IDS Lab., SNU

2008.08.14.

28th VLDB Conference (2002)

Yusuke Ohura

Katsumi Takahashi

Iko Pramudiono

Masaru Kitsuregawa

Institute of Industrial Science, University of

Tokyo

NTT Information Sharing Platform Laboratories

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

IntroductionIntroduction

Rapid progress on storage capacity and processor performance

Lead to a chance to analyze huge log data left on Web servers

But still..

– No technical report on huge log data mining is available to public

This paper reports..

Results of log data mining and query expansion experiments on the huge commercial Web service

– iTOWNPAGE

An online Japanese telephone directory system(Yellow Page Service)

Center for E-Business Technology

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

Internet Yellow Page Service and its Internet Yellow Page Service and its ProblemsProblems

iTOWNPAGE

Internet version of TOWNPAGE

Center for E-Business Technology

Copyright 2008 by CEBT

Internet Yellow Page Service and its Internet Yellow Page Service and its ProblemsProblems

Problems Found Through Statistical Analysis

Log data

– Access log on iTOWNPAGE from 1st February to 30th June 2000

– 450 million lines, 200GB

1st issue

– Regarding sessions with multiple categories

27.2% of search sessions with category as their variable input are multiple category sessions

75.2% of them used non sibling categories which do not share the parent in the category hierarchy iTOWNPAGE provides

2nd issue

– The case when users can not get any results for their search requests

Center for E-Business Technology

Overview of Search Requests on iTOWNPAGE

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE

Session

The sequence of requests from a user

Set of two continuous requests (within 30 mins interval) are regarded as the same session

Session vector

ith session vector

Center for E-Business Technology

s

is

Copyright 2008 by CEBT

Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE

K-means Clustering Algorithm for clustering sessions

Can not predict the number of clusters in advance Improve the algorithm so that it can dynamically decide the

number of clusters to be generated

Improved K-means algorithm The 1st input vector becomes the centroid vector of the

first cluster C1

– becomes the member of the cluster C1

for each successive input vector ,

– Similarity with existing clusters C1 … Ck is calculated with formula If the similarity is below threshold, new cluster is generated

If not, input vector becomes a member of the cluster with the highest similarity Centroid vector is recalculated with formula

The process is iteratively executed until it converges

Center for E-Business Technology

1s

1c

1s

is

Copyright 2008 by CEBT

K-means Clustering AlgorithmK-means Clustering Algorithm

Overview An algorithm to cluster n objects based on attributes into k

partitions

It assumes that the object attributes form a vector space

The objective it tries to achieve is to minimize total intra-cluster variance

Algorithm– Partition the input points into k initial sets, either at random or

using some heuristic data

– Calculate the centroid(mean point) of each set

– Construct a new partition by associating each point with the closest centroid

– Then, centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence

Center for E-Business Technology

Copyright 2008 by CEBT

Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE

Only display categories whose number of sessions are more than THcat of total sessions for that cluster

many non-sibling categories in the category hierarchy appear in the clusters

Can infer that the search session with the same input such as “Hotels” are performed on various demands and contexts

The clustering of web access logs is effective to understand the user behavior

Center for E-Business Technology

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

Query Expansion Using Web Log Query Expansion Using Web Log MiningMining

Motivation

There are many requests end with no result

– Best solution : recommend another address

Possible only when coordinate information for addresses is available

– Another solution : recommend categories

Need for similarity between categories

Can be extracted by clustering the user access log

There are many sessions consist of non-sibling categories

– Propose another expansion method for recommending categories, not similar but having some relation to the input category

Center for E-Business Technology

Copyright 2008 by CEBT

Query Expansion Using Web Log Query Expansion Using Web Log MiningMining

Strategies for Query Expansion

Intra-Category Recommendation

– Selects sibling categories that appear in major clusters of CATinput

1. Find clusters that have CATinput as a member in the order of the

appearance ratio of CATinput

2. Choose a sibling category that has the most count from each cluster until the number of sibling categories reaches MAXsibl

Inter-Category Recommendation

1. Selects non-sibling categories that appear in major clusters of CATinput

2. Choose the maximum non-sibling category of CATinput from each

clusters up to MAXnon-sibl in the same way of “Intra-Category” step

Center for E-Business Technology

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

Implementation and EvaluationImplementation and Evaluation

Center for E-Business Technology

Copyright 2008 by CEBT

Implementation and EvaluationImplementation and Evaluation

Use another log data to test expansion method– 1st July to 20th July 2000

Firstly, test data is converted into sessions

– “Category A -> Category B -> Category C”

Transition relations are extracted from the sessions

– “Category A -> Category B”, “Category B -> Category C”

Features

Center for E-Business Technology

N : # of test relationsS : # of successful expansions after the expansion testCi : # of expanded categories displayed for i-th test request

Copyright 2008 by CEBT

IndexIndex

Introduction

Internet Yellow Page Service and its Problems

Log Analysis of iTOWNPAGE

Query Expansion Using Web Log Mining

Implementation and Evaluation

Conclusion

Center for E-Business Technology

Copyright 2008 by CEBT

ConclusionConclusion

Experimental results of mining access log from a huge commercial site

Propose a query expansion method based on clustering of user requests

Enhance K-means clustering algorithm

Two-step expansion method

Recommendation for similar categories

Recommendation for related categories although they are non-similar in category hierarchy

Center for E-Business Technology

Copyright 2008 by CEBT

SummarySummary

Pros

It uses real data from commercial web-site

Simple and useful

Cons

Nothing special

– Clustering user sessions

Two step expansion method?

Center for E-Business Technology