experiments on query expansion for internet yellow page services using log mining summarized by...
TRANSCRIPT
ExperimentsExperiments on Query Expansion on Query Expansion for Internet Yellow Page Services for Internet Yellow Page Services Using Log MiningUsing Log Mining
Summarized by Dongmin Shin
Presented by Dongmin Shin
User Log Analysis Team
IDS Lab., SNU
2008.08.14.
28th VLDB Conference (2002)
Yusuke Ohura
Katsumi Takahashi
Iko Pramudiono
Masaru Kitsuregawa
Institute of Industrial Science, University of
Tokyo
NTT Information Sharing Platform Laboratories
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
IntroductionIntroduction
Rapid progress on storage capacity and processor performance
Lead to a chance to analyze huge log data left on Web servers
But still..
– No technical report on huge log data mining is available to public
This paper reports..
Results of log data mining and query expansion experiments on the huge commercial Web service
– iTOWNPAGE
An online Japanese telephone directory system(Yellow Page Service)
Center for E-Business Technology
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
Internet Yellow Page Service and its Internet Yellow Page Service and its ProblemsProblems
iTOWNPAGE
Internet version of TOWNPAGE
Center for E-Business Technology
Copyright 2008 by CEBT
Internet Yellow Page Service and its Internet Yellow Page Service and its ProblemsProblems
Problems Found Through Statistical Analysis
Log data
– Access log on iTOWNPAGE from 1st February to 30th June 2000
– 450 million lines, 200GB
1st issue
– Regarding sessions with multiple categories
27.2% of search sessions with category as their variable input are multiple category sessions
75.2% of them used non sibling categories which do not share the parent in the category hierarchy iTOWNPAGE provides
2nd issue
– The case when users can not get any results for their search requests
Center for E-Business Technology
Overview of Search Requests on iTOWNPAGE
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE
Session
The sequence of requests from a user
Set of two continuous requests (within 30 mins interval) are regarded as the same session
Session vector
ith session vector
Center for E-Business Technology
s
is
Copyright 2008 by CEBT
Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE
K-means Clustering Algorithm for clustering sessions
Can not predict the number of clusters in advance Improve the algorithm so that it can dynamically decide the
number of clusters to be generated
Improved K-means algorithm The 1st input vector becomes the centroid vector of the
first cluster C1
– becomes the member of the cluster C1
for each successive input vector ,
– Similarity with existing clusters C1 … Ck is calculated with formula If the similarity is below threshold, new cluster is generated
If not, input vector becomes a member of the cluster with the highest similarity Centroid vector is recalculated with formula
The process is iteratively executed until it converges
Center for E-Business Technology
1s
1c
1s
is
Copyright 2008 by CEBT
K-means Clustering AlgorithmK-means Clustering Algorithm
Overview An algorithm to cluster n objects based on attributes into k
partitions
It assumes that the object attributes form a vector space
The objective it tries to achieve is to minimize total intra-cluster variance
Algorithm– Partition the input points into k initial sets, either at random or
using some heuristic data
– Calculate the centroid(mean point) of each set
– Construct a new partition by associating each point with the closest centroid
– Then, centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence
Center for E-Business Technology
Copyright 2008 by CEBT
Log Analysis of iTOWNPAGELog Analysis of iTOWNPAGE
Only display categories whose number of sessions are more than THcat of total sessions for that cluster
many non-sibling categories in the category hierarchy appear in the clusters
Can infer that the search session with the same input such as “Hotels” are performed on various demands and contexts
The clustering of web access logs is effective to understand the user behavior
Center for E-Business Technology
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
Query Expansion Using Web Log Query Expansion Using Web Log MiningMining
Motivation
There are many requests end with no result
– Best solution : recommend another address
Possible only when coordinate information for addresses is available
– Another solution : recommend categories
Need for similarity between categories
Can be extracted by clustering the user access log
There are many sessions consist of non-sibling categories
– Propose another expansion method for recommending categories, not similar but having some relation to the input category
Center for E-Business Technology
Copyright 2008 by CEBT
Query Expansion Using Web Log Query Expansion Using Web Log MiningMining
Strategies for Query Expansion
Intra-Category Recommendation
– Selects sibling categories that appear in major clusters of CATinput
1. Find clusters that have CATinput as a member in the order of the
appearance ratio of CATinput
2. Choose a sibling category that has the most count from each cluster until the number of sibling categories reaches MAXsibl
Inter-Category Recommendation
1. Selects non-sibling categories that appear in major clusters of CATinput
2. Choose the maximum non-sibling category of CATinput from each
clusters up to MAXnon-sibl in the same way of “Intra-Category” step
Center for E-Business Technology
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
Implementation and EvaluationImplementation and Evaluation
Center for E-Business Technology
Copyright 2008 by CEBT
Implementation and EvaluationImplementation and Evaluation
Use another log data to test expansion method– 1st July to 20th July 2000
Firstly, test data is converted into sessions
– “Category A -> Category B -> Category C”
Transition relations are extracted from the sessions
– “Category A -> Category B”, “Category B -> Category C”
Features
Center for E-Business Technology
N : # of test relationsS : # of successful expansions after the expansion testCi : # of expanded categories displayed for i-th test request
Copyright 2008 by CEBT
IndexIndex
Introduction
Internet Yellow Page Service and its Problems
Log Analysis of iTOWNPAGE
Query Expansion Using Web Log Mining
Implementation and Evaluation
Conclusion
Center for E-Business Technology
Copyright 2008 by CEBT
ConclusionConclusion
Experimental results of mining access log from a huge commercial site
Propose a query expansion method based on clustering of user requests
Enhance K-means clustering algorithm
Two-step expansion method
Recommendation for similar categories
Recommendation for related categories although they are non-similar in category hierarchy
Center for E-Business Technology