a new topic:queries with geo-information
DESCRIPTION
A NEW TOPIC:QUERIES WITH GEO-INFORMATION. WEB&MOBILE GROUP Zheng Huo. SIX TOPICS RELATED. Spatial pattern mining Xiangmei Hu Mining Interesting Locations and Travel Sequences from GPS Trajectories [WWW09] WhereNext: a Location Predictor on Trajectory Pattern Mining [SIGKDD09] - PowerPoint PPT PresentationTRANSCRIPT
A NEW TOPIC:QUERIES A NEW TOPIC:QUERIES WITH GEO-INFORMATIONWITH GEO-INFORMATION
WEB&MOBILE GROUP
Zheng Huo
SIX TOPICS RELATED
• Spatial pattern mining Xiangmei Hu– Mining Interesting Locations and Travel Sequences from GPS Trajectories [WWW09]– WhereNext: a Location Predictor on Trajectory Pattern Mining [SIGKDD09]– Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets[SIGKDD09]
• Social network Ruxia Ma• Opinion Jing Zhao
– Rated Aspect Summarization of Short Comments [WWW09]– How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes [W
WW09]– OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction [SIGKDD09]
• Geo+query intention Zheng Huo– Discovering Users' Specific Geo Intention in Web Search [WWW09]– A Probabilistic Topic-Based Ranking Framework for Location-Sensitive Domain Information Retrieval [SIGI
R09]– Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects [VLDB09]– Keyword Search in Spatial Databases Towards Searching by Document [ICDE09]
• Geographic + image• kNN applications
2/43
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & future work
3/43
BACKGROUND• Many web queries contain geo info
– About 30% queries may have geo intent; about half of them have explicit geo info.
• Such as queries like “Italian restaurant”, ”Car dealer”, ”L.A hotel”
• About 13% queries have a place name– 84% of them have explicit city info.– 2.6% have state info.– 13.4% have country info.
• Can be used in many fields, such as– Recommendation System– Improve users’ search experience– Advertisement matching
4/43
BACKGROUND(cont’)
• Why traditional methods can’t solve this problem perfectly?
Q(Location, terms)
Scores of “textual relevance”
Scores of “Spatial relevance”
Hybrid Score
Ranking
1. Use a linear function to combine them, which is not the best method
1. Spatial relevance is computed through
“Euclidean Distance” which is not suitable
for all the cases
5/43
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & future work
6/43
OVERVIEW
Queries withGeo-information
Explicit Geo-information
Implicit Geo-information
SDIR method
GIU methods
Local info
Neighbor info
Specific region
Other….
Top-k query
Spatial query
Queries likeQueries like ” ”Beijing Hotels”Beijing Hotels”
““Paris toggeryParis toggery ””
Queries like “Italian Queries like “Italian Restaurant”Restaurant”““Dentist”Dentist”
Queries likeQueries like““Car dealer”Car dealer”““Real estate”Real estate”
Queries likeQueries like““State Maps”State Maps”
““Hotels”Hotels”
Local geo-info
Neighborhood geo-info
7/43
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & Future work
8/43
A TOPIC-BASED METHOD:SDIR• An example
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
q1 :“Los Angeles basketball game”
q2 :“Houston basketball game”
q3 :“Boston basketball game”
A Piece of News:There is an NBA match review regarding the match between L.A. Lakers and Rockets (from Houston), in which some other teams such as Boston Celtics are mentioned Briefly.
Web pages& documents……………
…
9/43
Search engine or IR system
SDIR(cont’)
• Problem definition– DEFINITION 1. A spatial query is expressed as q = (qS, qT ),in
which qS represents the geographical condition implied by q and qT represents the search terms that exclude location names.
– DEFINITION 2. When evaluated against spatial queries, a document can be viewed as d = (dS, dT ), in which dS is the list of location names found in d and dT represents document texts.
– We can define the ranking function as:
F(q, d) = F(qT , qS, dT , dS)
Assume that spatial relevance and textual relevance are independent, we can write it as
F(q, d) = FT (qT , dT ) F⊕ S(qS, dS)
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 10/43
SDIR(cont’)• Framework of SDIR(Spatial-related Domin
e Information Retrievel)
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
Topic Layer:In the middle of query layer and document layer, consists of topics
Topic:A generalized abstraction of document contentsEach NBA team is a topic
Topic Center:A location which the topic is about.For the team Rockets, Houston is topic center
Q-T Relevance ϕ(q, t), evaluate relevance between a query and a topic
D-T Relevanceψ(d, t), evaluate relevance between a document and a topic
11/43
SDIR(cont’)• Some formulas
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
F(q, d) = FT (qT , dT ) F⊕ S(qS, dS)
F(q, d) =∑ϕ(q, tj)ψ(d, tj )ωtj (q, d)
F(q, d) = ∑p(d|tj)p(tj |q)ωtj (q, d)
ϕ(q,t)=p(t|q) ψ(d,t)=p(d|t)
Bayesian Theory
F(q, d) ∑p∝ (tj |qS)p(tj |qT )p(tj |dS)p(tj |dT )ωtj (q, d) / p( tj )
Obtained from topic model
1.It worked directly between the query and the document2. Popular IR metrics can be used here, such as tf-idf and cosine function3. Here, the author used a extended version of the tf-idf method
Can be directly obtained from the training set
12/43
SDIR(cont’)• How to learn the topic model?
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
Determine which domain you are focused on
This method is domain-based, the author trained a model which domain is “NBA basketball games”. This is location related because most fans are interested in local teams
Topic documents :Crawl data from well supported web sites, including : NBA official site, ESPN , and Yahoo! SportFuns : at least 10,000 geo-record for each team
Data Collection
Find the suitable distribution model
Use GP classifier to Model
1.Returns probabilistic results for class labels, perfectly match ranking purpose.2. GP is no parametric and does not place prior assumptions3. GP is a kernel machine, which is highly flexible and configurable
13/43
SDIR(cont’)• Procedure Overall
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
Query(q)
Document(d)
grid1 p(t1|g1) p(t2|g2) ……
grid2 p(t1|g2) p(t2|g2) ……
…… …… …… ……
w1 p(t1|w1) p(t2|w1) ……
w2 p(t1|w2) p(t2|w2) ……
…… …… …… ……
d1 P(t1|d1) P(t2|d1) ……
d2 P(t1|d2) P(t2|d2) ……
…… …… …… ……
ϕ(q, tj)
ωtj (q, d)
ψ(d, tj)
F(q, d)
LTS
qS
qT
qT
Inverted Index
LTT
Geographical Influence Lookup Table: LTS divides the entire geo-area into small grids with the same sizes.
Term-Topic Lookup Table: for example, given m topics.
14/43
SDIR(cont’)• Implementation
– Data set: Take the NBA topic for example
– Training set: Documents crawled from ESPN/NBA team pages are as labeled with corresponding teams. At least 10,000 records for each team.
– Geo-Grid: cut the entire US main territory into smaller square grids, each of which is 0.2°×0.2°
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 15/43
SDIR(cont’)
2-team distributions
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09
Celtics(+1) VS Bulls(-1) Celtics, Bulls, Rockets, Lakers, Suns
5-team distributions
16/43
SDIR(cont’)Location: Simulate a user from 4 locations Query: “MVP” (implicit geo-info)
Euclidean distance is not suitable for this. For people from Pitts prefer Boston to Cleveland although Cleveland is much nearer
17/43
SDIR(cont’)
• Pros and cons– Highly ranking qualities on query with Geo-
information.– Suitable for explicit and implicit geo queries.– BUT it is domain based, each topic model
must be trained separately. – Topics must have only one center, can’t deal
with multiple centers in one topic.
A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 18/43
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & future work
19/43
GIU METHOD
• Overview of the system
Discovering Users’ Specific Geo Intention in Web Search WWW’09 20/43
GIU METHOD(cont’)• Classifier1: detect implicit geo intent
Discovering Users’ Specific Geo Intention in Web Search WWW’09
Qc
San Francisco
Qnc Freqency
Pizza 200
Cheap hotel 150
49ers 125
Zoo 100
Use WOE tool
For each city Ck, build bigram language model
Q = w1 · · ·wn
wi is the strings composed the query
The probabilityof each word is conditioned on the identity of the previousword
21/43
GIU METHOD(cont’)• City language model
– Calculate the posterior probability
Discovering Users’ Specific Geo Intention in Web Search WWW’09
Uniform distribution
Obtained from last formula
Attention!The city language is built. From now on when we related to a city, it meansa city in the city language model, Not the geo one. If the probability is high, it means the query is related to this city instead ofthe meaning the query is generated from that city.
22/43
GIU METHOD(cont’)
• Overall data description– Three learning tasks
• Classifier I: Detecting implicit geo queries• Classifier II: Discriminating different localization
capabilities of geo queries: local geo intent, neighbor region geo intent, etc.
• City language models: Predicting geo entities related to a query
Discovering Users’ Specific Geo Intention in Web Search WWW’09 23/43
GIU METHOD(cont’)
• Implementation– Use real world web search logs from Ya
hoo!– Training subset I
• Randomly sample 20,000 implicit geo queries and 20,000 non-geo queries
• All the explicit geo queries in the training set are used to generate the city language model(CLM)
Discovering Users’ Specific Geo Intention in Web Search WWW’09 24/43
GIU METHOD(cont’)• Generating labels
Discovering Users’ Specific Geo Intention in Web Search WWW’09
Step1: get the clicked url for each query (domain name)
Step 2: Identify queries in DN+
Step 3: Identify queries in DN-
Step 4: non-location parts of positive samples as the final implicitgeo intent queries
DN+DN-
Randomly sample 20,000 implicit geo queries and20,000 non-geo queries to train classifiers.67 DNs in DN+, 64DNs in DN-
25/43
GIU METHOD(cont’)
• Evaluate the classifiers
Discovering Users’ Specific Geo Intention in Web Search WWW’09 26/43
GIU METHOD(cont’)• Evaluating Classifier II
Discovering Users’ Specific Geo Intention in Web Search WWW’09
LG NG RGThe result of the classification formed training subset II
27/43
Implicit geo queries Classifier II Discriminate LG,
NRG, RG
Low dimensional features
All features
GIU METHOD(cont’)• Training models evaluation
– The training data is the training subset II
Discovering Users’ Specific Geo Intention in Web Search WWW’09
The classifiersclassify the queriesgenerated from city Level.The result of this stepformed the trainingsubset III / testingsubset III.
28/43
GIU METHOD(cont’)
• Location-specific query discovery
Discovering Users’ Specific Geo Intention in Web Search WWW’09
A thresholdTo tune ta with training subset III
29/43
GIU METHOD(cont’)• Conclusions of GIU method
Discovering Users’ Specific Geo Intention in Web Search WWW’09
WOE tool
Detect the implicit geo intent, using a probability of the co-occurrence of a city and a query. CLM is generated here.
Discriminate LG, NG and RG geo intention, predict the location of the entity in Q
30/43
GIU METHOD(cont’)
• Pros and cons– Can be used in explicit and implicit geo
queries both.– Compared to topic-based method, GIU
method is more flexible and useful.– BUT query log based method is constrained– The classifiers are not improved, the
performance is not quite good.
Discovering Users’ Specific Geo Intention in Web Search WWW’09 31/43
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & future work
32/43
TOD-K
• Introduction
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09
Q
O1 O2
O3
O4
O5
O6
Local geo info
Questions:
• How to present location proximity and text relevancy?
• What kind of index to combine both location proximity and text relevancy?
33/43
TOP-K
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09
• A simple example
34/43
TOP-K
• Hybrid index
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 35/43
A IR-tree Objects & bounding recs
TOP-K• IR-tree algorithm
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09
front
R7
R5,0.05119 R6,0.269
R1,0.238R2,0.1048
O3,0.481 O4,0.517 O8,0.686
O1,0.238 O2,0.512
R6,0.269
R1,0.238 R6,0.269
R6,0.269 O3,0.481 O4,0.517 O8,0.686
36/43
TOP-K• DIR-tree
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09
Bounding rectangles focused only on location proximity
37/43
TOP-K• DIR-tree(cont’)
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09
DIR-treeIR-tree
Top-2
38/43
TOP-K
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 40/43
• Conclusions– Proposed a new indexing framework for locati
on aware top-k text retrieval.– The frameworks integrates the inverted file for
text retrieval and the R-tree for spatial proximity querying in a novel manner.
– BUT it is only used for users to search local geo-information.
OUTLINE
• Background
• Overview
• Methods– Sdir– GIU method– Top-k– Others
• Conclusions & future work
41/43
CONCLUSIONS & FUTURE WOK
• Research of discovering users’ implicit geo intention is hot these years. – Some existing method based on large data tra
ining models, which is hard to adjust and used to other domains.
– If it is local geo information, it comes to the question of kNN.
• Except training methods, is there other way to model users’ implicit geo intention?
42/43