a new topic:queries with geo-information

A NEW TOPIC:QUERIES A NEW TOPIC:QUERIES WITH GEO-INFORMATIONWITH GEO-INFORMATION

WEB&MOBILE GROUP

Zheng Huo

SIX TOPICS RELATED

• Spatial pattern mining Xiangmei Hu– Mining Interesting Locations and Travel Sequences from GPS Trajectories [WWW09]– WhereNext: a Location Predictor on Trajectory Pattern Mining [SIGKDD09]– Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets[SIGKDD09]

• Social network Ruxia Ma• Opinion Jing Zhao

– Rated Aspect Summarization of Short Comments [WWW09]– How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes [W

WW09]– OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction [SIGKDD09]

• Geo+query intention Zheng Huo– Discovering Users' Specific Geo Intention in Web Search [WWW09]– A Probabilistic Topic-Based Ranking Framework for Location-Sensitive Domain Information Retrieval [SIGI

R09]– Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects [VLDB09]– Keyword Search in Spatial Databases Towards Searching by Document [ICDE09]

• Geographic + image• kNN applications

2/43

OUTLINE

• Background

• Overview

• Methods– Sdir– GIU method– Top-k– Others

• Conclusions & future work

3/43

BACKGROUND• Many web queries contain geo info

– About 30% queries may have geo intent; about half of them have explicit geo info.

• Such as queries like “Italian restaurant”, ”Car dealer”, ”L.A hotel”

• About 13% queries have a place name– 84% of them have explicit city info.– 2.6% have state info.– 13.4% have country info.

• Can be used in many fields, such as– Recommendation System– Improve users’ search experience– Advertisement matching

4/43

BACKGROUND(cont’)

• Why traditional methods can’t solve this problem perfectly?

Q(Location, terms)

Scores of “textual relevance”

Scores of “Spatial relevance”

Hybrid Score

Ranking

1. Use a linear function to combine them, which is not the best method

1. Spatial relevance is computed through

“Euclidean Distance” which is not suitable

for all the cases

5/43

OUTLINE

• Background

• Overview



6/43

OVERVIEW

Queries withGeo-information

Explicit Geo-information

Implicit Geo-information

SDIR method

GIU methods

Local info

Neighbor info

Specific region

Other….

Top-k query

Spatial query

Queries likeQueries like ” ”Beijing Hotels”Beijing Hotels”

““Paris toggeryParis toggery ””

Queries like “Italian Queries like “Italian Restaurant”Restaurant”““Dentist”Dentist”

Queries likeQueries like““Car dealer”Car dealer”““Real estate”Real estate”

Queries likeQueries like““State Maps”State Maps”

““Hotels”Hotels”

Local geo-info

Neighborhood geo-info

7/43

OUTLINE

• Background

• Overview


• Conclusions & Future work

8/43

A TOPIC-BASED METHOD:SDIR• An example

A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09

q1 :“Los Angeles basketball game”

q2 :“Houston basketball game”

q3 :“Boston basketball game”

A Piece of News:There is an NBA match review regarding the match between L.A. Lakers and Rockets (from Houston), in which some other teams such as Boston Celtics are mentioned Briefly.

Web pages& documents……………

…

9/43

Search engine or IR system

SDIR(cont’)

• Problem definition– DEFINITION 1. A spatial query is expressed as q = (qS, qT ),in

which qS represents the geographical condition implied by q and qT represents the search terms that exclude location names.

– DEFINITION 2. When evaluated against spatial queries, a document can be viewed as d = (dS, dT ), in which dS is the list of location names found in d and dT represents document texts.

– We can define the ranking function as:

F(q, d) = F(qT , qS, dT , dS)

Assume that spatial relevance and textual relevance are independent, we can write it as

F(q, d) = FT (qT , dT ) F⊕ S(qS, dS)

A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 10/43

SDIR(cont’)• Framework of SDIR(Spatial-related Domin

e Information Retrievel)


Topic Layer:In the middle of query layer and document layer, consists of topics

Topic：A generalized abstraction of document contentsEach NBA team is a topic

Topic Center：A location which the topic is about.For the team Rockets, Houston is topic center

Q-T Relevance ϕ(q, t), evaluate relevance between a query and a topic

D-T Relevanceψ(d, t), evaluate relevance between a document and a topic

11/43

SDIR(cont’)• Some formulas


F(q, d) = FT (qT , dT ) F⊕ S(qS, dS)

F(q, d) =∑ϕ(q, tj)ψ(d, tj )ωtj (q, d)

F(q, d) = ∑p(d|tj)p(tj |q)ωtj (q, d)

ϕ(q,t)=p(t|q) ψ(d,t)=p(d|t)

Bayesian Theory

F(q, d) ∑p∝ (tj |qS)p(tj |qT )p(tj |dS)p(tj |dT )ωtj (q, d) / p( tj )

Obtained from topic model

1.It worked directly between the query and the document2. Popular IR metrics can be used here, such as tf-idf and cosine function3. Here, the author used a extended version of the tf-idf method

Can be directly obtained from the training set

12/43

SDIR(cont’)• How to learn the topic model?


Determine which domain you are focused on

This method is domain-based, the author trained a model which domain is “NBA basketball games”. This is location related because most fans are interested in local teams

Topic documents :Crawl data from well supported web sites, including : NBA official site, ESPN , and Yahoo! SportFuns : at least 10,000 geo-record for each team

Data Collection

Find the suitable distribution model

Use GP classifier to Model

1.Returns probabilistic results for class labels, perfectly match ranking purpose.2. GP is no parametric and does not place prior assumptions3. GP is a kernel machine, which is highly flexible and configurable

13/43

SDIR(cont’)• Procedure Overall


Query(q)

Document(d)

grid1 p(t1|g1) p(t2|g2) ……

grid2 p(t1|g2) p(t2|g2) ……

…… …… …… ……

w1 p(t1|w1) p(t2|w1) ……

w2 p(t1|w2) p(t2|w2) ……

…… …… …… ……

d1 P(t1|d1) P(t2|d1) ……

d2 P(t1|d2) P(t2|d2) ……

…… …… …… ……

ϕ(q, tj)

ωtj (q, d)

ψ(d, tj)

F(q, d)

LTS

qS

qT

qT

Inverted Index

LTT

Geographical Influence Lookup Table: LTS divides the entire geo-area into small grids with the same sizes.

Term-Topic Lookup Table: for example, given m topics.

14/43

SDIR(cont’)• Implementation

– Data set: Take the NBA topic for example

– Training set: Documents crawled from ESPN/NBA team pages are as labeled with corresponding teams. At least 10,000 records for each team.

– Geo-Grid: cut the entire US main territory into smaller square grids, each of which is 0.2°×0.2°


SDIR(cont’)

2-team distributions


Celtics(+1) VS Bulls(-1) Celtics, Bulls, Rockets, Lakers, Suns

5-team distributions

16/43

SDIR(cont’)Location: Simulate a user from 4 locations Query: “MVP” (implicit geo-info)

Euclidean distance is not suitable for this. For people from Pitts prefer Boston to Cleveland although Cleveland is much nearer

17/43

SDIR(cont’)

• Pros and cons– Highly ranking qualities on query with Geo-

information.– Suitable for explicit and implicit geo queries.– BUT it is domain based, each topic model

must be trained separately. – Topics must have only one center, can’t deal

with multiple centers in one topic.


OUTLINE

• Background

• Overview



19/43

GIU METHOD

• Overview of the system

Discovering Users’ Specific Geo Intention in Web Search WWW’09 20/43

GIU METHOD(cont’)• Classifier1: detect implicit geo intent

Discovering Users’ Specific Geo Intention in Web Search WWW’09

Qc

San Francisco

Qnc Freqency

Pizza 200

Cheap hotel 150

49ers 125

Zoo 100

Use WOE tool

For each city Ck, build bigram language model

Q = w1 · · ·wn

wi is the strings composed the query

The probabilityof each word is conditioned on the identity of the previousword

21/43

GIU METHOD(cont’)• City language model

– Calculate the posterior probability


Uniform distribution

Obtained from last formula

Attention!The city language is built. From now on when we related to a city, it meansa city in the city language model, Not the geo one. If the probability is high, it means the query is related to this city instead ofthe meaning the query is generated from that city.

22/43

GIU METHOD(cont’)

• Overall data description– Three learning tasks

• Classifier I: Detecting implicit geo queries• Classifier II: Discriminating different localization

capabilities of geo queries: local geo intent, neighbor region geo intent, etc.

• City language models: Predicting geo entities related to a query


GIU METHOD(cont’)

• Implementation– Use real world web search logs from Ya

hoo!– Training subset I

• Randomly sample 20,000 implicit geo queries and 20,000 non-geo queries

• All the explicit geo queries in the training set are used to generate the city language model(CLM)


GIU METHOD(cont’)• Generating labels


Step1: get the clicked url for each query (domain name)

Step 2: Identify queries in DN+

Step 3: Identify queries in DN-

Step 4: non-location parts of positive samples as the final implicitgeo intent queries

DN+DN-

Randomly sample 20,000 implicit geo queries and20,000 non-geo queries to train classifiers.67 DNs in DN+, 64DNs in DN-

25/43

GIU METHOD(cont’)

• Evaluate the classifiers


GIU METHOD(cont’)• Evaluating Classifier II


LG NG RGThe result of the classification formed training subset II

27/43

Implicit geo queries Classifier II Discriminate LG,

NRG, RG

Low dimensional features

All features

GIU METHOD(cont’)• Training models evaluation

– The training data is the training subset II


The classifiersclassify the queriesgenerated from city Level.The result of this stepformed the trainingsubset III / testingsubset III.

28/43

GIU METHOD(cont’)

• Location-specific query discovery


A thresholdTo tune ta with training subset III

29/43

GIU METHOD(cont’)• Conclusions of GIU method


WOE tool

Detect the implicit geo intent, using a probability of the co-occurrence of a city and a query. CLM is generated here.

Discriminate LG, NG and RG geo intention, predict the location of the entity in Q

30/43

GIU METHOD(cont’)

• Pros and cons– Can be used in explicit and implicit geo

queries both.– Compared to topic-based method, GIU

method is more flexible and useful.– BUT query log based method is constrained– The classifiers are not improved, the

performance is not quite good.


OUTLINE

• Background

• Overview



32/43

TOD-K

• Introduction

Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09

Q

O1 O2

O3

O4

O5

O6

Local geo info

Questions:

• How to present location proximity and text relevancy?

• What kind of index to combine both location proximity and text relevancy?

33/43

TOP-K


• A simple example

34/43

TOP-K

• Hybrid index

Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 35/43

A IR-tree Objects & bounding recs

TOP-K• IR-tree algorithm


front

R7

R5,0.05119 R6,0.269

R1,0.238R2,0.1048

O3,0.481 O4,0.517 O8,0.686

O1,0.238 O2,0.512

R6,0.269

R1,0.238 R6,0.269

R6,0.269 O3,0.481 O4,0.517 O8,0.686

36/43

TOP-K• DIR-tree


Bounding rectangles focused only on location proximity

37/43

TOP-K• DIR-tree(cont’)


DIR-treeIR-tree

Top-2

38/43

TOP-K

Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 40/43

• Conclusions– Proposed a new indexing framework for locati

on aware top-k text retrieval.– The frameworks integrates the inverted file for

text retrieval and the R-tree for spatial proximity querying in a novel manner.

– BUT it is only used for users to search local geo-information.

OUTLINE

• Background

• Overview



41/43

CONCLUSIONS & FUTURE WOK

• Research of discovering users’ implicit geo intention is hot these years. – Some existing method based on large data tra

ining models, which is hard to adjust and used to other domains.

– If it is local geo information, it comes to the question of kNN.

• Except training methods, is there other way to model users’ implicit geo intention?

42/43

Thanks Thanks Q&AQ&A？？

43/43