location prediction

60
Knowledge Enabled Location Prediction of Twitter Users Master’s Thesis Revathy Krishnamurthy Committee Amit P. Sheth (Advisor) Krishnaprasad Thirunarayan Derek Doran Collaborator Pavan Kapanipathi 1

Upload: knoesis-center-wright-state-university

Post on 13-Jul-2015

255 views

Category:

Social Media


1 download

TRANSCRIPT

Page 1: Location prediction

Knowledge Enabled Location Prediction of Twitter

Users

Master’s Thesis

Revathy Krishnamurthy

Committee

Amit P. Sheth (Advisor)

Krishnaprasad Thirunarayan

Derek Doran

Collaborator

Pavan Kapanipathi

1

Page 2: Location prediction

Background Knowledge can improve a machine’s ability to interpret text

BUCKEYE STATE

2

Page 3: Location prediction

BACKGROUND KNOWLEDGE

3

Page 4: Location prediction

Geographic footprint of a Twitter user

4

Page 5: Location prediction

News RecommenderSystems

Beavercreek preschool to open in 2015

By Sharon D. Boykin

A $5.1 million preschool in Beavercreek citySchools district will help accommodate agrowing of student population and reduceovercrowding, according to school officials.

Ohio’s health exchange to include

more competition

By Randy Tucker

It was just a year ago that the insurance industry

fretted over potential loses from the new

insurance market created by Affordable Care Act.

Recommended for you

WHY IS LOCATION IMPORTANT?

• Targeted advertising

• Opinion Analysis

• Disaster Response

• Location Based Services

Other applications

5

Page 6: Location prediction

Geo-tagged Tweets Profile Information

LOCATION PUBLISHED BY USER

6

Page 7: Location prediction

Geo-tagged Tweets Profile Information

LOCATION PUBLISHED BY USER

• Less than 4% of tweets contain geo-spatial tags

• Location field in profile is either empty or contains invalid information such as “Justin Bieber’s heart”

7

Page 8: Location prediction

Friends

INFERRING LOCATION OF A TWITTER USER

Followees

8

Just drove around Golden Gate Park two times trying to get in

Cleveland Browns confuse me. When I give up on them, they actually show up to play.

Followers

Network based

Content based

Page 9: Location prediction

Friends

NETWORK BASED APPROACHES

FollowersFollowees

Depends on the friends andfollowers of a user whoselocation is known

9

Page 10: Location prediction

CONTENT BASED APPROACHES

Just drove around Golden Gate Park two times trying to get in

Cleveland Browns confuse me. When I give up on them, they actually show up to play.

• Supervised Approaches• Probabilistic Models – (Cheng, Caverlee, and Lee, 2010)• Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010)• Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012)• Language Models – (Doran, Gokhale, and Dagnino, 2014)• Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols,

and Drews, 2014)

10

Geographic location of a user influences the contents of their

tweets

Page 11: Location prediction

Content-based approach

APPROACHES TO LOCATE A TWITTER USER

Reference: Cheng, Caverlee, and Lee, 2010 11

Page 12: Location prediction

Content-based approach

APPROACHES TO LOCATE A TWITTER USER

12

Reference: Cheng, Caverlee, and Lee, 2010

Page 13: Location prediction

PROBLEM STATEMENT

13

Predict the location of a Twitter user based on theirtweets, by exploiting Wikipedia to create a locationspecific knowledgebase

Page 14: Location prediction

• Knowledge-enabled approach to predict the location of Twitterusers based on the contents of their tweets without using anytraining dataset of geo-tagged tweets

• Creation of location specific knowledgebase extracted fromWikipedia by introducing the concept of Local Entities

• Evaluation of the approach on a publicly available dataset with55% accuracy and 429 miles of Average Error Distance

CONTRIBUTIONS

14

Page 15: Location prediction

KNOWLEDGE-BASE ENABLED APPROACH

San Francisco:Golden Gate Bridge, San Francisco 49ers, San Francisco Chronicle …

Entity Count

Golden Gate Bridge 4

San Francisco 49ers 2

San FranciscoChronicle

1

Top-k predictions:San FranciscoOaklandPalo Alto

15

Page 16: Location prediction

KNOWLEDGE BASE GENERATOR

Internal Links Extraction

LocalEntity-1LocalEntity-2

---LocalEntity-n

city-1 city-2 city-k

Weighted Local Entities

Entity Recognition and Scoring

Annotated Tweets

USER PROFILE GENERATOR

LOCATION PREDICTION

Location PredictorRanked

cities for user

KNOWLEDGE-BASE ENABLED APPROACH

16

Page 17: Location prediction

SAN FRANCISCO NEW YORK CITY

HOUSTON

LOCAL ENTITIES

17

Page 18: Location prediction

• Collaborative encyclopedia

• As of 2014, English Wikipedia has 4.6 million articles, 18 billion pages viewsand 500 million unique visitors per month.

• Category Structure• Used for document clustering, tweet classification, personalization

systems etc.• At Kno.e.sis, used in applications such as

• Doozer (Thomas, Mehra, Brooks, and Sheth, 2008)• BLOOMS (Jain, Hitzler, Sheth, Verma, and Yeh, 2010)• Hierarchical Interest Graph (Kapanipathi, Jain, Venkataramani, and

Sheth, 2014)

• Link Structure• Used for word sense disambiguation, semantic relatedness between

terms etc.

WIKIPEDIA

18

Page 19: Location prediction

LINK STRUCTURE OF WIKIPEDIA

19

Page 20: Location prediction

LINK STRUCTURE OF WIKIPEDIA

20

Page 21: Location prediction

“In general, links should be created to relevantconnections to the subject of another article that willhelp readers understand the article more fully. Thiscan include people, events, and topics that alreadyhave an article or that clearly deserve one, so longas the link is relevant to the article in question.”

Source: http://en.wikipedia.org/wiki/Help:Link#Wikilinks

LINK STRUCTURE OF WIKIPEDIA

21

Page 22: Location prediction

• We consider the internal links of location pages as Local Entities of thecity

Local Entities of San Francisco

LOCAL ENTITIES

• While a city does not contain link to itself, we use the city as a localentity

22

Page 23: Location prediction

LOCAL ENTITIES

San Francisco, California – 717 local entitiesFairborn, Ohio – 110 local entities

23

Page 24: Location prediction

ARE ALL ENTITIES EQUALLY LOCAL?

24

Page 25: Location prediction

ARE ALL ENTITIES EQUALLY LOCAL?

25

San Francisco Chronicle

San Francisco ExaminerSF Weekly

MSNBC CNN BBCAl Jazeera America

Page 26: Location prediction

• Pointwise Mutual Information – standard measure ofassociation between two variables

• Assumption is that higher is the localness of an entity withrespect to the city, higher will be the statistical dependencebetween them

• Computed as:

𝑃𝑀𝐼 𝑐, 𝑒 = 𝑙𝑜𝑔2𝑃 𝑐,𝑒

𝑃 𝑐 .𝑃(𝑒)

Association-based Measure

LOCALNESS MEASURE OF ENTITIES

26

Page 27: Location prediction

Graph-based Measure

LOCALNESS MEASURE OF ENTITIES

27

The Boston Red Sox, a founding member of the

American League of Major League Baseball in

1901..

Boston Red SoxThe Boston Red Sox are an American

professional baseball team based in

Boston, Massachusetts ...

They are members of American League (AL).

Boston

American League

Page 28: Location prediction

LOCALNESS MEASURE OF ENTITIES

28

Directed Graph of Local Entities of Boston

Page 29: Location prediction

• Betweenness Centrality (BC) – Measures the importance of anode relative to the rest of the nodes in the graph

• A high BC score of a vertex in a graph indicates that it lies onconsiderable fraction of shortest path connecting others

• Computed as:

𝐶𝐵 𝑐, 𝑒 = 𝑒𝑖≠𝑒≠𝑒𝑗

𝜎𝑒𝑖𝑒𝑗(𝑒)

𝜎𝑒𝑖𝑒𝑗

Graph-based Measure

LOCALNESS MEASURE OF ENTITIES

29

Page 30: Location prediction

LOCALNESS MEASURE OF ENTITIES

30

Directed Graph of Local Entities of Boston

Boston Red Sox: 0.004540

American League: 0.000046

Page 31: Location prediction

Alcatraz IslandTreasure Island

Alameda IslandFinancial District

Market StreetFisherman’s WharfSan Francisco 49ersCow Hollow

Silicon ValleySouth Beach

….

Suspension BridgeHyde Street Pier

Irving MorrowAngelo Rossi

Art DecoCharles Alton EllisBethlehem Steel

Half Way to Hell ClubInternational Orange

San Francisco BayGolden Gate

San Francisco ChronicleU.S. Route 101Marin County

SausalitoBay Area

Semantic Overlap Measure

LOCALNESS MEASURE OF ENTITIES

31

Page 32: Location prediction

• Measures the relatedness between concepts with the intuitionthat related concepts are connected to similar entities

• Jaccard Index: Overlap between two sets

𝑗𝑎𝑐𝑐𝑎𝑟𝑑 𝑐, 𝑒 =|𝑂 𝑐 ∩𝑂 𝑒 |

|𝑂 𝑐 ∪𝑂 𝑒 |

Semantic Overlap Measure

LOCALNESS MEASURE OF ENTITIES

32

Page 33: Location prediction

• Tversky Index: Asymmetric similarity measure between two sets

𝑡𝑖 𝑐, 𝑒 =|𝑂 𝑐 ∩𝑂 𝑒 |

𝑂 𝑐 ∩𝑂 𝑒 + α 𝑂 𝑐 −𝑂 𝑒 + β|𝑂 𝑒 −𝑂 𝑐 |

• We choose α = 0 and β = 1

• For every entity in the page of a local entity not found in thepage of the city, penalize the local entity

Semantic Overlap Measure

LOCALNESS MEASURE OF ENTITIES

33

Page 34: Location prediction

KNOWLEDGE-BASE OF LOCAL ENTITIES

Local Entities of San Francisco (Localness measure: Tversky Index)34

Page 35: Location prediction

KNOWLEDGE BASE GENERATOR

Internal Links Extraction

LocalEntity-1LocalEntity-2

---LocalEntity-n

city-1 city-2 city-k

Weighted Local Entities

Entity Recognition and Scoring

Annotated Tweets

USER PROFILE GENERATOR

LOCATION PREDICTION

Location PredictorRanked

cities for user

KNOWLEDGE-BASE ENABLED APPROACH

35

Page 36: Location prediction

Step 1: Entity Linking

Just drove around Golden Gate Park trying to get in.

CREATION OF USER PROFILE

We use Zemanta for Entity Linking

36

Page 37: Location prediction

Step 1: Entity Linking

Just drove around Golden Gate Park trying to get in.

CREATION OF USER PROFILE

Entity Count

Golden Gate Bridge 4

San Francisco 49ers 2

San Francisco Chronicle 1

User Profile for user 𝑢 defined as:𝑃𝑢 = 𝑒, 𝑠 𝑒 ∈ 𝑊, 𝑠 ∈ 𝑅}

Step 2: Entity Scoring

We use Zemanta for Entity Linking

37

Page 38: Location prediction

KNOWLEDGE BASE GENERATOR

Internal Links Extraction

LocalEntity-1LocalEntity-2

---LocalEntity-n

city-1 city-2 city-k

Weighted Local Entities

Entity Recognition and Scoring

Annotated Tweets

USER PROFILE GENERATOR

LOCATION PREDICTION

Location PredictorRanked

cities for user

KNOWLEDGE-BASE ENABLED APPROACH

38

Page 39: Location prediction

LOCATION PREDICTION

• Compute an aggregate score for each city whose local entities are found in a user’s tweets

𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 =

𝑗=1

𝐼𝑐𝑢

𝑙𝑜𝑐𝑙 𝑐, 𝑒𝑗 × 𝑠𝑒𝑗

where 𝐼𝑐𝑢 are local entities of city 𝑐 found in tweets of user 𝑢 , 𝑒𝑗 ∈ 𝐼𝑐𝑢 and 𝑙𝑜𝑐𝑙(𝑐, 𝑒𝑗) is the localness score of entity 𝑒𝑗 with respect to city 𝑐

• Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 in descending order to predict the top-k locations of a user

39

Page 40: Location prediction

San Francisco International Airport (6),San Francisco (4), Nob Hill (3), SanFrancisco Museum of Modern Art (1),Beach Blanket Babylon (2), San FranciscoMunicipal Railway (4), Golden Gate Park(1), San Francisco Bay Area (1), SF Weekly(1), Fox Oakland Theatre (2), Berkley (1),Green Day (1), Oakland (9), San FranciscoBay Area (1), The White Stripes (1),Detroit Metropolitan Wayne CountyAirport (1), Detroit Historical Museum(1), Detroit Red Wings (4), GeneralMotors (1), Palo Alto (6), SAP AG (8),Facebook (3), PARC (company) (2), Dell(1), Google (1), …

LOCATION PREDICTION

User Profile Knowledgebase

Nob Hill 0.48214

SF Weekly 0.1875

Golden Gate Park 0.16783

San Francisco International

Airport 0.06818

Fox Oakland Theatre 0.09375

SF Bay Area 0.12972

Green Day 0.02066

Detroit Historical

Museum 0.4838

General Motors 0.05538

Detroit Red Wings 0.0232

PARC (company) 0.03726

Google 0.04678

Facebook 0.05810

San Francisco

Oakland, CA

Detroit, MI

Palo Alto, CA

40

Page 41: Location prediction

LOCATION PREDICTION

San Francisco International Airport (6), SanFrancisco (4), Nob Hill (3), San FranciscoMuseum of Modern Art (1), Beach BlanketBabylon (2), San Francisco Municipal Railway(4), Golden Gate Park (1), San Francisco BayArea (1), SF Weekly (1)

14.5531

Fox Oakland Theatre (2), Berkley (1), Green Day(1), Oakland (9), San Francisco Bay Area (1)

10.7584

The White Stripes (1), Detroit MetropolitanWayne County Airport (1), Detroit HistoricalMuseum (1), Detroit Red Wings (4), GeneralMotors (1)

8.0600

Palo Alto (6), SAP AG (8), Facebook (3), PARC(company) (2), Dell (1), Google (1)

6.9175

User Profile Knowledgebase Location Prediction

Nob Hill 0.48214

SF Weekly 0.1875

Golden Gate Park 0.16783

San Francisco International

Airport 0.06818

Fox Oakland Theatre 0.09375

SF Bay Area 0.12972

Green Day 0.02066

Detroit Historical

Museum 0.4838

General Motors 0.05538

Detroit Red Wings 0.0232

PARC (company) 0.03726

Google 0.04678

Facebook 0.05810

San Francisco

Oakland, CA

Detroit, MI

Palo Alto, CA

41

Page 42: Location prediction

• All cities of United States with population > 5000 as published in censusestimates of 2012

• 4,661 cities and 500714 local entities

Knowledge base

IMPLEMENTATION

Baseline

• Considers all local entities to be equally local to the city• Location prediction based only on frequency of entities

42

Page 43: Location prediction

• Published by Cheng, Caverlee, and Lee, 2010.

• Contains 5119 active users from continental United States withapproximately 1000 tweets per user.

• User’s location listed in the form of latitude and longitude.

Test Dataset

EVALUATION

43

Page 44: Location prediction

• Error Distance

𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑜𝑐𝑎𝑐𝑡 𝑢 , 𝑙𝑜𝑐𝑒𝑠𝑡 𝑢

Distance between actual location of the user and the estimated location

• Average Error Distance

𝐴𝐸𝐷 𝑈 = 𝑢∈𝑈 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡(𝑢)

|𝑈|

Average of error distance of all users in the test dataset

• Accuracy

𝐴𝐶𝐶 𝑈 =|{𝑢|𝑢∈𝑈 ˄ 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 ≤100}|

|𝑈|

Percentage of users predicted within 100 miles of their actual location

Evaluation Metrics

EVALUATION

44

Page 45: Location prediction

Location Prediction Results

EVALUATION

Localness Measure

ACC (%) AED (in Miles)

ACC@2 ACC@3 ACC@5

Baseline 25.21 632.56 38.01 42.78 47.95

PMI 38.48 599.40 49.85 56.06 64.15

BC 47.91 478.14 57.39 62.18 66.98

Jaccard Index 53.21 433.62 67.41 73.56 78.84

Tversky Index 54.48 429.00 68.72 74.68 79.99

45

Page 46: Location prediction

EVALUATION

Localness Measure

ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5

Baseline 25.21 632.56 38.01 42.78 47.95

PMI 38.48 599.40 49.85 56.06 64.15

BC 47.91 478.14 57.39 62.18 66.98

Jaccard Index 53.21 433.62 67.41 73.56 78.84

Tversky Index 54.48 429.00 68.72 74.68 79.99

• PMI is not normalized hence sensitive to the count of the occurrences of localentities in the Wikipedia corpus• E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of

San Francisco

46

Page 47: Location prediction

EVALUATION

Localness Measure

ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5

Baseline 25.21 632.56 38.01 42.78 47.95

PMI 38.48 599.40 49.85 56.06 64.15

BC 47.91 478.14 57.39 62.18 66.98

Jaccard Index 53.21 433.62 67.41 73.56 78.84

Tversky Index 54.48 429.00 68.72 74.68 79.99

• Does a good job of assigning low scores to common entities.• E.g. community college, National Weather Service, start up company

etc.

• Fails for entities with some relevance to the city but no distinguishing factor• E.g. IBM with respect to Endicott, New York

47

Page 48: Location prediction

LOCALNESS MEASURE OF ENTITIES

48

Page 49: Location prediction

EVALUATION

Localness Measure

ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5

Baseline 25.21 632.56 38.01 42.78 47.95

PMI 38.48 599.40 49.85 56.06 64.15

BC 47.91 478.14 57.39 62.18 66.98

Jaccard Index

53.21 433.62 67.41 73.56 78.84

Tversky Index 54.48 429.00 68.72 74.68 79.99

• Underperforms for local entities with fewer entities than the city• E.g. Eureka Valley and California with respect to San Francisco.

49

Page 50: Location prediction

EVALUATION

California

San Francisco

Eureka

Valley

50

0.03005

Overlap

Overlap

0.07092

Page 51: Location prediction

EVALUATION

Localness Measure

ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5

Baseline 25.21 632.56 38.01 42.78 47.95

PMI 38.48 599.40 49.85 56.06 64.15

BC 47.91 478.14 57.39 62.18 66.98

Jaccard Index 53.21 433.62 67.41 73.56 78.84

TverskyIndex

54.48 429.00 68.72 74.68 79.99

• Best performing localness measure• Overcomes the disadvantage of Jaccard Index.

• For example: We are able to assign higher localness to Eureka Valley(0.7096) than California (0.1270) with respect to San Francisco

51

Page 52: Location prediction

Top-k Accuracy

EVALUATION

52

Page 53: Location prediction

Top-k Average Error Distance

EVALUATION

53

Page 54: Location prediction

Distribution of all users in the dataset

Distribution of accurately predicted users

Distribution of users

54

Page 55: Location prediction

Comparison with Existing Approaches

EVALUATION

Method ACC (%) AED (in miles)

Cheng, Caverlee, and Lee, 2010 51.00 535.56

Chang, Lee, Eltaher, and Lee, 2012 49.9 509.3

Wikipedia based Approach 54.48 429.00

55

Page 56: Location prediction

Impact of Local Entities

EVALUATION

56

Page 57: Location prediction

Top 100 Cities

EVALUATION

• 2172 users from the dataset are from the top-100 mostpopulated cities of United States

• 60% users predicted within 100 miles of their actual location

• 54% users predicted exactly at the city level

57

Page 58: Location prediction

CONCLUSION

• Presented a crowd sourced knowledge based approach, that does notrequire geo-tagged tweets as a training dataset, to predict the locationof a user

• Introduced the concept of Local Entities and preprocessed WikipediaHyperlink Graph to extract local entities for each city

• Investigated relatedness measures to establish the degree ofassociation between a local entity and a city

• Evaluated the proposed approach against a benchmark datasetpublished by Cheng et al. For 5119 users, we are able to predict thelocation of 55% of users within 100 miles with an average errordistance of 429 miles

58

Page 59: Location prediction

FUTURE WORK

• Compute the confidence score of the prediction based on top-k citiesand count of local entities in tweets

• Investigate other localness measures for score local entities

• Consider semantic types, categories of local entities and weight thecontribution based on types

• Explore other knowledge bases such as Wikitravel and GeoNames

59

Page 60: Location prediction

ACKNOWLEDGEMENTS

THANK YOU!

Amit P. Sheth Krishnaprasad Thirunarayan

Derek Doran

60