geographic knowledge discovery (phd theme) by roberto zagal

21
Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN Christophe Claramunt, Naval Academy Research Institute 1

Upload: miguel-felix-mata-rivera

Post on 25-Jan-2017

34 views

Category:

Internet


2 download

TRANSCRIPT

Page 1: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City

Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN

Christophe Claramunt, Naval Academy Research Institute

1

Page 2: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Introduction (1)• Traditionally Pollution Data has been produced by

institutions, government and vendors• But now… the Pollution Data is produced by persons, too

2

Page 3: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Information about Pollution topic is expressed in different ways by:

Government, News media People in social networks

3

Introduction (2)

Page 4: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Introduction (3)

But…What about the certainty of this

information?

Page 5: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Introduction (4) What about ... inconsistency?

Id Type Description1 Tweet

newspaper1The index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX #bad #new

Page 6: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Related work• The social data problem has been faced:

1. KDD and Social Mining2. Formal publications (news media) guide the classification

of the interests of social media users [1]3. Opinion mining and topic modeling [2]. But not using a GKD with an approach of crossing data

layers

6

Page 7: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

GoalKnow how to:

Discover the certainty level of information

by Crossing geographic and social information

7

Page 8: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

8

Solution proposed:

GKD Framework ForData Air Polluttion

Phase 1

Phase 2

Phase 3

Page 9: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Data extraction: Sample tweet (Phase 1)

9

Id Type Description1 Tweet

newspaper1TheThe index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX #bad #news

We consider tweets from accounts that periodically reports data of air pollution

Page 10: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Data extraction: Domain Detection (Phase 1)

10

Id Type Description2 Tweet

Newspaper2

@ #contamination air is 127 IMECAS #CDMX #bad #new

The post is related to a pollution topic

Page 11: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Preprocessing (Phase 2)

• Emotion detection [3] • Location extraction

11

Id Type Description2 Tweet

Newspaper2@ #contamination air is 127 IMECAS #CDMX #bad #new

Page 12: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

• If we detect to which category belongs each set of data:

• Health and Pollution, Transport and Pollution

Then, we can select which data sources should be Then, we can select which data sources should be crossed with the tweet , in order to discover crossed with the tweet , in order to discover KnowledgeKnowledge

12

Classification C5 algorithm (Phase 3)

Id Description Category2 @ #contamination air is 127 IMECAS

#CDMX #bad #new Health and pollution

Page 13: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Crossing data (Phase 4)

• Example 1:• Inconsistencies in tweet 1 and 2?

13

Id Type Description1 Tweet

Newspaper1The index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX

What is correct?

Page 14: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

How to know what tweet is correct? Answer:

It was classified in the domain of: Health and pollution ( In Phase 3 )Then The official data from Healt reports and pollution reports are

selected to be crosssed with the Tweet (in Phase 4)

28/10/16

Crossing data (Phase 4)

Page 15: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Crossing data (Phase 4)

• Data are crossed considering different attributes, from the tweet is taken the date and hour of publication

• When is crossed with the date and hour from official reports of air quality: a match is found

28/10/16

Page 16: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

We discovered the tweets are correct but with different location (the location is not include in the original tweet)

28/10/16

1 Tweet newspaper1

The index of IMECAS is in 135 #CDMX

#Taxqueña 10:00 hours

2 TweetNewspaper2

The #contaminación of air is in 127 IMECAS #CDMX

#Indios Verdes

15:00 hours

Knowledge Discovered!

Crossing data (Phase 4)

Page 17: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Other preliminary results

• Following the same approach

• Knowledge discovered: what topic are talked by region

17

Topic Geographic Period

HealthSouth , West March-June

TransportNorth, East January

December

Policy and programs

Center JanuaryDecember

PollutionSurrounding Mexico City January-June

Public roadsSurrounding Mexico City January-

December

Page 18: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Conclusions and Future work• The integration of the geographical and temporal

dimensions allow us to discover data correlations knowledge can increase certainty of some information in social networks .

• The main contribution is the domain discovery and classification of information is a key element of news aproaches for to discover geographic information.

18

Page 19: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Conclusions and future work• Future work

• Use of clustering or deep learning approaches to improve the classification process

• The location detection is a hard problem. It can be test another machine learning methods for social media [4, 5]

• ¿How can we improve the geographic discovery knowledge considering no explicit links between traditional data sources and

social sources?

19

Page 20: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

Many Thanks!

Questions?

Roberto Zagal [email protected]

IPN, México

28/10/16

Page 21: Geographic knowledge discovery (PhD Theme) by Roberto Zagal

References

[1] Jonghyun Han, Hyunju Lee, Characterizing the interests of social media users: Refinement of a topic model for incorporating heterogeneous media, Information Sciences, Volumes 358–359, 1 September 2016, Pages 112-128, ISSN 0020-0255.

[2] Schubert, E., Weiler, M., & Kriegel, H. P. (2014, August). Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 871-880). ACM.

[3] Carlos Acevedo Miranda, Ricardo Clorio Rodriguez, Roberto Zagal Flores,and Consuelo V. Garcia Mendoza. Web architecture for analysis of feelings in Facebook with semantic approach (Spanish), pp. 59–69; rec. 2014-06-22; acc. 2014-07-21 59 Research in Computing Science 75 (2014). http://www.rcs.cic.ipn.mx/rcs/2014_75/

[4] Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. How events unfold: spatiotemporal mining in social media. SIGSPATIAL Special 7, 3 (January 2016), 19-25. DOI=http://dx.doi.org/10.1145/2876480.2876485

[5] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010.

28/10/16