geographic knowledge discovery (phd theme) by roberto zagal

Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City

Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN

Christophe Claramunt, Naval Academy Research Institute

1

Introduction (1)• Traditionally Pollution Data has been produced by

institutions, government and vendors• But now… the Pollution Data is produced by persons, too

2

Information about Pollution topic is expressed in different ways by:

Government, News media People in social networks

3

Introduction (2)

Introduction (3)

But…What about the certainty of this

information?

Introduction (4) What about ... inconsistency?

Id Type Description1 Tweet

newspaper1The index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX #bad #new

Related work• The social data problem has been faced:

1. KDD and Social Mining2. Formal publications (news media) guide the classification

of the interests of social media users [1]3. Opinion mining and topic modeling [2]. But not using a GKD with an approach of crossing data

layers

6

GoalKnow how to:

Discover the certainty level of information

by Crossing geographic and social information

7

8

Solution proposed:

GKD Framework ForData Air Polluttion

Phase 1

Phase 2

Phase 3

Data extraction: Sample tweet (Phase 1)

9


newspaper1TheThe index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX #bad #news

We consider tweets from accounts that periodically reports data of air pollution

Data extraction: Domain Detection (Phase 1)

10


Newspaper2

@ #contamination air is 127 IMECAS #CDMX #bad #new

The post is related to a pollution topic

Preprocessing (Phase 2)

• Emotion detection [3] • Location extraction

11


Newspaper2@ #contamination air is 127 IMECAS #CDMX #bad #new

• If we detect to which category belongs each set of data:

• Health and Pollution, Transport and Pollution

Then, we can select which data sources should be Then, we can select which data sources should be crossed with the tweet , in order to discover crossed with the tweet , in order to discover KnowledgeKnowledge

12

Classification C5 algorithm (Phase 3)

Id Description Category2 @ #contamination air is 127 IMECAS

#CDMX #bad #new Health and pollution

Crossing data (Phase 4)

• Example 1:• Inconsistencies in tweet 1 and 2?

13


Newspaper1The index of IMECAS is 135 #CDMX

2 TweetNewspaper2

@ the #contamination of air is 127 IMECAS #CDMX

What is correct?

How to know what tweet is correct? Answer:

It was classified in the domain of: Health and pollution ( In Phase 3 )Then The official data from Healt reports and pollution reports are

selected to be crosssed with the Tweet (in Phase 4)

28/10/16



• Data are crossed considering different attributes, from the tweet is taken the date and hour of publication

• When is crossed with the date and hour from official reports of air quality: a match is found

28/10/16

We discovered the tweets are correct but with different location (the location is not include in the original tweet)

28/10/16

1 Tweet newspaper1

The index of IMECAS is in 135 #CDMX

#Taxqueña 10:00 hours

2 TweetNewspaper2

The #contaminación of air is in 127 IMECAS #CDMX

#Indios Verdes

15:00 hours

Knowledge Discovered!


Other preliminary results

• Following the same approach

• Knowledge discovered: what topic are talked by region

17

Topic Geographic Period

HealthSouth , West March-June

TransportNorth, East January

December

Policy and programs

Center JanuaryDecember

PollutionSurrounding Mexico City January-June

Public roadsSurrounding Mexico City January-

December

Conclusions and Future work• The integration of the geographical and temporal

dimensions allow us to discover data correlations knowledge can increase certainty of some information in social networks .

• The main contribution is the domain discovery and classification of information is a key element of news aproaches for to discover geographic information.

18

Conclusions and future work• Future work

• Use of clustering or deep learning approaches to improve the classification process

• The location detection is a hard problem. It can be test another machine learning methods for social media [4, 5]

• ¿How can we improve the geographic discovery knowledge considering no explicit links between traditional data sources and

social sources?

19

Many Thanks!

Questions?

Roberto Zagal [email protected]

IPN, México

28/10/16

References

[1] Jonghyun Han, Hyunju Lee, Characterizing the interests of social media users: Refinement of a topic model for incorporating heterogeneous media, Information Sciences, Volumes 358–359, 1 September 2016, Pages 112-128, ISSN 0020-0255.

[2] Schubert, E., Weiler, M., & Kriegel, H. P. (2014, August). Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 871-880). ACM.

[3] Carlos Acevedo Miranda, Ricardo Clorio Rodriguez, Roberto Zagal Flores,and Consuelo V. Garcia Mendoza. Web architecture for analysis of feelings in Facebook with semantic approach (Spanish), pp. 59–69; rec. 2014-06-22; acc. 2014-07-21 59 Research in Computing Science 75 (2014). http://www.rcs.cic.ipn.mx/rcs/2014_75/

[4] Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. How events unfold: spatiotemporal mining in social media. SIGSPATIAL Special 7, 3 (January 2016), 19-25. DOI=http://dx.doi.org/10.1145/2876480.2876485

[5] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010.

28/10/16

geographic knowledge discovery (phd theme) by roberto zagal

Internet