geographic knowledge discovery (phd theme) by roberto zagal
TRANSCRIPT
Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City
Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN
Christophe Claramunt, Naval Academy Research Institute
1
Introduction (1)• Traditionally Pollution Data has been produced by
institutions, government and vendors• But now… the Pollution Data is produced by persons, too
2
Information about Pollution topic is expressed in different ways by:
Government, News media People in social networks
3
Introduction (2)
Introduction (3)
But…What about the certainty of this
information?
Introduction (4) What about ... inconsistency?
Id Type Description1 Tweet
newspaper1The index of IMECAS is 135 #CDMX
2 TweetNewspaper2
@ the #contamination of air is 127 IMECAS #CDMX #bad #new
Related work• The social data problem has been faced:
1. KDD and Social Mining2. Formal publications (news media) guide the classification
of the interests of social media users [1]3. Opinion mining and topic modeling [2]. But not using a GKD with an approach of crossing data
layers
6
GoalKnow how to:
Discover the certainty level of information
by Crossing geographic and social information
7
8
Solution proposed:
GKD Framework ForData Air Polluttion
Phase 1
Phase 2
Phase 3
Data extraction: Sample tweet (Phase 1)
9
Id Type Description1 Tweet
newspaper1TheThe index of IMECAS is 135 #CDMX
2 TweetNewspaper2
@ the #contamination of air is 127 IMECAS #CDMX #bad #news
We consider tweets from accounts that periodically reports data of air pollution
Data extraction: Domain Detection (Phase 1)
10
Id Type Description2 Tweet
Newspaper2
@ #contamination air is 127 IMECAS #CDMX #bad #new
The post is related to a pollution topic
Preprocessing (Phase 2)
• Emotion detection [3] • Location extraction
11
Id Type Description2 Tweet
Newspaper2@ #contamination air is 127 IMECAS #CDMX #bad #new
• If we detect to which category belongs each set of data:
• Health and Pollution, Transport and Pollution
Then, we can select which data sources should be Then, we can select which data sources should be crossed with the tweet , in order to discover crossed with the tweet , in order to discover KnowledgeKnowledge
12
Classification C5 algorithm (Phase 3)
Id Description Category2 @ #contamination air is 127 IMECAS
#CDMX #bad #new Health and pollution
Crossing data (Phase 4)
• Example 1:• Inconsistencies in tweet 1 and 2?
13
Id Type Description1 Tweet
Newspaper1The index of IMECAS is 135 #CDMX
2 TweetNewspaper2
@ the #contamination of air is 127 IMECAS #CDMX
What is correct?
How to know what tweet is correct? Answer:
It was classified in the domain of: Health and pollution ( In Phase 3 )Then The official data from Healt reports and pollution reports are
selected to be crosssed with the Tweet (in Phase 4)
28/10/16
Crossing data (Phase 4)
Crossing data (Phase 4)
• Data are crossed considering different attributes, from the tweet is taken the date and hour of publication
• When is crossed with the date and hour from official reports of air quality: a match is found
28/10/16
We discovered the tweets are correct but with different location (the location is not include in the original tweet)
28/10/16
1 Tweet newspaper1
The index of IMECAS is in 135 #CDMX
#Taxqueña 10:00 hours
2 TweetNewspaper2
The #contaminación of air is in 127 IMECAS #CDMX
#Indios Verdes
15:00 hours
Knowledge Discovered!
Crossing data (Phase 4)
Other preliminary results
• Following the same approach
• Knowledge discovered: what topic are talked by region
17
Topic Geographic Period
HealthSouth , West March-June
TransportNorth, East January
December
Policy and programs
Center JanuaryDecember
PollutionSurrounding Mexico City January-June
Public roadsSurrounding Mexico City January-
December
Conclusions and Future work• The integration of the geographical and temporal
dimensions allow us to discover data correlations knowledge can increase certainty of some information in social networks .
• The main contribution is the domain discovery and classification of information is a key element of news aproaches for to discover geographic information.
18
Conclusions and future work• Future work
• Use of clustering or deep learning approaches to improve the classification process
• The location detection is a hard problem. It can be test another machine learning methods for social media [4, 5]
• ¿How can we improve the geographic discovery knowledge considering no explicit links between traditional data sources and
social sources?
19
References
[1] Jonghyun Han, Hyunju Lee, Characterizing the interests of social media users: Refinement of a topic model for incorporating heterogeneous media, Information Sciences, Volumes 358–359, 1 September 2016, Pages 112-128, ISSN 0020-0255.
[2] Schubert, E., Weiler, M., & Kriegel, H. P. (2014, August). Signitrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 871-880). ACM.
[3] Carlos Acevedo Miranda, Ricardo Clorio Rodriguez, Roberto Zagal Flores,and Consuelo V. Garcia Mendoza. Web architecture for analysis of feelings in Facebook with semantic approach (Spanish), pp. 59–69; rec. 2014-06-22; acc. 2014-07-21 59 Research in Computing Science 75 (2014). http://www.rcs.cic.ipn.mx/rcs/2014_75/
[4] Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. How events unfold: spatiotemporal mining in social media. SIGSPATIAL Special 7, 3 (January 2016), 19-25. DOI=http://dx.doi.org/10.1145/2876480.2876485
[5] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010.
28/10/16