potholes and bad road conditions- mining twitter to
TRANSCRIPT
Potholes and Bad Road Conditions- Mining
Twitter to Extract Information on Killer Roads
Swati Agarwal1, Nitish Mittal2, Ashish Sureka3
1 BITS Pilani, Goa Campus, India2 Noon.com, Dubai, UAE
3 Ashoka University, India
Research Track 5
The ACM India Joint International Conference on CoDS & COMAD
January 13, 2018
Goa, India
Motivation
2
• Increasing trend of adaptation of social media by Indian Government and public
agencies to reach out to the people
• public citizens use Twitter to post their complaints and report the incidents to the
concerned authorities- Citizensourcing
• Complaints on killer roads contain the information about road irregularities and
other issues causing high risks and discomfort to the citizens.
• To develop a system that can automatically identify complaint reports and
overcome the challenge of manual inspection.
• To extract important information from tweets such as the exact geographical
location which can be used to locate the fault
Examples- Bad Road Complaints
Related Work
• Mining online communications from Twitter to automatically determine road
hazards by building language models. [Kumar et al 2014]
• Mining tweet texts to extract information on highways and arterial roads in
Pittsburg and Philadelphia Metropolitan Areas. [Gu et al 2016]
• Analyzing real-time traffic related data for incident management. [Fu et al
2015] [Eleonora 2015] [Mai et al 2013] [Schulz et al 2013]
• Identifying traffic information on Twitter by mining traffic congestion,
incidents and weather information. [Napong et al 2011]
• Examining the use of Twitter for making public awareness during heavy
snow and public riots. [Panagiotopoulos et al. 2016]
4
Related Work- Steps Taken by Indian Government
• TwitterSeva
• A program to address public complaints on one shared portal.
• Uniting 200 social media handle
• Sub-units
• #DOTSeva, #BSNLSeva, #MTNLSeva, PostalSeva (Telecommunication
and stakeholder departments)
• #MociSeva (Ministry of Commerce and Industry)
• Cybercell
• Union minister of Women and Child Development
• Online complaints against women harassment
• Online trolls and abuse against women
5
Novel Research Contributions
• First study on mining citizen's complaints and reports on killer roads posted on
official Twitter handle of Government.
• We investigate the efficacy of spatial, contextual and linguistic features for
identifying useful information from complaint posts.
• We build a text-analysis based model to enrich spatial features (geographical
location metadata) in a tweet that can be used to discover insights from less
informative reports.
• In order to conduct our experiments for this research, we create the very first
database of citizens’ complaints on killer roads and highways and make our
dataset publicly available for the research community so that our results can be
used for benchmark, comparison and further extension.
6
Enriched
Tweets
Raw
Tweets
Complaint
Reports
Proposed Research Framework
#JointHashtags, Spelling
Errors, Slang and
Abbreviations, @username
REST
API
Data Collection Micropost Enrichment [1] Non-Complaint Tweets [1]
Features Extraction
Tweet Metadata, Named
Entity Recognizer, N-
gram modeling,
Knowledgebase Graph
Complaint Tweets
Classification
Unknown
Features
Vector ModelOne-Class/ Multi-Class
Classification
Empirical Analysis and
Visualization
Problem Specific
Identification
7
[1] Mittal, N., Agarwal, S. and Sureka, A. (2016, December). Got a Complaint? Keep
Calm and Tweet it!. In Proceedings of the 12th International Conference on Advanced
Data Mining and Applications (ADMA). Springer.
Experimental Dataset Collection
• Extracting all the tweets posted to the public agencies’ accounts
• Twitter Search API to extract the tweets consisting of @username in the tweets
• Duration: 8 weeks (18 July 2016 to 13 September 2016)
• @MORTHIndia- Ministry of Road, Transports, and Highways
• @nitin_gadkari- Union Minster of MORTHIndia
8
Total #tweets 81,304
Original 17,511
Retweet 52,701
Replies 11,092
English 13, 368
Unique 13,208
Sampled 3,302
Users 2,604
Features Extraction
9
• Location- to know the region (or city) from which the complaint is
reported
• Landmark- the pinpoint location (or landmark) of the problem
• Problem- to know the type of issue faced by the citizens so that the
complaint can be forwarded to the concerned department
@MORTHIndia @nitin_gadkari
1. Location
10
ED2- Tweets filtered by AISP classifier
“Mahatma Gandhi Marg”, “Vasant Vihar Colony”, “PVR Saket”, and “Nehru Place”
2. Problem Identification
11
Dysfunctional Facilities- street light, traffic light, drainage
Risky and Hazardous- pothole, street light, traffic light, hawkers, breaker, sign
board, construction, intersection, diversion, animal, turn, crash, barrier
Poor Condition- road, highway, traffic, flyover, underpass, bypass, chowk,
expressway
Indiscipline and Carelessness- barrier, repair, sign board, pothole, parking,
police
2. Problem Identification
12
Classification
13
Rule-based
Classifier
Features Vector
ED2
Unknown Tweets
Useful Tweet (ED3)
Irrelevant Tweet
Nearly- Useful Tweet (ED4)
• Irrelevant Tweets (IRT)
• off-topic complaints
• Useful Tweets (UT)
• complete reports
• Nearly-Useful Tweets (N-UT)
• some information is missing
Enrichment of N-UT?
14
• Problem,
• Region,
• Landmark/Pin-point location
USER’S PROFILE COMPLAINT TWEET POSTED BY THE USER
OPENSTREETMAP
OUTPUT
Characterization of Complaint Reports
15
Distribution of Distinct Geographical
Locations Identified in the Complaint
Reports in our Experimental Dataset
Experimental Results
16
Tweet Type #Tweets
Irrelevant 734
Nearly-Useful (NC) 1668
Nearly-Useful (C) 50
Useful 170
Appreciation 166
News 417
Information Sharing 97
Overall accuracy:
67%
Limitations
• API dependency
• Presence of multilingual scripts and text in a tweet
17
Takeaways
• We propose an approach for identification of complaints reported in road
irregularities and bad road conditions.
• We propose to use the application of ConceptNet lexical Knowledgebase, NER,
and geocoding location APIs to extract important components of a killer road
complaint.
• Based on the available components in the tweets, we classify them into three
categories: useful, nearly-useful, and irrelevant tweets.
• Our results shows that the proposed approach classifies these complaint reports
with an accuracy of 67% and a recall of 65%.
• Maximum number of complaints are reported about the risky and accident prone
roads while most of them are due to the poor condition of amenities.
• Complaints are reported from all over the India while maximum complaints are
reported from Maharashtra, New Delhi, Uttar Pradesh, and Bihar states.
18
Future Directions
• Integration with other domains
• Information Visualization
• Metaphor analysis
• Modality and Dimensions
• Multimedia content
• Cross-platform analysis
• Novel Applications
• Corruption barometer
• Query vs. complaints
19
References
20
1. Agarwal, S., Mittal, N., Sureka, A.: Syntactic enhancement of killer road complaint tweets posted on twitter, mendeley data, v1, doi:
10.17632/dm6s252524.1 (2017)
2. D’Andrea, E.e.a.: Real-time detection of traffic from twitter stream analysis. IEEE Transactions on Intelligent Transportation Systems
16(4), 2269–2283 (2015)
3. Gu, Y., Qian, Z.S., Chen, F.: From twitter to detector: Real-time traffic incident detection using social media data. Transportation
Research Part C: Emerging Technologies 67 (2016)
4. Havasi, C., Speer, R., Alonso, J.: Conceptnet 3: a flexible, multilingual semantic network for common sense knowledge. In: Recent
advances in natural language processing. pp. 27–29. Citeseer (2007)
5. Kumar, A., Jiang, M., Fang, Y.: Where not to go?: detecting road hazards using twitter. In: Proceedings of the 37th international ACM
SIGIR conference on Research & development in information retrieval. pp. 1223–1226. ACM (2014)
6. Mai, E., Hranac, R.: Twitter interactions as a data source for transportation incidents. In: Proc. Transportation Research Board 92nd
Ann. Meeting (2013)
7. Panagiotopoulos, P., Barnett, J., Bigdeli, A.Z., Sams, S.: Social media in emergency management: Twitter as a tool for communicating
risks to the public. Technological Forecasting and Social Change 111, 86–96 (2016)
8. Wanichayapong, N.: Social-based traffic information extraction and classification. In: ITS Telecommunications (ITST), 11th
International Conference on. IEEE (2011)
References
21
9. Agarwal, S., Mittal, N., Sureka, A.: Enhanced dataset of citizen centric complaints and grievances on twitter, mendeley data, v1
http://dx.doi.org/10.17632/w2cp7h53s5.1 (2016)
10. Amari, S., Wu, S.: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 12(6), 783 – 789
(1999)
11. Anderson, M., Lewis, K., Dedehayir, O.: Diffusion of innovation in the public sector: Twitter adoption by municipal police
departments in the us. In: Portland International Conference on Management of Engineering and Technology, 2015
12. Edwards, A., Kool, D.: Webcare in Public Services: Deliver Better with Less?, pp. 151–166. Springer International Publishing, Cham
(2015)
13. Frias-Martinez, V., Sae-Tang, A., Frias-Martinez, E.: To call, or to tweet? understanding 3-1-1 citizen complaint behaviors. In: ASE
Big- Data/SocialCom/CyberSecurity Conference (2014)
14. Heverin, T., Zach, L.: Twitter for city police department information sharing. Pro- ceedings of the American Society for Information
Science and Technology (2010)
15. Khan, G.F., Swar, B., Lee, S.K.: Social media risks and benefits a public sector perspective. Social Science Computer Review 32(5),
606–627 (2014)
16. Loukis, E., Charalabidis, Y., Androutsopoulou, A.: Evaluating a Passive Social Media Citizensourcing Innovation, pp. 305–320.
Springer International Publishing
17. Meijer, A.J., Torenvlied, R.: Social media and the new organization of government communications an empirical analysis of twitter
usage by the dutch police. The American Review of Public Administration p. 0275074014551381 (2014)
We are hiring full-time and visiting faculty in Computer Science
Our focus is on: AI, Networks, Systems,Software Science, and Theory (but get in touch anyway if you are doing exciting work in some other area of Computer Science)
Write to us: [email protected]
Tel: +91 832 2580 107
Thank you!
23
Questions?
24
BACK-UP SLIDES
25
Open Source Social Media Data
Introduction
Content and Blogger
#likes #retweets
URL, hashtag, media
Tweet
User@username, profile name, bio information, location, and external URLs
Activity feeds, followers and followings
Tweets posted by the blogger
YouTube
Title, description, uploader, and related videos
#likes, #dislikes, and #comments received on the video
content and user of the comments and replies posted on the video
Video
UserTitle and description of the channel
#subscribers and #subscriptions of a channel
featured channel, contact and subscriptions (optional)
Tumblr
Content, type, blogger and tags of a post
#notes, #likes, #dislikes and comments received on a post
who liked or reblogged a post
Post
BloggerTitle and description of the blogger
Activity feeds of the blogger
followers, following and likes of a blogger (optional)
26
OSSMInt Applications for Government
Motivation
image source: http://blog.marketwired.com/2014/12/23/best-2014-social-media-marketing-posts/
Radical Users and
Communities
Public Citizens’ Complaints and
Grievances
Religious
Conflicts
Online Recruitment in Radical
Groups
Women and Child
Online Abuse &
Harassment
Civil Protest and
Unrest Forecast
Detecting Fake
News on OSM
Extremism and
Hate Promotion
Secret Message
Detection
27
28
Profile Picture
Activity Feeds Interests Writing a new tweet
Multimedia attachment
Enable location
Maximum length Poll
New and Trending Topics
Change the
Region
Tweet Metadata
Likes, Reply, Retweet
External Entity
(Blog URL)
Twitter Blog
Recommendations
Blogs followed
by Visitor
Direction Mention
Username
Profile Name
Verified
User Bio
Location
External Links
IntroductionMining Public Complaints
and Grievances on Twitter
Ministry of the Railway (@railminindia & @sureshpprabhu)
State Police (@delhipolice & @mumbaipolice)
Ministry of Road and Transport (@morthindia & @nitin_gadkari)
State Traffic Police (@dtptraffic)
Income Tax Department (@incometaxindia)
Ministry of External Affairs (@sushmaswaraj)
Department of children & youth affairs press
office account (@DCYAPress)
Department of health (@roinnslainte)
Prime Minister Theresa May’s office
(@Number10gov)
national government (@GOVUK)
Citizensourcing
disseminating information
collecting complaints and grievances from citizens
29
Case Study 1- Mining Public Complaints and Grievances
• General issues that are faced by public citizens and seek for immediate actions by
the concerned authorities.
• Examples:
• Corruption, illegal transactions from banks
• No-refund on ticket cancellation, delay of train, poor facilities on train/station
• Traffic violations
• Crimes
• In 1st week of January 2017, 55 trains delayed in North India due to fog
• Heavy rainfall in Delhi-NCR caused massive traffic jam for hours
Motivation
30
Motivation
Traffic Violation
PAN card delayed
Illegal Transaction
Poor facilities on train
Case Study 1- Mining Public Complaints and Grievances
31
Case Study 2- Mining Twitter to Extract information on Killer Roads
• Bad Road Conditions
• Road Irregularities, Roughness, potholes, bumps, patchy surface and poorly
designed speed breakers
• Accidents Prone Roads
• Blind or improper curves, temporary diversions, traffic on under repair roads.
• Dysfunctional Amenities
• Poor street lights or dim lights, encroachments on roads by shops, hawkers and
broken road signboards
Motivation
32
Case Study 2- Mining Twitter to Extract information on Killer Roads
Motivation
33
Dataset Characterization
Case Study
1
34
Case Study 2- Mining Twitter to Extract information on Killer Roads
Motivation
Road Irregularities Dysfunctional Amenities
Accident Prone RoadsUnder-construction Road
35
Padding Space CorrectionMicropost Enrichment
Algorithm
• Add a space after all special characters
• Comma, semi colon, question mark, period, exclamatory mark
• Trim all extra whitespaces
36
@ nitin_gadkari Pls re tar our village road,
a farmer died y’day while jumping
pothole, road condition is horrible, cant
even walk, pls help
Non-Complaint Tweets (AISP)
37
Appreciation Posts
Information Sharing
Promotional Tweets
1. Frequent N-grams
38
• Grouped triplets of word n-grams frequently occurring in
complains
• Addresses the challenge of keyword spotting method
• Use of WordNet lexical resource to map similar terms
Case Study
1
2. Closed-Domain Keywords
39
• Terms occurring in a complaint specific to a department of
public agency
• N-grams- to identify the type of the complaint
Case Study
1
3. Events and Substances
40
• Concepts present in the tweets- not identified as the entities
• Extracted using IBM Watson Concept and Relationship Extraction API
• Uses various lexical resources and knowledgegraphs as the backend
databases (Yago, Dbpedia, WordNet, and Freebase)
Case Study
1
4. Location
41
• Location reported in the complaints
• Identify people, location, and organization
• Verify the location by applying geocoding on the entities
• M. B. Road
• Gandhi Nagar
• AIIMS metro station
Case Study
1
5. Media Presence
42
Case Study
1
2. Problem Identification
43
Case Study
2 Big potholes on Delhi roads. It’s risky to drive at nights.
Streetlights on highway not working
Dim streetlights on highway.
Highway is full of dark.
there is complete blackout on highway.
1. Location- Landmark
44
Case Study
2 private
administrative
locality
canal
residential
hospital
hamlet
aerodrome
station
trunk
commercial
construction
neighborhood
common
bank
motorway junction
suburb
kindergarten
river
bus stop
bus station
school
restaurant
unclassified
motorway
hotel
secondary
peak
primary
water
public building
bicycle parking
service
house
stadium
railway
1. Location- Region
45
Case Study
2 Village
Town
City
District
Territory
Enrichment of N-UT?
46
Case Study
2• Problem,
• Region,
• Landmark/Pin-point location
Sampled Data (ED1)
F1 .. .. Fi-1 Fi
Hashtag Expansion
Spell Error Correction
Information Sharing
Promotion Appreciation
AISP Tweet Classif cation
AISP
Unknown (ED2)
1Tweet Pre-Processing
Sentence Segmentation
Slang Treatment
@username mention treatment
Sampled Data (ED1)
F1 .. .. Fi-1 Fi
Tweets Pre-processing
AISP Tweet Classif cation
AISP
Unknown (ED2)
ED2
Geographical Location
Issue/Incident
Feature Extraction
Classif cation
Irrelevant
Incomplete Report (ED3)
Useful Tweets (ED4)
ED4 Spatial Feature Enrichment Gaining Insights and
Knowledge from Killer Road Complaints
ED3 Useful Tweets
2. Problem Identification
47
Case Study
2
Experimental Results
48
Case Study
1
@dtpTraffic @RailMinIndia @IncomeTaxIndia0
0.2
0.4
0.6
0.8
1
Experimental Dataset
Pre
cisi
on
Linear
Polynomial
RBF
Cascaded Ensemble
Parallel Ensemble
0.7
6
0.7
5
0.4
7
0.5
6
0.7
6
0.4
3
0.4
2
0.2
8
0.3
3
0.4
3
0.6
2
0.6
1
0.3
6
0.4
3
0.8
3
Experimental Results
49
Case Study
1
Experimental Results
50
Case Study
1
Dynamics of Religious Conflicts on Social
Media
51
Notes
Type
Blogger
Text
Link
Chat
Photo
Audio
Video
Quote
Answer
Photo,Audio,Video,Link,Answer,Chat,Quote,Text
NoteBy,Type(Like/Reblog)
Tumblr Metadata
URL
Tag
Time
Notes
Type
Blogger
Seed
Tags
TumblrMetadata
Translator Language Identif er Posts and
Blogs
Contextual and Non-Text Metadata Text Data
English Language Data Non-
English
Description Blogger Metadata
Experimental Dataset
#Posts,Title,DescripFon
Tumblr Search API
Blogger Relations (Like/Reblog)
Posts Metadata
Notes Metadata
Blogger Metadata
Post Types
Tumblr Metadata
Topic Identif cation
Topic Identif cation
LIWC
Named Entities
Taxonomy
Word Count Features (WCF) Named Entities Features (NEF)
Religion
Unknown
Concepts
Multi-Label
Taxonomy Posts tagged with all religions mentioned in the textual content
LIWC PTopic
Feature 1 Feature 2
..
..
.. Feature N
Posts
User
NER
NER
<People, Date> <Organization> <Place, Time>
<Age, Location>
PNEF
UNEF
Experimental Dataset
Post Metadata Post Metadata
Post and Blogger Metadata
Term Obfuscation Detection in Secret Message Communication
52
MEAN AVERAGE CONCEPTUAL SIMILARITY (MACS)
WORD OBFUSCATION CLASSIFIER
Early forecast of civil protests
53
Learning(Models:(
Named(En2ty(Recogni2on:(• • •
Online Radicalization Detection
54
• Hate promoting content identification
• Radicalized users and communities detection
• YouTube
• Tumblr
• Intent-based radicalized and racist posts identification
• Tumblr
55
Sensitive Topic Content
Users and Hidden Community
Government Bodies
Monitoring OSM Social Media
Expressing their Opinions
Promoting Certain Ideology and Beliefs
Discussion on a variety of sensitive topics (religion and race), presence
of hate promoting, radicalized and extremist content
Shallow NLP Techniques
Deep NLP and Semantics
Users Uploading Extremist Posts and Forming Communities
Graph Traversal and Topical/Focused Crawler
N-gram Feature Vectors Model
Personality Traits, Writing Cues, Word Semantics, Emotions
One-Class KNN and SVM
Random Forest, Naïve Bayes, and
Decision Tree
Friends, Subscribers, Featured channels
Notes, Likes, Re-blogs
Shark Search, Best First Search
Random Walk
Linguistic and Contextual Metadata Analysis Links and User Relationship Analysis
Requirement of Monitoring Social Media
OSSMInt Applications (Identification and
Detection)
Purpose of Author for uploading such content
High-Level Approach
Social Media Platforms and Data Sources
Discriminatory Features (Extracted and Derived)
Content and Network Analysis (Data Mining
and Machine Learning)
Future Directions
• Novel Applications
• Corruption barometer
• Intent vs. impact of hate promoting posts
• Recruitment analysis in online radicalized communities
• Fake news detection
• NLP systems for security informatics
56
Micropost Enrichment AlgorithmMining Public Complaints and
Grievances on Twitter
• Consisting of data pre-processing and text enrichment techniques
• Addressing the problem of noisy data in tweets
Raw Tweets
Sentence Segmentation
Micropost Enhancement and Enrichment
Enriched Tweets
Hashtag Expansion
Padding Space Correction
Spell Error Correction
Acronym & Slang
Treatment
@Username Expansion
57
Sentence SegmentationMicropost Enrichment
Algorithm
• Find ill hashtags in the tweet (ignore the hashtag present in URL)
• Removal of filling terms
• Umm, hmm, err,
• Trim consecutive special characters occurring together
• ????, !!!!!, …..
58
Raw Tweets
Sentence Segmentation
Micropost Enhancement and Enrichment
Enriched Tweets
Hashtag Expansion
Padding Space Correction
Spell Error Correction
Acronym & Slang
Treatment
@Username Expansion
Joint Hashtag ExpansionMicropost Enrichment
Algorithm
1. Common Separator (’-’ and ’-’)
• #no-strong-action-by-police = no strong
action by police
• #black_money = black money
1. Uppercase Letters
• #MarchForDemocracy = March For
Democracy
• #CharchaOnRWH = Charcha On RWH
• #FANTomorrow = FAN Tomorrow
59
3. Alphanumeric String
• #TheriJoins100crClub = Theri Joins 100 cr Club
• #NH8= NH 8
4. Longest subsequence
• #seriousissue = serious issue
• #havesomesenseofchecking = have some sense of
checking
• #swachhbharat #swaranshatabdi
#modisarkar
Spell Error CorrectionMicropost Enrichment
Algorithm
• Given Sentence S= ”plnty of ppl waiting frm past hour”,
• set of 3-grams= [n1: ’plnty of ppl’, n2: ‘of ppl waiting’, n3: ’ppl waiting frm’, n4:
’waiting frm past’, n5: ‘frm past hour’]
60
Acronym and Slang TreatmentMicropost Enrichment
Algorithm
• Domain Specific Slangs (fetched from frequent n-grams, n=1, 2, 3, 4)
• NH (National Highway), Toll, RTO (Regional Transport Office), RC (Vehicle Registration
Certificate), KM (kilometers), STN (station), Acc (accident) and RD (Road)
• Numeric
• 2- to, two, too?
• 4- for, four?
• Others (General slang or abbreviations)
• B4- before, plz- please, idk- i don’t know, c- see
• NoSlang online dictionary
61
Username ExpansionMicropost Enrichment
Algorithm
62
• Expanding @username mentioned directly in the tweet
• Extracting profile name using Twitter REST API
Proposed Research FrameworkMining Public Complaints and
Grievances on Twitter
Enriched
Tweets
Non-Complaint Tweets
Raw
Tweets
#JointHashtags, Spelling
Errors, Slang and
Abbreviations, @username
Micropost Enrichment
REST
API
Data Collection
Complaint
Reports
Features Extraction
Tweet Metadata, Named
Entity Recognizer, N-
gram modeling,
Knowledgebase Graph
Complaint Tweets
Classification
Unknown
Features
Vector ModelOne-Class/ Multi-Class
Classification
Empirical Analysis and
Visualization
Problem Specific
Identification
63
Non-Complaint Tweets (AISP)
64
Appreciation Posts
Tweet
Bag-of-
words
Core NLP
Exhaustive set of
Appreciation keywords
Similarity
Computation
Lemmatized
Match?
Tweet
Tone Analyzer
IBM Watson APIJoy Score?
> Threshold then
appreciation post
Sentence Level
Tone
Anger
Sadness
Fear
Joy
Anxiety
Non-Complaint Tweets
Non-Complaint Tweets (AISP)
65
Information Sharing
• Tweets consisting of an external URL
• URL not an image, video
• Quote tweets
Non-Complaint Tweets
Non-Complaint Tweets (AISP)
66
Promotional Tweets
• Tweets posted by other accounts or handle of same
department or public agency
• Tweet posted by a verified account
Non-Complaint Tweets