potholes and bad road conditions- mining twitter to

Post on 02-Jan-2022

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Potholes and Bad Road Conditions- Mining

Twitter to Extract Information on Killer Roads

Swati Agarwal1, Nitish Mittal2, Ashish Sureka3

1 BITS Pilani, Goa Campus, India2 Noon.com, Dubai, UAE

3 Ashoka University, India

Research Track 5

The ACM India Joint International Conference on CoDS & COMAD

January 13, 2018

Goa, India

Motivation

2

• Increasing trend of adaptation of social media by Indian Government and public

agencies to reach out to the people

• public citizens use Twitter to post their complaints and report the incidents to the

concerned authorities- Citizensourcing

• Complaints on killer roads contain the information about road irregularities and

other issues causing high risks and discomfort to the citizens.

• To develop a system that can automatically identify complaint reports and

overcome the challenge of manual inspection.

• To extract important information from tweets such as the exact geographical

location which can be used to locate the fault

Examples- Bad Road Complaints

Related Work

• Mining online communications from Twitter to automatically determine road

hazards by building language models. [Kumar et al 2014]

• Mining tweet texts to extract information on highways and arterial roads in

Pittsburg and Philadelphia Metropolitan Areas. [Gu et al 2016]

• Analyzing real-time traffic related data for incident management. [Fu et al

2015] [Eleonora 2015] [Mai et al 2013] [Schulz et al 2013]

• Identifying traffic information on Twitter by mining traffic congestion,

incidents and weather information. [Napong et al 2011]

• Examining the use of Twitter for making public awareness during heavy

snow and public riots. [Panagiotopoulos et al. 2016]

4

Related Work- Steps Taken by Indian Government

• TwitterSeva

• A program to address public complaints on one shared portal.

• Uniting 200 social media handle

• Sub-units

• #DOTSeva, #BSNLSeva, #MTNLSeva, PostalSeva (Telecommunication

and stakeholder departments)

• #MociSeva (Ministry of Commerce and Industry)

• Cybercell

• Union minister of Women and Child Development

• Online complaints against women harassment

• Online trolls and abuse against women

5

Novel Research Contributions

• First study on mining citizen's complaints and reports on killer roads posted on

official Twitter handle of Government.

• We investigate the efficacy of spatial, contextual and linguistic features for

identifying useful information from complaint posts.

• We build a text-analysis based model to enrich spatial features (geographical

location metadata) in a tweet that can be used to discover insights from less

informative reports.

• In order to conduct our experiments for this research, we create the very first

database of citizens’ complaints on killer roads and highways and make our

dataset publicly available for the research community so that our results can be

used for benchmark, comparison and further extension.

6

Enriched

Tweets

Raw

Tweets

Complaint

Reports

Proposed Research Framework

#JointHashtags, Spelling

Errors, Slang and

Abbreviations, @username

REST

API

Data Collection Micropost Enrichment [1] Non-Complaint Tweets [1]

Features Extraction

Tweet Metadata, Named

Entity Recognizer, N-

gram modeling,

Knowledgebase Graph

Complaint Tweets

Classification

Unknown

Features

Vector ModelOne-Class/ Multi-Class

Classification

Empirical Analysis and

Visualization

Problem Specific

Identification

7

[1] Mittal, N., Agarwal, S. and Sureka, A. (2016, December). Got a Complaint? Keep

Calm and Tweet it!. In Proceedings of the 12th International Conference on Advanced

Data Mining and Applications (ADMA). Springer.

Experimental Dataset Collection

• Extracting all the tweets posted to the public agencies’ accounts

• Twitter Search API to extract the tweets consisting of @username in the tweets

• Duration: 8 weeks (18 July 2016 to 13 September 2016)

• @MORTHIndia- Ministry of Road, Transports, and Highways

• @nitin_gadkari- Union Minster of MORTHIndia

8

Total #tweets 81,304

Original 17,511

Retweet 52,701

Replies 11,092

English 13, 368

Unique 13,208

Sampled 3,302

Users 2,604

Features Extraction

9

• Location- to know the region (or city) from which the complaint is

reported

• Landmark- the pinpoint location (or landmark) of the problem

• Problem- to know the type of issue faced by the citizens so that the

complaint can be forwarded to the concerned department

@MORTHIndia @nitin_gadkari

1. Location

10

ED2- Tweets filtered by AISP classifier

“Mahatma Gandhi Marg”, “Vasant Vihar Colony”, “PVR Saket”, and “Nehru Place”

2. Problem Identification

11

Dysfunctional Facilities- street light, traffic light, drainage

Risky and Hazardous- pothole, street light, traffic light, hawkers, breaker, sign

board, construction, intersection, diversion, animal, turn, crash, barrier

Poor Condition- road, highway, traffic, flyover, underpass, bypass, chowk,

expressway

Indiscipline and Carelessness- barrier, repair, sign board, pothole, parking,

police

2. Problem Identification

12

Classification

13

Rule-based

Classifier

Features Vector

ED2

Unknown Tweets

Useful Tweet (ED3)

Irrelevant Tweet

Nearly- Useful Tweet (ED4)

• Irrelevant Tweets (IRT)

• off-topic complaints

• Useful Tweets (UT)

• complete reports

• Nearly-Useful Tweets (N-UT)

• some information is missing

Enrichment of N-UT?

14

• Problem,

• Region,

• Landmark/Pin-point location

USER’S PROFILE COMPLAINT TWEET POSTED BY THE USER

OPENSTREETMAP

OUTPUT

Characterization of Complaint Reports

15

Distribution of Distinct Geographical

Locations Identified in the Complaint

Reports in our Experimental Dataset

Experimental Results

16

Tweet Type #Tweets

Irrelevant 734

Nearly-Useful (NC) 1668

Nearly-Useful (C) 50

Useful 170

Appreciation 166

News 417

Information Sharing 97

Overall accuracy:

67%

Limitations

• API dependency

• Presence of multilingual scripts and text in a tweet

17

Takeaways

• We propose an approach for identification of complaints reported in road

irregularities and bad road conditions.

• We propose to use the application of ConceptNet lexical Knowledgebase, NER,

and geocoding location APIs to extract important components of a killer road

complaint.

• Based on the available components in the tweets, we classify them into three

categories: useful, nearly-useful, and irrelevant tweets.

• Our results shows that the proposed approach classifies these complaint reports

with an accuracy of 67% and a recall of 65%.

• Maximum number of complaints are reported about the risky and accident prone

roads while most of them are due to the poor condition of amenities.

• Complaints are reported from all over the India while maximum complaints are

reported from Maharashtra, New Delhi, Uttar Pradesh, and Bihar states.

18

Future Directions

• Integration with other domains

• Information Visualization

• Metaphor analysis

• Modality and Dimensions

• Multimedia content

• Cross-platform analysis

• Novel Applications

• Corruption barometer

• Query vs. complaints

19

References

20

1. Agarwal, S., Mittal, N., Sureka, A.: Syntactic enhancement of killer road complaint tweets posted on twitter, mendeley data, v1, doi:

10.17632/dm6s252524.1 (2017)

2. D’Andrea, E.e.a.: Real-time detection of traffic from twitter stream analysis. IEEE Transactions on Intelligent Transportation Systems

16(4), 2269–2283 (2015)

3. Gu, Y., Qian, Z.S., Chen, F.: From twitter to detector: Real-time traffic incident detection using social media data. Transportation

Research Part C: Emerging Technologies 67 (2016)

4. Havasi, C., Speer, R., Alonso, J.: Conceptnet 3: a flexible, multilingual semantic network for common sense knowledge. In: Recent

advances in natural language processing. pp. 27–29. Citeseer (2007)

5. Kumar, A., Jiang, M., Fang, Y.: Where not to go?: detecting road hazards using twitter. In: Proceedings of the 37th international ACM

SIGIR conference on Research & development in information retrieval. pp. 1223–1226. ACM (2014)

6. Mai, E., Hranac, R.: Twitter interactions as a data source for transportation incidents. In: Proc. Transportation Research Board 92nd

Ann. Meeting (2013)

7. Panagiotopoulos, P., Barnett, J., Bigdeli, A.Z., Sams, S.: Social media in emergency management: Twitter as a tool for communicating

risks to the public. Technological Forecasting and Social Change 111, 86–96 (2016)

8. Wanichayapong, N.: Social-based traffic information extraction and classification. In: ITS Telecommunications (ITST), 11th

International Conference on. IEEE (2011)

References

21

9. Agarwal, S., Mittal, N., Sureka, A.: Enhanced dataset of citizen centric complaints and grievances on twitter, mendeley data, v1

http://dx.doi.org/10.17632/w2cp7h53s5.1 (2016)

10. Amari, S., Wu, S.: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 12(6), 783 – 789

(1999)

11. Anderson, M., Lewis, K., Dedehayir, O.: Diffusion of innovation in the public sector: Twitter adoption by municipal police

departments in the us. In: Portland International Conference on Management of Engineering and Technology, 2015

12. Edwards, A., Kool, D.: Webcare in Public Services: Deliver Better with Less?, pp. 151–166. Springer International Publishing, Cham

(2015)

13. Frias-Martinez, V., Sae-Tang, A., Frias-Martinez, E.: To call, or to tweet? understanding 3-1-1 citizen complaint behaviors. In: ASE

Big- Data/SocialCom/CyberSecurity Conference (2014)

14. Heverin, T., Zach, L.: Twitter for city police department information sharing. Pro- ceedings of the American Society for Information

Science and Technology (2010)

15. Khan, G.F., Swar, B., Lee, S.K.: Social media risks and benefits a public sector perspective. Social Science Computer Review 32(5),

606–627 (2014)

16. Loukis, E., Charalabidis, Y., Androutsopoulou, A.: Evaluating a Passive Social Media Citizensourcing Innovation, pp. 305–320.

Springer International Publishing

17. Meijer, A.J., Torenvlied, R.: Social media and the new organization of government communications an empirical analysis of twitter

usage by the dutch police. The American Review of Public Administration p. 0275074014551381 (2014)

We are hiring full-time and visiting faculty in Computer Science

Our focus is on: AI, Networks, Systems,Software Science, and Theory (but get in touch anyway if you are doing exciting work in some other area of Computer Science)

Write to us: csis.office@goa.bits-pilani.ac.in

Tel: +91 832 2580 107

Thank you!

23

Questions?

24

BACK-UP SLIDES

25

Open Source Social Media Data

Introduction

Twitter

Content and Blogger

#likes #retweets

URL, hashtag, media

Tweet

User@username, profile name, bio information, location, and external URLs

Activity feeds, followers and followings

Tweets posted by the blogger

YouTube

Title, description, uploader, and related videos

#likes, #dislikes, and #comments received on the video

content and user of the comments and replies posted on the video

Video

UserTitle and description of the channel

#subscribers and #subscriptions of a channel

featured channel, contact and subscriptions (optional)

Tumblr

Content, type, blogger and tags of a post

#notes, #likes, #dislikes and comments received on a post

who liked or reblogged a post

Post

BloggerTitle and description of the blogger

Activity feeds of the blogger

followers, following and likes of a blogger (optional)

26

OSSMInt Applications for Government

Motivation

image source: http://blog.marketwired.com/2014/12/23/best-2014-social-media-marketing-posts/

Radical Users and

Communities

Public Citizens’ Complaints and

Grievances

Religious

Conflicts

Online Recruitment in Radical

Groups

Women and Child

Online Abuse &

Harassment

Civil Protest and

Unrest Forecast

Detecting Fake

News on OSM

Extremism and

Hate Promotion

Secret Message

Detection

27

Twitter

28

Profile Picture

Activity Feeds Interests Writing a new tweet

Multimedia attachment

Enable location

Maximum length Poll

New and Trending Topics

Change the

Region

Tweet Metadata

Likes, Reply, Retweet

External Entity

(Blog URL)

Twitter Blog

Recommendations

Blogs followed

by Visitor

Direction Mention

Username

Profile Name

Verified

User Bio

Location

External Links

IntroductionMining Public Complaints

and Grievances on Twitter

Ministry of the Railway (@railminindia & @sureshpprabhu)

State Police (@delhipolice & @mumbaipolice)

Ministry of Road and Transport (@morthindia & @nitin_gadkari)

State Traffic Police (@dtptraffic)

Income Tax Department (@incometaxindia)

Ministry of External Affairs (@sushmaswaraj)

Department of children & youth affairs press

office account (@DCYAPress)

Department of health (@roinnslainte)

Prime Minister Theresa May’s office

(@Number10gov)

national government (@GOVUK)

Citizensourcing

disseminating information

collecting complaints and grievances from citizens

29

Case Study 1- Mining Public Complaints and Grievances

• General issues that are faced by public citizens and seek for immediate actions by

the concerned authorities.

• Examples:

• Corruption, illegal transactions from banks

• No-refund on ticket cancellation, delay of train, poor facilities on train/station

• Traffic violations

• Crimes

• In 1st week of January 2017, 55 trains delayed in North India due to fog

• Heavy rainfall in Delhi-NCR caused massive traffic jam for hours

Motivation

30

Motivation

Traffic Violation

PAN card delayed

Illegal Transaction

Poor facilities on train

Case Study 1- Mining Public Complaints and Grievances

31

Case Study 2- Mining Twitter to Extract information on Killer Roads

• Bad Road Conditions

• Road Irregularities, Roughness, potholes, bumps, patchy surface and poorly

designed speed breakers

• Accidents Prone Roads

• Blind or improper curves, temporary diversions, traffic on under repair roads.

• Dysfunctional Amenities

• Poor street lights or dim lights, encroachments on roads by shops, hawkers and

broken road signboards

Motivation

32

Case Study 2- Mining Twitter to Extract information on Killer Roads

Motivation

33

Dataset Characterization

Case Study

1

34

Case Study 2- Mining Twitter to Extract information on Killer Roads

Motivation

Road Irregularities Dysfunctional Amenities

Accident Prone RoadsUnder-construction Road

35

Padding Space CorrectionMicropost Enrichment

Algorithm

• Add a space after all special characters

• Comma, semi colon, question mark, period, exclamatory mark

• Trim all extra whitespaces

36

@ nitin_gadkari Pls re tar our village road,

a farmer died y’day while jumping

pothole, road condition is horrible, cant

even walk, pls help

Non-Complaint Tweets (AISP)

37

Appreciation Posts

Information Sharing

Promotional Tweets

1. Frequent N-grams

38

• Grouped triplets of word n-grams frequently occurring in

complains

• Addresses the challenge of keyword spotting method

• Use of WordNet lexical resource to map similar terms

Case Study

1

2. Closed-Domain Keywords

39

• Terms occurring in a complaint specific to a department of

public agency

• N-grams- to identify the type of the complaint

Case Study

1

3. Events and Substances

40

• Concepts present in the tweets- not identified as the entities

• Extracted using IBM Watson Concept and Relationship Extraction API

• Uses various lexical resources and knowledgegraphs as the backend

databases (Yago, Dbpedia, WordNet, and Freebase)

Case Study

1

4. Location

41

• Location reported in the complaints

• Identify people, location, and organization

• Verify the location by applying geocoding on the entities

• M. B. Road

• Gandhi Nagar

• AIIMS metro station

Case Study

1

5. Media Presence

42

Case Study

1

2. Problem Identification

43

Case Study

2 Big potholes on Delhi roads. It’s risky to drive at nights.

Streetlights on highway not working

Dim streetlights on highway.

Highway is full of dark.

there is complete blackout on highway.

1. Location- Landmark

44

Case Study

2 private

administrative

locality

canal

residential

hospital

hamlet

aerodrome

station

trunk

commercial

construction

neighborhood

common

bank

motorway junction

suburb

kindergarten

river

bus stop

bus station

school

restaurant

unclassified

motorway

hotel

secondary

peak

primary

water

public building

bicycle parking

service

house

stadium

railway

1. Location- Region

45

Case Study

2 Village

Town

City

District

Territory

Enrichment of N-UT?

46

Case Study

2• Problem,

• Region,

• Landmark/Pin-point location

Sampled Data (ED1)

F1 .. .. Fi-1 Fi

Hashtag Expansion

Spell Error Correction

Information Sharing

Promotion Appreciation

AISP Tweet Classif cation

AISP

Unknown (ED2)

1Tweet Pre-Processing

Sentence Segmentation

Slang Treatment

@username mention treatment

Sampled Data (ED1)

F1 .. .. Fi-1 Fi

Tweets Pre-processing

AISP Tweet Classif cation

AISP

Unknown (ED2)

ED2

Geographical Location

Issue/Incident

Feature Extraction

Classif cation

Irrelevant

Incomplete Report (ED3)

Useful Tweets (ED4)

ED4 Spatial Feature Enrichment Gaining Insights and

Knowledge from Killer Road Complaints

ED3 Useful Tweets

2. Problem Identification

47

Case Study

2

Experimental Results

48

Case Study

1

@dtpTraffic @RailMinIndia @IncomeTaxIndia0

0.2

0.4

0.6

0.8

1

Experimental Dataset

Pre

cisi

on

Linear

Polynomial

RBF

Cascaded Ensemble

Parallel Ensemble

0.7

6

0.7

5

0.4

7

0.5

6

0.7

6

0.4

3

0.4

2

0.2

8

0.3

3

0.4

3

0.6

2

0.6

1

0.3

6

0.4

3

0.8

3

Experimental Results

49

Case Study

1

Experimental Results

50

Case Study

1

Dynamics of Religious Conflicts on Social

Media

51

Notes

Type

Blogger

Text

Link

Chat

Photo

Audio

Video

Quote

Answer

Photo,Audio,Video,Link,Answer,Chat,Quote,Text

NoteBy,Type(Like/Reblog)

Tumblr Metadata

URL

Tag

Time

Notes

Type

Blogger

Seed

Tags

TumblrMetadata

Translator Language Identif er Posts and

Blogs

Contextual and Non-Text Metadata Text Data

English Language Data Non-

English

Description Blogger Metadata

Experimental Dataset

#Posts,Title,DescripFon

Tumblr Search API

Blogger Relations (Like/Reblog)

Posts Metadata

Notes Metadata

Blogger Metadata

Post Types

Tumblr Metadata

Topic Identif cation

Topic Identif cation

LIWC

Named Entities

Taxonomy

Word Count Features (WCF) Named Entities Features (NEF)

Religion

Unknown

Concepts

Multi-Label

Taxonomy Posts tagged with all religions mentioned in the textual content

LIWC PTopic

Feature 1 Feature 2

..

..

.. Feature N

Posts

User

NER

NER

<People, Date> <Organization> <Place, Time>

<Age, Location>

PNEF

UNEF

Experimental Dataset

Post Metadata Post Metadata

Post and Blogger Metadata

Term Obfuscation Detection in Secret Message Communication

52

MEAN AVERAGE CONCEPTUAL SIMILARITY (MACS)

WORD OBFUSCATION CLASSIFIER

Early forecast of civil protests

53

Learning(Models:(

Named(En2ty(Recogni2on:(• • •

Online Radicalization Detection

54

• Hate promoting content identification

• Twitter

• Radicalized users and communities detection

• YouTube

• Tumblr

• Intent-based radicalized and racist posts identification

• Tumblr

55

Sensitive Topic Content

Users and Hidden Community

Government Bodies

Monitoring OSM Social Media

Expressing their Opinions

Promoting Certain Ideology and Beliefs

Discussion on a variety of sensitive topics (religion and race), presence

of hate promoting, radicalized and extremist content

Shallow NLP Techniques

Deep NLP and Semantics

Users Uploading Extremist Posts and Forming Communities

Graph Traversal and Topical/Focused Crawler

N-gram Feature Vectors Model

Personality Traits, Writing Cues, Word Semantics, Emotions

One-Class KNN and SVM

Random Forest, Naïve Bayes, and

Decision Tree

Friends, Subscribers, Featured channels

Notes, Likes, Re-blogs

Shark Search, Best First Search

Random Walk

Linguistic and Contextual Metadata Analysis Links and User Relationship Analysis

Requirement of Monitoring Social Media

OSSMInt Applications (Identification and

Detection)

Purpose of Author for uploading such content

High-Level Approach

Social Media Platforms and Data Sources

Discriminatory Features (Extracted and Derived)

Content and Network Analysis (Data Mining

and Machine Learning)

Future Directions

• Novel Applications

• Corruption barometer

• Intent vs. impact of hate promoting posts

• Recruitment analysis in online radicalized communities

• Fake news detection

• NLP systems for security informatics

56

Micropost Enrichment AlgorithmMining Public Complaints and

Grievances on Twitter

• Consisting of data pre-processing and text enrichment techniques

• Addressing the problem of noisy data in tweets

Raw Tweets

Sentence Segmentation

Micropost Enhancement and Enrichment

Enriched Tweets

Hashtag Expansion

Padding Space Correction

Spell Error Correction

Acronym & Slang

Treatment

@Username Expansion

57

Sentence SegmentationMicropost Enrichment

Algorithm

• Find ill hashtags in the tweet (ignore the hashtag present in URL)

• Removal of filling terms

• Umm, hmm, err,

• Trim consecutive special characters occurring together

• ????, !!!!!, …..

58

Raw Tweets

Sentence Segmentation

Micropost Enhancement and Enrichment

Enriched Tweets

Hashtag Expansion

Padding Space Correction

Spell Error Correction

Acronym & Slang

Treatment

@Username Expansion

Joint Hashtag ExpansionMicropost Enrichment

Algorithm

1. Common Separator (’-’ and ’-’)

• #no-strong-action-by-police = no strong

action by police

• #black_money = black money

1. Uppercase Letters

• #MarchForDemocracy = March For

Democracy

• #CharchaOnRWH = Charcha On RWH

• #FANTomorrow = FAN Tomorrow

59

3. Alphanumeric String

• #TheriJoins100crClub = Theri Joins 100 cr Club

• #NH8= NH 8

4. Longest subsequence

• #seriousissue = serious issue

• #havesomesenseofchecking = have some sense of

checking

• #swachhbharat #swaranshatabdi

#modisarkar

Spell Error CorrectionMicropost Enrichment

Algorithm

• Given Sentence S= ”plnty of ppl waiting frm past hour”,

• set of 3-grams= [n1: ’plnty of ppl’, n2: ‘of ppl waiting’, n3: ’ppl waiting frm’, n4:

’waiting frm past’, n5: ‘frm past hour’]

60

Acronym and Slang TreatmentMicropost Enrichment

Algorithm

• Domain Specific Slangs (fetched from frequent n-grams, n=1, 2, 3, 4)

• NH (National Highway), Toll, RTO (Regional Transport Office), RC (Vehicle Registration

Certificate), KM (kilometers), STN (station), Acc (accident) and RD (Road)

• Numeric

• 2- to, two, too?

• 4- for, four?

• Others (General slang or abbreviations)

• B4- before, plz- please, idk- i don’t know, c- see

• NoSlang online dictionary

61

Username ExpansionMicropost Enrichment

Algorithm

62

• Expanding @username mentioned directly in the tweet

• Extracting profile name using Twitter REST API

Proposed Research FrameworkMining Public Complaints and

Grievances on Twitter

Enriched

Tweets

Non-Complaint Tweets

Raw

Tweets

#JointHashtags, Spelling

Errors, Slang and

Abbreviations, @username

Micropost Enrichment

REST

API

Data Collection

Complaint

Reports

Features Extraction

Tweet Metadata, Named

Entity Recognizer, N-

gram modeling,

Knowledgebase Graph

Complaint Tweets

Classification

Unknown

Features

Vector ModelOne-Class/ Multi-Class

Classification

Empirical Analysis and

Visualization

Problem Specific

Identification

63

Non-Complaint Tweets (AISP)

64

Appreciation Posts

Tweet

Bag-of-

words

Core NLP

Exhaustive set of

Appreciation keywords

Similarity

Computation

Lemmatized

Match?

Tweet

Tone Analyzer

IBM Watson APIJoy Score?

> Threshold then

appreciation post

Sentence Level

Tone

Anger

Sadness

Fear

Joy

Anxiety

Non-Complaint Tweets

Non-Complaint Tweets (AISP)

65

Information Sharing

• Tweets consisting of an external URL

• URL not an image, video

• Quote tweets

Non-Complaint Tweets

Non-Complaint Tweets (AISP)

66

Promotional Tweets

• Tweets posted by other accounts or handle of same

department or public agency

• Tweet posted by a verified account

Non-Complaint Tweets

top related