real-time natural language processing for crowdsourced road traffic alerts
TRANSCRIPT
Real-time Natural Language Processing for Crowdsourced Road Traffic Alerts
C.D. Athuraliya, M.K.H. Gunasekara,Srinath Perera, Sriskandarajah Suhothayan
http://bit.ly/1NwBXTv
● Introduction
● Background
● Solution & Methodology
● Results & Conclusion
Overview
2
Introduction
● Success of modern day enterprises and businesses is highly relied
on how they process massive amounts of data
● “Drowning in data yet starving for knowledge”
● With the emergence of social media, public has gained the
potential to generate massive amounts of data
● But we are still in a struggle to extract useful information out of this
data
3
Introduction
● Road traffic has become a major issue, mainly in developing
countries
● Directly affects country’s economy and development due to the
waste of resources – Fuel, time
● Using technology to find solutions – Proven to be success stories in
number of cases
● This study was focused on one such solution emerged with the use
of social media
● Twitter – Popular for dynamic content publishing
○ Users publish on different topics such as current affairs, news, politics
and personal interests via 140 character messages called tweets4
Background
● Road.lk – A website that provides localized traffic alerts from a
Twitter feed
● Experiencing road traffic or have information on road traffic? Tweet
about it!
● All users, follow @road_lk receive traffic alerts nearly in real-time
● Identified as a potential source to extract information on road traffic
in real-time
● Reliability maintained by higher number of publishers
5
Background – @road_lk Feed
6
Background
● Potential is significant to a country like Sri Lanka – Due to the
unavailability of high tech traffic monitoring systems
● Several limitations,
○ Connectivity requirement
○ Unavailability of proper alert mechanism except Twitter feed or
road.lk website
● Notable limitation – Users use natural language to post traffic
updates
● A format can make processing tweets more straightforward but it
can reduce the flexibility of sharing updates
7
Solution & Methodology
● A prototype solution was implemented by combining NLP and CEP
tools
● Accommodates three use cases,
○ Real-time road traffic feed and geo location map
○ Traffic search within an area
○ Traffic alert subscription
● Developed an architecture for a these use cases
● Multiple tools were utilized to retrieve, process and present
information
8
Solution & Methodology – Architecture
9
Solution & Methodology – Feed
● Feed Retrieval – Access Twitter via its API
● Existing feed for model training dataset generation
○ REST API, Twitter4J
● Real-time feed stream for alert generation
○ Streaming API, WSO2 Enterprise Service Bus Twitter
connector
10
Solution & Methodology – NLP
● @road_lk Twitter feed
○ Reliable data source to generate real-time traffic alerts
○ Constrained by natural language representation
● Transform this data into a machine readable representation – Can
use the full potential of this source for a better solution
● Proposed a NLP model to address this problem
● Extracted two entities from a tweet – location and traffic level
● Before extracting these two entities,
○ A tweet needed to be classified – Traffic alert or not?
○ Cleaning, preprocessing
11
Solution & Methodology – NLP
● NLP tasks required to classify and extract,
○ Tweet categorization
○ Location extraction
○ Traffic level extraction
● First task – Document categorization task
● Latter two – Name entity recognition (NER) tasks
● Apache OpenNLP toolkit was used
● Custom tokenizer for street names and city names
● Traffic level NER task – Predefined set of words selected to tag
● Had to consider factors – Spelling mistakes, informal language,
abbreviations 12
Solution & Methodology – CEP
● Another important property of this data source – Required to
process the Twitter feed in real-time
● Our approach was complex event processing (CEP)
● CEP is a field, concerned in processing data from multiple sources
in real-time
● Used WSO2 Complex Event Processor as the CEP tool to analyse
and process Twitter feed input stream
● Siddhi Query Language (SiddhiQL) is at the core of WSO2 CEP
● Designed to process event streams and identify complex event
occurrences
13
Solution & Methodology – Siddhi Queries
from classifiedStream#transform.nlp:getEntities(convertedText,4,true,"/_system/governance/en-location.bin")
select * insert into templocationStream;
from classifiedStream#transform.nlp:getEntities(convertedText,1,false,"/_system/governance/en-trafficlevel.bin")
select * insert into temptrafficlevelStream;
from S1=classifiedStream, S2=temptrafficlevelStream, S3=templocationStream
select S1.createdAt as time, S2.nameElement1 as trafficLevel, S3.nameElement1 as location1, S3.nameElement2 as
location2, S3.nameElement3 as location3, S3.nameElement4 as location4
insert into locationsStream;
from uiFeedStream#window.time(120 min) as trafficFeed join SearchEventStream as request
on (trafficFeed.latitude < request.latitude + 0.018 and trafficFeed.latitude > request.latitude - 0.018 and
trafficFeed.longitude < request.longitude + 0.027 and trafficFeed.longitude > request.longitude - 0.027)
select trafficFeed.formattedAddress, trafficFeed.latitude, trafficFeed.longitude, trafficFeed.level, trafficFeed.time
insert into searchResult;
14
Solution & Methodology – CEP
● Siddhi queries define how to process and combine existing event
streams to create new event streams
● SiddhiQL was extended with extensions for,
○ Tweet categorization
○ Name entity recognition
○ Geocoding
● Geocoding extension converts the locations into geo coordinates
● Searching functionality used a time-based Siddhi window
○ To retrieve traffic in nearby geo area within a predefined time
period
15
Results & Conclusion
● Implemented a web based interface to demonstrate the
functionalities
● Users can interact with this interface and make use of the use
cases
● Accuracy measures of NLP through OpenNLP evaluation APIs
● A solution to extract useful information from a crowdsourced social
networking service
● By utilizing a NLP/CEP combined approach
16
Results & Conclusion – Web UI
17
Results & Conclusion
● Results demonstrate the potential of such model
● To tackle an application of real-time natural language processing
task
● This model can be extended to tackle any real-time unstructured
data stream
● Transforming human readable data into machine readable format
enables deep processing of data to generate useful information and
insights
○ Trend analysis
○ Pattern detection and prediction
18
Thank you.