lexical normalisation of twitter data
TRANSCRIPT
![Page 1: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/1.jpg)
Bilal AhmedThe University of Melbourne, Australia
Lexical Normalisation of Twitter Data
Science and Information Conference 2015July 28-30, 2015 | London UK
![Page 2: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/2.jpg)
Rise of the Internet
1B
2B
3B
Global Internet Users
7.3 BillionWorld Population 2015
3.15 BillionInternet Users 2015
Over 40 %
![Page 3: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/3.jpg)
Knowledge is being created and shared at an unprecedented rate
Social Networking
Social is now the #1 use of the internet, with 94% using it to learn, 78% to share knowledge and 49% to engage experts.
With 2 billion social connections and more than 3 billion expressions per day, it’s fuelling the emergence of a knowledge economy.
![Page 4: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/4.jpg)
Every 60 seconds:• YouTube users upload 72 hours of new video
content.• Apple users download nearly 50,000 apps.• Email users send over 200 million messages.• Amazon generates over $80,000 in online
sales.
Facebook users share nearly
2.5 million pieces of content.
Twitter users tweet nearly 300,000
times Instagram
users post nearly 220,000 new
photos
1minute
InSocial Media
![Page 5: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/5.jpg)
TwitterTwitter is the largest archive of public human thought that has ever existed.
Many other social platforms are private by default which makes it difficult or impossible to analyse their data. Twitter is a public platform by default.
Twitter has been described by some as a real time focus group for anything you want to explore
![Page 6: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/6.jpg)
@bilal_ahmed01 #London#ManchesterUnited
![Page 7: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/7.jpg)
What is a Tweet?
Visible textual data140 Characters
Over 50 metadata elements • Language
• Profile• Followers• Location (Geo tags)• Device
Users broadcast their thoughts and opinions for all to see.
![Page 8: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/8.jpg)
140 Characters Limit
4U for youmos mostfone phoneyrs years
neva neversrsly seriously
![Page 9: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/9.jpg)
c u 2morw!!!See you tomorrow!
Problem
![Page 10: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/10.jpg)
Processing
Lexical Analysis Lexical Normalisation
Lexical normalisation is the process of transforming tokens into a canonical form consistent with the dictionary and grammar. These tokens include words that are misspelt or intentionally shortened (elisions) due to character limit in case of Twitter.
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens, i.e. meaningful character strings.
![Page 11: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/11.jpg)
Step 1 - Tokenize
c u 2morw!!!
c
u
2morw
!!!
Raw Tweet Tokens
Lexical Analysis
![Page 12: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/12.jpg)
Step 2 – Classifying Tokens
u urc @ #! $ :
Watch EitherSay Mean
Call Episode
In Vocabulary Non-CandidatesOut of Vocabulary
Token
IV OOV NO
![Page 13: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/13.jpg)
In Vocabulary Tokens
Watch EitherSay Mean
Call Episode
In VocabularyIV
Matched against a lexicon of 115,326
words to identify “in vocabulary” words.
![Page 14: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/14.jpg)
Non-Candidate Tokens
@ #! $ :
Non-CandidatesNO
Parsed using regular expression to identify
special characters, punctuation and Twitter
specific symbols
![Page 15: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/15.jpg)
Lexical Normalisati
on
Out Of Vocabulary Tokens
u urc
Out of Vocabulary
OOV
you yoursee
Out of Vocabulary
IV
Canonical form consistent with the
dictionary and grammar
![Page 16: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/16.jpg)
Lexical NormalisationLexical normalisation is the process of transforming tokens into a canonical form consistent with the dictionary and grammar. These tokens include words that are misspelt or intentionally shortened (elisions) due to character limit in case of Twitter.
Levenshtein
Distance
Refined Soundex
String Matching
Phonetic Matching
Token
Peter Norwig’s Algorithm
u urc
Out of Vocabulary
OOV
5-Gram Context
Matching
![Page 17: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/17.jpg)
Step 1 – Levenshtein DistanceOnce a candidate has been identified for normalisation, firstly, edit distance (Levenshtein distance) technique is applied to find matches from (words.utf-8.txt) which are within 2 (inclusive) edit distance of the query. The results are stored in an array. We refer to this set as the “First Set of Matches based on Edit Distance” since they contain approximate matches based on their textual similarity to the query.
Token
Query 645,288 wordswords.utf-8.txt
Levenshtein Distance 2 { , {Step 1:
a [0]
a [1]
a [2]
a [3]
a [4]
a [5]
a [6]
a [7]
First Set of Matches based on Edit Distance
![Page 18: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/18.jpg)
Step 2 – Refined SoundexRefined Soundex is used to further analyse and phonetically match words gathered in the first set, based on their Levenshtein distance to the query as described in Section 4.1. The words in the array are filtered based on their phonetic similarity to the query as shown in Figure 2 below.
a [0] a [1] a [2] a [3] a [4] a [5] a [6] a [7]First Set of Matches based on Edit
Distance
First set of matches based on Edit Distance(Array)Refined Soundex { {
QueryStep 2:
b [0]
b [1]
b [2]
b [3]
b [4]
Token
Phonetic Matches
![Page 19: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/19.jpg)
Step 3 – Peter Norvig’s AlgorithmAlgorithm generates all possible terms with an edit distance of less than or equal to 2 (which includes deletes, transposes, replaces, and inserts) from the query term and searches them in the dictionary
Token
,Peter Norvig Algorithm Query 1 M wordsbig.txt{ {Step 3:
Correction (1 Word)
![Page 20: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/20.jpg)
Step 4 – Compare ResultsThe result is then compared (Figure 4) with the phonetically matched words derived in Section
Phonetic matches(Array){ {Correction
(1 Word)
Step 4:
Peter Norvig Algorithm
Comare Results from Steps 2 and 3
Edit Distance +
Refined Soundex
![Page 21: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/21.jpg)
Step 4 – Compare ResultsThe result is then compared (Figure 4) with the phonetically matched words derived in Section
b [0]
b [1]
b [2]
b [3]
b [4]
Edit Distance+
Phonetic MatchesCorrection (1
Word)
Peter Norwig’s Algorithm
![Page 22: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/22.jpg)
Step 5 – 5-Gram Context MatchingIf there are more than 1 phonetic matches found, in other words if Refined Soundex technique
(Section 4.2) returns more than one phonetic match then a 5-Gram Context Matching technique is applied using each phonetic match as the query in the following regular expression:
,5 Gram Pattern Matching Query 1 M wordsW5_.txt{ {
Previous word , Next word,
b [0]
b [1]
b [2]
b [3]
b [4]
Edit Distance+
Phonetic Matches
![Page 23: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/23.jpg)
Step 5 – 5-Gram Context MatchingIf there are more than 1 phonetic matches found, in other words if Refined Soundex technique
(Section 4.2) returns more than one phonetic match then a 5-Gram Context Matching technique is applied using each phonetic match as the query in the following regular expression:
,5 Gram Pattern Matching Query 1 M wordsW5_.txt{ {
Previous word , Next word,
b [0]
b [1]
b [2]
b [3]
b [4]
Edit Distance+
Phonetic Matches
![Page 24: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/24.jpg)
Conclusion• Normalising tokens with high accuracy can be quite a challenge given the
number of possible variations for a given token.
• This is further compounded by the ever increasing and evolving elisions and acronyms frequently used in social media tools such as Twitter.
• It is important to take into consideration the various normalisation techniques that are available and to pick the ones that best suit the purpose.
• A blend of techniques such as edit distance and Soundex or Refined Soundex usually results in better accuracy as compared to their standalone application. Techniques based on context such as Peter Norvig’s algorithm increase the accuracy of normalisation.
• Similarly, N-Gram matching, although exhaustive, can be optimised to produce accurate results based on the context.
![Page 25: Lexical Normalisation of Twitter Data](https://reader034.vdocuments.mx/reader034/viewer/2022051318/587b17fc1a28ab736c8b46e1/html5/thumbnails/25.jpg)
+ 61 432 020 777Contact Bilal Ahmed
twitter.com/bilal_ahmed01
facebook.com/mbilalimpossible
www.mbilalimpossible.com
au.linkedin.com/in/bilalahmed5