nlp for social media: language identification ii and text...
TRANSCRIPT
![Page 1: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/1.jpg)
NLP for Social Media: Language Identification II andText Normalization
Pawan Goyal
CSE, IITKGP
August 3-4, 2016
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 1 / 48
![Page 2: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/2.jpg)
LI: Supervised Approaches
InputA document d
A fixed set of classes C = {c1,c2, . . . ,cn}A training set of m hand-labeled documents (d1,c1), . . . ,(dm,cm)
OutputA learned classifier γ : d→ c
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 2 / 48
![Page 3: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/3.jpg)
LI: Supervised Approaches
InputA document d
A fixed set of classes C = {c1,c2, . . . ,cn}A training set of m hand-labeled documents (d1,c1), . . . ,(dm,cm)
OutputA learned classifier γ : d→ c
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 2 / 48
![Page 4: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/4.jpg)
Supervised Machine Learning
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 3 / 48
![Page 5: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/5.jpg)
Bayes’ rule for documents and classes
For a document d and a class c
P(c|d) = P(d|c)P(c)P(d)
Naïve Bayes Classifier
cMAP = argmaxc∈C
P(c|d)
= argmaxc∈C
P(d|c)P(c)
= argmaxc∈C
P(x1,x2, . . . ,xn|c)P(c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 4 / 48
![Page 6: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/6.jpg)
Bayes’ rule for documents and classes
For a document d and a class c
P(c|d) = P(d|c)P(c)P(d)
Naïve Bayes Classifier
cMAP = argmaxc∈C
P(c|d)
= argmaxc∈C
P(d|c)P(c)
= argmaxc∈C
P(x1,x2, . . . ,xn|c)P(c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 4 / 48
![Page 7: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/7.jpg)
Naïve Bayes classification assumptions
P(x1,x2, . . . ,xn|c)
Bag of words assumptionAssume that the position of a word in the document doesn’t matter
Conditional Independence
Assume the feature probabilities P(xi|cj) are independent given the class cj.
P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)
cNB = argmaxc∈C
P(c)∏x∈X
P(x|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48
![Page 8: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/8.jpg)
Naïve Bayes classification assumptions
P(x1,x2, . . . ,xn|c)
Bag of words assumptionAssume that the position of a word in the document doesn’t matter
Conditional Independence
Assume the feature probabilities P(xi|cj) are independent given the class cj.
P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)
cNB = argmaxc∈C
P(c)∏x∈X
P(x|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48
![Page 9: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/9.jpg)
Naïve Bayes classification assumptions
P(x1,x2, . . . ,xn|c)
Bag of words assumptionAssume that the position of a word in the document doesn’t matter
Conditional Independence
Assume the feature probabilities P(xi|cj) are independent given the class cj.
P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)
cNB = argmaxc∈C
P(c)∏x∈X
P(x|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48
![Page 10: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/10.jpg)
Naïve Bayes classification assumptions
P(x1,x2, . . . ,xn|c)
Bag of words assumptionAssume that the position of a word in the document doesn’t matter
Conditional Independence
Assume the feature probabilities P(xi|cj) are independent given the class cj.
P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)
cNB = argmaxc∈C
P(c)∏x∈X
P(x|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48
![Page 11: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/11.jpg)
Learning the model parameters
Maximum Likelihood Estimate
P̂(cj) =doc− count(C = cj)
Ndoc
P̂(wi|cj) =count(wi,cj)
∑w∈V
count(w,cj)
Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.
P̂(pure|Hindi) = 0
cNB = argmaxc
P̂(c)∏x∈X
P̂(xi|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48
![Page 12: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/12.jpg)
Learning the model parameters
Maximum Likelihood Estimate
P̂(cj) =doc− count(C = cj)
Ndoc
P̂(wi|cj) =count(wi,cj)
∑w∈V
count(w,cj)
Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.
P̂(pure|Hindi) = 0
cNB = argmaxc
P̂(c)∏x∈X
P̂(xi|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48
![Page 13: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/13.jpg)
Learning the model parameters
Maximum Likelihood Estimate
P̂(cj) =doc− count(C = cj)
Ndoc
P̂(wi|cj) =count(wi,cj)
∑w∈V
count(w,cj)
Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.
P̂(pure|Hindi) = 0
cNB = argmaxc
P̂(c)∏x∈X
P̂(xi|c)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48
![Page 14: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/14.jpg)
Laplace (add-1) smoothing
P̂(wi|c) =count(wi,c)+1
∑w∈V
(count(w,c)+1)
=count(wi,c)+1
( ∑w∈V
(count(w,c))+ |V|
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 7 / 48
![Page 15: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/15.jpg)
A worked out example
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 8 / 48
![Page 16: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/16.jpg)
A worked out example: No smoothing
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 9 / 48
![Page 17: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/17.jpg)
A worked out example: Smoothing
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 10 / 48
![Page 18: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/18.jpg)
Character n-gram based Approach
Input: A word w (e.g., khiprata)
Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48
![Page 19: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/19.jpg)
Character n-gram based Approach
Input: A word w (e.g., khiprata)
Features: character n-grams (n=2 to 5)
Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48
![Page 20: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/20.jpg)
Character n-gram based Approach
Input: A word w (e.g., khiprata)
Features: character n-grams (n=2 to 5)
Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48
![Page 21: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/21.jpg)
Character n-gram based Approach
Input: A word w (e.g., khiprata)
Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMs
Prob (kshiprata is Sanskrit) » Prob (kshiprata is English)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48
![Page 22: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/22.jpg)
Character n-gram based Approach
Input: A word w (e.g., khiprata)
Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48
![Page 23: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/23.jpg)
LangID Tools
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 12 / 48
![Page 24: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/24.jpg)
Using langid.py
https://github.com/saffsd/langid.pySupports 97 languages
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 13 / 48
![Page 25: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/25.jpg)
Word-level Language Labeling
Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48
![Page 26: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/26.jpg)
Word-level Language Labeling
Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .
Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48
![Page 27: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/27.jpg)
Word-level Language Labeling
Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .
Such that p(Y|X) is maximized
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48
![Page 28: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/28.jpg)
Word-level Language Labeling
Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48
![Page 29: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/29.jpg)
Conditional Random Fields: Modelling the ConditionalDistribution
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 15 / 48
![Page 30: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/30.jpg)
Conditional Random Fields: Modelling the ConditionalDistribution
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 15 / 48
![Page 31: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/31.jpg)
Conditional Random Fields: Feature Functions
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 16 / 48
![Page 32: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/32.jpg)
Feature Functions
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 17 / 48
![Page 33: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/33.jpg)
Conditional Random Fields: Distribution
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 18 / 48
![Page 34: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/34.jpg)
Features for word level Language Identification
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 19 / 48
![Page 35: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/35.jpg)
Lexical Normalization
Characteristics of Text in Social MediaSocial media text contains varying levels of “noise” (lexical, syntactic andotherwise), e.g.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 20 / 48
![Page 36: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/36.jpg)
Lexical Normalization
Characteristics of Text in Social MediaSocial media text contains varying levels of “noise” (lexical, syntactic andotherwise), e.g.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 20 / 48
![Page 37: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/37.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy?
no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 38: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/38.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 39: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/39.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions?
not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 40: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/40.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 41: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/41.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven?
to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 42: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/42.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 43: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/43.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)?
yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 44: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/44.jpg)
Why is Social Media Text “Bad”?
Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:
Lack of literacy? no
Length restrictions? not primarily
Text input method-driven? to some degree, yes
Pragmatics (mimicking prosodic effects etc. in speech)? yeeees
Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48
![Page 45: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/45.jpg)
What can be done about it?
Lexical normalizationTranslate expressions into their canonical form
IssuesWhat are the candidate tokens for normalization?
To what degree do we allow normalization?
What is the canonical form of a given expression? (e.g., aint)
Is lexical normalization always appropriate? (e.g., bro)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48
![Page 46: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/46.jpg)
What can be done about it?
Lexical normalizationTranslate expressions into their canonical form
IssuesWhat are the candidate tokens for normalization?
To what degree do we allow normalization?
What is the canonical form of a given expression? (e.g., aint)
Is lexical normalization always appropriate? (e.g., bro)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48
![Page 47: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/47.jpg)
What can be done about it?
Lexical normalizationTranslate expressions into their canonical form
IssuesWhat are the candidate tokens for normalization?
To what degree do we allow normalization?
What is the canonical form of a given expression? (e.g., aint)
Is lexical normalization always appropriate? (e.g., bro)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48
![Page 48: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/48.jpg)
What can be done about it?
Lexical normalizationTranslate expressions into their canonical form
IssuesWhat are the candidate tokens for normalization?
To what degree do we allow normalization?
What is the canonical form of a given expression? (e.g., aint)
Is lexical normalization always appropriate? (e.g., bro)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48
![Page 49: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/49.jpg)
What can be done about it?
Lexical normalizationTranslate expressions into their canonical form
IssuesWhat are the candidate tokens for normalization?
To what degree do we allow normalization?
What is the canonical form of a given expression? (e.g., aint)
Is lexical normalization always appropriate? (e.g., bro)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48
![Page 50: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/50.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 51: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/51.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 52: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/52.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 53: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/53.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 54: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/54.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 55: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/55.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 56: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/56.jpg)
Task Definition
One standard definitionrelative to some standard tokenization
consider only OOV tokens as candidates for normalization
allow only one-to-one word substitutions
Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their
not possible to normalise the multiword tokens, e.g., ttyl
ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm
assume a unique correct “norm” for each token
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48
![Page 57: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/57.jpg)
Spelling Errors
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 24 / 48
![Page 58: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/58.jpg)
Understanding unintentional spelling errors
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 25 / 48
![Page 59: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/59.jpg)
Edit Distance
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 26 / 48
![Page 60: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/60.jpg)
What about spelling errors in Social Media?
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 27 / 48
![Page 61: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/61.jpg)
The case of ‘Tomorrow’
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 28 / 48
![Page 62: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/62.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)
psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 63: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/63.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 64: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/64.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)
today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 65: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/65.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 66: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/66.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowels
message→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 67: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/67.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 68: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/68.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characters
tomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 69: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/69.jpg)
Patterns or Compression Operators
Phonetic substitution (phoneme)psycho→ syco, then→ den
Phonetic substitution (syllable)today→ 2day, see→ c
Deletion of vowelsmessage→ mssg, about→ abt
Deletion of repeated characterstomorrow→ tomorow
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48
![Page 70: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/70.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)
introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 71: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/71.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 72: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/72.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common Abbreviations
Kharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 73: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/73.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 74: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/74.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciation
going to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 75: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/75.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 76: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/76.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetition
Funny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 77: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/77.jpg)
Patterns or Compression Operators
Truncation (deletion of tails)introduction→ intro, evaluation→ eval
Common AbbreviationsKharagpur→ kgp, text back→ tb
Informal pronunciationgoing to→ gonna
Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48
![Page 78: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/78.jpg)
Successive Application of Operators
Because→ cause (informal usage)
cause→ cauz (phonetic substitution)
cauz→ cuz (vowel deletion)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48
![Page 79: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/79.jpg)
Successive Application of Operators
Because→ cause (informal usage)
cause→ cauz (phonetic substitution)
cauz→ cuz (vowel deletion)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48
![Page 80: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/80.jpg)
Successive Application of Operators
Because→ cause (informal usage)
cause→ cauz (phonetic substitution)
cauz→ cuz (vowel deletion)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48
![Page 81: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/81.jpg)
Categorisation of non-standard words in English Twitter
Most non-standard words in sampled tweets are morphophonemic variations.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 32 / 48
![Page 82: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/82.jpg)
Categorisation of non-standard words in English Twitter
Most non-standard words in sampled tweets are morphophonemic variations.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 32 / 48
![Page 83: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/83.jpg)
Token-based Approach (Han and Baldwin, 2011)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 33 / 48
![Page 84: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/84.jpg)
Token-based Approach (Han and Baldwin, 2011)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 33 / 48
![Page 85: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/85.jpg)
Token-based Approach (Han and Baldwin, 2011)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 34 / 48
![Page 86: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/86.jpg)
Token-based Approach (Han and Baldwin, 2011)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 35 / 48
![Page 87: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/87.jpg)
Pre-processing
Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs
Identify all OOV words relative to a standard spelling dictionary (aspell)
For OOV words, shorten any repetitions of 3+ letters to 2 letters
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48
![Page 88: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/88.jpg)
Pre-processing
Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs
Identify all OOV words relative to a standard spelling dictionary (aspell)
For OOV words, shorten any repetitions of 3+ letters to 2 letters
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48
![Page 89: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/89.jpg)
Pre-processing
Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs
Identify all OOV words relative to a standard spelling dictionary (aspell)
For OOV words, shorten any repetitions of 3+ letters to 2 letters
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48
![Page 90: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/90.jpg)
Candidate Generation
Generation via edit distance over letters (Tc) and phonemes (Tp).
This allows to generate “earthquake” for words such as earthquick.
Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48
![Page 91: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/91.jpg)
Candidate Generation
Generation via edit distance over letters (Tc) and phonemes (Tp).
This allows to generate “earthquake” for words such as earthquick.
Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48
![Page 92: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/92.jpg)
Candidate Generation
Generation via edit distance over letters (Tc) and phonemes (Tp).
This allows to generate “earthquake” for words such as earthquick.
Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48
![Page 93: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/93.jpg)
Detection of Ill-formed words
Detection based on candidate context fitnessCorrect words should fit better with context than substitution candidates
Incorrect words should fit worse than substitution candidates
Basic Idea: Use Dependencies from corpus dataAn SVM classifier is trained based on dependencies, to indicate candidatecontext fitness.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 38 / 48
![Page 94: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/94.jpg)
Detection of Ill-formed words
Detection based on candidate context fitnessCorrect words should fit better with context than substitution candidates
Incorrect words should fit worse than substitution candidates
Basic Idea: Use Dependencies from corpus dataAn SVM classifier is trained based on dependencies, to indicate candidatecontext fitness.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 38 / 48
![Page 95: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/95.jpg)
Feature Representation using Dependencies
Build a dependency bank from existing corpora
Represent each dependency tuple as a word pair + positional index
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 39 / 48
![Page 96: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/96.jpg)
Feature Representation using Dependencies
Build a dependency bank from existing corpora
Represent each dependency tuple as a word pair + positional index
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 39 / 48
![Page 97: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/97.jpg)
SVM Training Data Generation
Use dependency bank directly as positive features
Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48
![Page 98: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/98.jpg)
SVM Training Data Generation
Use dependency bank directly as positive features
Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48
![Page 99: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/99.jpg)
SVM Training Data Generation
Use dependency bank directly as positive features
Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48
![Page 100: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/100.jpg)
Detecting ill-formed words
OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words
Threshold = 1→ lookin is considered to be an ill-formed word
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48
![Page 101: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/101.jpg)
Detecting ill-formed words
OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words
Threshold = 1→ lookin is considered to be an ill-formed word
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48
![Page 102: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/102.jpg)
Detecting ill-formed words
OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words
Threshold = 1→ lookin is considered to be an ill-formed word
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48
![Page 103: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/103.jpg)
Normalization Candidate Selection
For each ill-formed word and its possible correction candidates, the followingfeatures are considered for normalization:
Word Similarityletter and phoneme edit distance (ED)
prefix, suffix, and longest common subsequence
Context Supporttrigram language model score
dependency score (weighted dependency count, derived from thedetection step)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 42 / 48
![Page 104: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/104.jpg)
Normalization Candidate Selection
For each ill-formed word and its possible correction candidates, the followingfeatures are considered for normalization:
Word Similarityletter and phoneme edit distance (ED)
prefix, suffix, and longest common subsequence
Context Supporttrigram language model score
dependency score (weighted dependency count, derived from thedetection step)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 42 / 48
![Page 105: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/105.jpg)
Type-based approach
ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate
ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48
![Page 106: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/106.jpg)
Type-based approach
ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate
ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48
![Page 107: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/107.jpg)
Type-based approach
ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate
ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48
![Page 108: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/108.jpg)
Type-based Approach (Han et al. (2012)
Construct the dictionary based on distributional similarity + string similarity
Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity
Re-rank the extracted pairs by string similarity
OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48
![Page 109: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/109.jpg)
Type-based Approach (Han et al. (2012)
Construct the dictionary based on distributional similarity + string similarity
Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity
Re-rank the extracted pairs by string similarity
OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48
![Page 110: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/110.jpg)
Type-based Approach (Han et al. (2012)
Construct the dictionary based on distributional similarity + string similarity
Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity
Re-rank the extracted pairs by string similarity
OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48
![Page 111: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/111.jpg)
An Example
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 45 / 48
![Page 112: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/112.jpg)
Context Modelling
Components/parameters of the methodcontext wondow size: ±1, ±2, ±3
context word sensitivity: bag-of-words vs. positional indexing
context word representation: unigram, bigram or trigram
context word filtering: all tokens vs. only dictionary words
context similarity: KL divergence, Jensen-Shannon divergence, Cosinesimilarity, Euclidean distance
Tune parameters relative to (OOV,IV) pair develepment data
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 46 / 48
![Page 113: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/113.jpg)
Context Modelling
Components/parameters of the methodcontext wondow size: ±1, ±2, ±3
context word sensitivity: bag-of-words vs. positional indexing
context word representation: unigram, bigram or trigram
context word filtering: all tokens vs. only dictionary words
context similarity: KL divergence, Jensen-Shannon divergence, Cosinesimilarity, Euclidean distance
Tune parameters relative to (OOV,IV) pair develepment data
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 46 / 48
![Page 114: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/114.jpg)
Rerank pairs by string similarity
(OOV,IV) pairs derived by distributional similarity:
Get the top-ranked pairs as lexicon entries:
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 47 / 48
![Page 115: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/115.jpg)
Rerank pairs by string similarity
(OOV,IV) pairs derived by distributional similarity:
Get the top-ranked pairs as lexicon entries:
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 47 / 48
![Page 116: NLP for Social Media: Language Identification II and Text ...cse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social2.pdf · Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification](https://reader034.vdocuments.mx/reader034/viewer/2022042910/5f3f530c08c6fa7cca3b505c/html5/thumbnails/116.jpg)
Main References
Han, Bo, and Timothy Baldwin. “Lexical normalisation of short textmessages: Makn sens a# twitter.” Proceedings of the 49th AnnualMeeting of the Association for Computational Linguistics: HumanLanguage Technologies-Volume 1. Association for ComputationalLinguistics, 2011.
Han, Bo, Paul Cook, and Timothy Baldwin. “Automatically constructing anormalisation dictionary for microblogs.” Proceedings of the 2012 jointconference on empirical methods in natural language processing andcomputational natural language learning. Association for ComputationalLinguistics, 2012.
Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 48 / 48