nlp for social media: pos tagging, sentiment...

34
NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE, IITKGP August 05, 2016 Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 1 / 23

Upload: others

Post on 12-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

NLP for Social Media: POS Tagging, Sentiment Analysis

Pawan Goyal

CSE, IITKGP

August 05, 2016

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 1 / 23

Page 2: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Part-of-Speech (POS) Tagging

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 2 / 23

Page 3: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Penn Treebank POS Tags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 3 / 23

Page 4: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Part-of-Speech (POS) Tagging

Words often have more than one POS

POS tagging problem is to determine the POS tag for a particular instance of aword.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23

Page 5: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Part-of-Speech (POS) Tagging

Words often have more than one POS

POS tagging problem is to determine the POS tag for a particular instance of aword.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23

Page 6: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Relevant papers and tool

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das,Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, JeffreyFlanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter:Annotation, Features, and Experiments. In Proceedings of ACL 2011.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, NathanSchneider and Noah A. Smith. Improved Part-of-Speech Tagging forOnline Conversational Text with Word Clusters. In Proceedings of NAACL2013.

Parts-of-Speech tagget for twitter -http://www.cs.cmu.edu/~ark/TweetNLP/

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 5 / 23

Page 7: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

POS Tagset for Twitter

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 6 / 23

Page 8: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Twitter-specific Tags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 7 / 23

Page 9: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

The case of Hashtags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 8 / 23

Page 10: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Combined forms

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 9 / 23

Page 11: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 10 / 23

Page 12: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 11 / 23

Page 13: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 12 / 23

Page 14: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Page 15: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Page 16: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Page 17: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Page 18: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Page 19: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Related Problems

Entity RecognitionNamed Entity: Names of people, places, organization

Date and time

Can you model it as a sequence labeling problem?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23

Page 20: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Related Problems

Entity RecognitionNamed Entity: Names of people, places, organization

Date and time

Can you model it as a sequence labeling problem?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23

Page 21: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Sentiment Analysis

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 15 / 23

Page 22: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Sentiment Analysis: Relevant Paper

Pak, Alexander, and Patrick Paroubek. “Twitter as a Corpus for SentimentAnalysis and Opinion Mining.” LREC. Vol. 10. 2010.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 16 / 23

Page 23: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Sentiment Analysis: Examples

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 17 / 23

Page 24: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

Page 25: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

Page 26: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

Page 27: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

POS tag Distribution: Subjective vs. Objective

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 19 / 23

Page 28: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

POS tag Distribution: Positive vs. Negative

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 20 / 23

Page 29: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Classification Model

Naïve Bayes Model: Main featuresPOS tags, Word n-grams

Using negations in Word n-grams

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23

Page 30: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Classification Model

Naïve Bayes Model: Main featuresPOS tags, Word n-grams

Using negations in Word n-grams

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23

Page 31: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

Page 32: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

Page 33: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

Page 34: NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

N-grams with high salience and low entropy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 23 / 23