nlp for social media: pos tagging, sentiment...

Post on 12-Mar-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NLP for Social Media: POS Tagging, Sentiment Analysis

Pawan Goyal

CSE, IITKGP

August 05, 2016

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 1 / 23

Part-of-Speech (POS) Tagging

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 2 / 23

Penn Treebank POS Tags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 3 / 23

Part-of-Speech (POS) Tagging

Words often have more than one POS

POS tagging problem is to determine the POS tag for a particular instance of aword.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23

Part-of-Speech (POS) Tagging

Words often have more than one POS

POS tagging problem is to determine the POS tag for a particular instance of aword.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 4 / 23

Relevant papers and tool

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das,Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, JeffreyFlanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter:Annotation, Features, and Experiments. In Proceedings of ACL 2011.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, NathanSchneider and Noah A. Smith. Improved Part-of-Speech Tagging forOnline Conversational Text with Word Clusters. In Proceedings of NAACL2013.

Parts-of-Speech tagget for twitter -http://www.cs.cmu.edu/~ark/TweetNLP/

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 5 / 23

POS Tagset for Twitter

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 6 / 23

Twitter-specific Tags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 7 / 23

The case of Hashtags

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 8 / 23

Combined forms

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 9 / 23

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 10 / 23

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 11 / 23

Tagging Scheme

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 12 / 23

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Building the POS tagger

CRF model was used. Some insighful features:

Twitter orthography: Features for several regular expression-style rulesthat detect at-mentions, hashtags, URLs etc.

Frequently-capitalized tokens: Compiled gazetteers of tokens whichare frequently capitalized. Features would be - membership in the top Nfrequently capitalized items.

Traditional tag dictionary: Features for all coarse-grained tags thateach word occurs with in the Penn Treebank.

Phonetic normalization: Metaphone algorithm to create a coarsephonetic normalization of words to simpler keys.

Distributional similarity: Distributional features from the successor andpredecessor probabilities for the 10,000 most common terms.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 13 / 23

Related Problems

Entity RecognitionNamed Entity: Names of people, places, organization

Date and time

Can you model it as a sequence labeling problem?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23

Related Problems

Entity RecognitionNamed Entity: Names of people, places, organization

Date and time

Can you model it as a sequence labeling problem?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 14 / 23

Sentiment Analysis

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 15 / 23

Sentiment Analysis: Relevant Paper

Pak, Alexander, and Patrick Paroubek. “Twitter as a Corpus for SentimentAnalysis and Opinion Mining.” LREC. Vol. 10. 2010.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 16 / 23

Sentiment Analysis: Examples

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 17 / 23

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

Data Creation

How to do that without manual labeling?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 18 / 23

POS tag Distribution: Subjective vs. Objective

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 19 / 23

POS tag Distribution: Positive vs. Negative

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 20 / 23

Classification Model

Naïve Bayes Model: Main featuresPOS tags, Word n-grams

Using negations in Word n-grams

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23

Classification Model

Naïve Bayes Model: Main featuresPOS tags, Word n-grams

Using negations in Word n-grams

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 21 / 23

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

Disregarding common n-grams

Using entropy and salience

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 22 / 23

N-grams with high salience and low entropy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: POS Tagging, Sentiment Analysis August 05, 2016 23 / 23

top related