02 naive bays classifier and sentiment analysis
Post on 18-Jul-2015
165 Views
Preview:
TRANSCRIPT
Overview• User-generated content on the Internet has risen exponentially over the last
decade.
• Nearly all our decision-making is social.
• Sentiment analysis:
– The computational study of how
– opinions,
– attitudes,
– emotions, and
– perspectives
• Are expressed in language
• Provides a rich set of tools and techniques for extracting this evaluative, subjective information from large datasets and summarizing it.
Applications• Understanding customer feedback
• Understanding public opinion
– We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contemporaneous Twitter messages.(O'Connor, Balasubramanyan, Routledge, and Smith 2010)
• Media studies
– Sentiment analysis is being used to study media coverage and media bias
• Realistic text-to-speech
– Text-to-speech technology is increasingly reliable and sophisticated, but it continues to stumble when it comes to predicting where and how the affect will change
• Financial prediction
– The number of emotional words on Twitter could be used to predict daily moves in the Dow Jones Industrial Average. A change in emotions expressed online would be followed between two and six days later by a move in the index, the researchers said, and this information let them predict its movements with 87.6 percent accuracy
– Which movies will open big
Monitoring real-world events
The vertical line marks the timing of the announcement that Gaddafi had been killed.
The emotions of a nation• Facebook's Gross National Happiness interface. Holidays register large
happiness spikes. The happiness dips in January correspond roughly with the earthquake in Haiti (Jan 12) and its most serious aftershock (Jan 20).
Brand analysis, reputation management• Twitter discussion following Netflix' announcement that it would be
increasing its subscription prices. The feedback on its Facebook page has a similar tenor and volume. The impact on its stock price has been dramatic.
A pervasive, challenging phenomenon• Part of the challenge is that sentiment information is blended and
multidimensional.
• We try to approximate it as a classification problem, but this is often inappropriate.
• Emotionally blended 'confessions' from the Experience Project.
I have a crush on my boss! *blush* eeek *back to work*
I fell over in the playground this afternoon, in front of all the parents and kids. I fell backwards over a cone. It was pretty funny, but now my neck hurts like crazy.
When I left my workplace today I kinda felt like I had to pee. But I was in rush to go home and figured it would be ok. So I have to take the bus and then walk for about another 15 minutes to get home. By the time I got on to the bus I knew I really had to pee. I was sitting down and had my legs tightly crossed and pretty much in agony scared of peeing myself in front of all the people in the bus. [...]
I really hate being shy ... I just want to be able to talk to someone about anything and everything and be myself.. That's all I’ve ever wanted.
Problems• Most robust, fast computational models treat documents as bags of words.
In some areas of sentiment, this is basically appropriate, because some sentiment words and expressions are not directly influenced by what is around them:
– That was fun :)
– That was miserable :(
• In other cases, where the sentiment words are influenced by what is around them, a bag of words might nonetheless do the job:
– I stubbed my damn toe
– What's with these friggin QR codes?
• However, the normal situation is for sentiment words to be influenced in complicated ways by the material around them:
– It was wonderful.
– He knows if it is wonderful.
– This "wonderful" movie turned out to be boring.
Syntactic structures
The sentiment word wonderful is influenced by the operators that take scope over it, which alter or weaken speaker/author commitment.
Non-literal language
• The effects of non-literal language are extremely challenging.
• Even humans are apt to get confused about the intended meaning of some expressions of this form.
• We can hope to approximate their effects with sentiment analysis systems, but they are still likely to be a leading cause of errors.– Irony and sarcasm: Oh, this is just great!, as clear as mud
– Implicit comparison: He did a good job for a linguist.
– Hypallage: He's not exactly brilliant.
– Hyperbole: No one goes there anymore.
Text preparation: Tokenization• Tokenizing (splitting a string into its desired constituent parts) is fundamental to
all NLP tasks.• There is no single right way to do tokenization. • The right algorithm depends on the application.• Whitespace tokenizer: The whitespace tokenizer simply downcases the string
and splits the text on any sequence of whitespace, tab, or newline characters.• Treebank-style: The Treebank-style is the one used by the Penn Treebank and
many other important large-scale corpora for NLP. Thus, it is a de facto standard. This alone makes it worth considering, since it can facilitate the use of other tools.
• Sentiment-aware tokenizer– Emoticons are extremely common in many forms of social media, and they are
reliable carriers of sentiment.– Twitter includes topic and user mark-up that is useful for advanced sentiment
modeling. Your tokenizer should capture this mark-up if you are processing Twitter data.• User Name: @+[\w_]+• #tag: \#+[\w_]+[\w\'_\-]*[\w_]+
– Informative HTML tags - Basic HTML mark-up like the strong, b, em, and i tags can be indicators of sentiment.
– Masked curses - (****, s***t)– Capitalization, Lengthening, Additional punctuation
Text preparation: Stemming• Stemming is a method for collapsing distinct word forms.
• This could help reduce the vocabulary size, thereby sharpening one's results, especially for small data sets.
• The Porter stemmer is one of the earliest and best-known stemming algorithms. It works by heuristically identifying word suffixes (endings) and stripping them off, with some regularization of the endings.
• The Porter stemmer often collapses sentiment distinctions, by mapping two words with different sentiment into the same stemmed form.
• Suggestion: see its effect and decide – for Twitter may not be effective or have negative effect. for sentiment analysis, running a stemmer is costly in terms of resources and performance accuracy.
Positiv Negativ Stemmed
captivation captive captiv
common commoner common
defend defendant defend
defense defensive defens
dependability dependent depend
dependable dependent depend
Negation marking• Append a _NEG suffix to every word appearing between a negation and a
clause-level punctuation mark.
• Example: No one enjoys it. -> no one_NEG enjoys_NEG it_NEG.
Definition: NegationA negation is any word matching the following regular expression:(?:
^(?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint
)$)|n't
Definition: Clause-level punctuationA clause-level punctuation mark is any word matching the following regular expression:^[.:;!?]$
Part-of-speech tagging• There are many cases in which a sentiment contrast exists between words
that have the same string representation but different parts of speech.
Word Tag1 Polarity Tag2 Polarity
arrest jj Positiv vb Negativ
board jj Positiv rb Negativ
deal intj Negativ nn Positiv
even jj Positiv vb Negativ
even rb Positiv vb Negativ
fine jj Positiv nn Negativ
fine jj Positiv vb Negativ
fine nn Negativ rb Positiv
fine rb Positiv vb Negativ
JJ-Adjective, RB-Adverb, INTJ-Interjection, VB-Verb in base form, NN-Noun in singular or mass
Categorization/Classification
• Given:
– A description of an instance, x X, where X is the instance language or instance space.
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:
– The category of x: c(x)C, where c(x) is a classification function whose domain is X and whose range is C.
• We want to know how to build classification functions (“classifiers”).
Recall a few probability basics
• For events a and b:
• Bayes’ Rule
• Odds:
aaxxpxbp
apabp
bp
apabpbap
apabpbpbap
apabpbpbapbapbap
,)()|(
)()|(
)(
)()|()|(
)()|()()|(
)()|()()|()(),(
)(1
)(
)(
)()(
ap
ap
ap
apaO
Posterior
Prior
Naive Bayes Classifiers
Task: Classify a new instance D based on a tuple of attribute values
into one of the classes cj C
nxxxD ,,, 21
),,,|(argmax 21 njCc
MAP xxxcPcj
),,,(
)()|,,,(argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j
)()|,,,(argmax 21 jjnCc
cPcxxxPj
MAP = Maximum Aposteriori Probability
Naïve Bayes Classifier: Naïve Bayes Assumption
• P(cj)
– Can be estimated from the frequency of classes in the training examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|) parameters
– Could be estimated if large number of training examples was available.
Naïve Bayes Conditional Independence Assumption:
• Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
The Naïve Bayes Classifier
• Conditional Independence Assumption: Features (term presence) are independent of each other given the class:
• This model is appropriate for binary variables
– Multivariate Bernoulli model
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
Learning the Model
• First attempt: maximum likelihood estimates
– simply use the frequencies in the data
)(
),()|(ˆ
j
jii
jicCN
cCxXNcxP
C
X1 X2 X5X3 X4 X6
N
cCNcP
j
j
)()(ˆ
• What if we have seen no training cases where patient had no flu and muscle aches?
• Zero probabilities cannot be conditioned away, no matter the other evidence!
0)(
),()|(ˆ 5
5
nfCN
nfCtXNnfCtXP
i ic cxPcP )|(ˆ)(ˆmaxarg
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
Problem with Max Likelihood
kcCN
cCxXNcxP
j
jii
ji
)(
1),()|(ˆ
• Somewhat more subtle version
# of values of Xi
mcCN
mpcCxXNcxP
j
kijkii
jki
)(
),()|(ˆ
,,
,
overall fraction in data where Xi=xi,k
extent of“smoothing”
Smoothing to Avoid Overfitting
Textj single document containing all docsj
for each word xk in Vocabulary
nk number of occurrences of xk in Textj
Naïve Bayes: Learning Algorithm
• From training corpus, extract Vocabulary
• Calculate required P(cj) and P(xk | cj) terms– For each cj in C do
• docsj subset of documents for which the target class is cj
•
• alpha is “tuning/smoothing” factor that can be set to 1
||)|(
Vocabularyn
ncxP
j
kjk
|documents # total|
||)(
j
j
docscP
OVERALL APPROACH
• General idea – Bayesian (robustness due to qualitative approach)
– multinomial (counts of terms)
– multivariate Bernoulli (counts of documents)
• Simplifying assumption for computational tractability
– conditional independence; positional independence
• Dealing with idiosyncrasies of training data
– smoothening technique
– avoid underflow using log
• Improving efficiency (noise reduction)
– feature selection
Exercise• Vocabulary = all words over all documents in Training Set
• Calculate number of documents in positive class, and in negative class
• Find total number of documents
• Calculate
• Textpos single document containing all docspos
• npos = number of distinct words in Textpos
• for each word xk in Vocabulary
– nk number of occurrences of xk in Textpos
– Compute:
• Write map-reduce pseudo-code to compute the above steps.
• Implement in Python using supporting modules that are given for: Tokenization, Stop word removal, and Steaming. (Duration: 90 mins)
• Team presentation and discussion (Duration: 45 mins)
|documents # total|
||)(
pos
pos
docscP
|documents # total|
||)(
neg
neg
docscP
||
1)|(
Vocabularyn
ncxP
pos
kposk
top related