text classification + machine learning reviewcall now to receive your diploma within 30 days...
TRANSCRIPT
Text Classification
+
Machine Learning Review
CS 287
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
Earn a Degree based on your Life Experience
Obtain a Bachelor’s, Master’s, MBA, or PhD based on your
present knowledge and life experience.
No required tests, classes, or books. Confidentiality assured.
Join our fully recognized Degree Program.
Are you a truly qualified professional in your field but lack the
appropriate, recognized documentation to achieve your goals?
Or are you venturing into a new field and need a boost to get
your foot in the door so you can prove your capabilities?
Call us for information that can change your life and help you
to achieve your goals!!!
CALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30
DAYS
Sentiment
Good Sentences
I A thoughtful, provocative, insistently humanizing film.
I Occasionally melodramatic, it’s also extremely effective.
I Guaranteed to move anyone who ever shook, rattled, or rolled.
Bad Sentences
I A sentimental mess that never rings true.
I This 100-minute movie only has about 25 minutes of decent
material.
I Here, common sense flies out the window, along with the hail of
bullets, none of which ever seem to hit Sascha.
Multiclass Sentiment
I ? ? ??
I visited The Abbey on several occasions on a visit to Cambridge
and found it to be a solid, reliable and friendly place for a meal.
I ??
However, the food leaves something to be desired. A very obvious
menu and average execution
I ? ? ? ? ?
Fun, friendly neighborhood bar. Good drinks, good food, not too
pricey. Great atmosphere!
Text Categorization
I Straightforward setup.
I Lots of practical applications:
I Spam Filtering
I Sentiment
I Text Categorization
I e-discovery
I Twitter Mining
I Author Identification
I . . .
I Introduces machine learning notation.
However, a relatively solved problem these days.
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
Preliminary Notation
I b,m; bold letters for vectors.
I B,M; bold capital letters for matrices.
I B,M; script-case for sets.
I B,M; capital letters for random variables.
I bi , xi ; lower case for scalars or indexing into vectors.
I δ(i); one-hot vector at position i
δ(2) = [0; 1; 0; . . . ]
I 1(x = y); indicator 1 if x = y , o.w. 0
Text Classification
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Input Representation:
I Conversion from text into a mathematical representation?
I Main focus of this class, representation of language
I Point in coming lectures: sparse vs. dense representations
Text Classification
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Input Representation:
I Conversion from text into a mathematical representation?
I Main focus of this class, representation of language
I Point in coming lectures: sparse vs. dense representations
Sparse Features (Notation from YG)
I F ; a discrete set of features values.
I f1 ∈ F , . . . , fk ∈ F ; active features for input.
For a given sentence, let f1, . . . fk be the relevant features.
Typically k << |F |.
I Sparse representation of the input defined as,
x =k
∑i=1
δ(fi )
I x ∈ R1×din ; input representation
Features 1: Sparse Bag-of-Words Features
Representation is counts of input words,
I F ; the vocabulary of the language.
I x = ∑i δ(fi )
Example: Movie review input,
A sentimental mess
x = δ(word:A) + δ(word:sentimental) + δ(word:mess)
x> =
1...
0
0
+
0...
0
1
+
0...
1
0
=
1...
1
1
word:A...
word:mess
word:sentimental
Features 2: Sparse Word Properties
Representation can use specific aspects of text.
I F ; Spelling, all-capitals, trigger words, etc.
I x = ∑i δ(fi )
Example: Spam Email
Your diploma puts a UUNIVERSITY JOB PLACEMENT COUNSELOR
at your disposal.
x = δ(misspelling) + δ(allcapital) + δ(trigger:diploma) + . . .
x> =
0...
1
0
+
0...
0
1
+
1...
0
0
=
1...
1
1
misspelling...
capital
word:diploma
Text Classification: Output Representation
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Output Representation:
I How do encode the output classes?
I We will use a one-hot output encoding.
I In future lectures, efficiency of output encoding.
Output Class Notation
I C = {1, . . . , dout}; possible output classes
I c ∈ C; always one true output class
I y = δ(c) ∈ R1×dout ; true one-hot output representation
Output Form: Binary Classification
Examples: spam/not-spam, good review/bad review, relevant/irrelevant
document, many others.
I dout = 2; two possible classes
I In our notation,
bad c = 1 y =[
1 0]vs.
good c = 2 y =[
0 1]
I Can also use a single output sign representation with dout = 1
Output Form: Multiclass Classification
Examples: Yelp stars, etc.
I dout = 5; for examples
I In our notation, one star, two star...
? c = 1 y =[
1 0 0 0 0]vs.
? ? c = 2 y =[
0 1 0 0 0]
. . .
Examples: Word Prediction (Unit 3)
I dout > 100, 000;
I In our notation, C is vocabulary and each c is a word.
the c = 1 y =[
1 0 0 0 . . . 0]vs.
dog c = 2 y =[
0 1 0 0 . . . 0]
. . .
Evaluation
I Consider evaluating accuracy on outputs y1, . . . , yn.
I Given a predictions c1 . . . cn we measure accuracy as,
n
∑i=1
1(δ(ci ) = yi )
n
I Simplest of several different metrics we will explore in the class.
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
Supervised Machine Learning
Let,
I (x1, y1), . . . , (xn, yn); training data
I xi ∈ R1×din ; input representations
I yi ∈ R1×dout ; true output representations (one-hot vectors)
Goal: Learn a classifier from input to output classes.
Note:
I xi is an input vector xi ,j is element of the vector, or just xj when
there is a clear single input .
I Practically, store design matrix X ∈ Rn×din and output classes.
Experimental Setup
I Data is split into three parts training, validation, and test.
I Experiments are all run on training and validation, test is final
output.
I For assignments, full training and validation data, and only inputs
for test.
For very small text classification data sets,
I Use K-fold cross-validation.
1. Split into K folds (equal splits).
2. For each fold, train on other K-1 folds, test on current fold.
Linear Models for Classification
Linear model,
y = f (xW+ b)
I W ∈ Rdin×dout ,b ∈ R1×dout ; model parameters
I f : Rdout 7→ Rdout ; activation function
I Sometimes z = xW+ b informally “score” vector.
I Note z and y are not one-hot.
Class prediction,
c = arg maxi∈C
yi = arg maxi∈C
(xW+ b)i
Interpreting Linear Models
Parameters give scores to possible outputs,
I Wf ,i is the score for sparse feature f under class i
I bi is a prior score for class i
I yi is the total score for class i
I c is highest scoring class under the linear model.
Example:
I For single feature score,
[β1, β2] = δ(word:dreadful)W,
Expect β1 > β2 (assuming 2 is class good).
Probabilistic Linear Models
Can estimate a linear model probabilistically,
I Let output be a random variable Y , with sample space C.
I Representation be a random vector X .
I (Simplified frequentist representation)
I Interested in estimating parameters θ,
P(Y |X ; θ)
Informally we use p(y = δ(c)|x) for P(Y = c |X = x).
Generative Model. Joint Log-Likelihood as Loss
I (x1, y1), . . . , (xn, yn); supervised data
I Select parameters to maximize likelihood of training data.
L(θ) = −n
∑i=1
log p(xi , yi ; θ)
For linear models θ = (W,b)
I Do this by minimizing negative log-likelihood (NLL).
arg minθL(θ)
Multinomial Naive Bayes
Reminder, joint probability chain rule,
p(x, y) = p(x|y)p(y)
For a sparse features, with observed classes we can write as,
p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =
=k
∏i=1
p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈
=k
∏i=1
p(xfi = 1|y)p(y)
First is by chain-rule, second is by independence assumption.
Multinomial Naive Bayes
Reminder, joint probability chain rule,
p(x, y) = p(x|y)p(y)
For a sparse features, with observed classes we can write as,
p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =
=k
∏i=1
p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈
=k
∏i=1
p(xfi = 1|y)p(y)
First is by chain-rule, second is by independence assumption.
Estimating Multinomial Distributions
Let S be a random variable with sample space S and we have
observations s1, . . . , sn,
I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial
distribution.
I Minimizing NLL for multinomial (MLE) for data has a closed-form.
P(S = s; θ) = Cat(s; θ) =n
∑i=1
1(si = s)
n
I Exercise: Derive this by minimizing L.
I Also called categorical or multinoulli (in Murphy).
Estimating Multinomial Distributions
Let S be a random variable with sample space S and we have
observations s1, . . . , sn,
I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial
distribution.
I Minimizing NLL for multinomial (MLE) for data has a closed-form.
P(S = s; θ) = Cat(s; θ) =n
∑i=1
1(si = s)
n
I Exercise: Derive this by minimizing L.
I Also called categorical or multinoulli (in Murphy).
Multinomial Naive Bayes
I Both p(y) and p(x|y) are parameterized as multinomials.
I Fit first as,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit second using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf = 1|y = δ(c)) =Ff ,c
∑f ′∈F
Ff ′,c
Multinomial Naive Bayes
I Both p(y) and p(x|y) are parameterized as multinomials.
I Fit first as,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit second using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf = 1|y = δ(c)) =Ff ,c
∑f ′∈F
Ff ′,c
Alternative: Multivariate Bernoulli Naive Bayes
I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over
each features .
I Fit class as Categorical,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit features using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf |y = δ(c)) =Ff ,c
n
∑i=1
1(yi = c)
Alternative: Multivariate Bernoulli Naive Bayes
I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over
each features .
I Fit class as Categorical,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit features using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf |y = δ(c)) =Ff ,c
n
∑i=1
1(yi = c)
Getting a Conditional Distribution
I Generative models estimates of P(X ,Y ), we want P(Y |X ).
I Bayes Rule,
p(y|x) = p(x|y)p(y)p(x)
I In log-space,
log p(y|x) = log p(x|y) + log p(y)− log p(x)
Prediction with Naive Bayes
I For prediction, last term is constant, so
arg maxc
log p(y = δ(c)|x) = log p(x|y = δ(c)) + log p(y = δ(c))
I Can write as linear model,
Wf ,c = log p(xf = 1|y = c) for all c ∈ C, f ∈ F
bc = log p(y = δ(c)) for all c ∈ C
Getting a Conditional Distribution
I What if we want conditional probabilities?
p(y|x) = p(x|y)p(y)p(x)
=p(x|y)p(y)
∑c ′∈C p(x, y = δ(c ′))
I Denominator is acquired by renormalizing,
p(y|x) ∝ p(x|y)p(y)
Practical Aspects: Calculating Log-Sum-Exp
I Because of numerical issues, calculate in log-space,
f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C
exp(zc ′)
where for naive Bayes
z = xW+ b
I However hard to calculate,
log ∑c ′∈C
exp(zc ′)
I Instead
log ∑c ′∈C
exp(yc ′ −M) +M
where M = maxc ′∈C zc ′
Practical Aspects: Calculating Log-Sum-Exp
I Because of numerical issues, calculate in log-space,
f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C
exp(zc ′)
where for naive Bayes
z = xW+ b
I However hard to calculate,
log ∑c ′∈C
exp(zc ′)
I Instead
log ∑c ′∈C
exp(yc ′ −M) +M
where M = maxc ′∈C zc ′
Digression: Zipf’s Law
Laplace Smoothing
Method for handling the long tail of words by distributing mass,
I Add a value of α to each element in the sample space before
normalization.
θs =α + ∑n
i=1 1(si = s)
α|S|+ n
I (Similar to Dirichlet prior in a Bayesian interpretation.)
For naive Bayes:
F = α + F
Laplace Smoothing
Method for handling the long tail of words by distributing mass,
I Add a value of α to each element in the sample space before
normalization.
θs =α + ∑n
i=1 1(si = s)
α|S|+ n
I (Similar to Dirichlet prior in a Bayesian interpretation.)
For naive Bayes:
F = α + F
Naive Bayes In Practice
I Very fast to train
I Relatively interpretable.
I Performs quite well on small datasets
(RT-S [movie review], CR [customer reports], MPQA [opinion polarity],
SUBJ [subjectivity])