text classification + machine learning reviewcall now to receive your diploma within 30 days...

42
Text Classification + Machine Learning Review CS 287

Upload: others

Post on 02-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Text Classification

+

Machine Learning Review

CS 287

Page 2: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Page 3: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Earn a Degree based on your Life Experience

Obtain a Bachelor’s, Master’s, MBA, or PhD based on your

present knowledge and life experience.

No required tests, classes, or books. Confidentiality assured.

Join our fully recognized Degree Program.

Are you a truly qualified professional in your field but lack the

appropriate, recognized documentation to achieve your goals?

Or are you venturing into a new field and need a boost to get

your foot in the door so you can prove your capabilities?

Call us for information that can change your life and help you

to achieve your goals!!!

CALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30

DAYS

Page 4: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Sentiment

Good Sentences

I A thoughtful, provocative, insistently humanizing film.

I Occasionally melodramatic, it’s also extremely effective.

I Guaranteed to move anyone who ever shook, rattled, or rolled.

Bad Sentences

I A sentimental mess that never rings true.

I This 100-minute movie only has about 25 minutes of decent

material.

I Here, common sense flies out the window, along with the hail of

bullets, none of which ever seem to hit Sascha.

Page 5: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Multiclass Sentiment

I ? ? ??

I visited The Abbey on several occasions on a visit to Cambridge

and found it to be a solid, reliable and friendly place for a meal.

I ??

However, the food leaves something to be desired. A very obvious

menu and average execution

I ? ? ? ? ?

Fun, friendly neighborhood bar. Good drinks, good food, not too

pricey. Great atmosphere!

Page 6: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Text Categorization

I Straightforward setup.

I Lots of practical applications:

I Spam Filtering

I Sentiment

I Text Categorization

I e-discovery

I Twitter Mining

I Author Identification

I . . .

I Introduces machine learning notation.

However, a relatively solved problem these days.

Page 7: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Page 8: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Preliminary Notation

I b,m; bold letters for vectors.

I B,M; bold capital letters for matrices.

I B,M; script-case for sets.

I B,M; capital letters for random variables.

I bi , xi ; lower case for scalars or indexing into vectors.

I δ(i); one-hot vector at position i

δ(2) = [0; 1; 0; . . . ]

I 1(x = y); indicator 1 if x = y , o.w. 0

Page 9: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Text Classification

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Input Representation:

I Conversion from text into a mathematical representation?

I Main focus of this class, representation of language

I Point in coming lectures: sparse vs. dense representations

Page 10: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Text Classification

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Input Representation:

I Conversion from text into a mathematical representation?

I Main focus of this class, representation of language

I Point in coming lectures: sparse vs. dense representations

Page 11: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Sparse Features (Notation from YG)

I F ; a discrete set of features values.

I f1 ∈ F , . . . , fk ∈ F ; active features for input.

For a given sentence, let f1, . . . fk be the relevant features.

Typically k << |F |.

I Sparse representation of the input defined as,

x =k

∑i=1

δ(fi )

I x ∈ R1×din ; input representation

Page 12: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Features 1: Sparse Bag-of-Words Features

Representation is counts of input words,

I F ; the vocabulary of the language.

I x = ∑i δ(fi )

Example: Movie review input,

A sentimental mess

x = δ(word:A) + δ(word:sentimental) + δ(word:mess)

x> =

1...

0

0

+

0...

0

1

+

0...

1

0

=

1...

1

1

word:A...

word:mess

word:sentimental

Page 13: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Features 2: Sparse Word Properties

Representation can use specific aspects of text.

I F ; Spelling, all-capitals, trigger words, etc.

I x = ∑i δ(fi )

Example: Spam Email

Your diploma puts a UUNIVERSITY JOB PLACEMENT COUNSELOR

at your disposal.

x = δ(misspelling) + δ(allcapital) + δ(trigger:diploma) + . . .

x> =

0...

1

0

+

0...

0

1

+

1...

0

0

=

1...

1

1

misspelling...

capital

word:diploma

Page 14: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Text Classification: Output Representation

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Output Representation:

I How do encode the output classes?

I We will use a one-hot output encoding.

I In future lectures, efficiency of output encoding.

Page 15: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Output Class Notation

I C = {1, . . . , dout}; possible output classes

I c ∈ C; always one true output class

I y = δ(c) ∈ R1×dout ; true one-hot output representation

Page 16: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Output Form: Binary Classification

Examples: spam/not-spam, good review/bad review, relevant/irrelevant

document, many others.

I dout = 2; two possible classes

I In our notation,

bad c = 1 y =[

1 0]vs.

good c = 2 y =[

0 1]

I Can also use a single output sign representation with dout = 1

Page 17: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Output Form: Multiclass Classification

Examples: Yelp stars, etc.

I dout = 5; for examples

I In our notation, one star, two star...

? c = 1 y =[

1 0 0 0 0]vs.

? ? c = 2 y =[

0 1 0 0 0]

. . .

Examples: Word Prediction (Unit 3)

I dout > 100, 000;

I In our notation, C is vocabulary and each c is a word.

the c = 1 y =[

1 0 0 0 . . . 0]vs.

dog c = 2 y =[

0 1 0 0 . . . 0]

. . .

Page 18: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Evaluation

I Consider evaluating accuracy on outputs y1, . . . , yn.

I Given a predictions c1 . . . cn we measure accuracy as,

n

∑i=1

1(δ(ci ) = yi )

n

I Simplest of several different metrics we will explore in the class.

Page 19: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Page 20: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Supervised Machine Learning

Let,

I (x1, y1), . . . , (xn, yn); training data

I xi ∈ R1×din ; input representations

I yi ∈ R1×dout ; true output representations (one-hot vectors)

Goal: Learn a classifier from input to output classes.

Note:

I xi is an input vector xi ,j is element of the vector, or just xj when

there is a clear single input .

I Practically, store design matrix X ∈ Rn×din and output classes.

Page 21: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Experimental Setup

I Data is split into three parts training, validation, and test.

I Experiments are all run on training and validation, test is final

output.

I For assignments, full training and validation data, and only inputs

for test.

For very small text classification data sets,

I Use K-fold cross-validation.

1. Split into K folds (equal splits).

2. For each fold, train on other K-1 folds, test on current fold.

Page 22: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Linear Models for Classification

Linear model,

y = f (xW+ b)

I W ∈ Rdin×dout ,b ∈ R1×dout ; model parameters

I f : Rdout 7→ Rdout ; activation function

I Sometimes z = xW+ b informally “score” vector.

I Note z and y are not one-hot.

Class prediction,

c = arg maxi∈C

yi = arg maxi∈C

(xW+ b)i

Page 23: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Interpreting Linear Models

Parameters give scores to possible outputs,

I Wf ,i is the score for sparse feature f under class i

I bi is a prior score for class i

I yi is the total score for class i

I c is highest scoring class under the linear model.

Example:

I For single feature score,

[β1, β2] = δ(word:dreadful)W,

Expect β1 > β2 (assuming 2 is class good).

Page 24: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Probabilistic Linear Models

Can estimate a linear model probabilistically,

I Let output be a random variable Y , with sample space C.

I Representation be a random vector X .

I (Simplified frequentist representation)

I Interested in estimating parameters θ,

P(Y |X ; θ)

Informally we use p(y = δ(c)|x) for P(Y = c |X = x).

Page 25: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Generative Model. Joint Log-Likelihood as Loss

I (x1, y1), . . . , (xn, yn); supervised data

I Select parameters to maximize likelihood of training data.

L(θ) = −n

∑i=1

log p(xi , yi ; θ)

For linear models θ = (W,b)

I Do this by minimizing negative log-likelihood (NLL).

arg minθL(θ)

Page 26: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Multinomial Naive Bayes

Reminder, joint probability chain rule,

p(x, y) = p(x|y)p(y)

For a sparse features, with observed classes we can write as,

p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =

=k

∏i=1

p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈

=k

∏i=1

p(xfi = 1|y)p(y)

First is by chain-rule, second is by independence assumption.

Page 27: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Multinomial Naive Bayes

Reminder, joint probability chain rule,

p(x, y) = p(x|y)p(y)

For a sparse features, with observed classes we can write as,

p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =

=k

∏i=1

p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈

=k

∏i=1

p(xfi = 1|y)p(y)

First is by chain-rule, second is by independence assumption.

Page 28: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Estimating Multinomial Distributions

Let S be a random variable with sample space S and we have

observations s1, . . . , sn,

I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial

distribution.

I Minimizing NLL for multinomial (MLE) for data has a closed-form.

P(S = s; θ) = Cat(s; θ) =n

∑i=1

1(si = s)

n

I Exercise: Derive this by minimizing L.

I Also called categorical or multinoulli (in Murphy).

Page 29: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Estimating Multinomial Distributions

Let S be a random variable with sample space S and we have

observations s1, . . . , sn,

I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial

distribution.

I Minimizing NLL for multinomial (MLE) for data has a closed-form.

P(S = s; θ) = Cat(s; θ) =n

∑i=1

1(si = s)

n

I Exercise: Derive this by minimizing L.

I Also called categorical or multinoulli (in Murphy).

Page 30: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Multinomial Naive Bayes

I Both p(y) and p(x|y) are parameterized as multinomials.

I Fit first as,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit second using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf = 1|y = δ(c)) =Ff ,c

∑f ′∈F

Ff ′,c

Page 31: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Multinomial Naive Bayes

I Both p(y) and p(x|y) are parameterized as multinomials.

I Fit first as,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit second using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf = 1|y = δ(c)) =Ff ,c

∑f ′∈F

Ff ′,c

Page 32: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Alternative: Multivariate Bernoulli Naive Bayes

I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over

each features .

I Fit class as Categorical,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit features using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf |y = δ(c)) =Ff ,c

n

∑i=1

1(yi = c)

Page 33: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Alternative: Multivariate Bernoulli Naive Bayes

I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over

each features .

I Fit class as Categorical,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit features using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf |y = δ(c)) =Ff ,c

n

∑i=1

1(yi = c)

Page 34: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Getting a Conditional Distribution

I Generative models estimates of P(X ,Y ), we want P(Y |X ).

I Bayes Rule,

p(y|x) = p(x|y)p(y)p(x)

I In log-space,

log p(y|x) = log p(x|y) + log p(y)− log p(x)

Page 35: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Prediction with Naive Bayes

I For prediction, last term is constant, so

arg maxc

log p(y = δ(c)|x) = log p(x|y = δ(c)) + log p(y = δ(c))

I Can write as linear model,

Wf ,c = log p(xf = 1|y = c) for all c ∈ C, f ∈ F

bc = log p(y = δ(c)) for all c ∈ C

Page 36: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Getting a Conditional Distribution

I What if we want conditional probabilities?

p(y|x) = p(x|y)p(y)p(x)

=p(x|y)p(y)

∑c ′∈C p(x, y = δ(c ′))

I Denominator is acquired by renormalizing,

p(y|x) ∝ p(x|y)p(y)

Page 37: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Practical Aspects: Calculating Log-Sum-Exp

I Because of numerical issues, calculate in log-space,

f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C

exp(zc ′)

where for naive Bayes

z = xW+ b

I However hard to calculate,

log ∑c ′∈C

exp(zc ′)

I Instead

log ∑c ′∈C

exp(yc ′ −M) +M

where M = maxc ′∈C zc ′

Page 38: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Practical Aspects: Calculating Log-Sum-Exp

I Because of numerical issues, calculate in log-space,

f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C

exp(zc ′)

where for naive Bayes

z = xW+ b

I However hard to calculate,

log ∑c ′∈C

exp(zc ′)

I Instead

log ∑c ′∈C

exp(yc ′ −M) +M

where M = maxc ′∈C zc ′

Page 39: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Digression: Zipf’s Law

Page 40: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Laplace Smoothing

Method for handling the long tail of words by distributing mass,

I Add a value of α to each element in the sample space before

normalization.

θs =α + ∑n

i=1 1(si = s)

α|S|+ n

I (Similar to Dirichlet prior in a Bayesian interpretation.)

For naive Bayes:

F = α + F

Page 41: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Laplace Smoothing

Method for handling the long tail of words by distributing mass,

I Add a value of α to each element in the sample space before

normalization.

θs =α + ∑n

i=1 1(si = s)

α|S|+ n

I (Similar to Dirichlet prior in a Bayesian interpretation.)

For naive Bayes:

F = α + F

Page 42: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally

Naive Bayes In Practice

I Very fast to train

I Relatively interpretable.

I Performs quite well on small datasets

(RT-S [movie review], CR [customer reports], MPQA [opinion polarity],

SUBJ [subjectivity])