web content mining - institut de recherche en informatique … · · 2017-02-02text mining and...

Web Content MiningBy Y.Pitarch and J.Moreno

Text Mining and Analytics

● Text mining ~ Text Analytics

● Turn text data into high-quality information or actionable knowledge.

– Minimizes human effort (on consuming text data)

– Supplied knowledge for optimal decision making

● Related to text retrieval, which is an essential component in any text mining system

– Text retrieval can be a preprocessor for text mining

– Text retrieval is needed for knowledge provenance

Text vs. Non-Text data

● Humans as Subjective “sensors”

Real worldSense

SensorReport

Data

Weather Thermometer 3C, 15F, etc

Locations Geo Sensor 41N and 120W

Networks Network Sensor

0100010001100

Real WorldPerceive

HumanSensor

ExpressText

The General Problem of Data Mining

Sensor 1

Sensor 2

Sensor k

...

...

RealWorld

Non-TextData

NumericalCategoricalRelational

Video

TextData

Data MiningSoftware

General DataSoftware

...

Video Mining

Text Mining

Actionable Knowledge

The General Problem of Text Mining

...RealWorld

TextData

Data MiningSoftware

Text Mining

Actionable Knowledge

Landscape of Text Mining and Analytics

Real WorldPerceive

(Perspective) Text Data

Observed World

Express(English)

1. Mining knowledgeabout language

2. Mining content of text data

3. Mining knowledge About the observer

4. Infer other real-worldvariables (predictive analytics)


Real WorldPerceive


Observed World

Express(English)

Word associationmining & analysis

Topic mining & analysis

Opinion mining &Sentiment analysis

Text based prediction

Natural language proces.& text representation

Basics of Text Mining

● Collections

– Documents are compressed

– Uncommon formats

– Some times they just don’t exist

● Documents

– A lot of preprocessing (encoding, cleaning, splitting, etc.)

– Granularity level (whole document, paragraph, sentence, etc.)

● Words

– Steaming, upper/lower case, frequent (stopwords) and infrequent words, special characters, dates, prices, names, emails, etc.

● General

– Language: main tasks have been addressed for English (with no 100% performance), but many other languages are far from English and many of them are completely without resources (regional languages)

– Lot of formulas and concepts

– Task is the main factor

– Tools: python (NLKT), java (OpenNLP), c, etc.


Real WorldPerceive


Observed World

Express(English)






Basic Concepts in NLP

A dog is chasing a boy on the playgroundDet Noun Aux Verb

Noun Phrase

Det Noun Prep Det Noun

Complex Verb Noun Phrase Noun Phrase

Prep PhraseVerb Phrase

Verb Phrase

Sentence

Lexical analysis (Part-of-speech tagging)

Semantic analysis

Dog (d1)Boy (b1)

Playground(p1)Chasing(d1,b1,p1)

+Scared(x) if Chasing(_,x,_)

=>Scared(b1)

NLP is difficult!

● Natural language is designed to make human communication efficient. As a result,

– We omit a lot of common sense knowledge, which we assume the hearer/reader possesses.

– We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve.

● This makes EVERY step in NLP hard

– Ambiguity is a killer!

– Common sense reasoning is pre-required.

Examples of Challenges

● Word-level ambiguity:

– “design” can be a noun or a verb (ambiguous POS)

– “root” has multiple meanings (ambiguous sense)

● Syntactic ambiguity:

– “natural language processing” (modification)

– “A man saw a boy with a telescope” (PP attachment)

– Anaphora resolution: “John persuaded Bill to buy a TV for himself” (himself = John or Bill?)

– Presupposition: “He has quite smoking” implies that he smoked before.

What we can’t do

● 100 % POS tagging

– ”He turned off the highway” vs “He turned off the fan”

● General complete parsing

– “A man saw a boy with a telescope”

● Precise deep semantic analysis

– Will we ever be able to precisely define meaning of “own” in “John owns a restaurant”?

Robust and general NLP tends to be shallow while deep understanding doesn’t scale up.

Take away message

● NLP is the foundation for text mining

● Computers are far from being able to understand natural language

– Deep NLP requires common sense knowledge and inferences, thus only working for very limited domains

– Shallow NLP based on statistical methods ca be done in large scale and is thus more broadly applicable

● In practice: statistical NLP as the basis, while humans provide help as needed

A dog is chasing a boy on the playground

A dog is chasing a boy on the playgroundDet Noun Aux Verb

Noun Phrase

Det Noun Prep Det Noun

Complex Verb Noun Phrase Noun Phrase

Prep PhraseVerb Phrase

Verb Phrase

Sentence

Dog (d1), Boy (b1),Playground(p1), Chasing(d1,b1,p1)

A dog CHASE A boy ONThe

playground

Animal Person location

String of characters

Sequenceof words

POS tags

Syntacticstructures

Entities andrelations

Logic predicates

Text Representation and Enabled Analysis

Text Rep Generality Enabled Analysis Examples of Application

String ****** String processing Compression

Words **** Word relation analysis; topic analysis; sentiment analysis

Thesaurus discovery; topic and opinion related applications

Syntactic structures

*** Syntactic graph analysis Stylistic analysis; structure-based feature extraction

Entities &relations

** Knowledge graph analysis; information network analysis

Discovery of knowledge ans opinions about specific entities

Logic predicates

* Integrative analysis of scattered knowledge; logic inference

Knowledge assistant for biologist

Take away message

● Text representations determines what kind of mining algorithms can be applied

● Multiple ways of representing text are possible

– String, words, syntactic structures, entity-relation graphs, predicates, etc.

– Can/should be combined in real applications

● This course focuses mainly on word-based representation

– General and robust: applicable to any natural language

– No/little manual effort

– “Surprisingly” powerful for many applications (not all!)

– Can be combined with more sophisticated representations

● Tools

– Python NLTK


Real WorldPerceive


Observed World

Express(English)






Basic word relations

● Paradigmatic : A & B have paradigmatic relation if they can be substituted for each other (i.e., A & B are in the same class)

– e.g;, “cat” and “dog”; “Monday” and “Tuesday”

● Syntagmatic: A & B have syntagmatic relation if they can be combined with each other (i.e., A & B are related semantically)

– e.g., “cat” and “sit”; “car” and “drive”

● These two basic and complementary relations ca be generalized to describe relations of any items in a language

Why mine word associations?

● They are useful for improving accuracy of many NLP tasks

– POS tagging, parsing, entity recognition, acronym expansion

– Grammar learning

● They are directly useful for many applications in text retrieval and mining

– Text retrieval (e.g., use word associations to suggest a variation of a query)

– Automatic construction of topic map of browsing: words as nodes ans associations as edges

– Compare and summarize opinions (e.g., what words are most strongly associated with “battery” in positive and negative reviews about iPhone7, respectively?)

Mining Word Associations: Intutions

My cat eats fish on Satuyday

His cat eats turkey on Tuesday

My dog eats meat on Sunday

His dog eats turkey on Tuesday

● Cat– My __ eats fish on Satuyday

– His __ eats turkey on Tuesday

● Dog– My __ eats meat on Sunday

– His __ eats turkey on Tuesday

Leftcontext

Rightcontext

Generalcontext

Paradigmatic:How similar are the context (“cat”) and context (“dog”)?How similar are the context (“cat”) and context (“computer”)?

Syntagmatic:How helpful is the occurrence of “eats” for predicting the occurrence of “meat”?How helpful is the occurrence of “eats” for predicting the occurrence of “text”?

Mining Word Associations: General Ideas

● Paradigmatic

– Represent each word by its context

– Compute context similarity

– Words with high context similarity likely have paradigmatic relation

● Syntagmatic

– Count how many times two words occur together in a context (e.g., sentence or paragraph)

– Compare their co-occurrences with their individual occurrences

– Words with high co-occurrences but relatively low individual occurrences likely have syntagmatic relation

Distributional semantics

● Comparing two words:– Look at all context words for word1

– Look at all context words for word2

– How similar are those two context collections in

their entirety?

● Compare distributional representations of

two words

How can we compare two context

collections in their entirety?

eat fall ripe slice peel tree throw fruit pie bite crab

794 244 47 221 208 160 145 156 109 104 88

Count how often “apple” occurs close to other words

in a large text collection (corpus):

Interpret counts as coordinates:

fall

eat

apple

Every context word

becomes a dimension.




794 244 47 221 208 160 145 156 109 104 88

Count how often “apple” occurs close to other words

in a large text collection (corpus):

Do the same for “orange”:


265 22 25 62 220 64 74 111 4 4 8




794 244 47 221 208 160 145 156 109 104 88

Then visualize both count tables as vectors in the same space:


265 22 25 62 220 64 74 111 4 4 8

fall

eat

apple

orange

Similarity between

two words as

proximity in space

Where can we find texts to use for

making a distributional model?

● Text in electronic form!● Newspaper articles● Project Gutenberg: older books available for free● Wikipedia● Text collections prepared for language analysis:

– Balanced corpora– WaC: Scrape all web pages in a particular domain– ELRA, LDC hold corpus collections

● For example, large amounts of newswire reports

– Google n-grams, Google books

What do we mean by

“similarity” of vectors?

orange

apple

Euclidean distance as a dissimilarity measure

Problem with Euclidean distance: very

sensitive to word frequency!

Braeburn

apple

What do we mean by

“similarity” of vectors?

orange

apple

Use angle between vectors

instead of point distance

to get around word

frequency issues

Some counts for “letter” in “Pride and

Prejudice”. What do you notice?

the to of and a her she his is was in that

102 75 72 56 52 50 41 36 35 34 34 33

had i from you as this mr for not on be he

32 28 28 25 23 23 22 21 21 20 18 17

but elizabeth with him which by when jane

17 17 16 16 16 15 14 12

Some counts for “letter” in “Pride and

Prejudice”. What do you notice?

the to of and a her she his is was in that

102 75 72 56 52 50 41 36 35 34 34 33

had i from you as this mr for not on be he

32 28 28 25 23 23 22 21 21 20 18 17

but elizabeth with him which by when jane

17 17 16 16 16 15 14 12

All the most frequent co-occurring words are function words.

Some words are more informative

than others

● Function words co-occur frequently with all words– That makes them less informative

● They have much higher co-occurrence counts than

content words– They can “drown out” more informative contexts

● Frequency

– Just selecting the most frequently occurring bigrams

– A simple POS filter drastically improves the results. Filtering

couples of words POS Tagged such as “Aux Noun” deals with

results such as “Prime Minister”

maxx , y p( x , y ) maxx , y log( p(x , y )+1)

Some Statistical Measures

● Pointwise Mutual Information (Theory of Information)

– The PMI tells us the amount of information that is provided by the occurrence of one word about the occurrence of the other word

● Pearson’s Phi-Square Test (Test Hypothesis)

– The essence of the test is to compare the observed frequencies with the frequencies expected for independence in a contingency table

– Try to refute the Null Hypothesis. The Higher the score, the more confidently the Hypothesis Ho can be rejected

Ho : p (x , y)=p(x ) p( y )

PMI (x , y)=log2

p( x , y )p (x) p( y)

Other Measures

● There exist many associative measures

and many others proposed after 2006 (or not included here)

Take away message

● You know the formulas...so, you can code it by your self!

● Depending of the task, preprocessing could be a critical phase. (e.g. orange and oranges)

● Different granularity levels could be use: document, paragraph, sentence, etc.

● Python

– nltk.metrics.association module

● How to know what is better?

– Evaluation!!! Many datasets allows to evaluate the performance of existing algorithms (e.g. SemEval datasets, Semantic and Syntactic similarity dataset)

Landscape of Text Mining and

Analytics

Real WorldPerceive

(Perspective)Text Data

Observed World

Express

(English)

Word association

mining & analysis


Opinion mining &

Sentiment analysis


Natural language proces.

& text representation

Topic Mining and Analysis:

Motivation● Topic~ main idea discussed in text data

– Theme/subject of a discussion or conversation

– Different granularities (e.g., topic of a sentence, an article, etc.)

● Many application require discovery of topics in text

– What are Twitter users talking about today?

– What are the current research topics in data mining? How are

they different from those 5 years ago?

– What do people like about the iPhone 7? What do they dislike?

– What were the major topics debated in 2016 presidential election?

Task of Topic Mining and Analysis

Text Data

Topic 1

Topic 2

Topic k

...

Doc 1 Doc 2 Doc N...

Task 1: Discovery k topics

Task 2: Figure out which

documents cover which topics


Text Data

w1,w3,w5

w8,w17

w2,w4,w9

...


Unsupervised problem (Clustering)


Text Data

Class 1

Class 2

Class k

...


Supervised problem (Classification)

Topic mining and Analysis

● Group of documents (unsupervised)

– Clustering

● Group of documents but into classes (supervised)

– Classification

● ? Group of words - dimensionality reduction (We

are going to see them in the unsupervised part)

– NNMF, LDA, PLSA, etc.

Clustering

● Clustering of documents is a way to discovery related documents and

topics

● It is the process of grouping a set of objects into classes of similar

objects

– Documents within a cluster should be similar.

– Documents from different clusters should be dissimilar

● Many existing algorithms

– K-means, E.M., etc.

● A representation for each document is demanded (not always)

– Term-document or document-term matrix

– The frequency is frequently used, but many others weighted models are

available (tf-idf). Their use depends of the chosen algorithm

Issues for clustering

● Representation for clustering

– Document representation

– Vector space? Normalization?

– Centroids aren’t length normalized

– Need a notion of similarity/distance

● How many clusters?

– Fixed a priori?

– Completely data driven?

● Avoid “trivial” clusters - too large or small

t 1

D2

D1

D3

D4

t 3

t 2

x

y

Term-document matrix

Anthony

and

Cleopatr

a

Julius

Caesar

The

Tempest

Hamlet Othello Macbeth

anthony 5.25 3.18 0.0 0.0 0.0 0.35

brutus 1.21 6.10 0.0 1.0 0.0 0.0

caesar 8.59 2.54 0.0 1.51 0.25 0.0

calpurnia 0.0 1.54 0.0 0.0 0.0 0.0

cleopatra 2.85 0.0 0.0 0.0 0.0 0.0

mercy 1.51 0.0 1.90 0.12 5.25 0.88

worser 1.37 0.0 0.11 4.15 0.25 1.95

Values used in these matrices are known as the weighting schema. They alter the

obtained results (clustering or classification). Often the frequency is used, but other

options are also available:

Why frequency is not the best option? Using frequency, we obtain the vector {73, 157,

227, 10, 0, 0, 0} to represent the Julius Caesar document. Document size strongly affects

similarity values!!!

Anthony

and

Cleopatr

a

Julius

Caesar

The

Tempest

Hamlet Othello Macbeth

anthony 1 1 0 0 0 1

brutus 1 1 0 1 0 0

caesar 1 1 0 1 1 1

calpurnia 0 1 0 0 0 0

cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary matrix TF-IDF

Documents as vectors

● So we have a |V|-dimensional vector space

● Terms are axes of the space

● Documents are points or vectors in this space

● Very high-dimensional: millions of dimensions

when you apply this to big collections

● These are very sparse vectors - most entries

are zero.

Notion of similarity/distance

● Ideal: semantic similarity.

● Practical: term-statistical similarity

– Cosine similarity.

– Docs as vectors.

– For many algorithms, easier to think in terms of a

distance (rather than similarity) between docs.

– It is easier to talk about Euclidean distance, but real

implementations use cosine similarity

Hard vs. soft clustering

● Hard clustering: Each document belongs to exactly

one cluster

– More common and easier to do

● Soft clustering: A document can belong to more than

one cluster.

– Makes more sense for applications like creating browsable

hierarchies

– You may want to put a pair of sneakers in two clusters: (i)

sports apparel and (ii) shoes

– You can only do that with a soft clustering approach.

What Is a Good Clustering?

● Internal criterion: A good clustering will produce high

quality clusters in which:

– the intra-class (that is, intra-cluster) similarity is high

– the inter-class similarity is low

– The measured quality of a clustering depends on both the

document representation and the similarity measure used

● External criterion: The quality of a clustering is also

measured by its ability to discover some or all of the

hidden patterns or latent classes

– Assessable with gold standard data

Clustering algorithms

● K-means

● Latent Semantic Analysis (LSA)

● Non-negative matrix factorization

● Latent Dirichlet allocation

● Many others but these are the main used in text

clustering

K-means

● It is a hard-clustering algorithm

● Assumes documents are real-valued vectors

● Clusters based on centroids (aka the center of gravity

or mean) of points in a cluster, c:

● Reassignment of instances to clusters is based on

distance to the current cluster centroids (Or one can

equivalently phrase it in terms of similarities)

μ⃗(c )=1

|c|∑x⃗∈c

x⃗

K-Means Algorithm

● Select K random docs {s1, s2,… sK} (or points)

as seeds.

● Until clustering converges (or other stopping

criterion):

– For each doc di:

● Assign di to the cluster cj such that dist(xi, sj) is minimal

(Next, update the seeds to the centroid of each

cluster)

– For each cluster cj

s j= μ⃗ (c j )

Iterations

Seeding Assign Update

Final

Partition

The default in SciKit

learn use the K-

means++

implementation

When to stop?

- Based on iterations

- Based on optimization

criteria

Take away message

● Easy to implement (already implemented in many

frameworks, tools, libraries, etc.)

● Initialization is an important issue (use k-means++)

● Distance is also important!!

● Each cluster is considered as a topic of the collection.

However, terms are unknown

● It has been extended in many ways (soft, labels,

similarity, etc.)

● How to introduce semantic similarity?

Latent Dirichlet Allocation

● Topic models are a suite of algorithms for discovering

the main themes that pervade a large and other wise

unstructured collection of documents

● Among these algorithms, LDA, a technique based in

Bayesian Modeling, is the most commonly used

nowadays

● Topic models can be applied to massive collections of

documents to automatically organize, understand,

search, and summarize large electronic archives

● Especially relevant in today’s “Big Data” environment.

Motivation for topic models

LDA basis

● Each topic is a distribution of words; each document is a mixture of corpus-wide topics;

and each word is drawn from one of those topics● The goal is to infer the hidden variables

Compute their distribution conditioned on the documents

p(topics, proportions, assignments | documents)

LDA model

Estimating the model

● Notation

– Nodes are random variables;

edges indicate dependence.

– Shaded nodes are observed.

– Plates indicate replicated

variables

● From a collection of

documents, infer

– Per-word topic assignment z d,n

– Per-document topic proportions

θ d

– Per-corpus topic distributions β k

● Approximate posterior inference

algorithms

– Mean field variational methods

– Expectation propagation

– Collapsed Gibbs sampling

– Collapsed variational inference

– Online variational inference

Take away message

● LDA is a topic model based on the probabilistic

version of LSI

● The estimation of the paramaters in the model is

performed based on the observed data and the

parameters

● Clustering can be performed using LDA by

choosing the most representative topic and fixing

the desired number of topics as the number of

clusters

More about clustering

● Many algorithms are available for clustering, however many of them are

not suitable for text

● Classical algorithms usually perform well, but state-of-the-art algorithms

can perform much better if they are well parametrized (not free lunch)

● Clustering is a hard task and few algorithms scale well to huge text

collections. If hierarchies are needed, bottom-up or top-down strategies

can be applied (combined with the presented algorithms)

● There exist problems in which labels are needed!!! LDA is a good

solution however other simpler algorithms can do the job (STC, Lingo,

etc)

● As any other data mining task, the best clustering algorithm will be draw

by the problem and not by choosing the most popular one

How to evaluate clustering?

● Evaluation is a hard task.

● Best scenario: an annotated collection is available

– Pairs of documents are used to evaluated the

partitions by asking if they belongs to the same

partition in the annotated data and in the obtained

partition (for any clustering algorithm)

● Extreme cases are hard to evaluate

– All documents belong to just one cluster

– All documents belong to individual clusters

Comparison of clustering

evalution metrics

Take away message

● Evaluate if you have annotated data

– Metric selection is an important issue

– Try to understand the addressed problem to select

a metric adapted to it

● Tools also implement evaluation metrics

– Clustering performance evaluation module in scikit

learn

– Clusteval in R

Why preprocessing is important?

Prime Minister Theresa May has jetted out to Davos for crunch talks with top bankers following a landmark speech this week in which she set out her vision for Brexit.

May will meet chief executives from banks such as Goldman Sachs and JP Morgan a day after two other bosses – HSBC’s Stuart Gulliver and UBS investment chief Andrea Orcel – confirmed that between them they would shift up to 2,000 jobs out of London when Britain leaves the EU.

Brexit will be top of the agenda when May hosts a roundtable including Goldman’s Lloyd Blankfein and JP Morgan’s Jamie Dimon, both of whom also landed in Switzerland yesterday. However, City experts do not believe there will be an exodus of banking jobs from London, despite the HSBC and UBS plans.

“It’s all positioning, it’s all playing to the gallery,” added Simon French, chief economist at Panmure Gordon.

French said other banks are unlikely to make similar announcements – partly because they risk prompting staff to switch to rivals that guarantee permanent jobs in London.

NNP,MD,VB ou MD,MD,VB if lowercase

?

Representation Granularity?


Real WorldPerceive


Observed World

Express(English)







Text Data

Class 1

Class 2

Class k

...


Supervised problem (Classification)

Topic mining and Analysis

● Group of documents (unsupervised)

– Clustering

● Group of documents but into classes (supervised)

– Classification or categorization

● ? Group of words - dimensionality reduction (We are going to see them in the unsupervised part)

– NNMF, LDA, PLSA, etc.

Text classification or categorization

● Given the following

– A set of predefined categories, possibly forming a hierarchy and often

– A training set of labeled text objects

● Task: Classify a text object into one or more of the categories

Examples of text categorization

● Text objects can vary (e.g., documents, passages, or collections of text)

● Categories can also vary

– Internal categories that characterize a text object (e.g., topical categories, sentiment categories)

– External categories that characterize an entity associated with the object (E;G., author attribution or any other meaningful categories associated with text data)

● Some examples of applications

– News categorization, literature article categorization (e.g., MeSH annotations)

– Spam email detection/filtering

– Sentiment categorization of product reviews or tweets

– Automatic email sorting/routing

– Author attribution

Variants of problem formulation

● Binary categorization: only two categories

– Retrieval (relevant or not)

– Span filtering (span or not)

– Opinion (negative or positive)

● K-category categorization: more than two categories

– Topic categorization (sports, science, travel, bussines,etc.)

– Email routing (folder 1, folder 2, etc.)

● Hierarchical categorization: categories for a hierarchy

● Joint categorization: multiple related categorization tasks done in a joint manner

Why text categorization?

● To enrich text representation (more understanding of text)

– Text can now be represented in multiple levels (keywords+categories)

– Semantic categories assigned ca be directly or indirectly useful for an application

– Semantic categories facilitate aggregation of text content (e.g., aggregating all positive /negative opinions about a product)

● To infer properties of entities associated with text data (discovery of knowledge about the world)

– As long as an entity can be associated with text data, we can always use the text data to help to categorize the associated entities

– E.g., discovery of non-native speakers of a language; prediction of party affiliation based on a political speech

Categorization methods: manual

● Determine the category based on rules that are carefully designed to reflect the domain knowledge about the categorization problem

● Works well when

– The categories are very well defined

– Categories are easily distinguished based on surface features in text (e.g., special vocabulary is known to only occur in a particular category)

– Sufficient domain knowledge is available to suggest many effective rules

● Problems

– Labor intensive -> doesn't scale up well

– Can't handle uncertainty in rules; rules may be inconsistent -> not robust

● Both problems can be solved/alleviated by using machine learning

Categorization methods: automatic

● Use human experts to

– Annotate data sets with category labels -> training data

– Provide a set of features to represent each text object that can potentially provide a "clue" about the category

– Use machine learning to learn "soft rules" for separating different categories

– Figure out which features are most useful for separating different categories

– Optimally combine the features to minimize the errors of categorization on the training data

– The trained classifier can then be applied to a new text object to predict the most likely category (that a human expert would assign to it)

Machine learning for text categorization

● General setup: learn a classifier f: X->Y

– Input: X= all objects; Output: Y all categories

– Learn a classifier function, f: X->Y, such that f(x)=y Y gives correct category for x X (correct is based on the training data)

● All methods

– Rely on discriminative features of text objects to distinguish categories

– Combine multiple features in a weighted manner

– Adjust weights on features to minimize errors on the training data

● Different methods tend to vary in

– Their way of measuring the errors on the training data (may optimize a different objective/loss/cost function)

– Their way to combining features (e.g. linear vs non-linear)

Generative vs Discriminative Classifiers

● Generative classifiers (learn what the data "looks" like in each category)

– Attempt to model p(X,Y)=p(Y)P(X|Y) and compute p(Y|X) based on p(X|Y) and p(Y) by using the Bayes rule

– Objective function is likelihood, thus indirectly measuring training error

– E.g., Naive Bayes

● Discriminative classifiers (learn what features separate categories)

– Attempt to model p(Y|X) directly

– Objective function directly measures errors of categorization on training data

– E.g., logistic regression, Support vector machines (SVM), k-Nearest Neighbors (kNN)

14

Text Categorization with Naïve Bayes

● Consider each category independently as a class c (for the multiple class setting)– Example d – document

– Feature w – word or term

– Classify c if score(c)>θ • Typically a specifically tuned threshold for each class, due to

inaccuracy of the probabilistic estimate of P(d|c) with given training statistics and independence assumption,

• .. but a biased probability estimate for c may still correlate well with the classification decision

score (c )= logP(c|d )

P( ~ c|d )=∑

w∈d

logP (w|c )P(w|~ c )

+ logP(C )

P( ~C )

15

Nearest-Neighbor Learning Algorithm

● Learning is just storing the representations of the training examples in data set D

● Testing instance x:

– Compute similarity between x and all examples in D

– Assign x the category of the most similar examples in D

● Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)

● Also called:

– Case-based

– Memory-based

– Lazy learning

16

K Nearest Neighbor for Text

● Training:

– For each each training example <x, c(x)> Є D● Compute the corresponding TF-IDF vector, dx, for document x

● Test instance y:

– Compute TF-IDF vector d for document y

– For each <x, c(x)> Є D

● Let sx = cos(d, dx)

– Sort examples, x, in D by decreasing value of sx

– Let N be the first k examples in D. (get most similar neighbors)

– Return the majority class of examples in N

Simple but powerful in very large collections!

17

KNN discussion

● No feature selection necessary

● No training necessary

● Scales well with large number of classes/documents

● Don’t need to train n classifiers for n classes

● Classes can influence each other

● Small changes to one class can have ripple effect

● Done naively, very expensive at test time

18

Support Vector Machine (SVM)

● VMs maximize the margin around the separating hyperplane.

– A.k.a. large margin classifiers

● The decision function is fully specified by a subset of training samples, the support vectors.

● Solving SVMs is a quadratic programming problem

● Seen by many as the most successful current text classification method

Support vectors

MaximizesmarginNarrower

margin

19

Non-linear SVMs: Feature spaces

● General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

20

Parameters trick?

● Over-fitting is a typical problem

21

Evaluation

● Recall: Fraction of docs in class i classified correctly

● Precision: Fraction of docs assigned class i that are actually about class i

● Accuracy: (1 - error rate) Fraction of docs classified correctly

● Other metrics, but again the problem define the appropriates metrics

22

Cross validation example

● Split the data into 5 samples

● Fit a model to the training samples and use the test sample to calculate a CV metric.

● Repeat the process for the next sample, until all samples have been used to either train or test the model

23

Examples within the landscape of Text Mining and Analytics

http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html

Corpus (Wikipedia)

https://sites.google.com/site/rmyeid/projects/polyglot

http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html

https://sites.google.com/site/rmyeid/projects/polyglot


Real WorldPerceive


Observed World

Express(English)






25

import nltk.collocationsimport collections

f = open("full.txt")documents = " ".join([line for line in f.readlines()])f.close()

bgm = nltk.collocations.BigramAssocMeasures()finder = nltk.collocations.BigramCollocationFinder.from_words(documents.split())ignored_words = nltk.corpus.stopwords.words('english')finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

finder.nbest(bgm.raw_freq, 10)finder.nbest(bgm.pmi, 10)finder.nbest(bgm.likelihood_ratio, 10)finder.nbest(bgm.chi_sq, 10)finder.nbest(bgm.dice, 10)finder.nbest(bgm.fisher, 10)finder.nbest(bgm.jaccard, 10)finder.nbest(bgm.mi_like, 10)finder.nbest(bgm.poisson_stirling, 10)finder.nbest(bgm.student_t, 10)

scored = finder.score_ngrams(bgm.pmi)


Real WorldPerceive


Observed World

Express(English)






27

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn.decomposition import NMF, LatentDirichletAllocationfrom sklearn.cluster import KMeans

f = open("full.txt.head")documents = [line for line in f.readlines()]f.close()

tf_v = CountVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')tfidf_v = TfidfVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')

tf = tf_v.fit_transform(documents)tfidf = tfidf_v.fit_transform(documents)

nmf = NMF(n_components=10, random_state=1,alpha=.1, l1_ratio=.5).fit(tfidf)km = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1).fit(tfidf)lda = LatentDirichletAllocation(n_topics=10, max_iter=5, learning_method='online', learning_offset=50., random_state=0).fit(tf)


Real WorldPerceive


Observed World

Express(English)






29

from sklearn.svm import SVCfrom sklearn.feature_extraction.text import TfidfVectorizerimport random

from nltk.corpus import movie_reviewsdocs = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(docs)X,y= [" ".join(w[0]) for w in docs],[1 if w[1] =='pos' else 0 for w in docs]

tfidf_v = TfidfVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')

tfidf = tfidf_v.fit_transform(X)

clf = SVC()clf.fit(tfidf, y)

tfidf_test = tfidf_v.transform("One of my all time favorites. Shawshank Redemption is a very moving story about hope and the power of friendship. The cast is first rate with everyone giving a great performance. Tim Robbins and Morgan Freeman carry the movie, but Bob Gunton and Clancy Brown are perfect as the Warden Norton and prison guard captain Hadley respectively. And James Whitmore's portrail of an elderly inmate Brooks is moving. The screenplay gives almost every actor at least one or more memorable lines through out the film. As well as a very surprising twist near the end that almost knocked me out of my chair. If you have not seen this movie rent it or better yet buy it. As I bet you'll want to see this one more than once.".split())

print(clf.predict(tfidf_test))