web content mining - institut de recherche en informatique … · · 2017-02-02text mining and...
TRANSCRIPT
Web Content MiningBy Y.Pitarch and J.Moreno
Text Mining and Analytics
● Text mining ~ Text Analytics
● Turn text data into high-quality information or actionable knowledge.
– Minimizes human effort (on consuming text data)
– Supplied knowledge for optimal decision making
● Related to text retrieval, which is an essential component in any text mining system
– Text retrieval can be a preprocessor for text mining
– Text retrieval is needed for knowledge provenance
Text vs. Non-Text data
● Humans as Subjective “sensors”
Real worldSense
SensorReport
Data
Weather Thermometer 3C, 15F, etc
Locations Geo Sensor 41N and 120W
Networks Network Sensor
0100010001100
Real WorldPerceive
HumanSensor
ExpressText
The General Problem of Data Mining
Sensor 1
Sensor 2
Sensor k
...
...
RealWorld
Non-TextData
NumericalCategoricalRelational
Video
TextData
Data MiningSoftware
General DataSoftware
...
Video Mining
Text Mining
Actionable Knowledge
The General Problem of Text Mining
...RealWorld
TextData
Data MiningSoftware
Text Mining
Actionable Knowledge
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
1. Mining knowledgeabout language
2. Mining content of text data
3. Mining knowledge About the observer
4. Infer other real-worldvariables (predictive analytics)
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
Basics of Text Mining
● Collections
– Documents are compressed
– Uncommon formats
– Some times they just don’t exist
● Documents
– A lot of preprocessing (encoding, cleaning, splitting, etc.)
– Granularity level (whole document, paragraph, sentence, etc.)
● Words
– Steaming, upper/lower case, frequent (stopwords) and infrequent words, special characters, dates, prices, names, emails, etc.
● General
– Language: main tasks have been addressed for English (with no 100% performance), but many other languages are far from English and many of them are completely without resources (regional languages)
– Lot of formulas and concepts
– Task is the main factor
– Tools: python (NLKT), java (OpenNLP), c, etc.
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
Basic Concepts in NLP
A dog is chasing a boy on the playgroundDet Noun Aux Verb
Noun Phrase
Det Noun Prep Det Noun
Complex Verb Noun Phrase Noun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Lexical analysis (Part-of-speech tagging)
Semantic analysis
Dog (d1)Boy (b1)
Playground(p1)Chasing(d1,b1,p1)
+Scared(x) if Chasing(_,x,_)
=>Scared(b1)
NLP is difficult!
● Natural language is designed to make human communication efficient. As a result,
– We omit a lot of common sense knowledge, which we assume the hearer/reader possesses.
– We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve.
● This makes EVERY step in NLP hard
– Ambiguity is a killer!
– Common sense reasoning is pre-required.
Examples of Challenges
● Word-level ambiguity:
– “design” can be a noun or a verb (ambiguous POS)
– “root” has multiple meanings (ambiguous sense)
● Syntactic ambiguity:
– “natural language processing” (modification)
– “A man saw a boy with a telescope” (PP attachment)
– Anaphora resolution: “John persuaded Bill to buy a TV for himself” (himself = John or Bill?)
– Presupposition: “He has quite smoking” implies that he smoked before.
What we can’t do
● 100 % POS tagging
– ”He turned off the highway” vs “He turned off the fan”
● General complete parsing
– “A man saw a boy with a telescope”
● Precise deep semantic analysis
– Will we ever be able to precisely define meaning of “own” in “John owns a restaurant”?
Robust and general NLP tends to be shallow while deep understanding doesn’t scale up.
Take away message
● NLP is the foundation for text mining
● Computers are far from being able to understand natural language
– Deep NLP requires common sense knowledge and inferences, thus only working for very limited domains
– Shallow NLP based on statistical methods ca be done in large scale and is thus more broadly applicable
● In practice: statistical NLP as the basis, while humans provide help as needed
A dog is chasing a boy on the playground
A dog is chasing a boy on the playgroundDet Noun Aux Verb
Noun Phrase
Det Noun Prep Det Noun
Complex Verb Noun Phrase Noun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Dog (d1), Boy (b1),Playground(p1), Chasing(d1,b1,p1)
A dog CHASE A boy ONThe
playground
Animal Person location
String of characters
Sequenceof words
POS tags
Syntacticstructures
Entities andrelations
Logic predicates
Text Representation and Enabled Analysis
Text Rep Generality Enabled Analysis Examples of Application
String ****** String processing Compression
Words **** Word relation analysis; topic analysis; sentiment analysis
Thesaurus discovery; topic and opinion related applications
Syntactic structures
*** Syntactic graph analysis Stylistic analysis; structure-based feature extraction
Entities &relations
** Knowledge graph analysis; information network analysis
Discovery of knowledge ans opinions about specific entities
Logic predicates
* Integrative analysis of scattered knowledge; logic inference
Knowledge assistant for biologist
Take away message
● Text representations determines what kind of mining algorithms can be applied
● Multiple ways of representing text are possible
– String, words, syntactic structures, entity-relation graphs, predicates, etc.
– Can/should be combined in real applications
● This course focuses mainly on word-based representation
– General and robust: applicable to any natural language
– No/little manual effort
– “Surprisingly” powerful for many applications (not all!)
– Can be combined with more sophisticated representations
● Tools
– Python NLTK
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
Basic word relations
● Paradigmatic : A & B have paradigmatic relation if they can be substituted for each other (i.e., A & B are in the same class)
– e.g;, “cat” and “dog”; “Monday” and “Tuesday”
● Syntagmatic: A & B have syntagmatic relation if they can be combined with each other (i.e., A & B are related semantically)
– e.g., “cat” and “sit”; “car” and “drive”
● These two basic and complementary relations ca be generalized to describe relations of any items in a language
Why mine word associations?
● They are useful for improving accuracy of many NLP tasks
– POS tagging, parsing, entity recognition, acronym expansion
– Grammar learning
● They are directly useful for many applications in text retrieval and mining
– Text retrieval (e.g., use word associations to suggest a variation of a query)
– Automatic construction of topic map of browsing: words as nodes ans associations as edges
– Compare and summarize opinions (e.g., what words are most strongly associated with “battery” in positive and negative reviews about iPhone7, respectively?)
Mining Word Associations: Intutions
My cat eats fish on Satuyday
His cat eats turkey on Tuesday
My dog eats meat on Sunday
His dog eats turkey on Tuesday
● Cat– My __ eats fish on Satuyday
– His __ eats turkey on Tuesday
● Dog– My __ eats meat on Sunday
– His __ eats turkey on Tuesday
Leftcontext
Rightcontext
Generalcontext
Paradigmatic:How similar are the context (“cat”) and context (“dog”)?How similar are the context (“cat”) and context (“computer”)?
Syntagmatic:How helpful is the occurrence of “eats” for predicting the occurrence of “meat”?How helpful is the occurrence of “eats” for predicting the occurrence of “text”?
Mining Word Associations: General Ideas
● Paradigmatic
– Represent each word by its context
– Compute context similarity
– Words with high context similarity likely have paradigmatic relation
● Syntagmatic
– Count how many times two words occur together in a context (e.g., sentence or paragraph)
– Compare their co-occurrences with their individual occurrences
– Words with high co-occurrences but relatively low individual occurrences likely have syntagmatic relation
Distributional semantics
● Comparing two words:– Look at all context words for word1
– Look at all context words for word2
– How similar are those two context collections in
their entirety?
● Compare distributional representations of
two words
How can we compare two context
collections in their entirety?
eat fall ripe slice peel tree throw fruit pie bite crab
794 244 47 221 208 160 145 156 109 104 88
Count how often “apple” occurs close to other words
in a large text collection (corpus):
Interpret counts as coordinates:
fall
eat
apple
Every context word
becomes a dimension.
How can we compare two context
collections in their entirety?
eat fall ripe slice peel tree throw fruit pie bite crab
794 244 47 221 208 160 145 156 109 104 88
Count how often “apple” occurs close to other words
in a large text collection (corpus):
Do the same for “orange”:
eat fall ripe slice peel tree throw fruit pie bite crab
265 22 25 62 220 64 74 111 4 4 8
How can we compare two context
collections in their entirety?
eat fall ripe slice peel tree throw fruit pie bite crab
794 244 47 221 208 160 145 156 109 104 88
Then visualize both count tables as vectors in the same space:
eat fall ripe slice peel tree throw fruit pie bite crab
265 22 25 62 220 64 74 111 4 4 8
fall
eat
apple
orange
Similarity between
two words as
proximity in space
Where can we find texts to use for
making a distributional model?
● Text in electronic form!● Newspaper articles● Project Gutenberg: older books available for free● Wikipedia● Text collections prepared for language analysis:
– Balanced corpora– WaC: Scrape all web pages in a particular domain– ELRA, LDC hold corpus collections
● For example, large amounts of newswire reports
– Google n-grams, Google books
What do we mean by
“similarity” of vectors?
orange
apple
Euclidean distance as a dissimilarity measure
Problem with Euclidean distance: very
sensitive to word frequency!
Braeburn
apple
What do we mean by
“similarity” of vectors?
orange
apple
Use angle between vectors
instead of point distance
to get around word
frequency issues
Some counts for “letter” in “Pride and
Prejudice”. What do you notice?
the to of and a her she his is was in that
102 75 72 56 52 50 41 36 35 34 34 33
had i from you as this mr for not on be he
32 28 28 25 23 23 22 21 21 20 18 17
but elizabeth with him which by when jane
17 17 16 16 16 15 14 12
Some counts for “letter” in “Pride and
Prejudice”. What do you notice?
the to of and a her she his is was in that
102 75 72 56 52 50 41 36 35 34 34 33
had i from you as this mr for not on be he
32 28 28 25 23 23 22 21 21 20 18 17
but elizabeth with him which by when jane
17 17 16 16 16 15 14 12
All the most frequent co-occurring words are function words.
Some words are more informative
than others
● Function words co-occur frequently with all words– That makes them less informative
● They have much higher co-occurrence counts than
content words– They can “drown out” more informative contexts
● Frequency
– Just selecting the most frequently occurring bigrams
– A simple POS filter drastically improves the results. Filtering
couples of words POS Tagged such as “Aux Noun” deals with
results such as “Prime Minister”
maxx , y p( x , y ) maxx , y log( p(x , y )+1)
Some Statistical Measures
● Pointwise Mutual Information (Theory of Information)
– The PMI tells us the amount of information that is provided by the occurrence of one word about the occurrence of the other word
● Pearson’s Phi-Square Test (Test Hypothesis)
– The essence of the test is to compare the observed frequencies with the frequencies expected for independence in a contingency table
– Try to refute the Null Hypothesis. The Higher the score, the more confidently the Hypothesis Ho can be rejected
Ho : p (x , y)=p(x ) p( y )
PMI (x , y)=log2
p( x , y )p (x) p( y)
Other Measures
● There exist many associative measures
and many others proposed after 2006 (or not included here)
Take away message
● You know the formulas...so, you can code it by your self!
● Depending of the task, preprocessing could be a critical phase. (e.g. orange and oranges)
● Different granularity levels could be use: document, paragraph, sentence, etc.
● Python
– nltk.metrics.association module
● How to know what is better?
– Evaluation!!! Many datasets allows to evaluate the performance of existing algorithms (e.g. SemEval datasets, Semantic and Syntactic similarity dataset)
Landscape of Text Mining and
Analytics
Real WorldPerceive
(Perspective)Text Data
Observed World
Express
(English)
Word association
mining & analysis
Topic mining & analysis
Opinion mining &
Sentiment analysis
Text based prediction
Natural language proces.
& text representation
Topic Mining and Analysis:
Motivation● Topic~ main idea discussed in text data
– Theme/subject of a discussion or conversation
– Different granularities (e.g., topic of a sentence, an article, etc.)
● Many application require discovery of topics in text
– What are Twitter users talking about today?
– What are the current research topics in data mining? How are
they different from those 5 years ago?
– What do people like about the iPhone 7? What do they dislike?
– What were the major topics debated in 2016 presidential election?
Task of Topic Mining and Analysis
Text Data
Topic 1
Topic 2
Topic k
...
Doc 1 Doc 2 Doc N...
Task 1: Discovery k topics
Task 2: Figure out which
documents cover which topics
Task of Topic Mining and Analysis
Text Data
w1,w3,w5
w8,w17
w2,w4,w9
...
Doc 1 Doc 2 Doc N...
Unsupervised problem (Clustering)
Task of Topic Mining and Analysis
Text Data
Class 1
Class 2
Class k
...
Doc 1 Doc 2 Doc N...
Supervised problem (Classification)
Topic mining and Analysis
● Group of documents (unsupervised)
– Clustering
● Group of documents but into classes (supervised)
– Classification
● ? Group of words - dimensionality reduction (We
are going to see them in the unsupervised part)
– NNMF, LDA, PLSA, etc.
Clustering
● Clustering of documents is a way to discovery related documents and
topics
● It is the process of grouping a set of objects into classes of similar
objects
– Documents within a cluster should be similar.
– Documents from different clusters should be dissimilar
● Many existing algorithms
– K-means, E.M., etc.
● A representation for each document is demanded (not always)
– Term-document or document-term matrix
– The frequency is frequently used, but many others weighted models are
available (tf-idf). Their use depends of the chosen algorithm
Issues for clustering
● Representation for clustering
– Document representation
– Vector space? Normalization?
– Centroids aren’t length normalized
– Need a notion of similarity/distance
● How many clusters?
– Fixed a priori?
– Completely data driven?
● Avoid “trivial” clusters - too large or small
t 1
D2
D1
D3
D4
t 3
t 2
x
y
Term-document matrix
Anthony
and
Cleopatr
a
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
anthony 5.25 3.18 0.0 0.0 0.0 0.35
brutus 1.21 6.10 0.0 1.0 0.0 0.0
caesar 8.59 2.54 0.0 1.51 0.25 0.0
calpurnia 0.0 1.54 0.0 0.0 0.0 0.0
cleopatra 2.85 0.0 0.0 0.0 0.0 0.0
mercy 1.51 0.0 1.90 0.12 5.25 0.88
worser 1.37 0.0 0.11 4.15 0.25 1.95
Values used in these matrices are known as the weighting schema. They alter the
obtained results (clustering or classification). Often the frequency is used, but other
options are also available:
Why frequency is not the best option? Using frequency, we obtain the vector {73, 157,
227, 10, 0, 0, 0} to represent the Julius Caesar document. Document size strongly affects
similarity values!!!
Anthony
and
Cleopatr
a
Julius
Caesar
The
Tempest
Hamlet Othello Macbeth
anthony 1 1 0 0 0 1
brutus 1 1 0 1 0 0
caesar 1 1 0 1 1 1
calpurnia 0 1 0 0 0 0
cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Binary matrix TF-IDF
Documents as vectors
● So we have a |V|-dimensional vector space
● Terms are axes of the space
● Documents are points or vectors in this space
● Very high-dimensional: millions of dimensions
when you apply this to big collections
● These are very sparse vectors - most entries
are zero.
Notion of similarity/distance
● Ideal: semantic similarity.
● Practical: term-statistical similarity
– Cosine similarity.
– Docs as vectors.
– For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs.
– It is easier to talk about Euclidean distance, but real
implementations use cosine similarity
Hard vs. soft clustering
● Hard clustering: Each document belongs to exactly
one cluster
– More common and easier to do
● Soft clustering: A document can belong to more than
one cluster.
– Makes more sense for applications like creating browsable
hierarchies
– You may want to put a pair of sneakers in two clusters: (i)
sports apparel and (ii) shoes
– You can only do that with a soft clustering approach.
What Is a Good Clustering?
● Internal criterion: A good clustering will produce high
quality clusters in which:
– the intra-class (that is, intra-cluster) similarity is high
– the inter-class similarity is low
– The measured quality of a clustering depends on both the
document representation and the similarity measure used
● External criterion: The quality of a clustering is also
measured by its ability to discover some or all of the
hidden patterns or latent classes
– Assessable with gold standard data
Clustering algorithms
● K-means
● Latent Semantic Analysis (LSA)
● Non-negative matrix factorization
● Latent Dirichlet allocation
● Many others but these are the main used in text
clustering
K-means
● It is a hard-clustering algorithm
● Assumes documents are real-valued vectors
● Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
● Reassignment of instances to clusters is based on
distance to the current cluster centroids (Or one can
equivalently phrase it in terms of similarities)
μ⃗(c )=1
|c|∑x⃗∈c
x⃗
K-Means Algorithm
● Select K random docs {s1, s2,… sK} (or points)
as seeds.
● Until clustering converges (or other stopping
criterion):
– For each doc di:
● Assign di to the cluster cj such that dist(xi, sj) is minimal
(Next, update the seeds to the centroid of each
cluster)
– For each cluster cj
s j= μ⃗ (c j )
Iterations
Seeding Assign Update
Final
Partition
The default in SciKit
learn use the K-
means++
implementation
When to stop?
- Based on iterations
- Based on optimization
criteria
Take away message
● Easy to implement (already implemented in many
frameworks, tools, libraries, etc.)
● Initialization is an important issue (use k-means++)
● Distance is also important!!
● Each cluster is considered as a topic of the collection.
However, terms are unknown
● It has been extended in many ways (soft, labels,
similarity, etc.)
● How to introduce semantic similarity?
Latent Dirichlet Allocation
● Topic models are a suite of algorithms for discovering
the main themes that pervade a large and other wise
unstructured collection of documents
● Among these algorithms, LDA, a technique based in
Bayesian Modeling, is the most commonly used
nowadays
● Topic models can be applied to massive collections of
documents to automatically organize, understand,
search, and summarize large electronic archives
● Especially relevant in today’s “Big Data” environment.
Motivation for topic models
LDA basis
● Each topic is a distribution of words; each document is a mixture of corpus-wide topics;
and each word is drawn from one of those topics● The goal is to infer the hidden variables
Compute their distribution conditioned on the documents
p(topics, proportions, assignments | documents)
LDA model
Estimating the model
● Notation
– Nodes are random variables;
edges indicate dependence.
– Shaded nodes are observed.
– Plates indicate replicated
variables
● From a collection of
documents, infer
– Per-word topic assignment z d,n
– Per-document topic proportions
θ d
– Per-corpus topic distributions β k
● Approximate posterior inference
algorithms
– Mean field variational methods
– Expectation propagation
– Collapsed Gibbs sampling
– Collapsed variational inference
– Online variational inference
Take away message
● LDA is a topic model based on the probabilistic
version of LSI
● The estimation of the paramaters in the model is
performed based on the observed data and the
parameters
● Clustering can be performed using LDA by
choosing the most representative topic and fixing
the desired number of topics as the number of
clusters
More about clustering
● Many algorithms are available for clustering, however many of them are
not suitable for text
● Classical algorithms usually perform well, but state-of-the-art algorithms
can perform much better if they are well parametrized (not free lunch)
● Clustering is a hard task and few algorithms scale well to huge text
collections. If hierarchies are needed, bottom-up or top-down strategies
can be applied (combined with the presented algorithms)
● There exist problems in which labels are needed!!! LDA is a good
solution however other simpler algorithms can do the job (STC, Lingo,
etc)
● As any other data mining task, the best clustering algorithm will be draw
by the problem and not by choosing the most popular one
How to evaluate clustering?
● Evaluation is a hard task.
● Best scenario: an annotated collection is available
– Pairs of documents are used to evaluated the
partitions by asking if they belongs to the same
partition in the annotated data and in the obtained
partition (for any clustering algorithm)
● Extreme cases are hard to evaluate
– All documents belong to just one cluster
– All documents belong to individual clusters
Comparison of clustering
evalution metrics
Take away message
● Evaluate if you have annotated data
– Metric selection is an important issue
– Try to understand the addressed problem to select
a metric adapted to it
● Tools also implement evaluation metrics
– Clustering performance evaluation module in scikit
learn
– Clusteval in R
Why preprocessing is important?
Prime Minister Theresa May has jetted out to Davos for crunch talks with top bankers following a landmark speech this week in which she set out her vision for Brexit.
May will meet chief executives from banks such as Goldman Sachs and JP Morgan a day after two other bosses – HSBC’s Stuart Gulliver and UBS investment chief Andrea Orcel – confirmed that between them they would shift up to 2,000 jobs out of London when Britain leaves the EU.
Brexit will be top of the agenda when May hosts a roundtable including Goldman’s Lloyd Blankfein and JP Morgan’s Jamie Dimon, both of whom also landed in Switzerland yesterday. However, City experts do not believe there will be an exodus of banking jobs from London, despite the HSBC and UBS plans.
“It’s all positioning, it’s all playing to the gallery,” added Simon French, chief economist at Panmure Gordon.
French said other banks are unlikely to make similar announcements – partly because they risk prompting staff to switch to rivals that guarantee permanent jobs in London.
NNP,MD,VB ou MD,MD,VB if lowercase
?
Representation Granularity?
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
Task of Topic Mining and Analysis
Text Data
Class 1
Class 2
Class k
...
Doc 1 Doc 2 Doc N...
Supervised problem (Classification)
Topic mining and Analysis
● Group of documents (unsupervised)
– Clustering
● Group of documents but into classes (supervised)
– Classification or categorization
● ? Group of words - dimensionality reduction (We are going to see them in the unsupervised part)
– NNMF, LDA, PLSA, etc.
Text classification or categorization
● Given the following
– A set of predefined categories, possibly forming a hierarchy and often
– A training set of labeled text objects
● Task: Classify a text object into one or more of the categories
Examples of text categorization
● Text objects can vary (e.g., documents, passages, or collections of text)
● Categories can also vary
– Internal categories that characterize a text object (e.g., topical categories, sentiment categories)
– External categories that characterize an entity associated with the object (E;G., author attribution or any other meaningful categories associated with text data)
● Some examples of applications
– News categorization, literature article categorization (e.g., MeSH annotations)
– Spam email detection/filtering
– Sentiment categorization of product reviews or tweets
– Automatic email sorting/routing
– Author attribution
Variants of problem formulation
● Binary categorization: only two categories
– Retrieval (relevant or not)
– Span filtering (span or not)
– Opinion (negative or positive)
● K-category categorization: more than two categories
– Topic categorization (sports, science, travel, bussines,etc.)
– Email routing (folder 1, folder 2, etc.)
● Hierarchical categorization: categories for a hierarchy
● Joint categorization: multiple related categorization tasks done in a joint manner
Why text categorization?
● To enrich text representation (more understanding of text)
– Text can now be represented in multiple levels (keywords+categories)
– Semantic categories assigned ca be directly or indirectly useful for an application
– Semantic categories facilitate aggregation of text content (e.g., aggregating all positive /negative opinions about a product)
● To infer properties of entities associated with text data (discovery of knowledge about the world)
– As long as an entity can be associated with text data, we can always use the text data to help to categorize the associated entities
– E.g., discovery of non-native speakers of a language; prediction of party affiliation based on a political speech
Categorization methods: manual
● Determine the category based on rules that are carefully designed to reflect the domain knowledge about the categorization problem
● Works well when
– The categories are very well defined
– Categories are easily distinguished based on surface features in text (e.g., special vocabulary is known to only occur in a particular category)
– Sufficient domain knowledge is available to suggest many effective rules
● Problems
– Labor intensive -> doesn't scale up well
– Can't handle uncertainty in rules; rules may be inconsistent -> not robust
● Both problems can be solved/alleviated by using machine learning
Categorization methods: automatic
● Use human experts to
– Annotate data sets with category labels -> training data
– Provide a set of features to represent each text object that can potentially provide a "clue" about the category
– Use machine learning to learn "soft rules" for separating different categories
– Figure out which features are most useful for separating different categories
– Optimally combine the features to minimize the errors of categorization on the training data
– The trained classifier can then be applied to a new text object to predict the most likely category (that a human expert would assign to it)
Machine learning for text categorization
● General setup: learn a classifier f: X->Y
– Input: X= all objects; Output: Y all categories
– Learn a classifier function, f: X->Y, such that f(x)=y Y gives correct category for x X (correct is based on the training data)
● All methods
– Rely on discriminative features of text objects to distinguish categories
– Combine multiple features in a weighted manner
– Adjust weights on features to minimize errors on the training data
● Different methods tend to vary in
– Their way of measuring the errors on the training data (may optimize a different objective/loss/cost function)
– Their way to combining features (e.g. linear vs non-linear)
Generative vs Discriminative Classifiers
● Generative classifiers (learn what the data "looks" like in each category)
– Attempt to model p(X,Y)=p(Y)P(X|Y) and compute p(Y|X) based on p(X|Y) and p(Y) by using the Bayes rule
– Objective function is likelihood, thus indirectly measuring training error
– E.g., Naive Bayes
● Discriminative classifiers (learn what features separate categories)
– Attempt to model p(Y|X) directly
– Objective function directly measures errors of categorization on training data
– E.g., logistic regression, Support vector machines (SVM), k-Nearest Neighbors (kNN)
14
Text Categorization with Naïve Bayes
● Consider each category independently as a class c (for the multiple class setting)– Example d – document
– Feature w – word or term
– Classify c if score(c)>θ • Typically a specifically tuned threshold for each class, due to
inaccuracy of the probabilistic estimate of P(d|c) with given training statistics and independence assumption,
• .. but a biased probability estimate for c may still correlate well with the classification decision
score (c )= logP(c|d )
P( ~ c|d )=∑
w∈d
logP (w|c )P(w|~ c )
+ logP(C )
P( ~C )
15
Nearest-Neighbor Learning Algorithm
● Learning is just storing the representations of the training examples in data set D
● Testing instance x:
– Compute similarity between x and all examples in D
– Assign x the category of the most similar examples in D
● Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)
● Also called:
– Case-based
– Memory-based
– Lazy learning
16
K Nearest Neighbor for Text
● Training:
– For each each training example <x, c(x)> Є D● Compute the corresponding TF-IDF vector, dx, for document x
● Test instance y:
– Compute TF-IDF vector d for document y
– For each <x, c(x)> Є D
● Let sx = cos(d, dx)
– Sort examples, x, in D by decreasing value of sx
– Let N be the first k examples in D. (get most similar neighbors)
– Return the majority class of examples in N
Simple but powerful in very large collections!
17
KNN discussion
● No feature selection necessary
● No training necessary
● Scales well with large number of classes/documents
● Don’t need to train n classifiers for n classes
● Classes can influence each other
● Small changes to one class can have ripple effect
● Done naively, very expensive at test time
18
Support Vector Machine (SVM)
● VMs maximize the margin around the separating hyperplane.
– A.k.a. large margin classifiers
● The decision function is fully specified by a subset of training samples, the support vectors.
● Solving SVMs is a quadratic programming problem
● Seen by many as the most successful current text classification method
Support vectors
MaximizesmarginNarrower
margin
19
Non-linear SVMs: Feature spaces
● General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
20
Parameters trick?
● Over-fitting is a typical problem
21
Evaluation
● Recall: Fraction of docs in class i classified correctly
● Precision: Fraction of docs assigned class i that are actually about class i
● Accuracy: (1 - error rate) Fraction of docs classified correctly
● Other metrics, but again the problem define the appropriates metrics
22
Cross validation example
● Split the data into 5 samples
● Fit a model to the training samples and use the test sample to calculate a CV metric.
● Repeat the process for the next sample, until all samples have been used to either train or test the model
23
Examples within the landscape of Text Mining and Analytics
http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
Corpus (Wikipedia)
https://sites.google.com/site/rmyeid/projects/polyglot
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
25
import nltk.collocationsimport collections
f = open("full.txt")documents = " ".join([line for line in f.readlines()])f.close()
bgm = nltk.collocations.BigramAssocMeasures()finder = nltk.collocations.BigramCollocationFinder.from_words(documents.split())ignored_words = nltk.corpus.stopwords.words('english')finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
finder.nbest(bgm.raw_freq, 10)finder.nbest(bgm.pmi, 10)finder.nbest(bgm.likelihood_ratio, 10)finder.nbest(bgm.chi_sq, 10)finder.nbest(bgm.dice, 10)finder.nbest(bgm.fisher, 10)finder.nbest(bgm.jaccard, 10)finder.nbest(bgm.mi_like, 10)finder.nbest(bgm.poisson_stirling, 10)finder.nbest(bgm.student_t, 10)
scored = finder.score_ngrams(bgm.pmi)
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
27
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn.decomposition import NMF, LatentDirichletAllocationfrom sklearn.cluster import KMeans
f = open("full.txt.head")documents = [line for line in f.readlines()]f.close()
tf_v = CountVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')tfidf_v = TfidfVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')
tf = tf_v.fit_transform(documents)tfidf = tfidf_v.fit_transform(documents)
nmf = NMF(n_components=10, random_state=1,alpha=.1, l1_ratio=.5).fit(tfidf)km = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1).fit(tfidf)lda = LatentDirichletAllocation(n_topics=10, max_iter=5, learning_method='online', learning_offset=50., random_state=0).fit(tf)
Landscape of Text Mining and Analytics
Real WorldPerceive
(Perspective) Text Data
Observed World
Express(English)
Word associationmining & analysis
Topic mining & analysis
Opinion mining &Sentiment analysis
Text based prediction
Natural language proces.& text representation
29
from sklearn.svm import SVCfrom sklearn.feature_extraction.text import TfidfVectorizerimport random
from nltk.corpus import movie_reviewsdocs = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(docs)X,y= [" ".join(w[0]) for w in docs],[1 if w[1] =='pos' else 0 for w in docs]
tfidf_v = TfidfVectorizer(max_df=0.95, min_df=2, max_features=100000, stop_words='english')
tfidf = tfidf_v.fit_transform(X)
clf = SVC()clf.fit(tfidf, y)
tfidf_test = tfidf_v.transform("One of my all time favorites. Shawshank Redemption is a very moving story about hope and the power of friendship. The cast is first rate with everyone giving a great performance. Tim Robbins and Morgan Freeman carry the movie, but Bob Gunton and Clancy Brown are perfect as the Warden Norton and prison guard captain Hadley respectively. And James Whitmore's portrail of an elderly inmate Brooks is moving. The screenplay gives almost every actor at least one or more memorable lines through out the film. As well as a very surprising twist near the end that almost knocked me out of my chair. If you have not seen this movie rent it or better yet buy it. As I bet you'll want to see this one more than once.".split())
print(clf.predict(tfidf_test))