![Page 1: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/1.jpg)
Machine Learningwith Applications in Categorization, Popularity and Sequence labeling
(linear models, decision trees, ensemble methods, evaluation)
Dr. Nicolas Nicolov<[email protected]>
![Page 2: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/2.jpg)
2
Goals
• Introduce important ML concepts• Illustrate ML techniques through examples in:
– Categorization– Popularity– Sequence labeling
(tutorial aims to be self-contained and to explain the notation)
![Page 3: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/3.jpg)
3
Outline• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 4: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/4.jpg)
4
EXAMPLES OF MACHINE LEARNINGWhy?– Get a flavor of the diversity of areas where ML is applied.
![Page 5: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/5.jpg)
5
Sequence Labeling
George W. Bush discussed Iraq
GPEXPER_ _PER_ _PER
<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>
George W. Bush discussed Iraq
Geo-Political Entity
(like search query analysis)
![Page 6: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/6.jpg)
6
Spam
www.dietsthatwork.com
www . dietsthatwork . com
www . diets that work . com
SPAM!
further segmentation
classification
![Page 7: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/7.jpg)
7
TokenizationWhat!?I love the iphone:-)
What !? I love the iphone :-)
How difficult can that be? — 98.2% [Zhang et al. 2003]
NO TRESSPASSING VIOLATORS WILL BE PROSECUTED
![Page 8: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/8.jpg)
8
NL Parsing
Unlikemy sluggish Chevy the Audi handlesthe winding mountain roads superbly
PREP
POSS
MODDET
SUBJ DETMOD
MOD
MANRDOBJ
CONTR
syntactic structure
![Page 9: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/9.jpg)
9
State Transitions
λ β
λ β
λ β
λ β
λ β
λ
λ
λ
λ
LEFTARC:
RIGHTARC:
NOARC:
SHIFT:
using ML to make the decisionwhich action to take
![Page 10: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/10.jpg)
10
Two Ladies in a Men’s Club
![Page 11: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/11.jpg)
11
We serve men
IOBJ
We serve men
DOBJSUBJ
SUBJ
We serve food to men.We serve our community.serve —IndirectObject men
We serve organic food.We serve coffee to connoiseurs.serve —DirectObject men
![Page 12: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/12.jpg)
12
Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality.
Coreference
![Page 13: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/13.jpg)
13
Parts of Objects (Meronymy)
[…] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh
headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort.
![Page 14: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/14.jpg)
14
Sentiment Analysis
I love pineapple nearly as much as I hate bananas.
POSITIVE sentiment regarding topic pineapple.
Xbox
Xbox
Positive Negative
Neutral
Negative
Negative
Neutral
Positive
![Page 15: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/15.jpg)
15
Chinese Sentiment
Car aspects Sentiment categories
Sentence
![Page 16: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/16.jpg)
16
![Page 17: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/17.jpg)
17
![Page 18: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/18.jpg)
18
Categorization
• High-level task: – Given a restaurant what is its restaurant sub-category?
• Encoding entities with features• Feature selection• Linear models
non-standard order
“Though this be madness, yet there is method in't.”
![Page 19: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/19.jpg)
19
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 20: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/20.jpg)
20
ENCODING OBJECTS WITH FEATURESWhy?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance.
![Page 21: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/21.jpg)
21
FlatObject
Encoding
1 0 0 1 1 1 0 1 …37
= feature values (binary in this example)=Target class index (for asian)
Machine learning (training) instance/example/observation.
Default feature: Always on.
Name: has “asian bistr
o”
Description has “china”
Description has “indonesia”
has FB page
Name: has “restaurant”
Name: has “ginger”
Can be a set;object can belong to several classes.
URL has “french”
Number offeatures canbe millions.
![Page 22: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/22.jpg)
22
Structured Objects to Strings
to Features
a b c d e
Structured object:
f1
f2
f3
f4
f5
f6
“f2:f4>a”“f2:f4>b”“f2:f4>c”…“f2:f4>a_b”“f2:f4>b_c”“f2:f4>c_d”…“f2:f4>a_b_c”“f2:f4>b_c_d”
uni-grams
bi-grams
tri-grams
Feature string Feature index
*DEFAULT* 0
… …
f2:f4>a 100
f2:f4>b 101
f2:f4>c 102
… …
f2:f4>a_b 105
f2:f4>b_c 106
f2:f4>c_d 107
… …
f2:f4>a_b_c 109
Read as field “f2:f4” contains feature “a”.
Table can be quite large.
![Page 23: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/23.jpg)
23
Sliding Window (bi-grams)
SkyCity at the Space Needle
SkyCity at the Space Needle^ $
add initial “^” and final “$” tokens
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
sliding window
![Page 24: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/24.jpg)
24
Example: Feature Templatespublic static List<string> NGrams( string field ){ var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );
featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field
string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;
for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram);
bigram = previous1 + "_" + unigram; featutes.Add( bigram );
if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }
previous2 = previous1; previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" );
return result;}
initial tri-gram is: "^_tokens[0]_tokens[1] "
initial bigram is “^_tokens[0]"
last trigram is “tokens[tokens.Length-2]_tokens[tokens.Length-1]_$"
could add field name as argument and prefix all features
![Page 25: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/25.jpg)
25
The Art of Feature Engineering:Disjunctive Features
• Useful feature = triggers often and with a particular class.• Rarely occurring (but indicative of a class) features can be
combined in a disjunction. This results in:– Need for less data to achieve good performance.– Final system performance (with all available data) is higher.
• How can we get insights about such features: Error analysis!
Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese| branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi| gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino| parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto| radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu| tortellini|vitello|vongole");
if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description");
Up to us how we call the feature.Triggering of the feature.
![Page 26: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/26.jpg)
26
instance( class= 7, features=[0,300857,100739,200441,...])instance( class=99, features=[0,201937,196121,345758,13,...])instance( class=42, features=[0,99173,358387,1001,1,...])...
Generic Nature of ML Systems
human sees
computer “sees”
Default feature always triggers.
Number of features that trigger for individual instances are often not the same.
Indices of (binary) features that trigger.
![Page 27: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/27.jpg)
27
Training Data
𝑋=( 𝑥0( 1) ⋯ 𝑥𝑑
(1 )
⋮ ⋱ ⋮𝑥0
(𝑁 ) ⋯ 𝑥𝑑(𝑁 )) ( 𝑦
(1 )
⋮𝑦 (𝑁 ))
Instance /w outcome.
![Page 28: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/28.jpg)
28
Feature Selection
• Templates: powerful way to get lots of features.• We get too many features.• Danger of overfitting.• Feature selection:
– CountCutOff.– TFxIDF.– Mutual information.– Information gain.– Chi square.
Doing well on seen data but poorly on unseen data.
e.g., 20M for dependency parsing.
Automatic ways of finding discriminative features.
We will examine in detail the implementation of this.
![Page 29: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/29.jpg)
29
Mutual Information• Measure of relative entropy between distributions of two random variables.• = expected value of across all classes:
• An alternative is to use:
𝐼 ( 𝑓 ,𝑐 )=log( 𝑃 ( 𝑓 ,𝑐 )𝑃 ( 𝑓 )𝑃 (𝑐 ) )=log(
𝑛 𝑓 ,𝑐
𝑁 𝑡
𝑛𝑓
𝑁𝑡
∙𝑛𝑐
𝑁 𝑡
)𝑀𝐼 ( 𝑓 ,𝐶 )=∑
𝑐∈𝐶
𝑃 (𝑐 ) 𝐼 ( 𝑓 ,𝑐 )=∑𝑐∈𝐶
𝑛𝑐𝑁𝑡
log(𝑛𝑓 ,𝑐
𝑁𝑡
𝑛 𝑓
𝑁 𝑡
∙𝑛𝑐𝑁𝑡
)𝐼𝑚𝑎𝑥 ( 𝑓 ,𝐶 )=max
𝑐∈𝐶𝐼 ( 𝑓 ,𝑐)=max
𝑐∈𝐶log(
𝑛𝑓 , 𝑐
𝑁 𝑡
𝑛𝑓
𝑁𝑡
∙𝑛𝑐
𝑁𝑡
)
![Page 30: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/30.jpg)
30
Information Gain
Balances effects of feature triggering for an object with the effects of feature being absent for an object.
𝐼𝐺 ( 𝑓 ,𝐶 )=𝐻 (𝐶 ) −𝐻 (𝐶|𝑓 ) −𝐻 (𝐶∨¬ 𝑓 )
¿− ∑𝑐∈𝐶
𝑃 (𝑐 ) log 𝑃 (𝑐 )−(− ∑𝑐∈𝐶 𝑃 ( 𝑓 ,𝑐 ) log 𝑃 (𝑐|𝑓 ))−(−∑𝑐∈𝐶 𝑃 (¬ 𝑓 ,𝑐 ) log 𝑃 (𝑐|¬ 𝑓 ))
¿− ∑𝑐∈𝐶 ( 𝑛𝑐
𝑁𝑡
log ( 𝑛𝑐𝑁𝑡)− 𝑛𝑐𝑁 𝑡
log(𝑛𝑓 ,𝑐
𝑛 𝑓)− 𝑛𝑐−𝑛 𝑓 , 𝑐
𝑁𝑡
log (𝑛𝑐−𝑛𝑓 ,𝑐
𝑁 𝑡−𝑛 𝑓))
![Page 31: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/31.jpg)
31
Chi Square
Quantifies lack of independence between feature and class :
𝑋 2 ( 𝑓 ,𝑐 )=𝑁𝑡 (𝑃 ( 𝑓 ,𝑐 )𝑃 (¬ 𝑓 , ¬𝑐 )−𝑃 ( 𝑓 ,¬𝑐 ) 𝑃 (¬ 𝑓 ,𝑐))2
𝑃 ( 𝑓 ) 𝑃 (¬ 𝑓 )𝑃 (𝑐 ) 𝑃 (¬𝑐)
¿𝑁𝑡 (𝑛𝑓 ,𝑐 (𝑁 𝑡−𝑛𝑓 −𝑛𝑐+𝑛𝑓 , 𝑐)− (𝑛𝑓 −𝑛𝑓 ,𝑐 ) (𝑛𝑐−𝑛𝑓 ,𝑐 ))
2
𝑛𝑐𝑛𝑓 (𝑁𝑡−𝑛𝑐 ) (𝑁 𝑡−𝑛𝑓 )
float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }
Calling: Chi2( , , , )
![Page 32: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/32.jpg)
32
Exponent(Log) TrickWhile the final output may not be big intermediate results are. Solution:
float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }
float Chi2_v2(int a, int b, int c, int d){ double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den);}
𝒙=𝒆𝐥𝐧 𝒙 (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿
¿𝑒ln
(𝑎+𝑏+𝑐 +𝑑 ) (𝑎𝑑−𝑏𝑐 )2
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿
¿𝑒ln (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2 − ln (𝑎+𝑏 ) (𝑎+𝑐 ) (𝑐+𝑑 ) (𝑏+𝑑 )=¿
¿𝑒ln (𝑎+𝑏+𝑐+𝑑 )+2 ln|𝑎𝑑−𝑏𝑐|− ln (𝑎+𝑏 )− ln (𝑎+𝑐 )− ln (𝑐+𝑑 ) − ln (𝑏+𝑑 )
![Page 33: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/33.jpg)
33
Chi Square: Score per Feature
• We know how to compute .• Two options for an aggregate score across classes:
– Weighted average:
– Highest score among any class:
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑃 (𝑐 ) 𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= max𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )
![Page 34: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/34.jpg)
34
Chi Square Feature Selectionint[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count;int[] classTotals = new int[ numLabels ]; // instances with that label.float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances.int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.int numInstances = instances.Count;
...
float[] weightedChiSquareScore = new float[ numFeatures ];for (int f = 0; f < numFeatures; f++) // f is a feature index{ float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } weightedChiSquareScore[ f ] = score;}
Do a pass over the data and collect above counts.
Weighted average across all classes.
![Page 35: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/35.jpg)
35
⇒ Summary: Encoding
• Object representation is crucial.• Humans: good at suggesting features (templates).• Computers: good at filtering (feature selection).
• Feature engineering: Ensuring systems use the “right” features.
The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure.
![Page 36: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/36.jpg)
36
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 37: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/37.jpg)
37
MACHINE LEARNINGGENERAL FRAMEWORK
![Page 38: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/38.jpg)
38
Machine Learning: Representation
object encoded with features(think DB attributes/ OO member fields of primitive types) is the feature dimensionality.
classifier
prediction(response/dependent variable).Can be qualitative/quantitative(classification/regression).
𝑶𝒃𝒋𝒆𝒄𝒕→𝑶𝒖𝒕𝒄𝒐𝒎𝒆Entity CategoryEntity PopularityEntity IsChainElement
Complex decision making:
𝑿→𝒀
�⃗�=(𝒙𝟎 , …, 𝒙𝒅)→𝒀
input/independent variable
We may know the relation for certain values of and :
In fact, we may know the relation for many s and s:
(𝒙 , 𝑦 )
{ (𝒙 (𝟏 ) , 𝑦 (𝟏 )) ,… , ( �⃗� ( 𝑵 ) , 𝑦 (𝑵 ) ) } The -th is:
![Page 39: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/39.jpg)
39
Notation
𝒙(𝑖)=(𝑥0(𝑖) , …, 𝑥 𝑗
(𝑖) , … 𝑥𝑑(𝑖))
-th instance.
is the total number of data items.
is not “to the power of”hence, the parentheses.
is the corresponding component of the feature vector..
We will often have be the default feature with value of 1.
![Page 40: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/40.jpg)
40
TRAINING
Machine Learning
Input
Online System
object encoded with features
classifier
prediction(response/dependent variable)
FinalOutput
ModelOffline
TrainingSub-system
Training Data
where
𝑓 (𝑋 )=𝑌𝑋→𝑌 Task is very complex . Hard to construct good .We construct an approximation to : Hypothesis space: .
![Page 41: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/41.jpg)
41
Classes of Learning Problems
• Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant).
• Regression: Predict a real value for each item (stock/currency value, temperature).
• Ranking: Order items according to some criterion (web search results relevant to a user query).
• Clustering: Partition items into homogeneous groups (clustering twitter posts by topic).
• Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images).
![Page 42: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/42.jpg)
42
ML Terminology• Examples: Items or instances used for learning or evaluation.• Features: Set of attributes represented as a vector associated with an example.• Labels: Values or categories assigned to examples. In classification the labels are categories; in
regression the labels are real numbers.• Target: The correct label for a training example. This is extra data that is needed for supervised learning.• Output: Prediction label from input set of features using a model of the machine learning algorithm.• Training sample: Examples used to train a machine learning algorithm.• Validation sample: Examples used to tune parameters of a learning algorithm.• Model: Information that the machine learning algorithm stores after training. The model is used when
predicting the output labels of new, unseen examples.• Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is
separate from the training and validation data and is not made available in the learning stage.• Loss function: A function that measures the difference/loss between a predicted label and a true label.
We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples).
• Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error.
• Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set).
![Page 43: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/43.jpg)
43
Classification
• Data:
• Binary classification:– Outcomes:
−
++
+
++
++
+ +
+
+ −−
−
−
−
−
−
−
− −
−
−
decision boundary
Yes, this is mysterious at this point.
![Page 44: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/44.jpg)
44
Multi-Class Classification
• Outcomes: • Common to use binary classification
approaches: One-Versus-All (OVA). One-Versus-One (OVO).
![Page 45: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/45.jpg)
45
One-Versus-All (OVA)
For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples.
Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”.
The scores from the different classifiers need to be calibrated!
�̂�=argmax𝑦∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑆𝑐𝑜𝑟𝑒𝑦 (𝒙 )
![Page 46: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/46.jpg)
46
One-Versus-One (OVO)For each pair of classes, create binary classifier on data labeled as either of the classes.
How many such classifiers?
Given a new instance run all classifiers and predict class with maximum number of wins.
(𝑘2 )=𝑘(𝑘−1)2
![Page 47: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/47.jpg)
47
Errors“Nobody is perfect, but then again, who wants to be nobody.”
Binary classifier: :
#misclassified examples (penalty score of 1 for every misclassified example).𝐸𝑟𝑟𝑜𝑟= 1
𝑁∙∑𝑖=1
𝑁
|�̂� (𝑖 ) −𝑦 (𝑖 )|
𝐸𝑟𝑟𝑜𝑟= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )
Point-wise error (for data point ,The corresponding prediction and true value ).
�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) ) Value predicted by the algorithm for input data point .
𝐿𝑜𝑠𝑠 ( �̂� (𝑖 ) , 𝑦 (𝑖 ) )Average error across all instances.Goal: Minimize the Error.Beneficial to have differentiable loss function.
𝐿𝑜𝑠𝑠 ( �̂� , 𝑦 )=|�̂�− 𝑦|
This encoding makes more sense than .
This particular function is called “Zero-One Loss”.For simplicity we are skipping the indices.
![Page 48: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/48.jpg)
48
Error: Function of the Parameters
�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )=𝑔 ( �⃗�( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) Value predicted by the algorithm for input data point .
The cumulative error across all instances is a function of the parameters.
𝐸𝑟𝑟𝑜𝑟 (𝑝𝑎𝑟𝑎𝑚𝑠 )= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 (𝑔 ( �⃗� ( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) , 𝑦 (𝑖 ) )
2 When the params are fixed we can compute given (testing).
1 When the s and the s are fixed we can compute (optimize) params (training).
![Page 49: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/49.jpg)
49
Evaluation
• Motivation:– Benchmark algorithms (which system is better).– Tuning parameters during training.
![Page 50: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/50.jpg)
50
Evaluation Measures
GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space
ClassificationAccuracy GeneralizationError
TrainingError: Percentage of training examples which are correctly classified.
Optimistically biased estimate especially if the inducer over-fits the (training) data.
Empirical estimation of the generalization error:• Heldout method• Re-sampling:
1. Random resampling2. Cross-validation
![Page 51: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/51.jpg)
51
Precision, Recall and F-measure
Let’s consider binary classification:
Space of all instances
Instances identified as positive by the system.
Positive instances in reality.
System identified these as positive but got them wrong(false positive).
System identified these as positive but got them correct(true positive).
System identified these as negative but got them wrong(false negative).
System identified these as negative and got them correct(true negative).
General Setup
![Page 52: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/52.jpg)
52
Accuracy, Precision, Recall,and F-measure
Definitions
𝑝=𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑟=𝑇𝑃
𝑇𝑃+𝐹𝑁
𝐹=1
12 ( 1𝑝+
1𝑟 )
=2𝑝𝑟𝑝+𝑟
FP: false positives
TP:true positives
FN: false negatives
TN: true negatives Precision:
Recall:
Accuracy:
𝑎𝑐𝑐=𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
F-measure: Harmonic mean ofprecision and recall
![Page 53: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/53.jpg)
53
Accuracy vs. Prec/Rec/F-measAccuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy.
Precision and recall (together) are better indicators.
As a single, aggregate number f-measure favors the lower of the precision or recall.
![Page 54: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/54.jpg)
54
Extreme Cases for Precision & Recall
TP:true positive
FN: false negatives
TN: true negatives
system actual
If very few (one in the extreme) instance(s) are correctly predicted as belonging to the class precision is 100% () but recall is low ( is high).
all instances
TP: true positives
system
actual
all instances
FP: false positives If all instances are predicted as belonging to the class (some correctly, some not) recall is 100% () but precision is low ( is high).
Precision can be traded for recall and vice versa.
![Page 55: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/55.jpg)
55
Sensitivity & Specificity
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁
𝑇𝑁+𝐹𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃
𝑇𝑃+𝐹𝑁FP: false positives
TP:true positives
FN: false negatives
TN: true negatives
[same as recall;aka true positive rate]
False positive rate:
𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑡𝑒=1 −𝐴𝑐𝑐=𝐹𝑃+𝐹𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
Definitions
[aka true negative rate]
False negative rate:
𝐹𝑃𝑅=𝐹𝑃
𝐹𝑃+𝑇𝑁𝐹𝑁𝑅=
𝐹𝑁𝐹𝑁+𝑇𝑃
![Page 56: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/56.jpg)
56
Venn Diagrams
John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59).
These visualization diagrams were introduced by John Venn:
What if there are three classes?
Four classes?
Six classes?
With more classes our visual intuitions are helping less and less.
A subtle point: These are just the actual/real classes without the system classes drawn on top!
![Page 57: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/57.jpg)
57
Confusion Matrix
Predicted class A Predicted class B Predicted class C
Actual class ANumber of instances in the actual class A AND predicted as belonging to class A.
Number of instances in the actual class A BUT predicted as belonging to class B.
… Total number of actual instances of class A
Actual class B … … … Total number of actual instances of class B
Actual class C … … … Total number of actual instances of class C
Total number of instances predicted as class A
Total number of instances predicted as class B
Total number of instances predicted as class C
Total number of instances
Shows how the predictions of instances of an actual class are distributed across all classes.Here is an example confusion matrix for three classes:
Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.Confusion matrices can handle many classes.
![Page 58: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/58.jpg)
58
Confusion Matrix:Accuracy, Precision and Recall
Predicted class A Predicted class B Predicted class C
Actual class A 50 80 70 200
Actual class B 40 140 120 300
Actual class C 120 220 160 500
210 440 350 1000
Given a confusion matrix, it’s easy to compute accuracy, precision and recall:
Confusion matrices can, themselves, be confusing sometimes
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝟓𝟎+𝟏𝟒𝟎+𝟏𝟔𝟎
𝟏𝟎𝟎𝟎𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐴=
𝟓𝟎𝟓𝟎+𝟒𝟎+𝟏𝟐𝟎
𝑅𝑒𝑐𝑎𝑙𝑙𝐴=𝟓𝟎
𝟓𝟎+𝟖𝟎+𝟕𝟎
![Page 59: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/59.jpg)
59
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 60: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/60.jpg)
60
LINEAR MODELSWhy?– Linear models are good way to learn about core ML concepts.
![Page 61: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/61.jpg)
61
Refresher: Vectors
point point
vector
𝑥
𝑦
vector
vector
points are also vectors.
sum of vectors
𝑣2
𝑣1
𝑣1+𝑣2
𝑥 𝑥1
𝑥2𝑦
𝑦=13𝑥
Equation of the line.
3
1
3 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0
(−1 ) 𝑥1+3 𝑥2=0
Can be re-written as:
(−1,3 )(𝑥1
𝑥2)=0
(𝑤1 ,𝑤2 ) (𝑥1
𝑥2)=0vector notation
𝒘=(𝑤0
⋮𝑤𝑑
)=(𝑤0 ,… ,𝑤𝑑)𝑇
transpose
![Page 62: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/62.jpg)
62
Refresher: Vectors (2)
𝑥 𝑥1
𝑥2𝑦
𝑦=13𝑥
Equation of the line.
3
13 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0
(−1 ) 𝑥1+3 𝑥2=0
Can be re-written as:
(−1,3 )(𝑥1
𝑥2)=0
(𝑤1 ,𝑤2 ) (𝑥1
𝑥2)=0vector notation
3
−1
Normal vector.
![Page 63: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/63.jpg)
63
Refresher: Dot Product
𝑥1
𝑥2
float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; }
(𝑤1 ,𝑤2 ) ∙(𝑥1
𝑥2)=𝑤1𝑥1+𝑤2𝑥2
𝒘 ∙ �⃗�=|⃗𝒘||⃗𝒙|cos𝛾
𝛾
𝛾
𝒘
𝒘 ∙ �⃗�>𝟎
𝒘 ∙ �⃗�<𝟎
𝒘 ∙ �⃗�=𝟎
![Page 64: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/64.jpg)
64
Refresher: Pos/Neg Classes
𝑥 𝑥1
𝑥2𝑦
Normal vector.
−
+ 𝒘 ∙ �⃗�>𝟎
𝒘 ∙ �⃗�<𝟎
𝒘 ∙ �⃗�=𝟎
![Page 65: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/65.jpg)
65
sgn Function
𝑥
𝑦
1
−1
𝑠𝑔𝑛 (∎)={+1:∎>00 :∎=0
− 1:∎<0
𝑥
𝑦
1
−1
In mathematics:
We will use:𝑠𝑔𝑛 (∎ )={+1 :∎≥ 0
− 1:∎<0
We are purposefully avoiding using here.We will use for the feature vector.
Informally drawn as:
![Page 66: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/66.jpg)
66
Two Linear Models
𝑔 (𝒙 )=𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗� ) 𝑔 (𝒙 )=𝒘𝑇 𝒙
Perceptron Linear regression
The features of an object have associated weights indicating their importance.
Signal: s=𝒘𝑇 �⃗�=∑𝑖=0
𝑑
𝑤𝑖 𝑥 𝑖
When is known the solution function is known; determines the hypothesis space.
![Page 67: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/67.jpg)
67
Why “Regression”?Why the term for quantitative output prediction is “regression”?
“That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents.
After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained.
Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.”
[A.Bellos pp.375]
![Page 68: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/68.jpg)
68
On-Line (Sequential) Learning• On-line = process one example at a time.• Attractive for large scale problems.
Objective: Minimize cumulative loss:
return parameters
for iteration (epoch/time).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔… Compute loss.
∑𝑡=1
𝑇
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
![Page 69: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/69.jpg)
69
On-Line (Sequential) Learning (2)Sometimes written out more explicitly:
return parameters
for # passes over the data.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔
𝒙 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )𝑦 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝑇𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙()
for
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
for
if
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)
return parameters
𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝐷𝑎𝑡𝑎 ()
for each data item.
.
𝑈𝑝𝑑𝑎𝑡𝑒 ( �⃗� (𝑡 ) , 𝑦 ( 𝑡 ) , �̂� (𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )
![Page 70: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/70.jpg)
70
Perceptron
• One of the earliest ML algorithms (Rosenblatt 1958).• On-line linear binary classification algorithm.• Determines a hyperplane (line in , plane in ,…) separating the
points for the two classes.
−
++
+
++
++
+ +
+
+ − −
−
−
−
−
−
−
− −
−
−
−
+
+
+
++
++
+ +
+
+ −
−
−−
−
−
−
−
−
−
−
−
Linearly separable data: Non-linearly separable data:
+
++
−−
−
![Page 71: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/71.jpg)
71
First: Perceptron Update Rule𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦
(𝑡 )𝒙 (𝑡 )
𝑥 𝑥1
𝑥2𝑦
𝑦=
3𝑥
𝑦=13𝑥
−
+
−+
+
Example (initially misclassified).
(−1 ) 𝑥+3 𝑦=0
(−3)𝑥+1𝑦=
0
(−1,3 ) (𝑥𝑦)=0
(−3,1) (𝑥 𝑦)=0
(𝑤1
𝑤2)=(−3
1 )+ (+1 )(22)=(− 13 )
(22)
Simplification: Lines pass through origin.
in order to simplify the update rule .
Example is now correctly classified with the new separating boundary. Not always the case that we can achieve this with one update.
![Page 72: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/72.jpg)
72
On-Line (Sequential) Learning
return
for
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
![Page 73: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/73.jpg)
73
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔ 12|�̂� (𝑡 ) −𝑦 (𝑡 )|
Perceptron Learning Algorithm
return
for iteration (epoch/time).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) )Compute zero-one loss
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
![Page 74: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/74.jpg)
74
Perceptron Learning Algorithm
return
for iteration (epoch/time). sample size.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) ) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
represents transpose here
(algorithm makes multiple passes over data.)
![Page 75: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/75.jpg)
75
Perceptron Learning Algorithm (PLA)
Initialize weights:
Select a mis-classified example:
Update weights:
return
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
while( mis-classified examples exist ):
𝑦 (𝑡 ) ≠𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗�( 𝑡 ) )
Misclassified example means:With the current weights
1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update.3. Attempting to minimize:
min�⃗�
1𝑁∑
𝑡=1
𝑁
⟦𝑦 ( 𝑡 )≠ 𝑠𝑖𝑔𝑛 (𝒘𝑇 𝒙 (𝑡 )) ⟧ NP-hard.
more generally
![Page 76: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/76.jpg)
76
Perceptron
If a point is classified incorrectly:
⇒𝑠𝑔𝑛 ( 𝑦 ( 𝑡 ) )≠𝑠𝑔𝑛 (𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 ) )⇒ 𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑
𝑇 ∙ �⃗� (𝑡 )<0
𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦(𝑡 )𝒙 (𝑡 )
Weight update:
𝑦 (𝑡 )𝒘𝑛𝑒𝑤𝑇 ∙ �⃗� (𝑡 )=𝑦 ( 𝑡 ) (𝒘 𝑜𝑙𝑑+𝑦
( 𝑡 )𝒙 ( 𝑡 ) )𝑇 ∙𝒙 (𝑡 )=¿
¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+‖�⃗� ( 𝑡 )‖2
>𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑𝑇 ∙ 𝒙 (𝑡 )
¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+( 𝑦 (𝑡 ) )2‖�⃗� (𝑡 )‖2
=¿
¿0 ¿0
Thus, the perceptron weight update pushes in the “right direction”.
![Page 77: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/77.jpg)
77
Looks Simple – Does It Work?
Number of updates by the Perceptron Algorithm ≤𝑟 2
𝜌 2
where:
𝒙 (1 )…𝒙 (𝑁 )∈ℝ𝑑+1
𝑟 ≥‖𝒙 ( 𝑖 )‖ (for all )
𝜌 ≤𝑦 (𝑖 ) (𝑣 ∙𝒙 ( 𝑖 ))
‖�⃗�‖(for all )
There exist and such that:
Margin-based upper bound on updates:
The quantity is known as the “normalized margin”.
Remarkable:Does not depend on dimension of feature space!
Fact:
![Page 78: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/78.jpg)
78
Compact Model Representation
void Save( StreamWriter w, int labelIdx, float[] weights ){ w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } }}
Use float instead of double:
Store only non-zero weights (and indices):
Store non-zero weights and diff of indices:
Difference of indices.
Remember last index where the weight was non-zero .
![Page 79: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/79.jpg)
79
Linear Classification Solutions
A fixed choice of defines the hyperplane and, thus, the solution to our (linear) task.
−
++
+
++
++
+ +
+
+ − −
−
−
−
−
−
−
− −
−
−
Different solutions (infinitely many)
![Page 80: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/80.jpg)
80
The Pocket AlgorithmA better perceptron algorithm: Keep track of the error and update weights when we lower the error.
Initialize weights:
Run PLA for one iteration and obtain new .
return
𝐸𝑟𝑟 (𝒘 ( 𝑖+1 ) )= 1𝑁∑
𝑛=1
𝑁
⟦𝑠𝑔𝑛 (𝒘 ( 𝑖+1 )𝒙 (𝑛 ) )≠ 𝑦 (𝑛 ) ⟧
for :
𝑏𝑒𝑠𝑡𝐸𝑟𝑟 ≔𝐷𝑜𝑢𝑏𝑙𝑒 .𝑀𝐴𝑋
if :
Compute error. Expensive step!
Only update the best weights if we lower the error!
Access to the entire data needed!
![Page 81: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/81.jpg)
81
Voted Perceptron• Training as the usual perceptron algorithm (with some extra book-keeping).• Decision rule:
�̂�=sgn((∑𝑡 𝑐𝑡 �⃗�(𝑡 ))∙ �⃗�)
Coefficient proportional to the number of iterations survives(number of iterations between and ).
iterations
𝒘 (𝑡 ) 𝒘 (𝑡+1 )
( �⃗�(𝒊 𝟏) , 𝑦
(𝒊 𝟏) )
( �⃗�(𝒊 𝟐) , 𝑦
(𝒊 𝟐) )
( �⃗�(𝒊 𝒄 𝒕
) , 𝑦(𝒊 𝒄 𝒕
) )
�̂� ( 𝒊𝟏 ) �̂� ( 𝒊𝟐 )�̂� (𝒊𝒄
𝒕)
![Page 82: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/82.jpg)
82
Dual Perceptron: Intuitions
𝑥1
𝑥2
−
+
−
+ separating line.++
+
+
+
++
−−
−
−−
−
𝑦 ¿¿
𝑦 − �⃗�−
normal vector
𝑦 −=−1
𝑦 +¿=+1¿
![Page 83: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/83.jpg)
83
Dual Perceptron
return
for iteration (epoch/time). sample size.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔sign(∑𝑗=1
𝑁
𝛼 𝑗 𝑦( 𝑗 ) (𝒙 ( 𝒋 )𝑇 ∙ 𝒙 (𝑡 ))) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()�⃗�=(𝛼1 ,…,𝛼𝑁 )𝑇= (0 ,…,0 )𝑇 ∈ℝ𝑁
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝛼 𝑡≔𝛼𝑡+1
(algorithm makes multiple passes over data.)
�̂�≔sign (∑𝑗=1
𝑁
𝛼 𝑗 𝑦( 𝑗 ) ( �⃗� ( 𝒋 )𝑇 ∙ �⃗� ))Decision rule:
gives a notion of how difficult instance is.
Kernel perceptron uses:
![Page 84: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/84.jpg)
84
Exclusive OR (XOR) Function
𝑥1
𝑥2
1
1
0
0
Truth table: Inputs in and color-coding of the output:
𝑥1
𝑥2
1
1
0
0
Challenge: The data is not linearly separable (no straight line can be drawn that separates the green from the blue points).
???
![Page 85: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/85.jpg)
85
Solution for the Exclusive OR (XOR)
𝑥1
𝑥3
1
1
0
0
We introduce another input dimension:
Now the data is linearly separable: 𝑥2
![Page 86: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/86.jpg)
86
𝑍≔∑𝑖= 0
𝑑
𝑤 𝑖𝑒𝑦 ( 𝑡 ) �⃗� 𝑖
(𝑡 )
Winnow Algorithm
return
for iteration (epoch).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑔𝑛 (𝒘 𝑇 ∙𝒙 (𝑡 ) )
𝒘≔( 1𝑑+1
,…)𝑇
;
if
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)for
𝑤𝑖≔𝑤𝑖𝑒
𝑦 (𝑡 ) �⃗� 𝑖(𝑡 )
𝑍
return parameters
Normalizing constant.
Multiplicative update.
![Page 87: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/87.jpg)
87
Training, Test Error and Complexity
Test error
Training error
Model complexity
![Page 88: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/88.jpg)
88
Logistic Regression𝑔 (𝒙 )=𝜃 (𝒘𝑇 𝒙 )
𝜃 (𝑠 )= 𝑒𝑠
1+𝑒𝑠
Logistic function:
𝑦∈ {−1 ,+1 }
𝑓 (𝑥 )=𝑃 ( 𝑦=+1∨𝒙 )
1−𝜃 (𝑠 )=𝜃 (−𝑠)
Target:
Data does not give the probability explicitly:
![Page 89: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/89.jpg)
89
Logistic Regression𝑃 (𝑦∨𝑥 )={ 𝑓 (𝑥 ) when 𝑦=+1
1 − 𝑓 (𝑥 ) when 𝑦=−1
𝑃 (𝑦∨𝒙 )={ 𝑔 (𝒙 )=𝜃 (𝒘 𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 )when 𝑦=+1
1−𝑔 (𝒙 )=1 −𝜃 (𝒘 𝑇 𝒙 )=𝜃 (−𝒘𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 ) when 𝑦=−1
Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|𝒙(𝑖))
Negative log-likelihood:
−𝑙 (𝒘 )=−1𝑁
ln (∏𝑖=1
𝑁
𝑃 ( 𝑦 (𝑖)|𝒙 (𝑖)) )= 1𝑁∑
𝑖=1
𝑁
ln( 1
𝑃 ( 𝑦(𝑖)|𝒙(𝑖) ) )= 1𝑁∑
𝑖=1
𝑁
ln( 1
𝜃 (𝑦(𝑖)𝒘 𝑇 𝒙(𝑖)) )= 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒−𝑦 (𝑖 )𝒘𝑇 𝒙( 𝑖) )
𝐸 (𝒘 )= 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙(𝑖) )Error:
How likely is it that we get output when we have input :
Which maximizes this? orminimizes this?
![Page 90: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/90.jpg)
90
RefresherDerivative:
(3 𝑥2 )′=3 ∙2 ∙𝑥2 −1
Partial derivative:𝜕𝜕𝑥
(𝑥2+2 𝑥𝑦+𝑦2 )=2 𝑥+2 𝑦
Partial derivative at a point :𝜕𝜕𝑤0
(𝑤02+2𝑤0𝑤1+𝑤1
2 )¿𝑤0=2 ,𝑤1=3=(2𝑤0+2𝑤1 )¿𝑤0=2,𝑤1=3=2 ∙2+2 ∙ 3
Gradient (derivatives with respect to each component): [ 𝜕 ( ∙ )𝜕𝑤0
,𝜕 ( ∙ )𝜕𝑤1
,…,𝜕 (∙ )𝜕𝑤𝑑 ]
Gradient of the error: 𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0
,𝜕𝐸𝜕𝑤1
,…,𝜕𝐸𝜕𝑤𝑑 ]
𝜕𝜕𝑥
𝐹 (𝑥 , 𝑦 )
(ln 𝑥 )′= 1𝑥 (𝑒𝑥) ′=𝑒𝑥 ( 𝑓 (𝑔 ) )′= 𝑓 ′ (𝑔) ∙𝑔 ′
This is a vector and we can compute it at a point.
Chain rule:
𝑓 ′ (𝑥 )= lim∆𝑥→ 0
𝑓 (𝑥+∆ 𝑥 )− 𝑓 (𝑥)∆ 𝑥
𝑓 ′ (𝑥2 )= lim∆ 𝑥→0
(𝑥+∆ 𝑥 )2−𝑥2
∆𝑥= lim
∆ 𝑥→0
𝑥2+2𝑥 ( ∆𝑥 )+( ∆𝑥 )2 −𝑥2
∆𝑥= lim
∆𝑥→ 0(2 𝑥+∆𝑥 )=2𝑥
![Page 91: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/91.jpg)
91
Hypothesis SpaceThe best to use is the one which minimizes: 𝐸 (𝒘 )= 1
𝑁∑𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )
Different give rise to different values for .
−𝛻𝐸 (𝒘 )
is the error surface.
Weight space/hyperplane.
[graph from T.Mitchell]
![Page 92: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/92.jpg)
92
Math FactThe gradient of the error:
(a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error.
The negative of the gradient gives the direction of the steepest decrease.
𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0
,𝜕𝐸𝜕𝑤1
, …,𝜕𝐸𝜕𝑤𝑑 ]
𝑤0
𝑤1
𝒘 (𝑡 )= (𝑤0 ,𝑤1 )
−𝛻𝐸 (𝒘 (𝑡 ))
𝒘 (𝑡+1 )
Best weights we can findup to iteration .
New best weights atiteration .
Negative gradient (see next slides).
![Page 93: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/93.jpg)
93
¿ 1𝑁∑
𝑖=1
𝑁1
1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙 (𝑖)
∙ (−𝑦 (𝑖)𝒙 (𝑖))=¿
𝛻 𝐸 (𝒘 )=𝛻 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒− 𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖) )= 1𝑁∑
𝑖=1
𝑁
𝛻 ln (1+𝑒−𝑦 (𝑖)𝒘 𝑇 𝒙( 𝑖) )=¿¿
Computing the Gradient𝐸 (𝒘 )= 1
𝑁∑𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )
¿ 1𝑁∑
𝑖=1
𝑁1
1+ 1
𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)
∙1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=¿
¿ 1𝑁∑
𝑖=1
𝑁1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )
+1
𝑒𝑦(𝑖 )𝒘𝑇 𝒙 (𝑖)
∙1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=−
1𝑁∑
𝑖=1
𝑁 𝑦 (𝑖) �⃗�(𝑖)
1+𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)
Because gradient is a linear operator.(ln∎ )′=1
∎∙∎ ′
1∎ ∎′= (1+𝑒𝑧 )′=𝑒𝑧 ∙ 𝑧′
𝑒−𝑧=1
𝑒𝑧
![Page 94: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/94.jpg)
94
(Batch) Gradient Descent
Initialize weights:
Compute gradient:
Update weights:
return
General technique for minimizing a differentiable function like .
�⃗�𝒓𝒂𝒅 :=−1𝑁∑
𝑖=1
𝑁 𝑦(𝑖)𝒙 (𝑖)
1+𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )
𝒘≔𝒘 −𝜂 �⃗�𝒓𝒂𝒅
repeat
until Stop
int
max #iterations; marginal error improvement; andsmall value for the error.
is the learning rate.
If a random training example () is selected and gradient computed on it alone, the algorithm is called SGD(Stochastic Gradient Descent).
![Page 95: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/95.jpg)
95
Punch Line
With the best weights computed using gradient descent,given a unknown input object encoded as vector of features ,the output probability that the object is in the class is:
𝑃 (𝑦=+1|𝒙 ;𝒘 )= 𝑒�⃗�𝑇 𝒙
1+𝑒𝒘𝑇 �⃗�
𝑃 (𝑦=+1|𝒙 ;𝒘 )>𝜏 classification rule.
The new object is in the class if:
Predict if or equivalently if . The larger , the better; will be larger and so will our degree of confidence that . The prediction that is very confident if . Similarly, logistic regression makes a very confident decision that if .
![Page 96: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/96.jpg)
96
Newton’s Method• Alternate way to minimize a function (like ).• We need to find the derivative of the error (negative log likelihood) and find for which values
of the parameters the derivative is zero.• Let be a function and we want to find such that .
𝑢𝑖+1≔𝑢𝑖−𝑓 (𝑢𝑖 )𝑓 ′ (𝑢𝑖 )
0 0.5 1 1.5 2 2.5 3
-0.5
0
0.5
1
1.5
2
2.5
3
𝑢𝑖𝑢∗
𝑓 (𝑢𝑖 )
𝑢𝑖+1
𝑓 (𝑢𝑖 )𝑢𝑖−𝑢𝑖+1
=tan𝛾= 𝑓 ′ (𝑢𝑖 )
𝛾
![Page 97: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/97.jpg)
97
Newton-Raphson• Generalization f Newton’s method to multidimensional case.• The parameters are a vector: (we used the notation ).
𝜃≔𝜃−𝑙 ′ (𝜃 )𝑙 ′ ′ (𝜃 )
𝜃≔𝜃−𝐻−1 ∙𝛻 𝑙(𝜃)
𝐻 𝑖𝑗=𝜕2 𝑙(𝜃)𝜕𝜃 𝑖𝜕 𝜃 𝑗
is the Hessian matrix:
![Page 98: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/98.jpg)
98
Robust Risk Minimizationinput vector
label
training examples
weight vector
bias
continuous linear model
𝒙𝑦∈ {−1 ,+1 }(𝒙 ( 𝑖 ) , 𝑦 (𝑖 ) )𝒘𝑏𝑝 (𝒙)
Prediction rule:
�̂� (𝒙 )={+1 :𝑝 (𝒙 )≥ 0− 1:𝑝 ( �⃗� )<0
Classification error:
𝑙 (𝑝 (𝒙 ) , 𝑦 )={+1 :𝑝 ( �⃗�) ∙ 𝑦 ≤ 00 :𝑝 ( �⃗� ) ∙ 𝑦>0
Notation:
![Page 99: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/99.jpg)
99
Robust Classification Loss
Parameter estimation:
Hinge loss:
Robust classification loss:
(�̂� ,�̂� )=argmin�⃗� ,𝑏
1𝑁∑
𝑖=1
𝑁
𝑙𝑜𝑠𝑠 (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )
𝑔 (𝑝 (𝒙 ) , 𝑦 )={1−𝑝 (𝒙 ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤10 :𝑝 (𝒙 ) 𝑦>1
h (𝑝 ( �⃗� ) , 𝑦 )={ − 2𝑝 ( �⃗� ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤ −112
(𝑝 (𝒙 ) 𝑦−1 )2:𝑝 ( �⃗� ) 𝑦∈ [− 1,1 ]
0:𝑝 (𝒙 ) 𝑦>1
![Page 100: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/100.jpg)
100
Loss Functions: Comparison
![Page 101: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/101.jpg)
101
Confidence and Regularization
smaller λ corresponds to a larger A.
Confidence 𝑃 (𝑦=1|⃗𝒙 ):
�̂� (𝒙 )=𝑚𝑎𝑥 (0 ,𝑚𝑖𝑛(1 ,�̂� ∙ �⃗�+ �̂�+1
2 ))Regularization:
‖𝒘‖2+𝑏2≤ 𝐴
(�̂� ,�̂� )=argmin�⃗� ,𝑏
1𝑁∑
𝑖=1
𝑁
h (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )
Unconstrained optimization (Lagrange multiplier):
(�̂� ,�̂� )=argmin�⃗� ,𝑏 [ 1
𝑁∑𝑖=1
𝑁
h (𝒘𝑇 ∙𝒙 (𝑖 )+𝑏 , 𝑦 (𝑖 ) )+λ2
(‖�⃗�‖2+𝑏2 ) ]
![Page 102: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/102.jpg)
102
Robust Risk Minimization
Input:
Initialization:
𝑝≔ 𝑦(𝑖) (𝒘 𝑇 ∙𝒙 (𝑖))𝑑𝑖≔𝑚𝑎𝑥 (𝑚𝑖𝑛(2𝑐−𝛼 𝑖 ,𝜂(𝑐−𝛼𝑖
𝑐−𝑝)) ,−𝛼 𝑖)
𝒘≔𝒘+𝑑𝑖 𝑦(𝑖) �⃗�(𝑖)
𝑏≔𝑏+𝑑𝑖 𝑦(𝑖)
𝛼 𝑖≔𝛼𝑖+𝑑𝑖
for
for
return
Number of passes over the data ( is a good default).
; .
𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }
Go over the training data.
![Page 103: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/103.jpg)
103
Learning Curve• Plots evaluation metric
against fraction of training data (on the same test set!).
• Highest performance bounded by human inter annotator agreement (ITA).
• Leveling off effect that can guide us how much data is needed.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0
10
20
30
40
50
60
70
80
90
100
Percentage of data used for each experiment.
Experiment with 50% of the training data yields
evaluation number of 70.
![Page 104: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/104.jpg)
104
Summary
• Examples of ML• Categorization• Object encoding• Linear models:
– Perceptron– Winnow– Logistic Regression– RRM
• Engineering aspects of ML systems
![Page 105: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/105.jpg)
105PART II: POPULARITY
![Page 106: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/106.jpg)
106
Goal
• Quantify how popular an entity is.
Motivation:• Used in the new local search relevance metric.
![Page 107: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/107.jpg)
107
What is popularity?
• Use clicks on entity as proxy for popularity.
• Popularity score [0..1].• Goal: preserve relative
ranking between clicks vs. predicted popularity score.
![Page 108: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/108.jpg)
108
POPULARITY IN LOCAL SEARCH
![Page 109: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/109.jpg)
109
Popularity
• Output a popularity score (regression)• Ensemble methods• Tree base procedure (non-linear)• Boosting
![Page 110: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/110.jpg)
110
When is a Local Entity Popular?
• Definition:Visited by many people in the context of alternative choices.
• Is the popularity of restaurants the same as the popularity of movies, etc.?
• How to operationalize “visit”, “many”, “alternative choices”?– Initially we are using: popular means clicked more.
• Going forward we will use:– “visit” = click given an impression.– “choice” = density of entities in the same primary category.– “many” = fraction of clicks from impressions.
![Page 111: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/111.jpg)
111
Local Entity Popularity
𝑃𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 (𝑒)=𝐶𝑇𝑅𝑒+ (1−𝐶𝑇𝑅𝑒 )∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝑒
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑒=1𝜋2
∙ tan−1 (𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒))
Popularity = Boosted Click Through Rate (CTR) for entity :
where :
The model then will be regression:
0 1𝐶𝑇𝑅=
𝐶𝑙𝑖𝑐𝑘𝑠𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠
1 −𝐶𝑇𝑅
(1−𝐶𝑇𝑅) ∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦
𝐶𝑇𝑅 (𝑒 )= 𝐶𝑙𝑖𝑐𝑘𝑠 (𝑒)𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠(𝑒)
𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒)=¿ Number of entities in the same primary category as within a radius
![Page 112: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/112.jpg)
112
Not all Clicks are Born the Same
• Click in the context of a named query:– Can even be argued we are not satisfying the user
information needs (and they have to click further to find out what they are looking for).
• Click in the context of a category query:– Much more significant (especially when alternative results
are present).
![Page 113: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/113.jpg)
113
Local Entity Popularity
• Popularity & 1st page , current ranker.• Entities without URL.• Newly created entities.• Clicks vs. mouseovers.• Scenario: 50 French restaurants; best entity
has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice.
![Page 114: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/114.jpg)
114
Entity Representation
8000 … 4000 65 4.7 73 … 1 …9000
feature valuesTarget
Machine learning (training) instance
clicks for week -1
clicks for week -9
# ratingsaggregate ratings
# reviews
has FB page
![Page 115: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/115.jpg)
115
POISSON REGRESSIONWhy?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time).
![Page 116: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/116.jpg)
116
SetupTraining data: where: are counts (rather than for regression problems).
Goal: Come up with a system which given a new observation can correctly predict the corresponding outcome .
response/outcome variable
These counts for our scenario are the clicks on the web page.
A good way to model counts of observations is using the Poisson distribution.
explanatory variables
![Page 117: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/117.jpg)
117
Poisson Distribution: PreliminariesThe Poisson distribution realistically describes the pattern of requests over time in many client-server situations.
Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples.
![Page 118: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/118.jpg)
118
Poisson Distribution: FormallyThe Poisson distribution can be used to model situations in which the expected number of events scales with the length of the interval within which the events can occur. If is the expected number of events per unit interval, then the distribution of the number of events within an interval is:
𝑝 (𝑋=𝑘|𝜆 )= 1𝑘!𝑒−𝜆𝑡 (𝜆𝑡 )𝑘
For unit length interval
Mean: Variance:
![Page 119: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/119.jpg)
119
Poisson Distribution: Mental StepsFirst, we are keeping ’s for the input. So we will write:
𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦
The output is determined by a single scalar parameter . We will have be dependent on the input in the following way:
𝜇=𝐸 [ 𝑋 ]=𝜆=𝑒𝒙 𝑇 ∙𝜷 This comes from the theory of Generalized Linear Models (GLM).
ln ( 𝜆 )=¿ �⃗�𝑇 ∙𝜷 ¿
log linear combination of the input features.
Hence, the name log-linear model.
In contrast, a linear model could potentially make negative but which is a count!
We used to write (when discussing logistic regression). Now, we call the parameters and because in the training phase they are unknown we will write them as the second argument in the dot product to emphasize they are the argument.
![Page 120: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/120.jpg)
120
Poisson Distribution
Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|�⃗�(𝑖))
Log-likelihood:
Which maximizes this?
𝑙 ( �⃗� )= ln(∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|�⃗�(𝑖)) )=∑𝑖=1
𝑁
ln (𝑃 ( 𝑦(𝑖)|⃗𝒙(𝑖)))=∑𝑖=1
𝑁
ln(𝑒−𝑒 �⃗�( 𝑖)𝜷 (𝑒𝒙( 𝑖)𝜷 )𝑦(𝑖)
𝑦 (𝑖) ! )=¿
¿∑𝑖=1
𝑁 [ ln (𝑒−𝑒 �⃗�( 𝑖)𝜷 )+ ln (𝑒𝒙 (𝑖) �⃗� )𝑦(𝑖 )
− ln (𝑦(𝑖) ! )]=∑𝑖=1
𝑁
[−𝑒𝒙 (𝑖) 𝜷+𝑦 (𝑖 ) ln (𝑒 �⃗�(𝑖 )𝜷 )− ln ( 𝑦(𝑖)! )]=¿
¿∑𝑖=1
𝑁
[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ]
𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦 𝜆=𝑒 �⃗�𝑇 ∙ �⃗�
![Page 121: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/121.jpg)
121
Maximizing the Log-Likelihood
𝑙 ( �⃗� )=∑𝑖=1
𝑁
[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ] Which maximizes this?
𝛻 𝑙 (𝜷 )=0
𝛻 𝑙 (𝜷 )=∑𝑖=1
𝑁
[− �⃗� (𝑖 )𝑒 �⃗� ( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖)]=∑𝑖=1
𝑁
(𝑦 (𝑖 )−𝑒𝒙 (𝑖 )𝜷 )𝒙 (𝑖)=0
Non-linear in ; does not have analytical solution.
![Page 122: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/122.jpg)
122
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 123: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/123.jpg)
123
DECISION TREESWhy?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part.
![Page 124: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/124.jpg)
124
Decision Trees
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
Binary partitioning of the data during training(navigating to leaf node during testing).
Selecting dimension and split value .
predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.
Stopping when instances are homogeneous or
small number of instances.
Training instances. Color reflects output variable(classification example).
𝑥 𝑗≥𝑠2
𝑥𝑖≥ 𝑠1
![Page 125: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/125.jpg)
125
Decision Tree: Example
Parents Visiting
Weather
Money
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play tennis
Attribute/feature/predicate
Value of the attribute
Predicted classes.
(classification example with categorical features)
Branching factor depends on the number of possible values for the attribute (as seen in the training set).
![Page 126: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/126.jpg)
126
Entropy (needed for describing how an attribute is selected.)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 )=− ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑝𝑐 ∙ log 2𝑝𝑐
Entropy values for two classes varying the probability of one classes (the probability of the other class is):
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=−𝑝1∙ log 2𝑝1−𝑝2 ∙ log2𝑝2=−𝑝 ∙ log 2𝑝−(1 −𝑝 )∙ log 2 (1 −𝑝 )
Example
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 127: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/127.jpg)
127
Selecting an Attribute: Information Gain
Measure of expected reduction in entropy.
𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )
instances attribute
Choose attribute with the highest information gain ( that minimizes this).
See Mitchell’97, p.59 for an example.
instances with value for attribute
![Page 128: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/128.jpg)
128
Splitting ‘Hairs’
?
If there are only a small number of instances do not split the node further (statistics are unreliable).
If there are no instances in the current node, inherit statistics (majority class) from parent node.
𝑎𝑡𝑡𝑟=𝑣𝑎𝑙1 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙2 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙3
If there is more training data, the tree can be “grown” bigger.
![Page 129: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/129.jpg)
129
ID3 AlgorithmID3: { new node
if then ; return
if then ; return
if then ; return
best attribute
foreach : possible value of attribute :
if then
else
return }
𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }
Examples that have value for attribute .
Attributes without .
most common class among
![Page 130: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/130.jpg)
130
Alternative Attribute Selection:Gain Ratio
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )= 𝐺𝑎𝑖𝑛(𝑆 ,𝑎)𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆 ,𝑎)
instances attribute
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
log 2(|𝑆𝑣||𝑆| )
instances with value for attribute
[Quinlan 1986]
Examples:
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 1. .𝑛 }
1𝑛
log2( 1𝑛 )=−𝑛 1
𝑛log2 (𝑛
−1 )=log 2𝑛
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 0,1}
𝑛2𝑛
log2( 𝑛2𝑛 )=−212
log2 (2− 1 )=1all different values.
0
1
![Page 131: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/131.jpg)
131
Alternative Attribute Selection:GINI Index
𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 )=1− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝑦 )
(|𝑆𝑣||𝑆| )
2
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎)=𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 ) − ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
𝐺𝑖𝑛𝑖 (𝑆𝑣 ,𝑎 )
target is just like another attribute.
�̂�= argmax𝑎∈ 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 ) The selected attribute is the one that maximizes the .
[Corrado Gini: Italian statistician]
![Page 132: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/132.jpg)
132
Space of Possible Decision Trees
𝒏−𝟏𝒏−𝟏
𝒏
𝒏−𝟐 𝒏−𝟐
Assume:• Binary classifier;• binary attributes;• height.
22h
∙[∑𝑖=0
h
2𝑖 (𝑛−𝑖 )]
𝒏−𝟐 𝒏−𝟐
10101010
𝑖nodes attributes
h
Number of possible trees:
![Page 133: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/133.jpg)
133
Decision Trees and Rule SystemsPath from each leaf node to the root represents a conjunctive rule:
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play tennis
if (ParentsVisiting==No) & (Weather==Windy) & (Money==Poor) then Cinema.
Parents Visiting
Weather
Money
![Page 134: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/134.jpg)
134
Decision Trees
• Different training sample -> different resulting tree (different structure).
• Learning does (conditional) feature selection.
![Page 135: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/135.jpg)
135
Regression TreesLike classification trees but the prediction is a number (as suggested by “regression”).
1. How do we split?2. When to stop?
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
predictions(constants)
𝑐1∈𝑅
𝑐2 𝑐3
![Page 136: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/136.jpg)
136
Regression Trees: How to Split
Finding:• Dimension • Split value .
⟨ 𝑗 ,𝑠 ⟩=min𝑗 , 𝑠 (min
𝑐1( ∑𝑋 (𝑖 ) [ 𝑗 ]<𝑠
(𝑌 (𝑖 ) −𝑐1 )2)+min
𝑐2( ∑𝑋 (𝑖 ) [ 𝑗 ]≥ 𝑠
(𝑌 ( 𝑖 )−𝑐2 )2))
𝑋 [ 𝑗 ](𝑖 )
𝑌 (𝑖 )
𝑠
𝑐1
𝑐2
𝑋 (1 )=(… 𝑋[ 𝑗 ](1 ) … )
𝑋 (2 )=(… 𝑋 [ 𝑗 ](2 ) …)
𝑋 (𝑁 )=(… 𝑋 [ 𝑗 ](𝑁 ) …)
![Page 137: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/137.jpg)
137
Regression Trees: PruningTree operation where a pre-terminal gets its two leaves collapsed:
𝑥𝑖<𝑠10
𝑥 𝑗<𝑠20
𝑐20 𝑐30
𝑥𝑖<𝑠10
𝑐 ′
![Page 138: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/138.jpg)
138
Regression Trees: How to Stop1. Don’t stop.2. Build big tree.3. Prune.4. Evaluate sub-trees.
![Page 139: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/139.jpg)
139
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 140: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/140.jpg)
140
BOOSTING
![Page 141: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/141.jpg)
141
ENSEMBLE
Ensemble Methods
INPUT
System System System System
Output Output Output Output
object encoded with featuresclassifiers
predictions(response/dependent variable)
…
…
FinalOutput
majority voting/averaging
![Page 142: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/142.jpg)
142
Where the Systems Come from
Sequential ensemble scheme:
System
System
System
…
Data
Data
Data
System Data
Inducing a classifier.
Identifying difficult examples (through weighting the examples).
![Page 143: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/143.jpg)
143
Contrast with Bagging
Non-sequential ensemble scheme:
System
System
System
Data
Data
Data
System Data
Inducing a classifier.
Sampling with replacement.
⋮
DATA
Datai are independent of each other (likewise for Sytemi).
![Page 144: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/144.jpg)
144
Base Procedure:Decision Tree
SystemData
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
Binary partitioning of the data during training(navigating to leaf node during testing).
Selecting dimension and split value .
predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.
Stopping when instances are homogeneous or
small number of instances.
Training instances. Color reflects output variable(classification example).
𝑥 𝑗≥𝑠2
𝑥𝑖≥ 𝑠1
![Page 145: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/145.jpg)
145
TRAINING DATA
Ensemble Schemebase procedure
{ (𝑿 (𝟏 ) ,𝒀 (𝟏 )) ,… , (𝑿 (𝑵 ) ,𝒀 (𝑵 ) ) } 𝑮(𝑿)
base procedure 𝑮𝟏(𝑿 )Original data
base procedure 𝑮𝟐(𝑿 )Weighted data
base procedure 𝑮𝑴 (𝑿)Weighted data
⋮ ⋮ ⋮
𝑔 (𝑋 )=∑𝑚=1
𝑀
𝛼𝑚∙𝐺𝑚 (𝑋 )Final prediction (regression)
Small systems.Don’t need to be perfect.
Weights depend only on previous
iteration (memory-less).N.B.: Data weights
feature weights inlinear models.
![Page 146: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/146.jpg)
146
Ada Boost (classification)𝑮𝟏(𝑿 )Original data
𝑮𝟐(𝑿 )Weighted data
𝑮𝒎(𝑿)Weighted data
⋮ ⋮ ⋮
𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1
𝑀
𝛼𝑚 ∙𝐺𝑚 (𝑋 ))
𝑮𝑴 (𝑿)Weighted data
⋮ ⋮ ⋮𝑒𝑟𝑟𝑚=
∑𝑖
𝑤 𝑖(𝑚 )
∑𝑗=1
𝑁
𝑤 𝑗(𝑚 )
𝑤𝑖( 1)=
1𝑁
𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )
𝑤𝑖(𝑚+1)=
~𝑤𝑖
∑𝑗=1
𝑁~𝑤 𝑗
~𝑤𝑖=𝑤𝑖(𝑚−1 ) ∙𝑒𝛼𝑚
weight associated with -th training example.
normalizing factor.
for miss-classified example .
final prediction.
for miss-classified example .
Goodness ofpredictor .
![Page 147: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/147.jpg)
147
AdaBoost
𝑒𝑟𝑟𝑚=∑𝑖=1
𝑁
𝑤𝑖(𝑚 ) ∙ ⟦𝐺𝑚 (𝑋 ( 𝑖 ) )≠𝑌 ( 𝑖 )⟧
∑𝑗=1
𝑁
𝑤 𝑗(𝑚 )
𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )
𝑤𝑖(𝑚+1)=
~𝑤𝑖
∑𝑗=1
𝑁~𝑤 𝑗
~𝑤𝑖=𝑤𝑖(𝑚 ) ∙𝑒𝛼𝑚 ⟦𝐺𝑚 (𝑋 ( 𝑖) )≠𝑌 ( 𝑖) ⟧
Initializing weights.
normalizing factor.
for
for :
𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1
𝑀
𝛼𝑚 ∙𝐺𝑚 (𝑋 ))=𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚=1
𝑀
𝛼𝑚∙ ⟦𝐺𝑚 (𝑋 )=𝑌 ⟧ final prediction.
weight update.
Create using .
𝑌
![Page 148: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/148.jpg)
148
Binary Classifier
• Constraint:– Must not have all zero clicks for current week, previous week and week before last
[shopping team uses stronger constraint: only instances with non-zero clicks for current week].
• Training: – 1.5M instances.– 0.5M instances (validation).
• Feature extraction:– 4.82mins (Cosmos job).
• Training time:– 2hrs 20mins.
• Testing:– 10k instances: 1sec.
![Page 149: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/149.jpg)
149
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 150: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/150.jpg)
150
POPULARITY EVALUATION
How do we know we have a good popularity?
![Page 151: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/151.jpg)
151
Rank Correlation Metrics
• Input: two rankings: and • Requirements:
−1≤𝐶 (𝑅1,𝑅2)≤1
𝐶 (𝑅1 ,𝑅2 )=1
𝐶 (𝑅1 ,𝑅2 )=−1
The two rankings are the same.
The two rankings are reverse of each other.
• •• •
• •
Actual input is a set of objects with two rank scores (ties are possible).
![Page 152: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/152.jpg)
152
Kendall’s Tau Coefficient
Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other):
Complexity:
![Page 153: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/153.jpg)
153
What is a concordant pair?
a a
b c
c b𝑅1 (𝑎 )−𝑅1 (𝑐 )
𝑅2 (𝑎 )−𝑅2 (𝑐 )
Need to have the same sign
![Page 154: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/154.jpg)
154
Kendall Tau: ExampleA
B
C
D
C
D
A
B
𝑅1 𝑅2
Pairs:(discordant pairs in red):
Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.
𝜏=1−2 ∙𝐷𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡𝑃𝑎𝑖𝑟𝑠 (𝑅1 ,𝑅2 )
𝑛 (𝑛− 1 )=1−
2 ∙ 84 ∙ (4 − 1 )
=−13
![Page 155: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/155.jpg)
155
Spearman’s Coefficient
Considers ranking differences for the same object:
a a
b c
c b
𝑆 (𝑅1 ,𝑅2 )=1− 6 ∙(𝑅1 (𝑎) −𝑅2 (𝑎) )2+(𝑅1 (𝑏 )−𝑅2 (𝑏) )2+(𝑅1 (𝑐 )−𝑅2 (𝑐 ) )2
3 (32− 1 )=1− 6 ∙
(1 −1 )2+(2 −3 )2+(3 −2 )2
3 ∙ 8=
12
Complexity:
0≤∑𝑗=1
𝑛
(𝑅1 (𝑜 𝑗 )−𝑅2 (𝑜 𝑗 ) )2≤𝑛 (𝑛2−1 )
3
Example:
![Page 156: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/156.jpg)
156
Rank Intuitions: Setup
𝑅1 𝑅2
The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.
123456789
10
123456789
10
Objects ordered by rank scores. Viewing as if scrambling the order of .
![Page 157: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/157.jpg)
157
Rank Intuitions: Pairs
Rankings in complete agreement.
Rankings in complete dis-agreement.
−1 0 1
𝑅1 𝑅2
𝑅1 𝑅2
![Page 158: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/158.jpg)
158
Rank Intuitions: Spearman
−1 0 1
Segment lengths represent R1 rank scores.
0.5− 0.50
𝑝=1𝑝=2𝑝=3𝑝=4
𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛
− 0.78− 0.88− 0.92
489
![Page 159: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/159.jpg)
159
Rank Intuitions: Kendall
−1
01
Segment lengths represent R1 rank scores.
0.5− 0.53
𝑝=1𝑝=2𝑝=3𝑝=4
𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛
01− 0.36− 0.639− 0.830
![Page 160: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/160.jpg)
160
What about ties?
The position of an object within set of objects with the same scores in the rankings affects the rank correlation.
𝑅1 𝑅2
𝑜 𝑗
Objects have the same ranking scores.
𝑜 𝑗
𝑜 𝑗 𝑜 𝑗Objects have the same ranking scores.
For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.
![Page 161: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/161.jpg)
161
Ties
• Kendall: Strict discordance:
• Spearman:– Can use per entity upper and lower bounds.– Do as in the Olympics:
𝑅1 𝑅2
𝑜 𝑗
𝑜 𝑗
Objects with thesame score have
the same rank.
𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒1 (𝑎 )−𝑠𝑐𝑜𝑟𝑒1 (𝑏))≠𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒2 (𝑎 )−𝑠𝑐𝑜𝑟𝑒2 (𝑏))
![Page 162: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/162.jpg)
162
Ties: Kendall TauB
http://en.wikipedia.org/wiki/Kendall_tau#Tau-b
𝜏 𝐵=𝑛𝑐−𝑛𝑑
√(𝑛 (𝑛−1)2
−𝑛1)(𝑛(𝑛−1)2
−𝑛2)where:
𝑛𝑐𝑛𝑑
𝑛
is the number of concordant pairs.
is the number of discordant pairs.
is the number of objects in the two rankings.
𝑛1=∑𝑖
𝑡𝑖(𝑡 𝑖−1)2
𝑛2=∑𝑗
𝑢 𝑗 (𝑢 𝑗−1)2
number of pairs among elements with ties in ranking .
number of pairs among elements with ties in ranking .
![Page 163: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/163.jpg)
163
Uses of popularityPopularity can be used to augment gain in NDCG by linearly scaling it:
1 3 7 15-1
𝑙𝑎𝑏𝑒𝑙
1 2 3 4
31
5
perfectexcellentgoodfairpoor
𝐺𝑎𝑖𝑛+ (𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ) ∙𝐺𝑎𝑖𝑛
![Page 164: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/164.jpg)
164
Next Steps
• How to determine popularity of new entities– Challenge: No historical data.– Usually there is an initial period of high popularity
(e.g., a new restaurant is featured in local paper, promotions, etc.).
• Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number).– Use number impressions for named queries.
![Page 165: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/165.jpg)
165
References1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link
]3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd
Edition. ACM Press Books. [link]5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge
University Press. [link]7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.
Springer. [link]10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd
Edition. Springer Series in Statistics. Springer. [link]13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine
Learning series. MIT Press. [link]15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine
Learning series. MIT Press. [link]19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]
![Page 166: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/166.jpg)
166
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
![Page 167: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/167.jpg)
167
SEQUENCE LABELING:HIDDEN MARKOV MODELS (HMMs)
![Page 168: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/168.jpg)
168
Outline
• The guessing game• Tagging preliminaries• Hidden Markov Models• Trellis and the Viterbi algorithm• Implementation (Python)• Complexity of decoding• Parameter estimation and smoothing• Second order models
![Page 169: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/169.jpg)
169
The Guessing Game
• A cow and duck write an email message together.• Goal – figure out which word is written by which animal.
The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
![Page 170: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/170.jpg)
170
What’s the Big Deal ?
• The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word!
![Page 171: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/171.jpg)
171
The Game (cont)
? ?
moo hello
?
quack
COW ?
moo hello
DUCK
quack
![Page 172: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/172.jpg)
172
The Game (cont)
COW COW
moo hello
DUCK
quack
DUCK
![Page 173: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/173.jpg)
173
What about the Rest of the Animals?
ZEBRA ZEBRA
word1 word2
ZEBRA
word3
PIG
ZEBRA
word4
ZEBRA
word5
PIG
DUCK
COW
ANT
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
![Page 174: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/174.jpg)
174
A Game for Adults
• Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word.
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
![Page 175: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/175.jpg)
175
POS Tags"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"
"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"
![Page 176: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/176.jpg)
176
Tagging Preliminaries
• We want the best set of tags for a sequence of words (a sentence)
• W — a sequence of words• T — a sequence of tags
)|(maxarg^
WTPTT
![Page 177: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/177.jpg)
177
Bayes’ Theorem (1763)
)(
)()|()|(
WP
TPTWPWTP
posteriorposterior
priorpriorlikelihoodlikelihood
marginal likelihoodmarginal likelihood
Reverend Thomas Bayes — Presbyterian minister (1702-1761)Reverend Thomas Bayes — Presbyterian minister (1702-1761)
![Page 178: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/178.jpg)
178
Applying Bayes’ Theorem• How do we approach P(T|W) ?• Use Bayes’ theorem
• So what? Why is it better?• Ignore the denominator (and the question):
)(
)()|(maxarg)|(maxarg
WP
TPTWPWTP
TT
)()|(maxarg)(
)()|(maxarg)|(maxarg TPTWP
WP
TPTWPWTP
TTT
![Page 179: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/179.jpg)
179
Tag Sequence Probability
• Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely!– Use chain rule
How do we get the probability P(T) of a specific tag sequence T?
![Page 180: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/180.jpg)
180
P(T) is a product of the probability of the N-grams that make it up
Make a Markov assumption: the current tag depends on the previous one only:
P(T) is a product of the probability of the N-grams that make it up
Make a Markov assumption: the current tag depends on the previous one only:
Chain Rule
),...,|(...)|()|()(
),...,()(
11213121
1
nn
n
tttPtttPttPtP
ttPTP history
n
iiin ttPtPttP
2111 )|()(),...,(
![Page 181: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/181.jpg)
181
• Use counts from a large hand-tagged corpus.• For bi-grams, count all the ti–1 ti pairs
• Some counts are zero – we’ll use smoothing to address this issue later.
Transition Probabilities
)(
)()|(
1
11
i
iiii tc
ttcttP
![Page 182: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/182.jpg)
182
What about P(W|T) ?
• First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”!– Collect up all the times you see that tag sequence and see how often “The
white horse” shows up …
• Assume each word in the sequence depends only on its corresponding tag:
n
i
ii twPTWP1
)|()|(
emission probabilitiesemission probabilities
![Page 183: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/183.jpg)
183
Emission Probabilities
• What proportion of times is the word wi associated with the tag ti (as opposed to another word):
)(
),()|(
i
iiii tc
twctwP
![Page 184: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/184.jpg)
184
The “Standard” Model
n
iiiii
T
T
T
T
ttPtwP
TPTWP
WP
TPTWP
WTP
11)|()|(maxarg
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
![Page 185: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/185.jpg)
185
Hidden Markov Models
• Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space .
• Probabilities for the first observation:
• Next step given previous history:
jj xxP outcomeeach for )( 1
), ... ,|(11 11 tt itiit xxxP
![Page 186: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/186.jpg)
186
• A Markov Chain is a stochastic process with the Markov property:
• Outcomes are called states.• Probabilities for next step – weighted finite state
automata.
Markov Chain
)|(), ... ,|(111 111 tttt itititiit xxPxxxP
![Page 187: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/187.jpg)
187
State Transitions w/ Probabilities
STARTEND
COW
DUCK
1.00.2
0.2
0.3 0.3
0.5
0.5
![Page 188: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/188.jpg)
188
Markov Model
Markov chain where each statecan output signals
(like “Moore machines”):
Markov chain where each statecan output signals
(like “Moore machines”):
START END
COW
DUCK
1.00.2
0.2
0.3 0.3
0.5
0.5
moo:0.9
hello:0.1
hello:0.4
quack:0.6
$:1.0^:1.0
![Page 189: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/189.jpg)
189
The Issue Was
• A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language.
![Page 190: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/190.jpg)
190
Markov ModelMarkov Model
},...,{ 1 mss
},...,{ 1 k
)|( where][P 1 itjtijij ssPpp
)|( where][A itjtijij sPaa
)( where],...,[ 11 jjm sPvvvv
Finite set of states:
Signal alphabet:
Transition matrix:
Emission probabilities:
Initial probability vector:
![Page 191: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/191.jpg)
191
Graphical Model
STATESTATE TAGTAG
OUTPUTOUTPUT wordword
……
![Page 192: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/192.jpg)
192
• A Markov Model for which it is not possible to observe
the sequence of states.
• S: unknown — sequence of states
• O: known — sequence of observations
)|(maxarg OSPS
Hidden Markov Model
*S
*O
wordswordstagstags
![Page 193: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/193.jpg)
193
The State Space
START END
COW
DUCK
1.0
0.0
0.2
0.2
0.3
0.3
0.5
0.5
moo:0.9 hello:0.1
hello:0.4 quack:0.6
moo hello quack
COW
DUCK
0.3
0.3
0.5
0.5
COW
DUCK
More on how the probabilities come about (training) later.More on how the probabilities come about (training) later.
![Page 194: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/194.jpg)
194
Optimal State Sequence:The Viterbi Algorithm
We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t:
);,,...,( max
);,( max)(
11
11
1
11,...,
1
tititiss
tittS
t
OsssP
OsSPi
t
tii
t
![Page 195: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/195.jpg)
195
Key Observation
The most likely partial derivation leading to state si at position t consists of:
– the most likely partial derivation leading to some state sit-1 at the previous position t-1,
– followed by the transition from sit-1 to si.
![Page 196: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/196.jpg)
196
Note:
We will show that:
)|( and )( where)(111 11 itktikiiiki sPasPvavi
Viterbi (cont)
tjkijti
t apij ])([max)( 1
![Page 197: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/197.jpg)
197
t
t
tt
tt
tt
t
jktiji
tittS
itjti
jtktitjtSi
tittktjtSi
kttjtittSi
tjttS
t
aip
OsSPssP
sPssP
OsSsP
OssSP
OsSPj
)]([max
)];,( max)|(max[
)|()|( maxmax
);,|;( maxmax
),;,,( maxmax
);,( max)(
1
1121
1
112
112
1
2
2
2
2
1
Recurrence Equation
)|(
);,(
);,(
112
112
jtkt
titt
titt
sP
OsSP
OsSP
t
1k 1k
![Page 198: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/198.jpg)
198
• The predecessor of state si in the path corresponding to t(i) :
• Optimal state sequence:
Back Pointers
1, ... ,1for )(
)(argmax
))((argmax)(
**
1
*
11
11
ntss
is
pij
ttt
T
kk
nmi
k
ijtmi
t
![Page 199: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/199.jpg)
199
The Trellis
START
COW
moo hello quack
DUCK
END
0
t=0
1
0
0
t=1 t=2 t=3 t=4
0
0.9
0
0
0 0 0
0 0
$
0.045
0.108
0.00648
0
0.0081
0
0.03240
![Page 200: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/200.jpg)
200
Implementation (Python)observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']
# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}
# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']
# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}
# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
![Page 201: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/201.jpg)
201
Implementation (Viterbi)n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T
# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T
v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s
n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T
# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T
v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s
![Page 202: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/202.jpg)
202
Implementation (Best Path)
# Now recover the optimal state sequencestate_sequence = [ 'end' ]
for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +
state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence
# Now recover the optimal state sequencestate_sequence = [ 'end' ]
for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +
state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence
![Page 203: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/203.jpg)
203
Complexity of Decoding
• O(m2n) — linear in n (the length of the string)• Initialization: O(mn)• Back tracing: O(n)• Next step: O(m2)
for current_state in s1..sm # at time t+1 for prev_state in s1..sm # at time t compute value
compare with best_so_far
• There are n next steps.
![Page 204: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/204.jpg)
204
Parameter Estimation for HMMs
• Need annotated training data (Brown, PTB).• Signal and state sequences both known.• Calculate observed relative frequencies.• Complications — sparse data problem (need for smoothing).
• One can use only raw data too — Baum-Welch (forward-backward) algorithm.
![Page 205: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/205.jpg)
205
Optimization
• Build vocabulary of possible tags for words• Keep total counts for words• If a word occurs frequently (count > threshold) consider its tag set
exhaustive• For frequent words only consider its tag set (vs. all tags)• For unknown words don’t consider tags corresponding to closed
class words (e.g., DT)
![Page 206: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/206.jpg)
206
Applications Using HMMs
• POS tagging (as we have seen).• Chunking.• Named Entity Recognition (NER).• Speech recognition.
![Page 207: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/207.jpg)
207
Exercises
• Implement the training (parameter estimation).• Use a dictionary of valid tags for known words to constrain
which tags are considered for a word.• Implement a second-order model.• Implement the decoder in Ruby.
![Page 208: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/208.jpg)
208
Some POS Taggers
• Alias-I: http://www.alias-i.com/lingpipe• AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm• Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z• CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html• Connexor: http://www.connexor.com/software/tagger• Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html• FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com• fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html• GATE: http://gate.ac.uk• Infogistics: http://www.infogistics.com/posdemo.htm• Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html• SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS• Stanford: http://nlp.stanford.edu/software/tagger.shtml• SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool• TNT: http://www.coli.uni-saarland.de/~thorsten/tnt• Yamcha: http://chasen.org/~taku/software/yamcha/
![Page 209: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/209.jpg)
209
References1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000),
224-231, Seattle, U.S.A.2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech
tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT.3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI
Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France.4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech
recognition, Proceedings of the IEEE, vol. 77, 256-286.5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural
Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT), vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia.
6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins: Amsterdam/Philadelphia.
![Page 210: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/210.jpg)
210
Statistics Refresher• Outcome: Individual atomic results of a (non-deterministic) experiment.• Event: A set of results.• Probability: Limit of target outcome over number of experiments (frequentist view) or
degree of belief (Bayesian view).• Normalization condition: Probabilities for all outcomes sum to 1.• Distribution: Probabilities associated with each outcome.• Random variable: Mapping of the outcomes to real numbers.• Joint distributions: Conducting several (possibly related) experiments and observing the
results. Joint distribution states the probability for a combination of values of several random variables.
• Marginal: Finding the distribution of a random variable from a joint distribution.• Conditional probability (Bayes’ rule): Knowing the value of one variable constrains the
distribution of another.• Probability density functions: Probability that a continuous variable is in a certain range.• Probabilistic reasoning: Introduce evidence (set certain variables) and compute
probabilities of interest (conditioned on this evidence).
![Page 211: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/211.jpg)
211
Definitions
𝜇=𝐸 [ 𝑋 ]=∑𝑖=1
𝑛
𝑥𝑖 ∙𝑝 (𝑥 𝑖 )=∫− ∞
∞
𝑥𝑝 (𝑥 )𝑑𝑥Expectation:
Mode: 𝑥∗=arg max𝑖𝑝 (𝑥 𝑖)
Variance: 𝜎 2=𝑉𝑎𝑟 (𝑋 )=𝐸 [ (𝑋−𝜇)2 ]=𝐸 [ 𝑋 2 ]−𝜇2
𝐸 [ 𝑓 (𝑋 ) ]=∑𝑖=1
𝑛
𝑓 (𝑥 𝑖 ) ∙𝑝 (𝑥𝑖)=∫−∞
∞
𝑓 (𝑥)𝑝 (𝑥 )𝑑𝑥Expectation of a function:
𝐸 [ 𝑋𝑛 ]=∑𝑖=1
𝑛
𝑥𝑖𝑛∙𝑝 (𝑥 𝑖)-th moment: ( is the first moment)
𝐸 [𝑎𝑋 +𝑏 ]=𝑎𝐸 [ 𝑋 ]+𝑏 𝐸 [ 𝑋+𝑌 ]=𝐸 [𝑋 ]+𝐸 [𝑌 ] 𝑉𝑎𝑟 [𝑎𝑋 +𝑏 ]=𝑎2𝑉𝑎𝑟 [𝑋 ]Properties:
![Page 212: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/212.jpg)
212
Intuitions about Scale
Weight in grams if the Earth were to be a black hole.
Age of the universe in seconds.
Number of cells in the human body (100 trillion).
Number of neurons in the human brain.
Standard Blu-ray disc size, XL 4 layer (128GB).
One year in seconds.
Items in the Library of Congress (largest in the world).
Length of the Niles in meters (longest river).
![Page 213: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling](https://reader033.vdocuments.mx/reader033/viewer/2022061120/546cc672af7959ec228b4c8b/html5/thumbnails/213.jpg)
213
Acknowledgements
• Bran Boguraev• Chris Brew• Jinho Choi• William Headden• Jingjing Li• Jason Kessler• Mike Mozer• Shumin Wu• Tong Zhang
• Amir Padovitz• Bruno Bozza• Kent Cedola• Max Galkin• Manuel Reyes Gomez• Matt Hurst• John Langford• Priyank Singh