machine learning: theory and applications › download › dalalyan_aua1.pdf · machine learning:...

66
Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan April 6, 2016

Upload: others

Post on 07-Jun-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

Machine Learning: Theory and Applications

Arnak S. DalalyanAmerican University of Armenia, YerevanApril 6, 2016

Page 2: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 3: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 4: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 5: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 6: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 7: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

2

Short CV and contactsarnak-dalalyan.fr

1979 Born in Yerevan (tramvi park)

1997 Bachelor degree in Mathematics, YSU

1999 Master degree in Probability and Statistics, Paris 6University

2001 PhD in Statistics, University of Le Mans (France)

2007 Habilitation in Statistics and Machine Learning, Paris 6University

Now • Professor of Statistics and Applied Mathematics, ENSAEParisTech

• Head of the Master in Data Science• Deputy chair of the Center for Data Science Paris-Saclay• Programme committee of the COLT (conf. on Learning

Theory) and NIPS (Neural Information and ProcessingSystems), Associate Editor of EJS, JSPI, SISP and JSS.

April 5,2016 2

Page 8: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

3

What is Machine Learning?

The emphasis of machine learning is on devising automaticmethods that perform a given task based on what was experiencedin the past.

Here are several examples :

• optical character recognition : categorize images ofhandwritten characters by the letters represented

• topic spotting : categorize news articles (say) as to whetherthey are about politics, sports, entertainment, etc.

• medical diagnosis : diagnose a patient as a sufferer ornon-sufferer of some disease

• customer segmentation : dividing a customer base into groupsof individuals that are similar in specific ways relevant tomarketing (such as gender, interests, spending habits)

• fraud detection : identify credit card transactions (for instance)which may be fraudulent in nature

April 5,2016 3

Page 9: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

4

Typical tasks of machine learningSupervised learning :

• Prediction : use historical data for predicting future values• Classification : identifying to which of a set of classes a new

observation belongs, on the basis of a training set of datacontaining observations whose class membership is known

• Ranking : construction of ranking models for informationretrieval systems

Unsupervised learning :

• Outlier detection : detecting a few observations that deviate somuch from other observations as to arouse suspicion that itwas generated by a different mechanism

• Clustering : grouping a set of objects into homogeneous groupsbased on some similarity measure

• Dimensionality reduction : mapping of data to a lowerdimensional space such that uninformative variance in thedata is discarded

April 5,2016 4

Page 10: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

Clustering Methods in Unsupervised Learning

Page 11: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

6

Outline

Introduction to clustering

K-means and partitioning around medoids

• Theoretical background• R code• An example

Gaussian mixtures and EM algorithm

• Theoretical background• R code• An example

Hierarchical Clustering

• Theoretical background• R code• An example

April 5,2016 6

Page 12: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

7

Introduction to clusteringMain objective

The goal of clustering analysis is to find high-quality clusterssuch that the inter-cluster similarity is low and the intra-clustersimilarity is high.

−10 −5 0 5

−2

−1

01

2

Component 1

Com

pone

nt 2

These two components explain 100 % of the point variability.

1

2

3

April 5,2016 7

Page 13: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

8

Introduction to clusteringThe general diagram

training examples

clustering algorithm

clustering rule

new example

identifier of the cluster to which

the example is assigned.

April 5,2016 8

Page 14: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

8

Introduction to clusteringThe general diagram

End User

Done by the Data Scientist

training examples

clustering algorithm

clustering rule

new example

identifier of the cluster to which

the example is assigned.

April 5,2016 8

Page 15: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

8

Introduction to clusteringThe general diagram

End User

Done by the Data Scientist

training examples

clustering algorithm

clustering rule

new example

identifier of the cluster to which

the example is assigned.

In general, there is no objective measure of quality for a clusteringalgorithm. The satisfaction of the end-user is perhaps the best man-ner of evaluation.

April 5,2016 8

Page 16: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

9

K-means clusteringNotation

• We are given n examples represented as a p-vector withreal-values entries :

X1, . . . ,Xn ∈ Rp, Xi =

xi1...xip

.• We denote by ‖Xi − Xj‖ the (Euclidean) distance between two

examples :

‖Xi − Xj‖2 =

p∑`=1

(xi,` − xj,`)2

• For a set G, subset of 1, . . . ,n, we denote by XG the averageof the examples Xi corresponding to i ∈ G, that is

XG =1

|G|∑i∈G

Xi.

April 5,2016 9

Page 17: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

10

K-means clusteringDefinition of the method

• Main idea : A good clustering algorithm divides the sample intoK groups such that the variance within each group is mall.

• Mathematically speaking, this corresponds to solving withrespect to G1, . . . ,GK ⊂ 1, . . . ,n and C1, . . . ,CK ∈ Rp thefollowing optimisation problem :

minG1,...,GK

minC1,...,CK

K∑k=1

∑i∈Gk

‖Xi − Ck‖2︸ ︷︷ ︸Ψ(G1:K ,C1:K)

,

where G1, . . . ,GK runs over all possible partitions of 1, . . . ,n.• Important remark : the minimization should be done for a fixed

value of K (prescribed number of clusters), otherwise thesolution is obvious (and meaningless), isn’t it ?

April 5,2016 10

Page 18: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 19: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 20: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 21: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 22: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 23: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 24: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

11

K-means clusteringHow to minimize the cost?

• Cost function : Ψ(G1:K ,C1:K) =∑K

k=1

∑i∈Gk‖Xi − Ck‖2 is hard to

minimize. It is a combinatorial optimization problem which isknown to be NP-hard.

• Local minimum : finding a local minimum is easy.

Step 1 For fixed centers C1:K the minimizer of Ψ(G1:K ,C1:K) w.r.t.G1:K can be easily computed :

Gk contains all the points Xi for which the closest point inthe set C1, . . . ,CK is Ck.

Step 2 For fixed groups G1:K , the minimizer of Ψ(G1:K ,C1:K) w.r.t.C1:K can be easily computed :

Ck = XGk .

• Iterative algorithm : initialize at some C1:K and thenalternate between the two aforementioned steps untilthe convergence.

April 5,2016 11

Page 25: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

Ideal clustering

April 5,2016 12

Page 26: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

Initialization

C1

C2

April 5,2016 12

Page 27: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 1 (a)

C1

C2

April 5,2016 12

Page 28: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 1 (b)

C1

C2

April 5,2016 12

Page 29: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 2 (a)

C1

C2

April 5,2016 12

Page 30: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 2 (b)

C1 C2

April 5,2016 12

Page 31: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 3 (a)

C1 C2

April 5,2016 12

Page 32: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 3 (a)

C1 C2

In general, a dozen iterations are sufficient for the algorithm toconverge ...

April 5,2016 12

Page 33: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

12

K-means clusteringAlternating minimization

An illustrative example :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 3 (a)

C1 C2

In general, a dozen iterations are sufficient for the algorithm toconverge to a local minimum.

April 5,2016 12

Page 34: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

Ideal clustering

April 5,2016 13

Page 35: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

Initialization

C1

C2

April 5,2016 13

Page 36: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 1 (a)

C1

C2

April 5,2016 13

Page 37: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 1 (b)

C1

C2

April 5,2016 13

Page 38: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 2 (a)

C1

C2

April 5,2016 13

Page 39: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

13

K-means clusteringAlternating minimization

Not all initializations are good :

−2 0 2 4 6

−2

−1

01

2

K−means clustering

interation 2 (a)

C1

C2

Most implementations (and this is the case for the R function kmeans)compute the clusterings for several initializations and select the onehaving the minimal cost.

April 5,2016 13

Page 40: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

14

K-means clusteringThe iris dataset

The Iris flower data set was introduced byRonald Fisher in his 1936 paper. It is some-times called Anderson’s Iris data set becauseEdgar Anderson collected the data to quan-tify the morphologic variation of Iris flowersof three related species.

April 5,2016 14

Page 41: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

15

K-means clusteringR code

Here is a toy example of clustering with R.

1 > Data <- iris[,1:4]> clus <- kmeans(Data,3)

3 > clus$size[1] 38 50 62

5 > title <- "Iris data set"> clusplot(Data, clus$cluster,

7 + color=TRUE, shade=TRUE,+ labels=4, lines=0,

9 + main=title)−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

Iris data set

Component 1

Com

pone

nt 2

These two components explain 95.81 % of the point variability.

1

2

3

April 5,2016 15

Page 42: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

16

K-means clusteringR code

Here is a toy example of clustering with R.

1 > Data <- iris[,1:4]> clus <- kmeans(Data,3)

3 > clus$size[1] 33 96 21

5 > title <- "Iris data set"> clusplot(Data, clus$cluster,

7 + color=TRUE, shade=TRUE,+ labels=4, lines=0,

9 + main=title)−3 −2 −1 0 1 2 3 4

−2

−1

01

23

Iris data set

Component 1

Com

pone

nt 2

These two components explain 95.81 % of the point variability.

1

2

3

April 5,2016 16

Page 43: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

17

K-means clusteringR code

Pay attention to :• use several initializations for reducing the randomness of the result• normalize the columns of the data matrix

1 > Data <- scale(iris[,1:4])> clus <- kmeans(Data,3,

3 + nstart=100)> clus$size

5 [1] 53 47 50> title <- "Iris data set"

7 > clusplot(Data, clus$cluster,+ color=TRUE, shade=TRUE,

9 + labels=4, lines=0,+ main=title)

−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

Iris data set

Component 1C

ompo

nent

2

These two components explain 95.81 % of the point variability.

1

2

3

April 5,2016 17

Page 44: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

18

K-means clusteringRemarks

• Breaking the ties In the step of assigning data points to clusters it may happenthat two centers are the distance from a data point. Such a tie is usually brokenat random using a coin toss.

• K-medians If the data is likely to contain outliers, it might be better to replacethe squared Euclidean distance by the “manhattan” distance :

‖Xi − Xj‖1 =

p∑`=1

|xi` − xj`|

and to minimize the cost functionK∑

k=1

∑i∈Gk

‖Xi − Ck‖1.

This method is referred to as k-medians since Ck is necessarily the median ofthe cluster Xi : i ∈ Gk.

• PAM More generally, one can consider any measure of dissimilarity betweendata points. The resulting algorithm is called partitioning around medoids.

April 5,2016 18

Page 45: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

19

K-means clusteringClustering CAC40 companies

Let us consider the problem of clustering the CAC40 companies according to theirstock prices during the past 2 months.

• Each company is described by its daily stock returns (44 real values), theintraday variation (45 positive values) and the daily volume of exchanges (45positive values). Data is downloaded from http://www.abcbourse.com/.

• We first determine the number of clusters using the function kmeansruns

> data <- scale(CAC40, center =FALSE)2 > clus <- kmeansruns(data,3:10)> k <-clus$bestk

4 > print(k)[1] 3

6 > clus <- kmeans(data, k, nstart=100)> clusplot(data, clus$cluster,

8 + color=TRUE, shade=TRUE,+ labels=4, lines=0,

10 + main=title)

−4 −2 0 2 4 6 8

−4

−2

02

4

CAC40, 25 August − 24 Sept. 2014

Component 1C

ompo

nent

2

These two components explain 50.5 % of the point variability.

1

2

3

April 5,2016 19

Page 46: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

20

Clustering CAC40 companiesThe result of k-means

Cluster 1 Cluster 2 Cluster 3Size :23 Size :7 Size :10

TotalL’orealLvmhSchneider ElVeolia EnvGDF SuezAlstomEDFPernod RicDanone...

Credit AgrAlcatel-LucentSociete GeneraleBnp ParibasRenaultOrangeArcelor Mittal

AccorBouyguesLafargeMichelinSaint GobainVinciValeoPublicis GroupeTechnipGemalto

April 5,2016 20

Page 47: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

Expectation maximization for Gaussian mixtures

Page 48: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

22

Gaussian Mixture (GM) Model based clusteringTheoretical background

The data points X1, . . . ,Xn are assumed to be generated atrandom according to the following mechanism :

• n “lablels” Z1, . . . ,Zn are independently generatedaccording to a distribution π on the set 1, . . . ,K (ifZi = k then Xi belongs to the kth cluster).

• for each i = 1, . . . ,n, if Zi = k, then the data point Xi isdrawn from a multivariate Gaussian distribution with amean µk and a covariance matrix Σk.

Only X1, . . . ,Xn are observed. The goal is to find labelsZ1, . . . , Zn such that the probability of the event Zi = Zi,∀i isas high as possible.

April 5,2016 22

Page 49: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

23

Gaussian Mixture (GM) Model based clustering

From a mathematical point of view, X1, . . . ,Xn are independent andhave the density :

p(x) =K∑

k=1

πkϕ(x|µk,Σk)

with ϕ(x|µ,Σ) being the Gaussian density

ϕ(x|µ,Σ) =1

(2π)p/2|Σ|1/2exp

− 1

2‖Σ−1/2(x − µ)‖2

.

Gaussian density in 2D Density of a GM in 2D (K = 2)

April 5,2016 23

Page 50: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

24

Gaussian Mixture (GM) Model based clusteringFrom clustering to estimation

Since X1, . . . ,Xn are independent and have the density :

p(x) =K∑

k=1

πkϕ(x|µk,Σk)

if the parameters π, (µk)k, (Σk)k of the model were known,we would determine the cluster number assigned to x bymaximizing w.r.t. k

P(Z = k|X = x

)=πkϕ(x|µk,Σk)

p(x).

Therefore, it is natural to first estimate π, (µk)k, (Σk)k byπ, (µk)k, (Σk)k and then to set

Z = arg maxk=1,...,K

πkϕ(x|µk, Σk).

Problem : how to estimate these parameters ?

April 5,2016 24

Page 51: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

24

Gaussian Mixture (GM) Model based clusteringFrom clustering to estimation

Since X1, . . . ,Xn are independent and have the density :

p(x) =K∑

k=1

πkϕ(x|µk,Σk)

if the parameters π, (µk)k, (Σk)k of the model were known,we would determine the cluster number assigned to x bymaximizing w.r.t. k

P(Z = k|X = x

)=πkϕ(x|µk,Σk)

p(x).

Therefore, it is natural to first estimate π, (µk)k, (Σk)k byπ, (µk)k, (Σk)k and then to set

Z = arg maxk=1,...,K

πkϕ(x|µk, Σk).

Problem : how to estimate these parameters ?

April 5,2016 24

Page 52: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

24

Gaussian Mixture (GM) Model based clusteringFrom clustering to estimation

Since X1, . . . ,Xn are independent and have the density :

p(x) =K∑

k=1

πkϕ(x|µk,Σk)

if the parameters π, (µk)k, (Σk)k of the model were known,we would determine the cluster number assigned to x bymaximizing w.r.t. k

P(Z = k|X = x

)=πkϕ(x|µk,Σk)

p(x).

Therefore, it is natural to first estimate π, (µk)k, (Σk)k byπ, (µk)k, (Σk)k and then to set

Z = arg maxk=1,...,K

πkϕ(x|µk, Σk).

Problem : how to estimate these parameters ?

April 5,2016 24

Page 53: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

25

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Estimation problem

Estimate the quantities π, (µk)k, (Σk)k based on the observa-tion of n independent random variables with density p(x) =∑K

k=1 πkϕ(x|µk,Σk).

Main difficulty : the maximum likelihood estimatorπ, (µk)k, (Σk)k

ML= arg max

π,(µk)k,(Σk)k

n∑i=1

logp(Xi)

is, in practice, impossible to compute.

Work-around : use the identity

April 5,2016 25

Page 54: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

25

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Estimation problem

Estimate the quantities π, (µk)k, (Σk)k based on the observa-tion of n independent random variables with density p(x) =∑K

k=1 πkϕ(x|µk,Σk).

Main difficulty : the maximum likelihood estimatorπ, (µk)k, (Σk)k

ML= arg max

π,(µk)k,(Σk)k

n∑i=1

log K∑

k=1

πkϕ(Xi|µk,Σk)

is, in practice, impossible to compute.

Work-around : use the identity

April 5,2016 25

Page 55: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

25

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Estimation problem

Estimate the quantities π, (µk)k, (Σk)k based on the observa-tion of n independent random variables with density p(x) =∑K

k=1 πkϕ(x|µk,Σk).

Main difficulty : the maximum likelihood estimatorπ, (µk)k, (Σk)k

ML= arg max

π,(µk)k,(Σk)k

n∑i=1

log K∑

k=1

πkϕ(Xi|µk,Σk)

is, in practice, impossible to compute.

Work-around : use the identity

log K∑

k=1

πkak

= maxν

K∑k=1

(νk log(πkak)− νk log νk

).

April 5,2016 25

Page 56: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

25

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Estimation problem

Estimate the quantities π, (µk)k, (Σk)k based on the observa-tion of n independent random variables with density p(x) =∑K

k=1 πkϕ(x|µk,Σk).

Main difficulty : the maximum likelihood estimatorπ, (µk)k, (Σk)k

ML= arg max

π,(µk)k,(Σk)k

n∑i=1

log K∑

k=1

πkϕ(Xi|µk,Σk)

is, in practice, impossible to compute.

Work-around : use the identity

log K∑

k=1

πkak

= maxν∈RK

+∑νk=1

K∑k=1

(νk log ak − νk log(νk/πk)

).

April 5,2016 25

Page 57: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

26

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Let us denote Ω =π, (µk), (Σk)

and Ω

ML=π, (µk), (Σk)

ML.

Using the previous formulae, we can write ΩML

as follows

ΩML

= arg maxΩ

maxν i

n∑i=1

K∑k=1

νik log(πkϕ(Xi|µk,Σk))− νik log νik

︸ ︷︷ ︸

G(Ω,ν)

.

(1)

April 5,2016 26

Page 58: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

26

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Let us denote Ω =π, (µk), (Σk)

and Ω

ML=π, (µk), (Σk)

ML.

Using the previous formulae, we can write ΩML

as follows

ΩML

= arg maxΩ

maxν

G(Ω,ν). (1)

April 5,2016 26

Page 59: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

26

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Let us denote Ω =π, (µk), (Σk)

and Ω

ML=π, (µk), (Σk)

ML.

Using the previous formulae, we can write ΩML

as follows

ΩML

= arg maxΩ

maxν

G(Ω,ν). (1)

For a fixed Ω, the maximum w.r.t. ν in (1) is explicitlycomputable.

For a fixed ν, the maximum w.r.t. Ω in (1) is explicitlycomputable as well.

April 5,2016 26

Page 60: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

26

Gaussian Mixture (GM) Model based clusteringEM algorithm : main idea

Let us denote Ω =π, (µk), (Σk)

and Ω

ML=π, (µk), (Σk)

ML.

Using the previous formulae, we can write ΩML

as follows

ΩML

= arg maxΩ

maxν

G(Ω,ν). (1)

For a fixed Ω, the maximum w.r.t. ν in (1) is explicitlycomputable.

For a fixed ν, the maximum w.r.t. Ω in (1) is explicitlycomputable as well.

EM-algorithm

Initialize Ω. Then iteratively (until convergence)- update ν by solving (1) with fixed Ω- update Ω by solving (1) with fixed ν.

April 5,2016 26

Page 61: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

27

Gaussian Mixture (GM) Model based clusteringEM algorithm : how it works

EM-algorithm

Initialize Ω. Then iteratively (until convergence)- update ν by solving (1) with fixed Ω- update Ω by solving (1) with fixed ν.

April 5,2016 27

Page 62: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

28

Gaussian Mixture (GM) Model based clusteringEM algorithm : summary and remarks

The goal of the EM-algorithm is to approximate the maximumlikelihood estimator.

The EM-algorithm solves a nonconvex optimization problem.Therefore, there is no guarantee that it finds the globaloptimum. Generally, it does not !

To apply the EM algorithm, the sample size n should besignificantly larger than p× k + p2.

One can adapt the EM algorithm to other (non Gaussian)distributions.

It is possible in the EM algorithm to assume that all the Σk’s areequal, or that they are all diagonal.

The auxiliary values νik computed by the EM-algorithmestimate the probability of Xi to belong to the cluster k.

April 5,2016 28

Page 63: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

29

Gaussian Mixture (GM) Model based clusteringEM algorithm : R code

The EM-algorithm is implemented in the R-package mclust.

> data <- (iris[,1:4])2 > irisC <- Mclust(data, G=3)> clusters <- irisC$classification

4 > table(clusters)clusters

6 1 2 350 45 55

8 > clusplot(data, clusters,+ color=TRUE, shade=TRUE,

10 + labels=4, lines=0,+ main="Iris data: EM")

−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

Iris data: EM

Component 1C

ompo

nent

2These two components explain 95.81 % of the point variability.

1 2

3

April 5,2016 29

Page 64: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

30

Gaussian Mixture (GM) Model based clusteringR code for CAC 40 dataset

1 > data <- scale(CAC40, center =FALSE)> CAC.cl <- Mclust(data)

3 > clusters <- CAC.cl$classification> table(clusters)

5 clusters1 2 3 4 5

7 12 10 8 8 2> clusplot(data, clusters,

9 + color=TRUE, shade=TRUE,+ labels=4, lines=0,

11 + main="CAC40: EM")

−5 0 5 10

−2

02

46

CAC40: EM

Component 1

Com

pone

nt 2

These two components explain 51.85 % of the point variability.

1

23

4

5

April 5,2016 30

Page 65: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan

31

Clustering CAC40 companiesThe result of EM

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5Size :12 Size :10 Size :8 Size :8 Size :2

SafranAir LiqL’orealAccorBouyguesPernod RicLvmhEssilor IntlCap GeminiAlstomEDFLegrand SA

Credit AgrTotalAxaVivendiAlcatel-Luc.Soc. Gen.Bnp ParibasOrangeGDF SuezArcelor Mit.

SolvayMichelinKeringUnibail-Rod.ValeoPublicis Gr.TechnipGemalto

CarrefourSanofiDanoneSchneider ElVeolia EnvSaint GobainVinciAirbus Group

LafargeRenault

April 5,2016 31

Page 66: Machine Learning: Theory and Applications › Download › Dalalyan_AUA1.pdf · Machine Learning: Theory and Applications Arnak S. Dalalyan American University of Armenia, Yerevan