26 machine learning unsupervised fuzzy c-means

Machine Learning for Data MiningFuzzy Clustering

Andres Mendez-Vazquez

July 27, 2015

1 / 39

Images/cinvestav-1.jpg

Outline

1 Fuzzy ClusteringHistoryFuzzy C-Means ClusteringUsing the Lagrange MultipliersThe Final Algorithm!!!Pros and Cons of FCM

2 What can we do? Possibilistic ClusteringIntroductionCost FunctionExplanation

2 / 39

Outline

3 / 39

Some of the Fuzzy Clustering Models

Fuzzy Clustering ModelBezdek, 1981

Possibilistic Clustering ModelKrishnapuram - Keller, 1993

Fuzzy Possibilistic Clustering ModelN. Pal - K. Pal - Bezdek, 1997

4 / 39

Outline

5 / 39

Fuzzy C-Means Clustering

The input an unlabeled data setX = {x1,x2,x3, ...,xN}.xk ∈ Rp

OutputA partition S of the X as a matrix U of C ×N .Set of cluster centers V = {v1, v2, ..., vC} ⊂ Rp

6 / 39

Fuzzy C-Means Clustering

The input an unlabeled data setX = {x1,x2,x3, ...,xN}.xk ∈ Rp

OutputA partition S of the X as a matrix U of C ×N .Set of cluster centers V = {v1, v2, ..., vC} ⊂ Rp

6 / 39

What we want

Creation of the Cost FunctionFirst:

We can use a distance defined as:

‖xk − vi‖ =√

(xk − vi)T (xk − vi) (1)

The euclidean distance from a point k to a centroid i.NOTE other distances based in Mahalonobis can be taken inconsideration.

7 / 39

What we want

‖xk − vi‖ =√

(xk − vi)T (xk − vi) (1)

7 / 39

What we want

‖xk − vi‖ =√

(xk − vi)T (xk − vi) (1)

7 / 39

Do you remember the cost function for K -means?

Finding a partition S that minimizes the following function

N∑k=1

∑k:xk∈Ci

‖xk − vi‖2 (2)

Where vi = 1Ni

∑xk∈Ci

We can rewrite the previous equation as

N∑k=1

C∑i=1

I (xk ∈ Ci) ‖xk − vi‖2 (3)

8 / 39

Do you remember the cost function for K -means?

Finding a partition S that minimizes the following function

N∑k=1

∑k:xk∈Ci

‖xk − vi‖2 (2)

Where vi = 1Ni

∑xk∈Ci

We can rewrite the previous equation as

N∑k=1

C∑i=1

I (xk ∈ Ci) ‖xk − vi‖2 (3)

8 / 39

In addition

Did you notice that the membership is always one or zero?

N∑k=1

C∑i=1

Membership︷︸︸︷I (xk ∈ Ci) ‖xk − vi‖2 (4)

9 / 39

Thus, we can rethink the membership using something“Fuzzy”

What if we modify the cost function to something like this

N∑k=1

C∑i=1

Membership︷︸︸︷Fuzzy Value ‖xk − vi‖2 (5)

This means that we think that each cluster Ci is “Fuzzy”We can assume a fuzzy set for the cluster Ci with memebership function:

Ai : Rp → [0, 1] (6)

Such that we can tune it by using a power i.e. decreasing it by a m power.

10 / 39

N∑k=1

C∑i=1

Ai : Rp → [0, 1] (6)

10 / 39

N∑k=1

C∑i=1

Ai : Rp → [0, 1] (6)

10 / 39

N∑k=1

C∑i=1

Ai : Rp → [0, 1] (6)

10 / 39

Under the following constraints

Ai (xk) ∈ [0, 1] ∀i, k (7)

Second

0 <N∑

k=1Ai (xk) < N ∀i (8)

ThirdC∑

i=1Ai (xk) = 1 ∀k (9)

11 / 39

Ai (xk) ∈ [0, 1] ∀i, k (7)

Second

0 <N∑

k=1Ai (xk) < N ∀i (8)

ThirdC∑

i=1Ai (xk) = 1 ∀k (9)

11 / 39

Ai (xk) ∈ [0, 1] ∀i, k (7)

Second

0 <N∑

k=1Ai (xk) < N ∀i (8)

ThirdC∑

i=1Ai (xk) = 1 ∀k (9)

11 / 39

Final Cost Function

Properties

Jm (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 (10)

Under the constraintsAi (xk) ∈ [0, 1], for 1 ≤ k ≤ N and 1 ≤ i ≤ C .∑C

i=1 Ai (xk) = 1, for 1 ≤ k ≤ N .0 <

∑Nk=1 Ai (xk) < n, for 1 ≤ i ≤ C .

m > 1.

12 / 39

Final Cost Function

Properties

Jm (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 (10)

i=1 Ai (xk) = 1, for 1 ≤ k ≤ N .0 <

∑Nk=1 Ai (xk) < n, for 1 ≤ i ≤ C .

m > 1.

12 / 39

Final Cost Function

Properties

Jm (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 (10)

i=1 Ai (xk) = 1, for 1 ≤ k ≤ N .0 <

∑Nk=1 Ai (xk) < n, for 1 ≤ i ≤ C .

m > 1.

12 / 39

Final Cost Function

Properties

Jm (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 (10)

i=1 Ai (xk) = 1, for 1 ≤ k ≤ N .0 <

∑Nk=1 Ai (xk) < n, for 1 ≤ i ≤ C .

m > 1.

12 / 39

Final Cost Function

Properties

Jm (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 (10)

i=1 Ai (xk) = 1, for 1 ≤ k ≤ N .0 <

∑Nk=1 Ai (xk) < n, for 1 ≤ i ≤ C .

m > 1.

12 / 39

Outline

13 / 39

Using the Lagrange Multipliers

New cost function

J̄m (S) =N∑

C∑i=1

[Ai (xk)]m ‖xk − vi‖2 −N∑

k=1λk

[ C∑i=1

Ai (xk)− 1]

Derive with respect to Ai (xk)∂J̄m (S)∂Ai (xk) = mAi (xk)m−1 ‖xk − vi‖2 − λk = 0 (12)

Ai (xk) =[

m ‖xk − vi‖2

] 1m−1

14 / 39

New cost function

J̄m (S) =N∑

C∑i=1

k=1λk

[ C∑i=1

Ai (xk)− 1]

Ai (xk) =[

m ‖xk − vi‖2

] 1m−1

14 / 39

New cost function

J̄m (S) =N∑

C∑i=1

k=1λk

[ C∑i=1

Ai (xk)− 1]

Ai (xk) =[

m ‖xk − vi‖2

] 1m−1

14 / 39

Using the Lagrange MultipliersSum over all i’s

C∑i=1

Ai (xk) = λ1

m−1k

m−1 ‖xk − vi‖2

m−1(14)

λk = m[∑Ci=1

1‖xk−vi‖

2m−1

]m−1 (15)

Plug Back on equation 12 using j instead of im[∑C

‖xk−vj‖2

]m−1 = mAi (xk)m−1 ‖xk − vi‖2 (16)

15 / 39

C∑i=1

Ai (xk) = λ1

m−1k

m−1 ‖xk − vi‖2

m−1(14)

λk = m[∑Ci=1

1‖xk−vi‖

2m−1

]m−1 (15)

‖xk−vj‖2

]m−1 = mAi (xk)m−1 ‖xk − vi‖2 (16)

15 / 39

C∑i=1

Ai (xk) = λ1

m−1k

m−1 ‖xk − vi‖2

m−1(14)

λk = m[∑Ci=1

1‖xk−vi‖

2m−1

]m−1 (15)

‖xk−vj‖2

]m−1 = mAi (xk)m−1 ‖xk − vi‖2 (16)

15 / 39

Finally

We have that

Ai (xk) = 1[∑Cj=1

{‖xk−vi‖2

‖xk−vj‖2

} 1m−1

] (17)

In a similar way we have

vi =∑N

k=1 Ai (xk)m xk∑Nk=1 Ai (xk)m (18)

16 / 39

Finally

We have that

Ai (xk) = 1[∑Cj=1

{‖xk−vi‖2

‖xk−vj‖2

} 1m−1

] (17)

In a similar way we have

vi =∑N

k=1 Ai (xk)m xk∑Nk=1 Ai (xk)m (18)

16 / 39

Outline

17 / 39

Final AlgorithmFuzzy c-means

1 Let t = 0. Select an initial fuzzy pseudo-partition.

2 Calculate the initial C cluster centers using, v(t)i =

∑Nk=1 A(t)

i (xk)mxk∑Nk=1 A(t)

i (xk)m .

3 Update for each xk the membership function by

I Case I:∥∥∥xk − v(t)

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

I Case II:∥∥∥xk − v(t)

∥∥∥2= 0 for some i ∈ I ⊆ {1, 2, ...,C} then define

A(t+1)i (xk) by any nonnegative number such that

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

i (xk)∣∣∣ ≤ ε stop; otherwise

increase t and go to step 2.18 / 39

∑Nk=1 A(t)

i (xk)m .

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

∑Nk=1 A(t)

i (xk)m .

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

∑Nk=1 A(t)

i (xk)m .

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

∑Nk=1 A(t)

i (xk)m .

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

∑Nk=1 A(t)

i (xk)m .

∥∥∥2> 0 for all i ∈ {1, 2, ...,C} then

A(t+1)i (xk) = 1[∑C

{‖xk −v(t)

i ‖2

‖xk −v(t)j ‖

} 1m−1]

∑i∈I Ai (xk) = 1

and A(t+1)i (xk) = 0 for i /∈ I .

4 If∣∣∣S(t+1) − S(t)

∣∣∣ = maxi,k

∣∣∣A(t+1)i (xk)−A(t)

Final Output

The Matrix UThe elements of U are Uik = Ai (xk).

The centroidsV = {v1, v2, ..., vC}

19 / 39

Final Output

The Matrix UThe elements of U are Uik = Ai (xk).

The centroidsV = {v1, v2, ..., vC}

19 / 39

Outline

20 / 39

Pros and Cons of Fuzzy C-Means

AdvantagesUnsupervisedAlways converges

DisadvantagesLong computational timeSensitivity to the initial guess (speed, local minima)Sensitivity to noise

I One expects low (or even no) membership degree for outliers (noisypoints)

21 / 39

Outliers, Disadvantage of FCMAfter running without outliers

22 / 39

Outliers, Disadvantage of FCMNow add an outlier

23 / 39

Outline

24 / 39

Krinshapuram and Keller

Following ZadehThey took in consideration that each class prototype as defining an elasticconstraint.

What?Giving the ti (xk) as degree of compatibility of sample xk with cluster Ci .

We do the followingIf we consider the Ci as fuzzy sets over the set of samplesX = {x1,x2, ...,xN}

25 / 39

Here is the Catch!!!

We should not use the old membershipC∑

i=1Ai (xk) = 1 (19)

BecauseThis is quite probabilistic... which is not what we want!!!

ThusWe only ask for membership, now using the possibilistic notation of ti (xk)(This is known as typicality value), to be in the interval [0, 1].

26 / 39

i=1Ai (xk) = 1 (19)

26 / 39

i=1Ai (xk) = 1 (19)

26 / 39

New Constraints

ti (xk) ∈ [0, 1] ∀i, k (20)

Second

0 <N∑

k=1ti (xk) < N ∀i (21)

ti (xk) > 0 ∀k (22)

27 / 39

New Constraints

ti (xk) ∈ [0, 1] ∀i, k (20)

Second

0 <N∑

k=1ti (xk) < N ∀i (21)

ti (xk) > 0 ∀k (22)

27 / 39

New Constraints

ti (xk) ∈ [0, 1] ∀i, k (20)

Second

0 <N∑

k=1ti (xk) < N ∀i (21)

ti (xk) > 0 ∀k (22)

27 / 39

Outline

28 / 39

We have the following cost function

Cost FunctionN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (23)

ProblemUnconstrained optimization of first term will lead to the trivial solutionti (xk) = 0 for all i, k.

Thus, we can introduce the following constraint

ti (xk)→ 1 (24)

Roughly it means to make the typicality values as large as possible.

29 / 39

Cost FunctionN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (23)

ti (xk)→ 1 (24)

29 / 39

Cost FunctionN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (23)

ti (xk)→ 1 (24)

29 / 39

We can try to control this tendency

By putting all them together inN∑

k=1(1− ti (xk))m (25)

With m to control the tendency of ti (xk)→ 1

We can also run this tendency over all the cluster using a suitablewi > 0 per cluster

C∑i=1

N∑k=1

(1− ti (xk))m (26)

30 / 39

We can try to control this tendency

By putting all them together inN∑

k=1(1− ti (xk))m (25)

With m to control the tendency of ti (xk)→ 1

We can also run this tendency over all the cluster using a suitablewi > 0 per cluster

C∑i=1

N∑k=1

(1− ti (xk))m (26)

30 / 39

Possibilistic C-Mean Clustering (PCM)

The final Cost Function

Jm (S) =N∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 +C∑

N∑k=1

(1− ti (xk))m (27)

Whereti (xk) are typicality values.wi are cluster weights

31 / 39

Possibilistic C-Mean Clustering (PCM)

The final Cost Function

Jm (S) =N∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 +C∑

N∑k=1

(1− ti (xk))m (27)

Whereti (xk) are typicality values.wi are cluster weights

31 / 39

Outline

32 / 39

Explanation

First TermN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (28)

It demands that the distance from feature vector to prototypes be as smallas possible!!!

Second Termc∑

n∑k=1

(1− ti (xk))m (29)

It forces the typicality values ti (xk) to be as large as possible.

33 / 39

Explanation

First TermN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (28)

Second Termc∑

n∑k=1

(1− ti (xk))m (29)

33 / 39

Explanation

First TermN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (28)

Second Termc∑

n∑k=1

(1− ti (xk))m (29)

33 / 39

Explanation

First TermN∑

C∑i=1

[ti (xk)]m ‖xk − vi‖2 (28)

Second Termc∑

n∑k=1

(1− ti (xk))m (29)

33 / 39

Final Updating Equations

Typicality Values

ti (xk) = 1

1 +(‖xk−vi‖2

) 1m−1

, ∀i, k (30)

Cluster Centers

vi =∑N

k=1 ti (xk)m xk∑nk=1 ti (xk)m (31)

34 / 39

Typicality Values

ti (xk) = 1

1 +(‖xk−vi‖2

) 1m−1

, ∀i, k (30)

Cluster Centers

vi =∑N

k=1 ti (xk)m xk∑nk=1 ti (xk)m (31)

34 / 39

Weights

wi = M∑N

k=1 [ti (xk)]m ‖xk − vi‖2∑nk=1 [ti (xk)]m , (32)

with M > 0.

35 / 39

Possibilistic can deal with outliersAfter running without outliers

36 / 39

Possibilistic can deal with outliersNow add an outlier

37 / 39

AdvantagesClustering noisy data samples.

DisadvantagesVery sensitive to good initialization.

In Between!!!Coincident clusters may result.

Because the columns and rows of the typicality matrix areindependent of each other.This could be advantageous (start with a large value of C and getless distinct clusters)

38 / 39

Nevertheless

There are more advanced clustering methods based on thepossibilistic and fuzzy ideaPal, N.R.; Pal, K.; Keller, J.M.; Bezdek, J.C., "A Possibilistic Fuzzyc-Means Clustering Algorithm," Fuzzy Systems, IEEE Transactions on ,vol.13, no.4, pp.517,530, Aug. 2005.

39 / 39

26 machine learning unsupervised fuzzy c-means

Engineering