mixture model clustering for mixed data with missing information

26
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Advisor Dr. Hsu Graduate Yu Cheng Chen Author: Lynette Hunt, Murray Jorge nsen Mixture model clustering for mixed data with missing information Computation statistics & Data Analysis, 2002

Upload: cato

Post on 23-Jan-2016

68 views

Category:

Documents


0 download

DESCRIPTION

Mixture model clustering for mixed data with missing information. Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette Hunt, Murray Jorgensen. Computation statistics & Data Analysis, 2002. Outline. Motivation Objective Introduction The Mixture approach to Clustering Data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor : Dr. Hsu

Graduate : Yu Cheng Chen

Author: Lynette Hunt, Murray Jorgensen

Mixture model clustering for mixed data with missing information

Computation statistics & Data Analysis, 2002

Page 2: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction The Mixture approach to Clustering Data Application Discussion Personal Opinion

Page 3: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

Missing observations are frequently seen in data sets.Specimen may be damaged result.

Expensive test may only be administered to a random sub-sample of the items.

Page 4: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

We need to implement some technique when the data to be clustered are incomplete.

Extends mixture likelihood approach to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random.

Page 5: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables, but not for on the value of the missing variable.

The distribution of the missing data does not depend on the missing data.

Page 6: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

Rubin(1976) showed the process that causes the missing data can be ignored when making likelihood-based about the parameter of the data if the data are ‘missing at random’.

The EM algorithms of Dempster et al . is a general iterative procedure maximum likelihood estimation in incomplete data problems.

Little and Schluchter(1985) present maximum likelihood procedure using the EM algorithms for the general location model with missing data.

Page 7: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Mixture approach to Clustering Data

Suppose p attributes are measured on n individuals. Let xi,…, xn be the observed values of a random sample from a mixture of K populations in known proportions, π1,…,πk

Let the density of xi in the kth group be fk(xi; θk), where θk is the parameter vector for group k.

Let ψ=(θ’, π’)’, where π=(π1,…,πk)’, θ=(θ1,…, θk)’

K

kkikki xfxf

1

);();(

Page 8: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Mixture approach to Clustering Data

In EM algorihm of Dempster et al., the ‘missing’ data are the unobserved indicators of group membership.

Let the vector of indicator variables, zi=(zi1,…,zik)

k group i individual if 0

k group i individual if 1ikz

K

k kik

kikk

ik

x

xf

group kiindividualprz

1

i

)ˆ;(ˆ

)ˆ;(ˆ

)ˆ; x| (ˆ

for k=1,…K; and xi is assigned to group k if zik > zik’ , k != k’

Page 9: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.The Mixture approach to Clustering Data

The latent class model is a finite mixture model for data where each of the p attributes is discrete.

Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the probability that for individuals from group k, the jth attribute has level m. Then, individual I belonging to group k is defined as

p

jkjxkik ij

xf1

),(

Page 10: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Multimix

Jorgensen and Hunt(1996) Hunt and Jorgensen(1999) proposed a general class of mixture models to include data having continuous and categorical attributes.

By partitioning the observational vector xi such that

If individual I belongs to group k, we can write

)'|...||...|( 1 iLilii xxxx

L

lilklik xfxf

1

)()(

Page 11: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Multimix

Discrete distribution:

where is a one-dimensional discrete attribute taking values

1,…Ml with probabilities λklM1

Multivariate Normal distribution:

where is a pl-dimensional vector with a Npl(μkl,∑kl)

ilx

ilx

Page 12: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Graphical modelsA alternative way of looking at these multivariate models within the framework of graphical models.

The graph of a model contains vertices and edges

vertex corresponding to each variable.

Edges shows the independence of corresponding vertices.

Latent class models for p variable are represented by a graph on p+1 vertices corresponding to the variables plus 1 categorical variable indicating the cluster.

Page 13: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

We put forward a method for mixture model clustering based on the assumption that the data are missing at random.

We write the observation vector xi in the form (xobs,i ,xmiss,i)

xobs,i is the observed attributes for observation i

xmiss,i is the missing attributes for observation i

Page 14: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

The E step of the EM algorithm require the calculation of Q(ψ, ψ(t))=E{ LC(ψ)|xobs; ψ(t)}, the expectation of the complete data log-likelihood conditional on the observed data and the current value of the parameters.

We calculate Q(ψ, ψ(t)) by replace zik with

K

1k

(t)kiobs,kk

(t)kiobs,kk

)(,

)(

);(xf

);(xf

);|(

tiobsik

tikik XzEzz

Page 15: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

The remaining calculations in the E step require the calculation of the expected value of the complete data sufficient statistics for each partition cell l.

Page 16: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

For multivariate normal partition cells, Eliminating one cluster at a time

Calculate the between-cluster entropy based on remaining clusters

Page 17: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

Sweep is usefulness in maximum likelihood estimation for multivariate missing data problems.

We form the augmented covariance matrix Al using the current estimates of the parameters for group k in cell l

Page 18: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

Sweeping on the elements of Al corresponding to the observed xij in cell l, yields the conditional distribution of the missing xij’ on the observed xij in the cell.

Page 19: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

The new parameter estimates θ(t+1) of parameters are estimated form the complete data sufficient statistic.

Mixing proportion:

Discrete distribution parameters:

kkforzn

n

i

tik

tk ,...,1

1

)()1(

l1

M1,...,m K,1,...,k 1

forzn

n

iilmik

kklm

Page 20: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Missing data

Multivariate Normal parameters:

Page 21: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.ApplicationProstate cancer clinical trial data of Byar and Green(1980).

The data were obtained from a randomized clinical trial comparing 4 treatments for 506 patients with prostatic cancer.

There are 12 pre-trial covariates measured on each patient, 7 variables may be taken to be continuous, 4 to be discrete and 1 variable (SG) is an index. We treat SG as a continuous variable.

Page 22: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Application

1/3 individual have at least one of pre-trial covariates missing, giving a total of 62 missing values.

As only approximately 1% of the data are missing.

Missing values were created by assigning each attribute of each individual a random digit generated from the discrete[0,1], respectively, as .10, .15, .20, .25 and .30.

Page 23: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Application

The data set reported in detail here had 1870values recorded as missing.

Separate data into two clusters.

We regard the data as a random sample from the distribution

Page 24: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Application

Page 25: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Discussion

The multimix approach allows to clustering of mixed finite data containing both types of variables.

The finite mixture model leads itself well into coping with missing values.

The approach implemented in this paper works well for mixed data set that had a very large amount of missing data.

Page 26: Mixture model clustering for mixed data with missing information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion