[ieee 2009 international conference on artificial intelligence and computational intelligence -...

Learning Naive Bayes Classifiers with Incomplete Data

Cuiping Leng School of Mathematics & Information

Shanghai Lixin University of Commerce

Shanghai, China e-mail: [email protected]

Shuangcheng Wang School of Mathematics &

Information, Opening Economy & Trade Research Center

Shanghai Lixin University of Commerce

Shanghai, China e-mail: [email protected]

Hui Wang School of Information Engineering

The Central University for Nationalities

Beijing, China

Abstract—Naive Bayes Classifiers have been known with the advantages of high efficiency and good classification accuracy and they have been widely used in many domains. However, the classifiers need complete data. And the phenomenon of missing data widely exists in practice. Facing this instance, learning naive Bayes classifier and classification method with missing data are built in this paper. Compared with the common methods dealing with missing data, this method is more efficient and reliable.

Keywords- incomplete data; naive Bayes classifier; iterative learning; Gibbs sampling

I. INTRODUCTION Obtained by learning, the classification is an important

and basic ability for human beings. It has been considered as a key research area in machine learning, pattern recognition and data mining, etc. When these classifiers are used to solve the different practical problems, each has advantages and disadvantages. They have been widely used in many fields. Naive Bayes (NB) classifier[1,2] is one of the most famous classifiers for its high efficiency and great generalization ability.

In practice, most real-life databases contain the phenomenon of missing data because of various reasons. How to effectively handle missing data is an important factor for classifiers to solve the practical problems. The existing methods are mainly dealing with discrete variables. The main methods are: (1) delete the records with missing data, (2) treat empty value as an new one[3], (3) replace empty value with mode, (4) EM algorithm[4]. In the first method, the information contained in records with missing data can not be sufficiently used, which will affect the result of classification. In the second method, some surplus information will be generated and the classifiers’ generalization ability will be reduced with more requirements for practice examples. The third one is very ordinary with simplicity and much better effects while it can also bring in noise. The last one is widely applied, which can optimize partial greedy of distribution parameters and it is sensitive to initial values and tends to be engaged into partial extreme values. Meanwhile, iterative parameter may contract to a border where there is no likelihood function extreme values and result in deceived contraction.

Facing those problems, we can combine Gibbs sampling[5,6] and star-shaped structure while dealing with missing data and classifiers parameter learning can be harmonized in order to precede overlapping study. Following the iterative contraction, Naive Bayes (NB) classifier with missing data can be obtained. This method has those advantages of reliability in dealing with missing data and improving the ability of generalization of classifiers.

Let nXX ,...,1 and C denote attribute variables and class variable, respectively. The attribute variables can be continuous or discrete. nxx ,...,1 and c are their specific values. We assume that example set (database) D has N records and data are generated randomly and obey the probability distribution P .

II. LEARNING NB CLASSIFIER WITH COMPLETE DATA NB classifier is composed of structure and parameters.

The classifier has simple star-like structure and is unnecessary to learn, so the parameter estimation is the core of learning NB classifier with complete data.

A. The Structure of NB Classifier Naive Bayes classifier assumes that the attribute

variables are conditionally independent given the class variables, so it has the star-like structure shown in figure 1.

C

XnX2X1 ... ...

Figure 1. The structure of naive Bayes classifier

B. The Parameter Estimation of NB Classifier The estimation of NB classifier includes parameter

estimation of class (prior probability estimation) and conditional probability or density estimation.

1) Parameter estimation of class: NNcpcp

kckk == )(ˆ)( , where kcN is the number of records

in the kc th class.

2009 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-3816-7/09 $26.00 © 2009 IEEE

DOI 10.1109/AICI.2009.402

350

2) Conditional probability or density estimation: a) Conditional probability estimation: If iX is discrete

variable,

k

i

k

c

xc

kiki N

Ncxpcxp

)(

)|(ˆ)|( == ,

where )( i

k

xcN is the number of ii xX = in the kc th class.

If the number is 0 in some case, there is

( )

NN

N

Ncxpi

k

xc

ki

+=

1

|ˆ ,

whereixN is the number of values of attribute variable iX .

b) Conditional density estimation: If iX is continuous variable, normal density(kernel density or other density functions can be used ) can be adopted as conditional density,

),,()|(kk ccii xgcxp σμ= , (1)

where ( )

( )2k

2

k

k

2

21),;( c

kci

k

x

ccci exg σ

μ

πσσμ

−−

= ,

kk

c

kiNkic N

cxcx )(...)(1 ++=μ ,

⎩⎨⎧

∉∈

=class 0class

)(kij

kijijkij cx

cxxcx ,

⎩⎨⎧

∉∈

=class 0class1

kij

kijc cx

cxN

k

,

that is the sample mean of the kc th class.

kk

c

kckiNkckic N

ccxccx 2212 ))(-)((...))(-)((

kkμμ

σ++

=

,

⎩⎨⎧

∉∈

=class 0class

)(kij

kijckc cx

cxc k

k

μμ ,

that is the sample variance of the kc th class.

C. The Representation Form of NB Classifier and Classification Principle According to conditional independency, the following

equation is obtained by Bayes formula. ),...,|( 1 nxxcp

=),...,(

)|,...,()(

1

1

n

n

xxPcxxpcp

= )|,...,()( 1 cxxpcp nα = ∏=

n

ii cxpcp

1

)|()(β , (2)

where )|( cxp i is the conditional probability or conditional density, and α and β are independent of c .

1) Classification principle of NB classifier: NNcpcp

kckk == )(ˆ)( , where kcN is the number of

records in the kc th class. The estimated values of )|(),...,|(),( 1 cxpcxpcp n are

got through training set D . For given attribute values 00

1 ,..., nxx , the class of 001 ,..., nxx is the one which value

maximize ∏=

n

ii cxpcp

1

)|()( (also ),...,|( 1 nxxcp ).

2) The representation form of NB classifier: The representation form of classifier is

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

∏=

n

ii

xxccxpcp

n 1),...,()|()(maxarg

1

.

III. LEARNING NAIVE BAYES CLASSIFIERS WITH MISSING DATA

Learning the naive Bayes classifier with missing data is an iterative process. At first, the missing data of discrete variables are initialized randomly and the missing data of continuous variables are initialized by using mean. Then according to the order of records in database, the missing data of every record are repaired in turn based on the star-like structure and Gibbs sampling. One iteration is completed after revising all missing data of records. The iteration is finished until the terminal condition is satisfied.

Let )0(D denote the initialization data set. )0(θ is the parameter vector obtained by )0(D and star-like structure which is denoted by S . It is assumed that two series of data set series )(kD and parameter vector series )(kθ are generated by iterating.

A. Revising the missing data It is assumed that k times-iteration are finished and the

joint distribution decided by S and )(kθ is

).,,|(),|(

),,|,...,(),|(),,,,...,(

)(

1

)()()(

)(1

)()()()(1

)(

ScxpScp

ScxxpScpScxxp

ki

n

i

kkk

kn

kkkkn

k

θθ

θθθ

∏=

=

=

According to the first attribute variables after class variable and the order of records in database, we carry on sampling to the variables with missing data. By using sampling values, the uncertain data are repaired. We assume that the values of iX and C in the m th record to be repaired are imx and mc , respectively. The revised values are imx and mc . The possible values of attribute variable

iX and class variable C are irii xx ,...,1 and crcc ,...,1 ,

respectively. Let )()1,1(

)( kk DD = denote the data set before the

k th iteration. )(),(

kmiD is the latest data set before repairing

imx in the k th iteration. The data set is denoted by )(

)1,1()1( k

Nk DD +

+ = after the k th iteration.

351

1) The revision of class variable values: Normalization processing is made for ∏

=

n

i

kmimim

kmim

k SDcxpSDcp1

)(),(

)(),(

)( ),,|(),|( .

Denote

{ }cr

j

n

i

kmi

jim

kmi

j

n

i

kmi

him

kmi

h

c rh

SDcxpSDcp

SDcxpSDcp

hwc

,...,1,

),,|(),|(

),,|(),|(

)(

1 1

)(),(

)(),(

1

)(),(

)(),(

∈=

∑ ∏

∏

= =

= ,

For a generated random number λ , we have

⎪⎪⎪⎪

⎩

⎪⎪⎪⎪

⎨

⎧

>

≤<

≤<

=

∑

∑∑

−

=

=

−

=

1

1

1

1

1

1

)(,

......

)()(,

......)1(0,

ˆ

cc

r

jc

r

h

jc

h

jc

h

c

m

jwc

jwjwc

wc

c

λ

λ

λ

.

2) The revision attribute variable values: It includes the revision of discrete variables and continuous variables.

a) The revision of values of discrete variable jX : Normalization processing is made for ),,|( )()(

),(kk

mimim TDcxp . Let

{ }ir

j

kkmim

ji

k

kkmim

hi

k

i rh

SDcxp

SDcxphw

i,...,1,

),,,|(

),,,|()(

1

)()(),(

)(

)()(),(

)(

∈=

∑=

,

For a generated random number ϕ , we have

⎪⎪⎪⎪

⎩

⎪⎪⎪⎪

⎨

⎧

>

≤<

≤<

=

∑

∑∑

−

=

=

−

=

1

1

1

1

1

1

)(,

......

)()(,

......)1(0,

ˆ

ii

r

ji

ri

h

ji

h

ji

hi

ii

im

jwx

jwjwx

wx

x

ϕ

ϕ

ϕ

.

b) The revision of values of continuous variable jX : At first, two random numbers 1ξ and 2ξ are obtained. Then

' 1/ 21 2ˆ ( 2 ln ) sin(2 )jmx ξ πξ= − obeys (0,1)N distribution. The

value of jX is ( 1) ( 1)

'ˆ ˆn m n mjm x jm xx xσ μ

+ += + . We can obtain

that ˆ jmx obeys ( 1) ( 1)

( , )n m n mx xN μ σ

+ + distribution.

3) The revision attribute variable values: If mm cc ˆ≠ , the corresponding parameters of class variable need to be adjusted. If imim xx ˆ≠ , the corresponding parameters of attribute variables need to be adjusted.

a) Parameters modification of class variables:

NSDcpSDcp kmim

kkmim

k /1),|(),|( )(),(

)()(),1(

)( −=+ ,

NSDcpSDcp kmim

kkmim

k /1),|ˆ(),|ˆ( )(),(

)()(),1(

)( +=+ .

b) Parameters modification of discrete attributes:

NSDcp

NSDcxpSDcxp k

mimk

kmimim

kk

mimimk

/1),|(

/1),|,(),,|( )(

),()(

)(),(

)()(

),1()(

−

−=+ ,

NSDcp

NSDcxpSDcxp k

mimk

kmimim

kk

mimimk

/1),|(

/1),|,ˆ(),,|ˆ( )(

),()(

)(),(

)()(

),1()(

+

+=+ .

c) Parameters modification of continuous attributes: The following formula is got after mathematical derivation

ic

kjmkjmkjm

kjm N

cxcx )(ˆ)(ˆ )()( −

−= μμ ,

( ) ( ) [ ]2)(2)(22

2)(2)( )()ˆ()(ˆ)(

ˆ kjm

kjmc

c

kjmkjmkjm

kjm i

i

NN

cxcxμμσσ −+

−−=

.

B. The Test for Stopping the Iteration of Data Set A sequence is obtained by ordering the missing data,

which is called missing data sequence. The consistency test of missing data sequences of two successive iterations is carried out to determine the termination of iteration. It is assumed that the obtained missing data sequences of two successive iterations are )()(

2)(

1 ,...,, kiM

ki

ki xxx and

)1()1(2

)1(1 ,...,, +++ k

iMk

ik

i xxx , respectively.

Mjxxxx

xxsig kij

kij

kij

kijk

ijk

ij ≤≤⎪⎩

⎪⎨⎧

≠=

= +

++ 1,

,1,0

),( )1()(

)1()()1()( .

For a given threshold value 00 >η , if

∑=

+ <M

j

kij

kij xxsig

M 10

)1()( ),(1 η ,

the iteration is stopped.

IV. EXPERIMENT From UCI machine learning data warehouse[7], we

choose 6 classification data sets: hepatitis， iris, voting records, new thyroid, wdbc and wine. Taking M05.00 =η and 30% data are missed, hepatitis, iris and voting records are chosen to carry on the data set iteration convergence experiment and all data sets are used to make classification accuracy experiment. The experimental results are shown in figure 2 and 3.

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10

the number for data revision

the

cons

isten

t deg

ree

for d

ata

sequ

ence

s(%

)

vot i ng_recordsi r i st i c_t ac_t oe

Figure 2. The iterative convergence of revising data set

352

50

60

70

80

90

100

20 30 40 50 60the percent of missing data(%)

clas

sific

atio

n ac

cura

cy( %

)Gi bbs sampl i ngEM ar i t hmet i cmode and averagedel et i ng r ecord met hod

Figure 3. The experiment of classification accuracy

Figure 2 shows the first iterative process of those 3 value sets where it is shown that all of them have contracted after five iterations. That figure displays iteratively revised sets can quickly contract and a higher efficiency can be obtained.

Use six classification sets to do experiments of missing data, conditional positive density is chosen for successive attributive variables. After that, six average values of classification accuracy are compared(when one of the set is classified, it will be divided into three parts of which two thirds are for practice sets and one third are for test sets. Average values of three classification accuracy rates will be used for the overall classification).

Figure 3 shows that the method of using Gibbs sampling to deal with missing data is significantly better than the other ones. Moreover, the effect of iterative revising data is gradually increased with the increase missing data ratio which shows the iterative learning method is efficient.

V. COCLUSION In this paper, an iterative method of learning naive Bayes

classifiers with missing data is presented by combining star-like structure with Gibbs sampling. In the theory, Gibbs

sampling iteration converges to the global stationary distribution thus can avoid the local optimal problem. The experimental results also show this method is more reliable and practical to deal with missing data than the other ones.

ACKNOWLEDGMENT This work was supported by National Natural Science

Foundation of China (60675036), Leading Academic Discipline Project of Shanghai Municipal Education Commission —international trade (the fifth), Innovation Program of Shanghai Municipal Education Commission (09zz202) and Program of Shanghai Financial Department (1138IA0005).

REFERENCES [1] N. Friedman , D. Geiger and M.Goldszmidt, “Bayesian Network

Classifiers,” Machine Learning, 1997, 29, pp. 131-161. [2] M. Ramoni, P. Sebastiani, “Robust Bayes Classifiers,” Artificial

Intelligence, 2001, 125(1-2), pp. 209-226. [3] M. Ramoni, P. Sebastiani, “Robust learning with missing data,”

Machine Learning, 2001, 45(2), pp. 147-170. [4] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum likelihood

from incomplete data via the EM algorithm,” J. Royal Statist. Soc., 1977(39), pp. 1-38.

[5] S. Geman, D. Geman, “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984, 6, pp. 721-742.

[6] S. S. Mao, J. L. Wang and X. L. Pu, Advanced mathematical statistics. 1th ed., Beijing: China Higher Education Press, Berlin: Springer-Verlag, 1998, pp. 401-459.

[7] S. L. Murphy, D. W. Aha, UCI repository of machine learning databases, http: //www. ics. uci. edu/~mlearn/ MLRepository. Html.

353

[ieee 2009 international conference on artificial intelligence and computational intelligence -...

Documents