cost-sensitive bayesian network algorithm introduction: machine learning algorithms are becoming an...

1
Cost-Sensitive Bayesian Network algorithm Introduction : Machine learning algorithms are becoming an increasingly important area for research and application in the field of Artificial Intelligence and data mining. One of the most important algorithm is Bayesian network, this algorithm have been widely used in real world applications like medical diagnosis, image recognition, fraud detection, and inference problems. In all of these applications, evaluation method as accuracy is not enough because there are costs involve each decision. For example, in a fraud detection application to predict new case, there are several costs involved when the classifier predicts a fraudulent case as a non-fraudulent case. Also, fraud databases have an unbalanced class distribution which is known to affect learning algorithms adversely. Therefore, this project develops new algorithm that aims to minimize the costs of prediction, misclassification, imbalance data, time and test . In this work, we attempt to create a new cost-sensitive Bayesian network learning algorithm by adapting Bayesian network algorithm, which focuses on accuracy only. There are several ways of adapting our algorithm and make it cost-sensitive, this includes: changing distribution of the data; changing the construction process and even adopting alternative measure in the algorithms that take account of cost; and using Genetic Algorithm to learn structure of BN. This work will apply different approaches such as amending distributions, amending formula, and using Genetic algorithms. Finally, an empirical evaluation of the developed algorithms will be carried on the artificial data sets (e.g diabetes data, lung cancer data, Bank data …etc) . Conclusion : In the real world problems such as fraud detection, medical diagnosis, or any decision problem. Often, one class label in dataset such as (Non-fraud class) is very rare and expansive than another class, because the cost of not recognizing some of the instances which belong to the rare class is high. Therefore, most of machine learning methods do not take cost into account. Thus, those algorithms (cost-insensitive algorithms) have a poor result, because ignoring cost might produce a very week model. In reality, misclassification problems (error of classification) are very common problem in real-world data mining when the data is imbalanced in class label . Eman Nashnush [email protected] University of Salford ,Manchester, UK Sponsor in Libya ( Tripoli University ) Hypotheses/The problem Methodology Cost-insensitive Vs. cost- sensitive ) Research problem ( A cost-insensitive classifier focus on accuracy only (class label output).. Cost-sensitive attempt to minimize the expected cost.. Learner Training Data Classifier ) $43.45 , retail,10040, .. nonfraud ( ) $246,70 , weapon,94583,.,fraud ( 1 . Decision trees 2 . Rules 3 . Naive Bayes ... Transaction {fraud,nonfraud} Testing data Classifier Class Labels nonfraud fraud ) $99.99 , pharmacy,10027 ,..., ? ( ) $1.00 , gas,00234 ,..., ? ( The previously mentioned problems are happened during classification data set. Therefore, three methods have been proposed to tackle those problems and minimize the expected misclassification cost. Amend the data distribution to reflect cost. Amend the formula by modifying the statistical measures to include cost. Utilize a Genetic algorithm to evolve a 'fittest' Bayesian network. Up to Now, I have investigated experimentally how changing the distribution of data will affect the performance and cost of a Bayesian classifier. I experiment my approach that called “Cost-Sensitive Bayesian Network using Sampling” with 24 data sets from the UCI repository database. I try to compare my proposed algorithm with the existing methods, and also compare the performance of this proposed method with the original algorithm. In the figure below, I show the results of Cost-sensitive Bayes Network algorithm via changing the distributions, and the original Bayes Network algorithm . Results Up to now, two new methods for cost-sensitive Bayesian Network algorithms have been developed and explored: one that uses a black box (Sampling) approach and another that uses a transparent box approach (modifying the statistical measures) that amends the selection measure to take account of costs . The effect of our algorithms are evaluated and compared with other algorithms, such as (MetaCost+J4.8, standard decision tree(J48), and standard Bayesian networks) .

Upload: bertram-owens

Post on 28-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application

Cost-Sensitive Bayesian Network algorithm

Introduction:

Machine learning algorithms are becoming an increasingly important area for research and application in the field of Artificial Intelligence and data mining. One of the most important algorithm is Bayesian network, this algorithm have been widely used in real world applications like medical diagnosis, image recognition, fraud detection, and inference problems. In all of these applications, evaluation method as accuracy is not enough because there are costs involve each decision. For example, in a fraud detection application to predict new case, there are several costs involved when the classifier predicts a fraudulent case as a non-fraudulent case. Also, fraud databases have an unbalanced class distribution which is known to affect learning algorithms adversely. Therefore, this project develops new algorithm

that aims to minimize the costs of prediction, misclassification, imbalance data, time and test .

 In this work, we attempt to create a new cost-sensitive Bayesian network learning algorithm by adapting Bayesian network algorithm, which focuses on accuracy only. There are several ways of adapting our algorithm and make it cost-sensitive, this includes: changing distribution of the data; changing the construction process and even adopting alternative measure in the algorithms that take account of cost; and using Genetic Algorithm to learn structure of BN. This work will apply different approaches such as amending distributions, amending formula, and using Genetic algorithms. Finally, an empirical evaluation of

the developed algorithms will be carried on the artificial data sets (e.g diabetes data, lung cancer data, Bank data …etc) .

Conclusion:

In the real world problems such as fraud detection, medical diagnosis, or any decision

problem. Often, one class label in dataset such as (Non-fraud class) is very rare and

expansive than another class, because the cost of not recognizing some of the

instances which belong to the rare class is high. Therefore, most of machine learning

methods do not take cost into account. Thus, those algorithms (cost-insensitive

algorithms) have a poor result, because ignoring cost might produce a very week

model. In reality, misclassification problems (error of classification) are very common

problem in real-world data mining when the data is imbalanced in class label.

Eman [email protected]

University of Salford ,Manchester, UKSponsor in Libya ( Tripoli University )

Hypotheses/The problem Methodology Cost-insensitive Vs. cost-sensitive

(Research problem)

A cost-insensitive classifier focus on accuracy only (class label output)..

Cost-sensitive attempt to minimize the expected cost..

Learner

TrainingData

Classifier

)$43.45,retail,10040, .. nonfraud()$246,70,weapon,94583,.,fraud)

1 .Decision trees2 .Rules

3 .Naive Bayes

...

Transaction {fraud,nonfraud}

Testing dataClassifier Class Labels

nonfraudfraud

)$99.99,pharmacy,10027,...,?()$1.00,gas,00234,...,?(

The previously mentioned problems are happened during classification data set.

Therefore, three methods have been proposed to tackle those problems and

minimize the expected misclassification cost.

Amend the data distribution to reflect cost.

Amend the formula by modifying the statistical measures to include cost.

Utilize a Genetic algorithm to evolve a 'fittest' Bayesian network.

Up to Now, I have investigated experimentally how changing the distribution of data

will affect the performance and cost of a Bayesian classifier. I experiment my

approach that called “Cost-Sensitive Bayesian Network using Sampling” with 24

data sets from the UCI repository database. I try to compare my proposed

algorithm with the existing methods, and also compare the performance of this

proposed method with the original algorithm. In the figure below, I show the results

of Cost-sensitive Bayes Network algorithm via changing the distributions, and the

original Bayes Network algorithm.

Results

Up to now, two new methods for cost-sensitive Bayesian Network algorithms

have been developed and explored: one that uses a black box (Sampling)

approach and another that uses a transparent box approach (modifying the

statistical measures) that amends the selection measure to take account of

costs.

The effect of our algorithms are evaluated and compared with other algorithms,

such as (MetaCost+J4.8, standard decision tree(J48), and standard Bayesian

networks).