machine learning for data mining

75
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Machine Learning for Data Mining Dr. Dewan Md. Farid Department of Computer Science & Engineering, United International University, Bangladesh December 01, 2016 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining

Upload: bhuban-roy

Post on 15-Apr-2017

94 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Machine Learning for Data Mining

Dr. Dewan Md. Farid

Department of Computer Science & Engineering,United International University, Bangladesh

December 01, 2016

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 2: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Big Data Project

Rule-based Classifier

Class Imbalanced Problem

Active Learning

Ensemble Clustering

Hybrid Classifier

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 3: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Data Mining: What is Data Mining?

Data mining (DM) is also known as Knowledge Discovery fromData, or KDD for short, which turns a large collection of data intoknowledge. DM is a multidisciplinary field including machine learning,artificial intelligence, pattern recognition, knowledge-based systems,high-performance computing, database technology and data visualisation.

1. Data mining is the process of analysing data from differentperspectives and summarising it into useful information.

2. Data mining is the process of finding hidden information andpatterns in a huge database.

3. Data mining is the extraction of implicit, previously unknown, andpotentially useful information from data.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 4: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Machine Learning

Machine learning (ML) provides the technical basis of data mining,which concerns the construction and study of systems that can learnfrom data.

1. Supervised learning/ Classification - the supervision in thelearning comes from the labeled instances.

2. Unsupervised learning/ Clustering - the learning process isunsupervised since the instances are not class labeled.

3. Semi-supervised learning - uses of both labeled and unlabelledinstances when learning a model.

4. Active learning - lets users play an active role in the learningprocess. It asks a user (e.g., a domain expert) to label an instance,which may be from a set of unlabelled instances.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 5: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Learning Algorithms

I Decision Tree (DT) Induction

I Naıve Bayes (NB) Classifier

I NBTree Classifier

I RainForest and BOAT Classifier

I k Nearest Neighbour (kNN) Classifier

I Random Forest, Bagging and Boosting (AdaBoost)

I Support Vector Machines (SVM)

I k Means Clustering

I Similarity based Clustering

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 6: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Mining Big Data

Mining big data is the process of extracting knowledge to uncover largehidden information from the massive amount of complex data ordatabases. Big data is defined by the three V’s:

I Volume - the quantity of data.

I Variety - the category of data.

I Velocity - the speed of data in and out.

It might suggest throwing a few more V’s into the mix:

I Vision - having a purpose/ plan).

I Verification - ensuring that the data conforms to a set ofspecifications.

I Validation - checking that its purpose is fulfilled.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 7: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Big Data Project

1. BRiDGEIris - Brussels Big Data Platform for Sharing and Discoveryin Clinical Genomics.

I Hosted by IB2 (Interuniversity Institute of Bioinformatics inBrussels).

I Funded by INNOVIRIS (Brussels Institute for Research andInnovation).

2. FWO research project G004414N “Machine Learning for DataMining Applications in Cancer Genomics”.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 8: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

BRiDGEIris Project

Brussels big data platform for sharing and discovery in clinical genomicsproject aims to answer the research challenges by:

1. Design and creation of a multi-site clinical/phenomic and genomicdata warehouse.

2. Development of automated tools for extracting relevant informationfrom genetic data.

3. Use of the designed tools to extract new knowledge and transfer itto the medical setting.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 9: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

VUB AI Lab (CoMo)

Lab is particularly focused on the aspect of design and developingstrategy for information discovery on genomic and clinical big databy employing an optimal ensemble method. Goal is to evaluateensemble predictive modelling techniques for:

1. Improving the prediction accuracy of variant identification/ genomicvariants classification.

2. Pathology classification tasks.

Developing new methods/ algorithms to deal with the following issues:

I Multi-class classification

I High-dimensional data

I Class imbalanced data

I Big data

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 10: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Brugada syndrome

Brugada syndrome (BrS), also known as sudden adult deathsyndrome (SADS) is a genetic disease. It increases the risk of suddencardiac death (SCD) at a young age. The Spanish cardiologists PedroBrugada and Josep Brugada name Brugada syndrome.

I BrS is detected by abnormal electrocardiogram (ECG) findings calleda type 1 Brugada ECG pattern, which is much more common in men.

I BrS is a heart rhythm disorder.

I Sudden cardiac death (SCD) caused when the heart doesn’t pumpeffectively and not enough blood travels to the rest of the body.

The Exome datasets of 148 patients have analysed for Brugada syndromeat UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 11: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Knowledge Discovery from Genomic Data

Exome1

FormattedData

GenePanel

MiningAlgorithm

GenomicDataSets

KnowledgeDiscoveryfromGenomicData

Exome2

Exome148 DataPreprocessing

FeatureSelection

Figure: The process of extracting knowledge from genomic data in data mining.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 12: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Genomic Data of BrS

Table: Classification of DNA variants for Brugada syndrome.

Class LabelClass I NonpathogenicClass II VUS1 - Unlikely pathogenicClass III VUS2 - UnclearClass IV VUS3 - Likely pathogenicClass V Pathogenic

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 13: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Gene Panel of BrS

Table: Gene panel of Brugada syndrome.

Chromosome Name of GeneChr 1 KCND3Chr 3 SCN5A, GPD1L, SLMAP, CAV3, SCN10AChr 4 ANK2Chr 7 CACNA2D1, AKAP9, KCNH2Chr 10 CACNAB2Chr 11 KCNE3, SCN3B, SCN2B, KCNJ5,

KCNQ1, SCN4BChr 12 CACNA1C, KCNJ8Chr 15 HCN4Chr 17 RANGRF, KCNJ2Chr 19 SCN1B, TRPM4Chr 20 SNTA1Chr 21 KCNE1, KCNE2Chr X KCNE1L

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 14: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Chromosomes

1 11 12 15 17 19 21 3 4 7 X

Chromosomes

No.

of V

aria

nts

010

020

030

040

050

0

Figure: Chromosomes in 148 Exome Datasets.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 15: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Genomic Data

0

20

40

60

80

100

120

140

160

180

200

220

240

260

2801 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

38

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

125

127

129

131

133

135

137

139

141

143

145

147

No.ofV

ariants

ExomeDataSets

AnnotatedvcfFile

GenePanel

BrSVariants

Figure: Genomic Data: 148 Exome Datasets.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 16: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Rule-based Classifier

Rule-based classifier is easy to deal with complex classification problems.It has various advantages:

I Highly expressive as DT

I Easy to interpret

I Easy to generate

I Can classify new instances rapidly

I Performance comparable to DT

I New rules can be added to existing rules without disturbing onesalready in there

I Rules can be executed in any order

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 17: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Adaptive Rule-based Classifier

It combines the random subspace and boosting approaches withensemble of decision trees to construct a set of classification rules formulti-class classification of biological big data.

I Random subspace method (or attribute bagging) to avoidoverfitting

I Boosting approach for classifying noisy instances

I Ensemble of decision trees to deal with class-imbalance data

It uses two popular classification techniques: decision tree (DT) andk-nearest-neighbour (kNN) classifiers.

I DTs are used for evolving classification rules from the training data.

I kNN is used for analysing the misclassified instances and removingvagueness between the contradictory rules.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 18: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Random Subspace & Boosting Method

Random subspace is an ensemble classifier. It consists of severalclassifiers each operating in a subspace of the original feature space, andoutputs the class based on the outputs of these individual classifiers.

I It has been used for decision trees (random decision forests).

I It is an attractive choice for high dimensional data.

Boosting is designed specifically for classification.

I It converts weak classifiers to strong ones.

I It is an iterative process.

I It uses voting for classification to combine the output of individualclassifiers.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 19: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Ensemble Classifier

Figure: An example of an ensemble classifier.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 20: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Decision Tree Induction

Decision tree (DT) induction is a top down recursive divide andconquer algorithm for multi-class classification task. The goal of DT is toiteratively partition the data into smaller subsets until all the subsetsbelong to a single class. It is easy to interpret and explain, and alsorequires little prior knowledge.

I Information Gain: ID3 (Iterative Dichotomiser) algorithm

I Gain Ratio: C4.5 algorithm

I Gini Index: CART algorithm

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 21: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 1 Decision Tree Induction

Input: D = {x1, · · · , xi , · · · , xN}Output: A decision tree, DT .Method:

1: DT = ∅;2: find the root node with best splitting, Aj ∈ D;3: DT = create the root node;4: DT = add arc to root node for each split predicate and label;5: for each arc do6: Dj created by applying splitting predicate to D;7: if stopping point reached for this path, then8: DT ′ = create a leaf node and label it with cl ;9: else

10: DT ′ = DTBuild(Dj );11: end if12: DT = add DT ′ to arc;13: end for

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 22: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

K-Nearest-Neighbour (kNN) Classifier

The k-nearest-neighbour (kNN) is a simple classifier. It uses thedistance measurement techniques that widely used in pattern recognition.kNN finds k instances, X = {x1, x2, · · · , xk} ∈ Dtraining that are closest tothe test instance, xtest and assigns the most frequent class label,cl → xtest among the X . When a classification is to be made for a newinstance, xnew , its distance to each Aj ∈ Dtraining , must be determined.Only the k closest instances, X ∈ Dtraining are considered further. Theclosest is defined in terms of a distance metric, such as Euclideandistance. The Euclidean distance between two points,x1 = (x11, x12, · · · , x1n) and x2 = (x21, x22, · · · , x2n), is shown in Eq. 1

dist(x1, x2) =

√√√√ n∑i=1

(x1i − x2i )2 (1)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 23: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 2 k-Nearest-Neighbour classifier

Input: D = {x1, · · · , xi , · · · , xn}Output: kNN classifier, kNN.Method:

1: find X ∈ D that identify the k nearest neighbours, regardless of classlabel, cl .

2: out of these instances, X = {x1, x2, · · · , xk}, identify the number ofinstances, ki , that belong to class cl , l = 1, 2, · · · ,M. Obviously,∑

i ki = k.3: assign xtest to the class cl with the maximum number of ki of instances.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 24: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Constructing Classification Rules

Extracting classification rules from DTs is easy and well-known process.Rules are highly expressive as DT, so the performance of rule-basedclassifier is comparable to DT.

I Each rule is generated for each leaf of the DT.

I Each path in DT from the root node to a leaf node corresponds witha rule.

I Tree corresponds exactly to the classification rules.

DT vs. RulesNew rules can be added to an existing rule set without disturbing onesalready there, whereas to add to a tree structure may require reshapingthe whole tree. Rules can be executed in any order.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 25: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm: Adaptive rule-based (ARB) classifier

I It considers a series of k iterations.

I Initially, an equal weight, 1N is assigned to each training instance.

I The weights of training instances are adjusted according to how theyare classified in every iterations.

I In each iteration, a sub-dataset Dj is created from the originaltraining dataset D and previous sub-dataset Dj−1 with maximumweighted instances. Only the sampling with replacement techniqueis used to create the sub-dataset D1 from the original training dataD in the first iteration.

I A tree DTj is built from the sub-dataset Dj with randomly selectedfeatures in each iteration.

I Each rule is generated for each leaf node of DTj .

I Each path in DTj from the root to a leaf corresponds with a rule.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 26: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 3 Adaptive rule-based classifier.

Input:D = {x1, · · · , xi , · · · , xN}, training dataset;k, number of iterations;DT learning scheme;Output: rule-set; // A set of classification rules.Method:

1: rule-set = ∅;2: for i = 1 to N do3: xi = 1

N ; // initialising weights of each xi ∈ D.4: end for5: for j = 1 to k do6: if j==1 then7: create Dj , by sampling D with replacement;8: else9: create Dj , by Dj−1 and D with maximum weighted X ;

10: end if11: build a tree, DTj ← Dj by randomly selected features;12: compute error(DTj ); // the error rate of DTj .13: if error(DTj ) ≥ threshold-value then14: go back to step 6 and try again;15: else16: rules ← DTj ; // extracting the rules from DTj .17: end if18: for each xi ∈ Dj that was correctly classified do

19: multiply the weight of xi by (error(DTj )

1−error(DTj )); // update weights.

20: end for21: normalise the weight of each xi ∈ Dj ;22: rule-set = rule-set ∪ rules;23: end for24: return rule-set;25: create sub-dataset, Dmisclassified with misclassified instances from Dj ;26: analyse Dmisclassified employing algorithm 4.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 27: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Error Rrate Calculation

The error rate of DTj is calculated by the sum of weights of misclassifiedinstances that is shown in Eq. 2. Where, err(xi ) is the misclassificationerror of an instance xi . If an instance, xi is misclassified, then err(xi ) isone. Otherwise, err(xi ) is zero (correctly classified).

error(DTj ) =n∑

i=1

wi × err(xi ) (2)

If error rate of DTj is less than the threshold-value, then rules areextracted from DTj .

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 28: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Mining Big Data with Rules

I Big data is so big (millions of instances) that we cannot process allthe instances together at the same time.

I It is not possible to store all the data in the main memory at a time.

I We can create several smaller sample (or subsets) of data from thebig data that each of which fits in main memory.

I Each subset of data is used to construct a set of rules, resulting inseveral sets of rules.

I Then the rules are examined and used to merge together toconstruct the final set of classification rules to deal with big data.As we have the advantage to add new rules with existing rules andrules are executed in any order.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 29: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Mining Big Data with Rules (con.)

Data

Data

Data

IntegratingRules

BigData

Sub-data,1

AdaptiveRule-basedClassifier

FinalClassificationRules

AdaptiveRule-basedClassifier

AdaptiveRule-basedClassifier

Sub-data,NSub-data,2

Figure: Mining big data using adaptive rule-based classifier.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 30: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Reduced-Error Pruning

I Split the original data into two parts: (a) a growing set, and (b) apruning set.

I Rules are generated using growing set only. So, important rulesmight miss because some key instances had been assigned to thepruning set.

I A rule generated from the growing set is deleted, and the effect isevaluated by trying out the truncated rule from the pruning set andseeing whether it performs well than the original rule.

I If the new truncated rule performs better then this new rule is addedto the rule set.

I This process continues for each rule and for each class.

I The overall best rules are established by evaluating the rules on thepruning set.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 31: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm: Analysing Misclassified Instances

To check the classes of misclassified instances we used the kNN classifierwith feature selection and weighting approach.

I We applied DT induction for feature selection and weightingapproach.

I We build a tree from the misclassified instances.

I Each feature that is tested in the tree, Aj ∈ Dmisclassified is assignedby a weight 1

d . Where d is the depth of the tree.

I We do not consider the features that are not tested in the tree forsimilarity measure of kNN classifier.

I We apply kNN classifier to classify each misclassified instance basedon the weighted features.

I We update the class label of misclassified instances.

I We check for the contradictory rules, if there is any.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 32: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 4 Analysing misclassified instances

Input: D, original training data;Dmisclassified , dataset with misclassified instances;Output: A set of instances, X with right class labels.Method:

1: build a tree, DT using Dmisclassified ;2: for each Aj ∈ Dmisclassified do3: if Aj is tested in DT then4: assign weight to Aj by 1

d , where d is the depth of DT ;5: else6: not to consider, Aj for similarity measure;7: end if8: end for9: for each xi ∈ Dmisclassified do

10: find X ∈ D, with the similarity of weighted A ={A1, · · · ,Aj , · · · ,An};

11: find the most frequent class, cl , in X ;12: assign xi ← cl ;13: end for

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 33: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Performance Measurement

The classification accuracy:

accuracy =

∑|X |i=1 assess(xi )

|X |, xi ∈ X (3)

If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified thenassess(xi ) = 0.

precision =TP

TP + FP(4)

recall =TP

TP + FN(5)

F − score =2× precision × recall

precision + recall(6)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 34: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Experiments on Exome datasets

The performance of the proposed ARB classifier against RainForest, NBand kNN classifiers on 148 Exome datasets. The ARB classifier correctlyclassifies 91% gene variants for BrS using training data. We haveconsidered five iterations for the proposed ARB classifier on each Exomedataset.

Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN andproposed ARB classifier using training data.

Algorithm Classification Precision Recall F-scoreaccuracy (%) (weighted (weighted (weighted

avg.) avg.) avg.)RainForest 83.33 0.76 0.83 0.79NB 83.33 0.79 0.83 0.78kNN 75 0.56 0.75 0.64ARB classifier 91.66 0.95 0.91 0.92

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 35: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Experiments on Exome datasets (con.)

The performance of the proposed ARB classifier against RainForest, NBand kNN classifiers using 10-folds cross validation on 148 Exomedatasets.

Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN andproposed ARB classifier using 10 folds cross-validation.

Algorithm Classification Precision Recall F-scoreaccuracy (%) (weighted (weighted (weighted

avg.) avg.) avg.)RainForest 58.33 0.46 0.58 0.51NB 58.33 0.63 0.58 0.6kNN 50 0.33 0.5 0.4ARB classifier 75 0.73 0.75 0.68

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 36: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Experiments on Exome datasets (con.)

The performance of the proposed ARB classifier against RainForest, NBand kNN classifiers using unseen test variants of 45 Exome datasets.Where 103 Exome datasets were used for training the models.

Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN andproposed ARB classifier using testing data.

Algorithm Classification Precision Recall F-scoreaccuracy (%) (weighted (weighted (weighted

avg.) avg.) avg.)RainForest 50 0.33 0.5 0.4NB 50 0.25 0.5 0.62kNN 50 0.25 0.5 0.33ARB classifier 66.66 0.44 0.66 0.53

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 37: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Benchmark Life Sciences Datasets

Table: 10 real benchmark life sciences datasets from UCI (University ofCalifornia, Irvine) machine learning repository.

No. Datasets Instances No of Att. Att. Types Classes1 Appendicitis 106 7 Numeric 22 Breast cancer 286 9 Nominal 23 Contraceptive 1473 9 Numeric 34 Ecoli 336 7 Numeric 85 Heart 270 13 Numeric 26 Pima diabetes 768 8 Numeric 27 Iris 150 4 Numeric 38 Soybean 683 35 Nominal 199 Thyroid 215 5 Numeric 210 Yeast 1484 8 Numeric 10

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 38: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Classification Accuracy

Table: The classification accuracy (%) of C4.5, kNN, naıve Bayes (NB) andproposed adaptive rule-based classifier with 10-fold cross validation.

Datasets C4.5 kNN NB Proposedclassifier

Appendicitis 85.84 86.79 85.84 87.73Breast cancer 75.52 73.42 71.67 75.52Contraceptive 50.98 49.76 48.13 50.1Ecoli 79.76 83.03 78.86 83.92Heart 77.40 78.88 83.7 83.7Pima diabetes 73.82 73.17 76.3 75.65Iris 96 95.33 96 95.33Soybean 91.50 90.19 92.97 91.94Thyroid 98.13 97.2 98.13 98.13Yeast 56.73 56.94 57.88 61.99

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 39: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Classification Accuracy (con.)

45

50

55

60

65

70

75

80

85

90

95

100

Appendici1s Breastcancer Contracep1ve Ecoli Heart Pimadiabetes Iris Soybean Thyroid Yeast

Classifica(

onAccuracy

UCIBenchmarkLifeSciencesDataSets

C4.5 kNN NB Adap1verule-basedclassifier

Figure: The comparison of classification accuracies among C4.5, kNN, naıveBayes (NB) and proposed adaptive rule-based classifiers.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 40: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Accuracy having 20% noisy instances

40

45

50

55

60

65

70

75

80

85

Appendici/s Breastcancer Contracep/ve Ecoli Heart Pimadiabetes Iris Soybean Thyroid Yeast

Classifica(

onAccuracy

UCIBenchmarkLifeSciencesDataSets

C4.5 kNN NB Adap/verule-basedclassifier

Figure: The comparison of classification accuracies among C4.5, kNN, naıveBayes (NB) and rule-based classifier on 20% noisy testing data.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 41: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Data Balancing Methods

Classification of multi-class imbalanced data is a difficult task, as realdata sets are noisy, high dimensional, small sample size that resultsoverfitting and overlapping of classes..

I Traditional machine learning algorithms are very successful withclassifying majority class instances compare to the minority classinstances.

I The conventional data balancing methods alter the original datadistribution, so they might suffer from overfitting or drop somepotential information.

We proposed a new method for dealing with multi-class imbalanced databased on clustering and selecting most informative instances from themajority classes.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 42: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Classifying Imbalanced Data

Machine learning algorithms successfully classify majority class instances,but misclassify the minority class instances in many high-dimensionaldata sets.Following methods are used for class imbalance problems:

1. Sampling methodsI Under-samplingI Over-sampling

2. Cost-sensitive learning methods (difficult to get the accuratemisclassification cost)

3. Ensemble methodsI BaggingI Boosting

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 43: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Proposed Data Balancing Method

I Initially, we cluster the majority class instances into several clusters.

I Find the most informative instances in each cluster. The informativeinstances are close to the center of cluster and border of cluster.

I Then several data sets are created using these clusters with mostinformative instances by combining the instances of minority classes.

I Every data set should have almost equal number ofminority-majority classes instances.

I Finally, multiple classifiers are trained using these data sets. Thevoting technique is used to classify the existing/ new instances.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 44: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Proposed Data Balancing Method (con.)

Imbalanced Data

Majority Classes Instances

Minority Classes Instances

Cluster 1

Balanced Data 1

Classifier 1

Find Informative

Instances

Cluster 2 Cluster N

Find Informative

Instances

Find Informative

Instances

Balanced Data 2

Balanced Data N

Classifier 2 Classifier N

CombineVotes

Prediction

New Data Instances

Figure: Proposed data balancing method.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 45: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Performance of Data Balancing Methods

The performance of data balancing methods using area under the ROC(Receiver Operating Characteristic) curve (AUC) on 2143 variants ofBrugada syndrome (BrS) of 148 Exome data sets.

Table: Average AUC values of 148 imbalanced Exome data sets for differentimbalance data handling methods.

Algorithm Average AUC valueRandom Under-Sampling 0.8923Random Over-Sampling 0.8673Bagging 0.8915Boosting 0.9136Proposed Method 0.9317

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 46: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Active Learning

It achieves high accuracy using the number of instances to learn aconcept can often be much lower than the number required in typicalsupervised learning.

I It interactively queries a user/ expert for class labels of unlabeledinstances.

I The objective is to train a classifier using as few labeled instances aspossible by selecting the most informative instances.

Let the data, D contains both set of labeled data, DL and set ofunlabeled data, DU . Initially, a model, M∗ trains using DL. Then aquerying function uses to select unlabeled instances, XU ∈ DU andrequests a user for labeling, XU → XL. After XL is added to DL and trainM∗ again. The process repeats until the user is satisfied.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 47: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Active Learning (con.)

Data, D

Labeled Data, DL

Unlabeled Data, DU

Unlabeled Instances, XU

Labeled Instances, XL DL + XL

Ensemble Model, M*

User/ Oracle

Figure: Active learning process.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 48: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Proposed Method

The naıve Bayes (NB) classifier and clustering are used to find the mostinformative instances for labeling as part of active learning. The unlabeledinstances are selected for labeling using the following two strategies:

I Instances close to centers of clusters and borders of clusters.

I If the posterior probabilities of instances are equal/ very close.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 49: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Performance of Ensemble Methods

Adaptive boosting (AdaBoost algorithm) with NB classifier is used asbase classifier.

Table: The accuracy and F-score of ensemble methods on 2143 DNA variantsof Brugada syndrome.

Algorithm Classification F-scoreaccuracy (%) (weighted

avg.)Random Forest 92.3 0.93Bagging 87.5 0.83Boosting 91.66 0.9AdaBoost with NB classifier 94.73 0.93

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 50: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Clustering of high-dimensional big data

I An ensemble clustering method with feature selection and groupingapproach.

I K-means clustering.

I Similarity-based clustering.

I Biclustering (On each cluster that generated by ensemble clusteringto find the sub-matrices).

I Unlabelled genomic data of Brugada syndrome (148 Exomedatasets).

The proposed method selects the most relevant features in the datasetand grouping them into subset of features to overcome the problemsassociated with the traditional clustering methods.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 51: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Clustering

It is the process of grouping a set of instances into clusters (subsets orgroups) so that instances within a cluster have high similarity incomparison to one another, but are very dissimilar to instances in otherclusters.Let X be the unlabelled data set, that is,

X = {x1, x2, · · · , xN}; (7)

The partition of X into k clusters, C1, · · · ,Ck , so that the followingconditions are met:

Ci 6= ∅, i = 1, · · · , k ; (8)

∪ki=1Ci = X ; (9)

Ci ∩ Cj = ∅, i 6= j , i , j = 1, · · · , k ; (10)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 52: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Challenges

I Pattern extracting from the genomic big data.

I Genomic data is often too big and too messy.

I Genomic data is also high-dimensional, so traditional distancemeasures may be dominated by the noise in many dimensions.

I In genomic data, we need to find not only the clusters of instances(genes), but for each cluster a set of features (conditions).

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 53: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

k-Means

I It defines the mean value of instances {xi1, xi2, · · · , xiN} ∈ Ci .

I It randomly selects k instances, {xk1, xk2, · · · , xkN} ∈ X each ofwhich initially represents a cluster center.

I Remaining instances, xi ∈ X , xi is assigned to the cluster.

I Similar is measure based on the Euclidean distance between xi andCi .

I It iteratively improves the within-cluster variation.

A high degree of similarity among instances in clusters is obtained, whilea high degree of dissimilarity among instances in different clusters isachieved simultaneously. The cluster mean of Ci = {xi1, xi2, · · · , xiN} isdefined in equation 11.

Mean = Ci =

∑Nj=1(xij )

N(11)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 54: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 5 k-Means Clustering

Input: X = {x1, x2, · · · , xN} // A set of unlabelled instances.k // the number of clustersOutput: A set of k clusters.Method:

1: arbitrarily choose k number of instances, {xk1, xk2, · · · , xkN} ∈ X asthe initial k clusters center;

2: repeat3: (re)assign each xi ∈ X → k to which the xi is the most similar based

on the mean value of the xm ∈ k ;4: update the k means, that is, calculate the mean value of the instances

for each cluster;5: until no change

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 55: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Similarity-Based Clustering (SCM)

I It is robust to initialise the cluster numbers.

I It detects different volumes of clusters.

Let’s consider sim(xi , xl ) as the similarity measure between instances xi

and the lth cluster center xl . The goal is to find xl to maximise the totalsimilarity measure shown in Eq. 12.

Js(C ) =k∑

l=1

N∑i=1

f (sim(xi , xl )) (12)

Where, f (sim(xi , xl )) is a reasonable similarity measure andC = {C1, · · · ,Ck}. In general, SCM uses feature values to check thesimilarity between instances. However, any suitable distance measure canbe used to check the similarity between the instances.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 56: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 6 Similarity-based Clustering

Input: X = {x1, x2, · · · , xN} // A set of unlabelled instances.Output: A set of clusters, C = {C1,C2, · · · ,Ck}.Method:

1: C = ∅;2: k = 1;3: Ck = {x1};4: C = C ∪ Ck ;5: for i = 2 to N do6: for l = 1 to k do7: find the lth cluster center xl ∈ Cl to maximize the similarity

measure, sim(xi , xl );8: end for9: if sim(xi , xl ) ≥ threshold value then

10: Cl = Cl ∪ xi11: else12: k = k + 1;13: Ck = {xi};14: C = C ∪ Ck ;15: end if16: end for

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 57: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Ensemble Clustering

Ensemble clustering is a process of integrating multiple clusteringalgorithms to form a single strong clustering approach that usuallyprovides better clustering results. It generates a set of clusters from agiven unlabelled data set and then combines the clusters into finalclusters to improve the quality of individual clustering.

I No single cluster analysis method is optimal.

I Different clustering methods may produce different clusters, becausethey impose different structure on the data set.

I Ensemble clustering performs more effectively in high dimensionalcomplex data.

I It’s a good alternative when facing cluster analysis problems.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 58: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Ensemble clustering (con.)

Generally three strategies are applied in ensemble clustering:

1. Using different clustering algorithms on the same data set to createheterogeneous clusters.

2. Using different samples/ subsets of the data with different clusteringalgorithms to cluster them to produce component clusters.

3. Running the same clustering algorithm many times on same data setwith different parameters or initialisations to create homogeneousclusters.

The main goal of the ensemble clustering is to integrate componentclustering into one final clustering with a higher accuracy.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 59: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Ensemble clustering on genomic/ biological dataPattern extraction from genomic data applying ensemble clustering.

Data  

Data  

Data   Data  Preprocessing  

Biclustering  

Big  Biological  Data  

Hidden  Patterns  in  Data  

Feature  Selection  

Feature  Grouping  

Ensemble  Clustering  

Figure: Pattern extracting process from genomic/ biological data.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 60: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Data Pre-processing

It transforms raw data into an understandable format, which includesseveral techniques:

I Data cleaning is the process of dealing with missing values.

I Data integration merges data from different multiple sources into acoherent data store like data warehouse or integrate metadata.

I Data transformation includes the followings: (a) normalisation, (b)aggregation, (c) generalisation, and (d) feature construction.

I Data reduction obtains a reduced representation of data set(eliminating redundant features/ instances).

I Data discretisation involves the reduction of a number of values ofa continuous feature by dividing the range of feature intervals.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 61: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Feature Selection

It is the process of selecting a subset of relevant features from a totaloriginal features in data.Mainly the following three reasons are used for feature selection:

I Simplification of models

I Shorter training times

I Reducing overfitting

In biological data, features may contain false correlations and theinformation they add is contained in other features. In this work, we haveapplied an unsupervised feature selection approach based on measuringsimilarities between features by maximum information compression index.We have quantified the information loss in feature selection with entropymeasure technique. After selecting the subset of features from the data,we have grouped them into two groups: nominal and numeric features.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 62: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Subspace Clustering

The subspace clustering finds subspace clusters in high-dimensional data.It can be classified into three groups:

1. Subspace search methods.

2. Correlation-based clustering methods

3. Biclustering methods.

A subspace search method searches various subspaces for clusters (setof instances that are similar to each other in a subspace) in the fullspace. It uses two kinds of strategies:

I Bottom-up approach - start from low-dimensional subspace andsearch higher-dimensional subspaces.

I Top-down approach - start with full space and search smallersubspaces recursively.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 63: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 7 δ-Biclustering

Input: E , a data matrix and δ ≥ 0, the maximum acceptable mean squaredresidue score.Output: EIJ , a δ-bicluster that is a submatrix of E with row set I andcolumn set J, with a score no longer than δ.Initialization: I and J are initialized to the instance and feature sets inthe data and EIJ = E .Deletion phase:

1: compute eiJ for all i ∈ I , eIj for all j ∈ J, eIJ , and H(I , J);2: if H(I , J) ≤ δ then3: return EIJ ;4: end if

5: find the rows i ∈ I with d(i) =∑

j∈J (eij−eiJ−eIj +eIJ )2

|J| ;

6: find the columns j ∈ J with d(j) =∑

i∈I (eij−eiJ−eIj +eIJ )2

|I | ;

7: remove rows i ∈ I and columns j ∈ J with larger d ;Addition phase:

1: compute eiJ for all i , eIj for all j , eIJ , and H(I , J);

2: add the columns j /∈ J with∑

i∈I (eij−eiJ−eIj +eIJ )2

|I | ≤ H(I , J);

3: recompute eiJ , eIJ and H(I , J);

4: add the rows i /∈ I with∑

j∈J (eij−eiJ−eIj +eIJ )2

|J| ≤ H(I , J);

5: for each row i /∈ I do

6: if∑

j∈J (eij−eiJ−eIj +eIJ )2

|J| ≤ H(I , J) then

7: add inverse of i ;8: end if9: end for

10: return EIJ ;

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 64: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Clustering of BrS variants

Distribution of BrS variants in clusters using proposed ensembleclustering.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 65: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Experimental Method

To test the performance of clustering algorithms we have used anunsupervised evaluation method that compute the Compactness (CP) ofclusters is shown in Eq. 13.

CP =1

n

k∑l=1

nl

(∑xi ,xj∈Cl

d(xi , xj )

nl (nl − 1)/2

)(13)

Where d(xi , xj ) is the distance between two instances in cluster Cl and nl

is the number of instances in Cl . The smaller the CP for a clusteringresult, the more compact and better the clustering result.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 66: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

ResultsThe proposed ensemble clustering is compared with following clusteringalgorithms:

I SimpleKMeans (clustering using the k-means method)I XMeans (extension of k-means)I DBScan (nearest-neighbor-based that automatically determines the

number of clusters)I MakeDensityBasedCluster (wrap a clusterer to make it return

distribution and density)

Table: Comparison of clustering results on 148 Exome data sets of BrS.

Clustering Method Compactness (CP)SimpleKMeans 9.401XMeans 8.297MakeDensityBasedCluster 7.483DBScan 6.351Ensemble Clustering 5.647

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 67: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Hybrid Decision Tree & Naıve Bayes Classifiers

The presence of noisy contradictory instances in the training data causethe learning models suffer from overfitting and decrease classificationaccuracy.

I Hybrid Decision Tree (DT) classifier - A naıve Bayes (NB)classifier is used to remove the noisy troublesome instances from thetraining data before the DT induction.

I Hybrid Naıve Bayes (NB) classifier - A DT is used to select acomparatively more important subset of features for the productionof naıve assumption of class conditional independence. It isextremely computationally expensive for a naıve Bayes classifier tocompute class conditional independence for high dimensional datasets.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 68: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 8 Decision Tree Induction

Input: D = {x1, x2, · · · , xn} // Training dataset, D, which contains a setof training instances and their associated class labels.Output: T , Decision tree.Method:

1: for each class, Ci ∈ D, do2: Find the prior probabilities, P(Ci ).3: end for4: for each attribute value, Aij ∈ D, do5: Find the class conditional probabilities, P(Aij |Ci ).6: end for7: for each training instance, xi ∈ D, do8: Find the posterior probability, P(Ci |xi )9: if xi is misclassified, then

10: Remove xi from D;11: end if12: end for13: T = ∅;14: Determine best splitting attribute;15: T = Create the root node and label it with the splitting attribute;16: T = Add arc to the root node for each split predicate and label;17: for each arc do18: D = Dataset created by applying splitting predicate to D;19: if stopping point reached for this path, then20: T ′ = Create a leaf node and label it with an appropriate class;21: else22: T ′ = DTBuild(D);23: end if24: T = Add T ′ to arc;25: end for

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 69: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Algorithm 9 Naıve Bayes classifier

Input: D = {x1, x2, · · · , xn} // Training data.Output: A classification Model.Method:

1: T = ∅;2: Determine the best splitting attribute;3: T = Create the root node and label it with the splitting attribute;4: T = Add arc to the root node for each split predicate and label;5: for each arc do6: D = Dataset created by applying splitting predicate to D;7: if stopping point reached for this path, then8: T ′ = Create a leaf node and label it with an appropriate class;9: else

10: T ′ = DTBuild(D);11: end if12: T = Add T ′ to arc;13: end for14: for each attribute, Ai ∈ D, do15: if Ai is not tested in T , then16: Wi = 0;17: else18: d as the minimum depth of Ai ∈ T , and Wi = 1√

d;

19: end if20: end for21: for each class, Ci ∈ D, do22: Find the prior probabilities, P(Ci ).23: end for24: for each attribute, Ai ∈ D and Wi 6= 0, do25: for each attribute value, Aij ∈ Ai , do

26: Find the class conditional probabilities, P(Aij |Ci )Wi .

27: end for28: end for29: for each instance, xi ∈ D, do30: Find the posterior probability, P(Ci |xi );31: end for

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 70: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Accuracy on Benchmark Datasets

Figure: Classification accuracy on 10 datasets with 10-fold cross validation.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 71: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Novel Class Instances

Figure: Instances with a fixed number of class labels (left) and instances of anovel class arriving in the data stream (right).

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 72: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Novel Class Instances (con.)

Figure: Flow chart of classification and novel class detection.

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 73: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Novel Class Instances (con.)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 74: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

Novel Class Instances (con.)

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining

Page 75: Machine Learning for Data Mining

Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier

*** THANK YOU ***

Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Machine Learning for Data Mining