data.mining.c.6(ii).classification and prediction

50
1 Chapter 6 (Part II) Alternative Classification Technologies 第第第 ( 第第第第 ) 第第第第第第

Upload: margaret-wang

Post on 11-Jan-2015

3.699 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data.Mining.C.6(II).classification and prediction

1

Chapter 6 (Part II)

Alternative Classification Technologies

第六章 ( 第二部分 )

数据分类技术

Page 2: Data.Mining.C.6(II).classification and prediction

2

Chapter 6 (II) Alternative Classification Technologies

- Instance (示例 ) based Approach

- Ensemble (组合 ) Approach

- Co-training Approach

- Partially Supervised Approach

Page 3: Data.Mining.C.6(II).classification and prediction

3

Instance-Based ( 基于示例 ) Approach

Atr1 ……... AtrN ClassA

B

B

C

A

C

B

Set of Stored Cases

Atr1 ……... AtrN

Unseen Case

• Store the training records

• Use training records to predict the class label of unseen cases directly

Page 4: Data.Mining.C.6(II).classification and prediction

4

Instance-Based Method• Typical approach

– k-nearest neighbor approach (kNN) k-邻近法• Instances represented as points in a Euclidean

space. • Uses k “closest” points (nearest neighbors) for

performing classification

Page 5: Data.Mining.C.6(II).classification and prediction

5

Nearest Neighbor Classifiers• Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test Record

Compute Distance (similarity)

Choose k of the “nearest” records

(i.e., most similar)

Page 6: Data.Mining.C.6(II).classification and prediction

6

Nearest-Neighbor Classifiers Requires three things

– The set of stored records

– Metric (度量) to compute distance between records

– The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:

– Compute distance to other training records

– Identify k nearest neighbors

– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., majority vote)

Unknown record

Page 7: Data.Mining.C.6(II).classification and prediction

7

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Page 8: Data.Mining.C.6(II).classification and prediction

8

Key to kNN Approach• Compute near relationship between two points

- Similarity (closeness) measure

• Determine the class from nearest neighbor list

- Take the majority vote of class labels among the k-nearest

neighbors

Page 9: Data.Mining.C.6(II).classification and prediction

9

Distance- based Similarity Measure

• Distances are normally used to measure the similarity between

two data objects

• Euclidean distance(欧几里德距离 ):

– Properties

• d(i,j) 0

• d(i,i) = 0

• d(i,j) = d(j,i)

• d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 10: Data.Mining.C.6(II).classification and prediction

10

Boolean type 布尔型• A contingency table for binary data (state value: 0 or 1)

• Simple matching coefficient(简单系数匹配 )

dcbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

Page 11: Data.Mining.C.6(II).classification and prediction

11

Distance based Measure for Categorical Type( 标称型 ) of Data

• Categorical type

i.e., red, yellow, blue, green for the nominal variable color

• Method: Simple matching

– m: # of matches, p: total # of variables

pmpjid ),(

Page 12: Data.Mining.C.6(II).classification and prediction

12

Distance based Measure for Mixed Types ( 混合型 ) of Data

• A object (tuple) may contain all the types mentioned above

• May use a weighted formula to combine their effects.

– Given p kinds of the different type variables:

if xif = xjf =0 (or missing value) ;

otherwise

– f is boolean (布尔 ) or categorical (标称 ):

dij(f) = 0 if xif = xjf , or dij

(f) =1

– f is interval-valued: use the normalized distance

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

0)( fij

1)( f

ij

hfx

hhfx

h

jfx

ifx

jidminmax

),(

Page 13: Data.Mining.C.6(II).classification and prediction

13

K-Nearest Neighbor AlgorithmInput: Let k be the number of nearest neighbors and D be the

set of training examples

For each test example z =(x’, y’) do

Computer d(x’, x), the distance between z and every

example (x, y) D.

Select Dz D, the set of k closest training examples to z

End for

)),(

(maxarg'

zDiyix

iyvI

vy

Page 14: Data.Mining.C.6(II).classification and prediction

14

Measure for Other Types of DataTextual Data: Vector Space Representation

• A document is represented as a vector:

(W1, W2, … … , Wn)

– Binary:

• Wi= 1 if the corresponding term i (often a word) is in the document

• Wi= 0 if the term i is not in the document

– TF: (Term Frequency)

• Wi= tfi tfi is the number of times the term occurred in the document

Page 15: Data.Mining.C.6(II).classification and prediction

15

Similarity Measure for Textual Data• Distance- based Similarity Measure

Problems: high dimension and data sparseness

Page 16: Data.Mining.C.6(II).classification and prediction

16

Other Similarity Measure• The “closeness” between documents is calculated as the correlation

between the vectors that represent them, using measures such as the cosine of the angle between these two vectors.

||||),(

21

2121 vv

vvvvsim

iti i vvvv 21 121

111 || vvv

Cosine measure (余弦计算方法 ):

Page 17: Data.Mining.C.6(II).classification and prediction

17

Discussion on the k-NN Algorithm• k-NN classifiers are lazy learners (or learning from your

neighbors)

– It does not build models explicitly

– Unlike eager learners such as decision tree induction

– Classifying unknown records are relatively expensive

– Robust to noisy data by averaging k-nearest neighbors

Page 18: Data.Mining.C.6(II).classification and prediction

18

Chapter 6 (II) Alternative Classification Technologies

- Instance (示例 )based Approach

- Ensemble (组合 ) Approach

- Co-training Approach

- Partially Supervised Approach

Page 19: Data.Mining.C.6(II).classification and prediction

19

Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

Page 20: Data.Mining.C.6(II).classification and prediction

20

General IdeaOriginal

Training data

....D1D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Page 21: Data.Mining.C.6(II).classification and prediction

21

Examples of Ensemble Approaches

• How to generate an ensemble of classifiers?

– Bagging

– Boosting

Page 22: Data.Mining.C.6(II).classification and prediction

22

Bagging

• Sampling with replacement

• Build classifier on each bootstrap sample set (自助样本集 )

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Page 23: Data.Mining.C.6(II).classification and prediction

Bagging Algorithm

23

)(maxarg* i

yCIy

C i

Let k be the number of bootstrap samples set

For i =1 to k do

Create a bootstrap sample Di of Size N

Train a (base) classifier Ci on Di

End for

Page 24: Data.Mining.C.6(II).classification and prediction

24

Boosting

• An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records

– Initially, all N records are assigned equal weights

– Unlike bagging, weights may change at the end of boosting round

Page 25: Data.Mining.C.6(II).classification and prediction

25

Boosting

• Records that are wrongly classified will have their weights increased

• Records that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Page 26: Data.Mining.C.6(II).classification and prediction

26

Boosting

C1

T

D1 F (D2)C2

T

Dm… Cm

T

The process of generating classifiers

F

Page 27: Data.Mining.C.6(II).classification and prediction

Boosting

Problems:

(1)How to update the weights of the training examples?

(2)How to combine the predictions made by each base classifier?

27

Page 28: Data.Mining.C.6(II).classification and prediction

AdaBoosting Algorithm Given (xj, yj) : a set of N training examples (j=1,…,N)

28

N

jjjiji yxCIw

N 1

)(1

The error rate of a base classifier Ci:

where I(p) = 1 if p is true, and 0 otherwise.

The importance of a classifier Ci:

i

ii

1ln

2

1

Page 29: Data.Mining.C.6(II).classification and prediction

AdaBoosting Algorithm

29

ja

j

jij

i eZ

Ww

)()1(

The weight update mechanism (Equation):

where is the normalization factor:

iij yxCif

ja

j

jij

i eZ

Ww

)()1( iij yxCif

i

jiw 1)1(

jZ

: the weight for example (xi, yi) during the round

)( jiw thj

Page 30: Data.Mining.C.6(II).classification and prediction

AdaBoosting Algorithm

30

Let k be the number of boosting rounds, D is the set of all examples

Update the weight of each examples according to Equation

End for

, Initialize the weights for all N examples },...,1|1

{ NjN

wW j

For i = 1 to k do

Create training set Di by sampling from D according to W. Train a base classifier Ci on Di

Apply Ci to all examples in the original set D

N

jjjiji yxCIw

N 1

)(1

i

ii

1ln

2

1

)))((maxarg)(*1

yxCIy

xC j

K

j j

Page 31: Data.Mining.C.6(II).classification and prediction

31

Increasing Classifier AccuracyBagging and Boosting :

- general techniques for improving classifier accuracy

Combining a series of T learned classifiers C1,…,CT with the aim of

creating an improved composite classifier C*

Data

C1

CT

C2…Combine

Votes

New datasample

Classprediction

Page 32: Data.Mining.C.6(II).classification and prediction

32

Chapter 6 (II) Alternative Classification Technologies

- Instance (示例 ) based Approach

- Ensemble (组合 ) Approach

- Co-training Approach

- Partially Supervised Approach

Page 33: Data.Mining.C.6(II).classification and prediction

33

Unlabeled Data

• One of the bottlenecks of classification is the labeling of a large set of examples (data records or text documents). – Often done manually– Time consuming

• Can we label only a small number of examples and make use of a large number of unlabeled data to classifying?

Page 34: Data.Mining.C.6(II).classification and prediction

34

Co-training Approach

• Blum and Mitchell, (CMU, 1998)

– Two “independent” views: split the features into two sets.

– Train a classifier on each view.

– Each classifier labels data that can be used to train the other classifier, and vice versa

Page 35: Data.Mining.C.6(II).classification and prediction

Co-Training Approach

35

Feature SetX=(X1, X2)

Classification

Model One

Classification

Model Two

new labeled data set 1

subset X1 subset X2

training training

new labeled data set 2

classifying classifying

Unlabeled

data

Unlabeled

data

example set L

example set L

Page 36: Data.Mining.C.6(II).classification and prediction

36

Two views• Features can be split into two independence sets(views):

– The instance space:

– Each example:

• A pair of views x1, x2 satisfy view independence just in case:

Pr[X1 =x1 | X2 =x2, Y=y] = Pr[X1=x1|Y=y]

Pr[X2 =x2 | X1 =x1, Y=y] = Pr[X2=x2|Y=y]

21 XXX

),( 21 xxx

Page 37: Data.Mining.C.6(II).classification and prediction

37

Co-training algorithm

For instance, p=1, n=3, k=30, and u=75

Page 38: Data.Mining.C.6(II).classification and prediction

38

Co-training: Example• Ca. 1050 web pages from 4 CS depts

– pages (25%) as test data– The remaining 75% of pages

• Labeled data: 3 positive and 9 negative examples• Unlabeled data: the rest (ca. 770 pages)

• Manually labeled into a number of categories: e.g., “course home page”.

• Two views:– View #1 (page-based): words in the page– View #2 (hyperlink-based): words in the hyperlinks

• Naïve Bayes Classifier

Page 39: Data.Mining.C.6(II).classification and prediction

39

Co-training: Experimental Results

• begin with 12 labeled web pages (course, etc)

• provide ca. 1,000 additional unlabeled web pages

• average error: traditional approach 11.1%;

• average error: co-training 5.0%

Page 40: Data.Mining.C.6(II).classification and prediction

40

Chapter 6 (II) Alternative Classification Technologies

- Instance (示例 ) based Approach

- Ensemble (组合 ) Approach

- Co-training Approach

- Partially Supervised Approach

Page 41: Data.Mining.C.6(II).classification and prediction

41

Learning from Positive & Unlabeled Data

• Positive examples: a set of examples of a class P,• Unlabeled set: a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples). • Build a classifier: Build a classifier to classify the examples in

U and/or future (test) data.

• Key feature of the problem: no labeled negative training data. We call this problem, PU-learning.

Page 42: Data.Mining.C.6(II).classification and prediction

42

Positive und Unlabeled

Positive

Sports

Negative

Politics CultureComputerScience

EducationMilitary Affairs

Page 43: Data.Mining.C.6(II).classification and prediction

43

Direct Marketing

• Company has database with details of its customer – positive examples, but no information on those who are not their customers, i.e., no negative examples.

• Want to find people who are similar to their customers for marketing.

• Buy a database consisting of details of people -- who may be potential customers ?

Page 44: Data.Mining.C.6(II).classification and prediction

44

Novel 2-steps strategy

• Step 1: Identifying a set of reliable negative documents from the unlabeled set.

• Step 2: Building a sequence of classifiers by iteratively

applying a classification algorithm and then selecting a good classifier.

Page 45: Data.Mining.C.6(II).classification and prediction

45

Two Steps Process

Page 46: Data.Mining.C.6(II).classification and prediction

46

Step 1 Step 2positive negative

ReliableNegative(RN)

Q =U - RN

U

P

positive

Using P, RN and Q to build the final classifier iteratively

or

Using only P and RN to build a classifier

Existing 2-step strategy

Page 47: Data.Mining.C.6(II).classification and prediction

47

Step 1: The Spy technique

• Sample a certain % of positive examples and put them into unlabeled set to act as “spies”.

• Run a classification algorithm (e.g. Naïve Bayes Approach) assuming all unlabeled examples are negative, – we will know the behavior of those actual positive

examples in the unlabeled set through the “spies”.

• We can then extract reliable negative examples from the unlabeled set more accurately.

Page 48: Data.Mining.C.6(II).classification and prediction

48

Step 2: Running a classification algorithm iteratively

Running a classification algorithm iteratively

– iteratively using P, RN and Q until this no document from Q can be classified as negative. RN and Q are updated in each iteration

Page 49: Data.Mining.C.6(II).classification and prediction

49

PU-Learning

• Heuristic (启发式 ) methods:– Step 1 tries to find some initial reliable negative examples

from the unlabeled set.

– Step 2 tried to identify more and more negative examples iteratively.

• The two steps together form an iterative strategy

Page 50: Data.Mining.C.6(II).classification and prediction

50