rule based classification

8/14/2019 Rule Based Classification

1/42

L22/11-09-09 1

Rule based classification

Lecture 21/10-09-09(no class bcoz of placements)

Lecture 22/11-09-09(No class)

Lecture 23/12-09-09(No Class)

Lecture 24/14-09-09


2/42

L22/11-09-09 2

Building Classification Rules Direct Method :

Extract rules directly from data. e.g.: RIPPER, CN2, Holtes 1R

Indirect Method : Extract rules from other classification

models (e.g. decision trees, neuralnetworks, etc).

e.g: C4.5rules


3/42

L22/11-09-09 3

Direct Method: Sequential CoveringAlgorithm

Extracts rules directly from data.

Extraction of rules are done one class at a time if no. of classes are more.

The criterion for selecting which class should be

considered, depends on no. of factors like classprevalence.


4/42

L22/11-09-09 4

Algorithm1: Let E - training Records, A set of attr. value pairs, {(A j,v j)}.

2: Let Y o be an ordered set of classes {y 1,y2,y3-------- ,yk}.3: Let R = { } be initial rule (decision) list.4: for each class y Y o- {yk} do

5: while stopping condition is not met do

6: r Learn-One-Rule (E, A, y).Remove training records from E that are covered by r .Add r to the bottom of the rule list: R R V r .

end while

end for Insert the default rule, { } y k, to the bottom of the rule list R


5/42

L22/11-09-09 5

Sequential Covering Algorithm(in short for your quick reference)

1. Start from an empty rule set.2. Extract a rule using the Learn-One-Rule

function.3. Remove training records covered by the

rule.

4. Repeat Step (2) and (3) until stoppingcriterion is met.


6/42

L22/11-09-09 6

Learn-One-Rule function Objective - extract a rule that covers

maximum of +ve examples and none or fewnegative examples in the training dataset.

Computationally expensive bcoz of exponential size of search space.

Generates an initial rule and keeps ongrowing (refining) it till stopping criteria ismet.


7/42

L22/11-09-09 7

Example of Sequential Covering

(i) Original Data

(ii) Step 1


8/42

L22/11-09-09 8

Example of SequentialCovering

(iii) Step 2

R1

(iv) Step 3

R1

R2


9/42

L22/11-09-09 9

Aspects of Sequential Covering Rule Growing Strategy

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning


10/42

L22/11-09-09 10

Rule Growing Two common strategies :

1. General-to-specific2. Specific-to-general

Tid

1

2


11/42

L22/11-09-09 11

General- to- specific

Rule has poorquality as itcovers allexamples in thetraining set

Conjuncts aresubsequentlyadded toimprove thequality of therule

r: { } y


12/42

L22/11-09-09 12

Specific to general

Refund=No,Status=Single,Income=85K(Class=Yes)

Refund=No,Status=Single,Income=90K(Class=Yes)

Refund=No,

Status = Single(Class = Yes)

(b) Specific-to-general


13/42

L22/11-09-09 13

Body temp=warm-blooded, Skin cover=hair, givesbirth=yes, aquatic creature=no, Aerial creature= no,has legs= yes, hibernates=no => MAMMALS

Body temp=warm-blooded,Skin cover=hair, givesbirth=yes, aquaticcreature=no, Aerial creature=no, has legs= yes => Mammals

Skin cover=hair, gives birth=yes, aquaticcreature=no, Aerial creature= no, haslegs= yes, hibernates=no => MAMMALS

Specific-to-general

One of thepositiveexamples arechosen randomlyas initial seed

One of theconjuncts areremoved so that itcan cover more

+ve examples


14/42

L22/11-09-09 14

Rule Evaluation Suppose a tr set contains 60 +ve and 100 ve

examples. Consider two rules:

R1: covers 50 +ve examples and 5 -ve examples

R2: covers 2 +ve ex and 0 ve ex.

The accuracy for R1 is 90.9% and R2 is 100%.

Still R1 is better bcoz its coverage.Other measures clearly state


15/42

L22/11-09-09 15

Rule Evaluation

Metrics: Accuracy

Likelihood ratio Statistics, Laplace

M-estimate

OR

k n

nc

+

+=

1

k nkpn c

+

+

=

n : total no. of instances

n c : Number of instancescovered by rule

k : Number of classes

p : Prior probability

nn c=

==

k

i iii e f f R 1 )/log(2

mn

pmn c+

+=


16/42

L22/11-09-09 16

FOILs Information gain (Rule EvaluationContd)

r: A + covers p 0 +ve examples and n 0 veexamples

Suppose we add a new conjunct B, theextended rule become

r: A^B + covers p 1 +ve examples and n 1 ve examples

Then FOIL IG= )log(log00

02

11

121 n p

pn p

p p+

+


17/42

L22/11-09-09 17

Q2. Consider two rules: R1:A C R2:A B C

Suppose R1 is covered by 350 +ve examples and 150 ve examples, while R2 is covered by 300 +veexamples and 50 ve examples. Compute the FOILsinformation gain for rule R2 wrt R1.

Q4. Page no. 317


18/42

L22/11-09-09 18

Aspects of Sequential CoveringAlgorithm

Rule Growing Strategy

Rule Evaluation

Stopping Criterion

Rule Pruning



19/42

L22/11-09-09 19

Stopping Criterion and RulePruning

Stopping criterion Compute the gain If gain is not significant, discard the new rule

Rule Pruning Remove one of the conjuncts in the rule Compare error rate on validation set before

and after pruning If error improves, prune the conjunct


20/42

L22/11-09-09 20

Aspects of Sequential CoveringAlgorithm

Rule Growing Strategy

Rule Evaluation

Stopping Criterion

Rule Pruning



21/42

L22/11-09-09 21


Why do we need to eliminateinstances? Otherwise, the next rule is

identical to previous rule class = +

class = -

+

+ +

++

++

+

++

+

+

+

+

+

+

+

-

-

--

--

--

--

-

-

-

-

-

-

-

+

+

++

+

+

+

R1

R3

+

+

Why do we remove +ve and ve instances?-Ensure that the next rule is different-Prevent underestimating the accuracy of rule.


22/42

L22/11-09-09 22

Indirect methods for rule-basedclassifiers and Instance-Based

Classifiers

Lecture 26/17-09-09


23/42

23

Indirect Methods(Generating rule set from Decision tree)

Rule Set

r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==r5: (P=Yes,R=Yes,Q=Yes) ==

P

Q R

Q- + +

- +

No No

No

Yes Yes

Yes

No Yes

Consider r2, r3, r5It shows that the class label is always predicted as + whenQ=Yes

So we can say r2: (Q=yes) +

r3: (P=yes) (R=no) +

Simplified rules


24/42

L22/11-09-09 24

Classification rules extracted from DT

C4.5rules:(Give Birth=No, Lives in water=No , Can Fly=Yes) Birds

(Give Birth=No, Live in Water=Yes) Fishes

(Give Birth=Yes) Mammals

(Give Birth=No, Live in Water=No, Can Fly=No,) Reptiles

( ) Amphibians

Give

Birth?

Live InWater?

CanFly?

Mammals

Fishes Amphibians

Birds Reptiles

Yes No

Yes

Sometimes

No

Yes No


25/42

L22/11-09-09 25

Advantages of Rule-BasedClassifiers

As highly expressive as decision trees Easy to interpret

Easy to generate Can classify new instances rapidly Performance comparable to decision trees


26/42

L22/11-09-09 26

InstanceBased Classifiers


27/42

L22/11-09-09 27

Eager learners vs. Lazy learners Eager learners

DT and rule-based classifiers are ex. of eager learners

They are designed to learn a model that maps theinput attr. to the class label as soon as training databecomes available .

Lazy Learners They delay the process of modeling the tr. Data until it

is provided with an unseen instance to be classified. Instance-based classifiers belong to this class. They memorizes the entire tr. data and perform

classification only when attr. of a test instancematches it completely.


28/42

L22/11-09-09 28

Instance-Based Classifiers

Atr1 ... AtrN ClassA

B

B

C

A

CB

Set of Stored Cases

Atr1 ... AtrN

Unseen Case

Store the training records Use training records to

predict the class label of unseen cases


29/42

L22/11-09-09 29

Instance Based Classifiers Examples:

Rote-learner (classifier) Memorizes entire training data and performsclassification only if attributes of record match one of the training examples exactly.

Its drawback is that some test rec. may not at all beclassified bcoz they dont match any of the instancein the tr. Data.

SLOUTION??

Nearest neighbor Uses k closest points (nearest neighbors) for

performing classification


30/42

L22/11-09-09 30

Nearest Neighbor Classifiers Basic idea: The main idea or justification of nearest neighbor

classifier is emphasized with the following example: If it walks like a duck, quacks like a duck, looks like a

duck, then its probably a duck

TrainingRecords

Test Re

ComputeDistance

Choose k of thenearest records


31/42

L22/11-09-09 31

A nearest-neighbor classifier represents

each instance as a d-dimensional datapoint in space, where d is the no. of attributes.


32/42

32

Nearest-Neighbor Classifiersq Requires three things

The set of stored records Distance Metric to compute

distance between records The value of k , the number of

nearest neighbors to retrieve

q To classify an unknown record: Compute distance to other

training records

Identify k nearest neighbors Use class labels of nearest

neighbors to determine theclass label of unknown record(e.g., by taking majority vote)

Unknown record


33/42

L22/11-09-09 33

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data pointsthat have the k smallest distance to x


34/42

34

Nearest Neighbor Classification Compute distance between two points:

Euclidean distance

Determine the class from nearest neighbor list take the majority vote of class labels among

the k-nearest neighbors

=i

iiq pq pd 2)(),(


35/42

L22/11-09-09 35

Nearest Neighbor Classification Choosing the value of k:

If k is too small, sensitive to noise points If k is too large, neighborhood may include points

from other classes

X

K-nearestclassificationwith large k


36/42

L22/11-09-09 36

Nearest Neighbor Classification

Scaling issues Attributes may have to be scaled to prevent

distance measures from being dominated byone of the attributes

Example: height of a person may vary from 1.5m to 1.8m

weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M


37/42

L22/11-09-09 37

Nearest neighbor Classification

k-NN classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree

induction and rule-based systems Classifying unknown records are relatively

expensive


38/42

L22/11-09-09 38

Algorithm 1: Let k be the no. of NN and D be the set of tr. examples. 2: for each test example z=(x,y) do 3: compute d(x,x), the distance between z and

every example, (x,y) D.

4: Select D z D, the set of k closest trainingexamples to z.

5: y=

6: end for

)(maxarg ),( = z ii D y x iv

yv I


39/42

L22/11-09-09 39

Once the NN list is obtained, the test sample isclassified based on the majority class of its NN : Majority voting: y=

v is the class label, y i is the class label for one of theNN and I(.) is an indicator function that returns the

value 1 if its argument is true and 0 otherwise.

)(maxarg ),( = z ii D y x iv yv I


40/42

L22/11-09-09 40

In majority voting approach, everyneighbor has the same impact on theclassification. (Refer slide 15 fig.)

This factor makes classification algo.sensitive to the choice of k.

In order to reduce this impact of k, weassign weight to each of the distance for NN say x i:wi=1/d(x,x i)2.


41/42

L22/11-09-09 41

As a result of applying wt. to the distance, the tr.Ex that are located far away from z have aweaker impact on the classification.

Using the distance-weighted voting scheme, theclass label can be determined:

Distance-wtd. Voting

y= )(maxarg ),( = z ii D y x iiv

yv I w


42/42

Characteristics 1. NN classification is a part of instance-based

learning. 2. Lazy learners like NN classifiers do not need

model building.

3. NN classifiers make their predictions basedon local information whereas DT and rule-basedclassifiers attempt to find a global model that fitsthe entire input space.

4. Appropriate proximity measures play asignificant role in NN classifiers.

rule based classification

Documents