multi-label classification without multi-label cost - multi-label random decision tree classifier...

Multi-label Classification without Multi-label Cost- Multi-label Random Decision Tree Classifier

1. IBM Research – China2. IBM T.J.Watson Research Center

Presenter: Xiatian Zhang [email protected] Authors: Xiatian Zhang, Quan Yuan, Shiwan Zhao, Wei Fan, Wentao Zheng, Zhong Wang

mailto:[email protected]

Multi-label Classification

Classical Classification (Single Label Classification)

– The classes are exclusive: if an example belongs to one class, it can’t be belongs to others

Multi-label Classification– A picture, video, article may

belong to several compatible categories

– A pieces of gene can control several biological functions

Tree

LakeIce

Winter

Park

Existed Multi-label Classification Methods

Grigorios Tsoumakas et al[2007] summarize the existing methods for ML-Classification

Two Strategies– Problem Transformation

– Transfer Multi-label Classification Problem to Single Classification Problem

– Algorithm Adaptation

– Adapt Single-label Classifiers to Solve the Multi-label Classification Problem

– With high complexity

Problem Transformation Approaches

Label Powerset (LP) – Label Powerset considers

each unique subset of labels that exists in the multi-label dataset as a single label

L1 L2 L3

L1

L1

L2

L3

L3

Binary Relevance (BR) – Binary Relevance learns

one binary classier for each label

L1

L2

L3

L4

Classifier

L1+ L2+ L3+

L1- L2- L3-

Classifier1 Classifier2 Classifier3

Large Number of Labels Problem

Hundreds and even more labels– Text categorization

– protein function classification

– semantic annotation of multimedia The Impacts to Multi-label Classification Methods

– Label Powerset: the number of training examples for each particular label will be much less

– Binary Relevance: The computational complexity is with linear complexity with respect to the number of labels

– Algorithm Adaptation: Even more worse than Binary Relevance

HOMER for Large Number of Labels Problem

HOMER (Hierarchy Of Multilabel classifERs) is developed by Grigorios Tsoumakas et al, 2008.

The HOMER algorithm constructs a Hierarchy Of Mul-tilabel classifERs, each one dealing with a much smaller set of labels.

Our Method – Without Label Cost

Without Label Cost– Training Time is almost irrelevant with number of labels |L|

But with Reliable Quality– The classification Quality can be compared to mainstream

methods over different data sets.

How to make it?

Our Method – Without Label Cost cont.

Binary Relevance Method based on Random Decision Tree

Random Decision Tree [Fan et al, 2003]– Training Process is irrelevant with label information

– Random Construction with very low cost

– Stable quality on many applications

Random Decision Tree – Tree Construction

At each node, an un-used feature is chosen randomly– A discrete feature is un-used if it has never been chosen

previously on a given decision path starting from the root to the current node.

– A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

It stop when one of the following happens:– A node becomes too small (<= 4 examples).

– Or the total height of the tree exceeds some limits:

– Such as the total number of features.

The construction process is irrelevant with label information

Random Decision Tree - Node Statistics

Classification and Probability Estimation:

– Each node of the tree keeps the number of examples belonging to each class.

The node statistics process cost a little computation resource

F1<0.5

F2>0.7 F3>0.3

+:200 -: 10

+:30-: 70

Y

Y

N

NN

… …

Random Decision Tree - Classification

During classification, each tree outputs posterior probability:

P(+|x)=30/100=0.3

F1<0.5

F2>0.7 F3>0.3

+:200 -: 10

+:30-: 70

Y

Y

N

NN

… …

Random Decision Tree - Ensemble

For a instance x, average the estimated probability on each tree and take the average probability as the predicted probability of x.

P’(+|x)=30/50 =0.6P(+|x)=30/100=0.3

(P(+|x)+P’(+|x))/2 = 0.45

F3>0.3

F2<0.6 F1>0.7

+:100 -:120

+:30-: 20

Y

Y

N

NN

… …

F1<0.5

F2>0.7 F3>0.3

+:200 -: 10

+:30-: 70

Y

Y

N

NN

… …

Multi-label Random Decision Tree

F1<0.5

F2>0.7 F3>0.3

Y

Y

N

NN

… … L1+:30L1-: 70

L2+:50L2-: 50

L1+:200L1-: 10

L2+:40L2-: 60

F3>0.5

F2<0.7 F1>0.7

Y

Y

N

NN

… … L1+:30L1-: 20

L2+:20L2-: 80

L1+:100L1-:120

L1+:200L1-: 10

P(L1+|x)=30/100=0.3 P’(L1+|x)=30/50 =0.6

P(L2+|x)=50/100=0.5 P’(L2+|x)=20/100=0.2

(P(L1+|x)+P’(L1+|x))/2 = 0.45

(P(L2+|x)+P’(L2+|x))/2 = 0.35

Why RDT Works?

Ensemble Learning View

– Our Analysis

– Other Explanations

Non-Parametric Estimation

Complexity of Multi-label Random Decision Tree

Training Complexity:

– m is the number of trees, and n is the number of instances

– t is the average number of labels on each leaf nodes, t<<n, and t<<|L|.

– It is irrelevant with number of labels |L|.

– Complexity of C4.5:

Vi is the size of values of i-th attribute.

– Complexity of HOMER:

Test Complexity:– q is the average depth of branches of trees

– It is also irrelevant with number of labels |L|

Experiment – Metrics and Datasets Quality Metrics:

Datasets:

Experiment - Quality

Experiment – Computational Cost

Experiment – Computational Cost cont.

Experiment – Computational Cost cont

Future Works

Leverage the relationship of labels.

Apply ML-RDT for Recommendation

Parallelization and Streaming Implementation

multi-label classification without multi-label cost - multi-label random decision tree classifier...

Documents

particular label

multilabel dataset

singlelabel classifiers

classification quality

label cost training

binary relevance slide

binary relevance method

high complexity slide