multi-label classification without multi-label cost - multi-label random decision tree classifier...
TRANSCRIPT
Multi-label Classification without Multi-label Cost- Multi-label Random Decision Tree Classifier
1. IBM Research – China2. IBM T.J.Watson Research Center
Presenter: Xiatian Zhang [email protected] Authors: Xiatian Zhang, Quan Yuan, Shiwan Zhao, Wei Fan, Wentao Zheng, Zhong Wang
Multi-label Classification
Classical Classification (Single Label Classification)
– The classes are exclusive: if an example belongs to one class, it can’t be belongs to others
Multi-label Classification– A picture, video, article may
belong to several compatible categories
– A pieces of gene can control several biological functions
Tree
LakeIce
Winter
Park
Existed Multi-label Classification Methods
Grigorios Tsoumakas et al[2007] summarize the existing methods for ML-Classification
Two Strategies– Problem Transformation
– Transfer Multi-label Classification Problem to Single Classification Problem
– Algorithm Adaptation
– Adapt Single-label Classifiers to Solve the Multi-label Classification Problem
– With high complexity
Problem Transformation Approaches
Label Powerset (LP) – Label Powerset considers
each unique subset of labels that exists in the multi-label dataset as a single label
L1 L2 L3
L1
L1
L2
L3
L3
Binary Relevance (BR) – Binary Relevance learns
one binary classier for each label
L1
L2
L3
L4
Classifier
L1+ L2+ L3+
L1- L2- L3-
Classifier1 Classifier2 Classifier3
Large Number of Labels Problem
Hundreds and even more labels– Text categorization
– protein function classification
– semantic annotation of multimedia The Impacts to Multi-label Classification Methods
– Label Powerset: the number of training examples for each particular label will be much less
– Binary Relevance: The computational complexity is with linear complexity with respect to the number of labels
– Algorithm Adaptation: Even more worse than Binary Relevance
HOMER for Large Number of Labels Problem
HOMER (Hierarchy Of Multilabel classifERs) is developed by Grigorios Tsoumakas et al, 2008.
The HOMER algorithm constructs a Hierarchy Of Mul-tilabel classifERs, each one dealing with a much smaller set of labels.
Our Method – Without Label Cost
Without Label Cost– Training Time is almost irrelevant with number of labels |L|
But with Reliable Quality– The classification Quality can be compared to mainstream
methods over different data sets.
How to make it?
Our Method – Without Label Cost cont.
Binary Relevance Method based on Random Decision Tree
Random Decision Tree [Fan et al, 2003]– Training Process is irrelevant with label information
– Random Construction with very low cost
– Stable quality on many applications
Random Decision Tree – Tree Construction
At each node, an un-used feature is chosen randomly– A discrete feature is un-used if it has never been chosen
previously on a given decision path starting from the root to the current node.
– A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
It stop when one of the following happens:– A node becomes too small (<= 4 examples).
– Or the total height of the tree exceeds some limits:
– Such as the total number of features.
The construction process is irrelevant with label information
Random Decision Tree - Node Statistics
Classification and Probability Estimation:
– Each node of the tree keeps the number of examples belonging to each class.
The node statistics process cost a little computation resource
F1<0.5
F2>0.7 F3>0.3
+:200 -: 10
+:30-: 70
Y
Y
N
NN
… …
Random Decision Tree - Classification
During classification, each tree outputs posterior probability:
P(+|x)=30/100=0.3
F1<0.5
F2>0.7 F3>0.3
+:200 -: 10
+:30-: 70
Y
Y
N
NN
… …
Random Decision Tree - Ensemble
For a instance x, average the estimated probability on each tree and take the average probability as the predicted probability of x.
P’(+|x)=30/50 =0.6P(+|x)=30/100=0.3
(P(+|x)+P’(+|x))/2 = 0.45
F3>0.3
F2<0.6 F1>0.7
+:100 -:120
+:30-: 20
Y
Y
N
NN
… …
F1<0.5
F2>0.7 F3>0.3
+:200 -: 10
+:30-: 70
Y
Y
N
NN
… …
Multi-label Random Decision Tree
F1<0.5
F2>0.7 F3>0.3
Y
Y
N
NN
… … L1+:30L1-: 70
L2+:50L2-: 50
L1+:200L1-: 10
L2+:40L2-: 60
F3>0.5
F2<0.7 F1>0.7
Y
Y
N
NN
… … L1+:30L1-: 20
L2+:20L2-: 80
L1+:100L1-:120
L1+:200L1-: 10
P(L1+|x)=30/100=0.3 P’(L1+|x)=30/50 =0.6
P(L2+|x)=50/100=0.5 P’(L2+|x)=20/100=0.2
(P(L1+|x)+P’(L1+|x))/2 = 0.45
(P(L2+|x)+P’(L2+|x))/2 = 0.35
Why RDT Works?
Ensemble Learning View
– Our Analysis
– Other Explanations
Non-Parametric Estimation
Complexity of Multi-label Random Decision Tree
Training Complexity:
– m is the number of trees, and n is the number of instances
– t is the average number of labels on each leaf nodes, t<<n, and t<<|L|.
– It is irrelevant with number of labels |L|.
– Complexity of C4.5:
Vi is the size of values of i-th attribute.
– Complexity of HOMER:
Test Complexity:– q is the average depth of branches of trees
– It is also irrelevant with number of labels |L|
Experiment – Metrics and Datasets Quality Metrics:
Datasets:
Experiment - Quality
Experiment – Computational Cost
Experiment – Computational Cost cont.
Experiment – Computational Cost cont
Future Works
Leverage the relationship of labels.
Apply ML-RDT for Recommendation
Parallelization and Streaming Implementation