experience with simple approaches wei fan erheng zhong sihong xie yuzhao huang kun zhang $ jing peng...
TRANSCRIPT
Experience with Simple Approaches
Wei Fan‡ Erheng Zhong† Sihong Xie† Yuzhao Huang† Kun Zhang$
Jing Peng# Jiangtao Ren†
‡IBM T. J. Watson Research Center†Sun Yat-sen University
$Xavier University of Louisiana#Montclair State University
RDT: Random Decision Tree (Fan et al’03)
“Encoding data” in trees. At each node, an un-used feature is chosen randomly
A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node.
A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
Stop when one of the following happens: A node becomes too small or belong to same class Or the total height of the tree exceeds some limits:
Illustration of RDTB1: {0,1}
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B3: continous
B1 == 0
B2 == 0?
Y
B3 < 0.3?
N
Y N
……… B3 < 0.6?
Random threshold 0.3
Random threshold 0.6
B1 chosen randomly
B2 chosen randomly
B3 chosen randomly
Probabilistic view of decision trees - PETs
|
Petal.Width< 1.75
setosa 50/0/0
versicolor0/49/5
virginica 0/1/45
Petal.Length< 2.45
P(setosa|x,θ) = 0
P(versicolor|x,θ) = 49/54
P(virginica|x,θ) = 5/54
Given an example x :
iiL
yL N/N),x|y(P • , E.g. (C4.5, CART)
• confidences in the predicted labels
• the dependence of P(y|x,θ) on θ is non-trivial
For example :
Problems of probability estimation via conventional DTs
1. Probability estimates tend to approach the extremes of 1 and 0.
---------------------------------------------2. Additional inaccuracies result
from the small number of examples at a leaf.
---------------------------------------------3. Same probability is assigned
to the entire region of space defined by a given leaf.
C4.4(Provost,03)
BC44(Zhang,06),
RDT(Fan,03)
bRDT
“ bRDT” is the averaging of RDT and BC44, where RDT is Random Decision Tree and BC44 is Bagged C4.4
RDT pr(y|x)
BC44 pb(y|x)
bRDT [pr(y|x)+pb(y|x)]/2
Sampling strategy for Task 1 &2
For station Z, negative instances are partitioned
into “blocks” such that the size of each block is
Approximately 3 times as that of the positive.
………… ……
Positive
Negative
Block 1
Block n
Task 1 & 2 - Result For V station, row 2 and 3, corresponding to task 1 and 2 The optimal classifiers of task 1 and 2 for station W, X, Y,
Z are the same. Thus there’s only one row for these 4 stations
Task 1 - ROC
VwXYZ
Task 2 - ROC
VwXYZ
Task 3 – Feature Expansion
X X X2 ln(X+1)
Example Three instances with only one feature, A and B are positive while C is negative. A(0.9) B(1.0) C(1.1)
Distant (A, B) = Distant (B, C) 0.01 vs. 0.01
Expand A(0.9,0.81,0.64) B(1.0,1.0,0.69) C(1.1,1.21,0.74)
Distant (A, B) < Distant (B, C) 0.049 vs. 0.056
Task3 – Result of test 3
Parameter-free