7-decision trees learning

Upload: yshu1

Post on 08-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 7-Decision Trees Learning

    1/51

    Decision Trees Learning

  • 8/7/2019 7-Decision Trees Learning

    2/51

    2

    Outline

    Decision tree representation

    ID3 learning algorithm

    Entropy, information gain

    Overfitting

  • 8/7/2019 7-Decision Trees Learning

    3/51

    3

    Decision Tree for PlayTennis

    Attributes and their values:

    Outlook: Sunny, Overcast, Rain

    Humidity: High, Normal

    Wind: Strong, Weak

    Temperature: Hot, Mild, Cool

    Target concept - Play Tennis: Yes, No

  • 8/7/2019 7-Decision Trees Learning

    4/51

    4

    Decision Tree for PlayTennis

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    Wind

    Strong Weak

    No Yes

    Yes

    YesNo

  • 8/7/2019 7-Decision Trees Learning

    5/51

    5

    Decision Tree for PlayTennis

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    No Yes

    Each internal node tests an attribute

    Each branch corresponds to anattribute value node

    Each leaf node assigns a classification

  • 8/7/2019 7-Decision Trees Learning

    6/51

    6

    No

    Decision Tree for PlayTennis

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    Wind

    Strong Weak

    No Yes

    Yes

    YesNo

    Outlook Temperature Humidity Wind PlayTennisSunny Hot High Weak ?

  • 8/7/2019 7-Decision Trees Learning

    7/51

    7

    Decision Tree for Conjunction

    Outlook

    Sunny Overcast Rain

    Wind

    Strong Weak

    No Yes

    No

    Outlook=Sunny Wind=Weak

    No

  • 8/7/2019 7-Decision Trees Learning

    8/51

    8

    Decision Tree for Disjunction

    Outlook

    Sunny Overcast Rain

    Yes

    Outlook=Sunny Wind=Weak

    Wind

    Strong Weak

    No Yes

    Wind

    Strong Weak

    No Yes

  • 8/7/2019 7-Decision Trees Learning

    9/51

    9

    Decision Tree for XOR

    Outlook

    Sunny Overcast Rain

    Wind

    Strong Weak

    Yes No

    Outlook=Sunny XOR Wind=Weak

    Wind

    Strong Weak

    No Yes

    Wind

    Strong Weak

    No Yes

  • 8/7/2019 7-Decision Trees Learning

    10/51

    10

    Decision Tree

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    Wind

    Strong Weak

    No Yes

    Yes

    YesNo

    decision trees represent disjunctions of conjunctions

    (Outlook=Sunny Humidity=Normal) (Outlook=Overcast)

    (Outlook=Rain

    Wind=Weak)

  • 8/7/2019 7-Decision Trees Learning

    11/51

    11

    When to consider Decision Trees

    Instances describable by attribute-value pairs

    e.g Humidity: High, Normal

    Target function is discrete valued

    e.g Play tennis;Y

    es, No Disjunctive hypothesis may be required

    e.g Outlook=Sunny Wind=Weak

    Possibly noisy training data

    Missing attribute values Examples:

    Medical diagnosis

    Credit risk analysis

    Object classification for robot manipulator (Tan 1993)

  • 8/7/2019 7-Decision Trees Learning

    12/51

    12

    Top-Down Induction of Decision Trees ID3

    1. An the best decision attribute for next node2. AssignA as decision attribute fornode

    3. For each value ofA createnew descendant4. Sort training examples to leafnode according to

    the attribute value of the branch5. If all training examples are perfectlyclassified

    (same value of targetattribute) stop, elseiterate overnew leafnodes.

  • 8/7/2019 7-Decision Trees Learning

    13/51

    13

    Which Attribute is best?

    A1=?

    True False

    [21+, 5-] [8+, 30-]

    [29+,35-] A2=?

    True False

    [18+, 33-] [11+, 2-]

    [29+,35-]

  • 8/7/2019 7-Decision Trees Learning

    14/51

    14

    Entropy

    S is a sample of training examples

    p+ is the proportion of positive examples

    p- is the proportion of negative examples

    Entropy measures the impurity ofS

    Entropy(S) = -p+ log2 p+ - p- log2 p-

  • 8/7/2019 7-Decision Trees Learning

    15/51

    15

    Entropy

    Entropy(S)= expected number of bits needed to

    encode class (+ or -) of randomly drawn members of

    S (under the optimal, shortest length-code)

    Information theory optimal length code assign

    log2 p bits to messages having probability p.

    So the expected number of bits to encode

    (+ or -) of random member ofS:

    -p+ log2 p+ - p- log2 p-(Note that: 0log 0 = 0)

  • 8/7/2019 7-Decision Trees Learning

    16/51

    16

    Information Gain

    Gain(S,A): expected reduction in entropy due to sortingS

    on attribute A

    A1=?

    True False

    [21+, 5-] [8+, 30-]

    [29+,35-] A2=?

    True False

    [18+, 33-] [11+, 2-]

    [29+,35-]

    Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

    Entropy([29+,35-]) = -29/64 log2 29/64 35/64 log2 35/64= 0.99

  • 8/7/2019 7-Decision Trees Learning

    17/51

    17

    Information Gain

    A1=?

    True False

    [21+, 5-] [8+, 30-]

    [29+,35-]

    Entropy([21+,5-]) = 0.71

    Entropy([8+,30-]) = 0.74

    Gain(S,A1)=Entropy(S)

    -26/64*Entropy([21+,5-])

    -38/64*Entropy([8+,30-])

    =0.27

    Entropy([18+,33-]) = 0.94

    Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S)

    -51/64*Entropy([18+,33-])

    -13/64*Entropy([11+,2-])=0.12

    A2=?

    True False

    [18+, 33-][11+

    , 2-]

    [29+,35-]

  • 8/7/2019 7-Decision Trees Learning

    18/51

    18

    Training Examples

    Day Outlook Temp. Humidity Wind Play Tennis

    D1 Sunny Hot High Weak No

    D2 Sunny Hot High Strong No

    D3 Overcast Hot High Weak Yes

    D4 Rain Mild High Weak Yes

    D5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No

    D7 Overcast Cool Normal Weak Yes

    D8 Sunny Mild High Weak No

    D9 Sunny Cold Normal Weak Yes

    D10 Rain Mild Normal Strong Yes

    D11 Sunny Mild Normal Strong Yes

    D12 Overcast Mild High Strong Yes

    D13 Overcast Hot Normal Weak Yes

    D14 Rain Mild High Strong No

  • 8/7/2019 7-Decision Trees Learning

    19/51

    19

    Selecting the Next Attribute

    Humidity

    High Normal

    [3+, 4-] [6+, 1-]

    S=[9+,5-]E=0.940

    Gain(S,Humidity)=0.940-(7/14)*0.985 (7/14)*0.592

    =0.151

    E=0.985 E=0.592

    Wind

    Weak Strong

    [6+, 2-] [3+, 3-]

    S=[9+,5-]E=0.940

    E=0.811 E=1.0Gain(S,Wind)=0.940-(8/14)*0.811 (6/14)*1.0

    =0.048Humidity provides greater info. gain than Wind, w.r.t target classification.

  • 8/7/2019 7-Decision Trees Learning

    20/51

    20

    Selecting the Next Attribute

    Outlook

    Sunny Rain

    [2+, 3-] [3+, 2-]

    S=[9+,5-]E=0.940

    Gain(S,Outlook)=0.940-(5/14)*0.971

    -(4/14)*0.0 (5/14)*0.0971=0.247

    E=0.971 E=0.971

    Overcast

    [4+, 0]

    E=0.0

  • 8/7/2019 7-Decision Trees Learning

    21/51

    21

    Selecting the Next Attribute

    Gain(S, Temperature)=0.940-(4/14)*1

    -(6/14)*0.918 (4/14)*0.811=0.029

    E=1 E=0.811

    Temperature

    Hot Cool

    [2+, 2-] [3+, 1-]

    S=[9+,5-]E=0.940

    Mild

    [4+, 2-]

    E=0.918

    Temperature ?

  • 8/7/2019 7-Decision Trees Learning

    22/51

    22

    Selecting the Next Attribute

    Theinformationgainvalues for the 4 attributesare: Gain(S,Outlook) =0.247

    Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029

    whereS denotes thecollection of training examples

    Note: 0Log20 =0

  • 8/7/2019 7-Decision Trees Learning

    23/51

    23

    ID3 Algorithm

    Outlook

    Sunny Overcast Rain

    Yes

    [D1,D2,,D14][9+,5-]

    Ssunny=[D1,D2,D8,D9,D11][2+,3-]

    ? ?

    [D3,D7,D12,D13][4+,0-]

    [D4,D5,D6,D10,D14][3+,2-]

    Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019

    Test for this node

  • 8/7/2019 7-Decision Trees Learning

    24/51

    24

    ID3 Algorithm

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    Wind

    Strong Weak

    No Yes

    Yes

    YesNo

    [D3,D7,D12,D13]

    [D8,D9,D11] [D6,D14

    ][D1,D2] [D4,D

    5,D

    10

    ]

  • 8/7/2019 7-Decision Trees Learning

    25/51

    25

    Hypothesis Space Search ID3

    +

    - +

    + - +

    A1

    - - ++ - +

    A2

    + - -

    + - +

    A2

    -

    A4+ -

    A2

    -

    A3- +

  • 8/7/2019 7-Decision Trees Learning

    26/51

    26

    Hypothesis Space Search ID3

    Hypothesis space is complete!

    Target function surely in there

    Outputs a single hypothesis

    No backtracking on selected attributes (greedy search)

    ocal minimal (suboptimal splits)

    Statistically-based search choices

    Robust to noisy data

    Inductive bias (search bias)

    Prefer shorter trees over longer ones Place high information gain attributes close to the root

  • 8/7/2019 7-Decision Trees Learning

    27/51

    27

    Converting a Tree to Rules

    Outlook

    Sunny Overcast Rain

    Humidity

    High Normal

    Wind

    Strong Weak

    No Yes

    Yes

    YesNo

    R1: If (Outlook=Sunny)(Humidity=High) Then PlayTennis=NoR2: If (Outlook=Sunny)(Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=YesR4: If (Outlook=Rain)(Wind=Strong) Then PlayTennis=No

    R5: If (Outlook=Rain)(Wind=Weak) Then PlayTennis=Yes

  • 8/7/2019 7-Decision Trees Learning

    28/51

    28

    Continuous Valued Attributes

    Create a discrete attribute to test continuous

    Temperature = 24.50C

    (Temperature > 20.00C) = {true, false}

    Where to set the threshold?

    Temperature 150C 180C 190C 220C 240C 270C

    PlayTennis No No Yes Yes Yes No

    (see paper by [Fayyad, Irani 1993]

  • 8/7/2019 7-Decision Trees Learning

    29/51

    29

    Attributes with many Values

    Problem: if an attribute has many values, maximizing

    InformationGain will select it.

    E.g.: Imagine using Date=12.7.1996 as attribute

    perfectly splits the data into subsets of size 1

    Use GainRatio instead of information gain as criteria:

    GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)

    SplitInformation(S,A) = -7i=1..c |Si|/|S| log2 |Si|/|S| Where Si is the subset for which attribute A has the

    value vi

  • 8/7/2019 7-Decision Trees Learning

    30/51

    30

    Attributes with Cost

    Consider:

    Medical diagnosis : blood test costs 1000 SEK(

    ) Robotics: width_from_one_feet has cost 23 secs.

    How to learn a consistent tree with low expected cost?

    Replace Gain by :

    Gain2(S,A)/Cost(A) [Tan, Schimmer 1990]2Gain(S,A)-1/(Cost(A)+1)w w [0,1] [Nunez 1988]

  • 8/7/2019 7-Decision Trees Learning

    31/51

    31

    Unknown Attribute Values

    What if examples are missing values ofA?

    Use training example anyway sort through tree

    If node n tests A, assign most common value ofA among other

    examples sorted to node n.

    Assign most common value ofA among other examples with

    same target value

    Assign probability pi to each possible value vi ofA

    Assign fraction pi of example to each descendant in tree

    Classify new examples in the same fashion

  • 8/7/2019 7-Decision Trees Learning

    32/51

    32

    Occams Razor

    Prefer shorter hypotheses

    Why prefer short hypotheses?

    Argument in favor:

    Fewer short hypotheses than long hypotheses

    A short hypothesis that fits the data is unlikely to be a coincidence

    A long hypothesis that fits the data might be a coincidence

    Argument opposed:

    T

    here are many ways to define small sets of hypotheses What is so special about small sets based on size of hypothesis

  • 8/7/2019 7-Decision Trees Learning

    33/51

    33

    Overfitting

    Consider error of hypothesis h over

    Training data: errortrain(h)

    Entire distribution D of data: errorD(h)Hypothesis hHoverfits training data if there is an

    alternative hypothesis hH such that

    errortrain(h) < errortrain(h)

    anderrorD(h) > errorD(h)

  • 8/7/2019 7-Decision Trees Learning

    34/51

    34

    Overfitting in Decision Tree Learning

  • 8/7/2019 7-Decision Trees Learning

    35/51

    35

    Avoid Overfitting

    How can we avoid overfitting?

    Stop growing when data split not

    statistically significant

    Grow full tree then post-prune

  • 8/7/2019 7-Decision Trees Learning

    36/51

    36

    Reduced-Error Pruning

    Split data into trainingand validation set

    Do until further pruning is harmful:

    1. Evaluate impact on validation set of pruning each

    possible node (plus those below it)2. Greedily remove the one that less improves the

    validation set accuracy

    Produces smallest version of most accurate subtree

  • 8/7/2019 7-Decision Trees Learning

    37/51

    37

    ReducedReduced--ErrorError PruningPruning

    Split data into training and validationsets.

    Pruning a decision node dconsists of:

    1. removing the subtree rooted at d.

    2. making d a leaf node.3. assigning d the most common

    classification of the traininginstances associated with d.

    Do until further pruning is harmful:

    1. Evaluate impact on validation setof pruning each possible node(plus those below it).

    2. Greedily remove the one thatmost improves validation setaccuracy.

    Outlook

    sunny overcast rainy

    Humidity Windy

    high normal

    no

    false true

    yes

    yes yes no

  • 8/7/2019 7-Decision Trees Learning

    38/51

    38

    Effect ofReduced Error Pruning

  • 8/7/2019 7-Decision Trees Learning

    39/51

    39

    Rule Post-Pruning

    Infer the decision tree from the training setallow overfitting

    Convert tree into equivalent set of rules

    Prune each rule by removing preconditions that result in

    improving its estimated accuracy

    Sort the pruned rules by estimated accuracy and considerthem in order when classifying

  • 8/7/2019 7-Decision Trees Learning

    40/51

    40

    Outlook

    Humidity WindYes

    Strong Weak

    No Yes No Yes

    Sunny Overcast Rain

    High Normal

    If (Outlook = Sunny) ( Humidity = High) Then (PlayTennis = No)

  • 8/7/2019 7-Decision Trees Learning

    41/51

    41

    Why convertthe decision tree to rules

    before pruning?

    Allows distinguishing among the different contexts

    in which a decision node is used

    Removes the distinction between attribute tests

    near the root and those that occur near leaves Enhances readability

  • 8/7/2019 7-Decision Trees Learning

    42/51

    42

    Evaluation

    Training accuracy How many training instances can be correctly classify based on

    the available data?

    Is high when the tree is deep/large, or when there is lessconfliction in the training instances.

    however, higher training accuracy does not mean goodgeneralization

    Testing accuracy Given a number of new instances, how many of them can we

    correctly classify?

    Cross validation

  • 8/7/2019 7-Decision Trees Learning

    43/51

    43

    Strengths

    can generate understandable rules

    perform classification without much computation

    can handle continuous and categorical variables

    provide a clear indication of which fields are most important

    for prediction or classification

  • 8/7/2019 7-Decision Trees Learning

    44/51

    44

    Weakness

    Not suitable for prediction of continuous attribute.

    Perform poorly with many class and small data.

    Computationally expensive to train.

    At each node, each candidate splitting field must be sorted beforeits best split can be found.

    In some algorithms, combinations of fields are used and a searchmust be made for optimal combining weights.

    Pruning algorithms can also be expensive since many candidatesub-trees must be formed and compared.

    Do not treat well non-rectangular regions.

  • 8/7/2019 7-Decision Trees Learning

    45/51

    45

    Cross-Validation

    Estimate the accuracy of a hypothesis induced by

    a supervised learning algorithm

    Predict the accuracy of a hypothesis over future

    unseen instances

    Select the optimal hypothesis from a given set of

    alternative hypotheses

    Pruning decision trees

    Model selection

    Feature selection

    Combining multiple classifiers (boosting)

  • 8/7/2019 7-Decision Trees Learning

    46/51

    46

    Holdout Method

    Partition data set D = {(v1,y1),,(vn,yn)} into trainingDt and

    validation set Dh=D\Dt

    Training Dt Validation D\Dt

    acch= 1/h (vi,yi)DhH(I(Dt,vi),yi)

    I(Dt,vi) : output ofhypothesisinduced by learnerItrained on data D

    t

    forinstanceviH(i,j) = 1ifi=j and 0 otherwise

    Problems: makesinsufficient use of data

    trainingand validationsetarecorrelated

  • 8/7/2019 7-Decision Trees Learning

    47/51

    47

    Cross-Validation

    k-fold cross-validation splits the data set D into k mutually

    exclusive subsets D1,D2,,Dk

    Train and test the learning algorithm k times, each time it istrained on D\Di and tested on Di

    D1 D2 D3 D4

    D1 D2 D3 D4 D1 D2 D3 D4

    D1 D2 D3 D4 D1 D2 D3 D4

    acccv= 1/n (vi,yi)D H(I(D\Di,vi),yi)

  • 8/7/2019 7-Decision Trees Learning

    48/51

    48

    Cross-Validation

    Uses all the data for training and testing

    Complete k-fold cross-validation splits the

    dataset of size m in all (m over m/k) possible

    ways (choosing m/k instances out of m) Leave n-out cross-validation sets n instances

    aside for testing and uses the remaining ones for

    training (leave one-out is equivalent to n-fold

    cross-validation)

    Leave one out is widely used

    In stratified cross-validation, the folds are

    stratified so that they contain approximately the

    same proportion of labels as the original data set

  • 8/7/2019 7-Decision Trees Learning

    49/51

    49

    = { X1 , X2, .. ,Xn } n p(Xi)

    = (Xi Xj ... Xk)

    (Xi Xj ... Xk) = I(Xi Xj ... Xk)

    p(X) = 1, I(X) = 0 p(X) = 0, I(X) = g

    p(Xi) > P(Xj), I(Xi) < I(Xj)

    XiXjXk I (XiXjXk) = I (Xi) + I (Xj) + I (Xk)

  • 8/7/2019 7-Decision Trees Learning

    50/51

    50

    I (X) = log2[ 1/ p(X) ] = - log2 p(X) (bit)

    [] X = (X1X2X3) , p(X1),p(X2),p(X3)I(X)

    I (X) = I (X1X2X3)

    = -log [ p(X1X2X3) ]

    = -log[ p(X1) p(X2) p(X3) ] Xi = -log p(X1) -log p(X2) -log p(X3)

    = I (X1) + I (X2) + I (X3)

    [] (X1X2X3X4), 1 p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8

    2 p(X1) = 1/4, p(X2) = 1/4, p(X3) = 1/4, p(X4) = 1/4

    1 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4) = -log(1/2) -log(1/4) -log(1/8) - log(1/8)

    = 1 + 2 + 3 +3 = 9 (bit)

    2 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4)

    = -log(1/4) -log(1/4) -log(1/4) - log(1/4)

    = 2 + 2 + 2 +2 = 8 (bit) Imin

  • 8/7/2019 7-Decision Trees Learning

    51/51

    51

    ( ) H(X)

    [] {X1, X2 , X3 , X4 },

    p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8 (Hmax )

    H(X) =

    = [ -(1/2)log(1/2) ] + [ -(1/4)log(1/4) ] + [ -(1/8)log(1/8) ] + [ -(1/8)log(1/8) ]

    = 1/2 + (1/4)

    2 + (1/8)

    3 + (1/8)

    3 = 1/2 + 1/2 + 3/8 + 3/8

    = 1.75 (bit)

    !

    !!n

    i

    n

    i

    XipXipXiIXipXH11

    )(log)()()()(

    !

    p i p ii

    n

    ( ) log ( )1