overview classification and prediction techniques ! summary 3 tasks ! classification &...

Download Overview Classification and Prediction techniques ! Summary 3 Tasks ! Classification & Prediction" Algorithm

If you can't read please download the document

Post on 03-Jul-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 1

    CS 5331 by Rattikorn Hewett Texas Tech University 1

    Classification and Prediction

    2

    Outline

    n  Overview ¨ Tasks ¨ Evaluation ¨  Issues

    n  Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others

    n  Prediction techniques n  Summary

    3

    Tasks

    n  Classification & Prediction ¨  describe data with respect to a target class/output ¨  create, from a data set, a model that

    n  describes a target class in terms of features (or attributes or variables) of the class or system generating the data

    ¨  use the model to n  classify an unknown target class n  predict an unknown output

    n  Example Applications ¨  credit approval ¨  medical diagnosis ¨  treatment effectiveness analysis

    What about image recognition or customer profiling?

    4

    Classification vs. Prediction

    n  Comparisons: Classification Prediction Data labeled (by known target class) labeled (by known output) Model classifier/classification model predictive model Use of the model classify unknown label predict unknown output Typical unknown values categorical continuous/ordered

    Labeled Data Algorithm

    Model

    Data with unknown class label or output

    class label or predicted output

  • 2

    5

    Three-step process

    n  Model Construction (Training/Learning) ¨  Build a model that describes training data labeled with known

    classes/outputs ¨  A model could be of various forms: rules, decision trees, formulae

    n  Model Evaluation (Testing) ¨  Use a labeled testing data set to estimate model accuracy

    n  The known label of test sample is compared with the classified result from the model

    n  Accuracy = % test cases correctly classified by the model

    n  Model Usage (Classification) ¨  Use the model with acceptable accuracy to classify or predict

    unknown

    6

    Example inst Credit history Debt Collateral Income Risk

    1 Bad H None L H 2 U H None M H 3 U L None M M 4 U L None L H 5 U L None H L 6 U L Ok H L 7 Bad L None L H 8 Bad L Ok H M 9 Good L None H L

    10 Good H Ok H L 11 Good H None L H 12 Good H None M M 13 Good H None H L 14 Bad H None M H

    Classification Algorithm

    Classifier (Model): 1. If income = L then Risk = H 2. If income = H and Credit History = U then Risk = L …….

    inst Credit history Debt Collateral Income Risk

    1 Bad H Ok L H 2 Good L None M H 3 U L None M M 4 U M None L H 5 U L None H L

    inst Credit history Debt Collateral Income Risk

    1 U H Ok L ? 2 Bad L None M ?

    Model Construction

    Model Evaluation Accuracy = ? Model Usage

    7

    Machine learning context

    Training Data Algorithm Model

    Data with unknown class label or output

    class label or predicted output

    Testing Data

    Labeled Data

    Is the model sufficiently accurate?

    no

    yes

    Model: • classifier • predictive model

    Training Testing

    Input Data Output Model

    8

    Supervised vs. Unsupervised

    n  Supervised learning ¨  Class labels and # classes are known ¨  Labeled data: the training data (observations, measurements,

    features etc.) are accompanied by labels indicating the class of the observations

    ¨  Goal is to describe class/concept in terms of features/observations à classification/prediction

    n  Unsupervised learning ¨  Class labels are unknown and # classes may not be known ¨  Unlabeled data: the training data does not have class labels ¨  Goal is to establish the existence of classes or concepts or clusters

    in the data à clustering

  • 3

    9

    Terms

    n  Machine learning aims to abstract regularities from a given data set ¨  Mining ~ learning ~ induction (reasoning from specifics to general)

    n  A training set consists of training instances ¨  The terms instances, tuples (or rows), examples, samples, and

    objects are used synonymously n  A testing set is independent of a training set

    ¨  If a training set is used for testing during learning process, the resulting model will overfit the training data à the model could incorporate anomalies in the training data that are not present in the overall sample population à model that may not represent the population (poor generalization)

    10

    Outline

    n  Overview ¨ Tasks ¨ Evaluation ¨  Issues

    n  Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others

    n  Prediction techniques n  Summary

    11

    Evaluation

    n  A resulting model (classifier) is tested against a testing data set to measure its ¨ Accuracy or ¨ Error rate = # misclassified test cases/#test cases in %

    n  Experimental methods ¨  Holdout method ¨  Random subsampling ¨  Cross validation ¨  Bootstrapping

    12

    Holdout method

    n  Randomly partition a data set into two independent sets: a training set and a test set (typically 2/3 and1/3) ¨  The training set is used to derive a classifier, whose accuracy is

    estimated using the test set

    n  Use with data set that has large number of samples

    Classifier (Model) Estimate

    Accuracy

    Data

    Training Data

    Testing Data

  • 4

    13

    Random sub-sampling

    n  Repeat holdout method k times n  Overall accuracy = average of the accuracies in each

    iteration Training

    Data

    Classifier (Model) Estimate

    Accuracy

    Testing Data

    Data

    Training Data

    Classifier (Model) Estimate

    Accuracy

    Testing Data

    Data

    Average accuracy

    • • •

    • • •

    What happens when the data requires supervised discretization (e.g., Fayad&Irani’’s alg.)?

    14

    Cross validation

    n  k-fold cross validation ¨  Divide a data set into k approximately equal subsets of samples ¨  Use k-1 subsets for training set and the rest for testing set ¨  Obtain a classifier/accuracy from each pair of training and testing

    sets (k possibilities)

    Train Data

    Classifier (Model)

    Estimate Accuracy

    Test Data

    Data

    Train Data

    Classifier (Model)

    Estimate Accuracy

    Test Data

    Average accuracy

    • • •

    • • •

    •• •

    15

    Cross validation & Bootstrapping

    n  Stratified cross validation – the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that in the initial data

    n  Stratified 10-fold Cross-validation is recommended when a sample set is of medium size

    n  When sample set size is very small à Bootstrapping ¨ Leave-one-out sample training set with replacement

    16

    Improving Accuracy

    n Use a committee of classifiers and combine the results of each classifier

    n Two basic techniques: ¨ Bagging ¨ Boosting

  • 5

    17

    Bagging

    n  Generate independent training sets by random sampling with replacement from the original sample set

    n  Construct a classifier for each training set using the same classification algorithm

    n  To classify an unknown sample x, each classifier returns its class prediction, which counts as one vote

    n  The Bagged Classifier C* counts the votes and assigns x to the class with the “most” votes

    18

    Bagging

    Training Data Set

    Classifier 1

    Classifier 2

    Classifier n

    Aggregated Classifier, C*

    by Majority of the votes

    Test Data Set

    Resulting classes

    Training Sample 1

    Training Sample 2

    Training Sample n

    19

    Boosting n  Assign weight to each training sample n  A series of classifiers is learned. After classifier Ct is learned

    ¨  Calculate the error and re-weight the examples based on the error.

    ¨  The weights are updated to allow the subsequent classifier Ct+1 to “pay more attention” to the misclassification errors made by Ct

    n  The final boosted classifier C* combines the votes of each classifier,

    where the weight of each classifier’s vote is a function of its accuracy on the training set

    20

    Boosting

    Weighted Training Data Set

    Classifier 1

    Classifier 2

    Classifier n

    Aggregated Classifier, C*

    by weighted

    votes

    Test Data Set

    Resulting classes

    Weighted Training Data Set

    Weighted Training Data Set

    Classifier 3

Recommended

View more >