predictive data mining in very large data sets: a ...dev1. data mining in...... a demonstration and...

Download Predictive Data Mining in Very Large Data Sets: A ...dev1.  Data Mining in...... A Demonstration and Comparison Under Model Ensemble ... SAS Enterprise Miner 7.1 ... Introduction to data mining: Using SAS Enterprise Miner. Cary, NC:

Post on 20-Apr-2018

216 views

Category:

Documents

4 download

Embed Size (px)

TRANSCRIPT

  • Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

    Dr. Hongwei Patrick Yang

    Educational Policy Studies & Evaluation

    College of Education

    University of Kentucky

    Lexington, KY

    Presented at the 2014 Modern Modeling Methods conference

  • Overview

    The study demonstrates predictive data mining models under model ensemble in the context of analyzing large data

    Data mining is usually defined as the data-driven process of discovering meaningful hidden patterns in large amounts of data through automatic as well as manual means

    2 Modern Modeling Methods Conference

    2014

  • Overview

    Many industries use data mining to address business problems, such as bankrupt prediction, risk management, fraud detection, etc.

    Such applications in data mining typically take advantage of predictive data mining models as learning machines with a primary focus on making good predictions

    3 Modern Modeling Methods Conference

    2014

  • Overview

    Among many types of predictive data mining models are decision trees, neural networks, and (traditional) regression models:

    Decision tree: Identify the most significant split of the outcome at each layer

    Neural network: Model nonlinear associations

    For each of the models/learning machines presented above, the outcome can be either a categorical one or a numerical one

    4

    Modern Modeling Methods Conference 2014

  • Overview

    On the other hand, model ensemble techniques have recently become popular thanks to the growing power of computation

    Bagging and boosting are two of the most popular ensemble techniques

    5 Modern Modeling Methods Conference

    2014

  • Overview

    Model ensemble techniques are designed to create a model ensemble/committee containing multiple component/base models

    The committee of models are averaged or pooled in a certain manner to improve the stability and accuracy of predictions

    Modern Modeling Methods Conference 2014 6

  • Overview

    Model ensemble techniques can be incorporated into many types of predictive models/learning machines (tree, neural network, regression, etc.)

    Ensemble-based modeling can also be combined with common feature/subset selection procedures (genetic algorithm, stepwise method, all-possible-subsets, etc.)

    Modern Modeling Methods

    Conference 2014 7

  • Numerical examples

    To demonstrate the effectiveness of predictive data mining in discovering meaningful information from large data, the study chooses the three types of predictive models which are commonly used, and analyzes them under two large scale applications

    Modern Modeling Methods Conference 2014 8

  • Numerical examples

    To further improve the predictions from each type of model, model ensemble is implemented during the modeling process to pool predictions from individual component model

    For comparison purposes, all models are also fitted without creating any model ensemble

    Modern Modeling Methods Conference 2014 9

  • Numerical examples

    Besides, the models are each evaluated for goodness-of-fit and performance at the final stage using various fit statistics including average squared error, ROC index, misclassification rate, Gini coefficient, K-S statistic, as applicable

    The entire analysis is performed under SAS Enterprise Miner 7.1

    Modern Modeling Methods Conference 2014 10

  • Numerical examples

    Example one: Physicochemical properties of protein tertiary structure data

    A numerical outcome: 45,730 cases

    Example two: Bank marketing data

    A categorical outcome: 41,188 cases

    Both data sets are retrieved from the UC Irvine (UCI) Machine Learning Repository

    Modern Modeling Methods Conference 2014 11

  • Example one: Numerical outcome

    Modern Modeling Methods Conference 2014 12

  • Example one: Numerical outcome

    Modern Modeling Methods Conference 2014 13

    Table 1. Comparison of Models based on Training Data under a Numerical Outcome.

    Model Description

    Average Squared

    Error

    Root Average

    Squared Error

    Maximum

    Absolute Error

    EnRegTreeNN 21.338 4.619 15.000

    EnReg 22.874 4.783 14.818

    EnNN 23.122 4.809 16.556

    EnTree 25.193 5.019 16.131

    NN 23.591 4.857 19.663

    Reg 23.574 4.855 19.668

    Tree 24.103 4.910 17.412

  • Example one: Numerical outcome

    Ensemble models tend to be more effective in reducing errors, although it is not guaranteed

    Average squared error: Lower is better

    Root average squared error: Lower is better

    Maximum absolute error: Lower is better

    Modern Modeling Methods Conference 2014 14

  • Example two: Categorical outcome

    Modern Modeling Methods Conference 2014 15

  • Example two: Categorical outcome

    Table 2. Comparison of Models based on Training Data under a Categorical Outcome.

    Model

    Description

    Root

    Average

    Squared

    Error

    Misclassification

    Rate

    Roc

    Index

    Gini

    Coefficient

    Kolmogorov

    -Smirnov

    Statistic

    Bin-Based

    Two-Way

    Kolmogorov

    -Smirnov

    Statistic

    Gain Cumulative

    Lift

    Cumulative

    Percent

    Captured

    Response

    EnRegTreeNN 0.237 0.078 0.947 0.894 0.780 0.772 504.305 6.043 60.541

    EnReg 0.241 0.081 0.935 0.871 0.719 0.717 455.744 5.557 55.676

    EnNN 0.252 0.086 0.919 0.838 0.682 0.681 428.767 5.288 52.973

    EnTree 0.270 0.101 0.801 0.602 0.579 0.576 395.325 4.953 49.623

    Tree 0.254 0.090 0.900 0.800 0.697 0.692 441.595 5.416 54.179

    NN 0.261 0.098 0.912 0.823 0.675 0.670 400.087 5.001 50.027

    Reg 0.261 0.097 0.912 0.823 0.668 0.666 408.710 5.087 50.889

    Modern Modeling Methods Conference 2014 16

  • Example two: Categorical outcome

    Ensemble models typically have better discriminatory power among all models, as is indicated by each criterion Misclassification rate: Lower is better

    ROC index: Higher is better

    Gini coefficient: Higher is better

    K-S statistic: Higher is better

    Cumulative lift: Higher is better

    Cumulative percent captured response: Higher is better

    Modern Modeling Methods Conference 2014 17

  • Conclusions

    The study presents some initial evidence for the effectiveness of model ensemble in improving the performance of an individual learning machine (model) under a given type

    The study needs to be supplemented with additional information on the use of (real) bagging and boosting in improving the performance of individual learning machine

    Modern Modeling Methods Conference 2014 18

  • Conclusions

    The study provides applied researchers with more options beyond traditional regression modeling when reliable predictions are needed in their research

    The study serves as the foundation for a future research topic which adds feature selection to predictive data mining modeling under model ensemble for analyzing very large data sets

    Modern Modeling Methods Conference 2014 19

  • References Ao, S. (2008). Data mining and applications in Genomics. Berlin, Heidelberg, Germany: Springer Science+Business Media. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of

    California, School of Information and Computer Science. Barutcuoglu, Z., & Alpaydin, E. (2003). A comparison of model aggregation methods for regression. In O. Kaynak, E.

    Alpaydin, E. Oja, & L. Xu. (Eds.), Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003 (pp. 7683). NYC, NY: Springer.

    Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Cerrito, P. B. (2006). Introduction to data mining: Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc. Drucker, H. (1997). Improving regressor using boosting techniques. Proceedings of the 14th International Conferences on

    Machine Learning, 107-115. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256-285. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the

    Thirteenth International Conference, 148-156. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.

    Journal of Computer and System Sciences, 55, 119-139. Hill, C. M., & Malone, L. C., & Trocine, L. (2004). Data mining and traditional regression. In H. Bozdogan (Ed.), Statistical

    data mining and knowledge discovery, (pp. 233-249). London, UK: Chapman and Hall/CRC. Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: John Wiley & Sons,

    Inc. Liu, B., Cui, Q., Jiang, T., & Ma, S. (2004). A combinational feature selection and ensemble neural network method for

    classification of gene expression data. BMC Bioinformatics, 5, 136. Oza, N. C. (2005). Ensemble Data Mining Methods. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp.

    448-453). Hershey, PA: Information Science Reference. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227. Schapire, R. E.. (2002). The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. C.

    Holmes, B. Mallick

Recommended

View more >