csc550z: data mining for business intelligence - week 1: … · 2019-01-15 · example: mlr for...

37
CSC550Z: Data Mining for Business Intelligence Week 1: Introduction to Data Mining Tuan Tran, Ph.D. Assistant Professor College of Information and Computer Technology Sullivan University Email: [email protected] October 27, 2016 Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 1 / 37

Upload: others

Post on 10-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

CSC550Z: Data Mining for Business IntelligenceWeek 1: Introduction to Data Mining

Tuan Tran, Ph.D.

Assistant ProfessorCollege of Information and Computer Technology

Sullivan UniversityEmail: [email protected]

October 27, 2016

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 1 / 37

Page 2: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Objectives

Understand data mining concepts

Understand steps in a data mining project

Identify types of variables and understand datapre-processing

Know how to normalize data

Understand the issues of overfitting and how to fix it

Steps in constructing data mining model usingXLMiner

Know how to read and intepret output

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 2 / 37

Page 3: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Outline

1 Data Mining Concepts

2 The Process of Data Mining

3 Pre-processing Data

4 Data Mining Example

5 Conclusion

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 3 / 37

Page 4: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Why Mine Data?

Massive data is collected and warehousedWeb data, e-commercePurchase at grocery storesBank/credit card transactions

Computing system is cheaper and more powerful

Competitive pressure is strongProvide better customized services

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 4 / 37

Page 5: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

What is Data Mining?

There are many definitions

Non-trivial extraction of implicit, previously unknown andpotentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, oflarge quantities of data in order to discover meaningful patterns

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 5 / 37

Page 6: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Core Ideas in Data Mining

Consisting of several aspects

Classification: Basic form ofdata analysis (purchase/nopurchase)

Prediction: Predict value of anumerical variable (amount ofpurchase)

Association Rules: What goeswith what

Data Reduction: Transforminto simpler data

Data Exploration: Review andexamine data

Visualization: Graphicalanalysis

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 6 / 37

Page 7: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Supervised Learning

Goal: Predict a single “target” or “outcome” variableTraining data: target value is knownNew data: score to new data where (target) value is not known

Methods: Classification and PredictionExample: Classify emails (spam/no-spam), predict house value, etc.

Source: www.astroml.org

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 7 / 37

Page 8: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning

Goal: Segment data into meaningful segments; detect patternsDraw inferences from datasets consisting of input data without labeledresponsesThere is no target (outcome) variable to predict or classify

Methods: Association rules, data reduction & exploration,visualization

Example: shopping basket, clustering, etc.

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 8 / 37

Page 9: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Supervised Learning: Classification

Goal: Predict categorical target (outcome) variable

Examples: Purchase/no purchase, fraud/no fraud, creditworthy/notcreditworthy

Each row is a case/instance/record (customer, tax return, applicant)Each column is a variable (attribute, feature)

Target types: Target variable is often binary (yes/no)

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 9 / 37

Page 10: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Supervised Learning: Prediction

Goal: Predict numerical target (outcome) variable

Examples: sales, revenue, performance

Each row is a case/instance/record (customer, tax return, applicant)Each column is a variable (attribute, feature)

Classification and prediction together constitute “predictive analytics”

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 10 / 37

Page 11: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning: Association Rules

Goal: Produce rules that define “what goes with what”

Examples: “If X was purchased, Y was also purchased”Rows are transactionsEach column is a variable (attribute, feature)

Used in recommender systems - “Our records show you bought X, youmay also like Y”

Also called “affinity analysis”

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 11 / 37

Page 12: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning: Association Rules

How Can We Qualtify It?

Dataset

Transactions Items

12345 ABC12346 AC12347 AD12348 BEF

Support

Itemset Support

A 75%B 50%C 50%A,C 50%

Parameters: support (50%) and confidence (e.g, 50%)

Rules: A → C

support (A,C) = 50%confidence (A → C) = support(A,C)/support(A)=66,6%

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 12 / 37

Page 13: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning: Data Reduction

Goal: Distillation of complex/large data into simpler/smaller data

Reducing the number of variables/columns (e.g., principalcomponents)

Reducing the number of records/rows (e.g., clustering)

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 13 / 37

Page 14: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning: Data Visualization

Goal: Graphs and plots of data for visualization

Histograms, boxplots, bar charts, scatterplots

Especially useful to examine relationships between pairs of variables

Graphical network of my social network links

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 14 / 37

Page 15: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Concepts

Unsupervised Learning: Data Exploration

Data sets are typically large, complex & messy (dirty)

Need to review the data to help refine the task

Use techniques of Reduction and Visualization

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 15 / 37

Page 16: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

The Process of Data Mining

Steps in Data Mining

Consisting of several steps

1 Understanding: Define purpose

2 Data Collection: Obtain data(random sampling)

3 Cleaning: Explore, pre-processdata

4 Data Reduction: Reduce thedata; if supervised DM, partition it

5 Identify Task: Specify task(classification, clustering, etc.)

6 Technique Selection: Choose thetechniques (CART, NN, etc.)

7 Parameter Tunning: Iterative and“tuning”

8 Evaluation: Assess & comparemodels

9 Deployment: Deploy best model

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 16 / 37

Page 17: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

The Process of Data Mining

Obtaining Data: SamplingWhy Sampling?

Massive dataset with correlation instancesSoftware limitation, e.g., XLMiner, e.g., limits the “training” partitionto 10,000 records

How to Sample?Use build-in function of software

What Is the Size of Sample?Data mining typically deals with huge datasetsProduce statistically-valid results

Source: http://labs.geog.uvic.ca/

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 17 / 37

Page 18: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

The Process of Data Mining

Rare event oversampling

Often the event of interest is rareExamples: response to mailing, fraud in taxes, ...

Sampling may yield too few “interesting” cases

to effectively train a modelA popular solution: oversample the rare casesto obtain a more balanced training set

Later, need to adjust results for the oversampling

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 18 / 37

Page 19: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Types of Variables

Determine the types of pre-processing needed, and algorithms used

Main distinction: Categorical vs. numericNumeric

ContinuousInteger

CategoricalOrdered (low, medium, high) (ordinal)Unordered (male, female) (nominal)

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 19 / 37

Page 20: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Variable handling

Numeric

Most algorithms in XLMiner can handle

numeric data

May occasionally need to “bin” into

categoriesCategorical

Naive Bayes can use as-is

In most other algorithms, must create

binary dummies (number of dummies =

number of categories - 1)Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 20 / 37

Page 21: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Variable Handling

An outlier is an observation that is “extreme”, beingdistant from the rest of the data (definition of“distant” is deliberately vague)

Outliers can have disproportionate influence on models(a problem if it is spurious)

An important step in data pre-processing is detectingoutliers

Once detected, domain knowledge is required todetermine if it is an error, or truly extreme

In some contexts, finding outliers is the purpose of theDM exercise (airport security screening). This is called“anomaly detection”

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 21 / 37

Page 22: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Handling Missing Data

Most algorithms will not process records with missingvalues. Default is to drop those records.Solution 1: Omission

If a small number of records have missing values, can omitthemIf many records are missing values on a small set of variables,can drop those variables (or use proxies)If many records have missing values, omission is not practical

Solution 2: ImputationReplace missing values with reasonable substitutesLets you keep the record and use the rest of its(non-missing) information

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 22 / 37

Page 23: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Normalizing (Standardizing) Data

Used in some techniques when variables with thelargest scales would dominate and skew results

Puts all variables on same scaleNormalizing function: subtract mean and divide bystandard deviation (used in XLMiner)

Mean of n samples xi : µ =∑n

i=1 xi/n

Standard deviation σ =√∑n

i=1(xi − µ)2/nStandardized samples: xi = (xi − µ)/σ

Alternative function: scale to 0-1 by subtractingminimum and dividing by the range

Range r = xmax − xmin

Normalized samples xi = (xi − xmin)/rUseful when the data contain dummies and numeric

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 23 / 37

Page 24: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

The Problem of Overfitting

Statistical models can produce highly complex explanations ofrelationships between variablesThe “fit” may be excellentWhen used with new data, models of great complexity do not doso well100% fit - not useful for new data

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 24 / 37

Page 25: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

The Problem of Overfitting (cont.)

Causes:Too many predictorsA model with too many parametersTrying many different models

Consequence: Deployed model will not work as well asexpected with completely new data.

Source: http://pingax.com

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 25 / 37

Page 26: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Partitioning the Data

Problem: How well will ourmodel perform with newdata?Solution: Separate datainto two parts

Training partition to developthe modelValidation partition toimplement the model andevaluate its performance on“new” data

Addresses the issue ofoverfitting

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 26 / 37

Page 27: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Pre-processing Data

Test Partition

When a model is developed on training data, it canoverfit the training data (hence need to assess onvalidation)

Assessing multiple models on same validation data canoverfit validation data

Some methods use the validation data to choose aparameter. This too can lead to overfitting thevalidation data

Solution: final selected model is applied to a testpartition to give unbiased estimate of its performanceon new data

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 27 / 37

Page 28: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)

This example shows data mining steps for predictinghouse values in Boston areas

Use BostonHousing.xls datasetConstruct a multiple linear regression (MLR) model to predict housevalues based on house featuresIntepret output of the model

Sample of the dataset

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 28 / 37

Page 29: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Description of variables

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 29 / 37

Page 30: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Partitioning the data

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 30 / 37

Page 31: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Using XLMiner for Multiple Linear Regression

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 31 / 37

Page 32: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Specifying Output

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 32 / 37

Page 33: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Prediction of Training Data

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 33 / 37

Page 34: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Prediction of Validation Data

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 34 / 37

Page 35: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Example: MLR for Predicting Boston Housing (XLMiner)(cont.)

Summary of errors

Error = actual - predicted

RMS = Root-mean-squared error (Square root of averagesquared error)

In example, sizes of training and validation sets differ, so onlyRMS Error and Average Error are comparable

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 35 / 37

Page 36: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Data Mining Example

Using Excel and XLMiner for Data Mining

Excel is limited in data capacity

However, the training and validation of DM modelscan be handled within the modest limits of Excel andXLMiner

Models can then be used to score larger databases

XLMiner has functions for interacting with variousdatabases (taking samples from a database, andscoring a database from a developed model)

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 36 / 37

Page 37: CSC550Z: Data Mining for Business Intelligence - Week 1: … · 2019-01-15 · Example: MLR for Predicting Boston Housing (XLMiner) This example shows data mining steps for predicting

Conclusion

Summary

Data Mining consists of supervised methods(Classification & Prediction) and unsupervisedmethods (Association Rules, Data Reduction, DataExploration & Visualization)

Before algorithms can be applied, data must becharacterized and pre-processed

To evaluate performance and to avoid overfitting, datapartitioning is used

Data mining methods are usually applied to a samplefrom a large database, and then the best model isused to score the entire database

Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 37 / 37