feature engineering, model evaluations · pca vs lda: an example figure:pca vs lda comparison giri...

35
Feature Engineering, Model Evaluations Giri Iyengar Cornell University [email protected] Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35

Upload: others

Post on 23-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Feature Engineering, Model Evaluations

Giri Iyengar

Cornell University

[email protected]

Feb 5, 2018

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35

Page 2: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 2 / 35

Page 3: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 3 / 35

Page 4: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

ETL: Preparing your data for modeling

Frequently, the data that you collect is in long formatFor example, logging systems will put timestamps, some entity id andthen a few columnsIn addition, you might need to join multiple inputs togetherFurther, you may want to apply some transformations / aggregationsto the individual columnsAll of these can be considered jointly as an ETL operationFrequently in Data Science, we end up with ELT. That is, theTransformation step is done closer to modeling stage

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 4 / 35

Page 5: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

ETL: Preparing your data for modeling

Month Day F e a t u r e Value1 5 1 Ozone 412 5 2 Ozone 363 5 5 Ozone NA. . .610 9 28 Temp 75611 9 29 Temp 76612 9 30 Temp 68

PivotTypical output found in a log file. You need to convert this into usableformats by pivoting the individual rows

Month Day Ozone S o l a r . R Wind Temp5 1 41 190 7 . 4 67

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 5 / 35

Page 6: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 6 / 35

Page 7: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Feature Engineering

In addition to pivoting, you might look at various aggregations /combinationsHow many times did this event occurIn the past time window at this location how many times did theevent occurHow much away from the mean does this event deviateUsually task dependent – with deep subject matter expertise, you askmore meaningful questions

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 7 / 35

Page 8: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Numerical Features

Count – How many times did something happenAmount – How much was itContinuous – Both positive and negative values have meaning

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 8 / 35

Page 9: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Normalization techniques

Standardize the numerical variablesMin-max normalizationsigmoid / tanh / log transformationsHandling zeros distinctly – potentially important for Count basedfeaturesDecorrelate / transform variables

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 9 / 35

Page 10: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

One Hot Encoding

Transforms an attribute taking ddistinct values into a d− 1dimensional binary vectorSimple, Easy and gives a greatdeal of interpretability of theattributeGreat technique for smallnumber of distinct valuesCategorical attribute with largenumber of distinct valuesbecomes an issue (e.g.zipcode/MSA/items)

LabelDog 1 0 0Cat 0 1 0

Mouse 0 0 1

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 10 / 35

Page 11: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Target Rate Encoding

Compute P (c|d) empiricallyReplace attribute by ak − 1-dimensional vector (onefor each target class)Statistics need to be robustenough (take care of distinctvalues with small counts)Handling missing values – thosenot in training partition

Feature TargetUp 1Up 0

Down 1Flat 0Flat 1

Down 0Up 1

Down 0

Feature EncodingUp 0.66

Down 0.33Flat 0.5

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 11 / 35

Page 12: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

TF-IDF and related techniques

Use summary statistics to reduce the number of distinct valuesUse Informative measures such as TF-IDF(https://en.wikipedia.org/wiki/TfÐidf)

TF and IDF can both be computed in several waysTF: raw, boolean, log-scaled, augmented (to account of document sizevariations)IDF: raw, log-scaled, smoothed log-scaled, . . .

Entropy (−pi log pi) to select distinct values of importance

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 12 / 35

Page 13: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Low Rank Factorizations

low-rank approximation is an optimization techniqueCost is the closeness of fit between original matrix and low-rankapproximationPCA/SVDNon-negative Matrix Factorization (NMF) ΘTX ≈ Y . Use GradientDescent, for instance

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 13 / 35

Page 14: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

LSH/Random Hashing

When dimensionality is really large, often times a random low-dimprojection seems to workThat is, take your really large dimensional vector and project it downto a really small space randomlyCollisions will occur but they don’t seem to matter. Interpretability islost, of courseLSH (Locality Sensitive Hashing) is a more principled approach thatdoes the sameSeveral types of projections (Walsh-Hadamard matrix, randomGaussian matrix) have proven distance preserving propertieshttps://en.wikipedia.org/wiki/Locality-sensitive_hashing

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 14 / 35

Page 15: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 15 / 35

Page 16: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

PCA/SVD

Useful technique to analyze the inter-relationships between variablesFeature dimensionality reduction with minimum loss of informationyi =

∑kj=1 βijxj

yi = βTi X, Y = βTX

Σ covariance matrix of X and ΣY covariance matrix of Y

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 16 / 35

Page 17: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

PCA

ΣY = βT Σβvar(yi) =

∑kp=1

∑km=1 βipβimσpm = βT

i Σβi

covar(yi, yj) =∑k

p=1∑k

m=1 βipβjmσpm = βTi Σβj

Spectral Decomposition Theorem: Σ = βDβT , where β contains theeigenvectorsEasy to show that var(yi) = λi and covar(yi, yj) = 0trace(Σ) =

∑ki=1 σ

2i =

∑ki=1 λi

Total variance of the data is accounted in the eigenvalues. Choosingthe top l << k eigenvalues and their corresponding eigenvectors givesyou a distortion minimizing low-dimensional projection

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 17 / 35

Page 18: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Linear Discriminant Analysis

PCA comes up with a projection that represents the data in alow-dimensional feature spaceIt is optimized towards representation. We are interested indiscrimination

Figure: Gaussian DataGiri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 18 / 35

Page 19: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

PCA vs LDA: An Example

Figure: PCA vs LDA comparison

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35

Page 20: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

LDA

Motivated from Bayes’ Rule: P (c|x) = P (x|c)P (c)∑d

P (x|d)P (d)

Assume Gaussian distributed dataFurther assume that, classes have identical covarianceEnd up with: fi = µiΣ−1xT − 1

2µiΣ−1µTi + ln(P (i)) which is a line

See http://people.revoledu.com/kardi/tutorial/LDA/ for anice tutorialSee http://www.ics.uci.edu/%7Ewelling/classnotes/papers_class/Fisher-LDA.pdf for a detailed explanationThe most discriminant projection comes out as eigenvectors of Σ−1Σb

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 20 / 35

Page 21: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 21 / 35

Page 22: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Feature Selection

Not all features are importantOften times, removing features from the model improves itsperformanceIf there are N features, there are 2N subsets =⇒ feature selection isa hard task!

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 22 / 35

Page 23: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Forward Feature Selection

Build a model with one featureFrom features, build models in turn and retain the model with thestrongest performance

From the remaining features, keep adding one feature at a time(using search method above)Stop if there is no significant improvement

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 23 / 35

Page 24: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Backward Feature Selection

Similar to Forward Selection ProcedureBuild model with all featuresRemove the weakest featureKeep removing as long as there is no significant change

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 24 / 35

Page 25: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Regularization

The more parameters a model has, the more flexible it isToo much flexibility leads to overfitting. Model learns the peculiaritiesof a particular data set and doesn’t generalize wellSolution: Simplify the modelEverything must be made as simple as possible, but not simpler!We can add a penalty term to the objective function and do jointminimizationargmin

ΘJ(Θ) + λ||Θ||

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 25 / 35

Page 26: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

L2 Regularization / Ridge Regression

argminΘ

J(Θ) + λ∑

i θ2i

Figure: θ values as a function of λ

In the end, you get regularized weights of features. Some weights are 0 orclose to 0. These features are not important to the model.

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 26 / 35

Page 27: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

L1 Regularization / LASSO

argminΘ

J(Θ) + λ∑

i |θi|

Figure: θ values as a function of λ

http://statweb.stanford.edu/˜owen/courses/305/Rudyregularization.pdf

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 27 / 35

Page 28: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Geometric View of Lasso and Ridge

Figure: Lasso and Ridge Regularization

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 28 / 35

Page 29: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Elastic Net Regularization

If there are a set of highly correlated features, Lasso ends up selectingonly one of themWhen we have high dimensionality, low sample count problem, Lassosaturates quicklyRidge doesn’t achieve the sparsity achieved by Lassoargmin

ΘJ(Θ) + λ1

∑i |θi|+ λ2

∑i θ

2i

Combines the advantages of both Lasso and Ridge, especially forlarge dimensionality problems

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 29 / 35

Page 30: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 30 / 35

Page 31: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Model Evaluation

Accuracy. #Correct#Samples

Confusion matrix∣∣∣∣∣∣∣Pred.+ ve Pred.− ve

Actual + ve TP FNActual − ve FP TN

∣∣∣∣∣∣∣Precision: T P

T P +F P , Recall: T PT P +F N

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 31 / 35

Page 32: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Receiver Operating Characteristic

Precision and Recall reflect a particular setting of decision thresholdsE.g. Logistic Regression Score / Naive Bayes Neg. Log Likelihood -select any threshold you wantROC curve shows the performance for any possible choice of thisthreshold. AUC – Normalized under ROC curve is another numberyou can use to compare models

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 32 / 35

Page 33: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Handling rare class classification

Often, the class of interest is very rareE.g. detecting a rare diseaseIdentifying a purchaser from the average internet browserDefault answer of No has a very high accuracyBut, that is not a very useful classifier

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 33 / 35

Page 34: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Handling rare class classification

Penalize different classesdifferentlySubsample the backgroundclass / Up-sample theforeground classConstruct hierarchical /back-off classifiers (Russiandoll model)

Browser / Site VisitorShopping Cart / SiteVisitorPurchase / Shopping Cart

Figure: Russian Doll Model

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 34 / 35

Page 35: Feature Engineering, Model Evaluations · PCA vs LDA: An Example Figure:PCA vs LDA comparison Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35. LDA Motivated from

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 35 / 35