feature engineering, model evaluations · pca vs lda: an example figure:pca vs lda comparison giri...

Feature Engineering, Model Evaluations

Giri Iyengar

Cornell University

[email protected]

Feb 5, 2018

Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35

Overview

1 ETL

2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features

3 Dealing with correlated features

4 Feature SelectionForward and Backward SelectionVia Regularization

5 Evaluating Model PerformanceHandling rare classes


Overview

1 ETL






ETL: Preparing your data for modeling

Frequently, the data that you collect is in long formatFor example, logging systems will put timestamps, some entity id andthen a few columnsIn addition, you might need to join multiple inputs togetherFurther, you may want to apply some transformations / aggregationsto the individual columnsAll of these can be considered jointly as an ETL operationFrequently in Data Science, we end up with ELT. That is, theTransformation step is done closer to modeling stage


ETL: Preparing your data for modeling

Month Day F e a t u r e Value1 5 1 Ozone 412 5 2 Ozone 363 5 5 Ozone NA. . .610 9 28 Temp 75611 9 29 Temp 76612 9 30 Temp 68

PivotTypical output found in a log file. You need to convert this into usableformats by pivoting the individual rows

Month Day Ozone S o l a r . R Wind Temp5 1 41 190 7 . 4 67


Overview

1 ETL






Feature Engineering

In addition to pivoting, you might look at various aggregations /combinationsHow many times did this event occurIn the past time window at this location how many times did theevent occurHow much away from the mean does this event deviateUsually task dependent – with deep subject matter expertise, you askmore meaningful questions


Numerical Features

Count – How many times did something happenAmount – How much was itContinuous – Both positive and negative values have meaning


Normalization techniques

Standardize the numerical variablesMin-max normalizationsigmoid / tanh / log transformationsHandling zeros distinctly – potentially important for Count basedfeaturesDecorrelate / transform variables


One Hot Encoding

Transforms an attribute taking ddistinct values into a d− 1dimensional binary vectorSimple, Easy and gives a greatdeal of interpretability of theattributeGreat technique for smallnumber of distinct valuesCategorical attribute with largenumber of distinct valuesbecomes an issue (e.g.zipcode/MSA/items)

LabelDog 1 0 0Cat 0 1 0

Mouse 0 0 1


Target Rate Encoding

Compute P (c|d) empiricallyReplace attribute by ak − 1-dimensional vector (onefor each target class)Statistics need to be robustenough (take care of distinctvalues with small counts)Handling missing values – thosenot in training partition

Feature TargetUp 1Up 0

Down 1Flat 0Flat 1

Down 0Up 1

Down 0

Feature EncodingUp 0.66

Down 0.33Flat 0.5


TF-IDF and related techniques

Use summary statistics to reduce the number of distinct valuesUse Informative measures such as TF-IDF(https://en.wikipedia.org/wiki/TfÐidf)

TF and IDF can both be computed in several waysTF: raw, boolean, log-scaled, augmented (to account of document sizevariations)IDF: raw, log-scaled, smoothed log-scaled, . . .

Entropy (−pi log pi) to select distinct values of importance


https://en.wikipedia.org/wiki/Tf�idf

Low Rank Factorizations

low-rank approximation is an optimization techniqueCost is the closeness of fit between original matrix and low-rankapproximationPCA/SVDNon-negative Matrix Factorization (NMF) ΘTX ≈ Y . Use GradientDescent, for instance


LSH/Random Hashing

When dimensionality is really large, often times a random low-dimprojection seems to workThat is, take your really large dimensional vector and project it downto a really small space randomlyCollisions will occur but they don’t seem to matter. Interpretability islost, of courseLSH (Locality Sensitive Hashing) is a more principled approach thatdoes the sameSeveral types of projections (Walsh-Hadamard matrix, randomGaussian matrix) have proven distance preserving propertieshttps://en.wikipedia.org/wiki/Locality-sensitive_hashing


https://en.wikipedia.org/wiki/Locality-sensitive_hashing

https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Overview

1 ETL






PCA/SVD

Useful technique to analyze the inter-relationships between variablesFeature dimensionality reduction with minimum loss of informationyi =

∑kj=1 βijxj

yi = βTi X, Y = βTX

Σ covariance matrix of X and ΣY covariance matrix of Y


PCA

ΣY = βT Σβvar(yi) =

∑kp=1

∑km=1 βipβimσpm = βT

i Σβi

covar(yi, yj) =∑k

p=1∑k

m=1 βipβjmσpm = βTi Σβj

Spectral Decomposition Theorem: Σ = βDβT , where β contains theeigenvectorsEasy to show that var(yi) = λi and covar(yi, yj) = 0trace(Σ) =

∑ki=1 σ

2i =

∑ki=1 λi

Total variance of the data is accounted in the eigenvalues. Choosingthe top l << k eigenvalues and their corresponding eigenvectors givesyou a distortion minimizing low-dimensional projection


Linear Discriminant Analysis

PCA comes up with a projection that represents the data in alow-dimensional feature spaceIt is optimized towards representation. We are interested indiscrimination

Figure: Gaussian DataGiri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 18 / 35

PCA vs LDA: An Example

Figure: PCA vs LDA comparison


LDA

Motivated from Bayes’ Rule: P (c|x) = P (x|c)P (c)∑d

P (x|d)P (d)

Assume Gaussian distributed dataFurther assume that, classes have identical covarianceEnd up with: fi = µiΣ−1xT − 1

2µiΣ−1µTi + ln(P (i)) which is a line

See http://people.revoledu.com/kardi/tutorial/LDA/ for anice tutorialSee http://www.ics.uci.edu/%7Ewelling/classnotes/papers_class/Fisher-LDA.pdf for a detailed explanationThe most discriminant projection comes out as eigenvectors of Σ−1Σb


http://people.revoledu.com/kardi/tutorial/LDA/

http://www.ics.uci.edu/%7Ewelling/classnotes/papers_class/Fisher-LDA.pdf

http://www.ics.uci.edu/%7Ewelling/classnotes/papers_class/Fisher-LDA.pdf

Overview

1 ETL






Feature Selection

Not all features are importantOften times, removing features from the model improves itsperformanceIf there are N features, there are 2N subsets =⇒ feature selection isa hard task!


Forward Feature Selection

Build a model with one featureFrom features, build models in turn and retain the model with thestrongest performance

From the remaining features, keep adding one feature at a time(using search method above)Stop if there is no significant improvement


Backward Feature Selection

Similar to Forward Selection ProcedureBuild model with all featuresRemove the weakest featureKeep removing as long as there is no significant change


Regularization

The more parameters a model has, the more flexible it isToo much flexibility leads to overfitting. Model learns the peculiaritiesof a particular data set and doesn’t generalize wellSolution: Simplify the modelEverything must be made as simple as possible, but not simpler!We can add a penalty term to the objective function and do jointminimizationargmin

ΘJ(Θ) + λ||Θ||


L2 Regularization / Ridge Regression

argminΘ

J(Θ) + λ∑

i θ2i

Figure: θ values as a function of λ

In the end, you get regularized weights of features. Some weights are 0 orclose to 0. These features are not important to the model.


L1 Regularization / LASSO

argminΘ

J(Θ) + λ∑

i |θi|

Figure: θ values as a function of λ

http://statweb.stanford.edu/˜owen/courses/305/Rudyregularization.pdf


http://statweb.stanford.edu/~owen/courses/305/Rudyregularization.pdf

http://statweb.stanford.edu/~owen/courses/305/Rudyregularization.pdf

Geometric View of Lasso and Ridge

Figure: Lasso and Ridge Regularization


Elastic Net Regularization

If there are a set of highly correlated features, Lasso ends up selectingonly one of themWhen we have high dimensionality, low sample count problem, Lassosaturates quicklyRidge doesn’t achieve the sparsity achieved by Lassoargmin

ΘJ(Θ) + λ1

∑i |θi|+ λ2

∑i θ

2i

Combines the advantages of both Lasso and Ridge, especially forlarge dimensionality problems


Overview

1 ETL






Model Evaluation

Accuracy. #Correct#Samples

Confusion matrix∣∣∣∣∣∣∣Pred.+ ve Pred.− ve

Actual + ve TP FNActual − ve FP TN

∣∣∣∣∣∣∣Precision: T P

T P +F P , Recall: T PT P +F N


Receiver Operating Characteristic

Precision and Recall reflect a particular setting of decision thresholdsE.g. Logistic Regression Score / Naive Bayes Neg. Log Likelihood -select any threshold you wantROC curve shows the performance for any possible choice of thisthreshold. AUC – Normalized under ROC curve is another numberyou can use to compare models


Handling rare class classification

Often, the class of interest is very rareE.g. detecting a rare diseaseIdentifying a purchaser from the average internet browserDefault answer of No has a very high accuracyBut, that is not a very useful classifier


Handling rare class classification

Penalize different classesdifferentlySubsample the backgroundclass / Up-sample theforeground classConstruct hierarchical /back-off classifiers (Russiandoll model)

Browser / Site VisitorShopping Cart / SiteVisitorPurchase / Shopping Cart

Figure: Russian Doll Model


feature engineering, model evaluations · pca vs lda: an example figure:pca vs lda comparison giri...

Documents