feature engineering, model evaluations · pca vs lda: an example figure:pca vs lda comparison giri...
TRANSCRIPT
Feature Engineering, Model Evaluations
Giri Iyengar
Cornell University
Feb 5, 2018
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 2 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 3 / 35
ETL: Preparing your data for modeling
Frequently, the data that you collect is in long formatFor example, logging systems will put timestamps, some entity id andthen a few columnsIn addition, you might need to join multiple inputs togetherFurther, you may want to apply some transformations / aggregationsto the individual columnsAll of these can be considered jointly as an ETL operationFrequently in Data Science, we end up with ELT. That is, theTransformation step is done closer to modeling stage
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 4 / 35
ETL: Preparing your data for modeling
Month Day F e a t u r e Value1 5 1 Ozone 412 5 2 Ozone 363 5 5 Ozone NA. . .610 9 28 Temp 75611 9 29 Temp 76612 9 30 Temp 68
PivotTypical output found in a log file. You need to convert this into usableformats by pivoting the individual rows
Month Day Ozone S o l a r . R Wind Temp5 1 41 190 7 . 4 67
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 5 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 6 / 35
Feature Engineering
In addition to pivoting, you might look at various aggregations /combinationsHow many times did this event occurIn the past time window at this location how many times did theevent occurHow much away from the mean does this event deviateUsually task dependent – with deep subject matter expertise, you askmore meaningful questions
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 7 / 35
Numerical Features
Count – How many times did something happenAmount – How much was itContinuous – Both positive and negative values have meaning
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 8 / 35
Normalization techniques
Standardize the numerical variablesMin-max normalizationsigmoid / tanh / log transformationsHandling zeros distinctly – potentially important for Count basedfeaturesDecorrelate / transform variables
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 9 / 35
One Hot Encoding
Transforms an attribute taking ddistinct values into a d− 1dimensional binary vectorSimple, Easy and gives a greatdeal of interpretability of theattributeGreat technique for smallnumber of distinct valuesCategorical attribute with largenumber of distinct valuesbecomes an issue (e.g.zipcode/MSA/items)
LabelDog 1 0 0Cat 0 1 0
Mouse 0 0 1
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 10 / 35
Target Rate Encoding
Compute P (c|d) empiricallyReplace attribute by ak − 1-dimensional vector (onefor each target class)Statistics need to be robustenough (take care of distinctvalues with small counts)Handling missing values – thosenot in training partition
Feature TargetUp 1Up 0
Down 1Flat 0Flat 1
Down 0Up 1
Down 0
Feature EncodingUp 0.66
Down 0.33Flat 0.5
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 11 / 35
TF-IDF and related techniques
Use summary statistics to reduce the number of distinct valuesUse Informative measures such as TF-IDF(https://en.wikipedia.org/wiki/TfÐidf)
TF and IDF can both be computed in several waysTF: raw, boolean, log-scaled, augmented (to account of document sizevariations)IDF: raw, log-scaled, smoothed log-scaled, . . .
Entropy (−pi log pi) to select distinct values of importance
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 12 / 35
Low Rank Factorizations
low-rank approximation is an optimization techniqueCost is the closeness of fit between original matrix and low-rankapproximationPCA/SVDNon-negative Matrix Factorization (NMF) ΘTX ≈ Y . Use GradientDescent, for instance
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 13 / 35
LSH/Random Hashing
When dimensionality is really large, often times a random low-dimprojection seems to workThat is, take your really large dimensional vector and project it downto a really small space randomlyCollisions will occur but they don’t seem to matter. Interpretability islost, of courseLSH (Locality Sensitive Hashing) is a more principled approach thatdoes the sameSeveral types of projections (Walsh-Hadamard matrix, randomGaussian matrix) have proven distance preserving propertieshttps://en.wikipedia.org/wiki/Locality-sensitive_hashing
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 14 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 15 / 35
PCA/SVD
Useful technique to analyze the inter-relationships between variablesFeature dimensionality reduction with minimum loss of informationyi =
∑kj=1 βijxj
yi = βTi X, Y = βTX
Σ covariance matrix of X and ΣY covariance matrix of Y
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 16 / 35
PCA
ΣY = βT Σβvar(yi) =
∑kp=1
∑km=1 βipβimσpm = βT
i Σβi
covar(yi, yj) =∑k
p=1∑k
m=1 βipβjmσpm = βTi Σβj
Spectral Decomposition Theorem: Σ = βDβT , where β contains theeigenvectorsEasy to show that var(yi) = λi and covar(yi, yj) = 0trace(Σ) =
∑ki=1 σ
2i =
∑ki=1 λi
Total variance of the data is accounted in the eigenvalues. Choosingthe top l << k eigenvalues and their corresponding eigenvectors givesyou a distortion minimizing low-dimensional projection
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 17 / 35
Linear Discriminant Analysis
PCA comes up with a projection that represents the data in alow-dimensional feature spaceIt is optimized towards representation. We are interested indiscrimination
Figure: Gaussian DataGiri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 18 / 35
PCA vs LDA: An Example
Figure: PCA vs LDA comparison
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 19 / 35
LDA
Motivated from Bayes’ Rule: P (c|x) = P (x|c)P (c)∑d
P (x|d)P (d)
Assume Gaussian distributed dataFurther assume that, classes have identical covarianceEnd up with: fi = µiΣ−1xT − 1
2µiΣ−1µTi + ln(P (i)) which is a line
See http://people.revoledu.com/kardi/tutorial/LDA/ for anice tutorialSee http://www.ics.uci.edu/%7Ewelling/classnotes/papers_class/Fisher-LDA.pdf for a detailed explanationThe most discriminant projection comes out as eigenvectors of Σ−1Σb
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 20 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 21 / 35
Feature Selection
Not all features are importantOften times, removing features from the model improves itsperformanceIf there are N features, there are 2N subsets =⇒ feature selection isa hard task!
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 22 / 35
Forward Feature Selection
Build a model with one featureFrom features, build models in turn and retain the model with thestrongest performance
From the remaining features, keep adding one feature at a time(using search method above)Stop if there is no significant improvement
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 23 / 35
Backward Feature Selection
Similar to Forward Selection ProcedureBuild model with all featuresRemove the weakest featureKeep removing as long as there is no significant change
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 24 / 35
Regularization
The more parameters a model has, the more flexible it isToo much flexibility leads to overfitting. Model learns the peculiaritiesof a particular data set and doesn’t generalize wellSolution: Simplify the modelEverything must be made as simple as possible, but not simpler!We can add a penalty term to the objective function and do jointminimizationargmin
ΘJ(Θ) + λ||Θ||
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 25 / 35
L2 Regularization / Ridge Regression
argminΘ
J(Θ) + λ∑
i θ2i
Figure: θ values as a function of λ
In the end, you get regularized weights of features. Some weights are 0 orclose to 0. These features are not important to the model.
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 26 / 35
L1 Regularization / LASSO
argminΘ
J(Θ) + λ∑
i |θi|
Figure: θ values as a function of λ
http://statweb.stanford.edu/˜owen/courses/305/Rudyregularization.pdf
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 27 / 35
Geometric View of Lasso and Ridge
Figure: Lasso and Ridge Regularization
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 28 / 35
Elastic Net Regularization
If there are a set of highly correlated features, Lasso ends up selectingonly one of themWhen we have high dimensionality, low sample count problem, Lassosaturates quicklyRidge doesn’t achieve the sparsity achieved by Lassoargmin
ΘJ(Θ) + λ1
∑i |θi|+ λ2
∑i θ
2i
Combines the advantages of both Lasso and Ridge, especially forlarge dimensionality problems
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 29 / 35
Overview
1 ETL
2 Feature EngineeringHandling Numerical FeaturesHandling Categorical Features
3 Dealing with correlated features
4 Feature SelectionForward and Backward SelectionVia Regularization
5 Evaluating Model PerformanceHandling rare classes
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 30 / 35
Model Evaluation
Accuracy. #Correct#Samples
Confusion matrix∣∣∣∣∣∣∣Pred.+ ve Pred.− ve
Actual + ve TP FNActual − ve FP TN
∣∣∣∣∣∣∣Precision: T P
T P +F P , Recall: T PT P +F N
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 31 / 35
Receiver Operating Characteristic
Precision and Recall reflect a particular setting of decision thresholdsE.g. Logistic Regression Score / Naive Bayes Neg. Log Likelihood -select any threshold you wantROC curve shows the performance for any possible choice of thisthreshold. AUC – Normalized under ROC curve is another numberyou can use to compare models
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 32 / 35
Handling rare class classification
Often, the class of interest is very rareE.g. detecting a rare diseaseIdentifying a purchaser from the average internet browserDefault answer of No has a very high accuracyBut, that is not a very useful classifier
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 33 / 35
Handling rare class classification
Penalize different classesdifferentlySubsample the backgroundclass / Up-sample theforeground classConstruct hierarchical /back-off classifiers (Russiandoll model)
Browser / Site VisitorShopping Cart / SiteVisitorPurchase / Shopping Cart
Figure: Russian Doll Model
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 34 / 35
Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 35 / 35