midterm review. 1-intro data mining vs. statistics –predictive v. experimental; hypotheses vs...
TRANSCRIPT
![Page 1: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/1.jpg)
Midterm Review
![Page 2: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/2.jpg)
1-Intro
• Data Mining vs. Statistics– Predictive v. experimental; hypotheses vs data-
driven• Different types of data• Data Mining pitfalls
– With lots of data you can find anything• Data privacy and security
– Good and bad examples
![Page 3: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/3.jpg)
2- EDA and Visualization
• Good visualization is good analysis• Examples of vis
– 1-d, 2-d, multivariate– Histograms, boxplots, scatterplots, density
estimates, etc– Overplotting with many points– Conditional plots (small multiples)– Good, bad examples
![Page 4: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/4.jpg)
3- Data mining concepts
• Preparing data for analysis– How to deal with missing data?– What are good transformations?– How to deal with outliers
• Data reduction– Reducing n: sampling, subsetting– Reducing p:
• Principal components: finding projections that preserve variance
– Scree plot shows how much variance is accounted for in the PC
• MDS: – Needs a distance matrix– Mimimizes ‘stress function’– mostly used for visualization and EDA
• In-vs-out of sample evaluation– In-sample: must penalize for complexity– Out-of-sample: use cross-validation to evaluate
predictive performance
![Page 5: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/5.jpg)
3- Data mining concepts
• Complexity/Performance tradeoff• Evaluating Classification models
– Accuracy (how many did I get right): not the best choice
– Precision/recall or Sensitivity/specificity tradeoff– Selecting different thresholds for ROC curve.
![Page 6: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/6.jpg)
4-Regression
• Linear regression– What is it, what are the assumptions, how do you
check them– Model selection
• Exhaustive or Greedy (forward/backward selection) search
• Extensions of Linear regression– Non-linear in parameters, linear in form– Generalized Linear Models
• Logisitic regression• Poisson regression
– Shrinkage• Ridge regression• Lasso regression• Profile plots show the trace of parameter estimates
– Principal component regression– Nonparametric models
• Smoothing splines
![Page 7: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/7.jpg)
5-Classification
• Categorical or binary response – ‘supervised’ learning
• LDA: fit a parametric model to each class• Classification (decision) trees
– Binary splits on any predictor X– Best split found algorithmically by gini or entropy to
maximize purity– Best size can be found via cross validation– Can be unstable
• K-Nearest Neighbors– Tradeoff of large/small k
• Probabilistic models– Bayes error rate: best possible error if model is
correct– Naïve Bayes
• Independence assumption on p(xi|c)
![Page 8: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls](https://reader036.vdocuments.mx/reader036/viewer/2022082505/56649dd15503460f94ac7044/html5/thumbnails/8.jpg)
6-Clustering
• No response variable – ‘unsupervised’ learning
• Needs distance measures– Euclidean, cosine, jaccard, edit, ordinal and
categorical• K-means
– Select initial solution– Classify points, than re-calculate means
• Hierarchical clustering– Solutions for all k from 1 to n– Dendrogram effective visualization– Different distance functions (links) will result in
different clusterings• Probabilistic
– Mixture models fit using EM algorithm– Model based clustering