midterm review. 1-intro data mining vs. statistics –predictive v. experimental; hypotheses vs...

Midterm Review

Upload: ralph-mccarthy

Post on 24-Dec-2015

217 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Midterm Review

1-Intro

• Data Mining vs. Statistics– Predictive v. experimental; hypotheses vs data-

driven• Different types of data• Data Mining pitfalls

– With lots of data you can find anything• Data privacy and security

– Good and bad examples

Page 3: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

2- EDA and Visualization

• Good visualization is good analysis• Examples of vis

– 1-d, 2-d, multivariate– Histograms, boxplots, scatterplots, density

estimates, etc– Overplotting with many points– Conditional plots (small multiples)– Good, bad examples

Page 4: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

3- Data mining concepts

• Preparing data for analysis– How to deal with missing data?– What are good transformations?– How to deal with outliers

• Data reduction– Reducing n: sampling, subsetting– Reducing p:

• Principal components: finding projections that preserve variance

– Scree plot shows how much variance is accounted for in the PC

• MDS: – Needs a distance matrix– Mimimizes ‘stress function’– mostly used for visualization and EDA

• In-vs-out of sample evaluation– In-sample: must penalize for complexity– Out-of-sample: use cross-validation to evaluate

predictive performance

Page 5: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

3- Data mining concepts

• Complexity/Performance tradeoff• Evaluating Classification models

– Accuracy (how many did I get right): not the best choice

– Precision/recall or Sensitivity/specificity tradeoff– Selecting different thresholds for ROC curve.

Page 6: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

4-Regression

• Linear regression– What is it, what are the assumptions, how do you

check them– Model selection

• Exhaustive or Greedy (forward/backward selection) search

• Extensions of Linear regression– Non-linear in parameters, linear in form– Generalized Linear Models

• Logisitic regression• Poisson regression

– Shrinkage• Ridge regression• Lasso regression• Profile plots show the trace of parameter estimates

– Principal component regression– Nonparametric models

• Smoothing splines

Page 7: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

5-Classification

• Categorical or binary response – ‘supervised’ learning

• LDA: fit a parametric model to each class• Classification (decision) trees

– Binary splits on any predictor X– Best split found algorithmically by gini or entropy to

maximize purity– Best size can be found via cross validation– Can be unstable

• K-Nearest Neighbors– Tradeoff of large/small k

• Probabilistic models– Bayes error rate: best possible error if model is

correct– Naïve Bayes

• Independence assumption on p(xi|c)

Page 8: Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls

6-Clustering

• No response variable – ‘unsupervised’ learning

• Needs distance measures– Euclidean, cosine, jaccard, edit, ordinal and

categorical• K-means

– Select initial solution– Classify points, than re-calculate means

• Hierarchical clustering– Solutions for all k from 1 to n– Dendrogram effective visualization– Different distance functions (links) will result in

different clusterings• Probabilistic

– Mixture models fit using EM algorithm– Model based clustering

Introduction to Data Miningstxavierstn.edu.in/ict_ppts/Computer Science/Jenila Vincent/part1.pdf · Define data mining Data mining vs. databases Basic data mining tasks Data mining

CS590D: Data Mining Chris Clifton - Purdue University · Data Mining: Classification Schemes • General functionality – Descriptive data mining – Predictive data mining ... –

Data Mining in Pharmaceutical Marketing and Sales …rembrandtgroup.com/.../08/Data-Mining-in-Pharmaceutical-Marketing... · 2 Contents What is Data Mining? Data Mining vs. Statistics:

Gufran Ahmad. Contents What is Data Mining? Data Mining / KDD process Different aspects of Data Mining Why Data Mining? Data Mining in Business Examples

Floresca vs Philex Mining

ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining Introduction Business Applications of Data Mining Data Mining Activities

Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classiﬁcation Task: predict label yfrom

Course Title: Data Warehousing and Data Mining …cs.sxc.edu.np/images/Docs/B.Sc.CSIT/8th sem/CSIT_8th Sem Syllabus.pdf... Data Warehousing and Data Mining ... Unit-2 4 Hrs. DBMS vs

Explanation vs Performance in Data Mining: A Case Study with … · 2013. 12. 24. · Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects . Tim

Web Mining – Data Mining im Internet · Structured vs. Web data mining traditional data mining data is structured and relational well-defined tables, columns, rows, keys, and constraints

Maching learning vs SSAS Data mining

Statistical Data Mining€¦ · 3 Data Mining Data (re-design and maintain existing database) Mining (Analysis) -- our focus Statistical Data Mining What is Data Mining? Data mining

Lecture 2: Data Mining 1. Roadmap What is data mining? Data Mining Tasks – Classification/Decision Tree – Clustering – Association Mining Data Mining

Data Mining and Applications - antoniomucherino.it · Data Mining and Applications Data Mining Why Data Mining? Introduction to Data Mining Example III - text mining Let us suppose

Data Mining: What is Data Mining?

Data Assimilation vs Data Mining

Data Mining - Open Source vs. Oracle ein Erfahrungsbericht · Data Mining - Open Source vs. Oracle ein Erfahrungsbericht Prof. Dr. Reinhold von Schwerin Projektgruppe Data Mining1

Data Mining: Introduction. Chapter 1. Introduction zMotivation: Why data mining? zWhat is data mining? zData Mining: On what kind of data? zData mining

Data Mining vs. Statistics Pavel Brusilovsky. 2 Objectives 2 Intro to Data Mining Data Mining vs. Statistics Data Mining vs. Text Mining Applications

Data Mining & Texte Mining Partie I : Data Mining Chapitre

4 Philex Mining vs. CIR

Data Mining VS Visualization

Introduction to Data Mining - homepages.math.uic.eduhomepages.math.uic.edu/~jyang06/stat486/R/DataMining_JYang2014.pdfFundamentals of Data Mining Typical Data Mining Tasks Data Mining

Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

Data Mining and Machine Learningpeople.scs.carleton.ca/~boyanbejanov/data5000/lecture4a.pdf · Machine Learning vs Data Mining I Machine Learning is the design of algorithms that

Proactive Fraud Executive Summary Detection with Data ... Mining... · Detection with Data Mining ... “Proactive” vs. “Reactive ... Is Data Analysis skills a core competency

1 Data Mining Chapter 26. 2 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality

EE3J2 Data Mining EE3J2 Data Mining

Santiago González Tortosa Data Mining VS Visualization

CS 570 Data Mining Classification and Prediction 3cengiz/cs570-data-mining-fa... · February 12, 2008 Data Mining: Concepts and Techniques 4 Prediction Prediction vs. classification

Data Mining and Information Retrieval · CMPT 454: Database Systems II –Introduction to Web Mining 3 / 23 Web Mining vs. Data Mining Structure (or lack of it) Textual information

Narra Nickel Mining vs. Redmont

CSE 634 – Data Mining: Text Mining · Text Mining vs. • Data Mining – In Text Mining, patterns are extracted from natural language text rather than databases. • Web Mining

Educational Data Mining: Möglichkeiten und Unmöglichkeiten · Educational Data Mining: Möglichkeiten und Unmöglichkeiten EDM vs. LA • Educational Data Mining (EDM) • Anwendung

MISSION OF SCS Kit/MCA_V.pdf... From Data Warehousing to Data Mining, DBMS vs DM, ... Chap- 3 AKP, Lecture Notes Week 7 Unit 3: Data Mining ... Hierarchical and Categorical clustering,