slides session 1
DESCRIPTION
TRANSCRIPT
![Page 1: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/1.jpg)
DATA MINING - 10 FEBRUARY 2004
Data Mining
Luc Dehaspe
K.U.L. Computer Science Department
-
Marc Van Hulle
K.U.L. Neurofysiologie Department
http://toledo.kuleuven.ac.be/
![Page 2: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/2.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Course overview
Data Mining
Session 1: Introduction
Session 2-3: Data warehousing/preparation
Session 4-6: Symbolic Data Mining techniques
Session 7: Application + Evaluation of Data Mining results
Session 8-14: Numeric Data Mining methods• statistical techniques• self-organizing techniques
(Hands-on) Exercise sessions
![Page 3: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/3.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Exercise session
Part 1 (L. Dehaspe) 2* 2.5 h “paper-and-pencil” sessions
application of algorithms
Part 2 (M. Van Hulle) hands-on exercises
![Page 4: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/4.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Exam
Written exam, closed book
Part 1 (Sessions 1-7): 50% Coverage
Questions RESTRICTED TO CONTENT OF SLIDES Occasional pointers to additional material: I do not expect you to study this
material Questions
One main question: apply+understand algorithm (30%) Two smaller questions: explain concept, compute model quality, … (2*10%)
Part 2 (Sessions 8-14): 50% (explained later by Marc Van Hulle)
![Page 5: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/5.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Working definition data mining
tools to search data for patterns and relationships that lead to better business decisions
“business”: commercial/scientific
![Page 6: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/6.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Overview
myths and facts
the Data Mining process
methods visual non-visual
![Page 7: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/7.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Myths and facts
New technology cycle phase 1: hype
unrealistic expectations “naive” users
phase 2: frustration phase 3: rejection
Alternative: realistic view on vital technology
![Page 8: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/8.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Myth 1: tabula rasa (virgin territory) Data mining methods are fundamentally different
from previous methods
Fact Underlying ideas often decades old
neural networks: 1940 k-nearest neighbour: 1950 CART (regression trees): 1960
Novel integrated applications to general “business” problems more data, more computing power non-academic users
![Page 9: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/9.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Take home lesson 1
Not: 1 optimal method optimal
But: portfolio of tools, mixture of old and new
![Page 10: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/10.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Myth 2: manna from heaven Data mining produces surprising results
that will turn your “business” upside-down without any input of domain expert knowledge without any tuning of the technology
Fact incremental changes rather than revolutionary
long term competitive advantage occasional breakthroughs (e.g. link aspirine-Reyes Syndrome)
technology assistant to the domain expert
careful selection required of: goal technology
![Page 11: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/11.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Take home lesson 2
Crucial combination of “business” (application domain) expertise data mining technology expertise
![Page 12: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/12.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data Mining process model
Definition
Link with the scientific method
![Page 13: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/13.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
The data mining process
process : iterative; learn to ask better questions
valid : patterns can be generalized to new data
novel and useful : offer a competitive advantage
understandable : contribute to insight in the domain
The non-trivial process of finding valid, novel, potentially useful, and ultimately understandable patterns in data
![Page 14: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/14.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Interrogating the databaseLook-up queries
What is the average toxicity of cadmium chloride?
Biological dataBiological data
Clinical dataClinical dataChemical dataChemical data
How many earthquakes have occurred last year?
Which customers have a car insurance?
How did HIV patient p123 react to AZT?
![Page 15: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/15.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Interrogating the databaseFinding patterns
What is the relation between geological features and the occurrence of earthquakes?
Data MiningData Mining
Biological dataBiological data
Clinical dataClinical dataChemical dataChemical data
What is the relation between in vitro activity and chemical structure?
What is the relation between the HIV patient’s therapy history and response to AZT?
What is the profile of returning customers?
![Page 16: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/16.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
2
3 4
1
NON-ACTIVE
6
7
8
5
ACTIVE
Science
![Page 17: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/17.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Science
collect data
build hypothesis
verify hypothesis
formulate theory
Tycho Brahe (1546-1601)
observational genius
collected data on Mars
Johannes Kepler (1571-1630)
mined Brahe’s data
discovered laws of planetary motion
The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is setting somewhere, minding his own business, and suddenly - flash! - he understands something he didn’t understand before.
Robert M. Pirsig, Zen and the Art of Motorcycle maintenance
The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is setting somewhere, minding his own business, and suddenly - flash! - he understands something he didn’t understand before.
Robert M. Pirsig, Zen and the Art of Motorcycle maintenance
The actual discovery of such an explanatory hypothesis is a process of creation, in which imagination as well as knowledge is involved.
Irving Copi, Introduction to Logic, 1986
The actual discovery of such an explanatory hypothesis is a process of creation, in which imagination as well as knowledge is involved.
Irving Copi, Introduction to Logic, 1986
![Page 18: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/18.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Evolution of data generation
Data source
Data analyst
Data
< 1950 > 2000
Data RichKnowledge Poor
Data RichKnowledge Poor
Everyone, even the most patient and thorough investigator, must pick and choose, deciding which facts to study and which to pass over.
Irving Copi, Introduction to Logic, 1986
Everyone, even the most patient and thorough investigator, must pick and choose, deciding which facts to study and which to pass over.
Irving Copi, Introduction to Logic, 1986
![Page 19: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/19.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
The scientific method
collect data
build hypothesis
verify hypothesis
formulate theory
Data Mining
Statistics - OLAP
care inspiration
Knowledge discovery in Databases
Data warehousing
![Page 20: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/20.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data Mining Definition:
Extracting or “mining” knowledge from large amounts of data
CRISP-DM process modelCRISP-DM process model
![Page 21: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/21.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data mining in industry
An in silico research assistant allowing researchers to Explore integrated database For variety of research purposes (“business goals”) Using optimal selection of data mining technologies
pattern
knowledge
![Page 22: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/22.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data Mining process model CRISP-DM
![Page 23: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/23.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Business understanding
Which are the business goals?
Translation to data mining problem definition
Design of a plan to meet objectives
![Page 24: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/24.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data understanding
First collection of data
Becoming familiar with the data
Judge data quality
Discovery of first insights interesting subsets
![Page 25: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/25.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Data preparation
Extract final data set from original set
Selection of tables records attributes
transformation
data cleaning
![Page 26: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/26.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Modelling
Selection modelling techniques
calibrating parameters
regular backtracking to adapt data to technology
(some techniques discussed further on)
![Page 27: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/27.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Evaluation
Decide whether to use Data Mining results
Verification of all steps
Check whether business goals have been met
![Page 28: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/28.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Deployment
Organisation & presentation of new insights
variable complexity deliver report implement software that allows process to be repeated
![Page 29: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/29.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods
Pro image has got broader information-bandwidth than text
(cf., an image tells more than a thousand words)
Con problems with representation of > 3 dimensions not effective in case of color blindness interpretation gives more information on subject than on object
stars, clouds, Hermann Rorschach test
![Page 30: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/30.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods Error detection
![Page 31: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/31.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods Linkage analysis
![Page 32: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/32.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods Conditional probabilities
![Page 33: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/33.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods landscapes
![Page 34: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/34.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Visual Data Mining methods Scatter plots
![Page 35: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/35.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-visual data mining methods Statistics - OLAP
descriptive: average, median, standard deviation, distribution hypothesis testing: (observed differences)/(random variation) discriminant analysis predictive regression analysis: linear, non-linear clustering
Neural networks
Decision trees and rules
Conceptual clustering
Association rules
![Page 36: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/36.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
(Non-)visual Data Mining methodsOLAP - Data cubes
CityDate
Pro
du
ct
JuiceColaMilk
CreamToothpaste
SoapPizza
Cheese1 2 3 4 5 6 7
LeuvenNYTokyo
CasablancaRio
10
50
35
60
20
15
70
25
Fact data: sales volume in $100
Online analytical processing
Classical statistical methods
+database technology
real-time calculations
powerful visualisation methods
![Page 37: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/37.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsRegression
![Page 38: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/38.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-visual Data Mining methodsDiscriminant analysis
R.A. Fischer, 1936
discovers planes that separate classes
![Page 39: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/39.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsNeural Networks
Represent functions with output a discrete value, a real value, or a vector
Neurobiological motivation
Parameters network tuned on basis of input-output examples (backpropagation)
e.g. . input from sensors camera (face recognition) microphone (speech recognition)
![Page 40: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/40.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsDecision trees
![Page 41: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/41.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsDecision trees
Attribute selection information gain “how well does an attribute distribute
the data according to their target class maximal reduction of Entropy =
- pM log2 pM - pF log2 pF
![Page 42: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/42.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsDecision rules
IF Frame = 2-Door AND Engine V6 AND Age < 50 AND Cost > 30K AND Color = Red
THEN buyer is highly likely to be male
![Page 43: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/43.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methods
Clustering
Eisen et al, PNAS 1998
Cholesterol biosynthesis
Cell cycleEarly responseSignaling and angiogenesis
Wound healing
![Page 44: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/44.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsConceptual clustering
Groups examples and provides description of each group
: all examples
A : Age=-20
B : Age =20-40
b1 : Age =20-40 en Frame=2-Door
b2 : Age =20-40 en Frame = 4-Door
C : Age =40-60
D : Age =+60
d1 : Age =+60 en Frame = 2-Door
d2 : Age =+60 en Frame = 4-Door
AC
D
Bb2b1
d1 d2
![Page 45: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/45.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Non-Visual Data Mining methodsAssociation rules
40 %60 %
Wine and PizzaWine and Pizza Wine, Pizza, Floppy, and CheeseWine, Pizza, Floppy, and Cheese
item sets
IF Wine and Pizza THEN Floppy and CheeseIF Wine and Pizza THEN Floppy and Cheese
associatio
n-
rule
frequency: 40 %
accuracy: 40% / 60% = 66%
IF-THEN rules show relationships
e.g. . Which products bought together?
![Page 46: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/46.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Evaluation: pitfallsPost hoc ergo propter hoc
Everyone who drank Stella in the year 1743 is now dead.
Therefore, Stella is fatal.
![Page 47: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/47.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Evaluation: pitfallsCorrelation does not imply Causality
Palm size correlates with your life expectancy
The larger your palm, the less you will live, on average.
Women have smaller palms
and live 6 years longer on average
Why?
!actions inspired by data mining results!
![Page 48: slides session 1](https://reader033.vdocuments.mx/reader033/viewer/2022061206/5481c858b4af9f52498b459c/html5/thumbnails/48.jpg)
DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004
Evaluation: pitfallsHypothesis validation
descriptive statistics: 1 hypothesis
data mining: 1 hypothesis-SPACE much higher probability of random relationships validation on separate data set required