data mining techniques
TRANSCRIPT
Data Mining Techniques
Wojtek Kowalczyk
www.cs.vu.nl/~wojtekwww.cs.vu.nl/~wojtek/DataMine
www.cs.vu.nl/ci/DataMine/DIANA
2
Outline
• Organization of the course
• What is Data Mining?
• Course overview
• Data Mining Tasks
• Data Mining Cycle
• Data Mining Techniques
3
Objectives of the course
• Provide an overview of most common algorithms
and techniques used in Data Mining (lectures)
• Provide an extensive “hands-on” experience with
applying these techniques (practicum)
• Provide a survey of typical (and future)
applications of data mining
4
Organization of the course
• 12 lectures (1sp) + 3 assignments (3sp) (1sp=40hrs work)
• no exams; grades based on assignments (theory & practice)
• assignments on: 8.03, 12.04, 03.05
• deadlines: 3 weeks later: 5.04, 3.05, 24.05
• work in couples(?); registration obligatory (before 1.03)
by e-mail to [email protected]: DMT-registration
Body: Full name; e-mail address; student number; {AI|BWI|…}
Full name; e-mail address; student number; {AI|BWI|…}
5
Materials
• Slides, notes, assignments:
www.cs.vu.nl/~wojtek/DataMine
• Book: “ Data Mining” by Ian H. Witten and Eibe Frank,
www.cs.waikato.ac.nz/~ml/weka/book.html
• Internet: www.kdnuggets.com
• Further readings from different perspectives:
- business aspects: Berry & Linoff
- theory: Hand, Mannila, Smyth;
Tan, Steinbach, Kumar
- latest: proceedings of KDD, PKDD, PAKDD, ML, ...
6
Origins of Data Mining
• Every day the world creates a few exabytes of data
1 exabyte = 1000 petabytes1 petabyte = 1000 terabytes 1 terabyte = 1000 gigabytes
• Only 4% of the data is used for any purpose (IBM)
• If we could only do something useful with this data ...
➨ ... the field of DATA MINING is born
7
Sources of data
• satellites (images)• business:
• banks, • telecom, • insurance, • retail• airlines, …
• internet (only a few terabytes at late 90’s)• libraries (e.g., Library of Congress: 20 TB - 3PB)• law enforcement agencies (FBI fingerprints DB: 1PB)• Bioinformatics:? RFID-tags? Homeland security?
8
Typical data mining applications
• fraud detection (credit cards, telecom, insurance, taxes, …)
• credit scoring and control (“to give or not to give?”)
• marketing (mailing selection, modeling churn/retention,
attrition, cross-selling, market basket analysis, etc)
• Customer Relation Management (CRM)
• criminal investigations (text mining)
• ….
In Holland every citizen is “present” in
800-1000 databases !!!
9
What is Data Mining ?
u Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. (U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, KDD-96)
u Data mining is an area in the intersection of machine learning, statistics, and databases.
(M. Holsheimer, M. Kersten, H. Mannila and H. Taivonen)
Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage
(SAS Institute)
10
Sorts of Data Mining Tasks
Predictive Data Mining (“ supervised” ):u Classificationu Regressionu Time series
Knowledge Discovery (“ unsupervised” ):u Deviation Detectionu Segmentationu Clusteringu Association Rulesu Summarizationu Visualization
11
u Medical diagnosis: soft or hard contact lenses u Credit application scoring: grant a loan or not? u Fraud detection: is the transaction suspicious or not?u Direct mailing: who should be offered a given product?
u CPU- performance: how to configure computers?u Remote sensing: determine water pollution from spectral imagesu Load forecasting: predict future demand for electric poweru Intelligent ATM’s : how much cash will be there tomorrow?
u identify groups of similar credit card usersu automatically organize incoming e- mailsu characterize interests of an Internet useru etc.
Examples
12
Contact lenses: a classification task
Can I use contact lenses?
Possible output: none, soft, hard.
Decision based on:- age- spectacle prescription- astigmatism- tear production rate
13
Hypothetical Decision Table
age prescription astigmatism tear p.r. lensesyoung myope no reduced noneyoung myope no normal softyoung hypermetrope yes reduced none
pre-presbyopic myope no reduced nonepre-presbyopic hypermetrope yes normal softpre-presbyopic hypermetrope yes reduced none
presbyopic myope no normal hardpresbyopic myope no reduced nonepresbyopic hypermetrope yes reduced none
14
Classifiers: classification procedures
•A set of “if-then” rules
•A decision tree
•A Neural Network
•A formula (e.g. “scoring model”)
•A classification procedure
15
Figure 1.1 Rules for the contact lens data.
If tear production rate = reduced then recommendation = none.If age = young and astigmatic = no and tear production rate = normal
then recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production
rate = normal then recommendation = softIf age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = noneIf spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = softIf spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hardIf age = young and astigmatic = yes and tear production rate =
normalthen recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
16
Figure 1.2 Decision tree for the contact lens data.
17
CPU performance: regression problem
Computer’s CPU performance (PRP) depends on a number of factors:
- cycle time (MYCT)- main memory (MMIN, MMAX)- cache (CACH)- number of channels (CHMIN, CHMAX)
Problem:express PRP as a function of all these factors.
18
Figure 3.6(a) Models for the CPU performance data: linear regression.
PRP =- 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH- 0.270 CHMIN+ 1.46 CHMAX
19
Figure 3.6(b) Models for the CPU performance data: regression tree.
20
Figure 3.6(c) Models for the CPU performance data: model tree.
21
Association Rules
A shop sells products a, b, …, zClients buy them in collections, e.g., {a, c}, {c, d, z}, …Each set is called a “transaction” or an “item set”
What are the most frequent item sets?
What are the most significant “ association rules” :e.g., {c, g}==>{z}
22
Association Rules II
Rule Significance is measured in terms of:- support (percentage of transactions that match LHS)- confidence (accuracy of the rule)
Problems:• combinatorial explosion of item sets• huge number of rules• two conflicting performance measures
(we want rules to have big support and high accuracy)
There are efficient algorithms for finding rules !!!
23
Interdependencies: Link Analysis
What influences what and to which extent?
Bayesian networks: graphical models of knowledge
Networks constructed from data and knowledge !!!
s
a
x
dr
h
s=smokerx=sexa=ageh=healthr=resistanced=live/death
24
Putting similar things together: Clustering
Example: Credit card users might be clustered according to the way the use their cards:
• frequent/seldom usage• domestic/foreign transactions• high/low amounts of money• transactions of specific type• …
Then for every group another fraud detection systemmay be developed. Or various products might be offered…
25
Characteristics of the data:
Huge quantitiesRedundancyIrrelevancyBad quality:
u missing valuesu incompletenessu inconsistencyu errorsu outdatedu outliers
High dimensionalityUnstructured (e.g. textual)
26
Data Mining Cycle
• Problem understanding and formulating• Identification of relevant data• Data gathering• Data cleaning
• Data preprocessing• Model building• Model analysis
• Model implementation• Model maintenance
27
Accents
1) Algorithms & Techniques2) Technical skills (AWK, Matlab, Weka)3) Performance Challenge4) Applications5) Recent Developments (text mining,
web mining, mining data streams, etc.)
28
Data Preprocessing
• exploratory data analysis• discretization and grouping of values
• reduction of dimensionality• feature extraction
• treatment of missing values and outliers• sampling
29
Model Building
• Rule Induction• Decision Trees • Bayesian Classifiers• Regression Trees• Association Rules• Instance-based learning• Clustering Algorithms• Combining models: Bagging, Boosting, Stacking, etc.
30
To remember:
•There are various definitions of “Data Mining” •Most common tasks of Data Mining are:
• Classification,
• Regression/ numerical prediction, • Discovery of Associations, • Clustering
• The road “from data to results” involves many steps• The course covers 3 aspects of DM:
• data preprocessing
• model building• model evaluation