data mining techniques

30
Data Mining Techniques Wojtek Kowalczyk www.cs.vu.nl/~wojtek www.cs.vu.nl/~wojtek/DataMine www.cs.vu.nl/ci/DataMine/DIANA [email protected] R4.50

Upload: tommy96

Post on 13-Jun-2015

586 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Mining Techniques

Data Mining Techniques

Wojtek Kowalczyk

www.cs.vu.nl/~wojtekwww.cs.vu.nl/~wojtek/DataMine

www.cs.vu.nl/ci/DataMine/DIANA

[email protected]

Page 2: Data Mining Techniques

2

Outline

• Organization of the course

• What is Data Mining?

• Course overview

• Data Mining Tasks

• Data Mining Cycle

• Data Mining Techniques

Page 3: Data Mining Techniques

3

Objectives of the course

• Provide an overview of most common algorithms

and techniques used in Data Mining (lectures)

• Provide an extensive “hands-on” experience with

applying these techniques (practicum)

• Provide a survey of typical (and future)

applications of data mining

Page 4: Data Mining Techniques

4

Organization of the course

• 12 lectures (1sp) + 3 assignments (3sp) (1sp=40hrs work)

• no exams; grades based on assignments (theory & practice)

• assignments on: 8.03, 12.04, 03.05

• deadlines: 3 weeks later: 5.04, 3.05, 24.05

• work in couples(?); registration obligatory (before 1.03)

by e-mail to [email protected]: DMT-registration

Body: Full name; e-mail address; student number; {AI|BWI|…}

Full name; e-mail address; student number; {AI|BWI|…}

Page 5: Data Mining Techniques

5

Materials

• Slides, notes, assignments:

www.cs.vu.nl/~wojtek/DataMine

• Book: “ Data Mining” by Ian H. Witten and Eibe Frank,

www.cs.waikato.ac.nz/~ml/weka/book.html

• Internet: www.kdnuggets.com

• Further readings from different perspectives:

- business aspects: Berry & Linoff

- theory: Hand, Mannila, Smyth;

Tan, Steinbach, Kumar

- latest: proceedings of KDD, PKDD, PAKDD, ML, ...

Page 6: Data Mining Techniques

6

Origins of Data Mining

• Every day the world creates a few exabytes of data

1 exabyte = 1000 petabytes1 petabyte = 1000 terabytes 1 terabyte = 1000 gigabytes

• Only 4% of the data is used for any purpose (IBM)

• If we could only do something useful with this data ...

➨ ... the field of DATA MINING is born

Page 7: Data Mining Techniques

7

Sources of data

• satellites (images)• business:

• banks, • telecom, • insurance, • retail• airlines, …

• internet (only a few terabytes at late 90’s)• libraries (e.g., Library of Congress: 20 TB - 3PB)• law enforcement agencies (FBI fingerprints DB: 1PB)• Bioinformatics:? RFID-tags? Homeland security?

Page 8: Data Mining Techniques

8

Typical data mining applications

• fraud detection (credit cards, telecom, insurance, taxes, …)

• credit scoring and control (“to give or not to give?”)

• marketing (mailing selection, modeling churn/retention,

attrition, cross-selling, market basket analysis, etc)

• Customer Relation Management (CRM)

• criminal investigations (text mining)

• ….

In Holland every citizen is “present” in

800-1000 databases !!!

Page 9: Data Mining Techniques

9

What is Data Mining ?

u Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. (U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, KDD-96)

u Data mining is an area in the intersection of machine learning, statistics, and databases.

(M. Holsheimer, M. Kersten, H. Mannila and H. Taivonen)

Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage

(SAS Institute)

Page 10: Data Mining Techniques

10

Sorts of Data Mining Tasks

Predictive Data Mining (“ supervised” ):u Classificationu Regressionu Time series

Knowledge Discovery (“ unsupervised” ):u Deviation Detectionu Segmentationu Clusteringu Association Rulesu Summarizationu Visualization

Page 11: Data Mining Techniques

11

u Medical diagnosis: soft or hard contact lenses u Credit application scoring: grant a loan or not? u Fraud detection: is the transaction suspicious or not?u Direct mailing: who should be offered a given product?

u CPU- performance: how to configure computers?u Remote sensing: determine water pollution from spectral imagesu Load forecasting: predict future demand for electric poweru Intelligent ATM’s : how much cash will be there tomorrow?

u identify groups of similar credit card usersu automatically organize incoming e- mailsu characterize interests of an Internet useru etc.

Examples

Page 12: Data Mining Techniques

12

Contact lenses: a classification task

Can I use contact lenses?

Possible output: none, soft, hard.

Decision based on:- age- spectacle prescription- astigmatism- tear production rate

Page 13: Data Mining Techniques

13

Hypothetical Decision Table

age prescription astigmatism tear p.r. lensesyoung myope no reduced noneyoung myope no normal softyoung hypermetrope yes reduced none

pre-presbyopic myope no reduced nonepre-presbyopic hypermetrope yes normal softpre-presbyopic hypermetrope yes reduced none

presbyopic myope no normal hardpresbyopic myope no reduced nonepresbyopic hypermetrope yes reduced none

Page 14: Data Mining Techniques

14

Classifiers: classification procedures

•A set of “if-then” rules

•A decision tree

•A Neural Network

•A formula (e.g. “scoring model”)

•A classification procedure

Page 15: Data Mining Techniques

15

Figure 1.1 Rules for the contact lens data.

If tear production rate = reduced then recommendation = none.If age = young and astigmatic = no and tear production rate = normal

then recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production

rate = normal then recommendation = softIf age = presbyopic and spectacle prescription = myope and

astigmatic = no then recommendation = noneIf spectacle prescription = hypermetrope and astigmatic = no and

tear production rate = normal then recommendation = softIf spectacle prescription = myope and astigmatic = yes and

tear production rate = normal then recommendation = hardIf age = young and astigmatic = yes and tear production rate =

normalthen recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

Page 16: Data Mining Techniques

16

Figure 1.2 Decision tree for the contact lens data.

Page 17: Data Mining Techniques

17

CPU performance: regression problem

Computer’s CPU performance (PRP) depends on a number of factors:

- cycle time (MYCT)- main memory (MMIN, MMAX)- cache (CACH)- number of channels (CHMIN, CHMAX)

Problem:express PRP as a function of all these factors.

Page 18: Data Mining Techniques

18

Figure 3.6(a) Models for the CPU performance data: linear regression.

PRP =- 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH- 0.270 CHMIN+ 1.46 CHMAX

Page 19: Data Mining Techniques

19

Figure 3.6(b) Models for the CPU performance data: regression tree.

Page 20: Data Mining Techniques

20

Figure 3.6(c) Models for the CPU performance data: model tree.

Page 21: Data Mining Techniques

21

Association Rules

A shop sells products a, b, …, zClients buy them in collections, e.g., {a, c}, {c, d, z}, …Each set is called a “transaction” or an “item set”

What are the most frequent item sets?

What are the most significant “ association rules” :e.g., {c, g}==>{z}

Page 22: Data Mining Techniques

22

Association Rules II

Rule Significance is measured in terms of:- support (percentage of transactions that match LHS)- confidence (accuracy of the rule)

Problems:• combinatorial explosion of item sets• huge number of rules• two conflicting performance measures

(we want rules to have big support and high accuracy)

There are efficient algorithms for finding rules !!!

Page 23: Data Mining Techniques

23

Interdependencies: Link Analysis

What influences what and to which extent?

Bayesian networks: graphical models of knowledge

Networks constructed from data and knowledge !!!

s

a

x

dr

h

s=smokerx=sexa=ageh=healthr=resistanced=live/death

Page 24: Data Mining Techniques

24

Putting similar things together: Clustering

Example: Credit card users might be clustered according to the way the use their cards:

• frequent/seldom usage• domestic/foreign transactions• high/low amounts of money• transactions of specific type• …

Then for every group another fraud detection systemmay be developed. Or various products might be offered…

Page 25: Data Mining Techniques

25

Characteristics of the data:

Huge quantitiesRedundancyIrrelevancyBad quality:

u missing valuesu incompletenessu inconsistencyu errorsu outdatedu outliers

High dimensionalityUnstructured (e.g. textual)

Page 26: Data Mining Techniques

26

Data Mining Cycle

• Problem understanding and formulating• Identification of relevant data• Data gathering• Data cleaning

• Data preprocessing• Model building• Model analysis

• Model implementation• Model maintenance

Page 27: Data Mining Techniques

27

Accents

1) Algorithms & Techniques2) Technical skills (AWK, Matlab, Weka)3) Performance Challenge4) Applications5) Recent Developments (text mining,

web mining, mining data streams, etc.)

Page 28: Data Mining Techniques

28

Data Preprocessing

• exploratory data analysis• discretization and grouping of values

• reduction of dimensionality• feature extraction

• treatment of missing values and outliers• sampling

Page 29: Data Mining Techniques

29

Model Building

• Rule Induction• Decision Trees • Bayesian Classifiers• Regression Trees• Association Rules• Instance-based learning• Clustering Algorithms• Combining models: Bagging, Boosting, Stacking, etc.

Page 30: Data Mining Techniques

30

To remember:

•There are various definitions of “Data Mining” •Most common tasks of Data Mining are:

• Classification,

• Regression/ numerical prediction, • Discovery of Associations, • Clustering

• The road “from data to results” involves many steps• The course covers 3 aspects of DM:

• data preprocessing

• model building• model evaluation