data mining techniques

Data Mining Techniques

Wojtek Kowalczyk

www.cs.vu.nl/~wojtekwww.cs.vu.nl/~wojtek/DataMine

www.cs.vu.nl/ci/DataMine/DIANA

[email protected]

2

Outline

• Organization of the course

• What is Data Mining?

• Course overview

• Data Mining Tasks

• Data Mining Cycle

• Data Mining Techniques

3

Objectives of the course

• Provide an overview of most common algorithms

and techniques used in Data Mining (lectures)

• Provide an extensive “hands-on” experience with

applying these techniques (practicum)

• Provide a survey of typical (and future)

applications of data mining

4

Organization of the course

• 12 lectures (1sp) + 3 assignments (3sp) (1sp=40hrs work)

• no exams; grades based on assignments (theory & practice)

• assignments on: 8.03, 12.04, 03.05

• deadlines: 3 weeks later: 5.04, 3.05, 24.05

• work in couples(?); registration obligatory (before 1.03)

by e-mail to [email protected]: DMT-registration

Body: Full name; e-mail address; student number; {AI|BWI|…}

Full name; e-mail address; student number; {AI|BWI|…}

5

Materials

• Slides, notes, assignments:

www.cs.vu.nl/~wojtek/DataMine

• Book: “ Data Mining” by Ian H. Witten and Eibe Frank,

www.cs.waikato.ac.nz/~ml/weka/book.html

• Internet: www.kdnuggets.com

• Further readings from different perspectives:

- business aspects: Berry & Linoff

- theory: Hand, Mannila, Smyth;

Tan, Steinbach, Kumar

- latest: proceedings of KDD, PKDD, PAKDD, ML, ...

6

Origins of Data Mining

• Every day the world creates a few exabytes of data

1 exabyte = 1000 petabytes1 petabyte = 1000 terabytes 1 terabyte = 1000 gigabytes

• Only 4% of the data is used for any purpose (IBM)

• If we could only do something useful with this data ...

➨ ... the field of DATA MINING is born

7

Sources of data

• satellites (images)• business:

• banks, • telecom, • insurance, • retail• airlines, …

• internet (only a few terabytes at late 90’s)• libraries (e.g., Library of Congress: 20 TB - 3PB)• law enforcement agencies (FBI fingerprints DB: 1PB)• Bioinformatics:? RFID-tags? Homeland security?

8

Typical data mining applications

• fraud detection (credit cards, telecom, insurance, taxes, …)

• credit scoring and control (“to give or not to give?”)

• marketing (mailing selection, modeling churn/retention,

attrition, cross-selling, market basket analysis, etc)

• Customer Relation Management (CRM)

• criminal investigations (text mining)

• ….

In Holland every citizen is “present” in

800-1000 databases !!!

9

What is Data Mining ?

u Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. (U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, KDD-96)

u Data mining is an area in the intersection of machine learning, statistics, and databases.

(M. Holsheimer, M. Kersten, H. Mannila and H. Taivonen)

Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage

(SAS Institute)

10

Sorts of Data Mining Tasks

Predictive Data Mining (“ supervised” ):u Classificationu Regressionu Time series

Knowledge Discovery (“ unsupervised” ):u Deviation Detectionu Segmentationu Clusteringu Association Rulesu Summarizationu Visualization

11

u Medical diagnosis: soft or hard contact lenses u Credit application scoring: grant a loan or not? u Fraud detection: is the transaction suspicious or not?u Direct mailing: who should be offered a given product?

u CPU- performance: how to configure computers?u Remote sensing: determine water pollution from spectral imagesu Load forecasting: predict future demand for electric poweru Intelligent ATM’s : how much cash will be there tomorrow?

u identify groups of similar credit card usersu automatically organize incoming e- mailsu characterize interests of an Internet useru etc.

Examples

12

Contact lenses: a classification task

Can I use contact lenses?

Possible output: none, soft, hard.

Decision based on:- age- spectacle prescription- astigmatism- tear production rate

13

Hypothetical Decision Table

age prescription astigmatism tear p.r. lensesyoung myope no reduced noneyoung myope no normal softyoung hypermetrope yes reduced none

pre-presbyopic myope no reduced nonepre-presbyopic hypermetrope yes normal softpre-presbyopic hypermetrope yes reduced none

presbyopic myope no normal hardpresbyopic myope no reduced nonepresbyopic hypermetrope yes reduced none

14

Classifiers: classification procedures

•A set of “if-then” rules

•A decision tree

•A Neural Network

•A formula (e.g. “scoring model”)

•A classification procedure

15

Figure 1.1 Rules for the contact lens data.

If tear production rate = reduced then recommendation = none.If age = young and astigmatic = no and tear production rate = normal

then recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production

rate = normal then recommendation = softIf age = presbyopic and spectacle prescription = myope and

astigmatic = no then recommendation = noneIf spectacle prescription = hypermetrope and astigmatic = no and

tear production rate = normal then recommendation = softIf spectacle prescription = myope and astigmatic = yes and

tear production rate = normal then recommendation = hardIf age = young and astigmatic = yes and tear production rate =

normalthen recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

16

Figure 1.2 Decision tree for the contact lens data.

17

CPU performance: regression problem

Computer’s CPU performance (PRP) depends on a number of factors:

- cycle time (MYCT)- main memory (MMIN, MMAX)- cache (CACH)- number of channels (CHMIN, CHMAX)

Problem:express PRP as a function of all these factors.

18

Figure 3.6(a) Models for the CPU performance data: linear regression.

PRP =- 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH- 0.270 CHMIN+ 1.46 CHMAX

19

Figure 3.6(b) Models for the CPU performance data: regression tree.

20

Figure 3.6(c) Models for the CPU performance data: model tree.

21

Association Rules

A shop sells products a, b, …, zClients buy them in collections, e.g., {a, c}, {c, d, z}, …Each set is called a “transaction” or an “item set”

What are the most frequent item sets?

What are the most significant “ association rules” :e.g., {c, g}==>{z}

22

Association Rules II

Rule Significance is measured in terms of:- support (percentage of transactions that match LHS)- confidence (accuracy of the rule)

Problems:• combinatorial explosion of item sets• huge number of rules• two conflicting performance measures

(we want rules to have big support and high accuracy)

There are efficient algorithms for finding rules !!!

23

Interdependencies: Link Analysis

What influences what and to which extent?

Bayesian networks: graphical models of knowledge

Networks constructed from data and knowledge !!!

s

a

x

dr

h

s=smokerx=sexa=ageh=healthr=resistanced=live/death

24

Putting similar things together: Clustering

Example: Credit card users might be clustered according to the way the use their cards:

• frequent/seldom usage• domestic/foreign transactions• high/low amounts of money• transactions of specific type• …

Then for every group another fraud detection systemmay be developed. Or various products might be offered…

25

Characteristics of the data:

Huge quantitiesRedundancyIrrelevancyBad quality:

u missing valuesu incompletenessu inconsistencyu errorsu outdatedu outliers

High dimensionalityUnstructured (e.g. textual)

26

Data Mining Cycle

• Problem understanding and formulating• Identification of relevant data• Data gathering• Data cleaning

• Data preprocessing• Model building• Model analysis

• Model implementation• Model maintenance

27

Accents

1) Algorithms & Techniques2) Technical skills (AWK, Matlab, Weka)3) Performance Challenge4) Applications5) Recent Developments (text mining,

web mining, mining data streams, etc.)

28

Data Preprocessing

• exploratory data analysis• discretization and grouping of values

• reduction of dimensionality• feature extraction

• treatment of missing values and outliers• sampling

29

Model Building

• Rule Induction• Decision Trees • Bayesian Classifiers• Regression Trees• Association Rules• Instance-based learning• Clustering Algorithms• Combining models: Bagging, Boosting, Stacking, etc.

30

To remember:

•There are various definitions of “Data Mining” •Most common tasks of Data Mining are:

• Classification,

• Regression/ numerical prediction, • Discovery of Associations, • Clustering

• The road “from data to results” involves many steps• The course covers 3 aspects of DM:

• data preprocessing

• model building• model evaluation

data mining techniques

Documents

u data mining

data mining lectures

taivonen data mining

data mining techniqueswojtek

origins of data mining

field of data mining

data analysis

u fraud detection