data mining: staying ahead in the information age a tutorial in data mining, yor11, cambridge, 29 th...

34
Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL, London, UK. [email protected]

Upload: jacquelyn-threlkeld

Post on 15-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Data Mining: Staying Ahead in the Information Age

A Tutorial in Data Mining, YOR11, Cambridge, 29th March 2000.

Robert BurbidgeComputer Science, UCL, London, UK.

[email protected]

http://www.cs.ucl.ac.uk/staff/r.burbidge

Page 2: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Definition

‘We are drowning in information, but starving for knowledge’

John Naisbett

• Data Mining is the search for ‘nuggets’ of useful information

• Data Mining is an automated search for ‘interesting’ patterns in large databases

Page 3: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Overview

DataPre-

ProcessingAnalysis

BusinessSolutions

Aims

Domain Knowledge

Page 4: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Before We Begin ...

• Getting the Data

• Assessing Usefulness of the Data

• Noise in the Data

• Volume of Available Data

• Domain Knowledge and Expertise

Page 5: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Getting the Data

• Are the data easily available?– What format are the

data in?

– Are the data in a live database or a data warehouse?

– Are the data online?

1010111....ID

0Xc2

Jones, H., 24

00011002210

GRsa4

7 8 3 2 1 0 .... 9 4 3 2 3 4 ...... .... ...... ... ..... .. .. ... . ..

objects

variables

Page 6: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Assessing Usefulness of the Data

• Are the available data relevant to the task at hand?– E.g. to predict ice-cream sales information

about the FTSE would (probably) not be useful

• Are there missing factors which are likely to be predictive?– E.g. temperature is likely to be predictive of

ice-cream sales

Page 7: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Noise in the Data

• Are the data contaminated by noise?– E.g. experimental error, typing mistakes,

corrupted storage media

• Can this be eliminated?– E.g. improved experimental set up, data

cleaning

• How seriously is this likely to affect the results?

Page 8: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Volume of Available Data

• Are there enough data ...– ... to learn a useful concept?– ... to give statistically significant results?

• Should more data be collected?– More examples– More information about the examples– Meta data

Page 9: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Domain Knowledge

• Domain knowledge can be incorporated into some techniques– To choose priors in Bayesian analysis– To encode invariances in the data– Expert systems

• Use of expertise can avoid blind search– Feature selection– Building a model

Page 10: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Résumé 1

• Before we begin we must– Obtain the data– Make sure it’s useful– Make sure there’s enough– Identify available expert knowledge

• This is all pretty obvious– If you don’t do this you’re headed for trouble

Page 11: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Pre-Processing

• Visualization

• Feature Selection

• Feature Extraction

• Feature Derivation

• Data Reduction

Page 12: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Visualization

• Histogram plots– Identify Distributions

• Clustering– k-means

– Kohonen nets

– Relational

– Hierarchical

– Outlier detection

Page 13: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Feature Selection

• Performance Measures– Filters

– Wrappers

• Search Algorithms– Exhaustive

– Branch-and-bound

– Mathematical Programming

– Stochastic

7 8 3 2 1 0 .... 9 4 3 2 3 4 ....

objects

variables

7 3 2 1 9 3 2 3

objects

variables

Page 14: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Feature Extraction

• Domain knowledge– E.g. edges in images

• Informative features– Kohonen nets– Principle components analysis

• Useful for visualization– Projecting data to two or three dimensions– Identifying the number of clusters

Page 15: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Feature Derivation

• Transforming continuous attributes to discrete attributes– Fuzzy or rough linguistic concepts– Binning

• Deriving numeric features– Products, ratios, differences, etc– E.g. taking differences of start and finish times,

taking ratios of price changes

Page 16: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Data Reduction

• Large amounts of data require longer training times– Some data points are

more relevant than others

• Reducing the modality of a variable– Makes solutions more

easily interpretable

Support Vector Machine

Page 17: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Résumé 2

• Assess the data statistically

• Visualize the data

• Identify, extract or create useful features

• Reduce the size of the problem if necessary

Page 18: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Discovering Patterns and Rules

• Rule Induction

• Statistical Pattern Recognition

• Neural Networks

• Hybrid Systems

• Performance Analysis

Page 19: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Rule Induction

• Discover rules that describe the data– e.g. marketing – who buys what?

• IF age > 55 AND income > 20 000 THEN holiday

• IF age < 40 AND age > 20 THEN pension

• Easy to understand – identifies important features

• Can be fuzzified• IF age_low AND income_high THEN car_high

Page 20: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Statistical Pattern Recognition

• Model the underlying distribution– Classification

• Bayesian solution is optimal

• Gives confidence values

– Regression• Identifies useful features

• Robust techniques to handle noise

• Difficult in many practical applications

Page 21: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Neural Networks

• Based on neuronal brain model

• Each neuron forms a weighted sum of its inputs

• Flexible learners• Prone to over-fitting • Messy optimization

problem

inputs

hiddenlayer

output

Page 22: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Hybrid Systems

• Combine techniques for increased functionality and accuracy– function replacing

• neural network accurate but unreadable

• combine with a decision tree

– committee• multiple classifiers with different

set-ups• aggregate with a decision tree

inputs

NN1 NN2 NN3

Decision Tree

output

Page 23: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Performance Analysis

• Accuracy– error rate– discrimination– variable costs

• Readability• Time

– training– using

ROC curve; Neyman Pearson at 20%

Page 24: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Résumé 3

• Identify key criteria

• Assess data characteristics

• Choose an algorithm

• Set the parameters

• Try combining multiple techniques to improve results

• Assess statistical significance

Page 25: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Post-Processing

• Understanding

• Significance

• Implementation

Page 26: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Understanding

• What does it mean?– if easily understandable, does it make sense?– if numeric, how to interpret

• Which features were important?– sensitivity analysis

Page 27: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Significance

• Are the results interesting?– are they new and unobvious?

• e.g. IF age > 100 THEN NOT pension

– are they relevant

• What is the significance?– are further studies required

• with more data specific to the discovered pattern

– change of business plan

Page 28: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Implementation

• How to convince the money men– solid results– clear and concise

• How to test your hypothesis– experimental design– controlled studies to eliminate sampling bias

Page 29: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Résumé 4

• Assess the usefulness of the results– Interpretability– Relevance to initial problem

• Identify the next step– Sales pitch– Further experiments– Field trials– Towards knowledge discovery

Page 30: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Example Applications at UCL

• Intelligent fraud detection with Fuzzy GAs (Lloyd’s TSB)

• Drug Design by SVMs (SmithKline Beecham and Glaxo-Wellcome)

• Consumer Profiling with Bayes Nets (Unilever)

• Process Control (AstraZeneca)

Page 31: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

‘Data Snooping’ – A Warning

• Artefacts – ‘patterns’ that aren’t there• Sampling bias• Statistical tests may not show significance

– this does not mean results aren’t significant

• The extremum of a collection of Gaussians is highly skewed – beware coincidence

• Data mining is a dangerous tool in the wrong hands

Page 32: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Summary

• Get the right data

• Use domain knowledge

• Pre-process the data

• Discover patterns and rules– machine learning– statistics

• Analyze results – but be wary

Page 33: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Conclusions

With vast amounts of data available, it has become necessary to use automated techniquesAdvances in data processing, machine learning and statistics have made this possibleData mining is a necessary tool for business survival in the information age

Page 34: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Internet Resources

• www.kdnuggetts.com

• www.data-miners.com

• www.crisp-dm.org• www.research.microsoft.com/profiles/fayyad

• www.cs.sfu.ca/~han

• etc...