1. data mining (or kdd) let us find something interesting! definition := “data mining is the...

17
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Upload: reynold-oneal

Post on 03-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

1. Data Mining (or KDD)

Let us find something interesting!

Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Why Mine Data? Scientific Viewpoint Data collected and stored at

enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene expression data

– scientific simulations generating terabytes of data

– GIS

Traditional techniques infeasible for raw data Data mining may help scientists

– in classifying and segmenting data

– in Hypothesis Formation

Ch. Eick: Data Mining

2.1 Supervised Clustering

Applications of Supervised Clustering Include:a. Learning Subclasses

b. for Region Discovery in Spatial Datasets

c. Distance Function Learning

d. Data Set Compression (reduce size of dataset by using cluster representatives)

e. Adaptive Supervised Clustering

class 1class 2unclassified object

H

F

E

L

JI

D

c. Supervised Clusteringb. Semi-supervised Clusteringa. Unsupervised Clustering

CB

A

Attribute1 Attribute1Attribute1

Attribute2 Attribute2 Attribute2

K

unclassified object

class 1class 2

G

Ch. Eick

Ch. Eick: Data Mining

Example: Finding Subclasses

Attribute2

Ford TrucksAttribute1

Ford SUV

Ford Vans

GMC Trucks

GMC Van

GMC SUV

:Ford

:GMC

Ch. Eick

Ch. Eick: Data Mining

SC Algorithms Investigated1. Representative-based Clustering Algorithms

1. Supervised Partitioning Around Medoids (SPAM).

2. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).

3. Supervised Clustering using Evolutionary Computing (SCEC)

2. Agglomerative Hierarchical Supervised Clustering (AHSC)

3. Grid-Based Supervised Clustering (GRIDSC)

1. Naïve approach

2. Hierarchical Grid-based Clustering relying on data cubes

3. Grid-based Clustering relying on density estimation techniques

Ch. Eick: Data Mining

2.2 Spatial Data Mining (SPDM) SPDM := the process of discovering interesting, useful, non-trivial patterns from

(large) spatial datasets. Spatial patterns

– Spatial outlier, discontinuities • bad traffic sensors on highways

– Location prediction models • model to identify habitat of endangered species

– Spatial clusters • crime hot-spots , poverty clusters

– Co-location patterns • identify arsenic risk zones in Texas and determine if there is a correlation

between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc.

Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.

Ch. Eick: Data Mining

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Ch. Eick

Ch. Eick: Data Mining

2.3 Distance Function LearningExample: How to Find Similar Patients?

Task: Construct a distance function that measures patient similarity

Motivation: Finding a “good” distance function is important for:– Case based reasoning

– Clustering

– Instance-based classification (e.g. nearest neighbor classifiers)

Our Approach: Learn distance functions based on training examples and user feedback

Ch. Eick: Data Mining

The following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age,…) Attribute Domains

– ssn: 9 digits

– weight between 30 and 650; mweight=158 sweight=24.20

– height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2

– cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

– eye-color: {brown, blue, green, grey }

– age: between 3 and 100; mage=45 sage=13.2

Task: Define Patient Similarity

Motivating Example: How To Find Similar Patients?

f

fjif

ji

wpf

woopfoo

1

*)(1

,),(

Ch. Eick: Data Mining

Idea: Coevolving Clusters and Distance Functions

Clustering X DistanceFunction Cluster

Goodness of the Distance Function

q(X) Clustering Evaluation

Weight Updating Scheme /Search Strategy

x

x x

x

o

oo

o

xx

oo

xx

oo

oo

“Bad” distance function “Good” distance function

xx

oo

Ch. Eick: Data Mining

Distance Function Learning Framework

RandomizedHill Climbing

AdaptiveClustering

Inside/OutsideWeight Updating

K-Means

SupervisedClustering

NN-Classifier

Weight-Updating Scheme /Search Strategy

Distance FunctionEvaluation

… …

WorkBy Karypis

[BECV05]

Other Research

[ERBV04]

CurrentResearch[CHEN05]

Ch. Eick: Data Mining

Supervised ClusteringAlgorithm

Inputs Clustering

Quality

AdaptationSystem

Changes

Past Experience

Feedback

EvaluationSystem

Summary

Domain Expert

q(X),…

FitnessFunctions

(Predefined)

2.4 Adaptive Data MiningCh. Eick

Ch. Eick: Data Mining

2.5 Signatures of Data SetsInput: a set of classified examplesOutput: Signatures in the dataset that characterize

1. how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset

2. how many regions dominated by a single class exist in the data set3. which regions dominated by one class are bordering regions dominated by

another class?4. where are the regions, identified in step 2 and 3, located5. what are the density attactors (maxima of the density function) of the classes

in the data setWhy are we creating those signatures?– As a preprocessing step to develop smarter classifiers– To understand why a particular data mining techniques works well / do not work

well for a particular dataset meta learningMethods employed: density estimation techniques, supervised clustering, proximity

graphs (e.g. Delaunay, Gabriel graphs),…

Ch. Eick: Data Mining

Example: Signatures of Data Sets

class 1class 2unclassified object

H

F

E

L

JI

D

c. Supervised Clusteringb. Semi-supervised Clusteringa. Unsupervised Clustering

CB

A

Attribute1 Attribute1Attribute1

Attribute2 Attribute2 Attribute2

K

unclassified object

class 1class 2

G

Ch. Eick: Data Mining

Attribute 1

Attribute 2

Attribute 1

Attribute 2

Attribute 2

Applications of Creating Signatures:Class Decomposition (see also [VAE03])

Attribute 1

Ch. Eick: Data Mining

2.6 Research Christoph F. Eick 2005-2007

Measures of Interestingness

Clustering for Classification

Supervised ClusteringEditing /Data Set Compression

Spatial Data Mining Adaptive Clustering Distance FunctionLearning

Mining Data StreamsOnline Data MiningMining Sensor Data

Mining Semi-Structured DataWeb Annotation

File Prediction EvolutionaryComputing

Creating SignaturesFor Datasets

Ch. Eick: Data Mining

Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine.

Topics investigated:

Meta Learning Classification and Learning from Examples Clustering Distance Function Learning Using Reinforcement Learning for Data Mining Spatial Data Mining Knowledge Discovery

3. UH Data Mining and Machine Learning Group (UH-DMML) Co-Directors: Christoph F. Eick and Ricardo Vilalta

3. UH Data Mining and Machine Learning Group (UH-DMML) Co-Directors: Christoph F. Eick and Ricardo Vilalta