data mining

78
10/1/2007 TCSS555A Isabelle Bichi ndaritz 1 Main Concepts of Data Mining Introduction to Data Preprocessing

Upload: tommy96

Post on 27-Jan-2015

952 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 1

Main Concepts of Data Mining

Introduction to Data Preprocessing

Page 2: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 2

Learning Objectives

• Study some examples of data mining systems

• Understand why to preprocess the data.

• Understand how to understand the data (descriptive data summarization)

Page 3: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 3

Acknowledgements

Some of these slides are adapted from Jiawei Han and Micheline Kamber

Page 4: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 4

Learning Objectives

• Study some examples of data mining systems

• Understand why to preprocess the data.

• Understand how to understand the data (descriptive data summarization)

Page 5: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 5

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Page 6: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 6

Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of databases to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted

Page 7: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 7

Major Issues in Data Mining (1)• Mining methodology and user interaction

– Mining different kinds of knowledge in databases

– Interactive mining of knowledge at multiple levels of abstraction

– Incorporation of background knowledge

– Data mining query languages and ad-hoc data mining

– Expression and visualization of data mining results

– Handling noise and incomplete data

– Pattern evaluation: the interestingness problem

• Performance and scalability– Efficiency and scalability of data mining algorithms

– Parallel, distributed and incremental mining methods

Page 8: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 8

Major Issues in Data Mining (2)• Issues relating to the diversity of data types

– Handling relational and complex types of data– Mining information from heterogeneous databases and

global information systems (WWW)

• Issues related to applications and social impacts– Application of discovered knowledge

• Domain-specific data mining tools• Intelligent query answering• Process control and decision making

– Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

– Protection of data security, integrity, and privacy

Page 9: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 9

Main Concepts in Data Mining• Data mining: discovering interesting patterns from large amounts of

data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

• Classification of data mining systems

• Major issues in data mining

Page 10: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 10

Case-Based ReasoningCase-Based Reasoning

• Case-based reasoning (CBR)– Problem-solving method from artificial

intelligence (AI) that proposes to reuse previously solved and memorized problem situations, called cases

– Instance-based method from machine learning– Can be used for classification/prediction tasks

Page 11: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 11

Case-Based ReasoningCase-Based Reasoning

NewCase

Target case

Interpretation

Retrieve

ReuseRevise

Retain

RetrievedCase

Solved CaseSolutionSolution

Tested Case

USER INTERFACE

PROBLEM

SOLUTION

CASE BASEPrevious Cases

Page 12: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 12

Fifth Workshop on Case-Based Reasoning in the Health Sciences

Isabelle BichindaritzUniversity of Washington, Tacoma, Washington, USA

[email protected]

Stefania MontaniUniversity of Piemonte Orientale, Italy

[email protected]

Page 13: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 13

Workshop Stats

• Papers accepted: 10 papers

• Attendees: 19 participants

• Good news !!!

Page 14: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 14

Workshop Goals

• Provide a forum for identifying important contributions and opportunities for research on the application of CBR to the Health Sciences

• Promote the systematic study of how to apply CBR to the Health Sciences

• Showcase applications of CBR in the Health Sciences

Page 15: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 15

A CBR Solution for Missing Medical Data

Olga Vorobieva and Rainer Schmidt

Institute for Medical Informatics and Biometry University of Rostock, Germany

Alexander Rumiantzev

Pavlov State Medical University, St.Petersburg, Russia

Page 16: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 16

Summary• Application domain

dialysis medicineeffects of fitness on dialysis

• System contextISOR, a CBR system that explains the exceptional cases – those for which fitness does not improve renal function

• Task / problem addressedrestoration of missing data

• Research hypothesiscase-based reasoning can be applied to restore missing data in a dataset/case base

• Main contributionsynergy between CBR and statistics (statistical modeling).

Page 17: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 17

Page 18: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 18

A Case-Based Reasoning Approach A Case-Based Reasoning Approach to Dose Planning in Radiotherapyto Dose Planning in Radiotherapy

Xueyan Song1, Sanja Petrovic1, and Santhanam Sundar 2

1Automated Scheduling, Optimisation and Planning GroupSchool of Computer Science

University of Nottingham, UK

2Dept. of Oncology, City Hospital Campus, Nottingham University Hospitals NHS Trust, Nottingham, UK

Page 19: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 19

Summary• Application domain

dose planning in radiotherapy for prostate cancer

• System contexttrade-off between the benefit in terms of cancer control and the risk in terms of harmful side effects to neighboring tissues

• Task / problem addressedplanning problem – designing a radiotherapy dose planning

• Research hypothesiscase-based reasoning can be applied to propose dose plans

• Main contributionfuzzy representation of attribute values and similarity measurefusion of similar cases by Dempster-Shafer theory.

Page 20: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 20

Page 21: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 21

On-Line Domain Knowledge Management for

Case-Based Medical Recommendation

Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and Alain Mille1

1LIRIS CNRS, UMR 5202, Université Lyon 1, INSA Lyon, Université Lyon 2, ECL

43, bd du 11 Novembre 1918, Villeurbanne Cedex, France,{Amelie.Cordier, Beatrice.Fuchs, Alain.Mille}@liris.cnrs.fr

2LORIA (UMR 7503 CNRS–INRIA–Nancy Universities),BP 239, 54506 Vandoeuvre-lès-Nancy, France

[email protected]

Page 22: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 22

Summary• Application domain

breast cancer treatment

• System contextKasimir is a knowledge management and decision-support system in oncology focusing on case-based protocol treatment recommendations

• Task / problem addressedplanning problem – recommending a treatment plan based on a protocol

• Research hypothesesconservative adaptation is recommended for adapting a protocol to a new case through case-based reasoningnew domain knowledge can be acquired by analysis of failures

• Main contributionimprovement of adaptationmethod for learning from failures of the case-based reasoning.

Page 23: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 23

Page 24: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 24

Concepts for Novelty Detection and Handling

based on Case-Based Reasoning

Petra PernerPetra PernerInstitute of Computer Vision and applied Computer Sciences, IBaIInstitute of Computer Vision and applied Computer Sciences, IBaI

Page 25: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 25

Summary• Application domain

Hep-2 cell image interpretation

• System contextcase-based image interpretation

• Task / problem addressedclassification problem – improve recognition of over 30 different nuclear and cytoplasmic patterns when patterns change over time or new patterns emerge

• Research hypothesiscase-based reasoning can be applied to the problem of novelty detection and also of concept drift

• Main contributionnovel application for CBR: detecting novelty, detecting concept drift.

Page 26: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 26

Page 27: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 27

Similarity of Medical Cases in Health Care

Using Cosine Similarity and Ontology

Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo von Schéelevon Schéele

Mälardalen University, Department of Computer Science and Mälardalen University, Department of Computer Science and ElectronicsElectronics

PO Box 883 SE-721 23, Västerås, SwedenPO Box 883 SE-721 23, Västerås, Sweden{firstname.lastname}@mdh.se{firstname.lastname}@mdh.se

Page 28: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 28

Summary• Application domain

any medical domain

• System contextelectronic medical records

• Task / problem addressedretrieval task – finding similar cases represented with structured and semi-structured data

• Research hypothesisa hybrid similarity measure based on combining the cosine similarity measure, an ontology, and the nearest neighbor method permit to successfully retrieve similar cases

• Main contributionsynergy between case-based reasoning and information retrieval.

Page 29: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 29

Page 30: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 30

Towards Case-Based Reasoning for Diabetes

ManagementCindy Marling1, Jay Shubrook2 and Frank Schwartz2

1 School of Electrical Engineering and Computer ScienceRuss College of Engineering and TechnologyOhio University, Athens, Ohio 45701, USA

[email protected] Appalachian Rural Health Institute, Diabetes and Endocrine Center

College of Osteopathic MedicineOhio University, Athens, Ohio 45701, USA

[email protected], [email protected]

Page 31: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 31

Summary• Application domain

type I diabetes management

• System contextreal-time monitoring of glucose level through insulin pump

• Task / problem addressedtreatment planning – adjusting insulin dosage

• Research hypothesiscase-based reasoning can adjust insulin dosage in real timecases required for the future CBR system can be acquired through an online Web-based interface

• Main contributionplanning the development of a case-based reasoning system for automatic type I diabetes monitoring.

Page 32: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 32

Hypothetico-Deductive Case-Based Reasoning

David McSherry

School of Computing and Information Engineering,University of Ulster, Northern Ireland

Page 33: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 33

Summary• Application domain

contact lenses classification

• System contextconversational CBR

• Task / problem addressedclassification problem – recommending type of contact lenses

• Research hypothesisa hypothetico-deductive CBR approach to test selection can minimize the number of tests required to confirm a hypothesis proposed by the system or user

• Main contributionsynergy between case-based reasoning and hypothetico-deductive reasoningexplanations in CBR.

Page 34: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 34

Page 35: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 35

Other Papers Summaries• Case-based Reasoning for managing non-

compliance with clinical guidelines, Stefania Montani, University of Piemonte Orientale, Alessandria, Italy A CBR system able to

– Retrieve similar past episodes (cases) of non-compliance to guidelines, to be suggested to the physician

– Learn more general indications from ground non-compliance cases, adoptable for a formal GL revision by an experts committee

• CBR for Temporal Abstractions Configuration in Haemodyalisis, Leonardi Giorgio, Bottrighi Alessio, Portinale Luigi, Montani Stefania, University of Piemonte Orientale, Alessandria, ItalyA CBR system able to choose the appropriate parameters for the configuration of temporal abstractions in medical domain of haemodyalisis

Page 36: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 36

Other Papers Summaries

• Prototypical Cases for Knowledge Prototypical Cases for Knowledge Maintenance in Biomedical CBR, Maintenance in Biomedical CBR, Isabelle Bichindaritz, University of Washington, Tacoma, WA, USAPrototypical cases have served various purposes in biomedical CBR systems, among which to organize and structure the memory, to guide the retrieval as well as the reuse of cases, and to serve as bootstrapping a CBR system memory when real cases are not available in sufficient quantity and/or quality. Knowledge maintenance is yet another role that these prototypical cases can play in biomedical CBR systems

Page 37: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 37

Discussion • Trends and issues

– Integration of CBR with electronic patient records and/or in clinical practice (Begum et al., Marling et al.)

– Importance of prototypical cases (Bichindaritz)– Incompleteness / non-reliability of cases or CBR system

knowledge (Vorobieva et al., Cordier et al., Bichindaritz) – Novel domains of applications for CBR (Perner,

Leonardi et al., Montani) – Need for synergy with other AI methods (Song et al.,

McSherry)

Page 38: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 38

Discussion • Pearls of wisdom

– Remember Occam’s razor – introducing complexity in CBR should be carefully justified

– Knowledge in medical cases / domain knowledge is often questionable – finding methods for dealing with this reality is essential for the development of CBR in biomedical domains

– CBR can be promoted as the methodology of choice for evidence gathering in evidence-based medicine

Page 39: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 39

Future Plans

• A second special issue on CBR in the Health Sciences, based on papers from this Fifth Workshop on CBR in the Health Sciences is going to be published in Computational Intelligence.

• The Web-site (version 1.beta) and mailing list for our research group are now live:http://www.cbr-health.orghttp://www.cbr-biomed.org

Page 40: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 40

Page 41: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 41

Page 42: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 42

Learning Objectives

• Study some examples of data mining systems

• Understand why to preprocess the data.

• Understand how to understand the data (descriptive data summarization)

Page 43: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 43

Why Data Preprocessing?• Data mining aims at discovering relationships and other

forms of knowledge from data in the real world.• Data map entities in the application domain to symbolic

representation through a measurement function.• Data in the real world is dirty

– incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

– noisy: containing errors, such as measurement errors, or outliers– inconsistent: containing discrepancies in codes or names– distorted: sampling distortion

• No quality data, no quality mining results! (GIGO)– Quality decisions must be based on quality data– Data warehouse needs consistent integration of quality data

Page 44: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 44

Multi-Dimensional Measure of Data Quality

• Data quality is multidimensional:– Accuracy– Preciseness (=reliability)– Completeness– Consistency– Timeliness– Believability (=validity)– Value added– Interpretability– Accessibility

• Broad categories:– intrinsic, contextual, representational, and accessibility.

Page 45: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 45

Major Tasks in Data Preprocessing• Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors

• Data integration– Integration of multiple databases, data cubes, or files

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization– Part of data reduction but with particular importance, especially for numerical data

Page 46: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 46

Forms of data preprocessing

Page 47: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 47

Learning Objectives

• Study some examples of data mining systems

• Understand why to preprocess the data.

• Understand how to understand the data (descriptive data summarization)

Page 48: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 48

Mining Data Descriptive Characteristics

• Motivation– To better understand the data: central tendency, variation and spread

• Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc.

• Numerical dimensions correspond to sorted intervals– Data dispersion: analyzed with multiple granularities of precision

– Boxplot or quantile analysis on sorted intervals

• Dispersion analysis on computed measures– Folding measures into numerical dimensions

– Boxplot or quantile analysis on the transformed cube

Page 49: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 49

Measuring the Central Tendency• Mean (algebraic measure) (sample vs. population):

– Weighted arithmetic mean:

– Trimmed mean: chopping extreme values

• Median: A holistic measure

– Middle value if odd number of values, or average of the middle two values

otherwise

– Estimated by interpolation (for grouped data):

• Mode

– Value that occurs most frequently in the data

– Unimodal, bimodal, trimodal

– Empirical formula:

n

iix

nx

1

1

n

ii

n

iii

w

xwx

1

1

widthfreq

lfreqNLmedian

median

))(2/

(1

)(3 medianmeanmodemean

N

x

Page 50: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 50

Symmetric vs. Skewed Data

• Median, mean and mode of symmetric,

positively and negatively skewed data

positively skewed negatively skewed

symmetric

Page 51: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 51

Measuring the Dispersion of Data

• Quartiles, outliers and boxplots

– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, M, Q3, max

– Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot

outlier individually

– Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

– Variance: (algebraic, scalable computation)

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)

n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1

[1

1)(

1

1

n

ii

n

ii x

Nx

N 1

22

1

22 1)(

1

Page 52: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 52

Boxplot Analysis• Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum

• Boxplot

– Data is represented with a box

– The ends of the box are at the first and third quartiles, i.e.,

the height of the box is IQR

– The median is marked by a line within the box

– Whiskers: two lines outside the box extend to Minimum

and Maximum

Page 53: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 53

Visualization of Data Dispersion: 3-D Boxplots

Page 54: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 54

Properties of Normal Distribution Curve

• The normal (distribution) curve– From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)

– From μ–2σ to μ+2σ: contains about 95% of it

– From μ–3σ to μ+3σ: contains about 99.7% of it

Page 55: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 55

Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

• Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence

Page 56: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 56

Histogram Analysis

• Graph displays of basic statistical class descriptions– Frequency histograms

• A univariate graphical method

• Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Page 57: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 57

Histograms Often Tells More than Boxplots

• The two histograms shown in the left may have the same boxplot representation– The same values for:

min, Q1, median, Q3, max

• But they have rather different data distributions

Page 58: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 58

Quantile Plot• Displays all of the data (allowing the user to assess both the

overall behavior and unusual occurrences)• Plots quantile information

– For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

Page 59: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 59

Quantile-Quantile (Q-Q) Plot• Graphs the quantiles of one univariate distribution against the

corresponding quantiles of another• Allows the user to view whether there is a shift in going from

one distribution to another

Page 60: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 60

Scatter plot• Provides a first look at bivariate data to see clusters of points,

outliers, etc• Each pair of values is treated as a pair of coordinates and plotted as

points in the plane

Page 61: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 61

Loess Curve• Adds a smooth curve to a scatter plot in order to provide better

perception of the pattern of dependence• Loess curve is fitted by setting two parameters: a smoothing parameter,

and the degree of the polynomials that are fitted by the regression

Page 62: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 62

Positively and Negatively Correlated Data

• The left half fragment is positively

correlated

• The right half is negative correlated

Page 63: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 63

Not Correlated Data

Page 64: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 64

Data Visualization and Its Methods• Why data visualization?

– Gain insight into an information space by mapping data onto graphical primitives

– Provide qualitative overview of large data sets

– Search for patterns, trends, structure, irregularities, relationships among data

– Help find interesting regions and suitable parameters for further quantitative analysis

– Provide a visual proof of computer representations derived

• Typical visualization methods:– Geometric techniques

– Icon-based techniques

– Hierarchical techniques

Page 65: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 65

Direct Data Visualization

Ribbons w

ith Tw

ists Based on V

orticity

Page 66: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 66

Geometric Techniques

• Visualization of geometric transformations and projections of the data

• Methods– Landscapes

– Projection pursuit technique

• Finding meaningful projections of multidimensional data

– Scatterplot matrices

– Prosection views

– Hyperslice

– Parallel coordinates

Page 67: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 67

Scatterplot Matrices

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

Use

d by

erm

issi

on o

f M

. W

ard,

Wor

cest

er P

olyt

echn

ic In

stitu

te

Page 68: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 68

news articlesvisualized asa landscape

Use

d by

per

mis

sion

of B

. Wrig

ht, V

isib

le D

ecis

ions

Inc.

Landscapes

• Visualization of the data as perspective landscape• The data needs to be transformed into a (possibly artificial) 2D spatial

representation which preserves the characteristics of the data

Page 69: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 69Attr. 1 Attr. 2 Attr. kAttr. 3

• • •

Parallel Coordinates

• n equidistant axes which are parallel to one of the screen axes and correspond to the attributes

• The axes are scaled to the [minimum, maximum]: range of the corresponding attribute

• Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute

Page 70: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 70

Parallel Coordinates of a Data Set

Page 71: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 71

Icon-based Techniques• Visualization of the data values as features of icons

• Methods:

– Chernoff Faces

– Stick Figures

– Shape Coding:

– Color Icons:

– TileBars: The use of small icons representing the relevance

feature vectors in document retrieval

Page 72: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 72

Chernoff Faces• A way to display variables on a two-dimensional surface, e.g., let x be

eyebrow slant, y be eye size, z be nose length, etc.

• The figure shows faces produced using 10 characteristics--head eccentricity,

eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,

mouth shape, mouth size, and mouth opening): Each assigned one of 10

possible values, generated using Mathematica (S. Dickson)

• REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993

• Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource.

mathworld.wolfram.com/ChernoffFace.html

Page 73: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 73

census data showing age, income, gender, education, etc.

used

by

perm

issi

on o

f G

. G

rinst

ein,

Uni

vers

ity o

f M

assa

chus

ette

s at

Low

ell

Stick Figures

Page 74: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 74

Hierarchical Techniques

• Visualization of the data using a hierarchical partitioning into subspaces.

• Methods– Dimensional Stacking

– Worlds-within-Worlds

– Treemap

– Cone Trees

– InfoCube

Page 75: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 75

Dimensional Stacking

attribute 1

attribute 2

attribute 3

attribute 4

• Partitioning of the n-dimensional attribute space in 2-D subspaces which are ‘stacked’ into each other

• Partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels

• Adequate for data with ordinal attributes of low cardinality

• But, difficult to display more than nine dimensions

• Important to map dimensions appropriately

Page 76: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 76

Used by permission of M. Ward, Worcester Polytechnic Institute

Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes

Dimensional Stacking

Page 77: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 77

Tree-Map• Screen-filling method which uses a hierarchical partitioning of

the screen into regions depending on the attribute values

• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)

MSR Netscan Image

Page 78: Data Mining

10/1/2007 TCSS555A Isabelle Bichindaritz 78

Tree-Map of a File System (Schneiderman)