data mining

10/1/2007 TCSS555A Isabelle Bichindaritz 1

Main Concepts of Data Mining

Introduction to Data Preprocessing


Learning Objectives

• Study some examples of data mining systems

• Understand why to preprocess the data.

• Understand how to understand the data (descriptive data summarization)


Acknowledgements

Some of these slides are adapted from Jiawei Han and Micheline Kamber


Learning Objectives





Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization


Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of databases to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted


Major Issues in Data Mining (1)• Mining methodology and user interaction

– Mining different kinds of knowledge in databases

– Interactive mining of knowledge at multiple levels of abstraction

– Incorporation of background knowledge

– Data mining query languages and ad-hoc data mining

– Expression and visualization of data mining results

– Handling noise and incomplete data

– Pattern evaluation: the interestingness problem

• Performance and scalability– Efficiency and scalability of data mining algorithms

– Parallel, distributed and incremental mining methods


Major Issues in Data Mining (2)• Issues relating to the diversity of data types

– Handling relational and complex types of data– Mining information from heterogeneous databases and

global information systems (WWW)

• Issues related to applications and social impacts– Application of discovered knowledge

• Domain-specific data mining tools• Intelligent query answering• Process control and decision making

– Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

– Protection of data security, integrity, and privacy


Main Concepts in Data Mining• Data mining: discovering interesting patterns from large amounts of

data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

• Classification of data mining systems

• Major issues in data mining


Case-Based ReasoningCase-Based Reasoning

• Case-based reasoning (CBR)– Problem-solving method from artificial

intelligence (AI) that proposes to reuse previously solved and memorized problem situations, called cases

– Instance-based method from machine learning– Can be used for classification/prediction tasks


Case-Based ReasoningCase-Based Reasoning

NewCase

Target case

Interpretation

Retrieve

ReuseRevise

Retain

RetrievedCase

Solved CaseSolutionSolution

Tested Case

USER INTERFACE

PROBLEM

SOLUTION

CASE BASEPrevious Cases


Fifth Workshop on Case-Based Reasoning in the Health Sciences

Isabelle BichindaritzUniversity of Washington, Tacoma, Washington, USA

[email protected]

Stefania MontaniUniversity of Piemonte Orientale, Italy

[email protected]


Workshop Stats

• Papers accepted: 10 papers

• Attendees: 19 participants

• Good news !!!


Workshop Goals

• Provide a forum for identifying important contributions and opportunities for research on the application of CBR to the Health Sciences

• Promote the systematic study of how to apply CBR to the Health Sciences

• Showcase applications of CBR in the Health Sciences


A CBR Solution for Missing Medical Data

Olga Vorobieva and Rainer Schmidt

Institute for Medical Informatics and Biometry University of Rostock, Germany

Alexander Rumiantzev

Pavlov State Medical University, St.Petersburg, Russia


Summary• Application domain

dialysis medicineeffects of fitness on dialysis

• System contextISOR, a CBR system that explains the exceptional cases – those for which fitness does not improve renal function

• Task / problem addressedrestoration of missing data

• Research hypothesiscase-based reasoning can be applied to restore missing data in a dataset/case base

• Main contributionsynergy between CBR and statistics (statistical modeling).


A Case-Based Reasoning Approach A Case-Based Reasoning Approach to Dose Planning in Radiotherapyto Dose Planning in Radiotherapy

Xueyan Song1, Sanja Petrovic1, and Santhanam Sundar 2

1Automated Scheduling, Optimisation and Planning GroupSchool of Computer Science

University of Nottingham, UK

2Dept. of Oncology, City Hospital Campus, Nottingham University Hospitals NHS Trust, Nottingham, UK



dose planning in radiotherapy for prostate cancer

• System contexttrade-off between the benefit in terms of cancer control and the risk in terms of harmful side effects to neighboring tissues

• Task / problem addressedplanning problem – designing a radiotherapy dose planning

• Research hypothesiscase-based reasoning can be applied to propose dose plans

• Main contributionfuzzy representation of attribute values and similarity measurefusion of similar cases by Dempster-Shafer theory.


On-Line Domain Knowledge Management for

Case-Based Medical Recommendation

Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and Alain Mille1

1LIRIS CNRS, UMR 5202, Université Lyon 1, INSA Lyon, Université Lyon 2, ECL

43, bd du 11 Novembre 1918, Villeurbanne Cedex, France,{Amelie.Cordier, Beatrice.Fuchs, Alain.Mille}@liris.cnrs.fr

2LORIA (UMR 7503 CNRS–INRIA–Nancy Universities),BP 239, 54506 Vandoeuvre-lès-Nancy, France

[email protected]



breast cancer treatment

• System contextKasimir is a knowledge management and decision-support system in oncology focusing on case-based protocol treatment recommendations

• Task / problem addressedplanning problem – recommending a treatment plan based on a protocol

• Research hypothesesconservative adaptation is recommended for adapting a protocol to a new case through case-based reasoningnew domain knowledge can be acquired by analysis of failures

• Main contributionimprovement of adaptationmethod for learning from failures of the case-based reasoning.


Concepts for Novelty Detection and Handling

based on Case-Based Reasoning

Petra PernerPetra PernerInstitute of Computer Vision and applied Computer Sciences, IBaIInstitute of Computer Vision and applied Computer Sciences, IBaI



Hep-2 cell image interpretation

• System contextcase-based image interpretation

• Task / problem addressedclassification problem – improve recognition of over 30 different nuclear and cytoplasmic patterns when patterns change over time or new patterns emerge

• Research hypothesiscase-based reasoning can be applied to the problem of novelty detection and also of concept drift

• Main contributionnovel application for CBR: detecting novelty, detecting concept drift.


Similarity of Medical Cases in Health Care

Using Cosine Similarity and Ontology

Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo von Schéelevon Schéele

Mälardalen University, Department of Computer Science and Mälardalen University, Department of Computer Science and ElectronicsElectronics

PO Box 883 SE-721 23, Västerås, SwedenPO Box 883 SE-721 23, Västerås, Sweden{firstname.lastname}@mdh.se{firstname.lastname}@mdh.se



any medical domain

• System contextelectronic medical records

• Task / problem addressedretrieval task – finding similar cases represented with structured and semi-structured data

• Research hypothesisa hybrid similarity measure based on combining the cosine similarity measure, an ontology, and the nearest neighbor method permit to successfully retrieve similar cases

• Main contributionsynergy between case-based reasoning and information retrieval.


Towards Case-Based Reasoning for Diabetes

ManagementCindy Marling1, Jay Shubrook2 and Frank Schwartz2

1 School of Electrical Engineering and Computer ScienceRuss College of Engineering and TechnologyOhio University, Athens, Ohio 45701, USA

[email protected] Appalachian Rural Health Institute, Diabetes and Endocrine Center

College of Osteopathic MedicineOhio University, Athens, Ohio 45701, USA

[email protected], [email protected]



type I diabetes management

• System contextreal-time monitoring of glucose level through insulin pump

• Task / problem addressedtreatment planning – adjusting insulin dosage

• Research hypothesiscase-based reasoning can adjust insulin dosage in real timecases required for the future CBR system can be acquired through an online Web-based interface

• Main contributionplanning the development of a case-based reasoning system for automatic type I diabetes monitoring.


Hypothetico-Deductive Case-Based Reasoning

David McSherry

School of Computing and Information Engineering,University of Ulster, Northern Ireland



contact lenses classification

• System contextconversational CBR

• Task / problem addressedclassification problem – recommending type of contact lenses

• Research hypothesisa hypothetico-deductive CBR approach to test selection can minimize the number of tests required to confirm a hypothesis proposed by the system or user

• Main contributionsynergy between case-based reasoning and hypothetico-deductive reasoningexplanations in CBR.


Other Papers Summaries• Case-based Reasoning for managing non-

compliance with clinical guidelines, Stefania Montani, University of Piemonte Orientale, Alessandria, Italy A CBR system able to

– Retrieve similar past episodes (cases) of non-compliance to guidelines, to be suggested to the physician

– Learn more general indications from ground non-compliance cases, adoptable for a formal GL revision by an experts committee

• CBR for Temporal Abstractions Configuration in Haemodyalisis, Leonardi Giorgio, Bottrighi Alessio, Portinale Luigi, Montani Stefania, University of Piemonte Orientale, Alessandria, ItalyA CBR system able to choose the appropriate parameters for the configuration of temporal abstractions in medical domain of haemodyalisis


Other Papers Summaries

• Prototypical Cases for Knowledge Prototypical Cases for Knowledge Maintenance in Biomedical CBR, Maintenance in Biomedical CBR, Isabelle Bichindaritz, University of Washington, Tacoma, WA, USAPrototypical cases have served various purposes in biomedical CBR systems, among which to organize and structure the memory, to guide the retrieval as well as the reuse of cases, and to serve as bootstrapping a CBR system memory when real cases are not available in sufficient quantity and/or quality. Knowledge maintenance is yet another role that these prototypical cases can play in biomedical CBR systems


Discussion • Trends and issues

– Integration of CBR with electronic patient records and/or in clinical practice (Begum et al., Marling et al.)

– Importance of prototypical cases (Bichindaritz)– Incompleteness / non-reliability of cases or CBR system

knowledge (Vorobieva et al., Cordier et al., Bichindaritz) – Novel domains of applications for CBR (Perner,

Leonardi et al., Montani) – Need for synergy with other AI methods (Song et al.,

McSherry)


Discussion • Pearls of wisdom

– Remember Occam’s razor – introducing complexity in CBR should be carefully justified

– Knowledge in medical cases / domain knowledge is often questionable – finding methods for dealing with this reality is essential for the development of CBR in biomedical domains

– CBR can be promoted as the methodology of choice for evidence gathering in evidence-based medicine


Future Plans

• A second special issue on CBR in the Health Sciences, based on papers from this Fifth Workshop on CBR in the Health Sciences is going to be published in Computational Intelligence.

• The Web-site (version 1.beta) and mailing list for our research group are now live:http://www.cbr-health.orghttp://www.cbr-biomed.org

http://www.cbr-health.org/

http://www.cbr-biomed.org/


Learning Objectives





Why Data Preprocessing?• Data mining aims at discovering relationships and other

forms of knowledge from data in the real world.• Data map entities in the application domain to symbolic

representation through a measurement function.• Data in the real world is dirty

– incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

– noisy: containing errors, such as measurement errors, or outliers– inconsistent: containing discrepancies in codes or names– distorted: sampling distortion

• No quality data, no quality mining results! (GIGO)– Quality decisions must be based on quality data– Data warehouse needs consistent integration of quality data


Multi-Dimensional Measure of Data Quality

• Data quality is multidimensional:– Accuracy– Preciseness (=reliability)– Completeness– Consistency– Timeliness– Believability (=validity)– Value added– Interpretability– Accessibility

• Broad categories:– intrinsic, contextual, representational, and accessibility.


Major Tasks in Data Preprocessing• Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors

• Data integration– Integration of multiple databases, data cubes, or files

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization– Part of data reduction but with particular importance, especially for numerical data


Forms of data preprocessing


Learning Objectives





Mining Data Descriptive Characteristics

• Motivation– To better understand the data: central tendency, variation and spread

• Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc.

• Numerical dimensions correspond to sorted intervals– Data dispersion: analyzed with multiple granularities of precision

– Boxplot or quantile analysis on sorted intervals

• Dispersion analysis on computed measures– Folding measures into numerical dimensions

– Boxplot or quantile analysis on the transformed cube


Measuring the Central Tendency• Mean (algebraic measure) (sample vs. population):

– Weighted arithmetic mean:

– Trimmed mean: chopping extreme values

• Median: A holistic measure

– Middle value if odd number of values, or average of the middle two values

otherwise

– Estimated by interpolation (for grouped data):

• Mode

– Value that occurs most frequently in the data

– Unimodal, bimodal, trimodal

– Empirical formula:

n

iix

nx

1

1

n

ii

n

iii

w

xwx

1

1

widthfreq

lfreqNLmedian

median

))(2/

(1

)(3 medianmeanmodemean

N

x


Symmetric vs. Skewed Data

• Median, mean and mode of symmetric,

positively and negatively skewed data

positively skewed negatively skewed

symmetric


Measuring the Dispersion of Data

• Quartiles, outliers and boxplots

– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, M, Q3, max

– Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot

outlier individually

– Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

– Variance: (algebraic, scalable computation)

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)

n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1

[1

1)(

1

1

n

ii

n

ii x

Nx

N 1

22

1

22 1)(

1


Boxplot Analysis• Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum

• Boxplot

– Data is represented with a box

– The ends of the box are at the first and third quartiles, i.e.,

the height of the box is IQR

– The median is marked by a line within the box

– Whiskers: two lines outside the box extend to Minimum

and Maximum


Visualization of Data Dispersion: 3-D Boxplots


Properties of Normal Distribution Curve

• The normal (distribution) curve– From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)

– From μ–2σ to μ+2σ: contains about 95% of it

– From μ–3σ to μ+3σ: contains about 99.7% of it


Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

• Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence


Histogram Analysis

• Graph displays of basic statistical class descriptions– Frequency histograms

• A univariate graphical method

• Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data


Histograms Often Tells More than Boxplots

• The two histograms shown in the left may have the same boxplot representation– The same values for:

min, Q1, median, Q3, max

• But they have rather different data distributions


Quantile Plot• Displays all of the data (allowing the user to assess both the

overall behavior and unusual occurrences)• Plots quantile information

– For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi


Quantile-Quantile (Q-Q) Plot• Graphs the quantiles of one univariate distribution against the

corresponding quantiles of another• Allows the user to view whether there is a shift in going from

one distribution to another


Scatter plot• Provides a first look at bivariate data to see clusters of points,

outliers, etc• Each pair of values is treated as a pair of coordinates and plotted as

points in the plane


Loess Curve• Adds a smooth curve to a scatter plot in order to provide better

perception of the pattern of dependence• Loess curve is fitted by setting two parameters: a smoothing parameter,

and the degree of the polynomials that are fitted by the regression


Positively and Negatively Correlated Data

• The left half fragment is positively

correlated

• The right half is negative correlated


Not Correlated Data


Data Visualization and Its Methods• Why data visualization?

– Gain insight into an information space by mapping data onto graphical primitives

– Provide qualitative overview of large data sets

– Search for patterns, trends, structure, irregularities, relationships among data

– Help find interesting regions and suitable parameters for further quantitative analysis

– Provide a visual proof of computer representations derived

• Typical visualization methods:– Geometric techniques

– Icon-based techniques

– Hierarchical techniques


Direct Data Visualization

Ribbons w

ith Tw

ists Based on V

orticity


Geometric Techniques

• Visualization of geometric transformations and projections of the data

• Methods– Landscapes

– Projection pursuit technique

• Finding meaningful projections of multidimensional data

– Scatterplot matrices

– Prosection views

– Hyperslice

– Parallel coordinates


Scatterplot Matrices

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

Use

d by

erm

issi

on o

f M

. W

ard,

Wor

cest

er P

olyt

echn

ic In

stitu

te


news articlesvisualized asa landscape

Use

d by

per

mis

sion

of B

. Wrig

ht, V

isib

le D

ecis

ions

Inc.

Landscapes

• Visualization of the data as perspective landscape• The data needs to be transformed into a (possibly artificial) 2D spatial

representation which preserves the characteristics of the data

10/1/2007 TCSS555A Isabelle Bichindaritz 69Attr. 1 Attr. 2 Attr. kAttr. 3

• • •

Parallel Coordinates

• n equidistant axes which are parallel to one of the screen axes and correspond to the attributes

• The axes are scaled to the [minimum, maximum]: range of the corresponding attribute

• Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute


Parallel Coordinates of a Data Set


Icon-based Techniques• Visualization of the data values as features of icons

• Methods:

– Chernoff Faces

– Stick Figures

– Shape Coding:

– Color Icons:

– TileBars: The use of small icons representing the relevance

feature vectors in document retrieval


Chernoff Faces• A way to display variables on a two-dimensional surface, e.g., let x be

eyebrow slant, y be eye size, z be nose length, etc.

• The figure shows faces produced using 10 characteristics--head eccentricity,

eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,

mouth shape, mouth size, and mouth opening): Each assigned one of 10

possible values, generated using Mathematica (S. Dickson)

• REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993

• Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource.

mathworld.wolfram.com/ChernoffFace.html

http://www.wolfram.com/products/mathematica/

http://www.amazon.com/exec/obidos/ASIN/0062731025/ref=nosim/weisstein-20

http://mathworld.wolfram.com/ChernoffFace.html


census data showing age, income, gender, education, etc.

used

by

perm

issi

on o

f G

. G

rinst

ein,

Uni

vers

ity o

f M

assa

chus

ette

s at

Low

ell

Stick Figures


Hierarchical Techniques

• Visualization of the data using a hierarchical partitioning into subspaces.

• Methods– Dimensional Stacking

– Worlds-within-Worlds

– Treemap

– Cone Trees

– InfoCube


Dimensional Stacking

attribute 1

attribute 2

attribute 3

attribute 4

• Partitioning of the n-dimensional attribute space in 2-D subspaces which are ‘stacked’ into each other

• Partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels

• Adequate for data with ordinal attributes of low cardinality

• But, difficult to display more than nine dimensions

• Important to map dimensions appropriately


Used by permission of M. Ward, Worcester Polytechnic Institute

Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes

Dimensional Stacking


Tree-Map• Screen-filling method which uses a hierarchical partitioning of

the screen into regions depending on the attribute values

• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)

MSR Netscan Image


Tree-Map of a File System (Schneiderman)

data mining

Documents

data mining data mining

data mining expression

data integration

data cleaning

data selection

data preprocessing

mining methodology

diversity of data types