what is data mining? włodzisław duch dept. of informatics, nicholas copernicus university, toruń,...

of 71 /71
What is data mining? What is data mining? Włodzisław Duch Włodzisław Duch Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University, Toruń, Toruń, Poland Poland http://www.phys.uni.torun.pl/~duch http://www.phys.uni.torun.pl/~duch ISEP Porto, 8-12 July 2002

Upload: arline-wilcox

Post on 16-Dec-2015

217 views

Category:

Documents


1 download

Embed Size (px)

TRANSCRIPT

Page 1: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

What is data mining?What is data mining?What is data mining?What is data mining?

Włodzisław DuchWłodzisław Duch

Dept. of Informatics, Dept. of Informatics, Nicholas Copernicus University, Nicholas Copernicus University,

Toruń, Toruń, PolandPoland

http://www.phys.uni.torun.pl/~duchhttp://www.phys.uni.torun.pl/~duch

ISEP Porto, 8-12 July 2002

Page 2: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

What is it about?What is it about?

• Data used to be precious! Now it is overwhelming ...Data used to be precious! Now it is overwhelming ...• In many areas of science, business and commerce In many areas of science, business and commerce

people are drowning in data.people are drowning in data.• Ex: astronomy super-telescope – data mining in Ex: astronomy super-telescope – data mining in

existing databases. existing databases.

• Database technology allows to store and retrieve large Database technology allows to store and retrieve large amounts of data of any kind.amounts of data of any kind.

• There is knowledge hidden in data. There is knowledge hidden in data. • Data analysis requires intelligence. Data analysis requires intelligence.

Page 3: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Ancient historyAncient history• 1960: first databases, collections of data.1960: first databases, collections of data.• 1970: RDBMS, relational data model most popular 1970: RDBMS, relational data model most popular

today, large centralized systems. today, large centralized systems. • 1980: application-oriented data models, specialized for 1980: application-oriented data models, specialized for

scientific, geographic, engineering data, time series, scientific, geographic, engineering data, time series, text, object-oriented models, distributed databases.text, object-oriented models, distributed databases.

• 1990: multimedia and Web databases, data 1990: multimedia and Web databases, data warehousing (subject-oriented DB for decision warehousing (subject-oriented DB for decision support), and on-line analytical processing (OLAP), support), and on-line analytical processing (OLAP), deduction and verification of hypothetical patterns. deduction and verification of hypothetical patterns.

• Data mining: first conference in 1989, book 1996, Data mining: first conference in 1989, book 1996, discover something useful!discover something useful!

Page 4: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Data Mining HistoryData Mining History• 1989 IJCAI Workshop on Knowledge Discovery in 1989 IJCAI Workshop on Knowledge Discovery in

Databases (Piatetsky-Shapiro and W. Frawley 1991)Databases (Piatetsky-Shapiro and W. Frawley 1991)

• 1991-1994 Workshops on KDD1991-1994 Workshops on KDD

• 1996 Advances in Knowledge Discovery and Data Mining 1996 Advances in Knowledge Discovery and Data Mining (Fayyad et al.)(Fayyad et al.)

• 1995-1998 International Conferences on Knowledge 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)Discovery in Databases and Data Mining (KDD’95-98)

• 1997 Journal of Data Mining and Knowledge Discovery1997 Journal of Data Mining and Knowledge Discovery

• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorationsand SIGKDD Explorations

• Many conferences on data mining: PAKDD, PKDD, SIAM-Many conferences on data mining: PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.Data Mining, (IEEE) ICDM, etc.

Page 5: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

References, papersReferences, papers

KDD WWW Resources:KDD WWW Resources:http://www.kdd.orghttp://www.kdd.orghttp://www.kdnuggets.comhttp://www.kdnuggets.comhttp://www.the-data-mine.comhttp://www.the-data-mine.comhttp://www.acm.org/sigkdd/http://www.acm.org/sigkdd/

ResearchIndex: ResearchIndex: http://http://citeseer.nj.nec.com/csciteseer.nj.nec.com/cs

AI & ML aspectsAI & ML aspects

http://www.phys.uni.torun.pl/kmkhttp://www.phys.uni.torun.pl/kmk

NN & StatisticsNN & Statistics

http://www.phys.uni.torun.pl/kmkhttp://www.phys.uni.torun.pl/kmk

Comparison of results on many datasets:Comparison of results on many datasets:

http://www.phys.uni.torun.pl/kmkhttp://www.phys.uni.torun.pl/kmk

Page 6: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Data Mining and statisticsData Mining and statisticsData Mining and statisticsData Mining and statistics

• Statisticians deal with data: what’s new in DM?Statisticians deal with data: what’s new in DM?• Many DM methods have roots in statistics.Many DM methods have roots in statistics.• Statistics used to deal with small, controlled Statistics used to deal with small, controlled

experiments, while DM deals with large, messy experiments, while DM deals with large, messy collections of data.collections of data.

• Statistics is based on analytical probabilistic models, Statistics is based on analytical probabilistic models, DM is based on algorithms that find patterns in data. DM is based on algorithms that find patterns in data.

• Many DM algorithms came from other sources and Many DM algorithms came from other sources and slowly get some statistical justification. slowly get some statistical justification.

• Key factor for DM is the computer cost/performance. Key factor for DM is the computer cost/performance. • Sometimes DM is more art than science … Sometimes DM is more art than science …

Page 7: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Types of DataTypes of DataTypes of DataTypes of Data

• Statistical data – clean, numerical, controlled Statistical data – clean, numerical, controlled experiments, vector space model. experiments, vector space model.

• Relational data – marketing, finances. Relational data – marketing, finances. • Textual data – Web, NLP, search. Textual data – Web, NLP, search. • Complex structures – chemistry, economics. Complex structures – chemistry, economics. • Sequence data – bioinformatics. Sequence data – bioinformatics. • Multimedia data – images, video. Multimedia data – images, video. • Signals – dynamic data, biosignals. Signals – dynamic data, biosignals. • AI data – logical problems, games, behavior AI data – logical problems, games, behavior

……

Page 8: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

What is What is DMDM??What is What is DMDM??• Discovering interesting patterns, finding useful Discovering interesting patterns, finding useful

summaries of large databases. summaries of large databases.

• DM is more than database technology, On-Line DM is more than database technology, On-Line Analitic Processing (OLAP) tools.Analitic Processing (OLAP) tools.

• DM is more than statistical analysis, although it DM is more than statistical analysis, although it includes classification, association, clustering, includes classification, association, clustering, outlier and trend analysis, decision rules, outlier and trend analysis, decision rules, prototype cases, multidimensional visualization prototype cases, multidimensional visualization etc. Understanding of data has not been an etc. Understanding of data has not been an explicit goal of statistics, focusing on predictive explicit goal of statistics, focusing on predictive data models.data models.

Page 9: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

DMDM applications applicationsDMDM applications applications• Many applications, but spectacular new knowledge is Many applications, but spectacular new knowledge is

rarely discovered. rarely discovered.

Some examples: Some examples:

– ““Diapers and beer” correlation: please them close and Diapers and beer” correlation: please them close and put potato chips in between. put potato chips in between.

– Mining astronomical catalogs (Skycat, Sloan Sky Mining astronomical catalogs (Skycat, Sloan Sky survey): new subtype of stars has been discovered!survey): new subtype of stars has been discovered!

– Bioinformatics: more precise characterization of some Bioinformatics: more precise characterization of some diseases, many discoveries to be made? diseases, many discoveries to be made?

– Credit card fraud detection (HNC company). Credit card fraud detection (HNC company).

– Discounts of air/hotel for frequent travelers.Discounts of air/hotel for frequent travelers.

Page 10: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Important issues in data mining.Important issues in data mining.Important issues in data mining.Important issues in data mining.

• Use of statistical and CI methods for KDD.Use of statistical and CI methods for KDD.

• What makes an interesting pattern? What makes an interesting pattern?

• Handling uncertainty in the data.Handling uncertainty in the data.

• Handling noise, outliers and missing or unknown data. Handling noise, outliers and missing or unknown data.

• Finding linguistic variables, discretization of continuous Finding linguistic variables, discretization of continuous data, presentation and evaluation of knowledge.data, presentation and evaluation of knowledge.

• Knowledge representation for structural data, Knowledge representation for structural data, heterogeneous information, textual databases & NLP. heterogeneous information, textual databases & NLP.

• Performance, scalability, distributed data, incremental or Performance, scalability, distributed data, incremental or “on-line” processing.“on-line” processing.

• Best form of explanation depends on the application.Best form of explanation depends on the application.

Page 11: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

DMDM dangers dangersDMDM dangers dangers• If there are too many conclusions to draw some If there are too many conclusions to draw some

inferences will be true by chance due to too small data inferences will be true by chance due to too small data samples (Bonferroni’s theorem).samples (Bonferroni’s theorem).

Example 1: David Rhine (Duke Univ) ESP tests. Example 1: David Rhine (Duke Univ) ESP tests. 1 person in 1000 guessed correctly color (red or black) of 1 person in 1000 guessed correctly color (red or black) of 10 cards: is this evidence for ESP?10 cards: is this evidence for ESP?Retesting of these people gave average results. Retesting of these people gave average results. Rhine’s conclusion: telling people that they have ESP Rhine’s conclusion: telling people that they have ESP interferes with their ability … interferes with their ability …

Example 2: using Example 2: using mm letters to form a random sequence of letters to form a random sequence of the length the length NN all possible subsequences of log all possible subsequences of logmmN N are found are found

=> Bible code!=> Bible code!

Page 12: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Knowledge discovery in databases (KDD): Knowledge discovery in databases (KDD): a search process for understandable and useful a search process for understandable and useful patterns in data. patterns in data.

Data Mining

Clean,Collect,Summarize

DataWarehouse

Data Preparation

TrainingData

ModelPatterns

Verification, EvaluationOperational

Databases

Data Mining processData Mining processData Mining processData Mining process

most effort

Page 13: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Stages of DM processStages of DM processStages of DM processStages of DM process• Data gathering, data warehousing, Web crawling.Data gathering, data warehousing, Web crawling.

• Preparation of the data: cleaning, removing outliers and Preparation of the data: cleaning, removing outliers and impossible values, removing wrong records, finding impossible values, removing wrong records, finding missing data. missing data.

• Exploratory data analysis: visualization of different Exploratory data analysis: visualization of different aspects of data.aspects of data.

• Finding relevant features for questions that are asked, Finding relevant features for questions that are asked, preparing data structures for predictive methods, preparing data structures for predictive methods, converting symbolic values to numerical representation. converting symbolic values to numerical representation.

• Pattern extraction, discovery, rules, prototypes. Pattern extraction, discovery, rules, prototypes.

• Evaluation of knowledge gained, finding useful patterns, Evaluation of knowledge gained, finding useful patterns, consultation with experts. consultation with experts.

Page 14: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Multidimensional Data CuboidsMultidimensional Data Cuboids• Data warehouses use multidimensional data model. Data warehouses use multidimensional data model.

• Projections (views) of data on different dimensions Projections (views) of data on different dimensions

(attributes) form “data cuboids”. (attributes) form “data cuboids”.

• In DB warehousing literature: In DB warehousing literature:

base cuboid: original data, N-Dim. base cuboid: original data, N-Dim.

apex cuboid: 0-D cuboid, highest-level summary;apex cuboid: 0-D cuboid, highest-level summary;

data cube: lattice of cuboids.data cube: lattice of cuboids.

• Ex: Sales data cube, viewed in multiple dimensionsEx: Sales data cube, viewed in multiple dimensions– Dimension tables, ex. item (item_name, brand, type), or Dimension tables, ex. item (item_name, brand, type), or

time(day, week, month, quarter, year) time(day, week, month, quarter, year)

– Fact tables, measures (such as cost), and keys to each of the Fact tables, measures (such as cost), and keys to each of the

related dimension tablesrelated dimension tables

Page 15: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Data Cube: A Lattice of CuboidsData Cube: A Lattice of CuboidsData Cube: A Lattice of CuboidsData Cube: A Lattice of Cuboids

time,item

time,item,location

none

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 16: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Forms of useful knowledgeForms of useful knowledge

But ... knowledge accessible to humans is in: • symbols, • similarity to prototypes, • images, visual representations.

What type of explanation is satisfactory?Interesting question for cognitive scientists.Different answers in different fields.

AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever.

Page 17: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Forms of knowledgeForms of knowledge

Types of explanation:

• exemplar-based: prototypes and similarity;• logic-based: symbols and rules;• visualization-based: exploratory data

analysis, maps, diagrams, relations ...

• Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do.

• Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do.

• Logical rules are the highest form of summarization of knowledge.

Page 18: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Computational IntelligenceComputational IntelligenceComputational IntelligenceComputational Intelligence

Computational IntelligenceData => KnowledgeArtificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

Soft computing

Page 19: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

CI methods for data miningCI methods for data miningCI methods for data miningCI methods for data mining

• Provide non-parametric (“universal”), predictive Provide non-parametric (“universal”), predictive models of data.models of data.

• Classify new data to pre-defined categories, Classify new data to pre-defined categories, supporting diagnosis & prognosis.supporting diagnosis & prognosis.

• Discover new categories, clusters, patterns.Discover new categories, clusters, patterns.• Discover interesting associations, correlations. Discover interesting associations, correlations. • Allow to understand the data, creating fuzzy or Allow to understand the data, creating fuzzy or

crisp logical rules, or prototypes.crisp logical rules, or prototypes.• Help to visualize multi-dimensional Help to visualize multi-dimensional

relationships among data samples. relationships among data samples.

Page 20: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Association rulesAssociation rulesAssociation rulesAssociation rules• Classification rules: X => C(X)Classification rules: X => C(X)• Association rules: looking for correlation Association rules: looking for correlation

between components of X, i.e. probability between components of X, i.e. probability pp(X(Xii||XX11,X,Xi-1i-1,X,Xi+1i+1,X,Xnn).).

• ““Market basket” problem: many items selected Market basket” problem: many items selected from an available pool to a basket; what are the from an available pool to a basket; what are the correlations?correlations?

• Only frequent items are interesting:Only frequent items are interesting:itemsets with high support, i.e. appearing itemsets with high support, i.e. appearing together in many baskets. together in many baskets.

Search for rules above support threshold > 1%. Search for rules above support threshold > 1%.

Page 21: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Association rules - relatedAssociation rules - relatedAssociation rules - relatedAssociation rules - related• Related problems to market basket: Related problems to market basket:

correlation between documents – high for correlation between documents – high for plagiarism; plagiarism; phrases in documents – high for semantically phrases in documents – high for semantically related documents. related documents.

• Causal relations matter, although may be Causal relations matter, although may be difficult to determine: difficult to determine: lower the price of diapers, keep high beer price, lower the price of diapers, keep high beer price, or try the reverse – what will happen?or try the reverse – what will happen?

• More general approach: More general approach: Bayesian belief networks, causal networks, Bayesian belief networks, causal networks, graphical models. graphical models.

Page 22: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

ClusteringClusteringClusteringClustering• Given points in multidimensional space divided them Given points in multidimensional space divided them

into groups that are “similar”. into groups that are “similar”.

• Ex: if epidemic breaks, look for location of cases on the Ex: if epidemic breaks, look for location of cases on the map (cholera in London). map (cholera in London).

Documents in the space of words cluster according to Documents in the space of words cluster according to their topics. their topics.

• How to measure similarity?How to measure similarity?• Hierarchical approaches: start from single cases, join Hierarchical approaches: start from single cases, join

them forming clusters; ex: dendrogram.them forming clusters; ex: dendrogram.

Centroid approaches: assume a few centers and adapt Centroid approaches: assume a few centers and adapt their position; ex: k-means, LVQ, SOM. their position; ex: k-means, LVQ, SOM.

Page 23: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Neural networksNeural networksNeural networksNeural networks• Inspired by neurobiology: simple elements Inspired by neurobiology: simple elements

cooperate changing internal parameters.cooperate changing internal parameters.

• Large field, dozens of different models, over Large field, dozens of different models, over 500 papers on NN in medicine each year. 500 papers on NN in medicine each year.

• Supervised networksSupervised networks: heteroassociative : heteroassociative mapping X=>Y, symptoms => diseases,mapping X=>Y, symptoms => diseases,universal approximators. universal approximators.

• Unsupervised networksUnsupervised networks: clusterization, : clusterization, competitive learning, autoassociation. competitive learning, autoassociation.

• Reinforcement learningReinforcement learning: modeling behavior, : modeling behavior, playing games, sequential data.playing games, sequential data.

Page 24: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Unsupervised NN exampleUnsupervised NN exampleUnsupervised NN exampleUnsupervised NN exampleClustering and visualization of the quality of life Clustering and visualization of the quality of life index (UN data) by SOM map.index (UN data) by SOM map.

Poor classification, inaccurate visualization. Poor classification, inaccurate visualization.

Page 25: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Real and artificial neuronsReal and artificial neuronsReal and artificial neuronsReal and artificial neurons

Synapses

Axon

Dendrites

Synapses

(weights)

Nodes – artificialneurons

Signals

Page 26: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Neural networkNeural network for MI diagnosisfor MI diagnosisNeural networkNeural network for MI diagnosisfor MI diagnosisMyocardial Infarction~ p(MI|X)

Sex Age SmokingECG: ST

PainIntensity

PainDuration

Elevation

0.7

51 1365Inputs:

Outputweights

Inputweights

Page 27: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

MI network functionMI network functionMI network functionMI network function

Training: setting the values of weights and thresholds, efficient algorithms exist.

Effect: non-linear regression function

Such networks are universal approximators: they may learn any mapping X => Y

5 6

1 1

o iMI ij jk k

i k

F W W X

X

Page 28: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Knowledge from networksKnowledge from networksKnowledge from networksKnowledge from networks

Simplify networks: force most weights to 0, quantize remaining parameters, be constructive!

• Regularization: mathematical technique improving predictive abilities of the network.• Result: MLP2LN neural networks that are equivalent to logical rules.

Page 29: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

MLP2LNMLP2LNMLP2LNMLP2LN

Converts MLP neural networks into a network Converts MLP neural networks into a network performing logical operations (LN).performing logical operations (LN).

InputInputlayer layer

Aggregation: Aggregation: better featuresbetter features

Output: Output: one node one node per class. per class.

Rule units: Rule units: threshold logicthreshold logic

Linguistic units: Linguistic units: windows, filterswindows, filters

Page 30: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Learning dynamicsLearning dynamicsLearning dynamicsLearning dynamicsDecision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.

Page 31: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Neurofuzzy systemNeurofuzzy systemssNeurofuzzy systemNeurofuzzy systemss

Feature Space Mapping (FSM) neurofuzzy system.Feature Space Mapping (FSM) neurofuzzy system.Neural adaptation, estimation of probability density Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions:(RBF-like) with nodes realizing separable functions:

1

; ;i i ii

G X P G X P

Fuzzy: Fuzzy: xx(no/yes) replaced by a degree (no/yes) replaced by a degree xx. Triangular, trapezoidal, Gaussian . Triangular, trapezoidal, Gaussian ...... MFMF..

M.f-s in many dimensions:

Page 32: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

GhostMiner PhilosophyGhostMiner Philosophy

• There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees.

• Provide tools for visualization of data.• Support the process of knowledge discovery/model

building and evaluating, organizing it into projects.

GhostMiner, data mining tools from our lab.

http://www.fqspl.com.pl/ghostminer/

• Separate the process of model building and knowledge discovery from model use =>

GhostMiner Developer & GhostMiner Analyzer.

Page 33: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Heterogeneous systemsHeterogeneous systems

• Discovering simplest class structures, its inductive bias, requires heterogeneous adaptive systems (HAS).

• Ockham razor: simpler systems are better.

• HAS examples:• NN with many types of neuron transfer functions.• k-NN with different distance functions.• DT with different types of test criteria.

Homogenous systems: one type of “building blocks”, same type of decision borders.

Ex: neural networks, SVMs, decision trees, kNNs ….

Committees combine many models together, but lead to complex models that are difficult to understand.

Page 34: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Wine data exampleWine data example

• alcohol content • ash content • magnesium content • flavanoids content • proanthocyanins

phenols content • OD280/D315 of diluted

wines

• malic acid content • alkalinity of ash • total phenols content • nonanthocyanins

phenols content • color intensity • hue• proline.

Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars.Task: recognize the source of wine sample.13 quantities measured, continuous features:

Page 35: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration and visualizationExploration and visualization

General info about the data

Page 36: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: dataExploration: data

Inspect the data

Page 37: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: data statisticsExploration: data statisticsDistribution of feature values

Proline has very large values, the data should be standardized before further processing.

Page 38: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: data standardizedExploration: data standardizedStandardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std]

Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values.

Page 39: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: 1D histogramsExploration: 1D histograms

Distribution of feature values in classes

Some features are more useful than the others.

Page 40: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: 1D/3D histogramsExploration: 1D/3D histograms

Distribution of feature values in classes, 3D

Page 41: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Exploration: 2D projectionsExploration: 2D projections

Projections (cuboids) on selected 2D

Projections on selected 2D

Page 42: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Visualize data Visualize data

Relations in more than 3D are hard to imagine.

SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions.

Measure of topographical distortions: map all Xi

points from Rn to xi points in Rm, m < n, and ask:

How well are Rij = D(Xi, Xj) distances reproduced by

distances rij = d(xi,xj) ?

Use m = 2 for visualization, use higher m for dimensionality reduction.

Page 43: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Visualize data: MDSVisualize data: MDS

Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) … Minimize measure of topographical distortions moving the x coordinates.

2

1 2

2

2

2

3

1MDS

11Sammon

11 MDS, more local

ij iji jij

i j

ij

i jij iji j

ij iji jij

i j

S R rR

rS

R R

S r RR

x x

xx

x x

Page 44: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Visualize data: WineVisualize data: Wine

The green outlier can be identified easily.

3 clusters are clearly distinguished, 2D is fine.

Page 45: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Decision treesDecision trees

Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.

4 attributes used,

10 errors, 168 correct,

94.4% correct.

Simplest things first: use decision tree to find logical rules.

Page 46: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Decision bordersDecision borders

Multivariate trees: test on combinations of attributes, hyperplanes.

Result: feature space is divided into cuboids.

Wine data: univariate decision tree borders for

proline and flavanoids

Univariate trees: test the value of a single attribute x < a.

Page 47: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Logical rulesLogical rules

sk(x) ş True [XkŁ x ŁX'k], for example: small(x) = True{x|x < 1}medium(x) = True{x|x [1,2]}large(x) = True{x|x > 2}

Linguistic variables are used in crisp (prepositional, Boolean) logic rules:

IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...

Crisp logic rules: for continuous x use linguistic variables (predicate functions).

Page 48: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Crisp logic decisionsCrisp logic decisions

True/False values jump from 0 to 1.

Step functions are used for partitioning of the feature space.

Very simple hyper-rectangular decision borders.

Sever limitation on the expressive power of crisp logical rules!

Crisp logic is based on rectangular membership functions:

Page 49: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Logical rules - advantagesLogical rules - advantages

• Rules may expose limitations of black box solutions.

• Only relevant features are used in rules. • Rules may sometimes be more accurate than

NN and other CI methods. • Overfitting is easy to control, rules usually

have small number of parameters. • Rules forever !?

A logical rule about logical rules is:

IF the number of rules is relatively smallAND the accuracy is sufficiently high. THEN rules may be an optimal choice.

Logical rules, if simple enough, are preferable.

Page 50: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Logical rules - limitationsLogical rules - limitations

Logical rules are preferred but ...

• Only one class is predicted p(Ci|X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications.

• Discontinuous cost function allow only non-gradient optimization.

• Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules.

• Reliable crisp rules may reject some cases as unclassified.

• Interpretation of crisp rules may be misleading.

• Fuzzy rules are not so comprehensible.

Page 51: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Rules - choicesRules - choices

true | predicted r

r

p p p pp

p p p p

Accuracy (overall) A(M) = p+ p

Error rate L(M) = p+ p

Rejection rate R(M)=p+r+pr= 1L(M)A(M)

Sensitivity S+(M)= p+|+ = p++ /p+

Specificity S(M)= p = p /p

p is a hit; p false alarm; p is a miss.

Simplicity vs. accuracy. Confidence vs. rejection rate.

Page 52: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Rules – error functionsRules – error functionsThe overall accuracy is equal to a combination of sensitivity and specificity weighted by the a priori probabilities:

A(M) = pS(M)+pS(M)

Optimization of rules for the C+ class;

large means no errors but high rejection rate.

E(M)= L(M)A(M)= (p+p) (p+p)minM E(M;) minM {(1+)L(M)+R(M)} Optimization with different costs of errors

minM E(M;) = minM {p+ p} = minM {pS(M))pr(M) + [pS(M))pr(M)]}

ROC (Receiver Operating Curve): p (p, hit(false alarm).

Page 53: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Wine example – SSV rulesWine example – SSV rules

Decision trees provide rules of different complexity.

Simplest tree: 5 nodes, corresponding to 3 rules;

25 errors, mostly Class2/3 wines mixed.

Page 54: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Wine – SSV 5 rulesWine – SSV 5 rules

Lower pruning leads to more complex tree.

7 nodes, corresponding to 5 rules;

10 errors, mostly Class2/3 wines mixed.

Page 55: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Wine – SSV optimal rulesWine – SSV optimal rules

Various solutions may be found, depending on the search: 5 rules with 12 premises, making 6 errors, 6 rules with 16 premises and 3 errors, 8 rules, 25 premises, and 1 error.

if OD280/D315 > 2.505 proline > 726.5 color > 3.435 then class 1

if OD280/D315 > 2.505 proline > 726.5 color < 3.435 then class 2

if OD280/D315 < 2.505 hue > 0.875 malic-acid < 2.82 then class 2

if OD280/D315 > 2.505 proline < 726.5 then class 2

if OD280/D315 < 2.505 hue < 0.875 then class 3

if OD280/D315 < 2.505 hue > 0.875 malic-acid > 2.82 then class 3

What is the optimal complexity of rules? Use crossvalidation to estimate generalization.

Page 56: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Wine – FSM rulesWine – FSM rules

Complexity of rules depends on desired accuracy.

Use rectangular functions for crisp rules. Optimal accuracy may be evaluated using crossvalidation.

FSM discovers simpler rules, for example:

if proline > 929.5 then class 1 (48 cases, 45 correct, 2 recovered by other rules).

if color < 3.79285 then class 2 (63 cases, 60 correct)

SSV: hierarchical rulesFSM: density estimation with feature selection.

Page 57: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Examples of interesting knowledge discovered!Examples of interesting knowledge discovered!

The most famous example of knowledge discovered by data mining:

correlation between beer, milk and diapers.

Other examples: 2 subtypes of galactic spectra forced astrophysicist to reconsider star evolutionary processes.

Several examples of knowledge found by us in medical and other datasets follow.

Page 58: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

MushroomsMushroomsThe Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy.

8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2118=3.1035 possible input vectors.

Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicySpore print color: black, brown, buff, chocolate, green, orange, purple, white, yellow.

Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) Ů spore-print-color = Ř green

48 errors, 99.41% correct

This is why animals have such a good sense of smell! What does it tell us about odor receptors?

Page 59: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Mushrooms rulesMushrooms rulesTo eat or not to eat, this is the question! Not any more ...

A mushroom is poisonous if: R1) odor = Ř (almond anise none); 120 errors, 98.52% R2) spore-print-color = green 48 errors, 99.41% R3) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R4) habitat = leaves Ů cap-color = white no errors!

R1 + R2 are quite stable, found even with 10% of data; R3 and R4 may be replaced by other rules, ex:

R'3): gill-size=narrow Ů stalk-surface-above-ring=(silky scaly) R'4): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.

Page 60: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Recurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancerRecurrence of breast cancer

Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia.

286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%)

no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes

9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.

Page 61: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

RulesRules for for breast cancerbreast cancerRulesRules for for breast cancerbreast cancer

Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia.

Many systems used, 65-78% accuracy reported.

Single rule:IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence, ELSE no-recurrence

76.2% accuracy, only trivial knowledge in the data: Highly malignant breast cancer involving many nodes is likely to strike back.

Page 62: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Recurrence - comparison. Recurrence - comparison. Recurrence - comparison. Recurrence - comparison.

Method 10xCV accuracy

MLP2LN 1 rule 76.2 SSV DT stable rules 75.7 1.0

k-NN, k=10, Canberra 74.1 1.2

MLP+backprop. 73.5 9.4 (Zarndt)CART DT 71.4 5.0 (Zarndt) FSM, Gaussian nodes 71.7 6.8 Naive Bayes 69.3 10.0 (Zarndt)

Other decision trees < 70.0

Page 63: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis. Breast cancer diagnosis.

Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.

699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses

Tasks: distinguish benign from malignant cases.

Page 64: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Breast cancer rules. Breast cancer rules. Breast cancer rules. Breast cancer rules.

Data from University of Wisconsin Hospital, Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.Madison, collected by dr. W.H. Wolberg.

Simplest rule from MLP2LN, large regularization:

If uniformity of cell size 3Then benign Else malignantSensitivity=0.97, Specificity=0.85

More complex NN solutions, from 10CV estimate:Sensitivity =0.98, Specificity=0.94

Page 65: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Breast cancer comparison. Breast cancer comparison. Breast cancer comparison. Breast cancer comparison.

Method 10xCV accuracy

k-NN, k=3, Manh 97.0 2.1 (GM)FSM, neurofuzzy 96.9 1.4 (GM)

Fisher LDA 96.8 MLP+backprop. 96.7 (Ster, Dobnikar)LVQ 96.6 (Ster, Dobnikar) IncNet (neural) 96.4 2.1 (GM)Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0 2.9 (GM) LDA (linear discriminant) 96.0 Various decision trees 93.5-95.6

Page 66: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Collected in the Outpatient Center of Dermatology in Rzeszów, Poland.

Four types of Melanoma: benign, blue, suspicious, or malignant.

250 cases, with almost equal class distribution.

Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5).

TDS (Total Dermatoscopy Score) - single index

Goal: hardware scanner for preliminary diagnosis.

Melanoma skin cancerMelanoma skin cancerMelanoma skin cancerMelanoma skin cancer

Page 67: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

R1: IF TDS ≤ 4.85 AND C-BLUE IS absent THEN MELANOMA IS Benign-nevus

R2: IF TDS ≤ 4.85 AND C-BLUE IS present THEN MELANOMA IS Blue-nevus

R3: IF TDS > 5.45 THEN MELANOMA IS Malignant

R4: IF TDS > 4.85 AND TDS < 5.45 THEN MELANOMA IS Suspicious

5 errors (98.0%) on the training set 0 errors (100 %) on the test set.

Feature aggregation is important!Without TDS 15 rules are needed.

Melanoma rules

Page 68: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Method Rules Training % Test %

MLP2LN, crisp rules 4 98.0 all 100

SSV Tree, crisp rules 4 97.5±0.3 100

FSM, rectangular f. 7 95.5±1.0 100

knn+ prototype selection 13 97.5±0.0 100

FSM, Gaussian f. 15 93.7±1.0 95±3.6

knn k=1, Manh, 2 features -- 97.4±0.3 100

LERS, rough rules 21 -- 96.2

Melanoma resultsMelanoma results

Page 69: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

SummarySummarySummarySummaryData mining is a large field; only a few issues have been Data mining is a large field; only a few issues have been mentioned here. mentioned here.

DM involves many steps, here only those related to DM involves many steps, here only those related to pattern recognition were stressed, but in practice pattern recognition were stressed, but in practice scalability and efficiency issues may be most important. scalability and efficiency issues may be most important.

Neural networks are used still mostly for building predictive Neural networks are used still mostly for building predictive data models, but they may also provide simplified data models, but they may also provide simplified description in form of rules.description in form of rules.

Rules are not the only for of data understanding. Rules are not the only for of data understanding. Rules may be a beginning for a practical application. Rules may be a beginning for a practical application.

Some interesting knowledge has been discovered.Some interesting knowledge has been discovered.

Page 70: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

ChallengesChallengesChallengesChallenges

• Discovery of theories rather than data modelsDiscovery of theories rather than data models• Integration with image/signal analysisIntegration with image/signal analysis• Integration with reasoning in complex domainsIntegration with reasoning in complex domains• Combining expert systems with neural networksCombining expert systems with neural networks

Fully automatic universal data analysis systems: Fully automatic universal data analysis systems: press the button and wait for the truth …press the button and wait for the truth …

We are slowly getting there. We are slowly getting there.

More & more computational intelligence tools More & more computational intelligence tools (including our own) are available. (including our own) are available.

Page 71: What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

DisclaimerDisclaimerDisclaimerDisclaimerA few slides/figures were taken from various presentations found in A few slides/figures were taken from various presentations found in the Internet; unfortunately I cannot identify original authors at the the Internet; unfortunately I cannot identify original authors at the moment, since these slides went through different iterations. moment, since these slides went through different iterations.

I have to apologize for that.I have to apologize for that.