1 an introduction to data mining hosein rostani alireza zohdi report 1 for “advance data base”...

20
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar

Upload: rosemary-lloyd

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

1

An Introduction to Data Mining

Hosein Rostani Alireza Zohdi

Report 1 for “advance data base” course

Supervisor: Dr. Masoud RahgozarDecember 2007

2

OutlineWhy data mining?Data mining applicationsData mining functionalities

Concept descriptionAssociation analysisOutlier AnalysisEvolution AnalysisClassificationClustering

3

Why data mining?Motivation:

Wide availability of huge amounts of dataNeed for turning data into useful info &

knowledgeData mining:

Extracting or “mining” knowledge from large amounts of data

Knowledge : useful patternsSemiautomatic process

Focus on automatic aspects

4

Data mining applicationsPrediction. Examples:

Credit riskCustomer switching to competitorsFraudulent phone calling card usage

Associations. Examples:Related books for buyRelated accessories for suggest: e.g. cameraCausation discovery: e.g. medicine

Clusters. Example:Clusters of disease

5

Data mining functionalitiesConcept description

Characterization & discriminationAssociation analysisOutlier AnalysisEvolution AnalysisClassification and PredictionClustering

6

Concept descriptionDescription of concepts

summarized, concise & preciseWays:

Data characterization Summarizing the data of the target class in general terms

Data discrimination Comparison of the target class with the contrasting

class(es)

Examples of Output forms:Pie charts, bar charts, curves & multidimensional

tables

7

Association analysisMining frequent patterns

For discovery of interesting associations within dataKinds of frequent patterns:

Frequent itemset Set of items frequently appear together. E.g. milk and

breadFrequent subsequence

E.g. pattern of customers’ purchase: First a PC, then a digital camera & then a memory card

Frequent substructure Structural forms such as graphs, trees, or lattices

Support and confidence

8

Outlier AnalysisOutliers:

data objects disobeying the general behavior of data

Approaches to outliersDiscard as noise or exceptionsKeep for applications such as fraud detection

Example: detecting fraudulent usage of credit cards

Ways:Using statistical testsUsing distance measures Using deviation-based methods

9

Evolution AnalysisDescription and modeling of trends

For objects with changing behavior over timeWays:

Applying other data mining tasks on time related data Association analysis, classification, prediction, clustering &

…Distinct ways

time-series data analysis sequence or periodicity pattern matching similarity-based data analysis

Example: stock market: predict future trends in prices

10

Classification and PredictionClassification:

Process of finding a model that distinguishes data classes

Purpose: using the model to predict the class of new objects

Deriving model:Based on the analysis of a set of training data

data objects with known class labels

Example:In a credit card company

Classification of customers based on their payment history Prediction of a new customer’s credit worthiness

11

ClassificationA two-step process for classification:

First: Learning or training step Building the classifier by analyzing or learning from

training dataSecond: classifying step

Using classifier for classification

Accuracy of a classifier (on a given test set)Percentage of test set tuples correctly classified by

classifierClassification methods:

Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …

12

Decision treeDecision tree induction :

Learning of decision trees from class-labeled training tuples

Decision tree: A flowchart-like tree structureInternal nodes: tests on attributesBranches: outcomes of the testLeaves: class labels

Usage in classification:Prediction by tracing a path from the root to a leaf

nodeTesting attribute values of new tuple against decision

treeEasily converting Decision tree to classification rules

13

Decision tree example: Does a customer buys a computer?

14

Bayesian ClassificationBayesian classification

Predicting the probability that a new tuple belongs to a particular class

High accuracy and speed in large databasesBased on Bayes’ theorem

Conditional probabilityNaïve Bayesian classifier

Assumption: class conditional independenceGood for Simplifying computations

15

ClusteringThe process of grouping a set of physical or

abstract objects into classes of similar objectsGenerating class labels for objects currently

without labelClustering based on this principle:

Maximizing the intraclass similarity andMinimizing the interclass similarity

Clustering also for facilitating taxonomy formationHierarchical organization of observations

16

An example: clustering customers in a restaurant

Summarization

Clustering

Preprocessing

Restaurant database

Object View for Clustering

Young at midnight

A Set of Similar Object Clusters

White Collar for Dinner

Retired for Lunch

17

Steps of database Clustering1. Define object-view2. Select relevant attributes3. Generate suitable input format for the

clustering tool4. Define similarity measure5. Select parameter settings for the chosen

clustering algorithm6. Run clustering algorithm7. Characterize the computed clusters

18

Challenge: database clusteringData collections are in many different

formats Flat filesRelational databasesObject-oriented database

Flat file format: The simplest and most frequently used format

in the traditional data analysis areaDatabases are more complex than flat files

19

Challenge: database clustering (cont.)Challenge: Changing clustering

algorithms to become more directly applicable to real-world databases

Issues related to databases:Different types of objects in DBRelationships between objects: 1:1, 1:n & n:mComplexity in definition of object similarity

Due to the presence of bags of values for an objectDifficulty in selection of an appropriate similarity

measure Due to the presence of different types for

attributes of objects

20

Refferences Han, J., Kamber, M., Data Mining: Concepts

and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3.

Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3.

Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).