1 an introduction to data mining hosein rostani alireza zohdi report 1 for “advance data base”...
TRANSCRIPT
1
An Introduction to Data Mining
Hosein Rostani Alireza Zohdi
Report 1 for “advance data base” course
Supervisor: Dr. Masoud RahgozarDecember 2007
2
OutlineWhy data mining?Data mining applicationsData mining functionalities
Concept descriptionAssociation analysisOutlier AnalysisEvolution AnalysisClassificationClustering
3
Why data mining?Motivation:
Wide availability of huge amounts of dataNeed for turning data into useful info &
knowledgeData mining:
Extracting or “mining” knowledge from large amounts of data
Knowledge : useful patternsSemiautomatic process
Focus on automatic aspects
4
Data mining applicationsPrediction. Examples:
Credit riskCustomer switching to competitorsFraudulent phone calling card usage
Associations. Examples:Related books for buyRelated accessories for suggest: e.g. cameraCausation discovery: e.g. medicine
Clusters. Example:Clusters of disease
5
Data mining functionalitiesConcept description
Characterization & discriminationAssociation analysisOutlier AnalysisEvolution AnalysisClassification and PredictionClustering
6
Concept descriptionDescription of concepts
summarized, concise & preciseWays:
Data characterization Summarizing the data of the target class in general terms
Data discrimination Comparison of the target class with the contrasting
class(es)
Examples of Output forms:Pie charts, bar charts, curves & multidimensional
tables
7
Association analysisMining frequent patterns
For discovery of interesting associations within dataKinds of frequent patterns:
Frequent itemset Set of items frequently appear together. E.g. milk and
breadFrequent subsequence
E.g. pattern of customers’ purchase: First a PC, then a digital camera & then a memory card
Frequent substructure Structural forms such as graphs, trees, or lattices
Support and confidence
8
Outlier AnalysisOutliers:
data objects disobeying the general behavior of data
Approaches to outliersDiscard as noise or exceptionsKeep for applications such as fraud detection
Example: detecting fraudulent usage of credit cards
Ways:Using statistical testsUsing distance measures Using deviation-based methods
9
Evolution AnalysisDescription and modeling of trends
For objects with changing behavior over timeWays:
Applying other data mining tasks on time related data Association analysis, classification, prediction, clustering &
…Distinct ways
time-series data analysis sequence or periodicity pattern matching similarity-based data analysis
Example: stock market: predict future trends in prices
10
Classification and PredictionClassification:
Process of finding a model that distinguishes data classes
Purpose: using the model to predict the class of new objects
Deriving model:Based on the analysis of a set of training data
data objects with known class labels
Example:In a credit card company
Classification of customers based on their payment history Prediction of a new customer’s credit worthiness
11
ClassificationA two-step process for classification:
First: Learning or training step Building the classifier by analyzing or learning from
training dataSecond: classifying step
Using classifier for classification
Accuracy of a classifier (on a given test set)Percentage of test set tuples correctly classified by
classifierClassification methods:
Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …
12
Decision treeDecision tree induction :
Learning of decision trees from class-labeled training tuples
Decision tree: A flowchart-like tree structureInternal nodes: tests on attributesBranches: outcomes of the testLeaves: class labels
Usage in classification:Prediction by tracing a path from the root to a leaf
nodeTesting attribute values of new tuple against decision
treeEasily converting Decision tree to classification rules
14
Bayesian ClassificationBayesian classification
Predicting the probability that a new tuple belongs to a particular class
High accuracy and speed in large databasesBased on Bayes’ theorem
Conditional probabilityNaïve Bayesian classifier
Assumption: class conditional independenceGood for Simplifying computations
15
ClusteringThe process of grouping a set of physical or
abstract objects into classes of similar objectsGenerating class labels for objects currently
without labelClustering based on this principle:
Maximizing the intraclass similarity andMinimizing the interclass similarity
Clustering also for facilitating taxonomy formationHierarchical organization of observations
16
An example: clustering customers in a restaurant
Summarization
Clustering
Preprocessing
Restaurant database
Object View for Clustering
Young at midnight
A Set of Similar Object Clusters
White Collar for Dinner
Retired for Lunch
17
Steps of database Clustering1. Define object-view2. Select relevant attributes3. Generate suitable input format for the
clustering tool4. Define similarity measure5. Select parameter settings for the chosen
clustering algorithm6. Run clustering algorithm7. Characterize the computed clusters
18
Challenge: database clusteringData collections are in many different
formats Flat filesRelational databasesObject-oriented database
Flat file format: The simplest and most frequently used format
in the traditional data analysis areaDatabases are more complex than flat files
19
Challenge: database clustering (cont.)Challenge: Changing clustering
algorithms to become more directly applicable to real-world databases
Issues related to databases:Different types of objects in DBRelationships between objects: 1:1, 1:n & n:mComplexity in definition of object similarity
Due to the presence of bags of values for an objectDifficulty in selection of an appropriate similarity
measure Due to the presence of different types for
attributes of objects
20
Refferences Han, J., Kamber, M., Data Mining: Concepts
and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3.
Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3.
Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).