mit-652 data mining applications · network dbms 1970s: relational data model, relational dbms...

MIT-652: DM 1: Introduction to Data Mining 1

Introduction to Data Mining

MIT-652 Data Mining Applications

Thimaporn Phetkaew

School of Informatics,Walailak University


Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Data mining task primitives

Major issues in data mining

Summary


Motivation: “Necessity is the Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing (OLAP)

Data Mining: extraction of interesting knowledge (rules,

regularities, patterns, constraints) from data in large databases


Evolution of Database Technology

1960s: Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining and data warehousing, multimedia databases, and Web databases

2000s: Data mining and its applications, Web technology (XML, data integration) and global information systems


Introduction









Summary


What Is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databasesData mining: a misnomer?

Alternative names: Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: is everything “data mining”?Simple search and query processing Expert systems


Potential Data Mining Applications

Database analysis and decision supportMarket analysis and management

Target marketing, customer relation management (CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other ApplicationsText mining (news group, email, documents) and Web mining

Bioinformatics and bio-data analysis


Data Mining: A KDD Process

Data mining: the core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Data Selection

Data Mining

Pattern Evaluation

Data Transformation


Steps of a KDD Process

Data cleaning: to remove noise and incomplete data

Data integration: where multiple data sources may be combined

Data selection: where data relevant to the analysis task are retrieved from the database/repository

Data transformation: where data are transformed into forms appropriate for mining


Steps of a KDD Process

Data mining: where intelligent methods are applied in order to extract data patterns

Pattern evaluation: to identify the truly interesting patterns representing knowledge

Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user


Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

End User

BusinessAnalyst

DataAnalyst

DBA

DecisionsMaking

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific Experiments, Database Systems


Architecture of a Typical Data Mining System

data cleaning, integration, and selection

Database or Data Warehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

Database Data Warehouse

World-WideWeb

Other InfoRepositories


Introduction









Summary


Data Mining: On What Kind of Data?

Relational databasesTransactional databasesData warehousesAdvanced DB and information repositories

Object-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataText databases Multimedia databasesHeterogeneous and legacy databasesWWW


Relational Databases

A collection of tables consisting of a set of tuples(records or rows)Each tuple owns a set of attributes (columns or fields) A tuple is identified by a unique keySQL “Show me total sales of last year grouped by branch”Data mining can search for trends and data patterns e.g. detect deviations such as

Items whose sales are far from those expected compared to last year; Predict credit risk of new customers based on their income, age,credit history


Transactional Databases

Consist of a file where each record represents a transaction (TransID + ItemList, such as items purchased in a store)Q: How many transactions include item #13?Data mining: Market basket analysis: identify frequent itemsets to enable maximizing sales

Knowing printers are commonly purchased together with computers, thus discounting an expensive model printers with a purchase of selected computers (hoping selling more expensive printers)Shelf diapers, beers and potato chips nearby to increase sales


Data Warehouses

A repository of information collected from multiple sources, organized under a unified schema at a single site in order to facilitate management decision makingProvide a multidimensional view of data and allows precomputation and fast accessing of summarized dataData warehouse systems are well suited for On-line Analytical Processing (OLAP)Data mining uses data warehouse tools to support data analysis

OLAP operations such as drill-down and roll-up


Object-Oriented Databases

Based on OO programming paradigmData and code encapsulated into a single unitEntity is considered as object associated with

A set of variables describe the object (attributes in E-R model)A set of messages the object uses to communicate with other objectA set of methods which return a value in response upon receiving a message

Superclass/subclass with inheritance benefits information sharing


Object-Relational Databases

Extend basic relational model by adding power to handle complex data types, class hierarchies, and object inheritanceData mining in OO and OR systems share some similarities with relational Data mining

Techniques need to be developed for handling complex object structure, class and subclass hierarchies, property inheritance,and methods and procedures


Spatial Databases

Include geographic (map) databases, medical image databases and satellite image databasesApplications: forestry and ecology planning, vehicle navigation, providing public service information regarding location of telephone and electric cables, pipes, and sewage systemsData mining uncover patterns describing

Characteristics of houses located near a parkClimate of mountainous areas located at various altitudes


Time-series and Temporal Databases

Time-series database stores sequences of values that change with time e.g. stock exchangeTemporal database stores relational data that include time-related attributesData mining finds characteristics of object evolution or the trend of changes for objects in database e.g.

Aid in scheduling of bank tellers according to the volume of customer trafficUncover trends that could help investment strategy planning (when is the best time to purchase XXX stock?)


Text Databases

Contain word descriptions for objects such as library databases, articles, product specifications, and bug reportsData mining may uncover keyword or content associations, clustering behavior of text objects e.g. search engines cluster documents according to the words contained


Multimedia Databases

Store text, image, audio,video dataApplication: picture content-based retrieval, voice mail system, speech-based user interfacesData mining needs storage and search techniques to be integrated to support real-time retrieval of countinuousmedia data


Heterogeneous and Legacy Databases

Heterogenous database consists of a set of interconnected, autonomous component databasesLegacy database is a group of heterogeneous databases that combined different kinds of data systems, such as relational or OO databases, spreadsheets, or file systemsData mining may provide an interesting solution to the information exchange problem by transforming the given data into higher, more generalized, conceptual levels


A huge, widely distributed information repository made available by internetData mining: Mining path traversal patterns: understand user access patterns

Help providing efficient access between highly correlated objects (improve system desigh)Lead to better marketing decisions e.g. placing advertisementsProviding better customer classification and behavior analysis

World Wide Web


Introduction









Summary


Data Mining Functionalities

Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Frequent patterns, association (correlation and causality)

Multi-dimensional vs. single-dimensional associationage(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 20%, confidence = 60%]contains(X, “computer”) -> contains(X, “software”) [10%, 75%]



Classification and PredictionFinding models (functions) that describe and distinguish classesor concepts for future prediction

E.g., classify countries based on climate, or classify cars based on gas mileage

Presentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values

Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity



Outlier analysisOutlier: a data object that does not comply with the general

behavior of the data

It can be considered as noise or exception but is quite useful in

fraud detection, rare events analysis

Trend and evolution analysisTrend and deviation: regression analysis

Sequential pattern mining: e.g., digital camera large SD memory

Periodicity analysis

Similarity-based analysis


Introduction









Summary


Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of

patterns, not all of them are interesting.Suggested approach: Human-centered, query-based, focused

mining

Interestingness measures: A pattern is interesting if it is easily understood by humans,

valid on new or test data with some degree of certainty,

potentially useful, novel, or validates some hypothesis that a

user seeks to confirm


Are All the “Discovered” Patterns Interesting?

Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of

patterns, e.g., support, confidence, etc.

Subjective: based on user’s belief in the data, e.g.,

unexpectedness, novelty, actionability, etc.


Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: CompletenessCan a data mining system find all the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering

Search for only interesting patterns: OptimizationCan a data mining system find only the interesting patterns?

Approaches

First general all the patterns and then filter out the uninteresting ones

Generate only the interesting patterns—mining query optimization


Introduction









Summary


Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Algorithm


Data Mining: Classification Schemes

General functionalityDescriptive data mining characterize general properties of data

in database e.g. association rules

Predictive data mining perform inference on current data in

order to make predictions

Different views, different classificationsData view: kinds of databases to be mined

Knowledge view: kinds of knowledge to be discovered

Method view: kinds of techniques utilized

Application view: kinds of applications adapted


Different views, different classifications (1)Databases to be mined

Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.

Knowledge to be minedCharacterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.Multiple/integrated functions and mining at multiple levels



Different views, different classifications (2)Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.

Applications adaptedRetail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.



Introduction









Summary


What Defines a Data Mining Task ?

Task-relevant data

Type of knowledge to be mined

Background knowledge

Pattern interestingness measurements

Visualization of discovered patterns


Task-Relevant Data

Database or data warehouse name

Database tables or data cubes

Condition for data selection

Relevant attributes or dimensions

Data grouping criteria


Types of knowledge to be mined

Characterization

Discrimination

Association

Classification/prediction

Clustering

Outlier analysis

Other data mining tasks


Types of knowledge to be mined: Metapatterns

Metapatterns: all discovered patterns must match

For example, metarule for association rules

P(X:customer,W) ^ Q(X,Y) ⇒ buys(X,Z)

age(X,”30..39”) ^ income(X,”40K..49K”) ⇒ buys(X,”VCR”)

[2.2%, 60%]

occupation(X,”student”) ^ age(X,”20..29”) ⇒

buys(X,”Computer”) [1.4%, 70%]


Background Knowledge: Concept Hierarchies

Schema hierarchyE.g., street < city < province_or_state < country

Set-grouping hierarchyE.g., {20-39} = young, {40-59} = middle_aged, {60-89} = senior

Operation-derived hierarchyE.g., email address: [email protected]

login-name < department < university < countryRule-based hierarchy

low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50


Measurements of Pattern Interestingness

Simplicitye.g., (association) rule length, (decision) tree size

Certaintye.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.

Utilitypotential usefulness, e.g., support (association), noise threshold (description)

Noveltynot previously known, contribute new information, e.g., exception


Visualization of Discovered Patterns

Different backgrounds/usages may require different forms of representation

E.g., rules, tables, crosstabs, pie/bar chart etc.

Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction

Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data

Different kinds of knowledge require different representation: association, classification, clustering, etc.


Introduction









Summary


Major Issues in Data Mining

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem


Major Issues in Data Mining

Issues relating to the diversity of data typesHandling relational and complex types of dataMining information from heterogeneous databases and global information systems (WWW)

Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods


Introduction









Summary


Summary

Database technology: evolved from primitive file processing to the development of database management systems

Data mining: discovering interesting patterns from large amounts of data in a variety of information repositories

A KDD process: includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation

Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.


Summary

Pattern interestingness: easily understood, valid (with some degree of certainty), useful, novel

Measures of pattern interestingness, either objective or subjective can be used to guide the discovery process

Data mining systems: can be classified according to the kinds of databases mined, the kinds of knowledge mined, the techniques used, or the applications adapted


Summary

Five primitives for specification of a data mining task

task-relevant data (i.e., the data set to be mined)

kind of knowledge to be mined

background knowledge

interestingness measures

knowledge presentation and visualization techniques to be used for displaying the discovered patterns

mit-652 data mining applications · network dbms 1970s: relational data model, relational dbms...

Documents