data mining edward, hong zhang cs dept, suny, albany csi 668, march,20. 2001

49
Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining

Edward, Hong ZhangCS Dept, SUNY, Albany

CSI 668, March,20. 2001

Page 2: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Presentation Outline

Motivation

Background (KDD Process)

What’s Data Mining?

Why Data Mining?

The Data Mining Process

Data Mining Algorithms

Data Mining Research Trend

Existing Systems

for Data Mining

Conclusions

Page 3: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Motivation “Necessity is the mother of invention”

Data explosion problem: Automated data collection tools, availability of increasingly cheap

storage devices and mature database technology lead to tremendous amounts of data stored in database, data warehouses

and other information repositories.

We are drowning in data, but starving for knowledge! Data is everywhere

Understand and use data—an imminent task!

Solution: Knowledge Discovery (Data warehousing and data mining)

Page 4: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Evolution of Database Technology

1960s-1970s:

Data collection, database creation, IMS and network DBMS.

1970s-1980s:

Relational data model, relational DBMS implementation.

1980s-1990s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial,

scientific, engineering, etc.).

1990s-right now: Data mining and data warehousing, multimedia databases, and

Web-based database technology.

Page 5: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

BackgroundKnowledge Discovery (KD):

the process of finding general patterns/principles that summarize/explain a set of "observations".

The Knowledge Discovery in Databases (KDD)

Very Large DataBases (VLDB) have become the industry standard, making it impossible for human beings to mine the data "by hand" to look for interesting patterns. Automated tools are therefore needed to help to extract these patterns.

Page 6: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Background Cont.

The knowledge discovery in databases (KDD) consists of 3 steps: Data Integration (Data Warehousing): Collecting the target data observations from the

different data sources, removing noise from the observations, and integrating them into an appropriate format.

Data Mining: (will be covered in detail) Applying a concrete algorithm to find useful and novel

patterns in the integrated data.

Page 7: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Background Cont.

Pattern Evaluation:

Interpreting mined patterns, evaluating them according to usefulness/interestingness criteria, and possibly using visualization tools to aid in understanding the patterns graphically.

See KDD process graph below:

Page 8: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining: KDD process

Task-relevant Data

Data Cleaning

Data Warehouse

Data Mining

Pattern Evaluation

Selection

Data IntegrationDatabases

Data mining: the core of knowledge discovery process.

Page 9: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

What Is Data Mining?

Data Mining (knowledge discovery in databases) Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) information (knowledge) or patterns from data in

large databases, data warehouse or other information repositories

What is not data mining? (Deductive) query processing. Expert systems or Machine Learning/statistical programs Online Analytical Processing (OLAP) Software Agents

Data Mining: Confluence of Multiple Disciplines

Page 10: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining

Database, OLAP,

High Performance Computing

MachineLearning (AI)

Visualization

InformationScience

Pattern recognition

Statistics Modeling

Page 11: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Why Data Mining? – Potential Applications

Database analysis and decision support System (DSS)

Market analysis and management target marketing, customer relation management,

market basket analysis, cross selling, market segmentation.

Risk analysis and management Forecasting, customer retention, improved underwriting,

quality control, competitive analysis.

Text mining (Text Databases, documents), key words search and analysis.DNA sequence analysis and gene expression.

Page 12: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

Paper, Files, Information Providers, Database Systems, OLTPData Sources

Data Warehouses / Data MartsOLAP, MDA

Data ExplorationStatistical Analysis, Querying and Reporting

Visualization Techniques

Data Mining

Useful Pattern

MakingDecisions

DBA

DataAnalyst

Business Analyst

End User

Page 13: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Why Data Mining? – Potential Applications (Cont.)

Internet Web Surf-Aid (Web Mining) IBM Surf-Aid applies data mining algorithms to

Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Sports IBM Advanced Scout analyzed NBA game

statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.

Page 14: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

The Data Mining Process

Data Mining Algorithm

Score model

model

Results Pattern

Data Mining Systemtraining

evaluation

prediction

Historical Training data

New data

Data set

Page 15: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Examples of “Discovered” Patterns

Association rules: find rules between different attributes 98% of AOL users also have EBay accounts

Classification: Classify data based on the values in a classifying attribute People age less than 40 and salary > 40,000$ trade on-line

Clustering: Group data to form new classes Users A and B access similar URLs, they belong to the

same group, which has similar user profiles.

Page 16: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Query-based, focused mining

Interestingness measures: A pattern is interesting if it is: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks

to confirm

Page 17: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

How can we Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness. Can a data mining system find all the interesting

patterns?Search only interesting patterns: Optimization. Can a data mining system find only the interesting

patterns? Approaches

First generate all the patterns and then filter out the uninteresting ones.

Generate only the interesting patterns --- mining query optimization

Page 18: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining Algorithms

Four common DM algorithm types:

The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees

Page 19: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

The k-Nearest Neighbor Algorithm (KNN)

A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset

Use entire training database as the model

Find nearest data point and do the same thing as you did for that record

. xq

-++

-

-

+-

+

-+

Page 20: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

The k-Nearest Neighbor Algorithm (KNN) (Cont.)

Distance-weighted nearest neighbor algorithm. Weight the contribution of each of the k neighbors

according to their distance to the query point Xq. giving greater weight to closer neighbors:

Advantages: Calculate the mean values of the k nearest neighbors. Robust to noisy data by averaging k-nearest neighbors.

Very easy to implement.

Disadvantage: Huge Models ( the entire training database ) More difficult to use in production.

Page 21: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Artificial neural networks Algorithm (ANN)

Non-linear predictive models that learn through training and loosely resemble biological neural networks in structure.

Inputs transformed through a network of simple processors

Processor combines (weighted) inputs and produces an output value

Page 22: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Artificial neural networks (Cont.)

x0

x1

xn

f

w0

w1

wn

k-

Inputvector x

weightvector w

weighted sum

Activationfunction

output y

The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

(Learning Rate)

Page 23: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Multi layer perception of Artificial neural networks

Input vector: xi

Input nodes

Hidden nodes

Output nodes

Output vector

Page 24: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Artificial Neural Network evaluation

Advantages: prediction accuracy is generally high robust,still works when training examples contain errors

Disadvantages: Key problem: Difficult to understand

The neural network model is difficult to understand No intuitive understanding of results

Long training time Although after training, process is very quick, the training process itself is time-consuming

Significant pre-processing of data often required

Page 25: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Rule Induction

Rule Induction (rule-based prediction) We first generate a set of rules from a data warehouse,

then use them to predict values for new data item. It works much better on larger (and real)data sets, not

just on samples of data.

Two phases: Rule discovery: analyze a historical database and

generate a set of rules by automatic discovery. Prediction: apply the rules to a new data set and match

the rules to make predictions.

Page 26: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Rule Induction ExampleOutlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Training Set

Page 27: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Rule Induction Example (Cont.)

4 attributes: Outlook: can be sunny, overcast, rainy 3 cases Temperature: hot, mild, cool 3 cases Humidity: high, normal 2 cases Windy: true, false 2 cases

1 outcome: class (N: no class, P: have class)

Totally we should have 3*3*2*2=36 possible combinations, of which 14 are present in the

set of input examples.

Page 28: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Rule Induction Example (Cont.)

Some rules inducted from above dataset: Classification rules:

If outlook = sunny and humidity = high then class = n.

If outlook = rainy and windy = true then class = n

if outlook = overcast then class = p Association rules:

If temperature = cool then humidity = normal

If windy=false and class=n then outlook = sunny and

humidity = high

Page 29: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

What is a decision tree?

A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test

All tuples in branch have the same value for the tested attribute.

Leaf node represents class label or class label distribution.

A series of nested if/then rules Understandable!

Page 30: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

A Sample Decision Tree

Outlook

sunny

humidity

high

N

normal

P

The same Training setwith RuleInduction

overcast

P

rain

windy

true false

N P

Page 31: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Another Example for DT

If x=1 and y=0 then class = aIf x=0 and y=1 then class = aIf x=0 and y=0 then class = bIf x=1 and y=1 then class = b

Page 32: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Another Example for DT

salary education label10000 high school reject40000 under graduate accept15000 under graduate reject75000 graduate accept18000 graduate accept

Credit Analysis

salary < 20000

Yes

no

accept

education in graduate

yes

no

rejectaccept

Page 33: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Decision-Tree Classification Methods

The basic top-down decision tree generation approach usually consists of two phases: Tree construction

At start, all the training examples are at the root. Partition examples recursively based on selected

attributes.

Tree pruning Aiming at removing tree branches that may lead to errors

when classifying test data (training data may contain noise, statistical fluctuations, …)

Page 34: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

How to construct a tree?

Algorithmgreedy algorithm

make optimal choice at each step: select the best attribute for each tree node.

top-down recursive divide-and-conquer manner

from root to leafsplit node to several branches for each branch, recursively run the algorithm

Page 35: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

How to prune a treeA decision tree constructed using the training data may have too many branches/leaf nodes. Caused by noise, overfitting May result poor accuracy for unseen samples

Prune the tree: merge a subtree into a leaf node. Using a set of data different from the training data. At a tree node, if the accuracy without splitting is

higher than the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class.

Page 36: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

How to use a tree?

Directly test the attribute value of unknown sample against

the tree. A path is traced from root to a leaf which holds the

label

Indirectly decision tree is converted to classification rules one rule is created for each path from the root to a

leaf IF-THEN is easier for humans to understand

Page 37: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Decision tree for a covering algorithm

y

x

a

b b

b

b

b

bb

b

b b bb

bb

aa

aa

a

y

a

b b

b

b

b

bb

b

b b bb

bb

a a

aa

a

x1·2

y

a

b b

b

b

b

bb

b

b b bb

bb

a a

aa

a

x1·2

2·6

Page 38: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining Algorithm Summary

KNN: Quick and easy Models tend to be very

large

ANN: Difficult to interpret Can require significant

amounts of time to train

Rule Induction: Understandable Need to limit calculations

Decision Trees: Understandable

Relatively fast

Other DM Technologies Genetic Algorithms Rough sets Bayesian networks Mixture models Many more...

Page 39: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Data Mining Research Trend

Text mining: Text database and information retrieval

Multimedia data mining

OLAM (OLAP Mining)

Web mining (Data Mining and WWW) E-commerce Information retrieval (search) Network management

Page 40: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Why Mine the Web?Web: A huge, widely-distributed, highly heterogeneous, semi-structured,

hypertext/hypermedia, interconnected, evolving information repository.

Web is a huge collection of documents plus Hyper-link information Access and usage information

Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)

Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

Page 41: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Why is Web Mining Different?Huge : The Web is a huge collection of documents except for Hyper-link information Access and usage information

Dynamic:The Web is very dynamic New pages are constantly being generated

Unstructured: Complexity of Web pages: far greater than text

document collection

Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental

Page 42: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Types of Web Mining

Web Mining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web StructureMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Page 43: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Web Mining ApplicationsE-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud detection Similar image retrieval

Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents

Network Management Performance management Fault management

Page 44: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Existing Systems for Data Mining

IBM: Intelligent Miner.SAS Institute: Enterprise Miner.Silicon Graphics: MineSet.Integral Solutions Ltd.: Clementine.Information Discovery Inc.: Data Mining Suite.

DBMiner Technology Inc.: DBMinerRutger: DataMine, GMD: Explora, Univ. Munich: VisDB

Page 45: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Microsoft OLE DB for Data Mining

Microsoft OLE, OLE DB, OLE DB for OLAP and OLE

DB for Data Mining

OLE DB for DM: Standardization July 1999 to March

2000

Microsoft SQL Server 2000: Analysis manager Analysis manager consists of OLAP and Data Mining

Data mining: two modules (Classification/Prediction and clustering)

OLDB for DM: Data mining providers (such as association modules

and other classification or clustering modules)

Page 46: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Research Progress for Data Mining in the Last Decade

Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)Association, correlation, and causality analysisClassification: scalability and new approachesClustering and outlier analysisSequential patterns and time-series analysisText mining, Web mining and Weblog analysisSpatial, multimedia, scientific data analysisData preprocessing and database compressionData visualization and visual data mining

Page 47: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Conclusions

Knowledge Discovery in Databases (KDD)

Data warehouse: An industry trend DW stores a huge amount of subject-oriented,

cleansed, integrated, consolidated, time-related data.

Data Mining: A rich, promising, young field with broad applications and many challenging research

issues. Good science - leading position in research community

Page 48: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Conclusions (Cont.)

Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc.

Data mining Algorithms: The k-Nearest Neighbor Algorithm (KNN) Artificial Neural Network (ANN) Rule Induction Decision Trees

Research progress and trend in Data Mining

Page 49: Data Mining Edward, Hong Zhang CS Dept, SUNY, Albany CSI 668, March,20. 2001

Future WorkTheoretical foundations of data mining.Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies.

Data mining in advanced information systems: Spatial, multimedia, Web-mining

Data mining applications: content browsing, query optimization, multi-

resolution model, etc.Social issues: A threat to security and privacy.