knowledge discovery in data [and data mining] (kdd)

28
Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html) Field is more dominated by industry than by research institutions

Upload: flann

Post on 16-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Knowledge Discovery in Data [and Data Mining] (KDD). Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1

Knowledge Discovery in Data [and Data Mining] (KDD)

Let us find something interesting!

Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see

http://www.kdnuggets.com/siftware.html) Field is more dominated by industry than by research institutions

Page 2: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 2

Making Sense of Data ---Knowledge Discovery and Data Mining

1. Introduction to KDD (1 class)

2. Data Warehouses and OLAP (2 classes)

3. Association Rule Mining (1.5 classes)

4. Learning to Classify (1 class)

5. Other Techniques: Clustering, Deviation Detection, Sequential Pattern Analysis (0.5 classes)

Page 3: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 3

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Page 4: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 4

Select/preprocessSelect/preprocess TransformTransform Data mineData mine Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate

Data preparation

Data sources Selected/Preprocessed data Transformed data Extracted informationKnowledge

General KDD StepsGeneral KDD Steps

Page 5: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 5

Popular KDD-Tasks Classification (try to learn how to classify) Clustering (finding groups of similar object) Estimation and Prediction (try to learn a function that predicts an th

value of a continuous output variable based on a set of input variables)

Bayesian and Dependency Networks Deviation and Fraud Detection Text Mining Web Mining Visualization Transformation and Data Cleaning

Page 6: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 6

KDD is less focused than data analysis in that it looks for interesting patterns in data; classical data analysis centers on analyzing particular relationships in data. The notion of interestingness is a key concept in KDD. Classical data analysis centers more on generating and testing pre-structured hypothesis with respect to a given sample set.

KDD is more centered on analyzing large volumes of data (many fields, many tuples, many tables, …).

In a nutshell the the KDD-process consists of preprocessing (generating a target data set), data mining (finding something interesting in the data set), and post processing (representing the found pattern in understandable form and evaluated their usefulness in a particular domain); classical data analysis is less concerned with the the preprocessing step.

KDD involves the collaboration between multiple disciplines: namely, statistics, AI, visualization, and databases.

KDD employs non-traditional data analysis techniques (neural networks, association rules, decision trees, fuzzy logic, evolutionary computing,…).

KDD and Classical Data AnalysisKDD and Classical Data Analysis

Page 7: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 7

The goal of model generation (sometimes also called predictive data mining) is the creation, evaluation, and use of models to make predictions and to understand the relationships between various variables that are described in a data collection. Typical example application include:– generate a model to that predicts a student’s academic performance based on

the applicants data such as the applicant’s past grades, test scores, past degree,…

– generate a model that predicts (based on economic data) which stocks to sell, hold, and buy.

– generate a model to predict if a patient suffers from a particular disease based on a patient’s medical and other data.

Model generation centers on deriving a function that can predict a variable using the values of other variables: v=f(a1,…,an)

Neural networks, decision trees, naïve Bayesian classifiers and networks, regression analysis and many other statistical techniques, fuzzy logic and neuro-fuzzy systems, association rules are the most popular model generation tools in the KDD area.

All model generation tools and environments employ the basic train-evaluate-predict cycle.

Generating Models as an ExampleGenerating Models as an Example

Page 8: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 8

Why Do We Need so manyData Mining / Analysis Techniques?

No generally good technique exists. Different methods make different assumptions with respect to the

data set to be analyzed (to be discussed on the next transparency) Cross fertilization between different methods is desirable and

frequently helpful in obtaining a deeper understanding of the analyzed dataset.

Page 9: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 9

Motivation: “Necessity is the Mother of Invention”

Data explosion problem – Automated data collection tools and mature

database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing– Extraction of interesting knowledge (rules,

regularities, patterns, constraints) from data in large databases

Page 10: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 10

Why Data Mining? — Potential Applications

Database analysis and decision support– Market analysis and management

• target marketing, customer relation management, market basket analysis, cross selling, market segmentation

– Risk analysis and management• Forecasting, customer retention, improved

underwriting, quality control, competitive analysis– Fraud detection and management

Other Applications– Text mining (news group, email, documents) and Web

analysis.– Intelligent query answering

Page 11: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 11

Market Analysis and Management Where are the data sources for analysis?

– Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing

– Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time

– Conversion of single to a joint bank account: marriage, etc. Cross-market analysis

– Associations/co-relations between product sales

– Prediction based on the association information

Page 12: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 12

Fraud Detection and Management Applications

– widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

Approach

– use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

Examples

– auto insurance: detect a group of people who stage accidents to collect on insurance

– money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)

– medical insurance: detect professional patients and ring of doctors and ring of references

Page 13: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 13

Other Applications

Sports

– IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy

– JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

Internet Web Surf-Aid

– IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Page 14: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 14

Data Mining and Business Intelligence Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 15: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 15

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 16: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 16

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 17: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 17

Example: Decision Tree Approach

Page 18: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 18

Decision Tree Approach2

Page 19: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 19

Decision Trees

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign

trainingset

Page 20: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 20

One Possibility

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

age<30

city=sf car=van

likely likelyunlikely unlikely

YY

Y

NN

N

Page 21: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 21

Another Possibility

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

car=taurus

city=sf age<45

likely likelyunlikely unlikely

YY

Y

NN

N

Page 22: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 22

Example: Nearest Neighbor Approach

Page 23: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 23

Clustering

age

income

education

Page 24: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 24

Another Example: Text

Each document is a vector

– e.g., <100110...> contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents

internationalnews

sports

business

Page 25: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 25

Issues Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful?

– e.g., “yuppies’’ cluster? Using clusters for disk storage

Page 26: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 26

Association Rule Mining

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

market-basketdata

Page 27: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 27

Summary Data Mining

Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide

applications A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association,

classification, clustering, outlier and trend analysis, etc. Classification of data mining systems Major issues in data mining

Page 28: Knowledge Discovery in Data [and Data Mining] (KDD)

Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 28

Where to Find References? Data mining and KDD (SIGKDD member CDROM):

– Conference proceedings: KDD, and others, such as PKDD, PAKDD, etc.– Journal: Data Mining and Knowledge Discovery

Database field (SIGMOD member CD ROM):– Conference proceedings: ACM-SIGMOD, ACM-PODS, VLDB, ICDE, EDBT,

DASFAA– Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.

AI and Machine Learning:– Conference proceedings: Machine learning, AAAI, IJCAI, etc.– Journals: Machine Learning, Artificial Intelligence, etc.

Statistics:– Conference proceedings: Joint Stat. Meeting, etc.– Journals: Annals of statistics, etc.

Visualization:– Conference proceedings: CHI, etc.– Journals: IEEE Trans. visualization and computer graphics, etc.