Transcript
Page 1: Overview of Data Mining and the KDD Process

Overview of Data Mining and the KDD Process

Bamshad MobasherDePaul University

Page 2: Overview of Data Mining and the KDD Process

2

From Data to Wisdomi Data

4 The raw material of information

i Information4 Data organized and

presented by someonei Knowledge

4 Information read, heard or seen and understood and integrated

i Wisdom4 Distilled knowledge and

understanding which can lead to decisions

Wisdom

Knowledge

Information

Data

The Information Hierarchy

Page 3: Overview of Data Mining and the KDD Process

Why Data Mining? i The Explosive Growth of Data: from terabytes to

petabytes4 Data collection and data availability

h Automated data collection tools, database systems, Web, computerized society

4 Major sources of abundant datah Business: Web, e-commerce, transactions, stocks, … h Science: Remote sensing, bioinformatics, scientific simulation, … h Society and everyone: news, images, video, documentsh Internet …

3

Page 5: Overview of Data Mining and the KDD Process

How much data?i Google: ~20-30 PB a dayi Wayback Machine has ~4 PB + 100-200 TB/monthi Facebook: ~3 PB of user data + 25 TB/dayi eBay: ~7 PB of user data + 50 TB/dayi CERN’s Large Hydron Collider generates 15 PB a yeari In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB

640K ought to be enough for anybody.

Page 6: Overview of Data Mining and the KDD Process

Big Data Growing

6

The Untapped Data Gap:Most of the useful data will not be tagged or analyzed – partly due to skill shortage

IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytesor 5,200 GB / person in 2020.

Page 7: Overview of Data Mining and the KDD Process

What Is Data Mining? i We are drowning in data, but starving for knowledge! i “Necessity is the mother of invention”—Data mining—

Automated analysis of massive data sets

7

The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories

i Data Mining: A Definition

4 Non-trivial: obvious knowledge is not useful4 implicit: hidden difficult to observe knowledge4 previously unknown4 potentially useful: actionable; easy to understand

Page 8: Overview of Data Mining and the KDD Process

8

Data Mining: Confluence of Multiple Disciplines

Data Mining

MachineLearning Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Page 9: Overview of Data Mining and the KDD Process

9

Data Mining’s Virtuous Cycle

1. Identifying the problem

2. Mining data to transform it into actionable information

3. Acting on the information

4. Measuring the results

Page 10: Overview of Data Mining and the KDD Process

10

The Knowledge Discovery Processi Data Mining v. Knowledge Discovery in Databases (KDD)

4 DM and KDD are often used interchangeably4 actually, DM is only part of the KDD process

- The KDD Process

Page 11: Overview of Data Mining and the KDD Process

11

Types of Knowledge Discoveryi Two kinds of knowledge discovery: directed and undirected

i Directed Knowledge Discovery4 Purpose: Explain value of some field in terms of all the others (goal-oriented)4 Method: select the target field based on some hypothesis about the data; ask the

algorithm to tell us how to predict or classify new instances4 Examples:

h what products show increased sale when cream cheese is discountedh which banner ad to use on a web page for a given user coming to the site

i Undirected Knowledge Discovery4 Purpose: Find patterns in the data that may be interesting (no target field)4 Method: clustering, affinity grouping4 Examples:

h which products in the catalog often sell togetherh market segmentation (find groups of customers/users with similar

characteristics or behavioral patterns)

Page 12: Overview of Data Mining and the KDD Process

From Data Mining to Data Science

12

Page 13: Overview of Data Mining and the KDD Process

13

Data Mining: On What Kinds of Data?

i Database-oriented data sets and applications

4 Relational database, data warehouse, transactional database

4 Object-relational databases, Heterogeneous databases and legacy databases

i Advanced data sets and advanced applications

4 Data streams and sensor data

4 Time-series data, temporal data, sequence data (incl. bio-sequences)

4 Structure data, graphs, social networks and information networks

4 Spatial data and spatiotemporal data

4 Multimedia database

4 Text databases

4 The World-Wide Web

Page 14: Overview of Data Mining and the KDD Process

14

Data Mining: What Kind of Data?i Structured Databases

4 relational, object-relational, etc.4 can use SQL to perform parts of the processe.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category

Page 15: Overview of Data Mining and the KDD Process

15

Data Mining: What Kind of Data?i Flat Files

4 most common data source4 can be text (or HTML) or binary4 may contain transactions, statistical data, measurements, etc.

i Transactional databases4 set of records each with a transaction id, time stamp, and a set of items4 may have an associated “description” file for the items4 typical source of data used in market basket analysis

Page 16: Overview of Data Mining and the KDD Process

16

Data Mining: What Kind of Data?i Other Types of Databases

4 legacy databases4 multimedia databases (usually very high-dimensional)4 spatial databases (containing geographical information, such as maps, or

satellite imaging data, etc.)4 Time Series Temporal Data (time dependent information such as stock market

data; usually very dynamic)i World Wide Web

4 basically a large, heterogeneous, distributed database4 need for new or additional tools and techniques

h information retrieval, filtering and extractionh agents to assist in browsing and filteringh Web content, usage, and structure (linkage) mining tools

4 The “social Web”h User generated meta-data, social networks, shared resources, etc.

Page 17: Overview of Data Mining and the KDD Process

17

What Can Data Mining Doi Many Data Mining Tasks

4 often inter-related4 often need to try different techniques/algorithms for each task4 each tasks may require different types of knowledge discovery

i What are some of data mining tasks4 Classification4 Prediction4 Clustering4 Affinity Grouping / Association discovery4 Sequence Analysis4 Characterization4 Discrimination

Page 18: Overview of Data Mining and the KDD Process

18

Some Applications of Data miningi Business data analysis and decision support

4 Marketing focalizationh Recognizing specific market segments that respond to particular

characteristicsh Return on mailing campaign (target marketing)

4 Customer Profilingh Segmentation of customer for marketing strategies and/or product

offeringsh Customer behavior understandingh Customer retention and loyaltyh Mass customization / personalization

Page 19: Overview of Data Mining and the KDD Process

19

Some Applications of Data miningi Business data analysis and decision support (cont.)

4 Market analysis and managementh Provide summary information for decision-makingh Market basket analysis, cross selling, market segmentation.h Resource planning

4 Risk analysis and managementh "What if" analysish Forecastingh Pricing analysis, competitive analysish Time-series analysis (Ex. stock market)

Page 20: Overview of Data Mining and the KDD Process

20

Some Applications of Data miningi Fraud detection

4 Detecting telephone fraud:h Telephone call model: destination of the call, duration, time of day or weekh Analyze patterns that deviate from an expected normh British Telecom identified discrete groups of callers with frequent intra-group calls,

especially mobile phones, and broke a multimillion dollar fraud scheme

4 Detection of credit-card fraud4 Detecting suspicious money transactions (money laundering)

i Text mining:4 Message filtering (e-mail, newsgroups, etc.)4 Newspaper articles analysis4 Text and document categorization

i Web Mining4 Mining patterns from the content, usage, and structure of Web resources

Page 21: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

21

Page 22: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

22

Applications:• document clustering or

categorization• topic identification / tracking• concept discovery• focused crawling• content-based

personalization• intelligent search tools

Page 23: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• user and customer behavior

modeling• Web site optimization• e-customer relationship

management• Web marketing• targeted advertising• recommender systems

23

Page 24: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• document retrieval and

ranking (e.g., Google)• discovery of “hubs” and

“authorities”• discovery of Web

communities• social network analysis

24

Page 25: Overview of Data Mining and the KDD Process

25

The Knowledge Discovery Process

- The KDD Process

i Next: We first focus on understanding the data and data preparation/transformation


Top Related