overview of data mining and the kdd process

25
Overview of Data Mining and the KDD Process Bamshad Mobasher DePaul University

Upload: maisie

Post on 25-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Overview of Data Mining and the KDD Process. Bamshad Mobasher DePaul University. From Data to Wisdom. Data The raw material of information Information Data organized and presented by someone Knowledge Information read, heard or seen and understood and integrated Wisdom - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview of Data Mining and the KDD Process

Overview of Data Mining and the KDD Process

Bamshad MobasherDePaul University

Page 2: Overview of Data Mining and the KDD Process

2

From Data to Wisdomi Data

4 The raw material of information

i Information4 Data organized and

presented by someonei Knowledge

4 Information read, heard or seen and understood and integrated

i Wisdom4 Distilled knowledge and

understanding which can lead to decisions

Wisdom

Knowledge

Information

Data

The Information Hierarchy

Page 3: Overview of Data Mining and the KDD Process

Why Data Mining? i The Explosive Growth of Data: from terabytes to

petabytes4 Data collection and data availability

h Automated data collection tools, database systems, Web, computerized society

4 Major sources of abundant datah Business: Web, e-commerce, transactions, stocks, … h Science: Remote sensing, bioinformatics, scientific simulation, … h Society and everyone: news, images, video, documentsh Internet …

3

Page 5: Overview of Data Mining and the KDD Process

How much data?i Google: ~20-30 PB a dayi Wayback Machine has ~4 PB + 100-200 TB/monthi Facebook: ~3 PB of user data + 25 TB/dayi eBay: ~7 PB of user data + 50 TB/dayi CERN’s Large Hydron Collider generates 15 PB a yeari In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB

640K ought to be enough for anybody.

Page 6: Overview of Data Mining and the KDD Process

Big Data Growing

6

The Untapped Data Gap:Most of the useful data will not be tagged or analyzed – partly due to skill shortage

IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytesor 5,200 GB / person in 2020.

Page 7: Overview of Data Mining and the KDD Process

What Is Data Mining? i We are drowning in data, but starving for knowledge! i “Necessity is the mother of invention”—Data mining—

Automated analysis of massive data sets

7

The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories

i Data Mining: A Definition

4 Non-trivial: obvious knowledge is not useful4 implicit: hidden difficult to observe knowledge4 previously unknown4 potentially useful: actionable; easy to understand

Page 8: Overview of Data Mining and the KDD Process

8

Data Mining: Confluence of Multiple Disciplines

Data Mining

MachineLearning Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Page 9: Overview of Data Mining and the KDD Process

9

Data Mining’s Virtuous Cycle

1. Identifying the problem

2. Mining data to transform it into actionable information

3. Acting on the information

4. Measuring the results

Page 10: Overview of Data Mining and the KDD Process

10

The Knowledge Discovery Processi Data Mining v. Knowledge Discovery in Databases (KDD)

4 DM and KDD are often used interchangeably4 actually, DM is only part of the KDD process

- The KDD Process

Page 11: Overview of Data Mining and the KDD Process

11

Types of Knowledge Discoveryi Two kinds of knowledge discovery: directed and undirected

i Directed Knowledge Discovery4 Purpose: Explain value of some field in terms of all the others (goal-oriented)4 Method: select the target field based on some hypothesis about the data; ask the

algorithm to tell us how to predict or classify new instances4 Examples:

h what products show increased sale when cream cheese is discountedh which banner ad to use on a web page for a given user coming to the site

i Undirected Knowledge Discovery4 Purpose: Find patterns in the data that may be interesting (no target field)4 Method: clustering, affinity grouping4 Examples:

h which products in the catalog often sell togetherh market segmentation (find groups of customers/users with similar

characteristics or behavioral patterns)

Page 12: Overview of Data Mining and the KDD Process

From Data Mining to Data Science

12

Page 13: Overview of Data Mining and the KDD Process

13

Data Mining: On What Kinds of Data?

i Database-oriented data sets and applications

4 Relational database, data warehouse, transactional database

4 Object-relational databases, Heterogeneous databases and legacy databases

i Advanced data sets and advanced applications

4 Data streams and sensor data

4 Time-series data, temporal data, sequence data (incl. bio-sequences)

4 Structure data, graphs, social networks and information networks

4 Spatial data and spatiotemporal data

4 Multimedia database

4 Text databases

4 The World-Wide Web

Page 14: Overview of Data Mining and the KDD Process

14

Data Mining: What Kind of Data?i Structured Databases

4 relational, object-relational, etc.4 can use SQL to perform parts of the processe.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category

Page 15: Overview of Data Mining and the KDD Process

15

Data Mining: What Kind of Data?i Flat Files

4 most common data source4 can be text (or HTML) or binary4 may contain transactions, statistical data, measurements, etc.

i Transactional databases4 set of records each with a transaction id, time stamp, and a set of items4 may have an associated “description” file for the items4 typical source of data used in market basket analysis

Page 16: Overview of Data Mining and the KDD Process

16

Data Mining: What Kind of Data?i Other Types of Databases

4 legacy databases4 multimedia databases (usually very high-dimensional)4 spatial databases (containing geographical information, such as maps, or

satellite imaging data, etc.)4 Time Series Temporal Data (time dependent information such as stock market

data; usually very dynamic)i World Wide Web

4 basically a large, heterogeneous, distributed database4 need for new or additional tools and techniques

h information retrieval, filtering and extractionh agents to assist in browsing and filteringh Web content, usage, and structure (linkage) mining tools

4 The “social Web”h User generated meta-data, social networks, shared resources, etc.

Page 17: Overview of Data Mining and the KDD Process

17

What Can Data Mining Doi Many Data Mining Tasks

4 often inter-related4 often need to try different techniques/algorithms for each task4 each tasks may require different types of knowledge discovery

i What are some of data mining tasks4 Classification4 Prediction4 Clustering4 Affinity Grouping / Association discovery4 Sequence Analysis4 Characterization4 Discrimination

Page 18: Overview of Data Mining and the KDD Process

18

Some Applications of Data miningi Business data analysis and decision support

4 Marketing focalizationh Recognizing specific market segments that respond to particular

characteristicsh Return on mailing campaign (target marketing)

4 Customer Profilingh Segmentation of customer for marketing strategies and/or product

offeringsh Customer behavior understandingh Customer retention and loyaltyh Mass customization / personalization

Page 19: Overview of Data Mining and the KDD Process

19

Some Applications of Data miningi Business data analysis and decision support (cont.)

4 Market analysis and managementh Provide summary information for decision-makingh Market basket analysis, cross selling, market segmentation.h Resource planning

4 Risk analysis and managementh "What if" analysish Forecastingh Pricing analysis, competitive analysish Time-series analysis (Ex. stock market)

Page 20: Overview of Data Mining and the KDD Process

20

Some Applications of Data miningi Fraud detection

4 Detecting telephone fraud:h Telephone call model: destination of the call, duration, time of day or weekh Analyze patterns that deviate from an expected normh British Telecom identified discrete groups of callers with frequent intra-group calls,

especially mobile phones, and broke a multimillion dollar fraud scheme

4 Detection of credit-card fraud4 Detecting suspicious money transactions (money laundering)

i Text mining:4 Message filtering (e-mail, newsgroups, etc.)4 Newspaper articles analysis4 Text and document categorization

i Web Mining4 Mining patterns from the content, usage, and structure of Web resources

Page 21: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

21

Page 22: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

22

Applications:• document clustering or

categorization• topic identification / tracking• concept discovery• focused crawling• content-based

personalization• intelligent search tools

Page 23: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• user and customer behavior

modeling• Web site optimization• e-customer relationship

management• Web marketing• targeted advertising• recommender systems

23

Page 24: Overview of Data Mining and the KDD Process

Types of Web Mining

Web ContentMining

Web StructureMining

Web UsageMining

Web Mining

Applications:• document retrieval and

ranking (e.g., Google)• discovery of “hubs” and

“authorities”• discovery of Web

communities• social network analysis

24

Page 25: Overview of Data Mining and the KDD Process

25

The Knowledge Discovery Process

- The KDD Process

i Next: We first focus on understanding the data and data preparation/transformation