what is data mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · ir&dm ’13/14 15...

38
IR&DM ’13/14 15 October 2013 DM Intro-1 What is Data Mining?

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-1

What is Data Mining?

Page 2: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

What is Data Mining?

2

“Data mining is the process of extracting hidden patterns from data.”

Page 3: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

What is Data Mining?

2

“An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.”

“Data mining is the process of extracting hidden patterns from data.”

Page 4: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

What is Data Mining?

2

“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.”

“An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.”

“Data mining is the process of extracting hidden patterns from data.”

Page 5: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

What is Data Mining?

2

“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.”

“Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.”

“An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.”

“Data mining is the process of extracting hidden patterns from data.”

Page 6: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

What is Data Mining?

3

“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.”

“Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.”

“An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.”

“Data mining is the process of extracting hidden patterns from data.”

Page 7: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

The KDD process

4

Data pre-processing

Data mining Post-processing

Input data

Information

Page 8: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

The KDD process

4

Data pre-processing

Data mining Post-processing

Input data

InformationNormalisationDimensionality reductionFeature selectionHandling missing values

Page 9: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

Filtering patterns Visualisation Pattern interpretation

IR&DM ’13/14 15 October 2013 DM Intro-

The KDD process

4

Data pre-processing

Data mining Post-processing

Input data

InformationNormalisationDimensionality reductionFeature selectionHandling missing values

Page 10: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

Filtering patterns Visualisation Pattern interpretation

IR&DM ’13/14 15 October 2013 DM Intro-

The KDD process

4

Data pre-processing

Data mining Post-processing

Input data

InformationNormalisationDimensionality reductionFeature selectionHandling missing values

Interactive data mining

Page 11: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining vs. Information Retrieval

5

• IR is answering questions user asked• DM is answering questions user didn’t ask– “Show me web pages relevant to this query” – “Show me interesting patterns in these web pages’ contents”•Vague problem, how to evaluate the results?•Typical approach: pre-define some measurable quality of

“interestingness”

Page 12: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data mining’s position in sciences• Data mining uses statistical tools and methods to infer

from data– Is data mining just fancy name for statistics?

• Data mining uses methods to learn unseen– Is data mining just boring name for machine learning?

6

Page 13: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data mining’s position in sciences• Data mining uses statistical tools and methods to infer

from data– Is data mining just fancy name for statistics?

• Data mining uses methods to learn unseen– Is data mining just boring name for machine learning?

6

Is data mining a voodoo science?

Page 14: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Why Data Mining?

7

Page 15: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Why Data Mining?

8

The ”PHT” Pirate wanted all information of the world. But before he realized most of it was useless, he was already buried under it.

—Stanisław Lem, The Cyberiad

Page 16: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

Page 17: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

≈ 5GB of climate data

Page 18: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

≈ 5GB of climate data

40 000 000 000photos

Page 19: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

1 000 000 customer transactions per hour

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

≈ 5GB of climate data

40 000 000 000photos

Page 20: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

1 000 000 customer transactions per hour

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

≈ 5GB of climate data

40 000 000 000photos340 000 000

tweets per day

Page 21: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

1 000 000 customer transactions per hour

IR&DM ’13/14 15 October 2013 DM Intro-

Data, data, data, data, …

9

≈ 5GB of climate data

40 000 000 000photos340 000 000

tweets per day

To utilise this data, we need tools to analyse and

understand it.

We need data mining.

Page 22: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

Page 23: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence

Page 24: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?

Page 25: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?

Page 26: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money? $?

Page 27: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis

Page 28: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?

Page 29: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?–What species co-inhabit areas?

Page 30: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?–What species co-inhabit areas?–What happens if average temperature raises?

Page 31: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?–What species co-inhabit areas?–What happens if average temperature raises?

• And anything else where you have data…

Page 32: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?–What species co-inhabit areas?–What happens if average temperature raises?

• And anything else where you have data…–Who Barack Obama had to persuade to vote him?

Page 33: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data Mining Applications

10

• Business intelligence–What customers buy together?–What are the seasonal trends?–How to make more money?

• Scientific data analysis–What genes cause diseases?–What species co-inhabit areas?–What happens if average temperature raises?

• And anything else where you have data…–Who Barack Obama had to persuade to vote him?– Is there a problem in the International Space Station?

Page 34: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

National Security

11

Page 35: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

National Security

11

Page 36: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

National Security

11

Page 37: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Data mining and privacy• Very important!–Everybody wants to do data mining, nobody wants to be

data mined–Google’s privacy policies, deleting your information from

Facebook, EU and the right to be forgotten, …• Often imposed by laws–Medical data, personal information records, …

• Governments use data mining techniques for national security– Profiling

12

Page 38: What is Data Mining?resources.mpi-inf.mpg.de/departments/d5/teaching/... · IR&DM ’13/14 15 October 2013 DM Intro-The KDD process 4 Data pre-processing Data mining Post-processing

IR&DM ’13/14 15 October 2013 DM Intro-

Summary• We’re collecting more and more data–Most of it is uninteresting—but how to find what’s

interesting• Scientific method: form hypothesis, collect data, test

hypothesis• DM approach: collect data and let the computer find

what (interesting) hypotheses hold in it

13