1 berendt: advanced databases, first semester 2008, berendt/teaching/2008w/adb/ 1 advanced databases...

70
Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/a 1 Advanced databases – Inferring new knowledge from data(bases): Deductive Databases; Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/ ast update: 29 October 2008

Upload: jeffery-rogers

Post on 03-Jan-2016

221 views

Category:

Documents


7 download

TRANSCRIPT

1Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

1

Advanced databases –

Inferring new knowledge from data(bases):

Deductive Databases;

Knowledge Discovery in DatabasesBettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

Last update: 29 October 2008

2Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

2

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

3Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

3

What is the impact of genetically modified organisms?

4Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

4Is our school system good for immigrants and/or children from poor backgrounds?

5Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

5

What are the effects of teaching in English at universities?

6Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

6

What makes people happy?

7Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

7

What do men and women like?

8Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

8

Is this a man or a woman?

clicked on

9Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

9And here‘s a somewhat speculative case ...Who owes money to whom (causing the current financial crisis)?

10Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

10

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

11Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

11

Deductive database languages / Datalog: Motivation

SQL-92 (= SQL2) cannot express some queries:

Are we running low on any parts needed to build a ZX600 sports car?

What is the total component and assembly cost to build a ZX600 at today's part prices??

NB: SQL saw a new version (SQL3) in 1999 and further developments since then. Some DDB concepts are used to support the advanced features of more recent SQL standards.

12Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

12

What is a deductive database (system)?

A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database.

13Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

13

Styles of reasoning: „All swans are white“

Deductive: towards the consequences All swans are white.

Tessa is a swan.

Tessa is white.

Inductive: towards a generalisation of observations Joe and Lisa and Tex and Wili and ... (all observed swans) are

swans.

Joe and Lisa and Tex and Wili and ... (all observed swans) are white.

All swans are white.

Abductive: towards the (most likely) explanation of an observation.

Tessa is white.

All swans are white.

Tessa is a swan.

14Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

14

What about truth?

Deductive:

Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion

Inductive:

the premises of an argument (are believed to) support the conclusion but do not ensure it

has been attacked several times by logicians and philosophers

Abductive:

formally equivalent to the logical fallacy affirming the consequent

15Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

15

What about new knowledge?

C.S. Peirce:

Introduced „abduction“ to modern logic

(after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction)

<<Abduction is the only logical process that actually creates anything new.>>

essential for scientific discovery

16Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

16

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

17Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

17

Deductive databases in a Computer Science context

Deductive databases have grown out of the desire to combine logic programming with relational databases to construct systems that support a powerful formalism and are still fast and able to deal with very large datasets.

Deductive databases are more expressive than relational databases but less expressive than logic programming systems.

Deductive databases have not found widespread adoptions outside academia, but some of their concepts are used in today‘s relational databases to support the advanced features of more recent SQL standards (≥ SQL:1999).

18Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

18

Datalog

a query and rule language for deductive databases that syntactically is a subset of Prolog.

Roots in 1970s; the term Datalog was coined in the mid 1980s by a group of researchers interested in database theory.

Query evaluation is sound and complete and can be done efficiently even for large databases.

Query evaluation is usually done using bottom up strategies.

In contrast to Prolog, Datalog

disallows complex terms as arguments of predicates, e.g. P(1, 2) is admissible but not P(f1(1), 2),

imposes certain stratification restrictions on the use of negation and recursion, and

only allows range restricted variables, i.e. each variable in the conclusion of a rule must also appear in a not negated clause in the premise of this rule.

19Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

19

Deductive database languages / Datalog: Motivation

SQL-92 cannot express some queries:

Are we running low on any parts needed to build a ZX600 sports car?

What is the total component and assembly cost to build a ZX600 at today's part prices?

Can we extend the query language to cover such queries?

Yes, by adding recursion.

20Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

20

Datalog

SQL queries can be read as follows: “If some tuples exist in the From tables that satisfy the Where conditions, then the Select tuple is in the answer.”

Datalog is a query language that has the same if-then flavor:

New: The answer table can appear in the From clause, i.e., be defined recursively.

Prolog style syntax is commonly used.

21Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

21

Example

Find the components of a trike?

We can write a relational algebra query to compute the answer on the given instance of Assembly.

But there is no R.A. (or SQL-92) query that computes the answer on all Assembly instances.

trike wheel 3

trike frame 1

frame seat 1

frame pedal 1

wheel spoke 2

wheel tire 1

tire rim 1

tire tube 1

Assembly instancep

art

su

bp

art

nu

mb

er

trike

wheel frame

spoke tire seat pedal

rim tube

3 1

2 1 1 1

1 1

22Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

22

The Problem with Relational Algebra and SQL-92

Intuitively, we must join Assembly with itself to deduce that trike contains spoke and tire.

Takes us one level down Assembly hierarchy.

To find components that are one level deeper (e.g., rim), need another join.

To find all components, need as many joins as there are levels in the given instance!

For any relational algebra expression, we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression!

23Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

23

A Datalog Query that Does the Job

Comp(Part, Subpt) :- Assembly(Part, Subpt, Qty).Comp(Part, Subpt) :- Assembly(Part, Part2, Qty),

Comp(Part2, Subpt).

Can read the second rule as follows:“For all values of Part, Subpt and Qty, if there is a tuple (Part, Part2, Qty) in Assembly and a tuple (Part2, Subpt) in Comp, then there must be a tuple (Part, Subpt) in Comp.”

head of rule body of ruleimplication

24Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

24

Using a Rule to Deduce New Tuples

Each rule is a template: by assigning constants to the variables in such a way that each body “literal” is a tuple in the corresponding relation, we identify a tuple that must be in the head relation.

By setting Part=trike, Subpt=wheel, Qty=3 in the first rule, we can deduce that the tuple <trike,wheel> is in the relation Comp.

This is called an inference using the rule.

Given a set of tuples, we apply the rule by making all possible inferences with these tuples in the body.

25Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

25

Example

For any instance of Assembly, we can compute all Comp tuples by repeatedly applying the two rules. (Actually, we can apply Rule 1 just once, then apply Rule 2 repeatedly.)

trike spoke

trike tire

trike seat

trike pedal

wheel rim

wheel tube

trike spoke

trike tire

trike seat

trike pedal

wheel rim

wheel tube

trike rim

trike tube

Comp tuples got by applying Rule 2 twice

Comp tuples got by applying Rule 2 once

26Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

26

Datalog vs. SQL:1999 (SQL3) notation

A collection of Datalog rules can be rewritten in SQL syntax, if recursion is allowed (this is the case in SQL:1999).

WITH RECURSIVE Comp(Part, Subpt) AS(SELECT A1.Part, A1.Subpt FROM Assembly A1)UNION(SELECT A2.Part, C1.Subpt FROM Assembly A2, Comp C1 WHERE A2.Subpt=C1.Part)

SELECT * FROM Comp

27Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

27

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

28Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

28

„Data mining“ and „knowledge discovery“

(informal definition):

data mining is about discovering knowledge in (huge amounts of) data

Therefore, it is clearer to speak about “knowledge discovery in data(bases)”

29Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

29

Recall: Data, information, and knowledge

Data represents a fact or statement of event

without relation to other things. Ex: It is raining.

Information embodies the understanding of a relationship of some sort, possibly cause and effect.

Ex: The temperature dropped 15 degrees and then it started raining.

Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next.

Ex: If the humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains.

(This is from knowledge-management theory. If you want to know about wisdom, check the Web page:

G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowledge, and Wisdom. http://www.systems-thinking.org/dikw/dikw.htm )

30Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

30

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web, computerized

society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras,

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis of massive

data sets

31Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

31

Background: Evolution of Database Technology

1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems

32Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

32A note on: Data Warehousing for finding implicit knowledge in data – and why I don‘t include this in the course (now)

33Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

33

The KDD process

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

34Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

34

The process part of knowledge discovery

CRISP-DM • CRoss Industry Standard Process for Data Mining• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

35Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

35

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining

sometimes = KD, sometimes = ML

36Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

36

The KDD ProcessData organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing 1

2

3 4

5

37Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

37

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

38Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

38

Main Contributing Areas of KDDMain Contributing Areas of KDD Main Contributing Areas of KDDMain Contributing Areas of KDD

DatabasesStore, access, search, update data (deduction)

StatisticsInfer info from data (deduction & induction, mainly numeric data)

Machine LearningComputer algorithms that

improve automatically through experience (mainly induction,

symbolic data)

KDD

[data warehouses:integrated data]

[OLAP: On-Line Analytical Processing]

39Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

39

Data Mining: Classification Schemes

General functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

40Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

40

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

41Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

41

Why Not Traditional Data Analysis?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

42Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

42

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

43Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

43

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction- Decision tree- Neural Network

44Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

44

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine whether a cell is cancerous?”

45Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

45

Color = dark

Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

46Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

46

Healthy

Cancerous

“What factors determine whether a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

# tails = 2

47Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

47

“Are there clusters of similar cells?”

Light color with 1 nucleus

Dark color with 2 tails 2 nuclei

1 nucleus and 1 tail

Dark color with 1 tail and 2 nuclei

Clustering

48Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

48

Task: Discovering association rules among items in a transaction database.

An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.

In general: A1, A2, … => B

Association Rule DiscoveryAssociation Rule Discovery

Association Rule DiscoveryAssociation Rule Discovery

49Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

49

“Are there any associations between the characteristics of the cells?”

If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;

confidence = 50%)

If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;

confidence = 100%)

If # tails = 1then Color = light (support =

37.5%;confidence = 75%)

Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery

50Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

50

Genetic Algorithms

StatisticsBayesian Networks

Rough Sets Time Series

Many Other Data Mining Techniques

Text Mining

51Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

51A goal: From databases to deductive databases to inductive databases

A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database.

inductive databases

contain not only data, but also patterns.

In an IDB, inductive queries can be used to generate (mine), manipulate, and apply patterns.

The IDB framework supports the process of knowledge discovery in databases (KDD):

– the results of one (inductive) query can be used as input for another

– nontrivial multi-step KDD scenarios can be supported, rather than just single data mining operations.

52Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

52

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

53Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

53

Input data ... Q: when does this person play tennis?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

54Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

54

The goal: a decision tree for classification / prediction

In which weather

will someone play (tennis etc.)?

55Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

55

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class

56Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

56

Which attribute to select?

57Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

57

Which attribute to select?

58Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

58

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain

59Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

59

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

60Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

60

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

61Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

61

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

62Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

62

Continuing to split

gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits

63Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

63

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

64Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

64

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!

65Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

65

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

66Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

66

Discussion / outlook decision trees

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes, missing values, noisy data Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40-45]

Similar approach: CART …

67Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

67

Agenda

Motivation II: Types of reasoning

A key concept of deductive DBs: Recursion

The process of knowledge discovery (KDD)

KDD: Origins and context

A short overview of key KDD techniques

An algorithm for decision-tree learning: ID3

Motivation I: Application examples

Mining semistructured and unstructured data

68Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

68

References / background reading

Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives:

a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook

a machine learning perspective: Witten, I.H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2

The CRISP-DM phase model can be found at http://www.crisp-dm.org

69Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

69

Acknowledgements p. 12, 17: http://en.wikipedia.org/wiki/Deductive_database pp. 14, 15: http://en.wikipedia.org/wiki/Abductive_reasoning p. 18: http://en.wikipedia.org/wiki/Datalog pp. 19-26 taken from (with minor modifications):

Ramakrishnan, R. & Gehrke, J. (2002?). Database Management Systems, 3rd Edition 2002. Instructor Slides. Ch. 25 - Deductive Databases. http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch25_DedDB-95.pdf

pp. 33, 36, 38, 43-50 were taken from (with minor modifications): Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications.

http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt

pp. 47-50 were taken from Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining.

http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt

pp. 30, 31, 39-41 were taken from Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 —

Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt

The ID3 part is based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques

with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

In particular, the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapters 1-4):http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or

http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)

70Berendt: Advanced databases, first semester 2008, http://www.cs.kuleuven.be/~berendt/teaching/2008w/adb/

70

Picture credits

See “notes“ of the slides