kdd knowledge discovery in databasesformas.ufba.br/dclaro/mat700/aula 3 - kdd e mecatronica -...

51
KDD KNOWLEDGE DISCOVERY IN DATABASES PREDICTION METHODS CLASSIFICATION AND REGRESSION Daniela Barreiro Claro

Upload: others

Post on 21-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

KDD – KNOWLEDGE DISCOVERY IN DATABASES PREDICTION METHODS – CLASSIFICATION AND REGRESSION

Daniela Barreiro Claro

Page 2: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Introduction

KDD

Pre-Processing

Data Mining

Tasks

Pos-Processing

Outline

2 de X;X= Prof. Daniela Barreiro Claro

Page 3: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Are you ready for the BigData era?

Introduction

Prof. Daniela Barreiro Claro

Page 4: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Are you ready for the BigData era?

Introduction

Prof. Daniela Barreiro Claro

Page 5: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Big Data = cloud+social+mobile

Introduction

Prof. Daniela Barreiro Claro

Page 6: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

What is BIG DATA?

Big data is data that exceeds the processing capacity

of conventional database systems.

The data is too big, moves too fast, or doesn’t fit the

structures of a database architecture

The buzzword started by 2012

FORMAS - UFBA 6 de X

Introduction

Page 7: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Physical Objects

+

Controller, Sensor, and Actuators

+

Internet

=

Internet of Things 1. Adrian McEwen & Hakim Cassimally. Designing the Internet of Things, 7 de X

Internet of Things

Page 8: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Integrate things into the existing web

HTML and REST

Smart things

FORMAS - UFBA 8 de X

Internet of Things

Page 9: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Huge amount of data

Urgent necessity to have new techniques and tools automate the

process to extract data

These techniques and tools may help to transform this huge

amount of data into relevant and useful information.

“Necessity is the mother of invention”

Data mining

Automated analysis of huge amount of data sets.

BIG Data

9

Page 10: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Large number of transactions is running each day, for instance: Walmart, Carrefour

Remote sensors

Telecomunications networks

Medical records, patients records, etc

Traffic Sensors

Devices

10

BIG Data

Page 11: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

“The World is Data Rich but information poor”

Collected data is being stored into large repositories.

Data Tombs – “Tumbas de Dados”

Achieved data that is rarely visited

Ex. Camera video

Prof. Daniela Barreiro Claro 11

BIG Data

Page 12: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Knowledge Discovery process using data stored

Following Fayyad 1996, KDD is:

“”The nontrivial process of identifying valid, novel, potentially

useful and ultimately understandable patterns in data”

KDD has some steps:

Selection, pre-processing (transformation), interpretation/evaluation and

knowledge

KDD – Knowledge Discovery in Databases

12 Prof. Daniela Barreiro Claro

Page 13: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

KDD - Knowledge Discovery in Databases

13 Prof. Daniela Barreiro Claro

Page 14: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

1. Domain knowledge

2. Creating of the dataset

3. Pre-processing and Transformations

4. Choose of DM technique

5. Choose of DM algorithm

6. Interpretation and evaluation of patterns found

7. Knowledge discovery

KDD - Knowledge Discovery in Databases

14 Prof. Daniela Barreiro Claro

Page 15: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

KDD - Knowledge Discovery in Databases

Some steps of KDD can be visualized as a Data

Warehouse (DW)

15

Page 16: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Three macro steps

Pre- Processing

Data Cleaning

Data Integration

Data Transformation

Data Reduction

Data Mining

Techniques of DM

Algorithms of DM

Pos-Processing

Analysis and evaluation of the patterns discovered

KDD - Knowledge Discovery in

Databases

16 Prof. Daniela Barreiro Claro

Page 17: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Real data have normally the following characteristics:

Incomplete

Attributes are missing values, attributes are aggregate

Wrong

There are errors; attributes with unexpected values

Inconsistence

There are discrepancies among data items; some attributes that represent a concept, can have distinct names in different databases.

Huge amount of data

Large number of data makes data mining process very slow

Pre-Processing

17 Prof. Daniela Barreiro Claro

Page 18: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

The pre-processing process can highlight 4 steps:

Data Cleaning

To clean the data

To complete the data that is missing

To resolve inconsistencies

To soften error (suavizar)

To eliminate or minimize discrepancies among data

If data is dirty, therefore the results will be unreliable

Pre-Processing

18 Prof. Daniela Barreiro Claro

Page 19: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Integration

Integrate the data from different databases, data cubes, file systems, etc

Some attributes that represent a concept can have different names in different databases.

Ex. IdCliente, ClienteID, Cli_ID,

Some attributes can be inferred by others

Ex. Annual salary, total amount

Many times the data integration process can generate some redundancy. In this cases, the step Data Cleaning must be re-executed to eliminate the redundancy generated by this phase

Pre-Processing

19 Prof. Daniela Barreiro Claro

Page 20: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Transformation

This step covers two main procedures

Agregation

Combination of two or more object into a single object

Ex. Aggregate 365 days into 12 months

Changement of the scale

Small datasets need less memory and time processing

Agregate quantity, such as average and total has less variance than single objects.

Disavantages

Lose of interesting details

Pre-Processing

20

Page 21: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Transformation

Normalization or Standardization

Discrete data set has some properties

If different variables need to be combined, it is necessary to transform

them to avoid that large values dominate the results.

Ex. Two variables: age and salary

Difference between both values salary (thousands of dollars) and age

(less than 130)

Pre-Processing

21 Prof. Daniela Barreiro Claro

Page 22: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Reduction Reduction of data representation considering volume, even if

it produces the same analytical result (or similar).

Strategies Aggregation

To construct a data cube

Attribute selection To eliminate irrelevant attributes by the use of a correlation analysis

Dimension reduction

Data discretization

Pre-Processing

22 Prof. Daniela Barreiro Claro

Page 23: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data Reduction Dimension reduction

A dimension consider the number of attributes

Can eliminate irrelevant characteristics and noise reduction

Can generate a more comprehensive model

Can reduce data and many times examine them.

Many times is used to join attributes generating new attributes, that is, a combination of old attributes

Data Discretization Transforming a continuous attribute into a categorical attribute (discrete) or

into binary attributes(binary process )

Pre-Processing

23 Prof. Daniela Barreiro Claro

Page 24: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount

of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.

Prof. Daniela Barreiro Claro 24

Data Mining

Page 25: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Prof. Daniela Barreiro Claro 25

Data Mining

Data Mining

Machine

Learning Statistics

Applications BI / Web Search

Visualization Database

Page 26: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

It is one of the steps in a KDD process

Two macro aims:

Prediction

Description

Prediction

Predict values to future variable or not known variables.

Description

Discover patterns that describe the data set

Data Mining

26 Prof. Daniela Barreiro Claro

Page 27: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

TECHNIQUES

Data Mining

Prediction Description

Classification Regression Clustering Summarization Association

Data Mining

27 Prof. Daniela Barreiro Claro

Page 28: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

28

Supervised vs. Unsupervised Learning

Supervised learning (prediction)

Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (description)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters or associations in the data

Page 29: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

TECHNIQUES

Data Mining

Prediction Description

Classification Regression Clustering Summarization Association

Data Mining

29 Prof. Daniela Barreiro Claro

Page 30: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Classification techniques

30 Prof. Daniela Barreiro Claro

Page 31: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

31

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees, or mathematical formulae

Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable, use the model to classify new data

Note: If the test set is used to select models, it is called validation (test) set

Page 32: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Classification techniques

32 de X

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning

algorithm

Training Set

Page 33: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Classification algorithms

33 de X FORMAS - UFBA

Page 34: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Classification- Decision tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 35: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the

same data!

Classification- Decision tree

Another example

Page 36: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

Page 37: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data Start from the root of tree.

Page 38: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 39: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 40: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 41: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Page 42: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Page 43: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

Page 44: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

4 macro steps:

1. Divide training data set and test data set

2. Choose the classification attribute (labeled attribute)

Decide what features of the data are relevant to the target class we want to predict.

Verify the relevant attributes (entropy and information gain)

3. Generate the decision tree

4. Test the efficiency of the classification algorithm using the test data set

Classification- Decision tree

44 Prof. Daniela Barreiro Claro

Page 45: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Entropy

It is a measure of impurity. It is defined for a binary class with values a/b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

Information gain

It is usually a good measure for deciding the relevance of an attribute

It is to define a preferred sequence of attributes to investigate to rapidly narrow down the state

of the predict class

A notable problem occurs when information gain is applied to attributes that can take on a large

number of distinct values

One of the input attributes might be the customer's credit card number.

Classification- Decision tree

45 de X

Page 46: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Classification- Exercise

46

Page 47: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Classification- Decision tree - Results

47

Using a Decision

tree algorithm

Page 48: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Name gender

Pedro M

Miguel M

Ana F

Gabriela F

Predict Daniela’s genre?

Daniela ?

FORMAS - UFBA 48 de X

Classification- Decision tree - Exercise

Features

Ends vowel

Number of vowel

Length

Page 49: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Represents a function to predict a number

Can predict the height of a child given the child’s age

Linear regression is the most simple to use

Algorithms examples

GLM _ Generalized Linear Model

Based on statistical techniques

SVM – Support Vector Machines

Supports linear and non-linear regression

Regression techniques

49 Prof. Daniela Barreiro Claro

Page 50: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

Analyze retrieved information

Generate knowledge

In many times, this is a cyclic process, that is, it is necessary to

redo in order to find useful information

KDD is a slow process

Prof. Daniela Barreiro Claro 50

Pos-Processing

Page 51: KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica - Classific… · Data Cleaning Data Integration Data Transformation Data Reduction Data

/formasresearchgroup /formasresearch

www.formas.ufba.br

Semantic Applications and Formalisms Research Group

Prof. Daniela Barreiro Claro

Email: [email protected]

Our course: formas.ufba.br/dclaro