#11 opentalks

36
Machine Learning Mariana Lopes [email protected] [email protected]

Upload: mariana-lopes

Post on 14-Apr-2017

19 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: #11 opentalks

Machine LearningMariana Lopes

[email protected]@gmail.com

Page 2: #11 opentalks

EuMariana Vieira Ribeiro Lopes

Bacharel em Ciência da Computação (Turma 2008) - UFSCar

Mestre em Ciência da Computação (Inteligência Artificial) - UFSCar

Data Scientist - Bravi (Desde agosto/2015)

Tutora do curso de BSI da EaD - UFSCar

Page 3: #11 opentalks

Machine Learning

Computer Vision

Robotics

 GameTheory

Speech Recognition

Natural Language Processing Reasoning

and Decision Making

MachineLearning

ArtificialIntelligence

Page 4: #11 opentalks

Machine Learning Paradigms● Symbolic Learning and Rule Induction

○ Learning from examples, DTs

● Probabilistic and Statistical Models○ Bayesian models, Naive Bayesian Models, SVM

● Evolution-based Algorithms○ Genetic Algorithms, Evolution Strategies

● Neural Networks

Page 5: #11 opentalks

Machine Learning

Examples(Training

data)Induction

New Knowledge

Page 6: #11 opentalks

Machine Learning

Unlabeled Data

InductionNew

Knowledge

Labeled Data

Page 7: #11 opentalks

Labeled weather temperature humidity wind playsunny 85 85 F nosunny 80 90 T nocloudy 83 86 F yesrainy 70 96 F yesrainy 68 80 F yesrainy 65 70 T no

cloudy 64 65 T yessunny 72 95 F nosunny 69 70 F yesrainy 75 80 F yessunny 75 70 T yescloudy 72 90 T yescloudy 81 75 F yesrainy 71 91 T no

Page 8: #11 opentalks

Unlabeled weather temperature humidity windsunny 85 85 Fsunny 80 90 Tcloudy 83 86 Frainy 70 96 Frainy 68 80 Frainy 65 70 T

cloudy 64 65 Tsunny 72 95 Fsunny 69 70 Frainy 75 80 Fsunny 75 70 Tcloudy 72 90 Tcloudy 81 75 Frainy 71 91 T

Page 9: #11 opentalks

Machine Learning

Unlabeled Data Induction

New Knowledge

Unsupervised LearningIdentify patterns in data

Page 10: #11 opentalks

Machine Learning

Labeled Data Induction

New Knowledge

Supervised Learning

Page 11: #11 opentalks

Supervised LearningDetermine the correct class

of unlabeled data Predict future events

Discrete Label

Classification Problem

Continuous Label

Regression Problem

Knowledge can be represented by inductive modelsExample:

Decision Trees

Page 12: #11 opentalks

Decision Trees● Induced from labeled training data (Knowledge Base)● Generalizes knowledge and learns concepts● Classifies unknown data

Intuitive

Accurate

Interpretable

Graphic representationor by rules

Embedded feature selection

Robust method

Scalable method

Low cost for induction and inference

Continuous and descrete features

Popular

Page 13: #11 opentalks

Decision Trees

Z

X X

c2 c1 c1 c2

Root node

Internal nodes

Leaf nodes

Page 14: #11 opentalks

Decision Trees

Z

X X

c2 c1 c1 c2

Root node

Internal nodes

Leaf nodes

Features

Labels

Page 15: #11 opentalks

Decision Trees

z2 z1

x1 x2x1 x2

Z

X X

c2 c1 c1 c2

Rule Base:IF Z = z2 AND X = x1 THEN C = c2IF Z = z2 AND X = x2 THEN C = c1IF Z = z1 AND X = x1 THEN C = c1IF Z = z1 AND X = x2 THEN C = c2

Page 16: #11 opentalks

ZX Y Z C

x1 y1 z1 c1

x2 y2 z2 c1

x1 y2 z2 c2

x2 y1 z1 c2

?

Decision Trees Induction

Page 17: #11 opentalks

ZX Y Z C

x1 y1 z1 c1

x2 y2 z2 c1

x1 y2 z2 c2

x2 y1 z1 c2

?

Decision Trees Induction

Selects the feature with the best pure measure (entropy, information gain, gain ratio,

gini index, ...)

Page 18: #11 opentalks

Z

XX Y Z C

x2 y2 z2 c1

x1 y2 z2 c2

Stop criterion was reached ?

?

YesLeaf Node

No Internal Nodez2 z1

Decision Trees Induction

Selects the feature with the best pure measure.

X Y Z C

x1 y1 z1 c1

x2 y2 z2 c1

x1 y2 z2 c2

x2 y1 z1 c2

Page 19: #11 opentalks

Z

X

c2?X Y Z C

x1 y2 z2 c2

z2 z1

x1 x2

Decision Trees Induction

X Y Z C

x2 y2 z2 c1

x1 y2 z2 c2

Page 20: #11 opentalks

Z

X

c2?X Y Z C

x1 y2 z2 c2

z2 z1

x1 x2

Decision Trees Induction Stop criterion was reached ?

YesLeaf Node

No Internal Node

Page 21: #11 opentalks

Z

X

c2X Y Z C

x1 y2 z2 c2

z2 z1

x1 x2

Decision Trees Induction

Page 22: #11 opentalks

Decision Trees InductionThe induction process is recursive and it ends when there is no more nodes to be divided.

Z

X X

c2 c1 c1 c2

z2 z1

x1 x2x1 x2

Page 23: #11 opentalks

Pure measureA B C X

a1 b1 c1 x1

a2 b2 c1 x1

a1 b2 c2 x2

a2 b1 c2 x2

Page 24: #11 opentalks

Pure measureA B C X

a1 b1 c1 x1

a2 b2 c1 x1

a1 b2 c2 x2

a2 b1 c2 x2A

B

C

A B C X

a1 b1 c1 x1

a1 b2 c2 x2

A B C Xa1 b1 c1 x1

a2 b2 c1 x1

A B C Xa1 b2 c2 x2

a2 b1 c2 x2

A B C Xa2 b2 c1 x1

a1 b2 c2 x2

A B C Xa1 b1 c1 x1

a2 b1 c2 x2

A B C X

a2 b2 c1 x1

a2 b1 c2 x2

Page 25: #11 opentalks

Pure measureA B C X

a1 b1 c1 x1

a2 b2 c1 x1

a1 b2 c2 x2

a2 b1 c2 x2A

B

C

A B C X

a1 b1 c1 x1

a1 b2 c2 x2

A B C Xa1 b1 c1 x1

a2 b2 c1 x1

A B C Xa1 b2 c2 x2

a2 b1 c2 x2

A B C Xa2 b2 c1 x1

a1 b2 c2 x2

A B C Xa1 b1 c1 x1

a2 b1 c2 x2

A B C X

a2 b2 c1 x1

a2 b1 c2 x2

Page 26: #11 opentalks

Stop criterion● Number of instances in the node● The gain obtained by the division reaches a pre defined treshold

Page 27: #11 opentalks

Decision Trees - Summary

A1 A2 An Class

x1 y1 z1 Cx

x2 y2 z2 Cy

... ... ... ...

xm ym zm Cx

Induction

Training set

Page 28: #11 opentalks

Decision Trees - Summary

A1 A2 An Class

x1 y1 z1 Cx

x2 y2 z2 Cy

... ... ... ...

xm ym zm Cx

A1 A2 An Class

x2 ym z1 ?

Induction

?Training set

Unknown data to be classified

Page 29: #11 opentalks

Decision Trees - Summary

A1 A2 An Class

x1 y1 z1 Cx

x2 y2 z2 Cy

... ... ... ...

xm ym zm Cx

A1 A2 An Class

x2 ym z1 Cx

Induction

?

Inference

Training set

Unknown data to be classified

Page 30: #11 opentalks

How to validate?

...Classifier

Examples

Induction

Used as training set

Page 31: #11 opentalks

How to validate?

...Classifier

Examples

Validation

Used as test set

Hmmm… This classifier is 98% accurate!!!

But… is that close to reality??

Page 32: #11 opentalks

Resampling methods● To estimate the future accuracy of the classifier● Examples that are used in the training set should not be used in the test

set

Page 33: #11 opentalks

Resampling methods - Holdout

E1E2E3E4E5

Examples

E6E7E8E9E10

E1E2E3E4E5

Training set

E6E7

Test setE8E9E10

Classifier

Induction

Validation

Page 34: #11 opentalks

Resampling methods - Cross ValidationE1

E2

E3

E4

E5

E6

E7

E8

E9

E10

Classifier1

Classifier2

Classifier3

Classifier4

Classifier5

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

Training set

Test set

Induction

Validation

Accuracy1

Accuracy2

Accuracy3

Accuracy4

Accuracy5

Mean Accuracy

Page 35: #11 opentalks

Example in Python

Page 36: #11 opentalks

Example in Spark