#11 opentalks
TRANSCRIPT
Machine LearningMariana Lopes
[email protected]@gmail.com
EuMariana Vieira Ribeiro Lopes
Bacharel em Ciência da Computação (Turma 2008) - UFSCar
Mestre em Ciência da Computação (Inteligência Artificial) - UFSCar
Data Scientist - Bravi (Desde agosto/2015)
Tutora do curso de BSI da EaD - UFSCar
Machine Learning
Computer Vision
Robotics
GameTheory
Speech Recognition
Natural Language Processing Reasoning
and Decision Making
MachineLearning
ArtificialIntelligence
Machine Learning Paradigms● Symbolic Learning and Rule Induction
○ Learning from examples, DTs
● Probabilistic and Statistical Models○ Bayesian models, Naive Bayesian Models, SVM
● Evolution-based Algorithms○ Genetic Algorithms, Evolution Strategies
● Neural Networks
Machine Learning
Examples(Training
data)Induction
New Knowledge
Machine Learning
Unlabeled Data
InductionNew
Knowledge
Labeled Data
Labeled weather temperature humidity wind playsunny 85 85 F nosunny 80 90 T nocloudy 83 86 F yesrainy 70 96 F yesrainy 68 80 F yesrainy 65 70 T no
cloudy 64 65 T yessunny 72 95 F nosunny 69 70 F yesrainy 75 80 F yessunny 75 70 T yescloudy 72 90 T yescloudy 81 75 F yesrainy 71 91 T no
Unlabeled weather temperature humidity windsunny 85 85 Fsunny 80 90 Tcloudy 83 86 Frainy 70 96 Frainy 68 80 Frainy 65 70 T
cloudy 64 65 Tsunny 72 95 Fsunny 69 70 Frainy 75 80 Fsunny 75 70 Tcloudy 72 90 Tcloudy 81 75 Frainy 71 91 T
Machine Learning
Unlabeled Data Induction
New Knowledge
Unsupervised LearningIdentify patterns in data
Machine Learning
Labeled Data Induction
New Knowledge
Supervised Learning
Supervised LearningDetermine the correct class
of unlabeled data Predict future events
Discrete Label
Classification Problem
Continuous Label
Regression Problem
Knowledge can be represented by inductive modelsExample:
Decision Trees
Decision Trees● Induced from labeled training data (Knowledge Base)● Generalizes knowledge and learns concepts● Classifies unknown data
Intuitive
Accurate
Interpretable
Graphic representationor by rules
Embedded feature selection
Robust method
Scalable method
Low cost for induction and inference
Continuous and descrete features
Popular
Decision Trees
Z
X X
c2 c1 c1 c2
Root node
Internal nodes
Leaf nodes
Decision Trees
Z
X X
c2 c1 c1 c2
Root node
Internal nodes
Leaf nodes
Features
Labels
Decision Trees
z2 z1
x1 x2x1 x2
Z
X X
c2 c1 c1 c2
Rule Base:IF Z = z2 AND X = x1 THEN C = c2IF Z = z2 AND X = x2 THEN C = c1IF Z = z1 AND X = x1 THEN C = c1IF Z = z1 AND X = x2 THEN C = c2
ZX Y Z C
x1 y1 z1 c1
x2 y2 z2 c1
x1 y2 z2 c2
x2 y1 z1 c2
?
Decision Trees Induction
ZX Y Z C
x1 y1 z1 c1
x2 y2 z2 c1
x1 y2 z2 c2
x2 y1 z1 c2
?
Decision Trees Induction
Selects the feature with the best pure measure (entropy, information gain, gain ratio,
gini index, ...)
Z
XX Y Z C
x2 y2 z2 c1
x1 y2 z2 c2
Stop criterion was reached ?
?
YesLeaf Node
No Internal Nodez2 z1
Decision Trees Induction
Selects the feature with the best pure measure.
X Y Z C
x1 y1 z1 c1
x2 y2 z2 c1
x1 y2 z2 c2
x2 y1 z1 c2
Z
X
c2?X Y Z C
x1 y2 z2 c2
z2 z1
x1 x2
Decision Trees Induction
X Y Z C
x2 y2 z2 c1
x1 y2 z2 c2
Z
X
c2?X Y Z C
x1 y2 z2 c2
z2 z1
x1 x2
Decision Trees Induction Stop criterion was reached ?
YesLeaf Node
No Internal Node
Z
X
c2X Y Z C
x1 y2 z2 c2
z2 z1
x1 x2
Decision Trees Induction
Decision Trees InductionThe induction process is recursive and it ends when there is no more nodes to be divided.
Z
X X
c2 c1 c1 c2
z2 z1
x1 x2x1 x2
Pure measureA B C X
a1 b1 c1 x1
a2 b2 c1 x1
a1 b2 c2 x2
a2 b1 c2 x2
Pure measureA B C X
a1 b1 c1 x1
a2 b2 c1 x1
a1 b2 c2 x2
a2 b1 c2 x2A
B
C
A B C X
a1 b1 c1 x1
a1 b2 c2 x2
A B C Xa1 b1 c1 x1
a2 b2 c1 x1
A B C Xa1 b2 c2 x2
a2 b1 c2 x2
A B C Xa2 b2 c1 x1
a1 b2 c2 x2
A B C Xa1 b1 c1 x1
a2 b1 c2 x2
A B C X
a2 b2 c1 x1
a2 b1 c2 x2
Pure measureA B C X
a1 b1 c1 x1
a2 b2 c1 x1
a1 b2 c2 x2
a2 b1 c2 x2A
B
C
A B C X
a1 b1 c1 x1
a1 b2 c2 x2
A B C Xa1 b1 c1 x1
a2 b2 c1 x1
A B C Xa1 b2 c2 x2
a2 b1 c2 x2
A B C Xa2 b2 c1 x1
a1 b2 c2 x2
A B C Xa1 b1 c1 x1
a2 b1 c2 x2
A B C X
a2 b2 c1 x1
a2 b1 c2 x2
Stop criterion● Number of instances in the node● The gain obtained by the division reaches a pre defined treshold
Decision Trees - Summary
A1 A2 An Class
x1 y1 z1 Cx
x2 y2 z2 Cy
... ... ... ...
xm ym zm Cx
Induction
Training set
Decision Trees - Summary
A1 A2 An Class
x1 y1 z1 Cx
x2 y2 z2 Cy
... ... ... ...
xm ym zm Cx
A1 A2 An Class
x2 ym z1 ?
Induction
?Training set
Unknown data to be classified
Decision Trees - Summary
A1 A2 An Class
x1 y1 z1 Cx
x2 y2 z2 Cy
... ... ... ...
xm ym zm Cx
A1 A2 An Class
x2 ym z1 Cx
Induction
?
Inference
Training set
Unknown data to be classified
How to validate?
...Classifier
Examples
Induction
Used as training set
How to validate?
...Classifier
Examples
Validation
Used as test set
Hmmm… This classifier is 98% accurate!!!
But… is that close to reality??
Resampling methods● To estimate the future accuracy of the classifier● Examples that are used in the training set should not be used in the test
set
Resampling methods - Holdout
E1E2E3E4E5
Examples
E6E7E8E9E10
E1E2E3E4E5
Training set
E6E7
Test setE8E9E10
Classifier
Induction
Validation
Resampling methods - Cross ValidationE1
E2
E3
E4
E5
E6
E7
E8
E9
E10
Classifier1
Classifier2
Classifier3
Classifier4
Classifier5
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
Training set
Test set
Induction
Validation
Accuracy1
Accuracy2
Accuracy3
Accuracy4
Accuracy5
Mean Accuracy
Example in Python
Example in Spark