machine learning

Machine Learning

[email protected] 2012

Machine Learning

Classification: Predicting discrete values Regression: Predicting continuous valuesClustering: Detecting similar groupsOptimization: Finding input that maximizes output

Machine Learning

Classification and Regression

Imagine an omniscient oracle who answers question:

oracle( question ) = answer

Goal: From previous questions and answers, create a function that approximates the oracle

f( question ) -> oracle( question ) as examples -> ∞

ClassificationClassification: Predicting discrete values

ExampleGiven:

Shape Color Width(in) Weight(g) Calories Taste TypeRound Red 4.2 205 73 Sweet AppleRound Green 3.7 145 52 Sour AppleRound Orange 3.2 131 62 Sweet OrangeRound Orange 5.7 181 75 Bitter GrapefruitCylinder Yellow 1.5 140 123 Sweet BananaOval Yellow 2.2 58 17 Sour LemonRound Purple 0.7 2.4 2 Sweet GrapeRound Green 2.0 65 45 Tart KiwiRound Green 8.0 4518 1366 Sweet Watermelon

Predict Type:

Shape Color Width Weight Calories Taste TypeRound Red 5.2 193 78 Bitter ?


Decision TreesInputs: Discrete and ContinuousLabels: nRule: Which leaf of a tree do I belong of a binary search tree?

Support Vector Machines Inputs: Continuous Labels: 2Rule: Which side of a hyperplane am I on?

Nearest Neighbors Inputs: ContinuousLabels: nRule: Who am I closest to?

Naïve BayesInputs: Discrete and ContinuousLabels: nRule: What am I most likely to be?

Neural NetworksInputs: ContinuousLabels: nRule: Which node do I map to after moving through a weighted network?


Decision TreesInputs: Discrete and ContinuousLabels: nRule: Which leaf of a tree do I belong of a binary search tree?

Nearest Neighbors Inputs: ContinuousLabels: nRule: Who am I closest to?


Neural NetworksInputs: ContinuousLabels: nRule: Which node do I map to after moving through a weighted network?


Support Vector Machines Inputs: ContinuousLabels: 2Rule: Which side of a hyperplane am I on?



Kernel Trick: You can compute distances in higher dimensions, even infinite, without actually moving there. (Mercer’s condition)



Common Kernels:



Kernel: Homogenous Polynomial k(x1 x2)=(x1 x2)**2∙ ∙



Naïve Assumption:


How the classifiers see the same data:

Decision Trees Support Vector Machines Naïve Bayes

Classifier Ensembles

An ensemble is a collection of classifiers, each trained on a different subset of the training data. At prediction time, the classifiers vote on the correct label. The result is a probability associated with each label based on proportion of classifiers that voted for them, and one often takes the label with highest probability.

Voting strategies:

Bagging Multiple classifiers vote of the correct label, all votes counted equally

BoostingMultiple classifiers vote but the votes are weighted by their error rate on a reserved test set

Random Forests

A Random Forest is an ensemble of decision trees. Random Forests have been proven to extract all information possible from a dataset and, in fact, converge to the oracle.

Gerard BiauEcole Normale SuperieureAnalysis of a Random Forests ModelJournal of Machine Learning Research (2012)

“Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm.

In this paper, we […] show that the procedure is consistent and adapts to sparsity, in the sense that [the] rate of convergence depends only on the number of strong features and not on how many noise variables are present. “

Naïve Bayes

Naïve Bayes classifiers appear to give good results in practice even though they are based on a potentially unrealistic assumption. Recently, researches have been attempting to explain the conditions that lead to successful Naïve Bayes classifiers.

Harry ZhangUniversity of New BrunswickThe Optimality of Naïve BayesAmerican Association for Artificial Intelligence (2004)

“In a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each class. Clearly, in this case, the conditional independence assumption is violated, but Naïve Bayes is still the optimal classifier.

Furthermore […] if we look at two attributes, there may exist strong dependence between them that affects the classification. When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification.“

Choosing a Classifier

Random ForestsProsHighly convergentIgnores noiseConsWhole forest must be retrained (batch)Time to classify depends on number of trees

Naïve BayesProsFast to train Fast to classifyIncremental updates (stream)Evaluates in O(1) timeSeem to work in practiceConsDoes not capture feature covarianceContinuous inputs need distribution estimation

LinksNaïve Bayes in 50 lines of Pythonhttp://ebiquity.umbc.edu/blogger/2010/12/07/naive-bayes-classifier-in-50-lines/

NIST Special Database 19 - Handwriting Sampleshttp://gorillamatrix.com/files/nist-sd19.rar

Analysis of a Random Forests Modelhttp://jmlr.csail.mit.edu/papers/volume13/biau12a/biau12a.pdf

The Optimality of Naïve Bayeshttp://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/Optimality_of_Naive_Bayes.pdf

Apache Mahouthttp://mahout.apache.org/

Programming Collective Intelligence http://shop.oreilly.com/product/9780596529321.do

Thank you

“Vision without action is a daydream. Action without vision is a nightmare.”

- Japanese Proverb

machine learning

Technology

nearest neighbors

predicting

binary search

random forests

machine learning

nave bayesinputs

nave bayes

neural networksinputs