an introduction to machine learning · machine learning introduction supervised learning...

AnIntroduction to

MachineLearning

Introduction

SupervisedLearningGeneralized LinearModels

Support VectorMachines

Decision Trees

UnsupervisedLearningManifold learning

Clustering

Neural Networks

ModelSelection andEvaluation

Applications

Pitfalls toavoid

MachineLearning thatMatters

Bibliography

An Introduction to Machine LearningResearch Group Meeting 2017

Melissa R. McGuirl

February 22, 2017

An Introduction to Machine Learning February 22, 2017 1 / 45

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Table of Contents

1 Introduction2 Supervised Learning

Generalized Linear ModelsSupport Vector MachinesDecision Trees

3 Unsupervised LearningManifold learningClusteringNeural Networks

4 Model Selection and Evaluation5 Applications6 Pitfalls to avoid7 Machine Learning that Matters8 Bibliography


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Introduction

General Set-up

Set-up and GoalSuppose we have X1,X2, . . . ,Xn data samples. Can wepredict properites about any given Xn+1,Xn+2, . . . ,XN?

Machine learning systems attempt to predict properties ofunknown data based on the attributes or features of thedata.


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Introduction

Supervised learning

1 Data comes with attributes that we want our algorithmto predict

2 Machine learning algorithm is given data attributes anddesired outputs and the goal is to learn a way to mapinputs to outputs in a general way

3 2 main types of learning:1 Classification: Data belongs to different

classes/groups and we want to be able to predict whichclass/group unlabeled data belongs to

2 Regression: Data labeled with one or more continuousvariables (parameters) and the task is to predict thevalue of these variables for unknown data


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Introduction

Image source: http://ipython-books.github.io/featured-04/


http://ipython-books.github.io/featured-04/

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Introduction

Unsupervised Learning

1 No labels are given to the algorithm2 Training data consists of a set of input vectors{X1,X2, . . . ,Xn} with no target outputs

3 3 Types of goals in this setting:

1 Clustering: discover groups with similar features withinthe data

2 Density Estimation: determine distribution of datawithin the input space

3 Dimensionality Reduction: project data into a lowerdimensional space than input space


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Supervised Learning Generalized Linear Models

Generalized Linear Models

Set-up: Data in the form of Xi = (x i1,x

i2, . . . ,x

ip)

Goal: Regression

Find y

y(w ,Xi) = w0 + w1x i1 + w2x i

2 + . . .wpx ip,

where w0 is called the intercept and w1, . . .wp are thecoefficients.


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


GLM: Ordinary Least Squares

Fits linear model y(w ,Xi) = w0 + w1x i1 + w2x i

2 + . . .wpx ip

where w = (w1, ...,wp) coefficients obtained by solving

minw||Xw −y ||22

Model relies on independence of model termsIf terms are correlated then the least-square estimate ishighly sensitive to random errors in the observedresponse (large variance)Complexity: O(np2)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


GLM: Ridge Regression


2 + . . .wpx ip


minw||Xw −y ||22 + α||w ||22

α > 0 is a complexity parameter that controls howrobust the coefficients are to linearityPenalty on the size of coefficients addresses issueswith ordinary least squares methodComplexity: O(np2)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


GLM: Lasso


2 + . . .wpx ip


minw

12n||Xw −y ||22 + α||w ||11

l1 norm means we will this linear model estimatessparse coefficientsThis method reduces number of variables upon whichthe solution is dependentUseful in compressed sensing and can be used forfeature selection


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Supervised Learning Support Vector Machines

SVM: Mathematical Setup

A SVM constructs a hyperplane (or set of hyperplanes) in ahigh or infinite dimensional space, which can be used forclassification, regression or other tasks.


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


SVM: Linear, separable case

We want to find the hyperplane that maximizes the marginas follows:

Mathematical formulationGiven X1,X2, . . .Xn ∈ Rp, with labels Yi ∈ [−1,1], minimize||w ||2 subject to{

(w ·Xi + b)≥ 1 Yi = 1(w ·Xi + b)≤−1 Yi =−1

Equivalently,

Yi(w ·Xi + b)≥ 1

Decision function:

f (X ,w ,b) = sign(w ·X + b)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


SVM: Linear, non-separable case

Rewrite problem as

Mathematical formulationGiven X1,X2, . . .Xn ∈ Rp, with labels Yi ∈ [−1,1], minimize||w ||2 + C ∑

ni=1 ξi subject to

Yi(w ·Xi + b)≥ 1−ξi , ξi ≥ 0

This added a hinge-loss term.

GeneralDecision: f (X ) = w ·X + bSolve:

minP(w ,b) =12||w ||2 + C ∑

iH1[Yi f (Xi)]

= maximize margin + minimize errorAn Introduction to Machine Learning February 22, 2017 13 / 45

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


SVM: non-linear case

For some data, linear classifiers are not complexenoughThe solution is to map data into a feature space, andthen construct a hyperplane in the feature space asbefore:

X 7→ Φ(X )Learn f (X ) = w ·Φ(X ) + b


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


SVM: Kernel Trick

Kernel trick: if Φ(X ) is high-dimensional, it can be hardto solve for wBy representer theorem (Kimeldorf & Wahba, 1971)

w = ∑i

αiΦ(Xi)

for some αiOptimize αi instead of w :

f (X ) = ∑i

αiK (Xi ,X ) + b, K (Xi ,X ) = Φ(Xi) ·Φ(X )

Rewrite all SVM equations as before but withw = ∑i αiΦ(Xi)

Dual:

minP(w ,b) =12||∑

iαiΦ(Xi)||2 + C ∑

iH1[Yi f (Xi)]


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Common kernel examples

RBF-SVM:

K (X ,X ′) = exp(−γ||X −X ′||2)

Adds a bump around each data pointPolynomial-SVM:

K (X ,X ′) = (X ·X ′)d

When d is large, the kernel still only requires ncomputations, whereas explicit representation may notfit in memory


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of SVM

Advantages

Effective in high dimensional spacesCan be used for classification or regressionMemory efficient since it only uses a subset of trainingpoints in the decision functionVersatile since we can use different kernel functions forthe decision function


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of SVM

Disadvantages

Non-Probabilistic: SVMs do not directly provideprobability estimatesMethod will likely not do well if the number of features ismuch greater than the number of samples


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Supervised Learning Decision Trees

Decision Trees formulation

Set-up training vectors Xi ∈ Rn, i = 1, ...,p and a labelvector Y ∈ Rp

A decision tree recursively partitions the space suchthat the samples with the same labels are groupedtogetherDenote data at node m by QConsider all possible splits of Q into left and rightgroupsChoose splits which minimizes some measure ofimpurityMaximum tree depth parameter


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Decision Tree Example

Image source: https://www.projectrhea.org/rhea/index.php/Lecture_21_-_Decision_Trees(Continued)_Old_Kiwi


https://www.projectrhea.org/rhea/index.php/Lecture_21_-_Decision_Trees(Continued)_Old_Kiwi

https://www.projectrhea.org/rhea/index.php/Lecture_21_-_Decision_Trees(Continued)_Old_Kiwi

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of Decision Trees

Advantages

Simple to understand and to interpret. Trees can bevisualizedThe cost of using the tree (i.e., predicting data) islogarithmic in the number of data points used to trainthe treePossible to validate a model using statistical testsAble to handle both numerical and categorical data


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of Decision Trees

DisadvantagesDecision-tree learners can create over-complex treesthat do not generalize the data wellDecision tree learners create biased trees if someclasses dominateThe problem of learning an optimal decision tree isknown to be NP-complete under several aspects ofoptimality and even for simple concepts


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Other Supervised Learning Methods

Stochastic gradient descentNearest neighborsGaussian processNaive bayesEnsemble methods


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Unsupervised Learning Manifold learning

Manifold Learning

A dimension reduction attempt to generalize linearframeworks like PCA to be sensitive to non-linearstructure in dataExamples:

Isomap: seeks a lower-dimensional embedding whichmaintains geodesic distances between all pointsLocally linear embedding: seeks a lower-dimensionalprojection of the data which preserves distances withinlocal neighborhoodsMultidimensional scaling: seeks a low-dimensionalrepresentation of the data in which the distancesrespect well the distances in the originalhigh-dimensional space


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Unsupervised Learning Clustering

Clustering: K-means

Goal: Separate samples into K clusters C by solving

n

∑i=0

minµj∈C

(||xj −µi ||2),

where µj is the centroid of the jth cluster

The within-cluster sum of squares criterion is referredto as inertia and it measures how internally coherentclusters are.Inertia makes the assumption that clusters are convexand isotropicInertia is not a normalized metric, better to run PCAbefore doing K-means clustering if you’re in a highdimensional space (curse of dimensionality)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


K-means clustering

Image source:http://webstaff.itn.liu.se/~reile/edu/TNM025/Matlab/html/KMeansDemo.html


http://webstaff.itn.liu.se/~reile/edu/TNM025/Matlab/html/KMeansDemo.html

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Clustering: Spectral

Spectral clustering does a low-dimension embedding ofthe affinity matrix between samples, followed by aKMeans in the low dimensional spaceThe general approach to spectral clustering is to use astandard clustering method on relevant eigenvectors ofa Laplacian matrix of similarity matrix A, where Aij ≥ 0represents a measure of the similarity between datapoints with indexes i and jWorks well for a small number of clusters but is notadvised when using many clustersVery useful when the structure of the individual clustersis highly non-convex, or more generally when ameasure of the center and spread of the cluster is not asuitable description of the complete cluster


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Unsupervised Learning Neural Networks

Neural Networks

A neural network is a system that receives an input,process the data, and provides an outputAlgorithm mimics biological processes of a neuronSimplest case: “Neuron” is a computational unit thattakes as input x1,x2,x3, . . . and outputs

hW ,b(x) = f (W T x) = f (∑i

Wixi + b),

where f : R→ R is called the activation function, Wi areweights and b is the biasCommon choice for the activation function is tanhAdd hidden layers to create a more sophisticatednetworkNeed to learn interconnection pattern between thedifferent layers of neurons and weights of theinterconnections based on a cost function


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Neural Networks

Image source: https://visualstudiomagazine.com/articles/2014/11/01/use-python-with-your-neural-networks.aspxAn Introduction to Machine Learning February 22, 2017 29 / 45

https://visualstudiomagazine.com/articles/2014/11/01/use-python-with-your-neural-networks.aspx

https://visualstudiomagazine.com/articles/2014/11/01/use-python-with-your-neural-networks.aspx

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of Neural Networks

Advantages

Can be trained directly on data with thousands of inputsOnce trains, predictions are fastPerforms tasks that linear programs cannotCan be used with supervised of unsupervised learningCan be done in parallel


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Pros and Cons of Neural Networks

DisadvantagesTraining is computationally expensiveComplex and can be hard to interpretCould require a lot of parameter tweaking


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Model Selection and Evaluation

Things to consider:

What is your goal? Classify? Parameter prediction?Dimension of your dataVariability of dataPre-existing knowledge of data


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Model Selection and Evaluation

Classification Metrics:

accuracy(y , y) =1

nsamples

nsamples−1

∑i=0

1(yi = yi)

Regression Metrics:

MAE(y , y) =1

nsamples

nsamples−1

∑i=0

|yi − yi |

MSE(y , y) =1

nsamples

nsamples−1

∑i=0

(yi − yi)2

R2(y , y) = 1− ∑nsamples−1i=0 (yi − yi)

2

∑nsamples−1i=0 (yi − y)2


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Applications

Applications of Machine Learning

Spam email detectionImproving weather predictionTargeted advertising and web searchesPredicting emergency room wait times using staffinglevels, patient data, charts, and layout of ERIdentifying heart failure from physician’s notesPredicting hospital readmissionsLearning dynamical systems models directly fromhigh-dimensional sensor data (Byron Boots, GeorgiaTech)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Applications

Useful Software Packages

PyML, pyMC, scikit-learn in PythonTensorFlow (Google)MATLAB’s Statistics and Machine Learning toolboxSpider (MATLAB)Shogunmlpack in C++

Torch (C++)Weka (Java)Orange (Open source machine learning and datavisualization)


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Pitfalls to avoid

Contamination of classifier

Goal: General classifierProblem: Illusion of success (over-tuning parameters)Use cross-validation to avoid this:

Randomly divide training data into multiple subsetsOnly use one subset for training at a timeTest each classifier on data not used for trainingAverage results to see how well the classifiers do


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Pitfalls to avoid

Overfitting

“Hallucinating a classifier” occurs when data are notsufficient to determine the correct classifierIn this case the classifiers will be encoding randomfeatures in data which are not grounded in realityOverfitting can be decomposed into bias and variance:

Bias is the learner’s tendency to consistently learn thesame (wrong) thingVariance is the tendency to learn random thingsirrespective of real signalLinear classifiers have high biasDecision trees have low bias but high variance

A more powerful algorithm is not necessarily betterUse cross-validation, add a regularization term to avoidoverfitting


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Pitfalls to avoid

Bias and Variance

Image source: [1]An Introduction to Machine Learning February 22, 2017 38 / 45

AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Pitfalls to avoid

Curse of Dimensionality

Second biggest problem in machine learningExpression coined in 1961 to refer to fact that manyalgorithms work well in low dimensions but fail in highdimensionsGeneralizing correctly is exponentially more difficult asthe dimensionality, or number of features, increasesSimilarity-based reasoning fails in high dimensionsGood news: usually high dimensional data areconcentrated on/near a lower dimensional manifold sowe can use dimension reduction techniques to avoidthis “curse ”


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Machine Learning that Matters

Ways Machine Learning Limits Impacts on World

C. Rudin and K. Wagstaff argue that most people donot ML with the primary intention of having a significantimpact on the worldMany papers in ML community are rejected solelybecause their algorithms and analyses are not novel,even if their scientific contribution could have animportant impact on societyIn a survey of 152 papers published at ICML 2011, only1% of papers interpret results in domain context,whereas 39% used synthetic data and 37 % usedstandard data from UCE archiveAbstract metrics for performance of algorithms do notmeasure the impact of the resultsTheoretical advances not connected back to real worldimpact


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Making Machine Learning Matter

1 Step 1: Define or select evaluation methods that willallow you to measure the impact of your results

2 Step 2: Collaborate with experts in other fields who canhelp define the ML problem and label data forclassification and regression tasks

3 Step 3: Consider the potential impact when decidingwhich research problem to work on


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Wagstaff’s Impact Challenges

(Original list from Carbonell, 1992)1 A law passed or legal decision made that relies on the

results of an ML analysis2 Save $100 through an improved decision making

algorithm provided by an ML system3 Avert a conflict between nations though high-quality

translation provided by an ML system4 Save a human life through a diagnosis or invention

recommended by a ML system5 Reduce cyber security break-ins by 50% through ML

defenses6 Improve the Human Development Index by 10% in a

country using a ML system


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Obstacles to ML Impact

1 JargonML vocabulary creates barriers between experts andsociety or even experts in other fieldsReplace “feature extraction” with “representation,”“variance” with “instability” and so on

2 Risk“With great power comes great responsibility”Who is at fault for errors when errors have a significantimpact?High concerns in fields such as medicine, spacecraft,finance, etc.

3 ComplexityML is not simple enough for researchers across fields touse freelySimplifying, maturing, and “robustifying” ML tools willpromote wider, more independent uses of ML


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography


Thanks for listening!!!


AnIntroduction to

MachineLearning

Introduction



Decision Trees


Clustering

Neural Networks


Applications

Pitfalls toavoid


Bibliography

Bibliography

REFERENCES

[1] Domingos, Pedro. A Few Useful Things to Know aboutMachine Learning.[2] Rudin, Cynthia and Kiri L. Wagstaff. Machine Learningfor Science and Society. Mach Lean, Springer. (2013)[3] Wagstaff, Kiri L. Machine Learning that Matters.Proceedings of the 29th International Conferences onMachine Learning, Edinburgh, Scotland. (2012)[4] scikit-learn.org/stable/user_guide.html[5] https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners[6] www.cs.columbia/edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf


scikit-learn.org/stable/user_guide.html

https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners

https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners

www.cs.columbia/edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

www.cs.columbia/edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

an introduction to machine learning · machine learning introduction supervised learning...

Documents