Download - Teaching Machine Learning to a Diverse Audience: the ...htlin/paper/doc/wstml12solid.pdf · Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach ... deep

Teaching Machine Learning to a Diverse Audience:the Foundation-based Approach

Hsuan-Tien Lin [email protected]

Department of Computer Science and Information Engineering, National Taiwan University

Malik Madgon-Ismail [email protected]

Department of Computer Science, Rensselaer Polytechnic Institute

Yaser S. Abu-Mostafa [email protected]

Department of Electrical Engineering and Department of Computer Science, California Institute of Technology

Abstract

How should we teach machine learning to adiverse audience? Is a ‘foundation-based’ ap-proach appropriate or should we switch toa ‘techniques-oriented’ approach? We arguethat the foundation-based approach is stillthe way to go.

The foundation-based approach emphasizesthe fundamentals first, before moving to al-gorithms or paradigms. Here, we focuson three key concepts in machine learningthat cover the basic understanding, the al-gorithmic modeling, and the practical tun-ing. We then demonstrate how more sophis-ticated techniques like the popular supportvector machine can be understood within thisframework.

1. Introduction

Machine learning, ‘the art of learning from data’, isnow a wide discipline attracting theorists and prac-titioners from social media, biology, economics, en-gineering, etc. The discipline represents a prevailingparadigm in many applications: we don’t know thesolution of our problem, but we have data that rep-resents the solution. A consequence is that the typ-ical entry level machine learning class will consist ofa diverse audience as seen in most university offerings– multiple backgrounds, motivations and goals. Toname a few cases: At the National Taiwan Univer-sity, the 2011 machine learning class was of size 77

Appearing in Proceedings of the Workshop on Teaching Ma-chine Learning, Edinburgh, Scotland, UK, 2012. Copyright2012 by the author(s)/owner(s).

and consists of 42 students from computer science, 26from electrical engineering, and other students fromfinance, information management, mechanical engi-neering and bio-engineering. At Rensselaer Polytech-nic Institute, the 2011 machine learning class is ofsize 50, consisting of 24 undergraduate and 26 grad-uate students, whose majors are spread over com-puter science, electrical engineering, information sci-ence and engineering, physics, cognitive science, math-ematics, chemistry and biology. At California Insti-tute of Technology, the 2012 machine learning classconsists of 151 registered students coming from 15 dif-ferent majors. The online version of the class (http://work.caltech.edu/telecourse) is vastly more di-verse, having practitioners with minimal backgroundall the way to a winner of the US National Medal ofTechnology.

Machine learning in many universities is commonlylisted as an advanced undergraduate course, relying onsome background in calculus, linear algebra and prob-ability. We don’t know whether this status will go theway of calculus and introductory computer program-ming, with separate offerings for biology, management,engineering, etc. But we do know that we have toaddress the diversity challenge facing instructors now.And, perhaps the most important choice to be made isbetween solid foundations versus practical techniquesthat can immediately be used. Of course, to keep astudent interested, some practical techniques must bethere, but where should the focus be?

It’s tempting to present the latest and greatest tech-niques; that is certainly the sexiest way to go. A typ-ical depth-based approach would focus on one or fewpromising techniques, such as the support vector ma-chine algorithm (Vapnik, 1998). The approach aims at

http://work.caltech.edu/telecourse

http://work.caltech.edu/telecourse

Teaching Machine Learning to Diverse Audience

examining the technique thoroughly, from its intuitiveand theoretical backbone to its heuristic and practicalusage. Students who are taught through this approachcan master the particular technique and exploit it ap-propriately in their domain. However, restriction toone or a few techniques has difficulty satisfying theneeds of a diverse audience.

The other extreme is the breadth-based approach. Atypical breadth-based approach travels through theforest of learning paradigms and algorithms, jumpingfrom paradigm to paradigm, from algorithm to algo-rithm. The goal is generally to cram enough togetherin an attempt to be ‘complete’. An ambitious instruc-tor may intend to cover supervised learning (includingnearest neighbors, radial basis functions, support vec-tor machines, kernel methods, neural networks, adap-tive boosting, decision trees, random forest); unsuper-vised learning (including principal component analysisand k-means clustering); probabilistic modeling (in-cluding Gaussian processes and graphical models); re-inforcement learning; etc. We can certainly see the ap-peal of this approach to a diverse audience, but let’snot forget that the student has to actually be able touse the techniques in their domain. Students can eas-ily get lost in this forest.

It is dangerous to have tools available to usewithout knowing when and how to use them.

Given the plethora of open source implementations ofcutting edge techniques on the Internet. we can alwaysdownload the latest and greatest open-source imple-mentation of the current cutting edge technique, forexample the support vector machine. So it seems thatthe right place to start is the foundations, even for adiverse audience. We can only argue from experience,and we say this on the basis of more than a decade ofsuccessful experience1 in teaching this material.

Foundations First. By foundations we do notmean death by theory. To pull this off, it is neces-sary to isolate the crucial core that all students shouldknow; the core that will enable them to understandthe map of machine learning and navigate throughit on their own. We have distilled a three-step corethat through experience works well, and we feel thatany student of machine learning who has mastered thiscore is truly poised to succeed in any machine learningapplication.

1Professor Y. Abu-Mostafa received the CaltechRichard P. Feynman prize for excellence in teaching in1996; Professor Hsuan-Tien Lin received the NTU out-standing teaching award in 2011.

1. Learnability, approximation and generalization:when can we learn and what are the tradeoffs?

2. Careful use of simple models: the linear modelcoupled with the nonlinear transform is typicallyenough for most applications. Only when it failsshould we go after a more complex model.

3. Noise and overfitting : how do we deal with theadverse effects of noise in learning, in particularby using regularization and validation.

We have observed that when equipped with these foun-dations, the students, including both practitioners andresearchers, can then move on to one technique or an-other with ease and are able to appreciate the largerframework within which those techniques fit. We usethis core in our courses as the immutable foundationfor a short course in introductory machine learningto both practitioners and researchers; we then com-plement this foundation with specific techniques andalgorithms that we wish to emphasize within a partic-ular context, and these specific techniques may changefrom year to year.

In this position paper, we briefly introduce these threekey concepts in the foundation, and present a sim-ple case of how to position new techniques withinthis foundation. We welcome discussions on whatshould be added or removed from the foundations. Ourhope is to highlight the foundation-based approach inteaching machine learning, an approach that seems tobe getting neglected in current teaching trends thatmostly focus on presenting as many flashy techniquesas possible.

2. Learnability, Approximation andGeneralization

– what we see is not what we get

For beginners in machine learning, it is all too oftenthat they would throw data at a learning system andbe surprised when they come to deploy the learned sys-tem on test examples. The first principle that shouldbe instilled early on is that what we see in-sample isnot always what we get out-of-sample. We need to em-phasize that the out-of-sample performance is whatwe care about, and this leads to a natural question:how can we claim that learning (getting a low out-of-sample performance) is feasible, while only being ableto observe in-sample? And so, we arrive at the logi-cal reformulation of the learning task into two steps.

1. Make sure we have good in-sample performance;

2. Make sure in-sample performance reflects out-of-sample performance.


There are several advantages to emphasizing the twostep approach. First, it crystallizes the tradeoff be-tween approximation (the ability to ensure step 1)and generalization (the ability to ensure step 2). Sec-ond, it clearly separates learning into the process wecan observe (step 1) and the process we cannot ob-serve (step 2). Both steps are necessary, but thereis an asymmetry: since step 2 cannot be observed, itmust be inferred in a probabilistic manner. There is adeep consequence of not being able to observe step 2,namely that via powerful and abstract methods, wemust ensure it, lest it be our unseen Achilles heel. Thisenables us to characterize successful learning accord-ing to what is happening in step 1, in-sample, modulothat we have ensured step 2. If we managed to get thein-sample performance we want, then we are done, butif not, we failed, but at least we will know we failed.

Contrast this with the typical approach of the begin-ner: throw a complex model at the data and try toknock down the in-sample error to get good in-sampleperformance; then hope that things work out luckilyduring deployment on test examples. This is not themode we advocate. The mode we advocate is ensurestep 2 and the try our best with the in-sample perfor-mance.

We cannot guarantee learning. We can‘guarantee’ no disasters. That is, after welearn we will either declare success or failure,and in both cases we will be right.

There are different levels of mathematical rigor atwhich this message can be told, but we feel that theessence of the message is a must. Armed with thisinsight, the enlightened student will rarely falter. Itmotivates them to understand when step 2 can be en-sured and the need for the abstract side to learning. Itmotivates the practical side too – knocking down thein-sample performance.

In our teaching experience, we find that students fromdiverse backgrounds are highly motivated after en-countering the learnability issue, mostly because theissue makes machine learning a non-trivial and fasci-nating field to them.

3. Linear Models

– simple works well in practice

It is just a dogma that the beginner when faced withthe data must have a go at it with the latest andgreatest. Armed with an understanding of learnability,approximation and generalization, students instantlysee how the linear model quantitatively ensures step 2

(generalization) by suitably controlling the dimension.So all that remains is step 1, and efficient algorithmsexist to knock down the error. If that does not work,careful use of the nonlinear feature transform usuallysolves the problem. In fact, recent advances in large-scale learning suggest that linear models are more suc-cessful than current dogma gives them credit for (Yuanet al., 2012).

So the recipe is simple. Try the linear model first.It is simple and often effective and not much can gowrong. If the in-sample performance of a linear modelis good, we confidently assert success. We always havethe option, later, to go with a more complicated model,but with care. And, the linear model is typically thebuilding block for the most popular complex modelsanyway.

In our teaching experience, we introduce the linearmodels within three related contexts of classification,regression, and probability estimation (logistic regres-sion). Each context comes with learning algorithmsand good generalization. The algorithms illustrate dif-ferent optimization methods: combinatorial, analytic,and iterative, and so a lot of territory can be coveredearly by the instructor. Students coming from diversebackgrounds not only get the big picture, but also thefiner details in a concrete setting.

4. Noise and Overfitting

– complex situations call for simpler models

Overfitting, the troubling phenomenon where pushingdown the training error no longer indicates that we willget a decent test error, is arguably one of the mostcommon traps for beginners. Intuitively, the ubiq-uitous stochastic noise that corrupts any finite dataset can even lead the linear model astray. We shouldalso not lose sight of the other major cause of over-fitting: deterministic noise, the extent to which thetarget function cannot be modeled by our learningmodel (Abu-Mostafa et al., 2012).

Whenever there is noise, extra precautionsare called for. This goes for stochastic as wellas deterministic noise.

It is counterintuitive to a student that if the targetfunction gets more complex we should choose a simplerlearning model. But, that is why proper understandingof machine learning matters for all students, regardlessof what domain their application is in; what separatesthe amateur from the professional is the ability to dealwith overfitting, in particular how to deal with bothtypes of noise.


We recommend a substantive treatment of two basictools, regularization and validation. An understandingof overfitting sets a perfect stage to introduce regular-ization and validation. Regularization constrains themodel complexity to prevent overfitting, and valida-tion allows one to certify whether a model is good orbad.

These three core concepts: learnability, simple modelsand overfitting continue to be crucial when we moveto the complex models. Each of those models havetheir own ways of dealing with these issues and so thebackground set by this core is the platform from whichto learn these more complex models (such as neuralnetworks or support vector machines).

5. Support Vector Machines?

We cannot avoid the more complex methods but theadvantage of having given the core is that studentsare fully equipped to study any method independently.For example, the neural network is just a cascade oflinear models. Ensemble methods are typically combi-nations of linear models voted together in some prin-cipled way. Support Vector Machines (SVM; Vapnik,1998), which we now focus on, are linear models with a‘robust’ formulation that can be solved with quadraticprogramming; and, the nonlinear transform can be in-corporated efficiently with SVM using the kernel trick.

One reason to choose SVM (for classification) is be-cause it is currently the most popular. In reality, mostmodels are good, it’s just that SVM requires the leastamount of ‘expertise’ to use effectively. But givenan understanding on the foundations, the student isequipped to understand the why and the how. Againthe discussion can be taken to a mathematical level ofthe instructor’s choice, but the concepts will remainfirm.

How does SVM accomplish the two steps required forlearnability? Generalization is handled because themodel is simple - it is a linear model with an ingre-dient of margin control to ensure the generalizationperformance (step 2) via theoretical results.

What about getting good in-sample performance?First, there is an efficient algorithm, based onquadratic programming. And if that fails, we can ex-ploit the nonlinear transform efficiently using a kernel.What if our data is too large for general quadraticprogramming tools? The many other tools that areavailable for linear models, such as stochastic gradi-ent descent (SGD) commonly used for linear logisticregression, can be taken with linear SVM to deal withscalability.

What about overfitting? We can efficiently estimatethe leave-one-out error and hence validate (certify) themodel. In addition, the primal form of SVM asks tominimize the norm of the weight vector subject to fit-ting the data - well, that is a form of regularization.

So we see why SVM is popular among the diverse ma-chine learning crowd – lots of things are taken care ofautomatically. That doesn’t mean that these thingscannot be done with neural networks. And there liesthe effectiveness of the foundation-based approach.SVM is not some magic box. It is just one way ofrealizing the three core concepts.

6. Conclusion

We discussed a proposed three-step core as a foun-dation for machine learning in both practice and the-ory. The three concepts cover the philosophical under-standing (when learning is possible), the algorithmicmodeling (how to learn with simple models), and thepractical tuning (how to combat overfitting). We ar-gue that the foundation can solidify the understandingof students coming from diverse backgrounds, and canbe carried through to any technique, simple or com-plex. Ironically, we are not proposing that we alterthe teaching to accommodate more diverse audiences.Quite the opposite, the foundations-based approach isall the more important for such audiences to ensurethat they use machine learning correctly.

References

Abu-Mostafa, Yaser S., Magdon-Ismail, Malik, andLin, Hsuan-Tien. Learning from Data: A ShortCourse. AMLBook.com, 2012.

Vapnik, Vladimir N. Statistical Learning Theory. Wi-ley, New York, NY, 1998.

Yuan, Guo-Xun, Ho, Chia-Hua, and Lin, Chih-Jen.Recent advances of large-scale linear classification.Proceedings of IEEE, 2012. doi: 10.1109/JPROC.2012.2188013. to appear.

Download - Teaching Machine Learning to a Diverse Audience: the ...htlin/paper/doc/wstml12solid.pdf · Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach ... deep

Top Related