Francesca Odone and Lorenzo Rosasco
RegML 2013
Regularization Methods for High Dimensional Learning
Genova, June, 3-7 2013
Course organized within the PhD Program in Computer Science for thePhD School in Sciences and Technologies for Information and Knowledge (STIC) PhD School in Life and Humanoid Technologies
Regularization Methods for High Dimensional Learning Intro
Who are we?The course is co-organized by
•SLIPGURU, University of Genova •Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology
Regularization Methods for High Dimensional Learning Intro
Schedule
Regularization Methods for High Dimensional Learning Intro
Course Schedule and Material
•Slipguru: slipguru.disi.unige.it•LCSL: lcsl.mit.edu
COURSE MATERIAL
Introductory information at:http://slipguru.disi.unige.it/Teaching/odone_rosasco/
Slides and additional material on Aulaweb:http://stsi.aulaweb.unige.it/course/view.php?id=59
Material also available upon [email protected], [email protected]
Regularization Approaches to Learning Theory Regularization Approaches to Learning Theory
disi.unige.it/dottorato/corsi/RegML2013/
Nicoletta NocetiSilvia VillaAlessandra Stagliano’Sean FanelloGabriele ChiusanoAlessandro RudiLuca Zini
Teaching Assistants
Instructors e-mails
Other Sources
Exams?Credits?Certificates?Attendance?Housing?
Regularization Methods for High Dimensional Learning Intro
What We Talk About When We Talk About (Machine) Learning
Francesca Odone and Lorenzo Rosasco
RegML 2013
Regularization Methods for High Dimensional Learning Intro
Regularization Methods for High Dimensional Learning Intro
Menu’
Appetizer: AI some context and some history
Entree: Machine Learning at a Glance
Main Course: Intro to Statistical Learning Theory
Regularization Methods for High Dimensional Learning Intro
(Artificial) Intelligence
Build intelligent machines
Understand Intelligence
Science and Engineering of Intelligence
Regularization Methods for High Dimensional Learning Intro
(Artificial) Intelligence: A Working Definition Turing test Ingredients for AI
• natural language processing • knowledge representation• automated reasoning • machine learning
• computer vision• robotics to manipulate
Alan Turing 1912-1954
Regularization Methods for High Dimensional Learning Intro
(Artificial) Intelligence & its Neighbors
Neuroscience
Psychology
Cognitive Science
AI
Mathematics
Engineering
Computer Science
• What are the formal rules to draw valid conclusions? •What can be computed?• How do we reason with uncertain information?
Phylosophy
Regularization Methods for High Dimensional Learning Intro
Birth of a Dream1943Arturo Rosenblueth, Norbert Wiener and Julian Bigelow coin the term "cybernetics". Wiener's popular book by that name published in 1948.1945Game theory which would prove invaluable in the progress of AI was introduced with the 1944 paper, Theory of Games and Economic Behavior by mathematician John von Neumann and economist Oskar Morgenstern.1945Vannevar Bush published As We May Think (The Atlantic Monthly, July 1945) a prescient vision of the future in which computers assist humans in many activities.1948John von Neumann (quoted by E.T. Jaynes) in response to a comment at a lecture that it was impossible for a machine to think: "You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!". Von Neumann was presumably alluding to the Church-Turing thesis which states that any effective procedure can be simulated by a (generalized) computer....1950Alan Turing proposes the Turing Test as a measure of machine intelligence.1950Claude Shannon published a detailed analysis of chess playing as search.1955The first Dartmouth College summer AI conference is organized by John McCarthy, Marvin Minsky, Nathan Rochester of IBM andClaude Shannon.1956The name artificial intelligence is used for the first time as the topic of the second Dartmouth Conference, organized by John McCarthy[30]
.....................
Regularization Methods for High Dimensional Learning Intro
How did it go?We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.
Dartmouth Summer Research Conference on Artificial Intelligence organised by John McCarthy and proposed by McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon.
Late 1990sWeb crawlers and other AI-based information extraction programs become essential in widespread use of the World Wide Web.1997The Deep Blue chess machine (IBM) beats the world chess champion, Garry Kasparov.2004DARPA introduces the DARPA Grand Challenge requiring competitors to produce autonomous vehicles for prize money.
Regularization Methods for High Dimensional Learning Intro
10/15 years ago
Regularization Methods for High Dimensional Learning Intro
How are we doing now?
Regularization Methods for High Dimensional Learning Intro
Pedestrians Detection at Human Level Performance
Regularization Methods for High Dimensional Learning Intro
ML and AI
Machine Learning
systems are trained on examples
rather than being programmed
Regularization Methods for High Dimensional Learning Intro
Menu’
Appetizer: AI some context and some history
Entree: Machine Learning at a Glance
Main Course: Intro to Statistical Learning Theory
Regularization Methods for High Dimensional Learning Intro
Basic Setting: Classification
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA
(x1, y1), . . . , (xn, yn)
xi 2 Rp and yi 2 Y = {�1, 1}, i = 1, . . . , n
Regularization Methods for High Dimensional Learning Intro
Genomics
...
;
...
n patients p gene expression measurements
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Text Classification
Regularization Methods for High Dimensional Learning Intro
Text Classification: Bag of Words
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA
Regularization Methods for High Dimensional Learning Intro
Image Classification
b
handwriting
AutomaticCar Plate Reading
......
......
Regularization Methods for High Dimensional Learning Intro
Image Classification
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA
Regularization Methods for High Dimensional Learning Intro
From classification to regression
(x1, y1), . . . , (xn, yn)
xi 2 RD and yi 2 Y = {�1, 1}, i = 1, . . . , n
yi 2 Y 2 R, i = 1, . . . , n
CS229 Lecture notes
Andrew Ng
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:
Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...
...
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
100
200
300
400
500
600
700
800
900
1000
housing prices
square feet
pric
e (in
$10
00)
Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?
1
CS229 Lecture notes
Andrew Ng
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:
Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...
...
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
100
200
300
400
500
600
700
800
900
1000
housing prices
square feet
pric
e (in
$10
00)
Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?
1
3
Part I
Linear Regression
To make our housing example more interesting, let’s consider a slightly richerdataset in which we also know the number of bedrooms in each house:
Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540...
......
Here, the x’s are two-dimensional vectors in R2. For instance, x(i)1 is the
living area of the i-th house in the training set, and x(i)2 is its number of
bedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on. We’ll saymore about feature selection later, but for now let’s take the features asgiven.)
To perform supervised learning, we must decide how we’re going to rep-resent functions/hypotheses h in a computer. As an initial choice, let’s saywe decide to approximate y as a linear function of x:
h!(x) = !0 + !1x1 + !2x2
Here, the !i’s are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y . When there is no risk ofconfusion, we will drop the ! subscript in h!(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx0 = 1 (this is the intercept term), so that
h(x) =n!
i=0
!ixi = !Tx,
where on the right-hand side above we are viewing ! and x both as vectors,and here n is the number of input variables (not counting x0).
yi = f(xi) + �"i, � > 0
e.g. f(x) = w
Tx, "i ⇠ N(0, 1)
Regularization Methods for High Dimensional Learning Intro
Batch Learning(x1, y1), . . . , (xn, yn)
Inputs Outputs
f
X Y
S
LM
Gx y
f(x)
Regularization Methods for High Dimensional Learning Intro
Machine Learning: Problems and Approaches
Learning Problems•Supervised Learning•Semisupervised•Online•....
Learning Approaches•Batch Learning •Online•Active •...
Regularization Methods for High Dimensional Learning Intro
Variations on a Theme(x1, y1), . . . , (xn, yn)
Multiclass: xi 2 RD and yi 2 Y = {1, . . . , T}, i = 1, . . . , n
Multitask: xi 2 RD and yi 2 RT , i = 1, . . . , n
(x1, x1; y1,1), (x1, x2; y1,2) . . . , (xn, xn; yn,n)
xj , xi 2 RD and yi,j 2 [0, 1], j, i = 1, . . . , n
Learning a similarity function
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Regularization Methods for High Dimensional Learning Intro
Semisupervised Learning
Xu =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1u . . . . . . . . . x
pu
1
CA[Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA;
Manifold Learning
Regularization Methods for High Dimensional Learning Intro
Online Learning
(x1, y1)
(x2, y2)
(xn, yn)
. . .
f1
f0
f2 . . .
fn
Regularization Methods for High Dimensional Learning Intro
Machine Learning: Problems and Approaches
Learning Problems•Supervised Learning•Semisupervised•Online•Unsupervised Learning
Learning Approaches•Batch Learning •Online•Active •...
S
LM
Gx y
f(x)
Regularization Methods for High Dimensional Learning Intro
Unsupervised Learning
ClusteringDimensionality reductionLearning Data Representation....
Goal: Extract patterns...
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA
x1, . . . , xnGiven
Regularization Methods for High Dimensional Learning Intro
Machine Learning: Problems and Approaches
Learning Problems•Supervised Learning•Semisupervised•Online•....
Learning Approaches•Batch Learning •Online•Active •...
Regularization Methods for High Dimensional Learning Intro
Online/Incremental Learning
(x1, y1)
(x2, y2)
(xn, yn)
. . .
f1
f0
f2. . .
fn (x1, y1)
f1
(x1, y1), . . . , (xn, yn)
Regularization Methods for High Dimensional Learning Intro
Active Learning
(x1, y1)
(x2, y2)
(xn, yn)
. . .
f1
f0
f2 . . .
fn
196 FOUNDATIONS AND APPLICATIONS OF SENSOR MANAGEMENT
(a) (b)
(c) (d)
Figure 8.7. The two step procedure for d = 2: (a) Initial unpruned RDP and n/2 samples.(b) Preview step RDP. Note that the cell with the arrow was pruned, but it contains a part of theboundary. (c) Additional sampling for the refinement step. (d) Refinement step.
The final estimator is constructed assembling the estimate “away” from theboundary obtained in the preview step with the estimate in the vicinity of theboundary obtained in the refinement step.
To formally show that this algorithm attains the faster rates we desire wehave to consider a further technical assumption, namely that the boundary setis “cusp-free”2. This condition is rather technical, but it is not very restrictive,and encompasses many interesting situations, including of course, boundaryfragments. This condition seems to be necessary for the algorithm to performwell, and it is not simply an artifact of the proof. For a more detailed explana-tion see [52]. Under this condition we have the following theorem.
2A cusp-free boundary cannot have the behavior you observe in the graph of |x|1/2 at the origin. Less“aggressive” kinks are allowed, such as in the graph of |x|.
Learner can query points
Regularization Methods for High Dimensional Learning Intro
Some Remarks
We look for computer systems that are trained, rather than programmed, to perform a task
Learning from examples is a unifying paradigm in AI:
It allows to exploit the availability of data and computational resources
``Learning is the acquisition of knowledge or skills through study, experience, or being taught’’
Regularization Methods for High Dimensional Learning Intro
Menu’
Appetizer: AI some context and some history
Entree: Machine Learning at a Glance
Main Course: Intro to Statistical Learning Theory
Regularization Methods for High Dimensional Learning Intro
**Warning**Math
The course contains many ideas and (quite) a bit of math, questions help prevent sleeping...
Regularization Methods for High Dimensional Learning Intro
Training Set
Given a Training Set
f(x) ⇠ y
Find
S = (x1, y1), . . . , (xn, yn)
Regularization Methods for High Dimensional Learning Intro
Loss function
We need a way to measure errors
Loss functionV (f(x), y)
Regularization Methods for High Dimensional Learning Intro
Loss function examples• 0� 1-loss V (f(x), y) = ✓(�yf(x)) (✓ is the step function)
• square loss (L2) V (f(x), y) = (f(x)� y)
2= (1� yf(x))
2
• absolute value (L1) V (f(x), y) = |f(x)� y|
• Vapnik’s ✏-insensitive loss V (f(x), y) = (|f(x)� y|� ✏)+
• hinge loss V (f(x), y) = (1� yf(x))+
• logistic loss V (f(x), y) = log(1� e
�yf(x)) logistic regression
• exponential loss V (f(x), y) = e
�yf(x)
Regularization Methods for High Dimensional Learning Intro
Empirical error
IS [f ] =1n
Pni=1 V (f(xi), yi)
Given a loss function V (f(x), y)
We can define the Empirical Error
Regularization Methods for High Dimensional Learning Intro
Hypotheses Space
``Learning processes do not take place in vacuum.’’
Cucker and Smale, AMS 2001
We need to fix a Hypotheses Space
H ⇢ F = {f | f : X ! Y }
F
H
Regularization Methods for High Dimensional Learning Intro
Hypotheses Space
• Linear model f(x) =Pp
j=1 xjw
j
• Generalized linear models f(x) =Pp
j=1 �(x)jw
j
• Reproducing kernel Hilbert spaces f(x) =P
j�1 �(x)jw
j =P
i�1 K(x, xi)↵i
K(x, x0) is a symmetric positive definite function called reproducing kernel
parametric
non-parametric
F
HH ⇢ F = {f | f : X ! Y }
Regularization Methods for High Dimensional Learning Intro
Hypotheses Space
• Linear model f(x) =Pp
j=1 xjw
j
• Generalized linear models f(x) =Pp
j=1 �(x)jw
j
• Reproducing kernel Hilbert spaces f(x) =P
j�1 �(x)jw
j =P
i�1 K(x, xi)↵i
K(x, x0) is a symmetric positive definite function called reproducing kernel
parametric
semi-parametric
H ⇢ F = {f | f : X ! Y }
F
H
Regularization Methods for High Dimensional Learning Intro
Hypotheses Space
• Linear model f(x) =Pp
j=1 xjw
j
• Generalized linear models f(x) =Pp
j=1 �(x)jw
j
• Reproducing kernel Hilbert spaces f(x) =P
j�1 �(x)jw
j =P
i�1 K(x, xi)↵i
K(x, x0) is a symmetric positive definite function called reproducing kernel
parametric
non-parametric
H ⇢ F = {f | f : X ! Y }
F
H
semi-parametric
Regularization Methods for High Dimensional Learning Intro
Minimizing the empirical error
Empirical Risk Minimization (ERM)
minf2H
IS [f ] = minf2H
1
n
nX
i=1
V (f(xi), yi)
Regularization Methods for High Dimensional Learning Intro
Minimizing the empirical error
Empirical Risk Minimization (ERM)
minf2H
IS [f ] = minf2H
1
n
nX
i=1
V (f(xi), yi)
Regularization Methods for High Dimensional Learning Intro
Minimizing the empirical error
Empirical Risk Minimization (ERM)
Which one is a good solution?
minf2H
ES [f ] = minf2H
1
n
nX
i=1
V (f(xi), yi)
Regularization Methods for High Dimensional Learning Intro
Statistical Learning: Overfitting and Generalization
CS229 Fall 2012 2
To establish notation for future use, we’ll use x(i) to denote the “input”variables (living area in this example), also called input features, and y(i)
to denote the “output” or target variable that we are trying to predict(price). A pair (x(i), y(i)) is called a training example, and the datasetthat we’ll be using to learn—a list of m training examples {(x(i), y(i)); i =1, . . . , m}—is called a training set. Note that the superscript “(i)” in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also use X denote the space of input values, and Ythe space of output values. In this example, X = Y = R.
To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X !" Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. Seen pictorially, the process is thereforelike this:
Training set
house.)(living area of
Learning algorithm
h predicted yx(predicted price)of house)
When the target variable that we’re trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y can take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it a classification problem.
S = (x1, y1), . . . , (xn, yn)
The training set
is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p(x, y) = p(x)p(y|x)
Regularization Methods for High Dimensional Learning Intro
Generalization and Stability ERM AND ILL-POSEDNESS
Ill posed problems often arise if one tries to infer general laws fromfew data
the hypothesis space is too largethere are not enough data
In general ERM leads to ill-posedsolutions because
the solution may be too complex
it may be not unique
it may change radically whenleaving one sample out
Foundations of Computational and Statistical Learning Foundations of Computational and Statistical Learning
Learning is an ill-posed problem
Jacques Hadamard
Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization)
Regularization Methods for High Dimensional Learning Intro
Theory of Machine Learning
•Beyond drawings & intuitions (...) there is a deep, rigorous mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ).
Theory of learning is a synthesis of different fields, e.g. Computer Science (Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics).
•Central to the Theory of Machine Learning is the problem of understanding condition under which ERM can solve
inf E(f), E(f) = E(x,y) V (y, f(x))
Regularization Methods for High Dimensional Learning Intro
(Tikhonov) Regularization
minf2H
{ 1n
nX
i=1
V (yi, f(xi)) + �R(f))} ! f
�S
regularization parameter
regularizer•The regularizer describes the complexity of the solution
R(f2) is bigger than R(f1)
f1 f2
•The regularization parameter determines the trade-off between complexity and empirical risk
Regularization Methods for High Dimensional Learning Intro
Some Remarks and Some Questions
•Supervised learning in statistical learning theory: basic concepts/notation.
•The regularization approach:•Which hypotheses space? Which regularizer?•How can we find a solution in an efficient way?•How do we solve the fitting/regularizing trade-off?