an overview of other topics, conclusions, and perspectives · 2017. 8. 14. · sparse modeling....
TRANSCRIPT
![Page 1: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/1.jpg)
An Overview of Other Topics,Conclusions, and Perspectives
Piyush Rai
Machine Learning (CS771A)
Nov 11, 2016
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 1
![Page 2: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/2.jpg)
Plan for today..
Survey of some other topics
Sparse Modeling
Time-series modeling
Reinforcement Learning
Multitask/Transfer Learning
Active Learning
Bayesian Learning
Conclusion and take-aways..
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 2
![Page 3: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/3.jpg)
Sparse Modeling
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 3
![Page 4: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/4.jpg)
Sparse Modeling
Many ML problems can be written in the following general form
y = Xw + ε
where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise
Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)
Becomes especially important when D >> N
Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4
![Page 5: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/5.jpg)
Sparse Modeling
Many ML problems can be written in the following general form
y = Xw + ε
where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise
Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)
Becomes especially important when D >> N
Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4
![Page 6: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/6.jpg)
Sparse Modeling
Many ML problems can be written in the following general form
y = Xw + ε
where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise
Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)
Becomes especially important when D >> N
Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4
![Page 7: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/7.jpg)
Sparse Modeling
Many ML problems can be written in the following general form
y = Xw + ε
where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise
Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)
Becomes especially important when D >> N
Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4
![Page 8: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/8.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 9: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/9.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 10: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/10.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 11: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/11.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 12: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/12.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 13: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/13.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 14: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/14.jpg)
Sparse Modeling
Ideally, our goal will be to solve the following problem
arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s
Note: ||w ||0 is known as the `0 norm of w
In general, an NP-hard problem: Combinatorial optimization problem with∑s
i=1
(Di
)possible
locations of nonzeros in w
Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).
A huge body of work on solving these problems. Primarily in two categories
Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)
arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min
w||y − Xw ||2 + λ||w ||1
(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)
Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)
†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5
![Page 15: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/15.jpg)
Sparsity using `1 Norm
Why `1 norm gives sparsity? Many explanations. An informal one:
Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1
Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6
![Page 16: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/16.jpg)
Sparsity using `1 Norm
Why `1 norm gives sparsity? Many explanations. An informal one:
Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1
Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6
![Page 17: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/17.jpg)
Sparsity using `1 Norm
Why `1 norm gives sparsity? Many explanations. An informal one:
Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1
Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6
![Page 18: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/18.jpg)
Learning from Time-Series Data
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 7
![Page 19: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/19.jpg)
Modeling Time-Series Data
The input is a sequence of (non-i.i.d.) examples y 1, y 2, . . . , yT
The problem may be supervised or unsupervised, e.g.,
Forecasting: Predict yT+1, given y 1, y 2, . . . , yT
Cluster the examples or perform dimensionality reduction
Evolution of time-series data can be attributed to several factors
Teasing apart these factors of variation is also an important problem
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 8
![Page 20: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/20.jpg)
Auto-regressive Models
Auto-regressive (AR): Regress each example on p previous examples
y t = c +
p∑i=1
wiy t−i + εt : An AR(p) model
Moving Average (MA): Regress each example on p previous stochastic errors
y t = c + εt +
p∑i=1
wiεt−i : An MA(p) model
Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors
y t = c + εt +
p∑i=1
wiy t−i +
q∑i=1
viεt−i : An ARMA(p, q) model
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9
![Page 21: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/21.jpg)
Auto-regressive Models
Auto-regressive (AR): Regress each example on p previous examples
y t = c +
p∑i=1
wiy t−i + εt : An AR(p) model
Moving Average (MA): Regress each example on p previous stochastic errors
y t = c + εt +
p∑i=1
wiεt−i : An MA(p) model
Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors
y t = c + εt +
p∑i=1
wiy t−i +
q∑i=1
viεt−i : An ARMA(p, q) model
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9
![Page 22: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/22.jpg)
Auto-regressive Models
Auto-regressive (AR): Regress each example on p previous examples
y t = c +
p∑i=1
wiy t−i + εt : An AR(p) model
Moving Average (MA): Regress each example on p previous stochastic errors
y t = c + εt +
p∑i=1
wiεt−i : An MA(p) model
Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors
y t = c + εt +
p∑i=1
wiy t−i +
q∑i=1
viεt−i : An ARMA(p, q) model
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9
![Page 23: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/23.jpg)
State-Space Models
Assume that each observation y t in the time-series is generated by a low-dimensional latent factorx t (one-hot or continuous)
Basically, a generative latent factor model: y t = g(x t) and x t = f (x t−1), where g and f areprobability distributions
Very similar to PPCA/FA, except that latent factor x t depends on x t−1
Some popular SSMs: Hidden Markov Models (one-hot latent factor x t), Kalman Filters(real-valued latent factor x t)
Note: Models like RNN/LSTM are also similar, except that these are not generative (but can bemade generative)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 10
![Page 24: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/24.jpg)
Reinforcement Learning
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 11
![Page 25: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/25.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 26: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/26.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 27: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/27.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 28: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/28.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 29: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/29.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 30: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/30.jpg)
Reinforcement Learning
A paradigm for interactive learning or “learning by doing”
Different from supervised learning. Supervision is “implicit”
The learner (agent) performs “actions” and gets “rewards”
Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)
Order in which data arrives matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12
![Page 31: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/31.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 32: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/32.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 33: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/33.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 34: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/34.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 35: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/35.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 36: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/36.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 37: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/37.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 38: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/38.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 39: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/39.jpg)
Markov Decision Processes
Markov Decision Process (MDP) gives a way to formulate RL problems
MDP formulation assumes the agent knows the its state in the environment
Partially Observed MDP (POMDP) can be used when state is unknown
Assume there are K possible states and a set of actions at any state
A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)
Each Pa is a K × K matrix of transition probabilites (one for each action a)
A reward function R(st = k, at = a)
Goal: Find a “policy” π(st = k) which returns the optimal action for st = k
(Pa,R) and π can be estimated in an alternating fashion
Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13
![Page 40: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/40.jpg)
Multitask/Transfer Learning
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 14
![Page 41: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/41.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 42: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/42.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 43: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/43.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 44: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/44.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 45: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/45.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 46: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/46.jpg)
Multitask/Transfer Learning
An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other
Each learning problem is a “task”
Note: We rarely know how the different tasks are related with each other
Have to learn the relatedness structure as well
For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM
Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)
Some other related problem settings: Domain Adaptation, Covariate Shift
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15
![Page 47: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/47.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 48: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/48.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 49: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/49.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 50: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/50.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 51: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/51.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]
If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)ptr (x,y) = pte(x)
ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 52: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/52.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 53: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/53.jpg)
Covariate Shift
Note: The input or the feature vector x is also known as “covariate”
Suppose training inputs come from distribution ptr (x)
Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)
The following loss function can be used to handle this issueN∑
n=1
pte(xn)
ptr (xn)`(yn, f (xn)) + R(f )
Why will this work? Well, because
E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr
[pte(x , y)
ptr (x , y)`(y , x ,w)
]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)
ptr (x,y) = pte(x)ptr (x)
Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16
![Page 54: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/54.jpg)
Active Learning
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 17
![Page 55: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/55.jpg)
Active Learning
Allow the learner to ask for the most informative training examples
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18
![Page 56: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/56.jpg)
Active Learning
Allow the learner to ask for the most informative training examples
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18
![Page 57: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/57.jpg)
Active Learning
Allow the learner to ask for the most informative training examples
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18
![Page 58: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/58.jpg)
Active Learning
Allow the learner to ask for the most informative training examples
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18
![Page 59: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/59.jpg)
Active Learning
Allow the learner to ask for the most informative training examples
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18
![Page 60: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/60.jpg)
Bayesian Learning
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 19
![Page 61: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/61.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ)
or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 62: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/62.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ) or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 63: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/63.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ)
or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 64: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/64.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ) or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 65: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/65.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ) or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 66: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/66.jpg)
Optimization vs Inference
Learning as Optimization
Parameter θ is a fixed unknown
Minimize a “loss” and find a point estimate (best answer) for θ, given data X
θ̂ = arg minθ∈Θ
Loss(X; θ) or arg maxθ
log p(X|θ)︸ ︷︷ ︸Maximum Likelihood
or arg maxθ
log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation
Learning as (Bayesian) Inference
Treat the parameter θ as a random variable with a prior distribution p(θ)
Infer a posterior distribution over the parameters using Bayes rule
p(θ|X) =p(X|θ)p(θ)
p(X)∝ Likelihood× Prior
Posterior becomes the new prior for next batch of observed data
No “fitting”, so no overfitting!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20
![Page 67: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/67.jpg)
Why be Bayesian?
Can capture/quantify the uncertainty (or “variance”) in θ via the posterior
Can make predictions by averaging over the posterior
p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior
=
∫p(y |x , θ)p(θ|X ,Y )dθ
Many other benefits (wait for next semester :) )
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21
![Page 68: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/68.jpg)
Why be Bayesian?
Can capture/quantify the uncertainty (or “variance”) in θ via the posterior
Can make predictions by averaging over the posterior
p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior
=
∫p(y |x , θ)p(θ|X ,Y )dθ
Many other benefits (wait for next semester :) )
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21
![Page 69: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/69.jpg)
Why be Bayesian?
Can capture/quantify the uncertainty (or “variance”) in θ via the posterior
Can make predictions by averaging over the posterior
p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior
=
∫p(y |x , θ)p(θ|X ,Y )dθ
Many other benefits (wait for next semester :) )
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21
![Page 70: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/70.jpg)
Conclusion and Take-aways
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 22
![Page 71: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/71.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 72: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/72.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 73: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/73.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 74: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/74.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 75: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/75.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 76: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/76.jpg)
Conclusion and Take-aways
Most learning problems can be cast as optimizing a regularized loss function
Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data
Negative log-likelihood (NLL) = loss function, log-prior = regularizer
More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior
Always start with simple models. Linear models can be really powerful given a good featurerepresentation.
Learn to first diagnose a learning algorithm rather than trying new ones
No free lunch. No learning algorithm is “universally” good.
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23
![Page 77: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives](https://reader035.vdocuments.mx/reader035/viewer/2022071502/61227592c14bf87b31001997/html5/thumbnails/77.jpg)
Thank You!
Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 24