machine learning summer school 2016
TRANSCRIPT
![Page 1: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/1.jpg)
data science @ NYT
Chris Wiggins
Aug 8/9, 2016
![Page 2: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/2.jpg)
Outline
1. overview of DS@NYT
![Page 3: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/3.jpg)
Outline
1. overview of DS@NYT2. prediction + supervised learning
![Page 4: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/4.jpg)
Outline
1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL
![Page 5: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/5.jpg)
Outline
1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL4. description + inference
![Page 6: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/6.jpg)
Outline
1. overview of DS@NYT2. prediction + supervised learning3. prescription, causality, and RL4. description + inference5. (if interest) designing data products
![Page 7: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/7.jpg)
0. Thank the organizers!
Figure 1: prepping slides until last minute
![Page 8: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/8.jpg)
Lecture 1: overview of ds@NYT
![Page 9: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/9.jpg)
data science @ The New York Times
@chrishwiggins
references: http://bit.ly/stanf16
![Page 10: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/10.jpg)
data science @ The New York Times
![Page 11: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/11.jpg)
data science: searches
![Page 12: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/12.jpg)
data science: mindset & toolset
drew conway, 2010
![Page 13: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/13.jpg)
modern history:
2009
![Page 14: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/14.jpg)
modern history:
2009
![Page 15: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/15.jpg)
biology: 1892 vs. 1995
![Page 16: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/16.jpg)
biology: 1892 vs. 1995
biology changed for good.
![Page 17: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/17.jpg)
biology: 1892 vs. 1995
new toolset, new mindset
![Page 18: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/18.jpg)
genetics: 1837 vs. 2012
ML toolset; data science mindset
![Page 19: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/19.jpg)
genetics: 1837 vs. 2012
![Page 20: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/20.jpg)
genetics: 1837 vs. 2012
ML toolset; data science mindset
arxiv.org/abs/1105.5821 ; github.com/rajanil/mkboost
![Page 21: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/21.jpg)
data science: mindset & toolset
![Page 22: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/22.jpg)
1851
![Page 23: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/23.jpg)
news: 20th century
church state
![Page 24: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/24.jpg)
church
![Page 25: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/25.jpg)
church
![Page 26: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/26.jpg)
church
![Page 27: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/27.jpg)
news: 20th century
church state
![Page 28: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/28.jpg)
news: 21st century
church state
data
![Page 29: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/29.jpg)
1851 1996
newspapering: 1851 vs. 1996
![Page 30: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/30.jpg)
example:
millions of views per hour2015
![Page 31: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/31.jpg)
![Page 32: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/32.jpg)
![Page 33: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/33.jpg)
"...social activities generate large quantities of potentially valuable data...The data were not generated for the
purpose of learning; however, the potential for learning is great’’
![Page 34: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/34.jpg)
"...social activities generate large quantities of potentially valuable data...The data were not generated for the
purpose of learning; however, the potential for learning is great’’ - J Chambers, Bell Labs,1993, “GLS”
![Page 35: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/35.jpg)
data science: the web
![Page 36: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/36.jpg)
data science: the web
is your “online presence”
![Page 37: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/37.jpg)
data science: the web
is a microscope
![Page 38: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/38.jpg)
data science: the web
is an experimental tool
![Page 39: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/39.jpg)
data science: the web
is an optimization tool
![Page 40: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/40.jpg)
1851 1996
newspapering: 1851 vs. 1996 vs. 2008
2008
![Page 41: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/41.jpg)
“a startup is a temporary organization in search of a
repeatable and scalable business model” —Steve Blank
![Page 42: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/42.jpg)
every publisher is now a startup
![Page 43: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/43.jpg)
every publisher is now a startup
![Page 44: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/44.jpg)
every publisher is now a startup
![Page 45: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/45.jpg)
![Page 46: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/46.jpg)
every publisher is now a startup
![Page 47: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/47.jpg)
news: 21st century
church state
data
![Page 48: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/48.jpg)
news: 21st century
church state
data
![Page 49: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/49.jpg)
learnings
![Page 50: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/50.jpg)
learnings
- predictive modeling- descriptive modeling- prescriptive modeling
![Page 51: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/51.jpg)
(actually ML, shhhh…)
- (supervised learning)- (unsupervised learning)- (reinforcement learning)
![Page 52: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/52.jpg)
(actually ML, shhhh…)
h/t michael littman
![Page 53: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/53.jpg)
(actually ML, shhhh…)
h/t michael littman
Supervised
Learning
Reinforcement
Learning
Unsupervised
Learning
(reports)
![Page 54: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/54.jpg)
learnings
- predictive modeling- descriptive modeling- prescriptive modeling
cf. modelingsocialdata.org
![Page 55: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/55.jpg)
stats.stackexchange.com
![Page 56: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/56.jpg)
from “are you a bayesian or a frequentist”
—michael jordan
L =
NX
i=1
ϕ (yif(xi;β)) + λ||β||
![Page 57: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/57.jpg)
predictive modeling, e.g.,
cf. modelingsocialdata.org
![Page 58: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/58.jpg)
predictive modeling, e.g.,
“the funnel”
cf. modelingsocialdata.org
![Page 59: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/59.jpg)
interpretable predictive modeling
super cool stuff
cf. modelingsocialdata.org
![Page 60: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/60.jpg)
interpretable predictive modeling
super cool stuff
cf. modelingsocialdata.org
arxiv.org/abs/q-bio/0701021
![Page 61: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/61.jpg)
optimization & learning, e.g.,
“How The New York Times Works “popular mechanics, 2015
![Page 62: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/62.jpg)
optimization & prediction, e.g.,
(some models)
(some moneys)
“newsvendor problem,” literally (+prediction+experiment)
![Page 63: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/63.jpg)
recommendation as inference
![Page 64: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/64.jpg)
recommendation as inference
bit.ly/AlexCTM
![Page 65: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/65.jpg)
descriptive modeling, e.g,
“segments”
cf. modelingsocialdata.org
![Page 66: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/66.jpg)
descriptive modeling, e.g,
“segments”
cf. modelingsocialdata.org
![Page 67: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/67.jpg)
descriptive modeling, e.g,
“segments”
argmax_z p(z|x)=14
cf. modelingsocialdata.org
![Page 68: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/68.jpg)
descriptive modeling, e.g,
“segments”
“baby boomer”
cf. modelingsocialdata.org
![Page 69: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/69.jpg)
- descriptive data product
cf. daeilkim.com
![Page 70: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/70.jpg)
descriptive modeling, e.g,
cf. daeilkim.com ; import bnpy
![Page 71: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/71.jpg)
modeling your audience
bit.ly/Hughes-Kim-Sudderth-AISTATS15
![Page 72: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/72.jpg)
modeling your audience
(optimization, ultimately)
![Page 73: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/73.jpg)
also allows insight+targeting as inference
modeling your audience
![Page 74: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/74.jpg)
prescriptive modeling
![Page 75: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/75.jpg)
prescriptive modeling
![Page 76: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/76.jpg)
prescriptive modeling
![Page 77: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/77.jpg)
“off policy value estimation”
(cf. “causal effect estimation”)
cf. Langford `08-`16;
Horvitz & Thompson `52;
Holland `86
![Page 78: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/78.jpg)
“off policy value estimation”
(cf. “causal effect estimation”)
Vapnik’s razor
“ When solving a (learning) problem of interest,
do not solve a more complex problem as an
intermediate step.”
![Page 79: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/79.jpg)
prescriptive modeling
cf. modelingsocialdata.org
![Page 80: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/80.jpg)
prescriptive modeling
aka “A/B testing”;
RCT
cf. modelingsocialdata.org
![Page 81: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/81.jpg)
Reporting
Learning
Test
aka “A/B
testing”;
business as
usual
(esp.
predictive)
Some of the most recognizable personalization in our service is the collection of “genre” rows. …Members connect with these rows so
well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower.
cf. modelingsocialdata.org
prescriptive modeling: from A/B to….
![Page 82: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/82.jpg)
real-time A/B -> “bandits”
GOOG blog:
cf. modelingsocialdata.org
![Page 83: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/83.jpg)
prescriptive modeling, e.g,
![Page 84: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/84.jpg)
prescriptive modeling, e.g,
![Page 85: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/85.jpg)
prescriptive modeling, e.g,
![Page 86: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/86.jpg)
prescriptive modeling, e.g,
leverage methods which are predictive yet performant
![Page 87: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/87.jpg)
NB: data-informed, not data-driven
![Page 88: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/88.jpg)
predicting views/cascades: doable?
KDD 09: how many people are online?
![Page 89: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/89.jpg)
predicting views/cascades: doable?
WWW 14: FB shares
![Page 90: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/90.jpg)
predicting views/cascades: features?
![Page 91: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/91.jpg)
predicting views/cascades: features?
![Page 92: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/92.jpg)
predicting views/cascades: doable?
WWW 16: TWIT RT’s
![Page 93: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/93.jpg)
predicting views/cascades: doable?
![Page 94: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/94.jpg)
Reporting
Learning
Test
Optimizing
Exploredescriptive:
predictive:
prescriptive:
![Page 95: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/95.jpg)
Reporting
Learning
Test
Optimizing
Exploredescriptive:
predictive:
prescriptive:
![Page 96: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/96.jpg)
things:
what does DS team deliver?
- build data product- build APIs- impact roadmaps
![Page 97: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/97.jpg)
data science @ The New York Times
@chrishwiggins
references: http://bit.ly/stanf16
![Page 98: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/98.jpg)
![Page 99: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/99.jpg)
Lecture 2: predictive modeling @ NYT
![Page 100: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/100.jpg)
desc/pred/pres
Figure 2: desc/pred/pres
caveat: difference between observation and experiment. why?
![Page 101: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/101.jpg)
blossom example
Figure 3: Reminder: Blossom
![Page 102: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/102.jpg)
blossom + boosting (‘exponential’)
Figure 4: Reminder: Surrogate Loss Functions
![Page 103: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/103.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
![Page 104: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/104.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))
![Page 105: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/105.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))
− log2 p(yN1 ) =∑
i log2
(
1 + e−yi f (xi ))
≡∑
i ℓ(yi f (xi))
![Page 106: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/106.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))
− log2 p(yN1 ) =∑
i log2
(
1 + e−yi f (xi ))
≡∑
i ℓ(yi f (xi))
ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
![Page 107: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/107.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))
− log2 p(yN1 ) =∑
i log2
(
1 + e−yi f (xi ))
≡∑
i ℓ(yi f (xi))
ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
∴ maximizing log-likelihood is minimizing a surrogate convexloss function for classification (though not strongly convex,cf. Yoram’s talk)
![Page 108: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/108.jpg)
tangent: logistic function as surrogate loss function
define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
p(y = 1|x) + p(y = −1|x) = 1→ p(y |x) = 1/(1 + exp(−yf ))
− log2 p(yN1 ) =∑
i log2
(
1 + e−yi f (xi ))
≡∑
i ℓ(yi f (xi))
ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
∴ maximizing log-likelihood is minimizing a surrogate convexloss function for classification (though not strongly convex,cf. Yoram’s talk)
but∑
i log2
(
1 + e−yi wT h(xi )
)
not as easy as∑
i e−yi wT h(xi )
![Page 109: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/109.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
L[F ] =∑
i exp (−yiF (xi))
![Page 110: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/110.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
L[F ] =∑
i exp (−yiF (xi)) =
∑
i exp(
−yi
∑tt′ wt′ht′(xi)
)
≡ Lt(wt)
![Page 111: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/111.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
L[F ] =∑
i exp (−yiF (xi)) =
∑
i exp(
−yi
∑tt′ wt′ht′(xi)
)
≡ Lt(wt) Draw ht ∈ H large space of rules s.t. h(x) ∈ −1, +1
![Page 112: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/112.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
L[F ] =∑
i exp (−yiF (xi)) =
∑
i exp(
−yi
∑tt′ wt′ht′(xi)
)
≡ Lt(wt) Draw ht ∈ H large space of rules s.t. h(x) ∈ −1, +1 label y ∈ −1, +1
![Page 113: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/113.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
Lt+1(wt ; w) ≡∑
i d ti exp (−yiwht+1(xi))
Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
1, . . .
![Page 114: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/114.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
Lt+1(wt ; w) ≡∑
i d ti exp (−yiwht+1(xi))
=∑
y=h′ d ti e−w +
∑
y 6=h′ d ti e+w ≡ e−w D+ + e+w D−
Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
1, . . .
![Page 115: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/115.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
Lt+1(wt ; w) ≡∑
i d ti exp (−yiwht+1(xi))
=∑
y=h′ d ti e−w +
∑
y 6=h′ d ti e+w ≡ e−w D+ + e+w D−
∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
1, . . .
![Page 116: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/116.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
Lt+1(wt ; w) ≡∑
i d ti exp (−yiwht+1(xi))
=∑
y=h′ d ti e−w +
∑
y 6=h′ d ti e+w ≡ e−w D+ + e+w D−
∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
Lt+1(wt+1) = 2√
D+D− = 2√
ν+(1− ν+)/D, where0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
1, . . .
![Page 117: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/117.jpg)
boosting 1
L exponential surrogate loss function, summed over examples:
Lt+1(wt ; w) ≡∑
i d ti exp (−yiwht+1(xi))
=∑
y=h′ d ti e−w +
∑
y 6=h′ d ti e+w ≡ e−w D+ + e+w D−
∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D−
Lt+1(wt+1) = 2√
D+D− = 2√
ν+(1− ν+)/D, where0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1
update example weights d t+1i = d t
i e∓w
Punchlines: sparse, predictive, interpretable, fast (to execute), andeasy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞
1, . . .
1Duchi + Singer “Boosting with structural sparsity” ICML ’09
![Page 118: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/118.jpg)
predicting people
“customer journey” prediction
![Page 119: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/119.jpg)
predicting people
“customer journey” prediction
fun covariates
![Page 120: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/120.jpg)
predicting people
“customer journey” prediction
fun covariates observational complication v structural models
![Page 121: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/121.jpg)
predicting people (reminder)
Figure 5: both in science and in real world, feature analysis guides futureexperiments
![Page 122: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/122.jpg)
single copy (reminder)
Figure 6: from Lecture 1
![Page 123: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/123.jpg)
example in CAR (computer assisted reporting)
Figure 7: Tabuchi article
![Page 124: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/124.jpg)
example in CAR (computer assisted reporting)
cf. Friedman’s “Statistical models and Shoe Leather”2
2Freedman, David A. “Statistical models and shoe leather.” Sociologicalmethodology 21.2 (1991): 291-313.
![Page 125: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/125.jpg)
example in CAR (computer assisted reporting)
cf. Friedman’s “Statistical models and Shoe Leather”2
Takata airbag fatalities
2Freedman, David A. “Statistical models and shoe leather.” Sociologicalmethodology 21.2 (1991): 291-313.
![Page 126: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/126.jpg)
example in CAR (computer assisted reporting)
cf. Friedman’s “Statistical models and Shoe Leather”2
Takata airbag fatalities 2219 labeled3 examples from 33,204 comments
2Freedman, David A. “Statistical models and shoe leather.” Sociologicalmethodology 21.2 (1991): 291-313.
3By Hiroko Tabuchi, a Pulitzer winner
![Page 127: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/127.jpg)
example in CAR (computer assisted reporting)
cf. Friedman’s “Statistical models and Shoe Leather”2
Takata airbag fatalities 2219 labeled3 examples from 33,204 comments cf. Box’s “Science and Statistics”4
2Freedman, David A. “Statistical models and shoe leather.” Sociologicalmethodology 21.2 (1991): 291-313.
3By Hiroko Tabuchi, a Pulitzer winner4Science and Statistics, George E. P. Box Journal of the American Statistical
Association, Vol. 71, No. 356. (Dec., 1976), pp. 791-799.
![Page 128: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/128.jpg)
computer assisted reporting
Impact
Figure 8: impact
![Page 129: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/129.jpg)
Lecture 3: prescriptive modeling @ NYT
![Page 130: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/130.jpg)
the natural abstraction
operators5 make decisions
5In the sense of business deciders; that said, doctors, including those whooperate, also have to make decisions, cf., personalized medicines
![Page 131: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/131.jpg)
the natural abstraction
operators5 make decisions faster horses v. cars
5In the sense of business deciders; that said, doctors, including those whooperate, also have to make decisions, cf., personalized medicines
![Page 132: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/132.jpg)
the natural abstraction
operators5 make decisions faster horses v. cars general insights v. optimal policies
5In the sense of business deciders; that said, doctors, including those whooperate, also have to make decisions, cf., personalized medicines
![Page 133: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/133.jpg)
maximizing outcome
the problem: maximizing an outcome over policies. . .
![Page 134: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/134.jpg)
maximizing outcome
the problem: maximizing an outcome over policies. . . . . . while inferring causality from observation
![Page 135: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/135.jpg)
maximizing outcome
the problem: maximizing an outcome over policies. . . . . . while inferring causality from observation different from predicting outcome in absence of action/policy
![Page 136: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/136.jpg)
examples
observation is not experiment
![Page 137: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/137.jpg)
examples
observation is not experiment
e.g., (Med.) smoking hurts vs unhealthy people smoke
![Page 138: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/138.jpg)
examples
observation is not experiment
e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment
![Page 139: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/139.jpg)
examples
observation is not experiment
e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment e.g., (life) veterans earn less vs the rich serve less6
6Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam DraftLottery: Evidence from Social Security Administrative Records”. AmericanEconomic Review 80 (3): 313–336.
![Page 140: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/140.jpg)
examples
observation is not experiment
e.g., (Med.) smoking hurts vs unhealthy people smoke e.g., (Med.) affluent get prescribed different meds/treatment e.g., (life) veterans earn less vs the rich serve less6
e.g., (life) admitted to school vs learn at school?
6Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam DraftLottery: Evidence from Social Security Administrative Records”. AmericanEconomic Review 80 (3): 313–336.
![Page 141: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/141.jpg)
reinforcement/machine learning/graphical models
key idea: model joint p(y , a, x)
![Page 142: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/142.jpg)
reinforcement/machine learning/graphical models
key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x)
![Page 143: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/143.jpg)
reinforcement/machine learning/graphical models
key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x) “causality”: pα(y , a, x) = p(y |a, x)pα(a|x)p(x) “a causes y”
![Page 144: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/144.jpg)
reinforcement/machine learning/graphical models
key idea: model joint p(y , a, x) explore/exploit: family of joints pα(y , a, x) “causality”: pα(y , a, x) = p(y |a, x)pα(a|x)p(x) “a causes y” nomenclature: ‘response’, ‘policy’/‘bias’, ‘prior’ above
![Page 145: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/145.jpg)
in general
Figure 9: policy/bias, response, and prior define the distribution
also describes both the ‘exploration’ and ‘exploitation’ distributions
![Page 146: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/146.jpg)
randomized controlled trial
Figure 10: RCT: ‘bias’ removed, random ‘policy’ (response and priorunaffected)
also Pearl’s ‘do’ distribution: a distribution with “no arrows”pointing to the action variable.
![Page 147: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/147.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation”
![Page 148: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/148.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
![Page 149: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/149.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation”
![Page 150: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/150.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation” role of “IPW”
![Page 151: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/151.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation” role of “IPW”
reduction
![Page 152: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/152.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation” role of “IPW”
reduction normalization
![Page 153: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/153.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation” role of “IPW”
reduction normalization hyper-parameter searching
![Page 154: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/154.jpg)
POISE: calculation, estimation, optimization
POISE: “policy optimization via importance sample estimation” Monte Carlo importance sampling estimation
aka “off policy estimation” role of “IPW”
reduction normalization hyper-parameter searching unexpected connection: personalized medicine
![Page 155: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/155.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x)
![Page 156: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/156.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution
p−(y , a, x) = p(y |a, x)p−(a|x)p(x)
![Page 157: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/157.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution
p−(y , a, x) = p(y |a, x)p−(a|x)p(x) define exploitation distribution
p+(y , a, x) = p(y |a, x)p+(a|x)p(x)
![Page 158: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/158.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution
p−(y , a, x) = p(y |a, x)p−(a|x)p(x) define exploitation distribution
p+(y , a, x) = p(y |a, x)p+(a|x)p(x) Goal: Maximize E+(Y ) over p+(a|x) using data drawn from
p−(y , a, x).
![Page 159: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/159.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution
p−(y , a, x) = p(y |a, x)p−(a|x)p(x) define exploitation distribution
p+(y , a, x) = p(y |a, x)p+(a|x)p(x) Goal: Maximize E+(Y ) over p+(a|x) using data drawn from
p−(y , a, x).
![Page 160: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/160.jpg)
POISE setup and Goal
“a causes y” ⇐⇒ ∃ family pα(y , a, x) = p(y |a, x)pα(a|x)p(x) define off-policy/exploration distribution
p−(y , a, x) = p(y |a, x)p−(a|x)p(x) define exploitation distribution
p+(y , a, x) = p(y |a, x)p+(a|x)p(x) Goal: Maximize E+(Y ) over p+(a|x) using data drawn from
p−(y , a, x).
notation: x , a, y ∈ X , A, Y i.e., Eα(Y ) is not a function of y
![Page 161: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/161.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x)
![Page 162: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/162.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x) E+(Y ) =
∑
yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x))
![Page 163: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/163.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x) E+(Y ) =
∑
yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x)) E+(Y ) =
∑
yax yp−(y , a, x)(p+(a|x)/p−(a|x))
![Page 164: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/164.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x) E+(Y ) =
∑
yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x)) E+(Y ) =
∑
yax yp−(y , a, x)(p+(a|x)/p−(a|x)) E+(Y ) ≈ N−1 ∑
i yi(p+(ai |xi)/p−(ai |xi))
![Page 165: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/165.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x) E+(Y ) =
∑
yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x)) E+(Y ) =
∑
yax yp−(y , a, x)(p+(a|x)/p−(a|x)) E+(Y ) ≈ N−1 ∑
i yi(p+(ai |xi)/p−(ai |xi))
![Page 166: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/166.jpg)
POISE math: IS+Monte Carlo estimation=ISE
i.e, “importance sampling estimation”
E+(Y ) ≡∑
yax yp+(y , a, x) E+(Y ) =
∑
yax yp−(y , a, x)(p+(y , a, x)/p−(y , a, x)) E+(Y ) =
∑
yax yp−(y , a, x)(p+(a|x)/p−(a|x)) E+(Y ) ≈ N−1 ∑
i yi(p+(ai |xi)/p−(ai |xi))
let’s spend some time getting to know this last equation, theimportance sampling estimate of outcome in a “causal model” (“acauses y”) among y , a, x
![Page 167: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/167.jpg)
Observation (cf. Bottou7 )
factorizing P±(x): P+(x)P−(x) = Πfactors
P+but not−(x)P−but not+(x)
![Page 168: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/168.jpg)
Observation (cf. Bottou7 )
factorizing P±(x): P+(x)P−(x) = Πfactors
P+but not−(x)P−but not+(x)
origin: importance sampling Eq(f ) = Ep(fq/p) (as invariational methods)
![Page 169: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/169.jpg)
Observation (cf. Bottou7 )
factorizing P±(x): P+(x)P−(x) = Πfactors
P+but not−(x)P−but not+(x)
origin: importance sampling Eq(f ) = Ep(fq/p) (as invariational methods)
the “causal” model pα(y , a, x) = p(y |a, x)pα(a|x)p(x) helpshere
![Page 170: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/170.jpg)
Observation (cf. Bottou7 )
factorizing P±(x): P+(x)P−(x) = Πfactors
P+but not−(x)P−but not+(x)
origin: importance sampling Eq(f ) = Ep(fq/p) (as invariational methods)
the “causal” model pα(y , a, x) = p(y |a, x)pα(a|x)p(x) helpshere
factors left over are numerator (p+(a|x), to optimize) anddenominator (p−(a|x), to infer if not a RCT)
![Page 171: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/171.jpg)
Observation (cf. Bottou7 )
factorizing P±(x): P+(x)P−(x) = Πfactors
P+but not−(x)P−but not+(x)
origin: importance sampling Eq(f ) = Ep(fq/p) (as invariational methods)
the “causal” model pα(y , a, x) = p(y |a, x)pα(a|x)p(x) helpshere
factors left over are numerator (p+(a|x), to optimize) anddenominator (p−(a|x), to infer if not a RCT)
unobserved confounders will confound us (later)
7Counterfactual Reasoning and Learning Systems, arXiv:1209.2355
![Page 172: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/172.jpg)
Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))
consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]
![Page 173: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/173.jpg)
Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))
consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]
E+(Y ) ∝∑
i(yi/p−(a|x))1[a = h(x)] ≡∑
i wi1[a = h(x)]
![Page 174: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/174.jpg)
Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))
consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]
E+(Y ) ∝∑
i(yi/p−(a|x))1[a = h(x)] ≡∑
i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ]
![Page 175: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/175.jpg)
Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))
consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]
E+(Y ) ∝∑
i(yi/p−(a|x))1[a = h(x)] ≡∑
i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ] ∴ E+(Y ) ∝ constant−
∑
i wi1[a 6= h(x)]
![Page 176: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/176.jpg)
Reduction (cf. Langford8,9,10 (’05, ’08, ’09 ))
consider numerator for deterministic policy:p+(a|x) = 1[a = h(x)]
E+(Y ) ∝∑
i(yi/p−(a|x))1[a = h(x)] ≡∑
i wi1[a = h(x)] Note: 1[c = d ] = 1− 1[c 6= d ] ∴ E+(Y ) ∝ constant−
∑
i wi1[a 6= h(x)] ∴ reduces policy optimization to (weighted) classification
8Langford & Zadrozny “Relating Reinforcement Learning Performance toClassification Performance” ICML 2005
9Beygelzimer & Langford “The offset tree for learning with partial labels”(KDD 2009)
10Tutorial on “Reductions” (including at ICML 2009)
![Page 177: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/177.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)]
![Page 178: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/178.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT
![Page 179: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/179.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than
1/p−(a|x)
![Page 180: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/180.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than
1/p−(a|x)
normalize as L ≡
∑
iy1[ai 6=h(xi )]/p−(ai |xi )
∑
i1[ai 6=h(xi )]/p−(ai |xi )
![Page 181: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/181.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than
1/p−(a|x)
normalize as L ≡
∑
iy1[ai 6=h(xi )]/p−(ai |xi )
∑
i1[ai 6=h(xi )]/p−(ai |xi )
destroys lovely reduction
![Page 182: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/182.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than
1/p−(a|x)
normalize as L ≡
∑
iy1[ai 6=h(xi )]/p−(ai |xi )
∑
i1[ai 6=h(xi )]/p−(ai |xi )
destroys lovely reduction simply11 L(λ) =
∑
i(yi − λ)1[ai 6= h(xi)]/p−(ai |xi)
11Suggestion by Dan Hsu
![Page 183: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/183.jpg)
Reduction w/optimistic complication
Prescription ⇐⇒ classification L =∑
i wi1[ai 6= h(xi)] weight wi = yi/p−(ai |xi), inferred or RCT destroys measure by treating p−(a|x) differently than
1/p−(a|x)
normalize as L ≡
∑
iy1[ai 6=h(xi )]/p−(ai |xi )
∑
i1[ai 6=h(xi )]/p−(ai |xi )
destroys lovely reduction simply11 L(λ) =
∑
i(yi − λ)1[ai 6= h(xi)]/p−(ai |xi) hidden here is a 2nd parameter, in classification, ∴ harder
search
11Suggestion by Dan Hsu
![Page 184: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/184.jpg)
POISE punchlines
allows policy planning even with implicit logged explorationdata12
12Strehl, Alex, et al. “Learning from logged implicit exploration data.”Advances in Neural Information Processing Systems. 2010.
![Page 185: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/185.jpg)
POISE punchlines
allows policy planning even with implicit logged explorationdata12
e.g., two hospital story
12Strehl, Alex, et al. “Learning from logged implicit exploration data.”Advances in Neural Information Processing Systems. 2010.
![Page 186: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/186.jpg)
POISE punchlines
allows policy planning even with implicit logged explorationdata12
e.g., two hospital story “personalized medicine” is also a policy
12Strehl, Alex, et al. “Learning from logged implicit exploration data.”Advances in Neural Information Processing Systems. 2010.
![Page 187: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/187.jpg)
POISE punchlines
allows policy planning even with implicit logged explorationdata12
e.g., two hospital story “personalized medicine” is also a policy abundant data available, under-explored IMHO
12Strehl, Alex, et al. “Learning from logged implicit exploration data.”Advances in Neural Information Processing Systems. 2010.
![Page 188: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/188.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
![Page 189: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/189.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
ATE
![Page 190: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/190.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
ATE
τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
![Page 191: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/191.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
ATE
τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
CATE aka Individualized Treatment Effect (ITE)
![Page 192: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/192.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
ATE
τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
CATE aka Individualized Treatment Effect (ITE)
τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x)
![Page 193: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/193.jpg)
tangent: causality as told by an economist
different, related goal
they think in terms of ATE/ITE instead of policy
ATE
τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
CATE aka Individualized Treatment Effect (ITE)
τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x) ≡ Q(a = 1, x) − Q(a = 0, x)
![Page 194: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/194.jpg)
Q-note: “generalizing” Monte Carlo w/kernels
MC: Ep(f ) =∑
x p(x)f (x) ≈ N−1 ∑
i∼p f (xi)
![Page 195: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/195.jpg)
Q-note: “generalizing” Monte Carlo w/kernels
MC: Ep(f ) =∑
x p(x)f (x) ≈ N−1 ∑
i∼p f (xi) K: p ≈ N−1 ∑
i K (x |xi)
![Page 196: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/196.jpg)
Q-note: “generalizing” Monte Carlo w/kernels
MC: Ep(f ) =∑
x p(x)f (x) ≈ N−1 ∑
i∼p f (xi) K: p ≈ N−1 ∑
i K (x |xi) ⇒
∑
x p(x)f (x) ≈ N−1 ∑
i
∑
x f (x)K (x |xi)
![Page 197: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/197.jpg)
Q-note: “generalizing” Monte Carlo w/kernels
MC: Ep(f ) =∑
x p(x)f (x) ≈ N−1 ∑
i∼p f (xi) K: p ≈ N−1 ∑
i K (x |xi) ⇒
∑
x p(x)f (x) ≈ N−1 ∑
i
∑
x f (x)K (x |xi) K can be any normalized function, e.g., K (x |xi) = δx ,xi
, whichyields MC.
![Page 198: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/198.jpg)
Q-note: “generalizing” Monte Carlo w/kernels
MC: Ep(f ) =∑
x p(x)f (x) ≈ N−1 ∑
i∼p f (xi) K: p ≈ N−1 ∑
i K (x |xi) ⇒
∑
x p(x)f (x) ≈ N−1 ∑
i
∑
x f (x)K (x |xi) K can be any normalized function, e.g., K (x |xi) = δx ,xi
, whichyields MC.
multivariateEp(f ) ≈ N−1 ∑
i
∑
yax f (y , a, x)K1(y |yi)K2(a|ai)K3(x |xi)
![Page 199: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/199.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
![Page 200: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/200.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
![Page 201: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/201.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø
![Page 202: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/202.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø n(x) =
∑
i 1[z(xi) = z(x)]=number of points in x ’s stratum
![Page 203: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/203.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø n(x) =
∑
i 1[z(xi) = z(x)]=number of points in x ’s stratum Ω(x) =
∑
x ′ 1[z(x ′) = z(x)]=area of x ’s stratum
![Page 204: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/204.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø n(x) =
∑
i 1[z(xi) = z(x)]=number of points in x ’s stratum Ω(x) =
∑
x ′ 1[z(x ′) = z(x)]=area of x ’s stratum ∴ K3(x |xi) = 1[z(x) = z(xi)]/Ω(x)
![Page 205: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/205.jpg)
Q-note: application w/strata+matching, setup
Helps think about economists’ approach:
Q(a, x) ≡ E (Y |a, x) =∑
y yp(y |a, x) =∑
y yp−(y ,a,x)
p−(a|x)p(x)
= 1p−(a|x)p(x)
∑
y yp−(y , a, x)
stratify x using z(x) such that ∪z = X , and ∩z , z ′ = ø n(x) =
∑
i 1[z(xi) = z(x)]=number of points in x ’s stratum Ω(x) =
∑
x ′ 1[z(x ′) = z(x)]=area of x ’s stratum ∴ K3(x |xi) = 1[z(x) = z(xi)]/Ω(x) as in MC, K1(y |yi) = δy ,yi
, K2(a|ai) = δa,ai
![Page 206: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/206.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
![Page 207: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/207.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
![Page 208: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/208.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
![Page 209: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/209.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
![Page 210: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/210.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2
![Page 211: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/211.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x)
![Page 212: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/212.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching”
![Page 213: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/213.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching” K -generalizations: continuous a, any metric or similarity you
like,. . .
![Page 214: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/214.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching” K -generalizations: continuous a, any metric or similarity you
like,. . .
![Page 215: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/215.jpg)
Q-note: application w/strata+matching, payoff
∑
y yp−(y , a, x) ≈ N−1Ω(x)−1 ∑
ai =a,z(xi )=z(x) yi
p(x) ≈ (n(x)/N)Ω(x)−1
∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ∑
ai =a,z(xi )=z(x) yi
“matching” means: choose each z to contain 1 positive example & 1negative example,
p−(a|x) ≈ 1/2, n(x) = 2 ∴ τ(a, x) = Q(a = 1, x)− Q(a = 0, x) = y1(x)− y0(x) z-generalizations: graphs, digraphs, k-NN, “matching” K -generalizations: continuous a, any metric or similarity you
like,. . .
IMHO underexplored
![Page 216: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/216.jpg)
causality, as understood in marketing a/b testing and RCT
Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin.Database Marketing, Springer New York, 2008
![Page 217: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/217.jpg)
causality, as understood in marketing a/b testing and RCT yield optimization
Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin.Database Marketing, Springer New York, 2008
![Page 218: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/218.jpg)
causality, as understood in marketing a/b testing and RCT yield optimization Lorenz curve (vs ROC plots)
Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin.Database Marketing, Springer New York, 2008
![Page 219: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/219.jpg)
unobserved confounders vs. “causality” modeling
truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u)
![Page 220: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/220.jpg)
unobserved confounders vs. “causality” modeling
truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u) but: p+(y , a, x , u) = p(y |a, x , u)p−(a|x)p(x , u)
![Page 221: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/221.jpg)
unobserved confounders vs. “causality” modeling
truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u) but: p+(y , a, x , u) = p(y |a, x , u)p−(a|x)p(x , u) E+(Y ) ≡
∑
yaxu yp+(yaxu) ≈N−1 ∑
i∼p−yip+(a|x)/p−(a|x , u)
![Page 222: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/222.jpg)
unobserved confounders vs. “causality” modeling
truth: pα(y , a, x , u) = p(y |a, x , u)pα(a|x , u)p(x , u) but: p+(y , a, x , u) = p(y |a, x , u)p−(a|x)p(x , u) E+(Y ) ≡
∑
yaxu yp+(yaxu) ≈N−1 ∑
i∼p−yip+(a|x)/p−(a|x , u)
denominator can not be inferred, ignore at your peril
![Page 223: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/223.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined)
![Page 224: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/224.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male)
![Page 225: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/225.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35
![Page 226: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/226.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 )
13P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias inGraduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404
14Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLACognitive Systems Laboratory, Technical Report R-414.
![Page 227: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/227.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to
13P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias inGraduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404
14Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLACognitive Systems Laboratory, Technical Report R-414.
![Page 228: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/228.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to p(a|x) =
∑u=6u=1 p(a|x , u)p(u|x)
13P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias inGraduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404
14Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLACognitive Systems Laboratory, Technical Report R-414.
![Page 229: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/229.jpg)
cautionary tale problem: Simpson’s paradox
a: admissions (a=1: admitted, a=0: declined) x : gender (x=1: female, x=0: male) lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) u: unobserved department they applied to p(a|x) =
∑u=6u=1 p(a|x , u)p(u|x)
e.g., gender-blind: p(a|1)− p(a|0) = p(a|u) · (p(u|1)− p(u|0))
13P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias inGraduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404
14Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLACognitive Systems Laboratory, Technical Report R-414.
![Page 230: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/230.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
![Page 231: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/231.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
we don’t directly control engagement
![Page 232: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/232.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
we don’t directly control engagement nonetheless useful since many things can influence it
![Page 233: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/233.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
we don’t directly control engagement nonetheless useful since many things can influence it
Q: does serving in Vietnam war decrease earnings15?
15Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery:evidence from social security administrative records.” The American EconomicReview (1990): 313-336.
![Page 234: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/234.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
we don’t directly control engagement nonetheless useful since many things can influence it
Q: does serving in Vietnam war decrease earnings15?
US didn’t directly control serving in Vietnam, either16
15Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery:evidence from social security administrative records.” The American EconomicReview (1990): 313-336.
16cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .
![Page 235: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/235.jpg)
confounded approach: quasi-experiments + instruments 17
Q: does engagement drive retention? (NYT, NFLX, . . . )
we don’t directly control engagement nonetheless useful since many things can influence it
Q: does serving in Vietnam war decrease earnings15?
US didn’t directly control serving in Vietnam, either16
requires strong assumptions, including linear model
15Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery:evidence from social security administrative records.” The American EconomicReview (1990): 313-336.
16cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .17I thank Sinan Aral, MIT Sloan, for bringing this to my attention
![Page 236: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/236.jpg)
IV: graphical model assumption
Figure 12: independence assumption
![Page 237: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/237.jpg)
IV: graphical model assumption (sideways)
Figure 13: independence assumption
![Page 238: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/238.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
![Page 239: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/239.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
![Page 240: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/240.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ
![Page 241: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/241.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]
(note that E [AjAk ] gives square matrix; invert for β)
![Page 242: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/242.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]
(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ
![Page 243: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/243.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]
(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ]
![Page 244: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/244.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]
(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ] E [Y ] = E [βT A] + E [ǫ] (from ansatz)
![Page 245: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/245.jpg)
IV: review s/OLS/MOM/ (E is empirical average)
a endogenous
e.g., ∃u s.t. p(y |a, x , u), p(a|x , u)
linear ansatz: y = βT a + ǫ if a exogenous (e.g., OLS), use E [YAj ] = E [βT AAj ] + E [ǫAj ]
(note that E [AjAk ] gives square matrix; invert for β) add instrument x uncorrelated with ǫ E [YXk ] = E [βT AXk ] + E [ǫ]E [Xk ] E [Y ] = E [βT A] + E [ǫ] (from ansatz) C(Y , Xk) = βT C(A, Xk), not an “inversion” problem, requires
“two stage regression”
![Page 246: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/246.jpg)
IV: binary, binary case (aka “Wald estimator”)
y = βa + ǫ
![Page 247: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/247.jpg)
IV: binary, binary case (aka “Wald estimator”)
y = βa + ǫ E (Y |x) = βE (A|x) + E (ǫ), evaluate at x = 0, 1
![Page 248: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/248.jpg)
IV: binary, binary case (aka “Wald estimator”)
y = βa + ǫ E (Y |x) = βE (A|x) + E (ǫ), evaluate at x = 0, 1 β = (E (Y |x = 1)− E (Y |x = 0))/(E (A|x = 1)− E (A|x = 0)).
![Page 249: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/249.jpg)
bandits: obligatory slide
Figure 14: almost all the talks I’ve gone to on bandits have this image
![Page 250: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/250.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . .
![Page 251: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/251.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code
![Page 252: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/252.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript
![Page 253: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/253.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly
![Page 254: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/254.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”
![Page 255: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/255.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”
![Page 256: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/256.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”
examples
ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’)
![Page 257: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/257.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”
examples
ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) UCB1 (2002) (no context) + LinUCB (with context)
![Page 258: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/258.jpg)
bandits
wide applicability: humane clinical trials, targeting, . . . replace meetings with code requires software engineering to replace decisions with, e.g.,
Javascript most useful if decisions or items get “stale” quickly less useful for one-off, major decisions to be “interpreted”
examples
ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) UCB1 (2002) (no context) + LinUCB (with context) Thompson Sampling (1933)18,19,20 (general, with or without
context)
18Thompson, William R. “On the likelihood that one unknown probabilityexceeds another in view of the evidence of two samples”. Biometrika,25(3–4):285–294, 1933.
19AKA “probability matching”, “posterior sampling”20cf., “Bayesian Bandit Explorer” (link)
![Page 259: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/259.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x)
![Page 260: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/260.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
![Page 261: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/261.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
![Page 262: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/262.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs
![Page 263: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/263.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)])
![Page 264: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/264.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
![Page 265: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/265.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on
![Page 266: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/266.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect”
![Page 267: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/267.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect” where Q(a, . . .) =
∑
y yp(y | . . .)
![Page 268: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/268.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect” where Q(a, . . .) =
∑
y yp(y | . . .)
![Page 269: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/269.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect” where Q(a, . . .) =
∑
y yp(y | . . .)
In Thompson sampling we will generate 1 datum at a time, by
asserting a parameterized generative model for p(y |a, x , θ)
![Page 270: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/270.jpg)
TS: connecting w/“generative causal modeling” 0
WAS p(y , x , a) = p(y |x , a)pα(a|x)p(x) These 3 terms were treated by
response p(y |a, x): avoid regression/inferring using importancesampling
policy pα(a|x): optimize ours, infer theirs (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) prior p(x): either avoid by importance sampling or estimate via
kernel methods
In the economics approach we focus on τ(. . .) ≡ Q(a = 1, . . .)− Q(a = 0, . . .) “treatment effect” where Q(a, . . .) =
∑
y yp(y | . . .)
In Thompson sampling we will generate 1 datum at a time, by
asserting a parameterized generative model for p(y |a, x , θ) using a deterministic but averaged policy
![Page 271: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/271.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
![Page 272: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/272.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
21Note that θ is a vector, with components for each action.
![Page 273: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/273.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
21Note that θ is a vector, with components for each action.
![Page 274: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/274.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly
21Note that θ is a vector, with components for each action.
![Page 275: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/275.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ)
21Note that θ is a vector, with components for each action.
![Page 276: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/276.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
21Note that θ is a vector, with components for each action.
![Page 277: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/277.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
21Note that θ is a vector, with components for each action.
![Page 278: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/278.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
p(a|x) =∫
dθp(a|x , θ)p(θ|D)
21Note that θ is a vector, with components for each action.
![Page 279: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/279.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
p(a|x) =∫
dθp(a|x , θ)p(θ|D) p(a|x) =
∫
dθ1[a = argmaxa′Q(a′, x , θ)]p(θ|D)
21Note that θ is a vector, with components for each action.
![Page 280: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/280.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
p(a|x) =∫
dθp(a|x , θ)p(θ|D) p(a|x) =
∫
dθ1[a = argmaxa′Q(a′, x , θ)]p(θ|D)
hold up:
21Note that θ is a vector, with components for each action.
![Page 281: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/281.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
p(a|x) =∫
dθp(a|x , θ)p(θ|D) p(a|x) =
∫
dθ1[a = argmaxa′Q(a′, x , θ)]p(θ|D)
hold up:
Q1: what’s p(θ|D)?
21Note that θ is a vector, with components for each action.
![Page 282: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/282.jpg)
TS: connecting w/“generative causal modeling” 1
model true world response function p(y |a, x) parametrically asp(y |a, x , θ∗)
(i.e., θ∗ is the true value of the parameter)21
if you knew θ:
could compute Q(a, x , θ) ≡∑
y yp(y |x , a, θ∗) directly then choose h(x ; θ) = argmaxaQ(a, x , θ) inducing policy p(a|x , θ) = 1[a = h(x ; θ) = argmaxaQ(a, x , θ)]
idea: use prior data D = y , a, xt1 to define non-deterministic
policy:
p(a|x) =∫
dθp(a|x , θ)p(θ|D) p(a|x) =
∫
dθ1[a = argmaxa′Q(a′, x , θ)]p(θ|D)
hold up:
Q1: what’s p(θ|D)? Q2: how am I going to evaluate this integral?
21Note that θ is a vector, with components for each action.
![Page 283: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/283.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
![Page 284: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/284.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
![Page 285: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/285.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
![Page 286: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/286.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.)
![Page 287: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/287.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α)
![Page 288: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/288.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α)
![Page 289: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/289.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ)
![Page 290: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/290.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here
![Page 291: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/291.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt .
![Page 292: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/292.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt . (we used Bayes rule, but only in θ and y .)
![Page 293: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/293.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt . (we used Bayes rule, but only in θ and y .)
A2: evaluate integral by N = 1 Monte Carlo22
22AFAIK it is open research area what happens when you replace N = 1 withN = N0t−ν .
![Page 294: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/294.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt . (we used Bayes rule, but only in θ and y .)
A2: evaluate integral by N = 1 Monte Carlo22
take 1 sample “θt” of θ from p(θ|D)
22AFAIK it is open research area what happens when you replace N = 1 withN = N0t−ν .
![Page 295: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/295.jpg)
TS: connecting w/“generative causal modeling” 2
Q1: what’s p(θ|D)?
Q2: how am I going to evaluate this integral?
A1: p(θ|D) definable by choosing prior p(θ|α) and likelihoodon y given by the (modeled, parameterized) responsep(y |a, x , θ).
(now you’re not only generative, you’re Bayesian.) p(θ|D) = p(θ|yt
1, at1, xt
1, α) ∝ p(yt
1|at1, xt
1, θ)p(θ|α) = p(θ|α)Πtp(yt |at , xt , θ) warning 1: sometimes people write “p(D|θ)” but we don’t need
p(a|θ) or p(x |θ) here warning 2: don’t need historical record of θt . (we used Bayes rule, but only in θ and y .)
A2: evaluate integral by N = 1 Monte Carlo22
take 1 sample “θt” of θ from p(θ|D) at = h(xt ; θt) = argmaxaQ(a, x , θt)
22AFAIK it is open research area what happens when you replace N = 1 withN = N0t−ν .
![Page 296: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/296.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1,
![Page 297: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/297.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x ,
![Page 298: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/298.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
![Page 299: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/299.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a
![Page 300: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/300.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
![Page 301: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/301.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
![Page 302: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/302.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
Then
p(θ|D) ∝ p(θ|α)Πtp(yt |at , θ)
![Page 303: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/303.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
Then
p(θ|D) ∝ p(θ|α)Πtp(yt |at , θ)
=(
Πaθα−1a (1− θa)β−1
)
(
Πt,at θytat (1− θat )
1−yt)
![Page 304: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/304.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
Then
p(θ|D) ∝ p(θ|α)Πtp(yt |at , θ)
=(
Πaθα−1a (1− θa)β−1
)
(
Πt,at θytat (1− θat )
1−yt)
= Πaθα+Sa−1(1− θa)β+Fa−1
![Page 305: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/305.jpg)
That sounds hard.
No, just general. Let’s do toy case:
y ∈ 0, 1, no context x , Bernoulli (coin flipping), keep track of
Sa ≡ number of successes flipping coin a Fa ≡ number of failures flipping coin a
Then
p(θ|D) ∝ p(θ|α)Πtp(yt |at , θ)
=(
Πaθα−1a (1− θa)β−1
)
(
Πt,at θytat (1− θat )
1−yt)
= Πaθα+Sa−1(1− θa)β+Fa−1
∴ θa ∼ Beta(α + Sa, β + Fa)
![Page 306: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/306.jpg)
Thompson sampling: results (2011)
Figure 15: Chaleppe and Li 2011
![Page 307: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/307.jpg)
TS: words
Figure 16: from Chaleppe and Li 2011
![Page 308: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/308.jpg)
TS: p-code
Figure 17: from Chaleppe and Li 2011
![Page 309: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/309.jpg)
TS: Bernoulli bandit p-code23
Figure 18: from Chaleppe and Li 2011
![Page 310: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/310.jpg)
TS: Bernoulli bandit p-code (results)
Figure 19: from Chaleppe and Li 2011
![Page 311: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/311.jpg)
UCB1 (2002), p-code
Figure 20: UCB1
from Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer.“Finite-time analysis of the multiarmed bandit problem.” Machinelearning 47.2-3 (2002): 235-256.
![Page 312: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/312.jpg)
TS: with context
Figure 21: from Chaleppe and Li 2011
![Page 313: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/313.jpg)
LinUCB: UCB with context
Figure 22: LinUCB
![Page 314: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/314.jpg)
TS: with context (results)
Figure 23: from Chaleppe and Li 2011
![Page 315: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/315.jpg)
Bandits: Regret via Lai and Robbins (1985)
Figure 24: Lai Robbins
![Page 316: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/316.jpg)
Thompson sampling (1933) and optimality (2013)
Figure 25: TS result
from S. Agrawal, N. Goyal, ”Further optimal regret bounds forThompson Sampling”, AISTATS 2013.; see also Agrawal, Shipra,and Navin Goyal. “Analysis of Thompson Sampling for theMulti-armed Bandit Problem.” COLT. 2012 and Emilie Kaufmann,Nathaniel Korda, and R´emi Munos. Thompson sampling: Anasymptotically optimal finite-time analysis. In Algorithmic LearningTheory, pages 199–213. Springer, 2012.
![Page 317: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/317.jpg)
other ‘Causalities’: structure learning
Figure 26: from heckerman 1995
D. Heckerman. A Tutorial on Learning with Bayesian Networks.Technical Report MSR-TR-95-06, Microsoft Research, March, 1995.
![Page 318: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/318.jpg)
other ‘Causalities’: potential outcomes
model distribution of p(yi(1), yi(0), ai , xi)
![Page 319: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/319.jpg)
other ‘Causalities’: potential outcomes
model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome”
![Page 320: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/320.jpg)
other ‘Causalities’: potential outcomes
model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome” aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74)
![Page 321: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/321.jpg)
other ‘Causalities’: potential outcomes
model distribution of p(yi(1), yi(0), ai , xi) “action” replaced by “observed outcome” aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74) see Morgan + Winship24 for connections between frameworks
24Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal
inference Cambridge University Press, 2014.
![Page 322: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/322.jpg)
Lecture 4: descriptive modeling @ NYT
![Page 323: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/323.jpg)
review: (latent) inference and clustering
what does kmeans mean?
![Page 324: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/324.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
![Page 325: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/325.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
![Page 326: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/326.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
assign zi
![Page 327: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/327.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
assign zi
generative modeling gives meaning
![Page 328: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/328.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
assign zi
generative modeling gives meaning
given p(x |z , θ)
![Page 329: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/329.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
assign zi
generative modeling gives meaning
given p(x |z , θ) maximize p(x |θ)
![Page 330: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/330.jpg)
review: (latent) inference and clustering
what does kmeans mean?
given xi ∈ RD
given d : RD → R1
assign zi
generative modeling gives meaning
given p(x |z , θ) maximize p(x |θ) output assignment p(z |x , θ)
![Page 331: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/331.jpg)
actual math
define P ≡ p(x , z |θ)
![Page 332: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/332.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling)
![Page 333: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/333.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
![Page 334: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/334.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
![Page 335: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/335.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
alternate optimization on θ and on q
![Page 336: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/336.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
alternate optimization on θ and on q
NB: q step gives q(z) = p(z |x , θ)
![Page 337: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/337.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
alternate optimization on θ and on q
NB: q step gives q(z) = p(z |x , θ) NB: log P convenient for independent examples w/ exponential
families
![Page 338: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/338.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
alternate optimization on θ and on q
NB: q step gives q(z) = p(z |x , θ) NB: log P convenient for independent examples w/ exponential
families e.g., GMMs: µk ← E [x |z] and σ2
k ← E [(x − µ)2|z] aresufficient statistics
![Page 339: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/339.jpg)
actual math
define P ≡ p(x , z |θ) log-likelihood L ≡ log p(x |θ) = log
∑
z P = log EqP/q
(cf. importance sampling) Jensen’s:
L ≥ L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
analogy to free energy in physics
alternate optimization on θ and on q
NB: q step gives q(z) = p(z |x , θ) NB: log P convenient for independent examples w/ exponential
families e.g., GMMs: µk ← E [x |z] and σ2
k ← E [(x − µ)2|z] aresufficient statistics
e.g., LDAs: word counts are sufficient statistics
![Page 340: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/340.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
![Page 341: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/341.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi)
![Page 342: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/342.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi)
![Page 343: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/343.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi) define rik =
∑
z qi(z)1[zi = k]
![Page 344: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/344.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi) define rik =
∑
z qi(z)1[zi = k] −Ux =
∑
i rik log p(xi |k).
![Page 345: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/345.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi) define rik =
∑
z qi(z)1[zi = k] −Ux =
∑
i rik log p(xi |k). Gaussian25
⇒ −Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
25math is simpler if you work with λk ≡ σ−2
![Page 346: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/346.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi) define rik =
∑
z qi(z)1[zi = k] −Ux =
∑
i rik log p(xi |k). Gaussian25
⇒ −Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
25math is simpler if you work with λk ≡ σ−2
![Page 347: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/347.jpg)
tangent: more math on GMMs, part 1
Energy U (to be minimized):
−U ≡ Eq log P =∑
z
∑
i qi(z) log P(xi , zi) ≡ Ux + Uz
−Ux ≡∑
z
∑
i qi(z) log p(xi |zi) =
∑
i
∑
z qi(z)∑
k 1[zi = k] log p(xi |zi) define rik =
∑
z qi(z)1[zi = k] −Ux =
∑
i rik log p(xi |k). Gaussian25
⇒ −Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
simple to minimize for parameters ϑ = µk , λk
25math is simpler if you work with λk ≡ σ−2
![Page 348: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/348.jpg)
tangent: more math on GMMs, part 2
-Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
![Page 349: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/349.jpg)
tangent: more math on GMMs, part 2
-Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
µk ← E [x |k] solves∑
i rik =∑
i rikxi
![Page 350: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/350.jpg)
tangent: more math on GMMs, part 2
-Ux =∑
i rik
(
−12(xi − µk)2λk + 1
2 ln λk −12 ln 2π
)
µk ← E [x |k] solves∑
i rik =∑
i rikxi
λk ← E [(x − µ)2|k] solves∑
i rik12(xi − µk)2 = λ−1
k
∑
i rik
![Page 351: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/351.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k)
![Page 352: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/352.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x))
![Page 353: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/353.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
26Choosing η(θ) = η called ‘canonical form’
![Page 354: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/354.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x ,
26Choosing η(θ) = η called ‘canonical form’
![Page 355: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/355.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
26Choosing η(θ) = η called ‘canonical form’
![Page 356: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/356.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
η1 = µ/σ2 = µλ
26Choosing η(θ) = η called ‘canonical form’
![Page 357: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/357.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
η1 = µ/σ2 = µλ η2 = − 1
2 λ = −1/(2σ2)
26Choosing η(θ) = η called ‘canonical form’
![Page 358: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/358.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
η1 = µ/σ2 = µλ η2 = − 1
2 λ = −1/(2σ2) A = λµ2/2− 1
2 ln λ
26Choosing η(θ) = η called ‘canonical form’
![Page 359: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/359.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
η1 = µ/σ2 = µλ η2 = − 1
2 λ = −1/(2σ2) A = λµ2/2− 1
2 ln λ exp(B(x)) = (2π)−1/2
26Choosing η(θ) = η called ‘canonical form’
![Page 360: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/360.jpg)
tangent: Gaussians ∈ exponential family27
as before, −U =∑
i rik log p(xi |k) define p(xi |k) = exp (η(θ) · T (x)− A(θ) + B(x)) e.g., Gaussian case 26,
T1 = x , T2 = x2
η1 = µ/σ2 = µλ η2 = − 1
2 λ = −1/(2σ2) A = λµ2/2− 1
2 ln λ exp(B(x)) = (2π)−1/2
note that in a mixture model, there are separate η (and thusA(η)) for each value of z
26Choosing η(θ) = η called ‘canonical form’27NB: Gaussians ∈ exponential family, GMM /∈ exponential family! (Thanks
to Eszter Vértes for pointing out this error in earlier title.)
![Page 361: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/361.jpg)
tangent: variational joy ∈ exponential family
as before, −U =∑
i rik
(
ηTk T (xi)− A(ηk) + B(xi)
)
![Page 362: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/362.jpg)
tangent: variational joy ∈ exponential family
as before, −U =∑
i rik
(
ηTk T (xi)− A(ηk) + B(xi)
)
ηk,α solves∑
i rikTk,α(xi) = ∂A(ηk)∂ηk,α
∑
i rik (canonical)
![Page 363: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/363.jpg)
tangent: variational joy ∈ exponential family
as before, −U =∑
i rik
(
ηTk T (xi)− A(ηk) + B(xi)
)
ηk,α solves∑
i rikTk,α(xi) = ∂A(ηk)∂ηk,α
∑
i rik (canonical)
∴ ∂ηk,αA(ηk)← E [Tk,α|k] (canonical)
![Page 364: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/364.jpg)
tangent: variational joy ∈ exponential family
as before, −U =∑
i rik
(
ηTk T (xi)− A(ηk) + B(xi)
)
ηk,α solves∑
i rikTk,α(xi) = ∂A(ηk)∂ηk,α
∑
i rik (canonical)
∴ ∂ηk,αA(ηk)← E [Tk,α|k] (canonical)
nice connection w/physics, esp. mean field theory28
28read MacKay, David JC. Information theory, inference and learning
algorithms, Cambridge university press, 2003 to learn more. Actually you shouldread it regardless.
![Page 365: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/365.jpg)
clustering and inference: GMM/k-means case study
generative model gives meaning and optimization
![Page 366: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/366.jpg)
clustering and inference: GMM/k-means case study
generative model gives meaning and optimization large freedom to choose different optimization approaches
![Page 367: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/367.jpg)
clustering and inference: GMM/k-means case study
generative model gives meaning and optimization large freedom to choose different optimization approaches
e.g., hard clustering limit
![Page 368: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/368.jpg)
clustering and inference: GMM/k-means case study
generative model gives meaning and optimization large freedom to choose different optimization approaches
e.g., hard clustering limit e.g., streaming solutions
![Page 369: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/369.jpg)
clustering and inference: GMM/k-means case study
generative model gives meaning and optimization large freedom to choose different optimization approaches
e.g., hard clustering limit e.g., streaming solutions e.g., stochastic gradient methods
![Page 370: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/370.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans
![Page 371: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/371.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:
![Page 372: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/372.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:
hmm
![Page 373: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/373.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:
hmm vbmod: arXiv:0709.3512
![Page 374: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/374.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:
hmm vbmod: arXiv:0709.3512 ebfret: ebfret.github.io
![Page 375: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/375.jpg)
general framework: E+M/variational
e.g., GMM+hard clustering gives kmeans e.g., some favorite applications:
hmm vbmod: arXiv:0709.3512 ebfret: ebfret.github.io EDHMM: edhmm.github.io
![Page 376: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/376.jpg)
example application: LDA+topics
Figure 27: From Blei 2003
![Page 378: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/378.jpg)
recall: recommendation via factoring
Figure 29: From Blei 2011
![Page 379: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/379.jpg)
CTM: combined loss function
Figure 30: From Blei 2011
![Page 380: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/380.jpg)
CTM: updates for factors
Figure 31: From Blei 2011
![Page 381: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/381.jpg)
CTM: (via Jensen’s, again) bound on loss
Figure 32: From Blei 2011
![Page 382: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/382.jpg)
Lecture 5 data product
![Page 383: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/383.jpg)
data science and design thinking
knowing customer
![Page 384: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/384.jpg)
data science and design thinking
knowing customer right tool for right job
![Page 385: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/385.jpg)
data science and design thinking
knowing customer right tool for right job practical matters:
![Page 386: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/386.jpg)
data science and design thinking
knowing customer right tool for right job practical matters:
munging
![Page 387: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/387.jpg)
data science and design thinking
knowing customer right tool for right job practical matters:
munging data ops
![Page 388: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/388.jpg)
data science and design thinking
knowing customer right tool for right job practical matters:
munging data ops ML in prod
![Page 389: Machine Learning Summer School 2016](https://reader033.vdocuments.mx/reader033/viewer/2022051300/589fb3031a28abf9038b5541/html5/thumbnails/389.jpg)
Thanks!
Thanks MLSS students for your great questions; please contact me@chrishwiggins or chris.wiggins@nytimes,gmail.com with anyquestions, comments, or suggestions!