statistical inference from dependent observationsย ยท pr uิฆ=๐=exp(ฯ ๐t ๐...
TRANSCRIPT
Statistical Inference from Dependent Observations
Constantinos DaskalakisEECS and CSAIL, MIT
Supervised Learning SettingGiven:
โข Training set ๐ฅ๐ , ๐ฆ๐ ๐=1๐ of examples
โข ๐ฅ๐: feature (aka covariate) vector of example ๐
โข ๐ฆ๐: label (aka response) of example ๐
โข Assumption: ๐ฅ๐ , ๐ฆ๐ ๐=1๐ ~๐๐๐ ๐ท where ๐ท is unknown
โข Hypothesis class โ = โ๐ ๐ โ ฮ} of (randomized) responsesโข e.g. โ = ๐,โ ๐ 2 โค 1},
โข e.g.2 โ = {๐ ๐๐๐ ๐๐๐๐ ๐ ๐๐ ๐ท๐๐๐ ๐๐๐ข๐๐๐ ๐๐๐ก๐ค๐๐๐๐ }
โข generally, โ๐ might be randomized
โข Loss function: โ:๐ด ร ๐ด โ โ
Goal: Select ๐ to minimize expected loss: min๐
๐ผ ๐ฅ,๐ฆ โผ๐ท โ(โ๐(๐ฅ), ๐ฆ)
Goal 2: In realizable setting (i.e. when, under ๐ท, ๐ฆ โผ โ๐โ(๐ฅ)), estimate ๐โ
, โdogโ
โฆ
, โcatโ
E.g. Linear Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and responses ๐ฆ๐ โ โ, ๐ = 1,โฆ , ๐Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
E.g.2 Logistic Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and binary responses ๐ฆ๐ โ {ยฑ1}Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
Maximum Likelihood Estimator (MLE)
In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator แ๐ satisfies: แ๐ โ ๐2= ๐
๐
๐
โข Recall: โข Assumption: ๐1, ๐1 , โฆ , ๐๐, ๐๐ โผ๐๐๐ ๐ท, where ๐ท unknownโข Goal: choose โ โ โ to minimize expected loss under some loss function โ, i.e.
minโโโ
๐ผ ๐ฅ,๐ฆ โผ๐ท โ(โ(๐ฅ), ๐ฆ)
โข Let โ โ โ be the Empirical Risk Minimizer, i.e.
โ โ argminโโโ
1
๐
๐
โ(โ(๐ฅ๐), ๐ฆ๐)
โข Then:
Loss๐ท โ โค infโโโ
Loss๐ท โ + ๐ ๐๐ถ โ /๐ , for Boolean โ, 0-1 loss โ
Loss๐ท โ โค infโโโ
Loss๐ท โ + ๐ ๐ โ๐(โ) , for general โ, ๐-Lipschitz โ
Beyond Realizable Setting: Learnability
โฆ
Loss๐ท โ
Supervised Learning Setting on Steroids
But what do these results really mean?
Broader Perspectiveโ ML methods commonly operate under stringent assumptions
o train set: independent and identically distributed (i.i.d.) samples
o test set: i.i.d. samples from same distribution as training
โ Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation โ systematic missing of data
โ train set โ test set
o Data Dependencies โ peer effects, spatial, temporal dependencies
โ no apparent source for independence
Todayโs Topic: Statistical Inference from Dependent Observations
independent dependent
Observations ๐ฅ๐ , ๐ฆ๐ ๐ are commonly collected on some spatial domain, some temporal domain, or on a social network
As such, they are not independent, but intricately dependent
Why dependent?
e.g. spin glassesโข neighboring particles influence each other
- +
++
+ +
-+
+ +
++
- +
--
e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)
Statistical Physics and Machine Learning
โข Spin Systems [Isingโ25]
โข MRFs, Bayesian Networks, Boltzmann Machines โข Probability Theory
โข MCMC
โข Machine Learning
โข Computer Vision
โข Game Theory
โข Computational Biology
โข Causality
โข โฆ
- +
++
+ +
-+
+ +
++
- +
--
Peer Effects on Social Networks
Several studies of peer effects, in applications such as:โข criminal networks [Glaeser et alโ96]โข welfare participation [Bertrand et alโ00]โข school achievement [Sacerdoteโ01]โข participation in Retirement Plans [Duflo-Saezโ03]โข obesity [Trogdon et alโ08, Christakis-Fowlerโ13]
AddHealth Dataset: โข National Longitudinal Study of Adolescent Healthโข National study of students in grades 7-12โข friendship networks, personal and school life, age,
gender, race, socio-economic background, health,โฆ
MicroEconomics: โข Behavior/Opinion Dynamics e.g.
[Schellingโ78], [Ellisonโ93], [Youngโ93, โ01], [Montanari-Saberiโ10],โฆ
Econometrics: โข Disentangling individual from network
effects [Manskiโ93], [Bramoulle-Djebbari-Fortinโ09], [Li-Levina-Zhuโ16],โฆ
Menu
โข Motivation
โข Part I: Regression w/ dependent observations
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Menu
โข Motivation
โข Part I: Regression w/ dependent observations
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Standard Linear Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and responses ๐ฆ๐ โ โ, ๐ = 1,โฆ , ๐Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
Standard Logistic Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and binary responses ๐ฆ๐ โ {ยฑ1}Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
Maximum Likelihood Estimator
In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator แ๐ satisfies: แ๐ โ ๐2= ๐
๐
๐
Standard Linear Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and responses ๐ฆ๐ โ โ, ๐ = 1,โฆ , ๐Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
Linear Regression w/ Dependent Samples
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and responses ๐ฆ๐ โ โ, ๐ = 1,โฆ , ๐Assumption: for all ๐, ๐ฆ๐ sampled as follows conditioning on ๐ฆโ๐:
Parameters: โข coefficient vector ๐, inverse temperature ๐ฝ (unknown)โข Interaction matrix ๐ด (known)
Goal: Infer ๐, ๐ฝ
Standard Logistic Regression
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and binary responses ๐ฆ๐ โ {ยฑ1}Assumption: for all ๐, ๐ฆ๐ sampled as follows:
Goal: Infer ๐
Logistic Regression w/ Dependent Samples(todayโs focus)
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and binary responses ๐ฆ๐ โ {ยฑ1}Assumption: for all ๐, ๐ฆ๐ sampled as follows conditioning on ๐ฆโ๐:
Parameters: โข coefficient vector ๐, inverse temperature ๐ฝ (unknown)โข Interaction matrix ๐ด (known)
Goal: Infer ๐, ๐ฝ
Input: ๐ feature vectors ๐ฅ๐ โ โ๐ and binary responses ๐ฆ๐ โ {ยฑ1}Assumption: ๐ฆ1, โฆ , ๐ฆ๐ are sampled jointly according to measure
Parameters: โข coefficient vector ๐, inverse temperature ๐ฝ (unknown)โข Interaction matrix ๐ด (known)
Goal: Infer ๐, ๐ฝ
Ising model w/ inverse temperature ๐ฝ and external fields ๐๐๐ฅ๐
Logistic Regression w/ Dependent Samples(todayโs focus)
Challenge: one sample, likelihood contains ๐
For ๐ฝ = 0 equivalent to Logistic regression
i.e. lack of LLN, hard to compute MLE
Main Result for Logistic Regression from Dependent Samples
N.B. We show similar results for linear regression with peer effects.
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
MPLE instead of MLE
โข Likelihood involves ๐ = ๐๐,๐ฝ (the partition function), so is non-trivial to compute.
โข [Besagโ75,โฆ,Chatterjeeโ07] studies the maximum pseudolikelihood estimator (MPLE)
โข LogPL does not contain ๐ and is concave. Is MPLE consistent?
โข [Chatterjeeโ07]: yes, when ๐ฝ > 0, ๐ = 0
โข [BMโ18,GMโ18]: yes, when ๐ฝ > 0, ๐ โ 0 โ โ, ๐ฅ๐ = 1, for all ๐
โข General case?
Problem: Given: โข ิฆ๐ฅ = ๐ฅ1, โฆ , ๐ฅ๐ and โข ิฆ๐ฆ = ๐ฆ1, โฆ , ๐ฆ๐ โ ยฑ1 ๐ sampled as
Pr ิฆ๐ฆ = ๐ =exp(ฯ๐ ๐
T๐ฅ๐ ๐๐+๐ฝ๐T๐ด๐)
๐
Infer ๐, ๐ฝ
๐๐ฟ ๐ฝ, ๐ โ ฯ๐ Pr๐ฝ,๐,๐ด
[๐ฆ๐|๐ฆโ๐]1/๐
= ฯ๐1
1+exp โ2(๐๐๐ฅ๐+๐ฝ๐ด๐๐๐ฆ
1/๐
Analysis
Problem: Given: โข ิฆ๐ฅ = ๐ฅ1, โฆ , ๐ฅ๐ and โข ิฆ๐ฆ = ๐ฆ1, โฆ , ๐ฆ๐ โ ยฑ1 ๐ sampled as
Pr ิฆ๐ฆ = ๐ =exp(ฯ๐ ๐
T๐ฅ๐ ๐๐+๐ฝ๐T๐ด๐)
๐
Infer ๐, ๐ฝ
๐๐ฟ ๐ฝ, ๐ โ เท
๐
Pr๐ฝ,๐,๐ด
[๐ฆ๐|๐ฆโ๐]
1/๐
Analysis
Problem: Given: โข ิฆ๐ฅ = ๐ฅ1, โฆ , ๐ฅ๐ and โข ิฆ๐ฆ = ๐ฆ1, โฆ , ๐ฆ๐ โ ยฑ1 ๐ sampled as
Pr ิฆ๐ฆ = ๐ =exp(ฯ๐ ๐
T๐ฅ๐ ๐๐+๐ฝ๐T๐ด๐)
๐
Infer ๐, ๐ฝ
โข Concentration: gradient of log-pseudolikelihood at truth should be small; โข show ๐ปlog๐๐ฟ ๐0, ๐ฝ0 2
2 is ๐ ๐/๐ with probability at least 99%
โข Anti-concentration: Constant lower bound on min๐,๐ฝ
๐๐๐๐(โ๐ป2 log ๐๐ฟ (๐, ๐ฝ)) w.pr. โฅ 99%
๐๐ฟ ๐ฝ, ๐ โ เท
๐
Pr๐ฝ,๐,๐ด
[๐ฆ๐|๐ฆโ๐]
1/๐
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Supervised LearningGiven:
โข Training set ๐ฅ๐ , ๐ฆ๐ ๐=1๐ of examples
โข Hypothesis class โ = โ๐ ๐ โ ฮ} of (randomized) responses
โข Loss function: โ:๐ด ร ๐ด โ โ
Assumption: ๐ฅ๐ , ๐ฆ๐ ๐=1๐ ~๐๐๐ ๐ท where ๐ท is unknown
Goal: Select ๐ to minimize expected loss: min๐
๐ผ ๐ฅ,๐ฆ โผ๐ท โ(โ๐(๐ฅ), ๐ฆ)
Goal 2: In realizable setting (i.e. when, under ๐ท, ๐ฆ โผ โ๐โ(๐ฅ)), estimate ๐โ
, โdogโ
โฆ
, โcatโ
Given:
โข Training set ๐ฅ๐ , ๐ฆ๐ ๐=1๐ of examples
โข Hypothesis class โ = โ๐ ๐ โ ฮ} of (randomized) responses
โข Loss function: โ:๐ด ร ๐ด โ โ
Assumption: ๐ฅ๐ , ๐ฆ๐ ๐=1๐ ~๐๐๐ ๐ท where ๐ท is unknown
Goal: Select ๐ to minimize expected loss: min๐
๐ผ ๐ฅ,๐ฆ โผ๐ท โ(โ๐(๐ฅ), ๐ฆ)
Goal 2: In realizable setting (i.e. when, under ๐ท, ๐ฆ โผ โ๐โ(๐ฅ)), estimate ๐โ
, โdogโ
โฆ
, โcatโ
Supervised Learning w/ Dependent Observations
Supervised Learning w/ Dependent ObservationsGiven:
โข Training set ๐ฅ๐ , ๐ฆ๐ ๐=1๐ of examples
โข Hypothesis class โ = โ๐ ๐ โ ฮ} of (randomized) responses
โข Loss function: โ:๐ด ร ๐ด โ โ
Assumptionโ: a joint distribution ๐ท samples training examples andunknown test sample jointly, i.e.
โข ๐ฅ๐ , ๐ฆ๐ ๐=1๐ โช (๐ฅ, ๐ฆ) โผ ๐ท
Goal: select ๐ to minimize: min๐
๐ผ โ(โ๐(๐ฅ), ๐ฆ)
, โdogโ
โฆ
, โcatโ
Main result
Learnability is possible when joint distribution ๐ท over training set and test set satisfies Dobrushinโs condition.
Dobrushinโs condition
โข Given a joint probability vector าง๐ = ๐1, โฆ , ๐๐ define the influence of variable ๐ to variable ๐: โข โworst effect of ๐๐ to the conditional distribution of ๐๐ given ๐โ๐โ๐โ
โข Inf๐โ๐ = sup๐ง๐,๐ง๐
โฒ,๐งโ๐โ๐
๐๐๐ Pr ๐๐ โฃ ๐ง๐ , ๐งโ๐โ๐ , Pr ๐๐ โฃ ๐ง๐โฒ, ๐งโ๐โ๐
โข Dobrushinโs condition:โข โall nodes have limited total influence exerted on themโ
โข For all ๐, ฯ๐โ ๐ Inf๐โ๐ < 1.
Implies: Concentration of measureFast mixing of Gibbs sampler
Learnability under Dobrushinโs condition
โข Suppose: โข ๐1, ๐1 , โฆ , ๐๐, ๐๐ , (๐, ๐) โผ ๐ท, where ๐ท is joint distributionโข ๐ท: satisfies Dobrushinโs conditionโข ๐ท: has the same marginals
โข Define Loss๐ท โ = ๐ผ โ(โ(๐ฅ), ๐ฆ)
โข There exists a learning algorithm which, given ๐1, ๐1 , โฆ , ๐๐, ๐๐ , outputs โ โ โ such that
Loss๐ท โ โค infโโโ
Loss๐ท โ + เทจ๐ ๐๐ถ โ /๐ for Boolean โ, 0-1 loss โ
Loss๐ท โ โค infโโโ
Loss๐ท โ + เทจ๐ ๐ ๐ข๐(โ) , for general โ, ๐-Lipschitz โ
(need stronger than Dobrushin condition for general โ)
Essentially the same bounds as i.i.d
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Menu
โข Motivation
โข Part I: Regression w/ dependent observationsโข Proof Ideas
โข Part II: Statistical Learning Theory w/ dependent observations
โข Conclusions
Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation โ systematic missing of data
โ train set โ test set
o Data Dependencies โ peer effects, spatial, temporal dependencies
โ no apparent source for independence
Goals of Broader Line of Work
o Regression on a network (linear and logistic, ๐/๐ rates):
Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOCโ19).
o Statistical Learning Theory Framework (learnability and generalization bounds):
Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLTโ19).
Statistical Estimation from Dependent Observations
Thank you!