statistical inference from dependent observationsย ยท pr uิฆ=๐œŽ=exp(ฯƒ ๐œƒt ๐œŽ...

40
Statistical Inference from Dependent Observations Constantinos Daskalakis EECS and CSAIL, MIT

Upload: others

Post on 21-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Statistical Inference from Dependent Observations

Constantinos DaskalakisEECS and CSAIL, MIT

Page 2: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Supervised Learning SettingGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข ๐‘ฅ๐‘–: feature (aka covariate) vector of example ๐‘–

โ€ข ๐‘ฆ๐‘–: label (aka response) of example ๐‘–

โ€ข Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responsesโ€ข e.g. โ„‹ = ๐œƒ,โ‹… ๐œƒ 2 โ‰ค 1},

โ€ข e.g.2 โ„‹ = {๐‘ ๐‘œ๐‘š๐‘’ ๐‘๐‘™๐‘Ž๐‘ ๐‘  ๐‘œ๐‘“ ๐ท๐‘’๐‘’๐‘ ๐‘๐‘’๐‘ข๐‘Ÿ๐‘Ž๐‘™ ๐‘๐‘’๐‘ก๐‘ค๐‘œ๐‘Ÿ๐‘˜๐‘ }

โ€ข generally, โ„Ž๐œƒ might be randomized

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Page 3: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

E.g. Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 4: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

E.g.2 Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 5: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Maximum Likelihood Estimator (MLE)

In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐‘‹, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator แˆ˜๐œƒ satisfies: แˆ˜๐œƒ โˆ’ ๐œƒ2= ๐‘‚

๐‘‘

๐‘›

Page 6: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

โ€ข Recall: โ€ข Assumption: ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› โˆผ๐‘–๐‘–๐‘‘ ๐ท, where ๐ท unknownโ€ข Goal: choose โ„Ž โˆˆ โ„‹ to minimize expected loss under some loss function โ„“, i.e.

minโ„Žโˆˆโ„‹

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž(๐‘ฅ), ๐‘ฆ)

โ€ข Let โ„Ž โˆˆ โ„‹ be the Empirical Risk Minimizer, i.e.

โ„Ž โˆˆ argminโ„Žโˆˆโ„‹

1

๐‘›

๐‘–

โ„“(โ„Ž(๐‘ฅ๐‘–), ๐‘ฆ๐‘–)

โ€ข Then:

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + ๐‘‚ ๐‘‰๐ถ โ„‹ /๐‘› , for Boolean โ„‹, 0-1 loss โ„“

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + ๐‘‚ ๐œ† โ„›๐‘›(โ„‹) , for general โ„‹, ๐œ†-Lipschitz โ„“

Beyond Realizable Setting: Learnability

โ€ฆ

Loss๐ท โ„Ž

Page 7: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Supervised Learning Setting on Steroids

But what do these results really mean?

Page 8: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Broader Perspectiveโ‘ ML methods commonly operate under stringent assumptions

o train set: independent and identically distributed (i.i.d.) samples

o test set: i.i.d. samples from same distribution as training

โ‘ Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation โ‡ systematic missing of data

โ‡’ train set โ‰  test set

o Data Dependencies โ‡ peer effects, spatial, temporal dependencies

โ‡’ no apparent source for independence

Page 9: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Todayโ€™s Topic: Statistical Inference from Dependent Observations

independent dependent

Page 10: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Observations ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘– are commonly collected on some spatial domain, some temporal domain, or on a social network

As such, they are not independent, but intricately dependent

Why dependent?

e.g. spin glassesโ€ข neighboring particles influence each other

- +

++

+ +

-+

+ +

++

- +

--

e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)

Page 11: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Statistical Physics and Machine Learning

โ€ข Spin Systems [Isingโ€™25]

โ€ข MRFs, Bayesian Networks, Boltzmann Machines โ€ข Probability Theory

โ€ข MCMC

โ€ข Machine Learning

โ€ข Computer Vision

โ€ข Game Theory

โ€ข Computational Biology

โ€ข Causality

โ€ข โ€ฆ

- +

++

+ +

-+

+ +

++

- +

--

Page 12: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Peer Effects on Social Networks

Several studies of peer effects, in applications such as:โ€ข criminal networks [Glaeser et alโ€™96]โ€ข welfare participation [Bertrand et alโ€™00]โ€ข school achievement [Sacerdoteโ€™01]โ€ข participation in Retirement Plans [Duflo-Saezโ€™03]โ€ข obesity [Trogdon et alโ€™08, Christakis-Fowlerโ€™13]

AddHealth Dataset: โ€ข National Longitudinal Study of Adolescent Healthโ€ข National study of students in grades 7-12โ€ข friendship networks, personal and school life, age,

gender, race, socio-economic background, health,โ€ฆ

MicroEconomics: โ€ข Behavior/Opinion Dynamics e.g.

[Schellingโ€™78], [Ellisonโ€™93], [Youngโ€˜93, โ€™01], [Montanari-Saberiโ€™10],โ€ฆ

Econometrics: โ€ข Disentangling individual from network

effects [Manskiโ€™93], [Bramoulle-Djebbari-Fortinโ€™09], [Li-Levina-Zhuโ€™16],โ€ฆ

Page 13: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observations

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 14: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observations

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 15: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Standard Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 16: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Standard Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 17: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Maximum Likelihood Estimator

In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐‘‹, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator แˆ˜๐œƒ satisfies: แˆ˜๐œƒ โˆ’ ๐œƒ2= ๐‘‚

๐‘‘

๐‘›

Page 18: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Standard Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 19: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Linear Regression w/ Dependent Samples

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows conditioning on ๐‘ฆโˆ’๐‘–:

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Page 20: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Standard Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Page 21: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Logistic Regression w/ Dependent Samples(todayโ€™s focus)

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows conditioning on ๐‘ฆโˆ’๐‘–:

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Page 22: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› are sampled jointly according to measure

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Ising model w/ inverse temperature ๐›ฝ and external fields ๐œƒ๐‘‡๐‘ฅ๐‘–

Logistic Regression w/ Dependent Samples(todayโ€™s focus)

Challenge: one sample, likelihood contains ๐‘

For ๐›ฝ = 0 equivalent to Logistic regression

i.e. lack of LLN, hard to compute MLE

Page 23: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Main Result for Logistic Regression from Dependent Samples

N.B. We show similar results for linear regression with peer effects.

Page 24: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 25: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 26: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

MPLE instead of MLE

โ€ข Likelihood involves ๐‘ = ๐‘๐œƒ,๐›ฝ (the partition function), so is non-trivial to compute.

โ€ข [Besagโ€™75,โ€ฆ,Chatterjeeโ€™07] studies the maximum pseudolikelihood estimator (MPLE)

โ€ข LogPL does not contain ๐‘ and is concave. Is MPLE consistent?

โ€ข [Chatterjeeโ€™07]: yes, when ๐›ฝ > 0, ๐œƒ = 0

โ€ข [BMโ€™18,GMโ€™18]: yes, when ๐›ฝ > 0, ๐œƒ โ‰  0 โˆˆ โ„, ๐‘ฅ๐‘– = 1, for all ๐‘–

โ€ข General case?

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” ฯ‚๐‘– Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]1/๐‘›

= ฯ‚๐‘–1

1+exp โˆ’2(๐œƒ๐‘‡๐‘ฅ๐‘–+๐›ฝ๐ด๐‘–๐‘‡๐‘ฆ

1/๐‘›

Page 27: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Analysis

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” เท‘

๐‘–

Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]

1/๐‘›

Page 28: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Analysis

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

โ€ข Concentration: gradient of log-pseudolikelihood at truth should be small; โ€ข show ๐›ปlog๐‘ƒ๐ฟ ๐œƒ0, ๐›ฝ0 2

2 is ๐‘‚ ๐‘‘/๐‘› with probability at least 99%

โ€ข Anti-concentration: Constant lower bound on min๐œƒ,๐›ฝ

๐œ†๐‘š๐‘–๐‘›(โˆ’๐›ป2 log ๐‘ƒ๐ฟ (๐œƒ, ๐›ฝ)) w.pr. โ‰ฅ 99%

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” เท‘

๐‘–

Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]

1/๐‘›

Page 29: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 30: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 31: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Supervised LearningGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Page 32: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Given:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Supervised Learning w/ Dependent Observations

Page 33: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Supervised Learning w/ Dependent ObservationsGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumptionโ€™: a joint distribution ๐ท samples training examples andunknown test sample jointly, i.e.

โ€ข ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› โˆช (๐‘ฅ, ๐‘ฆ) โˆผ ๐ท

Goal: select ๐œƒ to minimize: min๐œƒ

๐”ผ โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Page 34: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Main result

Learnability is possible when joint distribution ๐ท over training set and test set satisfies Dobrushinโ€™s condition.

Page 35: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Dobrushinโ€™s condition

โ€ข Given a joint probability vector าง๐‘ = ๐‘1, โ€ฆ , ๐‘๐‘š define the influence of variable ๐’Š to variable ๐’‹: โ€ข โ€œworst effect of ๐‘๐‘– to the conditional distribution of ๐‘๐‘— given ๐‘โˆ’๐‘–โˆ’๐‘—โ€

โ€ข Inf๐‘–โ†’๐‘— = sup๐‘ง๐‘–,๐‘ง๐‘–

โ€ฒ,๐‘งโˆ’๐‘–โˆ’๐‘—

๐‘‘๐‘‡๐‘‰ Pr ๐‘๐‘— โˆฃ ๐‘ง๐‘– , ๐‘งโˆ’๐‘–โˆ’๐‘— , Pr ๐‘๐‘— โˆฃ ๐‘ง๐‘–โ€ฒ, ๐‘งโˆ’๐‘–โˆ’๐‘—

โ€ข Dobrushinโ€™s condition:โ€ข โ€œall nodes have limited total influence exerted on themโ€

โ€ข For all ๐‘–, ฯƒ๐‘—โ‰ ๐‘– Inf๐‘—โ†’๐‘– < 1.

Implies: Concentration of measureFast mixing of Gibbs sampler

Page 36: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Learnability under Dobrushinโ€™s condition

โ€ข Suppose: โ€ข ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› , (๐‘‹, ๐‘Œ) โˆผ ๐ท, where ๐ท is joint distributionโ€ข ๐ท: satisfies Dobrushinโ€™s conditionโ€ข ๐ท: has the same marginals

โ€ข Define Loss๐ท โ„Ž = ๐”ผ โ„“(โ„Ž(๐‘ฅ), ๐‘ฆ)

โ€ข There exists a learning algorithm which, given ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› , outputs โ„Ž โˆˆ โ„‹ such that

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + เทจ๐‘‚ ๐‘‰๐ถ โ„‹ /๐‘› for Boolean โ„‹, 0-1 loss โ„“

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + เทจ๐‘‚ ๐œ† ๐’ข๐‘›(โ„‹) , for general โ„‹, ๐œ†-Lipschitz โ„“

(need stronger than Dobrushin condition for general โ„‹)

Essentially the same bounds as i.i.d

Page 37: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 38: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Page 39: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation โ‡ systematic missing of data

โ‡’ train set โ‰  test set

o Data Dependencies โ‡ peer effects, spatial, temporal dependencies

โ‡’ no apparent source for independence

Goals of Broader Line of Work

Page 40: Statistical Inference from Dependent Observationsย ยท Pr Uิฆ=๐œŽ=exp(ฯƒ ๐œƒT ๐œŽ +๐›ฝ๐œŽT๐ด๐œŽ) ๐‘ Infer ๐œƒ,๐›ฝ โ€ข Concentration: gradient of log-pseudolikelihood at

o Regression on a network (linear and logistic, ๐‘‘/๐‘› rates):

Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOCโ€™19).

o Statistical Learning Theory Framework (learnability and generalization bounds):

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLTโ€™19).

Statistical Estimation from Dependent Observations

Thank you!