learning with memory and communication constraintsjsteinhardt/talks/communication.pdf · 2015. 12....

Learning with Memory and Communication Constraints

Jacob Steinhardt*

Stanford University

[email protected]

July 30, 2015

*with John Duchi, Gregory Valiant, and Stefan Wager

J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 1 / 20

Motivation

Computational constraints becoming bottleneck in many systems.

Not yet a good theory of computationally-bounded statistics.Study sample complexity of resource-constrained learning algorithms.

(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)

This work: memory, communication.

Motivation

Not yet a good theory of computationally-bounded statistics.

Study sample complexity of resource-constrained learning algorithms.(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)

Motivation

1 Memory, Communication, and Statistical Queries

2 Memory-Constrained Sparse Regression

Setting

Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .

COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.

COM(b,k): each party gets k samples (instead of 1)

MEM(b): access data in a stream, store at most b bits of state.

Relate both classes to well-studied statistical query model:

SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).

Setting

Main Results: Communication

Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).

Implications of theorem:

For any constant C > 0, COM(1) = COM(C log(n)) = SQ.

Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.

Then PARITY(n) 6∈ COM(n/4).

In addition, PARITY(n) 6∈ COM(n/16,n/4).

Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?

Main Results: Memory

Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with

O(log |F| log(m/τ)) bits of state and

O(m log |F|/τ2) samples.

Caveat: reduction is not computationally efficient.

Let REP be the class of efficiently representable problems: log |F|= O(n).

Then SQ∩REP⊆MEM(O(n)).

k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.

If the covariates are r -sparse, then only poly(r ,k) samples are needed.

Reduction: Communication

Goal: reduce communication-constrained algorithm to SQ algorithm.

Idea: use queries to estimate probability that next bit communicated is 0 or 1.

Consider intermediate state of algorithm:

party: 1 2 3 4 5

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).

Cumulative error: m2bτ .

=⇒ Okay as long as τ 1/(m2b)!

party: 1 2 3 4 5

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 5

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 5

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10)

= p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

= E[I[c1:3 = 101]]︸︷︷︸statistical query

/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

/p(c1:2 = 10)

Error: τ/p(c1:2).

party: 1 2 3 4 50 1

p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)

/p(c1:2 = 10)

Error: τ/p(c1:2).

Cumulative error: m2bτ . =⇒ Okay as long as τ 1/(m2b)!

Reduction: Memory

Goal: represent SQ algorithm in memory-efficient way.

Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).

Algorithm is now a decision tree:

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Issue: naıvely remembering position in tree requires Θ(m) memory.

Can we somehow identify “important” queries?

Reduction: Memory

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Reduction: Memory

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Reduction: Memory

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Reduction: Memory

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Reduction: Memory

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

0 1 0 1

Idea: Normalizing Queries

Consider threshold query (ψ, t) of tolerance τ .

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

arbitrary : otherwise

E[ψ][ ]tt− τ t + τ

[ ][ ]

Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .

Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).

At least one must be good.

Can always normalize queries to be good!

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

E[ψ][ ]tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

E[ψ][ ]tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

E[ψ][ ]tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

E[ψ][ ]tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

E[ψ][ ]tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

tt− τ t + τ

[ ][ ]

Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).At least one must be good.

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

tt− τ t + τ

[ ][ ]

SQ(ψ, t) =

1 : E[ψ] > t + τ

0 : E[ψ] < t− τ

tt− τ t + τ

[ ][ ]

Compression Scheme

Normalize all queries to be good.

At each node, color the child edge which reduces F by at least 12 :

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

Note: any path has at most log |F| colored edges.

Can remember indices of colored edges with log |F| log(m) bits of memory!

Compression Scheme

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

Compression Scheme

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

Compression Scheme

ψ0 ψ1

ψ00 ψ01 ψ10 ψ11

Summary

COM→ SQ

Simulate conditional probabilities of messages with statistical queries.

SQ→MEMNormalize queries, store compressed representation of decision path.

Next: study sparse regression in more detail

Summary

COM→ SQSimulate conditional probabilities of messages with statistical queries.

Summary

SQ→MEM

Normalize queries, store compressed representation of decision path.

Summary

1 Memory, Communication, and Statistical Queries

2 Memory-Constrained Sparse Regression

Setting

Sparse linear regression in Rd :

Y (i) = 〈w∗,X (i)〉+ ε(i)

‖w∗‖0 = k , k d

Memory constraint:

(X (i),Y (i)) observed as read-only stream

Only keep b bits of state Z (i) between successive observations

Setting

Sparse linear regression in Rd :

Y (i) = 〈w∗,X (i)〉+ ε(i)

‖w∗‖0 = k , k d

Memory constraint:

(X (i),Y (i)) observed as read-only stream

Only keep b bits of state Z (i) between successive observations

Problem Statement

How much data n is needed to obtain estimator w with

E[‖w−w∗‖22]≤ ε?

Problem Statement

E[‖w−w∗‖22]≤ ε?

Classical case (no memory constraint):

Theorem (Wainwright, 2009)

log(d) . n .kε

log(d)

Problem Statement

E[‖w−w∗‖22]≤ ε?

log(d) . n .kε

log(d)

Achievable with O(d) memory (Agarwal et al., 2012; S., Wager, & Liang, 2015).

Problem Statement

E[‖w−w∗‖22]≤ ε?

log(d) . n .kε

log(d)

With memory constraints b:

Theorem (S. & Duchi, 2015)

db. n .

Problem Statement

E[‖w−w∗‖22]≤ ε?

log(d) . n .kε

log(d)

db. n .

Exponential increase if b d !

Problem Statement

E[‖w−w∗‖22]≤ ε?

log(d) . n .kε

log(d)

db. n .

[Note: up to log factors; assumes k log(d) b ≤ d ]

Proof Overview

Lower bound:

information-theoreticstrong data-processing inequality

W ∗ X ,Y Zd

main challenge: dependence between X ,Y

Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem

Proof Overview

Lower bound:information-theoretic

strong data-processing inequality

W ∗ X ,Y Zd

Proof Overview

Lower bound:information-theoreticstrong data-processing inequality

W ∗ X ,Y Zd

Proof Overview

W ∗ X ,Y Zdb

Proof Overview

W ∗ X ,Y Zdb

Proof Overview

W ∗ X ,Y Zdb

Upper bound:

count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem

Proof Overview

W ∗ X ,Y Zdb

Upper bound:count-min sketch + `1-regularized dual averaging

more regularization→ easier sketching problem

Proof Overview

W ∗ X ,Y Zdb

Lower Bound Construction

Split coordinates into k blocks of size d/k

w∗ in each block: single non-zero coordinate J, ±δ with equal probability

Direct sum argument: reduce to k = 1

Estimation to testing:

E[‖w∗− w‖22]≥ δ 2

2P[J 6= J]

Looking ahead: bound KL between Pj and base distribution P0

E[‖w∗− w‖22]≥ δ 2

2P[J 6= J]

E[‖w∗− w‖22]≥ δ 2

2P[J 6= J]

E[‖w∗− w‖22]≥ δ 2

2P[J 6= J]

E[‖w∗− w‖22]≥ δ 2

2P[J 6= J]

Some Information Theory

Let X ∼ Uniform(±1d )

Let Pj(Z (1:n)) be distribution conditioned on J = j

Let P0(Z (1:n)) be distribution with Y independent of X

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Key fact: (Y ,Xj) independent of X¬j under Pj

Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Assouad’s method:

P[J 6= J]≥ 12−

√√√√ 1d

∑j=1

Dkl(P0(Z (1:n)) || Pj(Z (1:n))

Xj : −1 +1

Strong Data-Processing Inequality

Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.

PropositionFor any z,

Dkl (P0(Z | z) || Pj(Z | z))

Plug into Assouad:1d

∑j=1

Dkl (P0 || Pj)

≤ 4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

Only get 4δ 2bd bits per round!

Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2 I(Xj ;Z | Y , Z = z)︸︷︷︸

mutual information

∑j=1

Dkl (P0 || Pj)

≤ 4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj)

≤ 4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj)

≤ 4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj) ≤4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj) ≤4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ 2

dI(X ;Z ,Y | Z )

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj) ≤4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ 2

dI(X ;Z ,Y | Z )︸︷︷︸

b+O(1)

≤ 4δ2I(Xj ;Z ,Y | Z = z)

∑j=1

Dkl (P0 || Pj) ≤4δ 2

∑j=1

I(Xj ;Z ,Y | Z )

≤ 4δ 2

dI(X ;Z ,Y | Z )︸︷︷︸

b+O(1)

Upper Bound

Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

Hard part: determine support of w(i).

Need to distinguish |θj | ≥ λ√

n (signal) from |θj | ≈√

n (noise)

Can use count-min sketch, memory usage ≈ d log(d)λ 2

=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)

Upper Bound

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

n (noise)

Upper Bound

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

n (noise)

Upper Bound

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

n (noise)

Upper Bound

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

n (noise)

Upper Bound

w(i) = argminw

〈θ (i),w〉+ λ

√n‖w‖1 +

12η‖w‖2

θ(i) =

∑i ′=1

x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).

n (noise)

Discussion

Summary:

Upper and lower bounds on memory-constrained regression

Lower bound: extend data processing inequality to handle covariates

Upper bound: use `1-regularizer to reduce to sketching

Future work:

Close the gap (kd/bε vs kd/bε2)

Weaken upper bound assumptions

Discussion

Summary:

Future work:

Discussion

Summary:

Future work:

Discussion

Summary:

Future work:

Discussion

Summary:

Future work:

Discussion

Summary:

Future work:

Discussion

Summary:

Future work:

learning with memory and communication constraintsjsteinhardt/talks/communication.pdf · 2015. 12....

Documents