learning with memory and communication constraintsjsteinhardt/talks/communication.pdf · 2015. 12....
Post on 28-Mar-2021
4 Views
Preview:
TRANSCRIPT
Learning with Memory and Communication Constraints
Jacob Steinhardt*
Stanford University
jsteinhardt@cs.stanford.edu
July 30, 2015
*with John Duchi, Gregory Valiant, and Stefan Wager
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 1 / 20
Motivation
Computational constraints becoming bottleneck in many systems.
Not yet a good theory of computationally-bounded statistics.Study sample complexity of resource-constrained learning algorithms.
(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)
This work: memory, communication.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 2 / 20
Motivation
Computational constraints becoming bottleneck in many systems.
Not yet a good theory of computationally-bounded statistics.
Study sample complexity of resource-constrained learning algorithms.(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)
This work: memory, communication.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 2 / 20
Motivation
Computational constraints becoming bottleneck in many systems.
Not yet a good theory of computationally-bounded statistics.Study sample complexity of resource-constrained learning algorithms.
(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)
This work: memory, communication.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 2 / 20
Motivation
Computational constraints becoming bottleneck in many systems.
Not yet a good theory of computationally-bounded statistics.Study sample complexity of resource-constrained learning algorithms.
(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)
This work: memory, communication.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 2 / 20
Motivation
Computational constraints becoming bottleneck in many systems.
Not yet a good theory of computationally-bounded statistics.Study sample complexity of resource-constrained learning algorithms.
(Cover, 1969; Hellman & Cover, 1970; Ben-David & Dichterman, 1998;Balcan et al., 2012; Berthet & Rigollet, 2013; Chandrasekaran & Jordan,2013; Duchi, Jordan, & Wainwright, 2013; Zhang et al., 2013; Zhang,Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014;Braverman et al., 2015; S. & Duchi, 2015; S., Valiant, & Wager, 2015)
This work: memory, communication.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 2 / 20
1 Memory, Communication, and Statistical Queries
2 Memory-Constrained Sparse Regression
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 3 / 20
Setting
Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .
COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.
COM(b,k): each party gets k samples (instead of 1)
MEM(b): access data in a stream, store at most b bits of state.
Relate both classes to well-studied statistical query model:
SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 4 / 20
Setting
Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .
COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.
COM(b,k): each party gets k samples (instead of 1)
MEM(b): access data in a stream, store at most b bits of state.
Relate both classes to well-studied statistical query model:
SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 4 / 20
Setting
Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .
COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.
COM(b,k): each party gets k samples (instead of 1)
MEM(b): access data in a stream, store at most b bits of state.
Relate both classes to well-studied statistical query model:
SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 4 / 20
Setting
Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .
COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.
COM(b,k): each party gets k samples (instead of 1)
MEM(b): access data in a stream, store at most b bits of state.
Relate both classes to well-studied statistical query model:
SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 4 / 20
Setting
Assume: polynomial amount of i.i.d. samples (x , `(x)) ∈ X ×−1,+1, with`(x) in some concept class F .
COM(b): each sample held by a separate party, each party caninteractively broadcast up to b bits.
COM(b,k): each party gets k samples (instead of 1)
MEM(b): access data in a stream, store at most b bits of state.
Relate both classes to well-studied statistical query model:
SQ: can query E[ψ(x , `(x))] for any function ψ : X ×±1→ [−1,1];get output accurate to tolerance τ = 1/poly(n).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 4 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Communication
Theorem. If F is learnable with m samples and b bits of communication, thenit is learnable with O(bm) statistical queries of tolerance τ = Ω(1/(2bm)).
Implications of theorem:
For any constant C > 0, COM(1) = COM(C log(n)) = SQ.
Let PARITY(n) be the problem where x ∼ Uniform(0,1n) and`(x) = (−1)c>x for unknown c ∈ 0,1n.
Then PARITY(n) 6∈ COM(n/4).
In addition, PARITY(n) 6∈ COM(n/16,n/4).
Open Problem. Can PARITY(n) be solved with n2/4 bits of memory?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 5 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Main Results: Memory
Theorem. If F can be learned with m statistical queries of tolerance τ , then itcan be learned with
O(log |F| log(m/τ)) bits of state and
O(m log |F|/τ2) samples.
Caveat: reduction is not computationally efficient.
Implications of theorem:
Let REP be the class of efficiently representable problems: log |F|= O(n).
Then SQ∩REP⊆MEM(O(n)).
k -sparse linear regression in d dimensions can be solved with k ·polylog(d)bits of state and d ·poly(k) samples.
If the covariates are r -sparse, then only poly(r ,k) samples are needed.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 6 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 5
0 10
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 5
0 10
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 5
0 10
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 5
0 10
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50
10
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
0
11
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1
?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10)
= p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]︸ ︷︷ ︸statistical query
/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]︸ ︷︷ ︸statistical query
/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]︸ ︷︷ ︸statistical query
/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]︸ ︷︷ ︸statistical query
/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ .
=⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Communication
Goal: reduce communication-constrained algorithm to SQ algorithm.
Idea: use queries to estimate probability that next bit communicated is 0 or 1.
Consider intermediate state of algorithm:
party: 1 2 3 4 50 1
01
1?
c1
c2
c3
p(c3 = 1 | c1:2 = 10) = p(c1:3 = 101)/p(c1:2 = 10)
= E[I[c1:3 = 101]]︸ ︷︷ ︸statistical query
/p(c1:2 = 10)
Error: τ/p(c1:2).
E[τ/p(c1:2)] = τ ·∑c[p(c) ·1/p(c)] = 4τ (in general, 2bτ).
Cumulative error: m2bτ . =⇒ Okay as long as τ 1/(m2b)!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 7 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Reduction: Memory
Goal: represent SQ algorithm in memory-efficient way.
Step 1: replace queries with threshold queries (i.e., “Is E[ψ] > t?”).
Algorithm is now a decision tree:
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
0 1
0 1 0 1
...
dept
hm
Issue: naıvely remembering position in tree requires Θ(m) memory.
Can we somehow identify “important” queries?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 8 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
0
1
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ][ ]tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ]
[ ]
tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).
At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ]
[ ]
tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ]
[ ]
tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Idea: Normalizing Queries
Consider threshold query (ψ, t) of tolerance τ .
SQ(ψ, t) =
1 : E[ψ] > t + τ
0 : E[ψ] < t− τ
arbitrary : otherwise
E[ψ]
[ ]
tt− τ t + τ
01
[ ][ ]
Call (ψ, t,τ) “good” if at least one of 0,1 narrows down F by factor of 12 .
Consider (ψ, t− τ/2,τ/2) and (ψ, t + τ/2,τ/2).At least one must be good.
Can always normalize queries to be good!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 9 / 20
Compression Scheme
Normalize all queries to be good.
At each node, color the child edge which reduces F by at least 12 :
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
1
0
0
1 0
0
1 0
1
...
Note: any path has at most log |F| colored edges.
Can remember indices of colored edges with log |F| log(m) bits of memory!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 10 / 20
Compression Scheme
Normalize all queries to be good.
At each node, color the child edge which reduces F by at least 12 :
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
1
0
0
1 0
0
1 0 1
...
Note: any path has at most log |F| colored edges.
Can remember indices of colored edges with log |F| log(m) bits of memory!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 10 / 20
Compression Scheme
Normalize all queries to be good.
At each node, color the child edge which reduces F by at least 12 :
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
1
0
0
1 0
0
1 0 1
...
Note: any path has at most log |F| colored edges.
Can remember indices of colored edges with log |F| log(m) bits of memory!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 10 / 20
Compression Scheme
Normalize all queries to be good.
At each node, color the child edge which reduces F by at least 12 :
ψ.
ψ0 ψ1
ψ00 ψ01 ψ10 ψ11
1
0
0
1 0
0
1 0 1
...
Note: any path has at most log |F| colored edges.
Can remember indices of colored edges with log |F| log(m) bits of memory!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 10 / 20
Summary
COM→ SQ
Simulate conditional probabilities of messages with statistical queries.
SQ→MEMNormalize queries, store compressed representation of decision path.
Next: study sparse regression in more detail
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 11 / 20
Summary
COM→ SQSimulate conditional probabilities of messages with statistical queries.
SQ→MEMNormalize queries, store compressed representation of decision path.
Next: study sparse regression in more detail
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 11 / 20
Summary
COM→ SQSimulate conditional probabilities of messages with statistical queries.
SQ→MEM
Normalize queries, store compressed representation of decision path.
Next: study sparse regression in more detail
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 11 / 20
Summary
COM→ SQSimulate conditional probabilities of messages with statistical queries.
SQ→MEMNormalize queries, store compressed representation of decision path.
Next: study sparse regression in more detail
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 11 / 20
Summary
COM→ SQSimulate conditional probabilities of messages with statistical queries.
SQ→MEMNormalize queries, store compressed representation of decision path.
Next: study sparse regression in more detail
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 11 / 20
1 Memory, Communication, and Statistical Queries
2 Memory-Constrained Sparse Regression
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 12 / 20
Setting
Sparse linear regression in Rd :
Y (i) = 〈w∗,X (i)〉+ ε(i)
‖w∗‖0 = k , k d
Memory constraint:
(X (i),Y (i)) observed as read-only stream
Only keep b bits of state Z (i) between successive observations
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 13 / 20
Setting
Sparse linear regression in Rd :
Y (i) = 〈w∗,X (i)〉+ ε(i)
‖w∗‖0 = k , k d
Memory constraint:
(X (i),Y (i)) observed as read-only stream
Only keep b bits of state Z (i) between successive observations
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 13 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
Classical case (no memory constraint):
Theorem (Wainwright, 2009)
kε
log(d) . n .kε
log(d)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
Classical case (no memory constraint):
Theorem (Wainwright, 2009)
kε
log(d) . n .kε
log(d)
Achievable with O(d) memory (Agarwal et al., 2012; S., Wager, & Liang, 2015).
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
Classical case (no memory constraint):
Theorem (Wainwright, 2009)
kε
log(d) . n .kε
log(d)
With memory constraints b:
Theorem (S. & Duchi, 2015)
kε
db. n .
kε2
db
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
Classical case (no memory constraint):
Theorem (Wainwright, 2009)
kε
log(d) . n .kε
log(d)
With memory constraints b:
Theorem (S. & Duchi, 2015)
kε
db. n .
kε2
db
Exponential increase if b d !
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Problem Statement
How much data n is needed to obtain estimator w with
E[‖w−w∗‖22]≤ ε?
Classical case (no memory constraint):
Theorem (Wainwright, 2009)
kε
log(d) . n .kε
log(d)
With memory constraints b:
Theorem (S. & Duchi, 2015)
kε
db. n .
kε2
db
[Note: up to log factors; assumes k log(d) b ≤ d ]
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 14 / 20
Proof Overview
Lower bound:
information-theoreticstrong data-processing inequality
W ∗ X ,Y Zd
1
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoretic
strong data-processing inequality
W ∗ X ,Y Zd
1
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zd
1
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zdb
1bd
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zdb
1bd
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zdb
1bd
main challenge: dependence between X ,Y
Upper bound:
count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zdb
1bd
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averaging
more regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Proof Overview
Lower bound:information-theoreticstrong data-processing inequality
W ∗ X ,Y Zdb
1bd
main challenge: dependence between X ,Y
Upper bound:count-min sketch + `1-regularized dual averagingmore regularization→ easier sketching problem
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 15 / 20
Lower Bound Construction
Split coordinates into k blocks of size d/k
w∗ in each block: single non-zero coordinate J, ±δ with equal probability
Direct sum argument: reduce to k = 1
J = 2
dk
Estimation to testing:
E[‖w∗− w‖22]≥ δ 2
2P[J 6= J]
Looking ahead: bound KL between Pj and base distribution P0
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 16 / 20
Lower Bound Construction
Split coordinates into k blocks of size d/k
w∗ in each block: single non-zero coordinate J, ±δ with equal probability
Direct sum argument: reduce to k = 1
J = 2
dk
Estimation to testing:
E[‖w∗− w‖22]≥ δ 2
2P[J 6= J]
Looking ahead: bound KL between Pj and base distribution P0
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 16 / 20
Lower Bound Construction
Split coordinates into k blocks of size d/k
w∗ in each block: single non-zero coordinate J, ±δ with equal probability
Direct sum argument: reduce to k = 1
J = 2
dk
Estimation to testing:
E[‖w∗− w‖22]≥ δ 2
2P[J 6= J]
Looking ahead: bound KL between Pj and base distribution P0
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 16 / 20
Lower Bound Construction
Split coordinates into k blocks of size d/k
w∗ in each block: single non-zero coordinate J, ±δ with equal probability
Direct sum argument: reduce to k = 1
J = 2
dk
Estimation to testing:
E[‖w∗− w‖22]≥ δ 2
2P[J 6= J]
Looking ahead: bound KL between Pj and base distribution P0
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 16 / 20
Lower Bound Construction
Split coordinates into k blocks of size d/k
w∗ in each block: single non-zero coordinate J, ±δ with equal probability
Direct sum argument: reduce to k = 1
J = 2
dk
Estimation to testing:
E[‖w∗− w‖22]≥ δ 2
2P[J 6= J]
Looking ahead: bound KL between Pj and base distribution P0
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 16 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Some Information Theory
Let X ∼ Uniform(±1d )
Let Pj(Z (1:n)) be distribution conditioned on J = j
Let P0(Z (1:n)) be distribution with Y independent of X
Assouad’s method:
P[J 6= J]≥ 12−
√√√√ 1d
d
∑j=1
Dkl(P0(Z (1:n)) || Pj(Z (1:n))
)
2δ
Xj : −1 +1
Key fact: (Y ,Xj) independent of X¬j under Pj
Intuition: Dkl (P0 || Pj ) small unless Z stores info about Xj ; need to storemajority of Xj to make average Dkl large.
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 17 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z))
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj)
≤ 4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2 I(Xj ;Z | Y , Z = z)︸ ︷︷ ︸
mutual information
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj)
≤ 4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj)
≤ 4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj)
≤ 4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj) ≤4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj) ≤4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
≤ 4δ 2
dI(X ;Z ,Y | Z )
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj) ≤4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
≤ 4δ 2
dI(X ;Z ,Y | Z )︸ ︷︷ ︸
b+O(1)
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Strong Data-Processing Inequality
Focus on a single index Z = Z (i), with z = z(1:i−1) fixed.
PropositionFor any z,
Dkl (P0(Z | z) || Pj(Z | z)) ≤ 4δ2I(Xj ;Z | Y , Z = z)
≤ 4δ2I(Xj ;Z ,Y | Z = z)
Plug into Assouad:1d
d
∑j=1
Dkl (P0 || Pj) ≤4δ 2
d
d
∑j=1
I(Xj ;Z ,Y | Z )
≤ 4δ 2
dI(X ;Z ,Y | Z )︸ ︷︷ ︸
b+O(1)
Only get 4δ 2bd bits per round!
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 18 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Upper Bound
Solve `1-regularized dual averaging problem (Xiao, 2010), λ 1:
w(i) = argminw
〈θ (i),w〉+ λ
√n‖w‖1 +
12η‖w‖2
2
,
θ(i) =
i−1
∑i ′=1
x(i ′)(y(i ′)−〈w(i ′),x(i ′)〉).
Hard part: determine support of w(i).
Need to distinguish |θj | ≥ λ√
n (signal) from |θj | ≈√
n (noise)
Can use count-min sketch, memory usage ≈ d log(d)λ 2
=⇒ regularization decreases computation; seen before in `2 case(Shalev-Shwartz & Zhang, 2013; Bruer et al., 2014)
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 19 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
Discussion
Summary:
Upper and lower bounds on memory-constrained regression
Lower bound: extend data processing inequality to handle covariates
Upper bound: use `1-regularizer to reduce to sketching
Future work:
Close the gap (kd/bε vs kd/bε2)
Weaken upper bound assumptions
J. Steinhardt (Stanford) Resource-Constrained Learning July 30, 2015 20 / 20
top related