sublinear-time algorithms for machine learning ken clarkson elad hazan david woodruff ibm almaden...

31
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Upload: lillian-bennett

Post on 27-Mar-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Sublinear-time Algorithms for Machine Learning

Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Page 2: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Linear Classification

- margin

Page 3: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Linear Classification

n vectors in d dimensions: A1,…,An2 Rd

Can assume norms of the Ai are bounded by 1

Labels y1,…,yn 2 {-1,1}

Find vector x such that:

8 i 2 [n], sign(Ai x) = yi

Page 4: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Perceptron Algorithm

Page 5: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Perceptron Algorithm

[Rosenblatt 1957, Novikoff 1962, Minsky&Papert

1969]

Iteratively:

1. Find vector Ai for which sign(Ai x) yi

2. Add Ai to x:

Note: can assume all yi = +1 by multiplying Ai by yi

Page 6: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Thm [Novikoff 1962]: converges in 1/2 iterations

Proof:

Let x* be the optimal hyperplane, for which Ai x* >=

Page 7: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

For n vectors in d dimensions:

1/ 2 iterations

Each – n £ d time, total time:

New algorithm:

Sublinear time with high probability, leading order

term improvement

(in running times, poly-log factors are omitted)

Page 8: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Why is it surprising ?

- margin

Page 9: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

More results

- O(n/ε2 + d/ε) time alg for minimum enclosing

ball (MEB) assuming norms of input points

known

- Sublinear time kernel versions, e.g., polynomial

kernel of deg q

- Poly-log space / low pass / sublinear time

algorithms for these problems

All running times are tight up to polylog factors

(we give information theoretic lower bounds)

Page 10: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Talk outline

- primal-dual optimization in learning

- l2 sampling

- MEB

- Kernels

Page 11: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

A Primal-dual Perceptron

´ = £~(ε)

Iteratively:

1. Primal player supplies hyperplane xt

2. Dual player supplies distribution pt

3. Updates:

xt+1 = xt + ´ in pt(i) Ai

pt+1(i) = pt(i) ¢ e-´ Ai xt

Page 12: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Primal-dual Perceptrondistribution over examples

Page 13: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Optimization via learning

game

offline optimization problem

Player 1:Low regret alg

Player 2:Low regret alg

Converges to the min-max solutionClassification problem: maxx in B mini in [n] Ai x = maxx in B minp in Δ p Ai x = minp in Δ maxx in B p Ai

x

Reduction

Low regret algorithm = after

many game iterations,average payoff -> best attainable payoff in

hindsight of a fixed strategy

Page 14: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Thm: iterations to converge to -approximate

solution is bounded by T for which:

Total time = # iterations £ time-per-iteration

Advantages:

- Generic optimization

- Easy to apply randomization

Player 1 regret:

Tε · maxx 2 B t pt A x · t pt A xt + Regret1

Player 2 regret:

t ptAxt · minp 2 ¢ t p A xt + Regret2

So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2

Output t xt/T

Player 1 regret:

Tε · maxx 2 B t pt A x · t pt A xt + Regret1

Player 2 regret:

t ptAxt · minp 2 ¢ t p A xt + Regret2

So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2

Output t xt/T

Page 15: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

A Primal-dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ in pt(i) Ai

pt+1(i) = pt(i) ¢ e-´ Ai xt

Page 16: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

A Primal-dual Perceptron

Total time ?

Speed up via randomization:

1. Sufficient to look at one example

2. Sufficient to obtain crude estimates of inner

products.

Page 17: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

l2 sampling

Consider two vectors from the d-dim sphere u, v

- Sample coordinate i w.p. vi2

- Return

Notice that

- Expectation is correct

- Variance at most one (magnitude can be d)

- Time: O(d)

Page 18: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Primal-Dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt, l2-sample

from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:

Important: preprocess xt only once for all estimates

Running time: O((n + d)/ε2)

pt+1(i) = pt(i) ¢ e-´ Ai xt

pt+1(i+1) Ã pt(i) ¢ (1-´ l2-sample(Ai xt) + ´2 l2-sample(Ai xt)2)

xt+1 = xt + ´ Ait

Page 19: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Analysis

Some difficulties:

- Non-trivial regret analysis due to sampling

- Need new multiplicative update analysis for

bounded variance

- Analysis shows good solution with constant

probability

- Need a way to verify a solution to get high

probability

Page 20: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Streaming Implementation

- See rows one at a time

- Can’t afford to store xt or pt

- Want few passes, poly(log n/ε) space, and sublinear time

- Want to output succinct representation of hyperplane

- list of 1/ε2 row indices

- In t-th iteration, when l2-sampling from xt, use the

same index jt for all n rows

- Store samples i1, …, iT of rows chosen by dual player,

and j1, …, jT of l2-sampling indices of primal player

- Sample in a stream using known algorithms

Page 21: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Lower Bound

- Consider n x d matrix

- First 1/ε2 rows contain a random position equal to ε and

all other values are 0

- Each of the remaining n-1/ε2 rows is a copy of a random

row among the first 1/ε2

- With probability ½,

- choose a random row, and replace the value ε by –ε

With probability ½,

- do nothing

- Deciding which case you’re in requires reading (n+d)/ ε2

entries

Page 22: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

MEB (minimum enclosing ball)

Page 23: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

A Primal-dual algorithm

Iteratively:

1. Primal player supplies point xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ i=1n pt(i) (Ai – xt)

pt+1(i) = pt(i) ¢ e´ ||Ai-xt||2

Page 24: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

l2-sampling speed up

Iteratively:

1. Primal player supplies point xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ Ait

pt+1(i) = pt(i) (1+´ l2-sample(||Ai-xt||2) + ´2 l2-sample(||Ai-xt||2)2

Page 25: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Regret speed up

Updates:

# iterations remains

But only in -fraction we have to do O(d) work,

though in all iterations we do O(n) work

O(n/ε2 + d/ε) total time

with probability ε: xt+1 = xt + ´ Ait

pt+1(i) = pt(i) (1+´ l2-sample(|Ai-xt|2) + ´2 l2-sample(|Ai-xt|2)2

Page 26: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Kernels

Page 27: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Kernels

Map input to higher dimensional

space via non-linear

mapping. i.e. polynomial:

Classification via linear classifier in new space.

Efficient classification and optimization if inner

products can be computed efficiently (the “kernel

function”)

Page 28: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Primal-Dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt,

l2 sample from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:xt+1 Ã xt + ´ Ait

Page 29: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

The Primal-Dual Kernel Perceptron

Iteratively:

1. Primal player supplies hyperplane xt,

l2-sample from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:xt+1 Ã xt + ´ ©(Ait

)

Page 30: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

l2 sampling for kernels

Polynomial kernel:

Kernel l2-sample = q independent l2-samples of xT y

Running time increases by q

Can also use Taylor expansion, to do, say, Gaussian

kernels

Page 31: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden