alekh agarwal microsoft research · 2020. 1. 3. · para-active learning alekh agarwal microsoft...

30
Para-active learning Alekh Agarwal Microsoft Research Joint work with L´ eon Bottou, Miroslav Dud´ ık and John Langford Agarwal, Bottou, Dud´ ık, Langford Para-active learning

Upload: others

Post on 19-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Para-active learning

Alekh AgarwalMicrosoft Research

Joint work with Leon Bottou, Miroslav Dudık and John Langford

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 2: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Motivation

Many existing distributed learning approaches

Parallelize existing algorithms (e.g. distributed optimization)Variants of existing algorithms (e.g. distributed mini-batches)Bagging, model averaging, . . .

Model/gradients cheaply communicated, meaningfully averaged

Limited use of the statistical problem structure (beyond i.i.d.)

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 3: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Motivation

Many existing distributed learning approaches

Parallelize existing algorithms (e.g. distributed optimization)Variants of existing algorithms (e.g. distributed mini-batches)Bagging, model averaging, . . .

Model/gradients cheaply communicated, meaningfully averaged

Limited use of the statistical problem structure (beyond i.i.d.)

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 4: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Motivation

Many existing distributed learning approaches

Parallelize existing algorithms (e.g. distributed optimization)Variants of existing algorithms (e.g. distributed mini-batches)Bagging, model averaging, . . .

Model/gradients cheaply communicated, meaningfully averaged

Limited use of the statistical problem structure (beyond i.i.d.)

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 5: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Peculiarities of models

Models not always parsimoniously described

Kernel methods: model/gradient not described without training dataHigh-dimensional/non-parametric models

Models not always meaningfully averaged

Matrix factorization: M = UV = (−U)(−V )More generic for non-convex models: neural networks, mixture models

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 6: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Peculiarities of models

Models not always parsimoniously described

Kernel methods: model/gradient not described without training dataHigh-dimensional/non-parametric models

Models not always meaningfully averaged

Matrix factorization: M = UV = (−U)(−V )More generic for non-convex models: neural networks, mixture models

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 7: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Data data everywhere, but . . .

Not all data points are equally informative

Small number of support vectors specify SVM solution

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 8: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Data data everywhere, but . . .

Not all data points are equally informative

Small number of support vectors specify SVM solution

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 9: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Active learning

Active learning identifies informative examples

Similar idea as support vectors, works more generally

Efficient algorithms (and heuristics) for typical hypothesis classes

Examples

Query x with probability g(|h(x)|)Query x based on similarity with previously queried samples

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 10: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Active learning

Active learning identifies informative examples

Similar idea as support vectors, works more generally

Efficient algorithms (and heuristics) for typical hypothesis classes

Examples

Query x with probability g(|h(x)|)Query x based on similarity with previously queried samples

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 11: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Para-active learning

Sift for informative examples in parallel

Update model on selected examples

�����������

����

�����������

����

�����������

����

��� ���

���������������

� �������� �� � �������� �� � �������� ��

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 12: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Synchronous para-active learning

Initial hypothesis h1, batch size B, active sifter A, passive updater PFor rounds t = 1, 2, . . . ,T

For all nodes i = 1, 2, . . . , k in parallel

Local dataset of size B/kA creates subsampled dataset

Collect subsampled datasets from each nodeUpdate ht+1 by running passive updater P on the collected data

�����������

����

�����������

����

�����������

����

��� ���

���������������

� �������� �� � �������� �� � �������� ��

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 13: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Synchronous para-active learning

Initial hypothesis h1, batch size B, active sifter A, passive updater PFor rounds t = 1, 2, . . . ,T

For all nodes i = 1, 2, . . . , k in parallel

Local dataset of size B/kA creates subsampled dataset

Collect subsampled datasets from each nodeUpdate ht+1 by running passive updater P on the collected data

�����������

����

�����������

����

�����������

����

��� ���

���������������

� �������� �� � �������� �� � �������� ��

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 14: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Synchronous para-active learning

Initial hypothesis h1, batch size B, active sifter A, passive updater PFor rounds t = 1, 2, . . . ,T

For all nodes i = 1, 2, . . . , k in parallel

Local dataset of size B/kA creates subsampled dataset

Collect subsampled datasets from each nodeUpdate ht+1 by running passive updater P on the collected data

Example

ht is kernel SVM on examples selected so farA samples based on g(|ht(x)|) at round tP computes ht+1 from ht using online kernel SVM

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 15: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Asynchronous para-active learning

Initial hypothesis h1, batch size B, active sifter A, passive updater PInitialize Q i

S = ∅ for each node iFor all nodes i = 1, 2, . . . , k in parallel

While Q iS is not empty

Fetch a selected example from Q iS

Update the hypothesis using P on this example

If Q iF is non-empty

Fetch a candidate example from Q iF

Use A to decide whether the example is selected or notIf selected, broadcast example for addition to Q j

S for all j

�����������

����

�����������

����

�����������

����

��� ���

���������������

� �������� �� � �������� �� � �������� ��

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 16: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Asynchronous para-active learning

Initial hypothesis h1, batch size B, active sifter A, passive updater PInitialize Q i

S = ∅ for each node iFor all nodes i = 1, 2, . . . , k in parallel

While Q iS is not empty

Fetch a selected example from Q iS

Update the hypothesis using P on this example

If Q iF is non-empty

Fetch a candidate example from Q iF

Use A to decide whether the example is selected or notIf selected, broadcast example for addition to Q j

S for all j

�����������

����

�����������

����

�����������

����

��� ���

���������������

� �������� �� � �������� �� � �������� ��

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 17: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Computational complexity

Training time for n examples: T (n)Evaluation time per example after n examples: S(n)Number of subsampled examples out of n: φ(n)

Number of nodes: k

Seq. Passive Seq. Active Para-activeOperations T (n)

nS(φ(n)) + T (φ(n)) nS(φ(n)) + kT (φ(n))

Time T (n)

nS(φ(n)) + T (φ(n)) nS(φ(n))/k + T (φ(n))

Broadcasts 0

0 φ(n)

Pn

T (n)

Model

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 18: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Computational complexity

Training time for n examples: T (n)Evaluation time per example after n examples: S(n)Number of subsampled examples out of n: φ(n)

Number of nodes: k

Seq. Passive Seq. Active Para-activeOperations T (n) nS(φ(n)) + T (φ(n))

nS(φ(n)) + kT (φ(n))

Time T (n) nS(φ(n)) + T (φ(n))

nS(φ(n))/k + T (φ(n))

Broadcasts 0 0

φ(n)

A

P Model

n

φ(n)

nS(φ(n))

T (φ(n))

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 19: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Computational complexity

Training time for n examples: T (n)Evaluation time per example after n examples: S(n)Number of subsampled examples out of n: φ(n)

Number of nodes: k

Seq. Passive Seq. Active Para-activeOperations T (n) nS(φ(n)) + T (φ(n)) nS(φ(n)) + kT (φ(n))

Time T (n) nS(φ(n)) + T (φ(n)) nS(φ(n))/k + T (φ(n))

Broadcasts 0 0 φ(n)

nS(φ(n))/k

P Model

T (φ(n))

A A

φ(n)

n/k n/k

nS(φ(n))/k

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 20: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Computational complexity

Training time for n examples: T (n)Evaluation time per example after n examples: S(n)Number of subsampled examples out of n: φ(n)

Number of nodes: k

Seq. Passive Seq. Active Para-activeOperations T (n) nS(φ(n)) + T (φ(n)) nS(φ(n)) + kT (φ(n))

Time T (n) nS(φ(n)) + T (φ(n)) nS(φ(n))/k + T (φ(n))

Broadcasts 0 0 φ(n)

Example 1, kernel SVM:

T (n) ∼ O(n2), S(n) ∼ O(n)

Often φ(n)� n

T (n)� nS(φ(n))� nS(φ(n))/k

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 21: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Computational complexity

Training time for n examples: T (n)Evaluation time per example after n examples: S(n)Number of subsampled examples out of n: φ(n)

Number of nodes: k

Seq. Passive Seq. Active Para-activeOperations T (n) nS(φ(n)) + T (φ(n)) nS(φ(n)) + kT (φ(n))

Time T (n) nS(φ(n)) + T (φ(n)) nS(φ(n))/k + T (φ(n))

Broadcasts 0 0 φ(n)

Example 2, neural nets with backprop:

T (n) ∼ O(nd), S(n) ∼ O(d)

Often φ(n)� n

T (n) ≈ nS(φ(n))� nS(φ(n))/k

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 22: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Communication complexity

Communication complexity is query complexity of active learning

Typically assume examples are queried immediately in active learning

We have a delay before the model is updated

Theorem: Delay of τ leads to query complexity at most τ + φ(n − τ)

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 23: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Experimental evaluation

Large version of MNIST (8.1M examples) with elastic deformations oforiginal images

Two learning algorithms:

Simulation for kernel SVM: RBF kernel, LASVM algorithmParallel neural nets: 1-hidden layer with 100 nodes

Active learning: select a point x with probability based on |f (x)| forfixed subsampling rate

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 24: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Kernel SVM simulation

Simulated synchronous para-active learning

Fixed batch size B, split into portions of size B/k

Sift each portion in turn, take largest sifting time

Update model with new examples, take training time

Used as an estimate of parallel computation time

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 25: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

SVM simulation runtimes

Classifying {3, 1} vs {5, 7}Running time vs test error

102

10

20

30

40

50

60

70

Running time (s)

Testerror

Running time v/s test error for SVM

Passive seq.Active seq.1 node Para-active4 nodes Para-active16 nodes Para-active64 nodes Para-active

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 26: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

SVM simulated speedup over passive

Classifying {3, 1} vs {5, 7}Speedup over sequential passive

100

101

102

100

101

102

Number of nodes

Speedup

Speedup over passive

error=60error=30error=20error=14error=12

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 27: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

SVM simulated speedup over delayed active

Classifying {3, 1} vs {5, 7}Speedup over delayed active

100

101

102

100

101

102

Number of nodes

SpeedupSpeedup over delayed active

error=60error=30error=20error=14error=12

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 28: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Parallel neural net results

Classifying 3 vs 5Running time vs test error

5 6 7 8 9 10

0.005

0.01

0.015

0.02

0.025

log(Running time(s))

Tes

t err

or

PassiveActive24816

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 29: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Conclusions

General strategy for distributed learning

Applicable to diverse hypothesis classes and algorithms

Particularly appealing for non-parametric and/or non-convex models

Theoretically justified, empirically promising

Real distributed implementation for kernel SVMs

Other algorithms and datasets

Better subsampling strategies

Agarwal, Bottou, Dudık, Langford Para-active learning

Page 30: Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

Conclusions

General strategy for distributed learning

Applicable to diverse hypothesis classes and algorithms

Particularly appealing for non-parametric and/or non-convex models

Theoretically justified, empirically promising

Real distributed implementation for kernel SVMs

Other algorithms and datasets

Better subsampling strategies

Agarwal, Bottou, Dudık, Langford Para-active learning