strata new york 2012

36
1 ©MapR Technologies - Confidential Online Learning Bayesian bandits and more

Upload: mapr-technologies

Post on 13-Jul-2015

67 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Strata New York 2012

1©MapR Technologies - Confidential

Online Learning Bayesian bandits and more

Page 3: Strata New York 2012

3©MapR Technologies - Confidential

Online

Scalable

Incremental

Page 4: Strata New York 2012

4©MapR Technologies - Confidential

Scalability and Learning

What does scalable mean?

What are inherent characteristics of scalable learning?

What are the logical implications?

Page 5: Strata New York 2012

5©MapR Technologies - Confidential

Scalable ≈ On-line

If you squint just right

Page 6: Strata New York 2012

6©MapR Technologies - Confidential

unit of work ≈ unit of time

Page 7: Strata New York 2012

7©MapR Technologies - Confidential

Learning

State

Infinite Data

Stream

Page 8: Strata New York 2012

8©MapR Technologies - Confidential

Pick One

Page 9: Strata New York 2012

9©MapR Technologies - Confidential

Page 10: Strata New York 2012

10©MapR Technologies - Confidential

Page 11: Strata New York 2012

11©MapR Technologies - Confidential

Now pick again

Page 12: Strata New York 2012

12©MapR Technologies - Confidential

A Quick Diversion

You see a coin

– What is the probability of heads?

– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again

I catch the coin and ask again

I look at the coin (and you don’t) and ask again

Why does the answer change?

– And did it ever have a single value?

Page 13: Strata New York 2012

13©MapR Technologies - Confidential

Which One to Play?

One may be better than the other

The better coin pays off at some rate

Playing the other will pay off at a lesser rate

– Playing the lesser coin has “opportunity cost”

But how do we know which is which?

– Explore versus Exploit!

Page 14: Strata New York 2012

14©MapR Technologies - Confidential

A First Conclusion

Probability as expressed by humans is subjective and depends on information and experience

Page 15: Strata New York 2012

15©MapR Technologies - Confidential

A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better

Page 16: Strata New York 2012

16©MapR Technologies - Confidential

I Dunno

Page 17: Strata New York 2012

17©MapR Technologies - Confidential

5 and 5

Page 18: Strata New York 2012

18©MapR Technologies - Confidential

2 and 10

Page 19: Strata New York 2012

19©MapR Technologies - Confidential

The Cynic Among Us

Page 20: Strata New York 2012

20©MapR Technologies - Confidential

Demo

Page 21: Strata New York 2012

21©MapR Technologies - Confidential

An Example

Page 22: Strata New York 2012

22©MapR Technologies - Confidential

An Example

Page 23: Strata New York 2012

23©MapR Technologies - Confidential

The Cluster Proximity Features

Every point can be described by the nearest cluster

– 4.3 bits per point in this case

– Significant error that can be decreased (to a point) by increasing number of clusters

Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities)

– Error is negligible

– Unwinds the data into a simple representation

Page 24: Strata New York 2012

24©MapR Technologies - Confidential

Diagonalized Cluster Proximity

Page 25: Strata New York 2012

25©MapR Technologies - Confidential

Lots of Clusters Are Fine

Page 26: Strata New York 2012

26©MapR Technologies - Confidential

Surrogate Method

Start with sloppy clustering into κ = k log n clusters

Use these clusters as a weighted surrogate for the data

Cluster surrogate data using ball k-means

Results are provably high quality for highly clusterable data

Sloppy clustering can be done on-line

Surrogate can be kept in memory

Ball k-means pass can be done at any time

Page 27: Strata New York 2012

27©MapR Technologies - Confidential

Algorithm Costs

O(k d log n) per point for Lloyd’s algorithm

… not so good for k = 2000, n = 108

Surrogate methods

…. O(d log κ) = O(d (log k + log log n)) per point

This is a big deal:

– k d log n = 2000 x 10 x 26 = 500,000

– d (log k + log log n) = 10 (11 + 5) = 170

– 3,000 times faster makes the grade as a bona fide big deal

Page 28: Strata New York 2012

28©MapR Technologies - Confidential

3,000 times faster sounds good

Page 29: Strata New York 2012

29©MapR Technologies - Confidential

3,000 times faster sounds good

but that isn’t the big news

Page 30: Strata New York 2012

30©MapR Technologies - Confidential

3,000 times faster sounds good

but that isn’t the big news

these algorithms do on-line clustering

Page 31: Strata New York 2012

31©MapR Technologies - Confidential

Parallel Speedup?

1 2 3 4 5 20

10

100

20

30

40

50

200

Threads

Tim

e p

er

po

int

(μs) 2

3

4

56

8

10

12

14

16

Threaded version

Non- threaded

Perfect Scaling

Page 32: Strata New York 2012

32©MapR Technologies - Confidential

What about deployment?

Page 33: Strata New York 2012

33©MapR Technologies - Confidential

Learning

State

Infinite Data

Stream

Page 34: Strata New York 2012

34©MapR Technologies - Confidential

Mapper

State

Data Split

Page 35: Strata New York 2012

35©MapR Technologies - Confidential

Mapper

State

Data Split

Need shared memory!

MapperMapper

Page 36: Strata New York 2012

36©MapR Technologies - Confidential

whoami – Ted Dunning

We’re hiring at MapR

Ted Dunning

[email protected]

[email protected]

@ted_dunning

For slides and other info

http://www.slideshare.net/tdunning