strata new-york-2012

36
1 ©MapR Technologies - Confidential Online Learning Bayesian bandits and more

Upload: ted-dunning

Post on 10-May-2015

567 views

Category:

Technology


3 download

DESCRIPTION

This set of slides describes several on-line learning algorithms which taken together can provide significant benefit to real-time applications.

TRANSCRIPT

Page 1: Strata new-york-2012

1©MapR Technologies - Confidential

Online Learning Bayesian bandits and more

Page 2: Strata new-york-2012

2©MapR Technologies - Confidential

whoami – Ted Dunning

Ted [email protected]@apache.org@ted_dunning

We’re hiring at MapR

For slides and other info http://www.slideshare.net/tdunning

Page 3: Strata new-york-2012

3©MapR Technologies - Confidential

Online

ScalableIncremental

Page 4: Strata new-york-2012

4©MapR Technologies - Confidential

Scalability and Learning

What does scalable mean?

What are inherent characteristics of scalable learning?

What are the logical implications?

Page 5: Strata new-york-2012

5©MapR Technologies - Confidential

Scalable ≈ On-line

If you squint just right

Page 6: Strata new-york-2012

6©MapR Technologies - Confidential

unit of work ≈ unit of time

Page 7: Strata new-york-2012

7©MapR Technologies - Confidential

Learning

State

Infinite Data

Stream

Page 8: Strata new-york-2012

8©MapR Technologies - Confidential

Pick One

Page 9: Strata new-york-2012

9©MapR Technologies - Confidential

Page 10: Strata new-york-2012

10©MapR Technologies - Confidential

Page 11: Strata new-york-2012

11©MapR Technologies - Confidential

Now pick again

Page 12: Strata new-york-2012

12©MapR Technologies - Confidential

A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?

Page 13: Strata new-york-2012

13©MapR Technologies - Confidential

Which One to Play?

One may be better than the other The better coin pays off at some rate Playing the other will pay off at a lesser rate– Playing the lesser coin has “opportunity cost”

But how do we know which is which?– Explore versus Exploit!

Page 14: Strata new-york-2012

14©MapR Technologies - Confidential

A First Conclusion

Probability as expressed by humans is subjective and depends on information and experience

Page 15: Strata new-york-2012

15©MapR Technologies - Confidential

A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better

Page 16: Strata new-york-2012

16©MapR Technologies - Confidential

I Dunno

Page 17: Strata new-york-2012

17©MapR Technologies - Confidential

5 and 5

Page 18: Strata new-york-2012

18©MapR Technologies - Confidential

2 and 10

Page 19: Strata new-york-2012

19©MapR Technologies - Confidential

The Cynic Among Us

Page 20: Strata new-york-2012

20©MapR Technologies - Confidential

Demo

Page 21: Strata new-york-2012

21©MapR Technologies - Confidential

An Example

Page 22: Strata new-york-2012

22©MapR Technologies - Confidential

An Example

Page 23: Strata new-york-2012

23©MapR Technologies - Confidential

The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

Page 24: Strata new-york-2012

24©MapR Technologies - Confidential

Diagonalized Cluster Proximity

Page 25: Strata new-york-2012

25©MapR Technologies - Confidential

Lots of Clusters Are Fine

Page 26: Strata new-york-2012

26©MapR Technologies - Confidential

Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use these clusters as a weighted surrogate for the data Cluster surrogate data using ball k-means

Results are provably high quality for highly clusterable data Sloppy clustering can be done on-line Surrogate can be kept in memory Ball k-means pass can be done at any time

Page 27: Strata new-york-2012

27©MapR Technologies - Confidential

Algorithm Costs

O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108

Surrogate methods …. O(d log κ) = O(d (log k + log log n)) per point

This is a big deal:– k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster makes the grade as a bona fide big deal

Page 28: Strata new-york-2012

28©MapR Technologies - Confidential

30,000 times faster sounds good

Page 29: Strata new-york-2012

29©MapR Technologies - Confidential

30,000 times faster sounds good

but that isn’t the big news

Page 30: Strata new-york-2012

30©MapR Technologies - Confidential

30,000 times faster sounds good

but that isn’t the big news

these algorithms do on-line clustering

Page 31: Strata new-york-2012

31©MapR Technologies - Confidential

Parallel Speedup?

Page 32: Strata new-york-2012

32©MapR Technologies - Confidential

What about deployment?

Page 33: Strata new-york-2012

33©MapR Technologies - Confidential

Learning

State

Infinite Data

Stream

Page 34: Strata new-york-2012

34©MapR Technologies - Confidential

Mapper

State

Data Split

Page 35: Strata new-york-2012

35©MapR Technologies - Confidential

Mapper

State

Data Split

Need shared memory!

MapperMapper

Page 36: Strata new-york-2012

36©MapR Technologies - Confidential

whoami – Ted Dunning

We’re hiring at MapR

Ted [email protected]@apache.org@ted_dunning

For slides and other infohttp://www.slideshare.net/tdunning