strata new-york-2012

1©MapR Technologies - Confidential

Online Learning Bayesian bandits and more


whoami – Ted Dunning

Ted [email protected]@apache.org@ted_dunning

We’re hiring at MapR

For slides and other info http://www.slideshare.net/tdunning

mailto:[email protected]


http://www.slideshare.net/tdunning




Online

ScalableIncremental


Scalability and Learning

What does scalable mean?

What are inherent characteristics of scalable learning?

What are the logical implications?


Scalable ≈ On-line

If you squint just right


unit of work ≈ unit of time


Learning

State

Infinite Data

Stream


Pick One


Now pick again


A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?


Which One to Play?

One may be better than the other The better coin pays off at some rate Playing the other will pay off at a lesser rate– Playing the lesser coin has “opportunity cost”

But how do we know which is which?– Explore versus Exploit!


A First Conclusion

Probability as expressed by humans is subjective and depends on information and experience


A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better


I Dunno


5 and 5


2 and 10


The Cynic Among Us


Demo


An Example


The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation


Diagonalized Cluster Proximity


Lots of Clusters Are Fine


Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use these clusters as a weighted surrogate for the data Cluster surrogate data using ball k-means

Results are provably high quality for highly clusterable data Sloppy clustering can be done on-line Surrogate can be kept in memory Ball k-means pass can be done at any time


Algorithm Costs

O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108

Surrogate methods …. O(d log κ) = O(d (log k + log log n)) per point

This is a big deal:– k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster makes the grade as a bona fide big deal


30,000 times faster sounds good



but that isn’t the big news



but that isn’t the big news

these algorithms do on-line clustering


Parallel Speedup?

✓


What about deployment?


Learning

State

Infinite Data

Stream


Mapper

State

Data Split


Mapper

State

Data Split

Need shared memory!

MapperMapper


whoami – Ted Dunning

We’re hiring at MapR

Ted [email protected]@apache.org@ted_dunning

For slides and other infohttp://www.slideshare.net/tdunning






strata new-york-2012

Technology

log log n

mapr ted dunning

log n clusters

whoami ted dunning ted

data cluster surrogate

lesser coin

better coin

algorithm costs ok d