online learning with stream mining

Upload: mikio-braun

Post on 14-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Online Learning with Stream Mining

    1/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Online Learning withStream Mining

    Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de

    TWIMPACT

    http://twimpact.com

    http://blog.mikiobraun.de/http://twimpact.com/http://twimpact.com/http://blog.mikiobraun.de/
  • 7/30/2019 Online Learning with Stream Mining

    2/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Event Data

    Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

    FinanceFinance GamingGaming MonitoringMonitoring

    Social MediaSocial Media

    Sensor NetworksSensor Networks Advertisment Advertisment

  • 7/30/2019 Online Learning with Stream Mining

    3/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Online Learning

    Isn't all learning online?

  • 7/30/2019 Online Learning with Stream Mining

    4/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Isn't Machine Learning easilyOnline?

    Stochastic gradient descent

    converges, e.g. if

    http://leon.bottou.org/research/stochastic

    http://leon.bottou.org/research/stochastichttp://leon.bottou.org/research/stochastic
  • 7/30/2019 Online Learning with Stream Mining

    5/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Online vs Batch: Non-stationarities

    Very good tutorial by Albert Bifet et al. on these issues at http://sites.google.com/site/advancedstreamingtutorial

    http://sites.google.com/site/advancedstreamingtutorialhttp://sites.google.com/site/advancedstreamingtutorial
  • 7/30/2019 Online Learning with Stream Mining

    6/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Time horizons vs. Learning rate

    You can't just do online learning on event data!

  • 7/30/2019 Online Learning with Stream Mining

    7/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Also, Event Data is huge The problem: You easily get A LOT OF DATA!

    100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year

  • 7/30/2019 Online Learning with Stream Mining

    8/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    So, online learning challenges:

    So much data!

    Concept Drift Online (as in not batch)

    is not the whole story.

  • 7/30/2019 Online Learning with Stream Mining

    9/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Digging into Least Squares

    d

    d

    with entries

    this could be huge!

    d x d is probably ok

    Idea: Batch method like least squares on recent portionof the data.

    But: It's just a sum!

  • 7/30/2019 Online Learning with Stream Mining

    10/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Another Problem: High-dimensionalSpaces

    Potentially large spaces: distinct words: >100k IP addresses: >100M users in a social network: >10M

    http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net

  • 7/30/2019 Online Learning with Stream Mining

    11/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Stream Mining to the rescue Stream mining algorithms:

    answer stream queries with finite resources Typical examples:

    how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?

    Continuous Stream of Data

    Bounded Resource Analyzer Stream Queries

  • 7/30/2019 Online Learning with Stream Mining

    12/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    The Trade-Off

    ExactFast

    Big Data

    Stream Mining Map Reduceand friends

    First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra

    http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandrahttp://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra
  • 7/30/2019 Online Learning with Stream Mining

    13/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions,

    even more, e.g. IP addresses, Twitter users) Interested in most active elements only.

    Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conferenceon Database Theory, 2005

    132

    142

    432

    553

    712023

    15

    12

    85

    32

    Fixed tables of counts

    Case 1: element already in data base

    Case 2: new element

    142 142 12 13

    713 023 2

    713 3

  • 7/30/2019 Online Learning with Stream Mining

    14/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Count-Min Sketches Summarize histograms over large feature sets Like bloom filters, but better

    Query: Take minimum over all hash functions

    0 0 3 0

    1 1 0 2

    0 2 0 0

    0 3 5 20 5 3 2

    2 4 5 01 3 7 3

    0 2 0 8

    m bins

    n differenthash functions

    Updates for new entryQuery result: 1

    G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

  • 7/30/2019 Online Learning with Stream Mining

    15/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Clustering with count-min Sketches

    Online clustering For each data point:

    Map to closest centroid ( compute distances) Update centroid

    count-min sketches to represent sum over allvectors in a class

    Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

    0 0 3 01 1 0 2 0 2 0 00 3 5 20 5 3 22 4 5 0

    1 3 7 30 2 0 8

  • 7/30/2019 Online Learning with Stream Mining

    16/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Heavy Hitters over Time-Window

    Time

    DB

    Keep quite a big log (amonth?)

    Constant write/erase indatabase

    Alternative: Exponentialdecay

  • 7/30/2019 Online Learning with Stream Mining

    17/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Exponential Decay

    Instead of a fixed window, use exponentialdecay

    The beauty: updates are recursive

    halftime

    score

    timestamp

    time shift term

  • 7/30/2019 Online Learning with Stream Mining

    18/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Exponential Decay Collect stats by a table of expdecay counters

    counters[item] # countersts[item] # last timestamp

    update(C, item, timestamp, count) update countsC.counters[item] = count +

    weight(timestamp, C.ts[item]) * C.counters[item]C.ts[item] = timestampC.lastupdate = timestamp

    score(C, item) return scorereturn weight(C.lastupdate, C.ts[item])

    * C.counters[item]

  • 7/30/2019 Online Learning with Stream Mining

    19/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Least Squares Revisited Need to compute For each do

    Then, reconstruct

    As a reminder:

  • 7/30/2019 Online Learning with Stream Mining

    20/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    More: Maximum-Likelihood Estimate probabilistic models

    But wait, how do I 1/n with randomly spacedevents?

    based on

    which is slightlybiased, butsimpler

  • 7/30/2019 Online Learning with Stream Mining

    21/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Outlier detection Once you have a model, you can compute

    p-values (based on recent time frames!)

  • 7/30/2019 Online Learning with Stream Mining

    22/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    TF-IDF estimate word document frequencies

    for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0)

    query: score(word) / score(#docs)

  • 7/30/2019 Online Learning with Stream Mining

    23/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Extracting a relevant subset

  • 7/30/2019 Online Learning with Stream Mining

    24/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Classification with Nave Bayes Naive Bayes is also just counting, right?

    class priors

    Multinomnial nave Bayes

    Priors

    Total number of words in class

    Number of times word appears in classfrequency of word in document

  • 7/30/2019 Online Learning with Stream Mining

    25/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Classification with Naive Bayes

    ICML 2003

  • 7/30/2019 Online Learning with Stream Mining

    26/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Classification with Naive Bayes 7 Steps to improve NB:

    transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log

    normalize those weights again Predict linearly using those weights

  • 7/30/2019 Online Learning with Stream Mining

    27/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    What about non-parametricmethods and Kernel Methods?

    Problem here, no real accumulation of information instatistics, e.g. SVMs

    Could still use streamdrill to extract a representativesubset.

    sum over all elements!

  • 7/30/2019 Online Learning with Stream Mining

    28/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Streamdrill Heavy Hitters counting + exponential decay Instant counts & top-k results over time

    windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com ,

    launch imminent

    http://streamdrill.com/http://streamdrill.com/
  • 7/30/2019 Online Learning with Stream Mining

    29/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Architecture Overview

  • 7/30/2019 Online Learning with Stream Mining

    30/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    REST API Create a trend

    /1/create/plays/user:song:location?size=1000?timescales=day,hour

    Update a word/1/update/plays/frank:123123:San+Francisco

    Another word (with timestamp)/1/update/plays/paul:145323:Berlin?ts=131341354135

    Get most played songs for SF or Paul/1/query/plays?city=San+Francisco/1/query/plays?user=paul

    Get score for a word/1/query/score/hello

  • 7/30/2019 Online Learning with Stream Mining

    31/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Example: Twitter Stock Analysis

    http://play.streamdrill.com/vis/

    http://play.streamdrill.com/vis/http://play.streamdrill.com/vis/
  • 7/30/2019 Online Learning with Stream Mining

    32/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Example: Twitter Stock Analysis Trends:

    symbol:combinations $AAPL:$GOOG symbol:hashtag $AAPL:#trading symbol:keywords $GOOG:disruption symbol:mentions $GOOG:WallStreetCom symbol trend $AAPL

    symbol:url $FB:http://on.wsj.com/15fHaZW

  • 7/30/2019 Online Learning with Stream Mining

    33/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Example: Twitter Stock Analysis

  • 7/30/2019 Online Learning with Stream Mining

    34/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Example: Twitter Stock Analysis

  • 7/30/2019 Online Learning with Stream Mining

    35/36

    Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT

    Example: Twitter Stock Analysis

    Twitter

    stream drill

    JavaScriptvia REST

    Tweet Analyzer

    tweets

    updates

  • 7/30/2019 Online Learning with Stream Mining

    36/36

    M hi L i M t S F i A il 24 2013 ( ) 2013 b TWIMPACT

    Summary Doesn't always have to be scaling! Stream mining : Approximate results with

    finite resources. stream drill: stream analysis engine