online learning for big data analytics

Upload: warren-smith-qc-quantum-cryptanalyst

Post on 04-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Online Learning for Big Data Analytics

    1/116

    Online Learning for Big Data

    Analytics

    Irwin King, Michael R. Lyu and Haiqin YangDepartment of Computer Science & Engineering

    The Chinese University of Hong Kong

    Tutorial presentation at IEEE Big Data, Santa Clara, CA, 2013

    1

  • 8/14/2019 Online Learning for Big Data Analytics

    2/116

    Outline

    Introduction (60 min.)

    Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.)

    Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    2

  • 8/14/2019 Online Learning for Big Data Analytics

    3/116

    Outline

    Introduction (60 min.)

    Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.)

    Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    3

  • 8/14/2019 Online Learning for Big Data Analytics

    4/116

    What is Big Data?

    There is not a consensus as to how to define Big Data

    4

    A collection of data sets so largeand complexthat it becomes

    difficult to process using on-hand database management tools

    or traditional data processing applications. - wikii

    Big data exceeds the reachof commonly used hardware

    environments and software tools to capture, manage, and processit

    with in a tolerable elapsed time for its user population. - Tera- datamagazine article, 2011

    Big data refers to data sets whose size is beyond the abilityof

    typical database software tools to capture, store, manageand

    analyze. - The McKinsey Global Institute, 2011i

  • 8/14/2019 Online Learning for Big Data Analytics

    5/116

    What is Big Data?

    5

    File/Object Size, Content Volume

    Activity:IOPS

    Big Data refers to datasets grow so largeand complexthat it is

    difficultto capture, store, manage, share, analyzeand visualize

    within current computational architecture.

  • 8/14/2019 Online Learning for Big Data Analytics

    6/116

    Evolution of Big Data

    Birth: 1880 US census

    Adolescence: Big Science

    Modern Era: Big Business

    6

  • 8/14/2019 Online Learning for Big Data Analytics

    7/116

    Birth: 1880 US census

    7

  • 8/14/2019 Online Learning for Big Data Analytics

    8/116

    The First Big Data Challenge

    1880 census

    50 million people

    Age, gender (sex),

    occupation, educationlevel, no. of insane

    people in household

    8

  • 8/14/2019 Online Learning for Big Data Analytics

    9/116

  • 8/14/2019 Online Learning for Big Data Analytics

    10/116

    Manhattan Project (1946 - 1949)

    $2 billion (approx. 26

    billion in 2013)

    Catalyst for Big Science

    10

  • 8/14/2019 Online Learning for Big Data Analytics

    11/116

    Space Program (1960s)

    Began in late 1950s

    An active area of big

    data nowadays

    11

  • 8/14/2019 Online Learning for Big Data Analytics

    12/116

    Adolescence: Big Science

    12

  • 8/14/2019 Online Learning for Big Data Analytics

    13/116

    Big Science

    The InternationalGeophysical Year

    An international scientificproject

    Last from Jul. 1, 1957 to Dec.31, 1958

    A synoptic collection ofobservational data on aglobal scale

    Implications

    Big budgets, Big staffs, Bigmachines, Big laboratories

    13

  • 8/14/2019 Online Learning for Big Data Analytics

    14/116

    Summary of Big Science

    Laid foundation for ambitious projects

    International Biological Program

    Long Term Ecological Research Network

    Ended in 1974 Many participants viewed it as a failure

    Nevertheless, it was a success

    Transform the way of processing data Realize original incentives

    Provide a renewed legitimacy for synoptic datacollection

    14

  • 8/14/2019 Online Learning for Big Data Analytics

    15/116

    Lessons from Big Science

    Spawn new big data projects

    Weather prediction

    Physics research (supercollider data analytics)

    Astronomy images (planet detection)

    Medical research (drug interaction)

    Businesses latched onto its techniques,

    methodologies, and objectives

    15

  • 8/14/2019 Online Learning for Big Data Analytics

    16/116

    Modern Era: Big Business

    16

  • 8/14/2019 Online Learning for Big Data Analytics

    17/116

  • 8/14/2019 Online Learning for Big Data Analytics

    18/116

    Big Data is Everywhere!

    Lots of data is being

    collected and warehoused

    Science experiments

    Web data, e-commerce

    Purchases at department/

    grocery stores

    Bank/Credit Card

    transactions Social Networks

    18

  • 8/14/2019 Online Learning for Big Data Analytics

    19/116

    Big Data in Science

    CERN - Large Hadron Collider ~10 PB/year at start

    ~1000 PB in ~10 years

    2500 physicists collaborating

    Large Synoptic Survey Telescope(NSF, DOE, and private donors) ~5-10 PB/year at start in 2012

    ~100 PB by 2025

    Pan-STARRS (Haleakala, Hawaii)

    US Air Force now: 800 TB/year

    soon: 4 PB/year

    19

  • 8/14/2019 Online Learning for Big Data Analytics

    20/116

    Big Data from Different Sources

    20

    12+ TBs

    of tweet data

    every day

    25+ TBs

    of

    log data

    every day

    ?

    TBsof

    data

    every

    day

    2+

    bi l l ionpeople

    on the

    Web by

    end 2011

    30 bi l l io nRFID

    tags today(1.3B in 2005)

    4.6

    bi l l ioncamera

    phones

    world

    wide

    100s o fmi l l ions

    of GPS

    enableddevices

    sold

    annually

    76 m il l ionsmartmeters in 2009

    200M by 2014

  • 8/14/2019 Online Learning for Big Data Analytics

    21/116

    Big Data in Business Sectors

    21SOURCE: McKinsey Global Institute

  • 8/14/2019 Online Learning for Big Data Analytics

    22/116

    Characteristics of Big Data

    4V: Volume, Velocity, Variety, Veracity

    22

    Volume VolumeVolume

    Variety Veracity

    Volume

    VelocityVolume

    Text

    Videos

    Images

    Audios8 billion TBin 2015,

    40 ZB in 2020

    5.2TB per person

    New sharing over2.5

    billion per day

    new data over 500TB

    per day

    US predicts powerful

    quake this week

    UAE body says it's a

    rumour;

    By Wam/Agencies

    Published Saturday, April 20, 2013

  • 8/14/2019 Online Learning for Big Data Analytics

    23/116

    Big Data Analytics

    23

    Definition: a process of inspecting, cleaning,transforming, and modeling big data with thegoal of discoveringuseful information, suggesting

    conclusions, and supporting decision making Connection to data mining

    Analytics include both data analysis(mining) andcommunication(guide decision making)

    Analytics is not so much concerned with individualanalyses or analysis steps, but with the entiremethodology

  • 8/14/2019 Online Learning for Big Data Analytics

    24/116

    Outline

    Introduction (60 min.)

    Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.) Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    24

  • 8/14/2019 Online Learning for Big Data Analytics

    25/116

  • 8/14/2019 Online Learning for Big Data Analytics

    26/116

    Learning Techniques Overview

    Learning paradigms

    Supervised learning

    Semisupervised learning

    Transductive learning

    Unsupervised learning

    Universum learning

    Transfer learning

    26

  • 8/14/2019 Online Learning for Big Data Analytics

    27/116

    What is Online Learning?

    Batch/Offline learning Observe a batchof training

    data

    Learn a model from them

    Predict new samplesaccurately

    Online learning Observe a sequenceof data

    Learn a model incrementally

    as instances come

    Make the sequence of online

    predictions accurately

    27

    Update a model

    True response

    user

    Make prediction

    Niii

    y1

    ,

    x tt yy ,,,, 11 xx

  • 8/14/2019 Online Learning for Big Data Analytics

    28/116

    Online Prediction Algorithm

    An initial prediction rule

    For t=1, 2,

    We observe and make a prediction

    We observe the true outcomeytand then

    compute a loss

    The online algorithm updates the prediction

    rule using the new example and construct

    28

    )(0 f

    )(1 ttf x

    )),(( tt yfl x

    )(xtf

    Updateft

    yt

    user

    tx

    tx

    x

    )(1 ttf x

  • 8/14/2019 Online Learning for Big Data Analytics

    29/116

    Online Prediction Algorithm

    The total error of the method is

    Goal: this error to be as small as possible

    Predict unknown future one step a time:

    similar to generalization error

    29

    yt

    usertx

    )(1 ttf x

    T

    t

    ttt yfl1

    1 )),(( x

  • 8/14/2019 Online Learning for Big Data Analytics

    30/116

    Regret Analysis

    : optimal prediction function from a class

    H, e.g., the class of linear classifiers

    with minimum error after seeing all examples

    Regret for the online learning algorithm

    30

    T

    t

    tt

    Hf

    yflf1

    * )),((minarg x

    )(* f

    T

    t

    ttttt yflyflT 1*1 )),(()),((

    1regret xx

    We want regret as small as possible

  • 8/14/2019 Online Learning for Big Data Analytics

    31/116

    Why Low Regret?

    Regret for the online learning algorithm

    Advantages

    We do not lose much from not knowing future events

    We can perform almost as well as someone who

    observes the entire sequence and picks the bestprediction strategy in hindsight

    We can also compete with changing environment

    31

    T

    t

    ttttt yflyflT 1

    *1 )),(()),((1

    regret xx

  • 8/14/2019 Online Learning for Big Data Analytics

    32/116

    Advantages of Online Learning

    Meet many applications for data arriving sequentiallywhile predictions are required on-the-fly

    Avoid re-training when adding new data

    Applicable in adversarialand competitive environment

    Strong adaptabilityto changingenvironment

    High efficiencyand excellent scalability

    Simpleto understand and easy to implement

    Easy to be parallelized Theoreticalguarantees

    32

  • 8/14/2019 Online Learning for Big Data Analytics

    33/116

    Whereto Apply Online Learning?

    OnlineLearning

    SocialMedia

    InternetSecurity

    Finance

    33

  • 8/14/2019 Online Learning for Big Data Analytics

    34/116

    Online Learning for Social Media

    Recommendation, sentiment/emotion analysis

    34

  • 8/14/2019 Online Learning for Big Data Analytics

    35/116

    Whereto Apply Online Learning?

    OnlineLearning

    SocialMedia

    InternetSecurity

    Finance

    35

  • 8/14/2019 Online Learning for Big Data Analytics

    36/116

    Online Learning for Internet Security

    Electronic business sectors

    Spamemailfiltering

    Fraudcredit card transaction detection

    Network intrusion detectionsystem, etc.

    36

  • 8/14/2019 Online Learning for Big Data Analytics

    37/116

    Whereto Apply Online Learning?

    OnlineLearning

    SocialMedia

    InternetSecurity

    Finance

    37

  • 8/14/2019 Online Learning for Big Data Analytics

    38/116

    Online Learning for Financial Decision

    Financial decision

    Online portfolio selection

    Sequential investment, etc.

    38

  • 8/14/2019 Online Learning for Big Data Analytics

    39/116

    Outline

    Introduction (60 min.)

    Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.) Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    39

  • 8/14/2019 Online Learning for Big Data Analytics

    40/116

    Outline

    Introduction (60 min.)

    Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.) Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    40

  • 8/14/2019 Online Learning for Big Data Analytics

    41/116

    Perceptron Algorithm (F. Rosenblatt 1958)

    Goal: find a linear classifier with small error

    41

  • 8/14/2019 Online Learning for Big Data Analytics

    42/116

    Geometric View

    42

    w0

    (x1, y1(=1))

    Misclassification!

  • 8/14/2019 Online Learning for Big Data Analytics

    43/116

    43

    Geometric View

    w0w1=w0+x1y1

    (x1, y1(=1))

  • 8/14/2019 Online Learning for Big Data Analytics

    44/116

    44

    Geometric View

    w1

    (x2, y2(=-1))

    (x1, y1(=1))

    Misclassification!

  • 8/14/2019 Online Learning for Big Data Analytics

    45/116

    45

    Geometric View

    (x1, y1)

    w1

    (x2, y2(=-1))

    w2=w1+x2y2

  • 8/14/2019 Online Learning for Big Data Analytics

    46/116

    Perceptron Mistake Bound

    Consider w*separate the data:

    Define margin

    The number of mistakes perceptron makes is

    at most

    46

    0* iiT yxw

    22*

    *

    sup

    min

    ii

    i

    T

    i

    xw

    xw

    2

    f f k d

  • 8/14/2019 Online Learning for Big Data Analytics

    47/116

    Proof of Perceptron Mistake Bound

    [Novikoff,1963]

    Proof:Let be the hypothesis before the k-th

    mistake. Assume that the k-th mistake occurs

    on the input example .

    47

    kv

    ii y,x

    First, Second,

    Hence,

    22*

    *

    sup

    min

    ii

    i

    T

    i

    xw

    xw

  • 8/14/2019 Online Learning for Big Data Analytics

    48/116

    Outline

    Introduction (60 min.) Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.) Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    48

  • 8/14/2019 Online Learning for Big Data Analytics

    49/116

    Online Non-Sparse Learning

    First orderlearning methods Online gradient descent(Zinkevich, 2003)

    Passive aggressive learning(Crammer et al., 2006)

    Others (including but not limited) ALMA: A New Approximate Maximal Margin Classification

    Algorithm (Gentile, 2001)

    ROMMA: Relaxed Online Maximum Margin Algorithm (Li andLong, 2002)

    MIRA: Margin Infused Relaxed Algorithm (Crammer andSinger, 2003)

    DUOL: A Double Updating Approach for Online Learning(Zhao et al. 2009)

    49

  • 8/14/2019 Online Learning for Big Data Analytics

    50/116

    Online Gradient Descent (OGD)

    Online convex optimization (Zinkevich 2003)

    Consider a convex objective function

    where is a bounded convex set

    Update by Online Gradient Descent (OGD) or

    Stochastic Gradient Descent (SGD)

    where is a learning rate

    50

  • 8/14/2019 Online Learning for Big Data Analytics

    51/116

    Online Gradient Descent (OGD)

    For t=1, 2,

    An unlabeled sample arrives

    Make a prediction based on existing weights

    Observe the true class label

    Update the weights by

    where is a learning rate

    51

    )sgn( tTtty xw

    tx

    1,1ty

    Updatewt+1yt

    usertx

    x

    ty

  • 8/14/2019 Online Learning for Big Data Analytics

    52/116

    Passive Aggressive Online Learning

    Closed-form solutions can be derived:

    52

  • 8/14/2019 Online Learning for Big Data Analytics

    53/116

    Online Non-Sparse Learning

    First order methods

    Learn a linear weight vector (first order) of model

    Pros and Cons

    Simple and easy to implement

    Efficient and scalable for high-dimensional data

    Relatively slow convergence rate

    53

  • 8/14/2019 Online Learning for Big Data Analytics

    54/116

    Online Non-Sparse Learning

    Second orderonline learning methods Update the weight vector w by maintaining and exploring both

    first-orderand second-orderinformation

    Some representative methods, but not limited SOP:Second Order Perceptron (Cesa-Bianchi et al., 2005)

    CW:Confidence Weighted learning (Dredze et al., 2008)

    AROW:Adaptive Regularization of Weights (Crammer et al.,2009)

    IELLIP: Online Learning by Ellipsoid Method (Yang et al., 2009)

    NHERD: Gaussian Herding (Crammer & Lee 2010)

    NAROW: New variant of AROW algorithm (Orabona & Crammer2010)

    SCW: Soft Confidence Weighted (SCW) (Hoi et al., 2012)

    54

  • 8/14/2019 Online Learning for Big Data Analytics

    55/116

    Online Non-Sparse Learning

    Second-Orderonline learning methods

    Learn the weight vector w by maintaining and

    exploring both first-orderand second-order

    information Pros and Cons

    Faster convergence rate

    Expensive for high-dimensional dataRelatively sensitive to noise

    55

  • 8/14/2019 Online Learning for Big Data Analytics

    56/116

    Outline

    Introduction (60 min.) Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.) Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    56

  • 8/14/2019 Online Learning for Big Data Analytics

    57/116

    Online Sparse Learning

    Motivation Spaceconstraint: RAM overflow

    Test-time constraint

    How to induce Sparsityin the weights of online

    learning algorithms? Some representative work

    Truncated gradient (Langford et al., 2009)

    FOBOS: Forward Looking Subgradients (Duchi and

    Singer 2009) Dual averaging(Xiao, 2009)

    etc.

    57

  • 8/14/2019 Online Learning for Big Data Analytics

    58/116

    Objective function

    Stochastic gradient descent

    Simple coefficient rounding

    Truncated Gradient (Langford et al., 2009)

    58

    Truncated gradient: impose sparsity by

    modifying the stochastic gradient descent

  • 8/14/2019 Online Learning for Big Data Analytics

    59/116

    Simple Coefficient Rounding vs. Less aggressive truncation

    Truncated Gradient (Langford et al., 2009)

    59

  • 8/14/2019 Online Learning for Big Data Analytics

    60/116

    Truncated Gradient (Langford et al., 2009)

    The amount of shrinkage ismeasured by a gravityparameter

    The truncation can be

    performed everyKonline steps When

    the update rule is identical to

    the standard SGD

    Loss functions:

    Logistic SVM (hinge)

    Least square

    60

  • 8/14/2019 Online Learning for Big Data Analytics

    61/116

    Regret bound

    Let

    Truncated Gradient (Langford et al., 2009)

    61

  • 8/14/2019 Online Learning for Big Data Analytics

    62/116

  • 8/14/2019 Online Learning for Big Data Analytics

    63/116

    FOBOS (Duchi & Singer, 2009)

    Objective function

    Repeat

    I. Unconstrained (stochastic sub) gradient of loss

    II. Incorporate regularization

    Similar to Forward-backward splitting (Lions and Mercier 79)

    Composite gradient methods (Wright et al. 09, Nesterov 07)

    63

  • 8/14/2019 Online Learning for Big Data Analytics

    64/116

    FOBOS: Step I

    Objective function

    Unconstrained (stochastic sub) gradient of loss

    64

  • 8/14/2019 Online Learning for Big Data Analytics

    65/116

    FOBOS: Step II

    Objective function

    Incorporate regularization

    65

  • 8/14/2019 Online Learning for Big Data Analytics

    66/116

    Forward Looking Property

    The optimum satisfies

    Let and

    Currentsubgradient of loss,forwardsubgradient of regularization

    66

    current loss forward regularization

  • 8/14/2019 Online Learning for Big Data Analytics

    67/116

    Batch Convergence and Online Regret

    Set or to obtain batch convergence

    Online (average) regret bounds

    67

  • 8/14/2019 Online Learning for Big Data Analytics

    68/116

    High Dimensional Efficiency

    Input space is sparse but huge

    Need to perform lazy updates to

    Proposition:The following are equivalent

    68

  • 8/14/2019 Online Learning for Big Data Analytics

    69/116

    Dual Averaging (Xiao, 2010)

    Objective function

    Problem: truncated gradientdoesnt produce

    truly sparse weight due to small learning rate

    Fix: dual averagingwhich keeps two state

    representations:

    parameter and average gradient vector

    69

    tw

    t

    i

    iit wft

    g1

    1

  • 8/14/2019 Online Learning for Big Data Analytics

    70/116

    Dual Averaging (Xiao, 2010)

    +has entry-wise closed-form

    solution

    Advantage: sparse

    on the weight

    Disadvantage:

    keep a non-sparse

    subgradient

    70

  • 8/14/2019 Online Learning for Big Data Analytics

    71/116

    Average regret

    Theoretical bound: similar to gradient descent

    Convergence and Regret

    71

  • 8/14/2019 Online Learning for Big Data Analytics

    72/116

    Comparison

    FOBOS

    Subgradient Local Bregman divergence

    Coefficient 1

    Equivalent to TG method

    when

    Dual Averaging

    Average subgradient Global proximal function

    Coefficient 1

    72

  • 8/14/2019 Online Learning for Big Data Analytics

    73/116

  • 8/14/2019 Online Learning for Big Data Analytics

    74/116

    Comparison

    Left: 1for TG, 0 for RDA Right:=10 for TG, =25 for RDA

    74

    Variants of Online Sparse Learning

  • 8/14/2019 Online Learning for Big Data Analytics

    75/116

    Variants of Online Sparse Learning

    Models

    Online feature selection(OFS) A variant of sparse online learning

    The key difference is that OFS focuses on selecting afixed subset of featuresin online learning process

    Could be used as an alternative tool for batch featureselection when dealing with big data

    Other existing work Online learning for Group Lasso (Yang et al., 2010)

    and online learning for multi-task feature selection(Yang et al. 2013) to select features in group manneror features among similar tasks

    75

  • 8/14/2019 Online Learning for Big Data Analytics

    76/116

    Online Sparse Learning

    Objective

    Induce sparsity in the weights of online learning

    algorithms

    Pros and ConsSimple and easy to implement

    Efficient and scalable for high-dimensional data

    Relatively slow convergence rateNo perfect way to attain sparsity solution yet

    76

  • 8/14/2019 Online Learning for Big Data Analytics

    77/116

    Outline

    Introduction (60 min.) Big data and big data analytics (30 min.)

    Online learning and its applications (30 min.)

    Online Learning Algorithms (60 min.) Perceptron (10 min.)

    Online non-sparse learning (10 min.)

    Online sparse learning (20 min.)

    Online unsupervised learning (20. min.)

    Discussions + Q & A (5 min.)

    77

  • 8/14/2019 Online Learning for Big Data Analytics

    78/116

    Online Unsupervised Learning

    Assumption: data generated from some underlyingparametric probabilistic densityfunction

    Goal: estimate the parameters of the density to give a suitablecompact representation

    Typical work Online singular value decomposition(SVD) (Brand, 2003)

    Others (including but not limited) Online principal component analysis (PCA) (Warmuth and

    Kuzmin, 2006)

    Online dictionary learning for sparse coding (Mairal et al. 2009)

    Online learning for latent Dirichlet allocation (LDA) (Hoffman etal., 2010)

    Online variational inference for the hierarchical Dirichlet process(HDP) (Wang et al. 2011)

    ...78

    f

  • 8/14/2019 Online Learning for Big Data Analytics

    79/116

    [] [][][]

    : input data matrix matrix (e.g. documents,

    terms)

    : left singular vectors

    matrix ( documents, topics)

    : singular values diagonal matrix (strength

    of each topic)

    (): rank of matrix

    : right singular vectors matrix (terms, topics)

    SVD: Definition

    79

    : scalar

    : vector

    : vector

    A

  • 8/14/2019 Online Learning for Big Data Analytics

    80/116

    S i

  • 8/14/2019 Online Learning for Big Data Analytics

    81/116

    SVD Properties

    It is always possible to do SVD, i.e. decomposea matrixinto , where

    , , : unique

    ,: column orthonormal

    , (: identity matrix)

    : diagonal

    Entries (singular values) are non-negative,

    Sorted in decreasing order (120).

    81

    SVD E l U M i

  • 8/14/2019 Online Learning for Big Data Analytics

    82/116

    SVD: ExampleUsers-to-Movies

    - example: Users to Movies

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    SciFi

    Romance

    topics or concepts

    aka. Latent dimensions

    aka. Latent factors

    82

    TheAvengers

    Sta

    rWars

    Tw

    ilight

    Ma

    trix

    SVD E l U M i

  • 8/14/2019 Online Learning for Big Data Analytics

    83/116

    SVD: ExampleUsers-to-Movies

    - example

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    SciFi

    Romance

    0.24 0.02 0.69

    0.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.20

    0. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    83

    TheAvengers

    StarWars

    Twilight

    Matrix

    SciFi-concept

    X X

    Romance-concept

    SVD E l U M i

  • 8/14/2019 Online Learning for Big Data Analytics

    84/116

    SVD: ExampleUsers-to-Movies

    - example

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    SciFi

    Romance

    0.24 0.02 0.69

    0.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.20

    0. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    84

    TheAvengers

    StarWars

    Twilight

    Matrix SciFi-concept

    X X

    Romance-concept

    Uis user-to-concept

    similarity matrix

    SVD E l U t M i

  • 8/14/2019 Online Learning for Big Data Analytics

    85/116

    SVD: ExampleUsers-to-Movies

    - example

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    SciFi

    Romance

    0.24 0.02 0.69

    0.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.20

    0. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    85

    TheAvengers

    StarWars

    Twilight

    Matrix SciFi-concept

    X X

    strength of the SciFi-concept

    SVD E l U t M i

  • 8/14/2019 Online Learning for Big Data Analytics

    86/116

    SVD: ExampleUsers-to-Movies

    - example

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    SciFi

    Romance

    0.24 0.02 0.69

    0.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.20

    0. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    86

    TheAvengers

    StarWars

    Twilight

    Matrix SciFi-concept

    X X

    Vis movie-to-concept

    similarity matrix

    SciFi-concept

    SVD I t t ti #1

  • 8/14/2019 Online Learning for Big Data Analytics

    87/116

    SVD: Interpretation #1

    users, movies and concepts

    : user-to-concept similarity matrix

    : movie-to-concept similarity matrix

    : its diagonal elements strength of each concept

    87

    SVD I t t ti #2

  • 8/14/2019 Online Learning for Big Data Analytics

    88/116

    SVD: Interpretations #2

    SVD gives best axis

    to project on

    best = minimal sum

    of squares of

    projection errors

    In other words,

    minimum

    reconstruction error

    88

  • 8/14/2019 Online Learning for Big Data Analytics

    89/116

    SVD I t t ti #2

  • 8/14/2019 Online Learning for Big Data Analytics

    90/116

    SVD: Interpretation #2

    - example

    1 3 1 0

    3 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 30 1 0 5

    0.24 0.02 0.690.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.200. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    90

    X X

    variance (spread)

    on the v1axis

    SVD I t t ti #2

  • 8/14/2019 Online Learning for Big Data Analytics

    91/116

    SVD: Interpretation #2

    - example : the coordinates of

    the points in the

    projection axis

    91

    1 3 1 0

    3 4 3 1

    4 2 4 05 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    2.86 0.24 8.21

    5.71 0.24 5.12

    5.83 0.95 6.198.09 0.83 1.90

    0.71 6. 78 0.71

    0.12 4.88 2.38

    0. 83 8.33 0. 12

    Projection of users

    on the Sci-Fi axis

    SVD Interpretation #2

  • 8/14/2019 Online Learning for Big Data Analytics

    92/116

    SVD: Interpretation #2

    Q: how exactly is dimension reduction done?

    92

    1 3 1 0

    3 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 30 1 0 5

    0.24 0.02 0.690.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.200. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    X X

    SVD Interpretation #2

  • 8/14/2019 Online Learning for Big Data Analytics

    93/116

    SVD: Interpretation #2

    Q: how exactly is dimension reduction done?

    A: Set smallest singular values to zero

    93

    1 3 1 0

    3 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 30 1 0 5

    0.24 0.02 0.690.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.200. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    X X

    SVD: Interpretation #2

  • 8/14/2019 Online Learning for Big Data Analytics

    94/116

    SVD: Interpretation #2

    Q: how exactly is dimension reduction done?

    A: Set smallest singular values to zero

    Approximate original matrix by low-rank matrices

    94

    1 3 1 0

    3 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    0.24 0.02 0.690.48 0.02 0.43

    0.49 0.08 0.52

    0.68 0.07 0.16

    0.06 0.57 0.06

    0.01 0.41 0.200. 07 0.70 0.01

    11.9 0 0

    0 7.1 0

    0 0 2.5

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    0. 37 0.83 0.37 0.17

    X X

    SVD: Interpretation #2

  • 8/14/2019 Online Learning for Big Data Analytics

    95/116

    SVD: Interpretation #2

    Q: how exactly is dimension reduction done?

    A: Set smallest singular values to zero

    Approximate original matrix by low-rank matrices

    95

    1 3 1 0

    3 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 3

    0 1 0 5

    0.24 0.020.48 0.02

    0.49 0.08

    0.68 0.07

    0.06 0.57

    0.01 0.410. 07 0.70

    11.9 0

    0 7.1

    0.59 0.54 0.59 0. 05

    0.10 0.12 0.10 0.98

    X X

  • 8/14/2019 Online Learning for Big Data Analytics

    96/116

    SVD: Best Low Rank Approximation

  • 8/14/2019 Online Learning for Big Data Analytics

    97/116

    SVD: Best Low Rank Approximation

    Theorem: Let (rank , ),and = diagonal matrix where ( 1 )and 0( > )

    or equivalently, =

    , is the best rank-approximationto:

    or equivalently, argmin

    Intuition (spectral decomposition)

    0

    Why setting small to 0 is the right thing to do? Vectors and are unit length, so scales them. Therefore, zeroing small introduces less error.

    97

    SVD: Interpretation #2

  • 8/14/2019 Online Learning for Big Data Analytics

    98/116

    SVD: Interpretation #2

    Q: How many to keep?

    A: Rule-of-a thumb

    Keep 80~90% energy(

    )

    98

    1 3 1 03 4 3 1

    4 2 4 0

    5 4 5 1

    0 1 0 4

    0 0 0 30 1 0 5

    111+ 222

    +

    Assume: 1 2

    SVD: Complexity

  • 8/14/2019 Online Learning for Big Data Analytics

    99/116

    SVD: Complexity

    SVD for full matrix(min(, ))

    But

    faster, if we only want to compute singular values or if we only want first ksingular vectors (thin-svd).

    or if the matrix is sparse (sparse svd).

    Stable implementations LAPACK, Matlab, PROPACK

    Available in most common languages

    99

    SVD: Conclusions so far

  • 8/14/2019 Online Learning for Big Data Analytics

    100/116

    SVD: Conclusions so far

    SVD: : unique:user-to-concept similarities

    :movie-to-concept similarities

    : strength to each concept

    Dimensionality reduction

    Keep the few largest singular values (80-90% of

    energy) SVD: picks up linear correlations

    100

    SVD: Relationship to Eigen-

  • 8/14/2019 Online Learning for Big Data Analytics

    101/116

    decomposition

    SVD gives us

    Eigen-decomposition

    XT

    is symmetric ,,are orthonormal ( )

    , are diagonal

    Equivalence

    This shows how to use eigen-decomposition to compute SVD

    And also,

    101

    Online SVD (Brand 2003)

  • 8/14/2019 Online Learning for Big Data Analytics

    102/116

    Online SVD (Brand, 2003)

    Challenges: storage and computation

    Idea: an incrementalalgorithm computes the

    principal eigenvectors of a matrix without

    storing the entire matrix in memory

    102

    Online SVD (Brand 2003)

  • 8/14/2019 Online Learning for Big Data Analytics

    103/116

    Online SVD (Brand, 2003)

    103

    thr?p

  • 8/14/2019 Online Learning for Big Data Analytics

    104/116

    Online Unsupervised Learning

  • 8/14/2019 Online Learning for Big Data Analytics

    105/116

    Online Unsupervised Learning

    Unsupervised learning: minimizing thereconstruction errors

    Online: rank-one update

    Pros and Cons

    Simple to implement

    Heuristic, but intuitively workLack of theoretical guarantee

    Relative poor performance

    105

  • 8/14/2019 Online Learning for Big Data Analytics

    106/116

    Discussions and Open Issues

  • 8/14/2019 Online Learning for Big Data Analytics

    107/116

    Volume VolumeVolume

    Discussions and Open Issues

    107

    Variety

    Structured, semi-

    structured, unstructured

    Veracity

    Volume

    Data inconsistency,

    ambiguities, deception

    Velocity

    Batch data, real-time

    data, streaming data

    Volume

    Size grows from terabyte

    to exabyte to zettabyte

    How to learn from Big Data to tackle the

    4Vs characteristics?

  • 8/14/2019 Online Learning for Big Data Analytics

    108/116

  • 8/14/2019 Online Learning for Big Data Analytics

    109/116

  • 8/14/2019 Online Learning for Big Data Analytics

    110/116

    One-slide Takeaway

  • 8/14/2019 Online Learning for Big Data Analytics

    111/116

    One slide Takeaway

    Online learning is a promising tool for big dataanalytics

    Many challenges exist

    Real-world scenarios: concept drifting, sparsedata, high-dimensional, uncertain/imprecisiondata, etc.

    More advance online learning algorithms: faster

    convergence rate, less memory cost, etc. Parallel implementation or running on distributing

    platforms

    111

    Toolbox: http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software

    Other Toolboxes

    http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=softwarehttp://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software
  • 8/14/2019 Online Learning for Big Data Analytics

    112/116

    Other Toolboxes

    MOA: Massive Online Analysis http://moa.cms.waikato.ac.nz/

    Vowpal Wabbit

    https://github.com/JohnLangford/vowpal_wabbit/wiki

    112

    Q & A

    http://moa.cms.waikato.ac.nz/https://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttp://moa.cms.waikato.ac.nz/
  • 8/14/2019 Online Learning for Big Data Analytics

    113/116

    Q & A

    If you have any problems, please send emailsto [email protected]!

    113

    References

  • 8/14/2019 Online Learning for Big Data Analytics

    114/116

    References

    M. Brand. Fast online svd revisions for lightweight recommender systems. In SDM,2003.

    N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm.SIAM J. Comput., 34(3):640668, 2005.

    K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551585, 2006.

    K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In

    NIPS, pages 414422, 2009. K. Crammer and D. D. Lee. Learning via gaussian herding. In NIPS, pages 451459, 2010.

    K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems.Journal of Machine Learning Research, 3:951991, 2003.

    M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. InICML, pages 264271, 2008.

    J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward

    splitting. Journal of Machine Learning Research, 10:28992934, 2009. C. Gentile. A new approximate maximal margin classification algorithm. Journal of

    Machine Learning Research, 2:213242, 2001.

    M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation.In NIPS, pages 856864, 2010.

    114

    References

  • 8/14/2019 Online Learning for Big Data Analytics

    115/116

    References

    S. C. H. Hoi, J. Wang, and P. Zhao. Exact soft confidence-weighted learning. In ICML,2012.

    J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of

    Machine Learning Research, 10:777801, 2009.

    Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning,

    46(1-3):361387, 2002.

    P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.SIAM J. Numer. Anal., 16(6):964979, 1979.

    J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

    ICML, page 87, 2009.

    Y. Nesterov. Gradient methods for minimizing composite objective function. CORE

    Discussion Paper 2007/76, Catholic University of Louvain, Center for Operations

    Research and Econometrics, 2007.

    F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS,

    pages 18401848, 2010.

    F. Rosenblatt. The Perceptron: A probabilistic model for information storage and

    organization in the brain. Psychological Review, 65:386408, 1958.115

    References

  • 8/14/2019 Online Learning for Big Data Analytics

    116/116

    References

    C. Wang, J. W. Paisley, and D. M. Blei. Online variational inference for the hierarchicaldirichlet process. Journal of Machine Learning Research - Proceedings Track, 15:752

    760, 2011.

    M. K. Warmuth and D. Kuzmin. Randomized pca algorithms with regret bounds that are

    logarithmic in the dimension. In NIPS, pages 14811488, 2006.

    S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable

    approximation. IEEE Transactions on Signal Processing, 57(7):24792493, 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online

    optimization. Journal of Machine Learning Research, 11:25432596, 2010.

    H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.

    ACM Transactions on Knowledge Discovery from Data, 2013.

    H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages

    11911198, Haifa, Israel, 2010.

    L. Yang, R. Jin, and J. Ye. Online learning by ellipsoid method. In ICML, page 145, 2009.

    P. Zhao, S. C. H. Hoi, and R. Jin. Duol: A double updating approach for online learning. In