online learning for big data analytics

8/14/2019 Online Learning for Big Data Analytics

1/116

Online Learning for Big Data

Analytics

Irwin King, Michael R. Lyu and Haiqin YangDepartment of Computer Science & Engineering

The Chinese University of Hong Kong

Tutorial presentation at IEEE Big Data, Santa Clara, CA, 2013

1


2/116

Outline

Introduction (60 min.)

Big data and big data analytics (30 min.)

Online learning and its applications (30 min.)

Online Learning Algorithms (60 min.) Perceptron (10 min.)

Online non-sparse learning (10 min.)

Online sparse learning (20 min.)

Online unsupervised learning (20. min.)

Discussions + Q & A (5 min.)

2


3/116

Outline









3


4/116

What is Big Data?

There is not a consensus as to how to define Big Data

4

A collection of data sets so largeand complexthat it becomes

difficult to process using on-hand database management tools

or traditional data processing applications. - wikii

Big data exceeds the reachof commonly used hardware

environments and software tools to capture, manage, and processit

with in a tolerable elapsed time for its user population. - Tera- datamagazine article, 2011

Big data refers to data sets whose size is beyond the abilityof

typical database software tools to capture, store, manageand

analyze. - The McKinsey Global Institute, 2011i


5/116

What is Big Data?

5

File/Object Size, Content Volume

Activity:IOPS

Big Data refers to datasets grow so largeand complexthat it is

difficultto capture, store, manage, share, analyzeand visualize

within current computational architecture.


6/116

Evolution of Big Data

Birth: 1880 US census

Adolescence: Big Science

Modern Era: Big Business

6


7/116

Birth: 1880 US census

7


8/116

The First Big Data Challenge

1880 census

50 million people

Age, gender (sex),

occupation, educationlevel, no. of insane

people in household

8


9/116


10/116

Manhattan Project (1946 - 1949)

$2 billion (approx. 26

billion in 2013)

Catalyst for Big Science

10


11/116

Space Program (1960s)

Began in late 1950s

An active area of big

data nowadays

11


12/116

Adolescence: Big Science

12


13/116

Big Science

The InternationalGeophysical Year

An international scientificproject

Last from Jul. 1, 1957 to Dec.31, 1958

A synoptic collection ofobservational data on aglobal scale

Implications

Big budgets, Big staffs, Bigmachines, Big laboratories

13


14/116

Summary of Big Science

Laid foundation for ambitious projects

International Biological Program

Long Term Ecological Research Network

Ended in 1974 Many participants viewed it as a failure

Nevertheless, it was a success

Transform the way of processing data Realize original incentives

Provide a renewed legitimacy for synoptic datacollection

14


15/116

Lessons from Big Science

Spawn new big data projects

Weather prediction

Physics research (supercollider data analytics)

Astronomy images (planet detection)

Medical research (drug interaction)

Businesses latched onto its techniques,

methodologies, and objectives

15


16/116

Modern Era: Big Business

16


17/116


18/116

Big Data is Everywhere!

Lots of data is being

collected and warehoused

Science experiments

Web data, e-commerce

Purchases at department/

grocery stores

Bank/Credit Card

transactions Social Networks

18


19/116

Big Data in Science

CERN - Large Hadron Collider ~10 PB/year at start

~1000 PB in ~10 years

2500 physicists collaborating

Large Synoptic Survey Telescope(NSF, DOE, and private donors) ~5-10 PB/year at start in 2012

~100 PB by 2025

Pan-STARRS (Haleakala, Hawaii)

US Air Force now: 800 TB/year

soon: 4 PB/year

19


20/116

Big Data from Different Sources

20

12+ TBs

of tweet data

every day

25+ TBs

of

log data

every day

?

TBsof

data

every

day

2+

bi l l ionpeople

on the

Web by

end 2011

30 bi l l io nRFID

tags today(1.3B in 2005)

4.6

bi l l ioncamera

phones

world

wide

100s o fmi l l ions

of GPS

enableddevices

sold

annually

76 m il l ionsmartmeters in 2009

200M by 2014


21/116

Big Data in Business Sectors

21SOURCE: McKinsey Global Institute


22/116

Characteristics of Big Data

4V: Volume, Velocity, Variety, Veracity

22

Volume VolumeVolume

Variety Veracity

Volume

VelocityVolume

Text

Videos

Images

Audios8 billion TBin 2015,

40 ZB in 2020

5.2TB per person

New sharing over2.5

billion per day

new data over 500TB

per day

US predicts powerful

quake this week

UAE body says it's a

rumour;

By Wam/Agencies

Published Saturday, April 20, 2013


23/116

Big Data Analytics

23

Definition: a process of inspecting, cleaning,transforming, and modeling big data with thegoal of discoveringuseful information, suggesting

conclusions, and supporting decision making Connection to data mining

Analytics include both data analysis(mining) andcommunication(guide decision making)

Analytics is not so much concerned with individualanalyses or analysis steps, but with the entiremethodology


24/116

Outline






Online sparse learning (20 min.) Online unsupervised learning (20. min.)


24


25/116


26/116

Learning Techniques Overview

Learning paradigms

Supervised learning

Semisupervised learning

Transductive learning

Unsupervised learning

Universum learning

Transfer learning

26


27/116

What is Online Learning?

Batch/Offline learning Observe a batchof training

data

Learn a model from them

Predict new samplesaccurately

Online learning Observe a sequenceof data

Learn a model incrementally

as instances come

Make the sequence of online

predictions accurately

27

Update a model

True response

user

Make prediction

Niii

y1

,

x tt yy ,,,, 11 xx


28/116

Online Prediction Algorithm

An initial prediction rule

For t=1, 2,

We observe and make a prediction

We observe the true outcomeytand then

compute a loss

The online algorithm updates the prediction

rule using the new example and construct

28

)(0 f

)(1 ttf x

)),(( tt yfl x

)(xtf

Updateft

yt

user

tx

tx

x

)(1 ttf x


29/116

Online Prediction Algorithm

The total error of the method is

Goal: this error to be as small as possible

Predict unknown future one step a time:

similar to generalization error

29

yt

usertx

)(1 ttf x

T

t

ttt yfl1

1 )),(( x


30/116

Regret Analysis

: optimal prediction function from a class

H, e.g., the class of linear classifiers

with minimum error after seeing all examples

Regret for the online learning algorithm

30

T

t

tt

Hf

yflf1

* )),((minarg x

)(* f

T

t

ttttt yflyflT 1*1 )),(()),((

1regret xx

We want regret as small as possible


31/116

Why Low Regret?

Regret for the online learning algorithm

Advantages

We do not lose much from not knowing future events

We can perform almost as well as someone who

observes the entire sequence and picks the bestprediction strategy in hindsight

We can also compete with changing environment

31

T

t

ttttt yflyflT 1

*1 )),(()),((1

regret xx


32/116

Advantages of Online Learning

Meet many applications for data arriving sequentiallywhile predictions are required on-the-fly

Avoid re-training when adding new data

Applicable in adversarialand competitive environment

Strong adaptabilityto changingenvironment

High efficiencyand excellent scalability

Simpleto understand and easy to implement

Easy to be parallelized Theoreticalguarantees

32


33/116

Whereto Apply Online Learning?

OnlineLearning

SocialMedia

InternetSecurity

Finance

33


34/116

Online Learning for Social Media

Recommendation, sentiment/emotion analysis

34


35/116


OnlineLearning

SocialMedia

InternetSecurity

Finance

35


36/116

Online Learning for Internet Security

Electronic business sectors

Spamemailfiltering

Fraudcredit card transaction detection

Network intrusion detectionsystem, etc.

36


37/116


OnlineLearning

SocialMedia

InternetSecurity

Finance

37


38/116

Online Learning for Financial Decision

Financial decision

Online portfolio selection

Sequential investment, etc.

38


39/116

Outline








39


40/116

Outline








40


41/116

Perceptron Algorithm (F. Rosenblatt 1958)

Goal: find a linear classifier with small error

41


42/116

Geometric View

42

w0

(x1, y1(=1))

Misclassification!


43/116

43

Geometric View

w0w1=w0+x1y1

(x1, y1(=1))


44/116

44

Geometric View

w1

(x2, y2(=-1))

(x1, y1(=1))

Misclassification!


45/116

45

Geometric View

(x1, y1)

w1

(x2, y2(=-1))

w2=w1+x2y2


46/116

Perceptron Mistake Bound

Consider w*separate the data:

Define margin

The number of mistakes perceptron makes is

at most

46

0* iiT yxw

22*

*

sup

min

ii

i

T

i

xw

xw

2

f f k d


47/116

Proof of Perceptron Mistake Bound

[Novikoff,1963]

Proof:Let be the hypothesis before the k-th

mistake. Assume that the k-th mistake occurs

on the input example .

47

kv

ii y,x

First, Second,

Hence,

22*

*

sup

min

ii

i

T

i

xw

xw


48/116

Outline

Introduction (60 min.) Big data and big data analytics (30 min.)






48


49/116

Online Non-Sparse Learning

First orderlearning methods Online gradient descent(Zinkevich, 2003)

Passive aggressive learning(Crammer et al., 2006)

Others (including but not limited) ALMA: A New Approximate Maximal Margin Classification

Algorithm (Gentile, 2001)

ROMMA: Relaxed Online Maximum Margin Algorithm (Li andLong, 2002)

MIRA: Margin Infused Relaxed Algorithm (Crammer andSinger, 2003)

DUOL: A Double Updating Approach for Online Learning(Zhao et al. 2009)

49


50/116

Online Gradient Descent (OGD)

Online convex optimization (Zinkevich 2003)

Consider a convex objective function

where is a bounded convex set

Update by Online Gradient Descent (OGD) or

Stochastic Gradient Descent (SGD)

where is a learning rate

50


51/116

Online Gradient Descent (OGD)

For t=1, 2,

An unlabeled sample arrives

Make a prediction based on existing weights

Observe the true class label

Update the weights by

where is a learning rate

51

)sgn( tTtty xw

tx

1,1ty

Updatewt+1yt

usertx

x

ty


52/116

Passive Aggressive Online Learning

Closed-form solutions can be derived:

52


53/116


First order methods

Learn a linear weight vector (first order) of model

Pros and Cons

Simple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rate

53


54/116


Second orderonline learning methods Update the weight vector w by maintaining and exploring both

first-orderand second-orderinformation

Some representative methods, but not limited SOP:Second Order Perceptron (Cesa-Bianchi et al., 2005)

CW:Confidence Weighted learning (Dredze et al., 2008)

AROW:Adaptive Regularization of Weights (Crammer et al.,2009)

IELLIP: Online Learning by Ellipsoid Method (Yang et al., 2009)

NHERD: Gaussian Herding (Crammer & Lee 2010)

NAROW: New variant of AROW algorithm (Orabona & Crammer2010)

SCW: Soft Confidence Weighted (SCW) (Hoi et al., 2012)

54


55/116


Second-Orderonline learning methods

Learn the weight vector w by maintaining and

exploring both first-orderand second-order

information Pros and Cons

Faster convergence rate

Expensive for high-dimensional dataRelatively sensitive to noise

55


56/116

Outline







56


57/116

Online Sparse Learning

Motivation Spaceconstraint: RAM overflow

Test-time constraint

How to induce Sparsityin the weights of online

learning algorithms? Some representative work

Truncated gradient (Langford et al., 2009)

FOBOS: Forward Looking Subgradients (Duchi and

Singer 2009) Dual averaging(Xiao, 2009)

etc.

57


58/116

Objective function

Stochastic gradient descent

Simple coefficient rounding

Truncated Gradient (Langford et al., 2009)

58

Truncated gradient: impose sparsity by

modifying the stochastic gradient descent


59/116

Simple Coefficient Rounding vs. Less aggressive truncation


59


60/116


The amount of shrinkage ismeasured by a gravityparameter

The truncation can be

performed everyKonline steps When

the update rule is identical to

the standard SGD

Loss functions:

Logistic SVM (hinge)

Least square

60


61/116

Regret bound

Let


61


62/116


63/116

FOBOS (Duchi & Singer, 2009)

Objective function

Repeat

I. Unconstrained (stochastic sub) gradient of loss

II. Incorporate regularization

Similar to Forward-backward splitting (Lions and Mercier 79)

Composite gradient methods (Wright et al. 09, Nesterov 07)

63


64/116

FOBOS: Step I

Objective function

Unconstrained (stochastic sub) gradient of loss

64


65/116

FOBOS: Step II

Objective function

Incorporate regularization

65


66/116

Forward Looking Property

The optimum satisfies

Let and

Currentsubgradient of loss,forwardsubgradient of regularization

66

current loss forward regularization


67/116

Batch Convergence and Online Regret

Set or to obtain batch convergence

Online (average) regret bounds

67


68/116

High Dimensional Efficiency

Input space is sparse but huge

Need to perform lazy updates to

Proposition:The following are equivalent

68


69/116

Dual Averaging (Xiao, 2010)

Objective function

Problem: truncated gradientdoesnt produce

truly sparse weight due to small learning rate

Fix: dual averagingwhich keeps two state

representations:

parameter and average gradient vector

69

tw

t

i

iit wft

g1

1


70/116

Dual Averaging (Xiao, 2010)

+has entry-wise closed-form

solution

Advantage: sparse

on the weight

Disadvantage:

keep a non-sparse

subgradient

70


71/116

Average regret

Theoretical bound: similar to gradient descent

Convergence and Regret

71


72/116

Comparison

FOBOS

Subgradient Local Bregman divergence

Coefficient 1

Equivalent to TG method

when

Dual Averaging

Average subgradient Global proximal function

Coefficient 1

72


73/116


74/116

Comparison

Left: 1for TG, 0 for RDA Right:=10 for TG, =25 for RDA

74

Variants of Online Sparse Learning


75/116

Variants of Online Sparse Learning

Models

Online feature selection(OFS) A variant of sparse online learning

The key difference is that OFS focuses on selecting afixed subset of featuresin online learning process

Could be used as an alternative tool for batch featureselection when dealing with big data

Other existing work Online learning for Group Lasso (Yang et al., 2010)

and online learning for multi-task feature selection(Yang et al. 2013) to select features in group manneror features among similar tasks

75


76/116

Online Sparse Learning

Objective

Induce sparsity in the weights of online learning

algorithms

Pros and ConsSimple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rateNo perfect way to attain sparsity solution yet

76


77/116

Outline








77


78/116

Online Unsupervised Learning

Assumption: data generated from some underlyingparametric probabilistic densityfunction

Goal: estimate the parameters of the density to give a suitablecompact representation

Typical work Online singular value decomposition(SVD) (Brand, 2003)

Others (including but not limited) Online principal component analysis (PCA) (Warmuth and

Kuzmin, 2006)

Online dictionary learning for sparse coding (Mairal et al. 2009)

Online learning for latent Dirichlet allocation (LDA) (Hoffman etal., 2010)

Online variational inference for the hierarchical Dirichlet process(HDP) (Wang et al. 2011)

...78

f


79/116

[] [][][]

: input data matrix matrix (e.g. documents,

terms)

: left singular vectors

matrix ( documents, topics)

: singular values diagonal matrix (strength

of each topic)

(): rank of matrix

: right singular vectors matrix (terms, topics)

SVD: Definition

79

: scalar

: vector

: vector

A


80/116

S i


81/116

SVD Properties

It is always possible to do SVD, i.e. decomposea matrixinto , where

, , : unique

,: column orthonormal

, (: identity matrix)

: diagonal

Entries (singular values) are non-negative,

Sorted in decreasing order (120).

81

SVD E l U M i


82/116

SVD: ExampleUsers-to-Movies

- example: Users to Movies

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

SciFi

Romance

topics or concepts

aka. Latent dimensions

aka. Latent factors

82

TheAvengers

Sta

rWars

Tw

ilight

Ma

trix

SVD E l U M i


83/116


- example

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

SciFi

Romance

0.24 0.02 0.69

0.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.20

0. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

83

TheAvengers

StarWars

Twilight

Matrix

SciFi-concept

X X

Romance-concept

SVD E l U M i


84/116


- example

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

SciFi

Romance

0.24 0.02 0.69

0.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.20

0. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

84

TheAvengers

StarWars

Twilight

Matrix SciFi-concept

X X

Romance-concept

Uis user-to-concept

similarity matrix

SVD E l U t M i


85/116


- example

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

SciFi

Romance

0.24 0.02 0.69

0.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.20

0. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

85

TheAvengers

StarWars

Twilight


X X

strength of the SciFi-concept

SVD E l U t M i


86/116


- example

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

SciFi

Romance

0.24 0.02 0.69

0.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.20

0. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

86

TheAvengers

StarWars

Twilight


X X

Vis movie-to-concept

similarity matrix

SciFi-concept

SVD I t t ti #1


87/116

SVD: Interpretation #1

users, movies and concepts

: user-to-concept similarity matrix

: movie-to-concept similarity matrix

: its diagonal elements strength of each concept

87

SVD I t t ti #2


88/116

SVD: Interpretations #2

SVD gives best axis

to project on

best = minimal sum

of squares of

projection errors

In other words,

minimum

reconstruction error

88


89/116

SVD I t t ti #2


90/116


- example

1 3 1 0

3 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 30 1 0 5

0.24 0.02 0.690.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.200. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

90

X X

variance (spread)

on the v1axis

SVD I t t ti #2


91/116


- example : the coordinates of

the points in the

projection axis

91

1 3 1 0

3 4 3 1

4 2 4 05 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

2.86 0.24 8.21

5.71 0.24 5.12

5.83 0.95 6.198.09 0.83 1.90

0.71 6. 78 0.71

0.12 4.88 2.38

0. 83 8.33 0. 12

Projection of users

on the Sci-Fi axis

SVD Interpretation #2


92/116


Q: how exactly is dimension reduction done?

92

1 3 1 0

3 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 30 1 0 5

0.24 0.02 0.690.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.200. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

X X

SVD Interpretation #2


93/116



A: Set smallest singular values to zero

93

1 3 1 0

3 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 30 1 0 5

0.24 0.02 0.690.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.200. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

X X



94/116




Approximate original matrix by low-rank matrices

94

1 3 1 0

3 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

0.24 0.02 0.690.48 0.02 0.43

0.49 0.08 0.52

0.68 0.07 0.16

0.06 0.57 0.06

0.01 0.41 0.200. 07 0.70 0.01

11.9 0 0

0 7.1 0

0 0 2.5

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

0. 37 0.83 0.37 0.17

X X



95/116




Approximate original matrix by low-rank matrices

95

1 3 1 0

3 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 3

0 1 0 5

0.24 0.020.48 0.02

0.49 0.08

0.68 0.07

0.06 0.57

0.01 0.410. 07 0.70

11.9 0

0 7.1

0.59 0.54 0.59 0. 05

0.10 0.12 0.10 0.98

X X


96/116

SVD: Best Low Rank Approximation


97/116

SVD: Best Low Rank Approximation

Theorem: Let (rank , ),and = diagonal matrix where ( 1 )and 0( > )

or equivalently, =

, is the best rank-approximationto:

or equivalently, argmin

Intuition (spectral decomposition)

0

Why setting small to 0 is the right thing to do? Vectors and are unit length, so scales them. Therefore, zeroing small introduces less error.

97



98/116


Q: How many to keep?

A: Rule-of-a thumb

Keep 80~90% energy(

)

98

1 3 1 03 4 3 1

4 2 4 0

5 4 5 1

0 1 0 4

0 0 0 30 1 0 5

111+ 222

+

Assume: 1 2

SVD: Complexity


99/116

SVD: Complexity

SVD for full matrix(min(, ))

But

faster, if we only want to compute singular values or if we only want first ksingular vectors (thin-svd).

or if the matrix is sparse (sparse svd).

Stable implementations LAPACK, Matlab, PROPACK

Available in most common languages

99

SVD: Conclusions so far


100/116

SVD: Conclusions so far

SVD: : unique:user-to-concept similarities

:movie-to-concept similarities

: strength to each concept

Dimensionality reduction

Keep the few largest singular values (80-90% of

energy) SVD: picks up linear correlations

100

SVD: Relationship to Eigen-


101/116

decomposition

SVD gives us

Eigen-decomposition

XT

is symmetric ,,are orthonormal ( )

, are diagonal

Equivalence

This shows how to use eigen-decomposition to compute SVD

And also,

101

Online SVD (Brand 2003)


102/116

Online SVD (Brand, 2003)

Challenges: storage and computation

Idea: an incrementalalgorithm computes the

principal eigenvectors of a matrix without

storing the entire matrix in memory

102

Online SVD (Brand 2003)


103/116

Online SVD (Brand, 2003)

103

thr?p


104/116



105/116


Unsupervised learning: minimizing thereconstruction errors

Online: rank-one update

Pros and Cons

Simple to implement

Heuristic, but intuitively workLack of theoretical guarantee

Relative poor performance

105


106/116

Discussions and Open Issues


107/116

Volume VolumeVolume

Discussions and Open Issues

107

Variety

Structured, semi-

structured, unstructured

Veracity

Volume

Data inconsistency,

ambiguities, deception

Velocity

Batch data, real-time

data, streaming data

Volume

Size grows from terabyte

to exabyte to zettabyte

How to learn from Big Data to tackle the

4Vs characteristics?


108/116


109/116


110/116

One-slide Takeaway


111/116

One slide Takeaway

Online learning is a promising tool for big dataanalytics

Many challenges exist

Real-world scenarios: concept drifting, sparsedata, high-dimensional, uncertain/imprecisiondata, etc.

More advance online learning algorithms: faster

convergence rate, less memory cost, etc. Parallel implementation or running on distributing

platforms

111

Toolbox: http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software

Other Toolboxes
http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=softwarehttp://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software


112/116

Other Toolboxes

MOA: Massive Online Analysis http://moa.cms.waikato.ac.nz/

Vowpal Wabbit

https://github.com/JohnLangford/vowpal_wabbit/wiki

112

Q & A
http://moa.cms.waikato.ac.nz/https://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttp://moa.cms.waikato.ac.nz/


113/116

Q & A

If you have any problems, please send emailsto [email protected]!

113

References


114/116

References

M. Brand. Fast online svd revisions for lightweight recommender systems. In SDM,2003.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm.SIAM J. Comput., 34(3):640668, 2005.

K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551585, 2006.

K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In

NIPS, pages 414422, 2009. K. Crammer and D. D. Lee. Learning via gaussian herding. In NIPS, pages 451459, 2010.

K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems.Journal of Machine Learning Research, 3:951991, 2003.

M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. InICML, pages 264271, 2008.

J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward

splitting. Journal of Machine Learning Research, 10:28992934, 2009. C. Gentile. A new approximate maximal margin classification algorithm. Journal of

Machine Learning Research, 2:213242, 2001.

M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation.In NIPS, pages 856864, 2010.

114

References


115/116

References

S. C. H. Hoi, J. Wang, and P. Zhao. Exact soft confidence-weighted learning. In ICML,2012.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of

Machine Learning Research, 10:777801, 2009.

Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning,

46(1-3):361387, 2002.

P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.SIAM J. Numer. Anal., 16(6):964979, 1979.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

ICML, page 87, 2009.

Y. Nesterov. Gradient methods for minimizing composite objective function. CORE

Discussion Paper 2007/76, Catholic University of Louvain, Center for Operations

Research and Econometrics, 2007.

F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS,

pages 18401848, 2010.

F. Rosenblatt. The Perceptron: A probabilistic model for information storage and

organization in the brain. Psychological Review, 65:386408, 1958.115

References


116/116

References

C. Wang, J. W. Paisley, and D. M. Blei. Online variational inference for the hierarchicaldirichlet process. Journal of Machine Learning Research - Proceedings Track, 15:752

760, 2011.

M. K. Warmuth and D. Kuzmin. Randomized pca algorithms with regret bounds that are

logarithmic in the dimension. In NIPS, pages 14811488, 2006.

S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable

approximation. IEEE Transactions on Signal Processing, 57(7):24792493, 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online

optimization. Journal of Machine Learning Research, 11:25432596, 2010.

H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.

ACM Transactions on Knowledge Discovery from Data, 2013.

H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages

11911198, Haifa, Israel, 2010.

L. Yang, R. Jin, and J. Ye. Online learning by ellipsoid method. In ICML, page 145, 2009.

P. Zhao, S. C. H. Hoi, and R. Jin. Duol: A double updating approach for online learning. In

online learning for big data analytics

Documents