online learning for big data analytics
TRANSCRIPT
-
8/14/2019 Online Learning for Big Data Analytics
1/116
Online Learning for Big Data
Analytics
Irwin King, Michael R. Lyu and Haiqin YangDepartment of Computer Science & Engineering
The Chinese University of Hong Kong
Tutorial presentation at IEEE Big Data, Santa Clara, CA, 2013
1
-
8/14/2019 Online Learning for Big Data Analytics
2/116
Outline
Introduction (60 min.)
Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.)
Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
2
-
8/14/2019 Online Learning for Big Data Analytics
3/116
Outline
Introduction (60 min.)
Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.)
Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
3
-
8/14/2019 Online Learning for Big Data Analytics
4/116
What is Big Data?
There is not a consensus as to how to define Big Data
4
A collection of data sets so largeand complexthat it becomes
difficult to process using on-hand database management tools
or traditional data processing applications. - wikii
Big data exceeds the reachof commonly used hardware
environments and software tools to capture, manage, and processit
with in a tolerable elapsed time for its user population. - Tera- datamagazine article, 2011
Big data refers to data sets whose size is beyond the abilityof
typical database software tools to capture, store, manageand
analyze. - The McKinsey Global Institute, 2011i
-
8/14/2019 Online Learning for Big Data Analytics
5/116
What is Big Data?
5
File/Object Size, Content Volume
Activity:IOPS
Big Data refers to datasets grow so largeand complexthat it is
difficultto capture, store, manage, share, analyzeand visualize
within current computational architecture.
-
8/14/2019 Online Learning for Big Data Analytics
6/116
Evolution of Big Data
Birth: 1880 US census
Adolescence: Big Science
Modern Era: Big Business
6
-
8/14/2019 Online Learning for Big Data Analytics
7/116
Birth: 1880 US census
7
-
8/14/2019 Online Learning for Big Data Analytics
8/116
The First Big Data Challenge
1880 census
50 million people
Age, gender (sex),
occupation, educationlevel, no. of insane
people in household
8
-
8/14/2019 Online Learning for Big Data Analytics
9/116
-
8/14/2019 Online Learning for Big Data Analytics
10/116
Manhattan Project (1946 - 1949)
$2 billion (approx. 26
billion in 2013)
Catalyst for Big Science
10
-
8/14/2019 Online Learning for Big Data Analytics
11/116
Space Program (1960s)
Began in late 1950s
An active area of big
data nowadays
11
-
8/14/2019 Online Learning for Big Data Analytics
12/116
Adolescence: Big Science
12
-
8/14/2019 Online Learning for Big Data Analytics
13/116
Big Science
The InternationalGeophysical Year
An international scientificproject
Last from Jul. 1, 1957 to Dec.31, 1958
A synoptic collection ofobservational data on aglobal scale
Implications
Big budgets, Big staffs, Bigmachines, Big laboratories
13
-
8/14/2019 Online Learning for Big Data Analytics
14/116
Summary of Big Science
Laid foundation for ambitious projects
International Biological Program
Long Term Ecological Research Network
Ended in 1974 Many participants viewed it as a failure
Nevertheless, it was a success
Transform the way of processing data Realize original incentives
Provide a renewed legitimacy for synoptic datacollection
14
-
8/14/2019 Online Learning for Big Data Analytics
15/116
Lessons from Big Science
Spawn new big data projects
Weather prediction
Physics research (supercollider data analytics)
Astronomy images (planet detection)
Medical research (drug interaction)
Businesses latched onto its techniques,
methodologies, and objectives
15
-
8/14/2019 Online Learning for Big Data Analytics
16/116
Modern Era: Big Business
16
-
8/14/2019 Online Learning for Big Data Analytics
17/116
-
8/14/2019 Online Learning for Big Data Analytics
18/116
Big Data is Everywhere!
Lots of data is being
collected and warehoused
Science experiments
Web data, e-commerce
Purchases at department/
grocery stores
Bank/Credit Card
transactions Social Networks
18
-
8/14/2019 Online Learning for Big Data Analytics
19/116
Big Data in Science
CERN - Large Hadron Collider ~10 PB/year at start
~1000 PB in ~10 years
2500 physicists collaborating
Large Synoptic Survey Telescope(NSF, DOE, and private donors) ~5-10 PB/year at start in 2012
~100 PB by 2025
Pan-STARRS (Haleakala, Hawaii)
US Air Force now: 800 TB/year
soon: 4 PB/year
19
-
8/14/2019 Online Learning for Big Data Analytics
20/116
Big Data from Different Sources
20
12+ TBs
of tweet data
every day
25+ TBs
of
log data
every day
?
TBsof
data
every
day
2+
bi l l ionpeople
on the
Web by
end 2011
30 bi l l io nRFID
tags today(1.3B in 2005)
4.6
bi l l ioncamera
phones
world
wide
100s o fmi l l ions
of GPS
enableddevices
sold
annually
76 m il l ionsmartmeters in 2009
200M by 2014
-
8/14/2019 Online Learning for Big Data Analytics
21/116
Big Data in Business Sectors
21SOURCE: McKinsey Global Institute
-
8/14/2019 Online Learning for Big Data Analytics
22/116
Characteristics of Big Data
4V: Volume, Velocity, Variety, Veracity
22
Volume VolumeVolume
Variety Veracity
Volume
VelocityVolume
Text
Videos
Images
Audios8 billion TBin 2015,
40 ZB in 2020
5.2TB per person
New sharing over2.5
billion per day
new data over 500TB
per day
US predicts powerful
quake this week
UAE body says it's a
rumour;
By Wam/Agencies
Published Saturday, April 20, 2013
-
8/14/2019 Online Learning for Big Data Analytics
23/116
Big Data Analytics
23
Definition: a process of inspecting, cleaning,transforming, and modeling big data with thegoal of discoveringuseful information, suggesting
conclusions, and supporting decision making Connection to data mining
Analytics include both data analysis(mining) andcommunication(guide decision making)
Analytics is not so much concerned with individualanalyses or analysis steps, but with the entiremethodology
-
8/14/2019 Online Learning for Big Data Analytics
24/116
Outline
Introduction (60 min.)
Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.) Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
24
-
8/14/2019 Online Learning for Big Data Analytics
25/116
-
8/14/2019 Online Learning for Big Data Analytics
26/116
Learning Techniques Overview
Learning paradigms
Supervised learning
Semisupervised learning
Transductive learning
Unsupervised learning
Universum learning
Transfer learning
26
-
8/14/2019 Online Learning for Big Data Analytics
27/116
What is Online Learning?
Batch/Offline learning Observe a batchof training
data
Learn a model from them
Predict new samplesaccurately
Online learning Observe a sequenceof data
Learn a model incrementally
as instances come
Make the sequence of online
predictions accurately
27
Update a model
True response
user
Make prediction
Niii
y1
,
x tt yy ,,,, 11 xx
-
8/14/2019 Online Learning for Big Data Analytics
28/116
Online Prediction Algorithm
An initial prediction rule
For t=1, 2,
We observe and make a prediction
We observe the true outcomeytand then
compute a loss
The online algorithm updates the prediction
rule using the new example and construct
28
)(0 f
)(1 ttf x
)),(( tt yfl x
)(xtf
Updateft
yt
user
tx
tx
x
)(1 ttf x
-
8/14/2019 Online Learning for Big Data Analytics
29/116
Online Prediction Algorithm
The total error of the method is
Goal: this error to be as small as possible
Predict unknown future one step a time:
similar to generalization error
29
yt
usertx
)(1 ttf x
T
t
ttt yfl1
1 )),(( x
-
8/14/2019 Online Learning for Big Data Analytics
30/116
Regret Analysis
: optimal prediction function from a class
H, e.g., the class of linear classifiers
with minimum error after seeing all examples
Regret for the online learning algorithm
30
T
t
tt
Hf
yflf1
* )),((minarg x
)(* f
T
t
ttttt yflyflT 1*1 )),(()),((
1regret xx
We want regret as small as possible
-
8/14/2019 Online Learning for Big Data Analytics
31/116
Why Low Regret?
Regret for the online learning algorithm
Advantages
We do not lose much from not knowing future events
We can perform almost as well as someone who
observes the entire sequence and picks the bestprediction strategy in hindsight
We can also compete with changing environment
31
T
t
ttttt yflyflT 1
*1 )),(()),((1
regret xx
-
8/14/2019 Online Learning for Big Data Analytics
32/116
Advantages of Online Learning
Meet many applications for data arriving sequentiallywhile predictions are required on-the-fly
Avoid re-training when adding new data
Applicable in adversarialand competitive environment
Strong adaptabilityto changingenvironment
High efficiencyand excellent scalability
Simpleto understand and easy to implement
Easy to be parallelized Theoreticalguarantees
32
-
8/14/2019 Online Learning for Big Data Analytics
33/116
Whereto Apply Online Learning?
OnlineLearning
SocialMedia
InternetSecurity
Finance
33
-
8/14/2019 Online Learning for Big Data Analytics
34/116
Online Learning for Social Media
Recommendation, sentiment/emotion analysis
34
-
8/14/2019 Online Learning for Big Data Analytics
35/116
Whereto Apply Online Learning?
OnlineLearning
SocialMedia
InternetSecurity
Finance
35
-
8/14/2019 Online Learning for Big Data Analytics
36/116
Online Learning for Internet Security
Electronic business sectors
Spamemailfiltering
Fraudcredit card transaction detection
Network intrusion detectionsystem, etc.
36
-
8/14/2019 Online Learning for Big Data Analytics
37/116
Whereto Apply Online Learning?
OnlineLearning
SocialMedia
InternetSecurity
Finance
37
-
8/14/2019 Online Learning for Big Data Analytics
38/116
Online Learning for Financial Decision
Financial decision
Online portfolio selection
Sequential investment, etc.
38
-
8/14/2019 Online Learning for Big Data Analytics
39/116
Outline
Introduction (60 min.)
Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.) Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
39
-
8/14/2019 Online Learning for Big Data Analytics
40/116
Outline
Introduction (60 min.)
Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.) Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
40
-
8/14/2019 Online Learning for Big Data Analytics
41/116
Perceptron Algorithm (F. Rosenblatt 1958)
Goal: find a linear classifier with small error
41
-
8/14/2019 Online Learning for Big Data Analytics
42/116
Geometric View
42
w0
(x1, y1(=1))
Misclassification!
-
8/14/2019 Online Learning for Big Data Analytics
43/116
43
Geometric View
w0w1=w0+x1y1
(x1, y1(=1))
-
8/14/2019 Online Learning for Big Data Analytics
44/116
44
Geometric View
w1
(x2, y2(=-1))
(x1, y1(=1))
Misclassification!
-
8/14/2019 Online Learning for Big Data Analytics
45/116
45
Geometric View
(x1, y1)
w1
(x2, y2(=-1))
w2=w1+x2y2
-
8/14/2019 Online Learning for Big Data Analytics
46/116
Perceptron Mistake Bound
Consider w*separate the data:
Define margin
The number of mistakes perceptron makes is
at most
46
0* iiT yxw
22*
*
sup
min
ii
i
T
i
xw
xw
2
f f k d
-
8/14/2019 Online Learning for Big Data Analytics
47/116
Proof of Perceptron Mistake Bound
[Novikoff,1963]
Proof:Let be the hypothesis before the k-th
mistake. Assume that the k-th mistake occurs
on the input example .
47
kv
ii y,x
First, Second,
Hence,
22*
*
sup
min
ii
i
T
i
xw
xw
-
8/14/2019 Online Learning for Big Data Analytics
48/116
Outline
Introduction (60 min.) Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.) Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
48
-
8/14/2019 Online Learning for Big Data Analytics
49/116
Online Non-Sparse Learning
First orderlearning methods Online gradient descent(Zinkevich, 2003)
Passive aggressive learning(Crammer et al., 2006)
Others (including but not limited) ALMA: A New Approximate Maximal Margin Classification
Algorithm (Gentile, 2001)
ROMMA: Relaxed Online Maximum Margin Algorithm (Li andLong, 2002)
MIRA: Margin Infused Relaxed Algorithm (Crammer andSinger, 2003)
DUOL: A Double Updating Approach for Online Learning(Zhao et al. 2009)
49
-
8/14/2019 Online Learning for Big Data Analytics
50/116
Online Gradient Descent (OGD)
Online convex optimization (Zinkevich 2003)
Consider a convex objective function
where is a bounded convex set
Update by Online Gradient Descent (OGD) or
Stochastic Gradient Descent (SGD)
where is a learning rate
50
-
8/14/2019 Online Learning for Big Data Analytics
51/116
Online Gradient Descent (OGD)
For t=1, 2,
An unlabeled sample arrives
Make a prediction based on existing weights
Observe the true class label
Update the weights by
where is a learning rate
51
)sgn( tTtty xw
tx
1,1ty
Updatewt+1yt
usertx
x
ty
-
8/14/2019 Online Learning for Big Data Analytics
52/116
Passive Aggressive Online Learning
Closed-form solutions can be derived:
52
-
8/14/2019 Online Learning for Big Data Analytics
53/116
Online Non-Sparse Learning
First order methods
Learn a linear weight vector (first order) of model
Pros and Cons
Simple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rate
53
-
8/14/2019 Online Learning for Big Data Analytics
54/116
Online Non-Sparse Learning
Second orderonline learning methods Update the weight vector w by maintaining and exploring both
first-orderand second-orderinformation
Some representative methods, but not limited SOP:Second Order Perceptron (Cesa-Bianchi et al., 2005)
CW:Confidence Weighted learning (Dredze et al., 2008)
AROW:Adaptive Regularization of Weights (Crammer et al.,2009)
IELLIP: Online Learning by Ellipsoid Method (Yang et al., 2009)
NHERD: Gaussian Herding (Crammer & Lee 2010)
NAROW: New variant of AROW algorithm (Orabona & Crammer2010)
SCW: Soft Confidence Weighted (SCW) (Hoi et al., 2012)
54
-
8/14/2019 Online Learning for Big Data Analytics
55/116
Online Non-Sparse Learning
Second-Orderonline learning methods
Learn the weight vector w by maintaining and
exploring both first-orderand second-order
information Pros and Cons
Faster convergence rate
Expensive for high-dimensional dataRelatively sensitive to noise
55
-
8/14/2019 Online Learning for Big Data Analytics
56/116
Outline
Introduction (60 min.) Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.) Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
56
-
8/14/2019 Online Learning for Big Data Analytics
57/116
Online Sparse Learning
Motivation Spaceconstraint: RAM overflow
Test-time constraint
How to induce Sparsityin the weights of online
learning algorithms? Some representative work
Truncated gradient (Langford et al., 2009)
FOBOS: Forward Looking Subgradients (Duchi and
Singer 2009) Dual averaging(Xiao, 2009)
etc.
57
-
8/14/2019 Online Learning for Big Data Analytics
58/116
Objective function
Stochastic gradient descent
Simple coefficient rounding
Truncated Gradient (Langford et al., 2009)
58
Truncated gradient: impose sparsity by
modifying the stochastic gradient descent
-
8/14/2019 Online Learning for Big Data Analytics
59/116
Simple Coefficient Rounding vs. Less aggressive truncation
Truncated Gradient (Langford et al., 2009)
59
-
8/14/2019 Online Learning for Big Data Analytics
60/116
Truncated Gradient (Langford et al., 2009)
The amount of shrinkage ismeasured by a gravityparameter
The truncation can be
performed everyKonline steps When
the update rule is identical to
the standard SGD
Loss functions:
Logistic SVM (hinge)
Least square
60
-
8/14/2019 Online Learning for Big Data Analytics
61/116
Regret bound
Let
Truncated Gradient (Langford et al., 2009)
61
-
8/14/2019 Online Learning for Big Data Analytics
62/116
-
8/14/2019 Online Learning for Big Data Analytics
63/116
FOBOS (Duchi & Singer, 2009)
Objective function
Repeat
I. Unconstrained (stochastic sub) gradient of loss
II. Incorporate regularization
Similar to Forward-backward splitting (Lions and Mercier 79)
Composite gradient methods (Wright et al. 09, Nesterov 07)
63
-
8/14/2019 Online Learning for Big Data Analytics
64/116
FOBOS: Step I
Objective function
Unconstrained (stochastic sub) gradient of loss
64
-
8/14/2019 Online Learning for Big Data Analytics
65/116
FOBOS: Step II
Objective function
Incorporate regularization
65
-
8/14/2019 Online Learning for Big Data Analytics
66/116
Forward Looking Property
The optimum satisfies
Let and
Currentsubgradient of loss,forwardsubgradient of regularization
66
current loss forward regularization
-
8/14/2019 Online Learning for Big Data Analytics
67/116
Batch Convergence and Online Regret
Set or to obtain batch convergence
Online (average) regret bounds
67
-
8/14/2019 Online Learning for Big Data Analytics
68/116
High Dimensional Efficiency
Input space is sparse but huge
Need to perform lazy updates to
Proposition:The following are equivalent
68
-
8/14/2019 Online Learning for Big Data Analytics
69/116
Dual Averaging (Xiao, 2010)
Objective function
Problem: truncated gradientdoesnt produce
truly sparse weight due to small learning rate
Fix: dual averagingwhich keeps two state
representations:
parameter and average gradient vector
69
tw
t
i
iit wft
g1
1
-
8/14/2019 Online Learning for Big Data Analytics
70/116
Dual Averaging (Xiao, 2010)
+has entry-wise closed-form
solution
Advantage: sparse
on the weight
Disadvantage:
keep a non-sparse
subgradient
70
-
8/14/2019 Online Learning for Big Data Analytics
71/116
Average regret
Theoretical bound: similar to gradient descent
Convergence and Regret
71
-
8/14/2019 Online Learning for Big Data Analytics
72/116
Comparison
FOBOS
Subgradient Local Bregman divergence
Coefficient 1
Equivalent to TG method
when
Dual Averaging
Average subgradient Global proximal function
Coefficient 1
72
-
8/14/2019 Online Learning for Big Data Analytics
73/116
-
8/14/2019 Online Learning for Big Data Analytics
74/116
Comparison
Left: 1for TG, 0 for RDA Right:=10 for TG, =25 for RDA
74
Variants of Online Sparse Learning
-
8/14/2019 Online Learning for Big Data Analytics
75/116
Variants of Online Sparse Learning
Models
Online feature selection(OFS) A variant of sparse online learning
The key difference is that OFS focuses on selecting afixed subset of featuresin online learning process
Could be used as an alternative tool for batch featureselection when dealing with big data
Other existing work Online learning for Group Lasso (Yang et al., 2010)
and online learning for multi-task feature selection(Yang et al. 2013) to select features in group manneror features among similar tasks
75
-
8/14/2019 Online Learning for Big Data Analytics
76/116
Online Sparse Learning
Objective
Induce sparsity in the weights of online learning
algorithms
Pros and ConsSimple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rateNo perfect way to attain sparsity solution yet
76
-
8/14/2019 Online Learning for Big Data Analytics
77/116
Outline
Introduction (60 min.) Big data and big data analytics (30 min.)
Online learning and its applications (30 min.)
Online Learning Algorithms (60 min.) Perceptron (10 min.)
Online non-sparse learning (10 min.)
Online sparse learning (20 min.)
Online unsupervised learning (20. min.)
Discussions + Q & A (5 min.)
77
-
8/14/2019 Online Learning for Big Data Analytics
78/116
Online Unsupervised Learning
Assumption: data generated from some underlyingparametric probabilistic densityfunction
Goal: estimate the parameters of the density to give a suitablecompact representation
Typical work Online singular value decomposition(SVD) (Brand, 2003)
Others (including but not limited) Online principal component analysis (PCA) (Warmuth and
Kuzmin, 2006)
Online dictionary learning for sparse coding (Mairal et al. 2009)
Online learning for latent Dirichlet allocation (LDA) (Hoffman etal., 2010)
Online variational inference for the hierarchical Dirichlet process(HDP) (Wang et al. 2011)
...78
f
-
8/14/2019 Online Learning for Big Data Analytics
79/116
[] [][][]
: input data matrix matrix (e.g. documents,
terms)
: left singular vectors
matrix ( documents, topics)
: singular values diagonal matrix (strength
of each topic)
(): rank of matrix
: right singular vectors matrix (terms, topics)
SVD: Definition
79
: scalar
: vector
: vector
A
-
8/14/2019 Online Learning for Big Data Analytics
80/116
S i
-
8/14/2019 Online Learning for Big Data Analytics
81/116
SVD Properties
It is always possible to do SVD, i.e. decomposea matrixinto , where
, , : unique
,: column orthonormal
, (: identity matrix)
: diagonal
Entries (singular values) are non-negative,
Sorted in decreasing order (120).
81
SVD E l U M i
-
8/14/2019 Online Learning for Big Data Analytics
82/116
SVD: ExampleUsers-to-Movies
- example: Users to Movies
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
SciFi
Romance
topics or concepts
aka. Latent dimensions
aka. Latent factors
82
TheAvengers
Sta
rWars
Tw
ilight
Ma
trix
SVD E l U M i
-
8/14/2019 Online Learning for Big Data Analytics
83/116
SVD: ExampleUsers-to-Movies
- example
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
SciFi
Romance
0.24 0.02 0.69
0.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.20
0. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
83
TheAvengers
StarWars
Twilight
Matrix
SciFi-concept
X X
Romance-concept
SVD E l U M i
-
8/14/2019 Online Learning for Big Data Analytics
84/116
SVD: ExampleUsers-to-Movies
- example
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
SciFi
Romance
0.24 0.02 0.69
0.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.20
0. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
84
TheAvengers
StarWars
Twilight
Matrix SciFi-concept
X X
Romance-concept
Uis user-to-concept
similarity matrix
SVD E l U t M i
-
8/14/2019 Online Learning for Big Data Analytics
85/116
SVD: ExampleUsers-to-Movies
- example
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
SciFi
Romance
0.24 0.02 0.69
0.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.20
0. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
85
TheAvengers
StarWars
Twilight
Matrix SciFi-concept
X X
strength of the SciFi-concept
SVD E l U t M i
-
8/14/2019 Online Learning for Big Data Analytics
86/116
SVD: ExampleUsers-to-Movies
- example
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
SciFi
Romance
0.24 0.02 0.69
0.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.20
0. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
86
TheAvengers
StarWars
Twilight
Matrix SciFi-concept
X X
Vis movie-to-concept
similarity matrix
SciFi-concept
SVD I t t ti #1
-
8/14/2019 Online Learning for Big Data Analytics
87/116
SVD: Interpretation #1
users, movies and concepts
: user-to-concept similarity matrix
: movie-to-concept similarity matrix
: its diagonal elements strength of each concept
87
SVD I t t ti #2
-
8/14/2019 Online Learning for Big Data Analytics
88/116
SVD: Interpretations #2
SVD gives best axis
to project on
best = minimal sum
of squares of
projection errors
In other words,
minimum
reconstruction error
88
-
8/14/2019 Online Learning for Big Data Analytics
89/116
SVD I t t ti #2
-
8/14/2019 Online Learning for Big Data Analytics
90/116
SVD: Interpretation #2
- example
1 3 1 0
3 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 30 1 0 5
0.24 0.02 0.690.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.200. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
90
X X
variance (spread)
on the v1axis
SVD I t t ti #2
-
8/14/2019 Online Learning for Big Data Analytics
91/116
SVD: Interpretation #2
- example : the coordinates of
the points in the
projection axis
91
1 3 1 0
3 4 3 1
4 2 4 05 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
2.86 0.24 8.21
5.71 0.24 5.12
5.83 0.95 6.198.09 0.83 1.90
0.71 6. 78 0.71
0.12 4.88 2.38
0. 83 8.33 0. 12
Projection of users
on the Sci-Fi axis
SVD Interpretation #2
-
8/14/2019 Online Learning for Big Data Analytics
92/116
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
92
1 3 1 0
3 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 30 1 0 5
0.24 0.02 0.690.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.200. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
X X
SVD Interpretation #2
-
8/14/2019 Online Learning for Big Data Analytics
93/116
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
93
1 3 1 0
3 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 30 1 0 5
0.24 0.02 0.690.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.200. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
X X
SVD: Interpretation #2
-
8/14/2019 Online Learning for Big Data Analytics
94/116
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices
94
1 3 1 0
3 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
0.24 0.02 0.690.48 0.02 0.43
0.49 0.08 0.52
0.68 0.07 0.16
0.06 0.57 0.06
0.01 0.41 0.200. 07 0.70 0.01
11.9 0 0
0 7.1 0
0 0 2.5
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
0. 37 0.83 0.37 0.17
X X
SVD: Interpretation #2
-
8/14/2019 Online Learning for Big Data Analytics
95/116
SVD: Interpretation #2
Q: how exactly is dimension reduction done?
A: Set smallest singular values to zero
Approximate original matrix by low-rank matrices
95
1 3 1 0
3 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 3
0 1 0 5
0.24 0.020.48 0.02
0.49 0.08
0.68 0.07
0.06 0.57
0.01 0.410. 07 0.70
11.9 0
0 7.1
0.59 0.54 0.59 0. 05
0.10 0.12 0.10 0.98
X X
-
8/14/2019 Online Learning for Big Data Analytics
96/116
SVD: Best Low Rank Approximation
-
8/14/2019 Online Learning for Big Data Analytics
97/116
SVD: Best Low Rank Approximation
Theorem: Let (rank , ),and = diagonal matrix where ( 1 )and 0( > )
or equivalently, =
, is the best rank-approximationto:
or equivalently, argmin
Intuition (spectral decomposition)
0
Why setting small to 0 is the right thing to do? Vectors and are unit length, so scales them. Therefore, zeroing small introduces less error.
97
SVD: Interpretation #2
-
8/14/2019 Online Learning for Big Data Analytics
98/116
SVD: Interpretation #2
Q: How many to keep?
A: Rule-of-a thumb
Keep 80~90% energy(
)
98
1 3 1 03 4 3 1
4 2 4 0
5 4 5 1
0 1 0 4
0 0 0 30 1 0 5
111+ 222
+
Assume: 1 2
SVD: Complexity
-
8/14/2019 Online Learning for Big Data Analytics
99/116
SVD: Complexity
SVD for full matrix(min(, ))
But
faster, if we only want to compute singular values or if we only want first ksingular vectors (thin-svd).
or if the matrix is sparse (sparse svd).
Stable implementations LAPACK, Matlab, PROPACK
Available in most common languages
99
SVD: Conclusions so far
-
8/14/2019 Online Learning for Big Data Analytics
100/116
SVD: Conclusions so far
SVD: : unique:user-to-concept similarities
:movie-to-concept similarities
: strength to each concept
Dimensionality reduction
Keep the few largest singular values (80-90% of
energy) SVD: picks up linear correlations
100
SVD: Relationship to Eigen-
-
8/14/2019 Online Learning for Big Data Analytics
101/116
decomposition
SVD gives us
Eigen-decomposition
XT
is symmetric ,,are orthonormal ( )
, are diagonal
Equivalence
This shows how to use eigen-decomposition to compute SVD
And also,
101
Online SVD (Brand 2003)
-
8/14/2019 Online Learning for Big Data Analytics
102/116
Online SVD (Brand, 2003)
Challenges: storage and computation
Idea: an incrementalalgorithm computes the
principal eigenvectors of a matrix without
storing the entire matrix in memory
102
Online SVD (Brand 2003)
-
8/14/2019 Online Learning for Big Data Analytics
103/116
Online SVD (Brand, 2003)
103
thr?p
-
8/14/2019 Online Learning for Big Data Analytics
104/116
Online Unsupervised Learning
-
8/14/2019 Online Learning for Big Data Analytics
105/116
Online Unsupervised Learning
Unsupervised learning: minimizing thereconstruction errors
Online: rank-one update
Pros and Cons
Simple to implement
Heuristic, but intuitively workLack of theoretical guarantee
Relative poor performance
105
-
8/14/2019 Online Learning for Big Data Analytics
106/116
Discussions and Open Issues
-
8/14/2019 Online Learning for Big Data Analytics
107/116
Volume VolumeVolume
Discussions and Open Issues
107
Variety
Structured, semi-
structured, unstructured
Veracity
Volume
Data inconsistency,
ambiguities, deception
Velocity
Batch data, real-time
data, streaming data
Volume
Size grows from terabyte
to exabyte to zettabyte
How to learn from Big Data to tackle the
4Vs characteristics?
-
8/14/2019 Online Learning for Big Data Analytics
108/116
-
8/14/2019 Online Learning for Big Data Analytics
109/116
-
8/14/2019 Online Learning for Big Data Analytics
110/116
One-slide Takeaway
-
8/14/2019 Online Learning for Big Data Analytics
111/116
One slide Takeaway
Online learning is a promising tool for big dataanalytics
Many challenges exist
Real-world scenarios: concept drifting, sparsedata, high-dimensional, uncertain/imprecisiondata, etc.
More advance online learning algorithms: faster
convergence rate, less memory cost, etc. Parallel implementation or running on distributing
platforms
111
Toolbox: http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software
Other Toolboxes
http://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=softwarehttp://appsrv.cse.cuhk.edu.hk/~hqyang/doku.php?id=software -
8/14/2019 Online Learning for Big Data Analytics
112/116
Other Toolboxes
MOA: Massive Online Analysis http://moa.cms.waikato.ac.nz/
Vowpal Wabbit
https://github.com/JohnLangford/vowpal_wabbit/wiki
112
Q & A
http://moa.cms.waikato.ac.nz/https://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttps://github.com/JohnLangford/vowpal_wabbit/wikihttp://moa.cms.waikato.ac.nz/ -
8/14/2019 Online Learning for Big Data Analytics
113/116
Q & A
If you have any problems, please send emailsto [email protected]!
113
References
-
8/14/2019 Online Learning for Big Data Analytics
114/116
References
M. Brand. Fast online svd revisions for lightweight recommender systems. In SDM,2003.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm.SIAM J. Comput., 34(3):640668, 2005.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551585, 2006.
K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In
NIPS, pages 414422, 2009. K. Crammer and D. D. Lee. Learning via gaussian herding. In NIPS, pages 451459, 2010.
K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems.Journal of Machine Learning Research, 3:951991, 2003.
M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. InICML, pages 264271, 2008.
J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward
splitting. Journal of Machine Learning Research, 10:28992934, 2009. C. Gentile. A new approximate maximal margin classification algorithm. Journal of
Machine Learning Research, 2:213242, 2001.
M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation.In NIPS, pages 856864, 2010.
114
References
-
8/14/2019 Online Learning for Big Data Analytics
115/116
References
S. C. H. Hoi, J. Wang, and P. Zhao. Exact soft confidence-weighted learning. In ICML,2012.
J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of
Machine Learning Research, 10:777801, 2009.
Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning,
46(1-3):361387, 2002.
P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.SIAM J. Numer. Anal., 16(6):964979, 1979.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In
ICML, page 87, 2009.
Y. Nesterov. Gradient methods for minimizing composite objective function. CORE
Discussion Paper 2007/76, Catholic University of Louvain, Center for Operations
Research and Econometrics, 2007.
F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS,
pages 18401848, 2010.
F. Rosenblatt. The Perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65:386408, 1958.115
References
-
8/14/2019 Online Learning for Big Data Analytics
116/116
References
C. Wang, J. W. Paisley, and D. M. Blei. Online variational inference for the hierarchicaldirichlet process. Journal of Machine Learning Research - Proceedings Track, 15:752
760, 2011.
M. K. Warmuth and D. Kuzmin. Randomized pca algorithms with regret bounds that are
logarithmic in the dimension. In NIPS, pages 14811488, 2006.
S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable
approximation. IEEE Transactions on Signal Processing, 57(7):24792493, 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning Research, 11:25432596, 2010.
H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.
ACM Transactions on Knowledge Discovery from Data, 2013.
H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages
11911198, Haifa, Israel, 2010.
L. Yang, R. Jin, and J. Ye. Online learning by ellipsoid method. In ICML, page 145, 2009.
P. Zhao, S. C. H. Hoi, and R. Jin. Duol: A double updating approach for online learning. In