learning from infinite training examples
DESCRIPTION
Learning from Infinite Training Examples. 3.18.2009, 3.19.2009 Prepared for NKU and NUTN seminars Presenter: Chun-Nan Hsu ( 許鈞南 ) Institute of Information Science Academia Sinica Taipei, Taiwan. The Ever Growing Web (Zhuang, -400). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/1.jpg)
Learning from Infinite Training Examples
3.18.2009, 3.19.2009
Prepared for NKU and NUTN seminars
Presenter: Chun-Nan Hsu ( 許鈞南 )Institute of Information ScienceAcademia SinicaTaipei, Taiwan
![Page 2: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/2.jpg)
2
The Ever Growing Web(Zhuang, -400)
Human life is finite, but knowledge is
infinite. Following the infinite with the finite is doomed to fail.
人之生也有涯,而知也無涯。以有涯隨無涯,殆矣。莊子,西元前四百年
03/18/2009
![Page 3: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/3.jpg)
3
Analogously…
Computing power is finite, but the Web is infinite. Mining infinite Web with finite
computing power…
is doomed to fail?
03/18/2009
![Page 4: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/4.jpg)
Other “holy grails” in Artificial Intelligence
Learning to understand natural languages
Learning to recognize millions of objects in computer vision
Speech recognition in noisy environment, such as in a car
403/18/2009
![Page 5: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/5.jpg)
5
On-Line Learning vs. Off-Line Learning
Nothing to do with human learning by browsing the web
Definition: Given a set of new training data, online learner can update its model without
reading old data while improving its performance.
By contrast, off-line learner must combine old and new data and start the learning all over again, otherwise the performance willsuffer.
03/18/2009
![Page 6: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/6.jpg)
6
Off-Line Learning
Nearly all popular ML algorithms are off-line today
They scan the training examples many passes iteratively until an objective function is minimized
For example: SMO algorithm for SVM L-BFGS algorithm for CRF EM algorithm for HMM and GMM Etc.
03/18/2009
![Page 7: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/7.jpg)
7
Why on-line learning?
03/18/2009
![Page 8: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/8.jpg)
8
Single-pass on-line learning
The key for on-line learning to win is to achieve satisfying performance right after scanning the new training
examples for a single pass only
03/18/2009
![Page 9: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/9.jpg)
9
Previous work on on-line learning
Perceptron Rosenblatt 1957
Stochastic Gradient Descent Widrow & Hoff 1960
Bregment Divergence Azoury & Warmuth 2001
MIRA (Large Margin) Crammer & Singer 2003
LaRank Borde & Bottou 2005, 2007
EG Collins & Peter Bartlet et al. 2008
03/18/2009
![Page 10: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/10.jpg)
03/18/2009 10
Stochastic Gradient Descent (SGD)
Learning is to minimize a loss function given training examples
0);(
GL
DL
![Page 11: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/11.jpg)
03/18/2009 11
Optimal Step Size(Benveniste et al. 1990, Murata et al. 1998)
Solving gradient = 0 by Newton’s method
Step size is asymptotically optimal if it approaches to
);( )(1)()1( DLH ttt
1
1)(
t
Ht
![Page 12: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/12.jpg)
03/18/2009 12
Single-Pass Result (Bottou & LeCun 2004)
Optimum for n+1 examples is a Newton step away from the optimum for n examples
21
*11
**1
1);(
1
1
noBLH
n nnnnn
*n
*1n
);( nDL );( 1nDL
![Page 13: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/13.jpg)
03/18/2009 13
2nd Order SGD
2nd order SGD (2SGD): Adjusting the step size to approach to Hessian
Good News: from previous work, given sufficiently large training examples, 2SGD achieves empirical optimum in a single pass!
Bad News: it is prohibitively expensive to compute H-1
e.g. 10K features, H will be a 10K by 10K matrix = 100M floating point array
How about 1M features?
![Page 14: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/14.jpg)
03/18/2009 14
Approximating Jacobian(Aitken 1925, Schafer 1997)
Learning algorithms can be considered as fixed-point iteration mapping =M()
Taylor expansion gives
Eigenvalues of J can be approximated by
)()1(
)1()2(
ti
ti
ti
ti
i
)()( *)(*)()1( ttt JM
![Page 15: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/15.jpg)
03/18/2009 15
Approximating Hessian
Consider SGD mapping as a fixed-point iteration, too.
since J=M’=I-H, we have eig(J)=eig(M’)=eig(I-H), therefore, (since H is symmetric) eig(J)=1- eig(H) eig(H-1)= / 1-eig(J)= / 1-γ.
);()( )()( tt BLM
![Page 16: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/16.jpg)
03/18/2009 16
Estimating Eigenvalue Periodically
Since the mapping of SGD is stochastic, estimating the eigenvalues at each iteration may yield inaccurate estimations.
To make the mapping more stationary, we use Mb=M(M(…M(θ)…))
From the law of large number, b consecutive mappings, Mb, will be less “stochastic”
From Equation (4), we can estimate eig(Jb) by
)()(
)()2(
ti
bti
bti
bti
i
b
![Page 17: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/17.jpg)
03/18/2009 17
The PSA algorithm (Huang, Chang & Hsu 2007)
![Page 18: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/18.jpg)
03/18/2009 18
Experimental Results
Conditional Random Fields (CRF) (Lafferty et al. 2001)
Sequence labeling problems – gene mention tagging
![Page 19: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/19.jpg)
Conditional Random Fields
1903/18/2009
![Page 20: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/20.jpg)
In effect, CRF encodes a probabilistic rule-based system with rules of the form:
If fj1(X,Y) & fj2(X,Y) & … & fjn(X,Y) are non-zero,
then the labels of the sequence are Y
with score P(Y|X)
If we have d features and considers w context, then an order-1 CRF encodes this many rules:
20
2||||2 yx wd 03/18/2009
![Page 21: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/21.jpg)
03/18/2009 21
Tasks and Setups
CoNLL 2000 base NP Tag Noun phrases 8936 training 2012 test 3 tags, 1015662 features
CoNLL 2000 chunking Tag 11 POS types 8936 training 2012 test 23 tags, 7448606
features Performance measure: F-
score:
BioNLP/NLPBA 2004 Tag 5 types of bio-
entities (e.g., gene, protein, cell lines, etc.)
18546 training 3856 test
5977675 features BioCreative 2
Tag gene names 15000 training 5000
test 10242972 features
pr
rpF
FPTP
TPp
FNTP
TPr
2,,
![Page 22: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/22.jpg)
Feature types for BioCreative 2
2203/18/2009O(22M ) rules are encoded in our CRF model!!!
![Page 23: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/23.jpg)
03/18/2009 23
Convergence PerformanceCoNLL 2000 base NP
![Page 24: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/24.jpg)
03/18/2009 24
Convergence PerformanceCoNLL chunking
![Page 25: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/25.jpg)
03/18/2009 25
Convergence PerformanceBioNLP/NLPBA 2004
![Page 26: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/26.jpg)
03/18/2009 26
Convergence PerformanceBioCreative 2
![Page 27: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/27.jpg)
03/18/2009 27
Execution TimeCoNLL 2000 base NP
First Pass 23.74 sec
![Page 28: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/28.jpg)
03/18/2009 28
Execution TimeCoNLL chunking
First Pass 196.44 sec
![Page 29: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/29.jpg)
03/18/2009 29
Execution TimeBioNLP/NLPBA 2004
First Pass 287.48 sec
![Page 30: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/30.jpg)
03/18/2009 30
Execution TimeBioCreative 2
First Pass 394.04 sec
![Page 31: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/31.jpg)
Experimental results for linear SVM and convolutional neural net
Data sets
3103/18/2009
![Page 32: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/32.jpg)
Linear SVM
Convolutional Neural Net (5 layers)
3203/18/2009
** Layer trick -- Step sizes in the lower layers should be larger than in the higher layer
![Page 33: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/33.jpg)
03/18/2009 33
Mini-conclusion: Single-Pass
By approximating Jacobian, we can approximate Hessian, too
By approximating Hessian, we can achieve near-optimal single-pass performance in practice
With a single-pass on-line learner, virtually infinitely many training examples can be used
![Page 34: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/34.jpg)
PSA is a member in the family of “discretized Newton Methods”
Other well-known members include Secant method (aka. Quickprop) Steffensen’s method (aka. Triple Jump)
General form of these methods
where A is a matrix designed to
approximate the hessian matrix without actually computing the derivative
)(],[ )(1)()1( ttt ghA
Analysis of PSA
03/18/2009 34
![Page 35: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/35.jpg)
PSA
PSA is not secant nor is it steffensen’s method
PSA iterates a 2b-step “parallel chord” method (i.e., fixed rate SGD) followed by an approximated Newton step Off-line 2-step parallel chord method is
known to have an order 4 convergence
03/18/2009 35
![Page 36: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/36.jpg)
Convergence analysis of PSA
03/18/2009 36
![Page 37: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/37.jpg)
03/18/2009 37
Are we there yet?
With single-pass on-line learning, we can learn from infinite training examples now, at least in theory
A cheaper, quicker method to annotate labels for training examples
Plus a lot of computers…
![Page 38: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/38.jpg)
03/18/2009 38
The human life is finite, but the knowledge is infinite. Learning from infinite examples by
applying PSA to 2nd Order SGD
is a good idea!
![Page 39: Learning from Infinite Training Examples](https://reader035.vdocuments.mx/reader035/viewer/2022062301/568147bf550346895db50057/html5/thumbnails/39.jpg)
Thank you for your attention!http://aiia.iis.sinica.edu.twhttp://chunnan.iis.sinica.edu.tw/~chunnan
This research is supported mostly by NRPGM’s advanced bioinformatics core facility grant 2005-2011.