online regularized classification algorithms
TRANSCRIPT
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006 4775
Online Regularized Classification AlgorithmsYiming Ying and Ding-Xuan Zhou
Abstract—This paper considers online classification learningalgorithms based on regularization schemes in reproducing kernelHilbert spaces associated with general convex loss functions. Anovel capacity independent approach is presented. It verifies thestrong convergence of the algorithm under a very weak assump-tion of the step sizes and yields satisfactory convergence ratesfor polynomially decaying step sizes. Explicit learning rates withrespect to the misclassification error are given in terms of thechoice of step sizes and the regularization parameter (dependingon the sample size). Error bounds associated with the hinge loss,the least square loss, and the support vector machine q-norm lossare presented to illustrate our method.
Index Terms—Classification algorithm, error analysis, onlinelearning, regularization, reproducing kernel Hilbert spaces.
I. INTRODUCTION
I N this paper, we study online classification algorithms gener-ated from Tikhonov regularization schemes associated with
general convex loss functions and reproducing kernel Hilbertspaces.
Let be a compact metric space and . A func-tion is called a (binary) classifier which divides theinput space into two classes. A real valued function
can be used to generate a classifier wherethe sign function is defined as if and
for . For such a real valued func-tion , a loss function is often used to measurethe error: is the local error at the point while
is assigned to the event .Reproducing kernel Hilbert spaces are often used as hypoth-
esis spaces in the design of classification algorithms. Letbe continuous, symmetric and positive semidef-
inite, i.e., for any finite set of distinct points, the matrix is positive semidefinite. Such a
function is called a Mercer kernel.The reproducing kernel Hilbert space (RKHS) associated
with the kernel is defined [2] to be the completion of the linearspan of the set of functions with
Manuscript received May 11, 2005; revised October 26, 2005. This work wassupported by a grant from the Research Grants Council of Hong Kong [ProjectNo. CityU 103704] and by a grant from City University of Hong Kong [ProjectNo. 7001816].
Y. Ying was with the Department of Mathematics, City University of HongKong, Kowloon, Hong Kong, China. He is now with the Department of Com-puter Science, University College London, London, U.K., on leave from the In-stitute of Mathematics, Chinese Academy of Sciences, Beijing 100080, China(e-mail: [email protected]).
D.-X. Zhou is with the Department of Mathematics, City University of HongKong, Kowloon, Hong Kong, China (e-mail: [email protected]).
Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition,Statistical Learning, and Inference.
Digital Object Identifier 10.1109/TIT.2006.883632
the inner product given by . Thereproducing property takes the form
(1)
Classification algorithms considered here are induced by reg-ularization schemes learned from samples. Assume that isa probability distribution on and
is a set of random samples independentlydrawn according to . The batch learning algorithm for classifi-cation is implemented by an off-line regularization scheme [29]in the RKHS involving the sample , and the lossfunction as
(2)
The classifier is induced by the real valued function.
The off-line algorithm induced by (2), a Tikhonov regular-ization scheme for learning [14], has been extensively studiedin the literature. In particular, the error analysis is well done dueto many results. See, e.g., [27], [5], [35], [4], [7], [21], [31], [26].The main idea of the analysis is to show that has behaviorssimilar to the regularizing function of scheme (2) de-fined by
(3)
Here is the generalization error defined as
This expectation of the similarity between andis motivated by the law of large numbers telling us that
with confidence for any fixedfunction . For a function set, such as the union of unit ballsof reproducing kernel Hilbert spaces associated with a set ofMercer kernels, the theory of uniform convergence is involved.See, e.g., [29], [1], [34].
Though the off-line algorithm (2) performs well in theory andin many applications, it might be practically challenging whenthe sample size or data is very large. For example, when weconsider the case or
corresponding to the support vector machines, the scheme(2) is a quadratic optimization problem. Its standard complexityis about . When , the algorithm is hard toimplement.
Online algorithms with linear complexity can be ap-plied and provide efficient classifiers, when the sample size is
0018-9448/$20.00 © 2006 IEEE
4776 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
large. In this paper we investigate online classification algo-rithms generated by Tikhonov regularization schemes in repro-ducing kernel Hilbert spaces. Convergence to in the RKHSnorm as well as with respect to the excess misclassification error(to be defined) will be shown, and rate analysis will be done.These online algorithms can be considered as a descent method.See the discussion in the Appendix.
Throughout the paper, we assume that the loss function hasthe following form.
Definition 1: We say that is an admissible lossfunction if it is convex and differentiable at with .
The convexity of tells us that the left derivativeexists and equals
. It is the same as when it coincides with theright derivative
This paper is aimed at the following online algorithm for clas-sification given in [9], [22], and [17].
Definition 2: The Stochastic Gradient Descent Online Algo-rithm for classification is defined by and
for(4)
where is the regularization parameter andis called the step size. The classifier is given by the sign function
.We call the sequence the learning sequence for the on-
line scheme (4) which will be used to learn the regularizingfunction .
There is a vast literature on online learning. Let us mentionsome works relating to this paper. In [22], a stochastic gradientmethod in the Hilbert space is considered. Letbe the space of positive definite linear operators on , and
and be two maps. To learn astationary point satisfying
they proposed the learning sequence
(5)
But the online scheme (4) involving the general loss functionis nonlinear and is hard to write in the setting (5) except for theleast square loss. When is the least square loss, we shall im-prove the estimates in [22] for online regression by presentingcapacity independent learning rates (in Section II, Theorem 3),comparable to the best ones in [36], [26] for the off-line regular-ization setting (2). This is a byproduct of our analysis for onlineclassification algorithms.
The cumulative loss for online algo-rithms more general than (4) has been well studied in theliterature. See, for example, [15], [3], [10], [9], [16] andreferences therein. In particular, cumulative loss bounds arederived for online density estimation in [3] and for online linearregression with least square loss in [9]. In Section VI of [15],for a learning algorithm different from (4), the relative expected
instantaneous loss, measuring the prediction ability of inlinear regression problem, is analyzed in detail.
A general regularized online learning scheme (4) is in-troduced and analyzed in [17]. Assume the loss function
is convex, uniformly Lipschitz continuous, the stepsize has the form , and is fixed.It was proved there that the average instantaneous risk
converges toward theregularized generalization error with errorbound .
In this paper we mainly analyze the error inthe norm, which is different from estimating the cumulativeloss bounds as done in many previous results (e.g., [3], [10],[17]). Such a strong approximation was considered for leastsquare regression in [25].
II. MAIN RESULTS
The purpose of this paper is to give error bounds forwith fixed , and then apply them to the analysis of
the online classification algorithm (4): estimating the misclas-sification error, a quantity measuring the prediction ability, bysuitable choices of the regularization parameter .
Our first main result states that the sequence defined by(4) converges in expectation to in as long as the sequenceis uniformly bounded.
Theorem 1: Let and the sequence of positive step sizessatisfy
(6)
If the learning sequence given in (4) is uniformly boundedin , then
as (7)
Theorem 1 will be proved in Section V. The assumption ofuniform boundedness is mild and will be studied in Section III.In particular, we shall verify that is uniformly boundedwhen for any andfor some constant .
Recall from (3) that is the minimizer of the regularizedgeneralization error
(8)
Our second main result is the following important relation be-tween the excess regularized generalization error and themetric, which plays an essential role in proving Theorem 1.
Theorem 2: Let be an admissible loss function and .For any , there holds
(9)
Theorem 2 will be proved in Section IV. It will be used to de-rive convergence rates of (more quan-titative than (7)) when the step size decays in the form
with a constant (see Theorem 5 in Section VI).These rates, together with the approximation error (called theregularization error below) between and the minimizer of the
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4777
generalization error, yield capacity independent learning ratesof the misclassification error of the online algorithm (4). This isour last main result illustrated in the following subsections.
A. Misclassification Error of Classification Algorithms
The prediction power of classification algorithms are oftenmeasured by the misclassification error which is defined for aclassifier to be the probability of the event
:
(10)
Here denotes the marginal distribution of on , andthe conditional probability measure. The best classifier mini-mizing the misclassification error is called the Bayes rule [13]and can be expressed as where is the regres-sion function
(11)
Recall that for the online learning algorithm (4), we are inter-ested in the classifier produced by the real valuedfunction from a sample . So the error anal-ysis for the classification algorithm (4) is aimed at the excessmisclassification error
(12)
Let us present some examples proved in Section VII to il-lustrate the learning rates of (12) from suitable choices of theregularization parameter and the step size .
The first example corresponds to the classical support vectormachine (SVM) with being the hinge loss .For this loss, the online algorithm (4) can be expressed asand
ifif .
(13)
Corollary 1: Let . Assume for some, the pair satisfies
(14)
For any , choose
and where .Then
(15)
In (15), the expectation is taken with respect to the randomsample . We shall use this notion throughout the paper,if the random variable for is not specified.
The condition (14) concerns the approximation of the func-tion in the space by functions from the RKHS .It can be characterized by requiring to lie in an interpolation
space of the pair , an intermediate space between themetric space and the much smaller approximating space
. For details, see the discussion in [7]. The assumption (14)is satisfied [34] when we use the Gaussian kernels with flexiblevariances and the distribution satisfies some geometric noisecondition [21].
Assumptions like (14) are necessary to determine the regular-ization parameter for achieving the learning rate (15). This canbe seen from the literature [21], [31], [35] of off-line algorithm(2): learning rates are obtained by suitable choices of the regu-larization parameter , according to the behavior of theapproximation error estimated from a priori conditions on thedistribution and the space .
The second example is for the least square loss. Here the ap-proximation error [23], [25] can be studied by the integral oper-ator on the space defined by
Since is a Mercer kernel, the operator is symmetric, com-pact and positive. Therefore its power is well-defined forany .
Corollary 2: Let and be in the rangeof with some . For any , take
and . Then
(16)
Let us analyze the assumption as in [25]. Denote asthe positive eigenvalues of the positive, compact operatorand the corresponding orthonormal eigenfunctions.Any function can be expressed aswith and satisfying . The functionlies in the range of for some if and only if
, that is, the sequence has such a nice decaythat for another sequence . In particular [11],
is in the range of if and only if . So the smalleris, the less demanding the assumption in Corollary 2. When
, this assumption can also be characterized [23] by thedecay . For the
least square loss, there holds .This leads to the condition on the regularization error discussedin the next subsection for the general loss.
B. Comparison Theorem and Regularization Error
Estimating excess misclassification errorin (12) can often be done by bounding the excess gen-
eralization error [35], [4], [37]
(17)
where is a minimizer of the generalization error
is measurable on
4778 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
In particular for the SVM 1-norm soft margin classifier with thehinge loss , we have [30] and animportant relation was given in [35] as
(18)
Such a relation is called a comparison theorem. For the gen-eral loss function, a simple comparison theorem was establishedin [7] and [4].
Proposition A: Let be an admissible loss function such thatexists and is positive. Then there is a constant such that
for any measurable function , there holds
(19)
If moreover, for some and , satisfies a Tsy-bakov noise condition: for any measurable function
(20)
then (19) can be improved as
(21)
The Tsybakov noise condition (20) was introduced in [28]where the reader can find more details and explanation. Thegreater is, the smaller the noise of . In particular, any dis-tribution satisfies (20) with and .
With a comparison theorem, it is sufficient for us to estimatethe excess generalization error (17). In order to do so, we needthe regularization error [24] between and .
Definition 3: The regularization error for (2) associated withthe triple is
Now we have the error decomposition [32] for (17) as
(22)
The regularization error term in the error decomposi-tion (22) is independent of the sample . It can beestimated by -functionals from the rich knowledge of approx-imation theory. For more details, see discussions in [23], [7],[31].
The first term in (22) is called the sampleerror which may be bounded by the error . Toshow the idea, we mention a rough approach here. More refinedestimates are possible for specific loss functions as shown inSection VII. Observe from the convexity of that andare both nondecreasing, and .
Denote as the space of continuous functions on withthe norm . Then the reproducing property (1) tells us that
(23)
Corollary 3: Let be an admissible loss function and. There holds
where is the interval with.
Proof: Since , the definition fortells us that
(24)
Note the elementary inequality
where is an interval containing and . Applying this in-equality with and , we know that forany , can be bounded by
, where is an in-terval containing and . But (23) implies
and .In connection with the fact
and the inequality ,we verify the desired bound.
Let us show by the example of SVM -norm loss the learningrates of the excess misclassification error for the online algo-rithm (4), derived from the regularization error, the choice ofthe regularization parameter and the step size.
Corollary 4: Let with . Assume thatfor some , there holds . For any
, take and
with
Then
If moreover, the distribution satisfies (20), then the rate can beimproved to
C. A Byproduct for Regression and Rate Comparison
Our method for the classification algorithm can provide notonly estimates for the excess misclassification error, but alsoestimates for the strong approximation in the metric. Letus show this by the least square loss. We assume that lies inthe range of for some .
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4779
Theorem 3: Let and be in the range offor some . For any , take
and
Then
(25)
Let us compare our learning rates for the least square losswith the existing results.
First, we will show that our learning rate in the online set-ting is comparable to that in the off-line setting under the sameassumption on the approximation error: .In [5], [36], a leave-one-out technique was used to derive theexpected value of the off-line regularization algorithm (2). Theresults in [35], [36] can be expressed as
In terms of the regularization error, it can be restated as
(26)
We know from Lemma 3 in [26] that if forsome then
Putting this into (26) and trading off and , we know that thechoice gives the optimal rate
(27)
If is in the range of with andfor some constant , the error
is given in Theorem 2 of [26] for the off-line regularizationscheme (2) as
(28)
Hence our online learning rate (25) is almost the same as thecorresponding (28) in the offline setting while the rate (16) issuboptimal compared with (27).
Next, we will compare our learning rate (16) with the one in[22] under the same assumption: for some
. The result given in Theorem A and Remark 2.1 of[22] is
for (29)
where and are constants.
Select and , then there is another
constant such that (29) is rewritten as
Since is the range of for some , by [26] wehave
Consequently, the choice gives the optimalrate with a constant :
Setting andgives the optimal rate
(30)
Our learning rate (16) is better since
.
III. BOUNDING THE LEARNING SEQUENCE
In this section, we show how a local Lipschitz condition onthe loss function at the origin and some restrictions on the stepsize ensure the uniform boundedness of the learning se-quence , a crucial assumption for the convergence of the on-line scheme (4) stated in Theorem 1. Hence the uniform bound-edness holds for all loss functions encountered in specific clas-sification algorithms.
Definition 4: We say that is locally Lipschitz at the originif the local Lipschitz constant
(31)
is finite for any .The above local Lipschitz condition is equivalent to the exis-
tence of some and such thatfor every . In fact, the latter requirement implies theequation shown at the bottom of the page. Thus, when is twicecontinuously differentiable on , is locally Lipschitz at theorigin with . Exam-ples of loss functions will be discussed after the following the-orem on the boundedness of the sequence .
Theorem 4: Assume that is locally Lipschitz at the origin.Define by (4). If the step size satisfies
for each , then
(32)
4780 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Proof: We prove by induction. It is trivial thatsatisfies the bound (32). Suppose that this bound holds true for
. Consider . It can be written as
Write the middle term as
(33)Here by means of the reproducing property (1) for
we have denoted as a self-adjoint, rank- , positive linear operator given by
The operator norm can be bounded as sincefor any .
The local Lipschitz condition tells us thatis well defined (set as zero when ). It is bounded by
, since byour induction hypothesis. Since is nondecreasing,
for any . Thus
Therefore, is a self-adjoint, positive linearoperator on . Its norm is bounded by .When , the linear operator
on is self-adjoint, positive, and . It followsthat
since . This in connection withthe induction hypothesis on implies that
This completes the induction procedure and proves the the-orem.
The following are some commonly used examples of lossfunctions (e.g., [4], [18], [19], [12], [35]).
Corollary 5: Let and be defined by (4). Thenthe learning sequence is uniformly bounded in if eachstep size with satisfies
1) for the least square loss;
2) for the -norm SVM losswith ;
3) for the -normSVM loss with ;
4) for the exponential loss.
Proof: Our conclusion follows from Theorem 4 and theexpressions for the local Lipschitz constant derived sep-arately.
1) Note the least square loss is twice continuously differ-entiable and . Then we have .
2) When , the -norm SVM losshas . We have for
all since
ifif ,if .
3) When , we have and. So we find that
.4) For the exponential loss, we have
and . Then we have.
IV. EXCESS GENERALIZATION ERROR AND RKHS METRIC
In this section, we prove Theorem 2, a relation betweenand . This relation
is very important for the proof of the general convergence result,Theorem 1, as well as for the error analysis done in the nextsection.
Lemma 1: Assume is differentiable. Then satisfies
(34)
for any .Proof: Since is a minimizer of the regularized general-
ization error defined by (8), taking yields
Hence . Then for anyand , we know that
is nonnegative and equals
Letting , by the Lebesgue Dominated Theorem, we seethat
But . It follows that
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4781
This is true for every , which implies
(35)
Taking inner products with in proves the lemma.We first prove Theorem 2 for differentiable loss functions.Lemma 2: Let and be a differentiable convex loss
function. Then for any there holds
Proof: Let . Define a univariate functionby
We have
(36)
Since is differentiable, as a function of , is differentiable.In fact, if we denote , then
equals
The Lebesgue Dominated Theorem ensures that
This in connection with Lemma 1 tells us that equals
(37)
Since is convex in , it satisfies
Using this for and, we see from (37) that for
Therefore
This proves the desired result.
The intermediate step (37) in the above proof yields an in-teresting result for the least-square loss which may be useful inanalyzing the Tikhonov regularization scheme (2).
Corollary 6: Let be the least square loss.For any , there holds
Proof: It follows directly from (37) sinceimplies .
If is not differentiable like the hinge loss, we approximate itby which is convex, differentiable and defined foras
The approximation is valid:. Hence for any , there holds
as (38)
Now we can prove Theorem 2 for a general loss function.Proof of Theorem 2: We define, for any ,
and
(39)
For , we have used the conventional notationand .
Since is the minimizer of (39), by taking we get
which implies
for any (40)
Since any closed ballof the Hilbert space is weakly compact, the estimate (40)tells us that there exists a sequence such that
and converges to some weakly.That is,
(41)
In particular
and
(42)
4782 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Let in (41). Then the reproducing property (1) yields
(43)This in connection with the continuity ofand the Lebesgue Dominated Theorem gives
. The uniform
bound (40) of in connection with the uniformconvergence (38) of to ensures that
(44)
Therefore, by (42), we have
Taking in (39), we know that
which means
It tells us that is also a minimizer of . The strictconvexity of the functional on verifies theuniqueness of which leads to . That is, (41) and (42)hold with replaced by .
Apply Lemma 2 to the modified loss function . We have
(45)
Applying (41) to , we know that
which can be bounded by .Hence
This in connection with (45), (42), and (44) implies thatis bounded by
Since , the conclusion of Theorem 2is proved.
V. GENERAL CONVERGENCE RESULTS
In this section we prove our first main result, Theorem 1. Theessential estimate in the proof will also be used in the next sec-tion to give convergence rates. Note that the uniform bounded-ness of by implies
.For simplicity, denote and
.
Lemma 3: Assume that for some and , thereholds
(46)for any . Then for ,
(47)
which can be further controlled by
(48)
Proof: Recall that where. Then
(49)By the reproducing property (1), part of the last term of
(49) equals
(50)
Since is a convex function on , we know that
Applying this relation to and to-gether with (50) yields
(51)The Schwarz inequality
implies
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4783
Putting this and (51) into the middle term of (49), we know thatis bounded by
Since depends on but not on , it follows thatcan be bounded by
(52)
This in connection with (46) and (49) gives
(53)
By Theorem 2, this implies that isbounded by
(54)
Applying this relation iteratively for , wesee that is bounded by
This proves the first statement.The second statement follows from the inequality
for any .We are in a position to prove Theorem 1 stated in the intro-
duction. For this purpose, we use (47) while the bound (48) willbe used to derive explicit learning rates in the next section.
Proof of Theorem 1: By (6), there exists an integer suchthat for all . Since is uniformly boundedin , (46) is true for some constant . Applying Lemma 3,it is sufficient to estimate the right side of (47). According tothe assumption (6) on the step size, we have
as . So for any there ex-ists some such that the second term of (47) is boundedby whenever .
To deal with the first term, we use the assumptionand know that there exists some such that
for every . Writeas
(55)
Since is fixed, we can find some such that foreach , there holds
. It follows that for each , there holds
This in connection with the bound for eachtells us that the first term of (55) is bounded as
The second term of (55) is dominated by. But .
Then
Therefore, when , by Lemma 3, we have. This proves Theorem 1.
VI. CONVERGENCE RATES
Now we can derive convergence rates for the error. The step size here is often of the form for
some and . So to apply Lemma 3 for gettingerror bounds, we need to estimate the summations in (48) andlead to the following lemmas.
Lemma 4: For any and , there holds
if ,if .
(56)The proof follows from the simple inequality
.The next lemma in a modified form was given in [22] for
.Lemma 5: Let and . Then
is bounded by
,
.
4784 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
Proof: Denote .For , we apply (56) in Lemma 4 and see that isbounded by
For , we have and
. Then is bounded by
(57)Decompose the above integral in two parts. We have
(58)
When , we have and
. Also, . Hence
(59)
Combining (57), (58), and (59), we get
For , the inequality (56) implies is bounded by. For , there holds
. Hence , whichproves the lemma.
We are in a position to state the convergence rates for theonline algorithm (4). To this end, we need the following constantdepending on :
(60)
Theorem 5: Assume that is locally Lipschitz at the origin.Choose the step sizes as for some and
. Define by (4) and by (60).1) For
(61)
2) For , can be bounded by
(62)
Proof: The condition on the step size tells us thatfor each . By Theorem 4, this yields
, and hencefor each . Consequently, both and
0 lie in the interval . Itfollows that and
. Then (46) holds with
and . This in connection with (48) of Lemma 3 tellsus that
where
Since , (24) gives . By Lemma4, we know that
if
if .
By Lemma 5, can be bounded by
if ,
if .
This proves Theorem 5.To apply Theorem 5 for deriving rates for the misclassifica-
tion error, we need the constant and . They depend onthe loss function and play an essential role in getting learningrates. When satisfies the following increment condition withsome :
(63)
we can find and explicitly and then derive learning ratesfor the total error from Theorem 5. Denote as a constantdepending only on and satisfying
(64)
Corollary 7: Assume that satisfies (63) and is locallyLipschitz at the origin. Let . Choose the step sizes as
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4785
with some and given by (64). Define by(4). Then
(65)
Proof: The increment condition (63) for tells us that
which implies
Also, the local Lipschitz constant can be bounded as
Choose . Then
and our conclusion follows from Theorem 5.
VII. TOTAL ERROR BOUNDS AND LEARNING RATES
Applying the above mentioned techniques, we can derive thelearning rates of the excess misclassification error for the onlinealgorithm (4) from our analysis on together withthe regularization error .
A. Rates for the Online Algorithm With the Hinge Loss
First, we prove the learning rates for the online algorithm (13)with the hinge loss.
Proof of Corollary 1: Consider the hinge loss. Recall the relation (18) between the excess misclas-
sification error and the excess generalization error. Then
(66)Using the uniform Lipschitz continuity of the hinge loss, weknow that . Combined
with the assumption on the regularization errorfor some , it follows from (66) that
(67)
Now we apply Corollary 7. It is easy to see that satisfies (63)with and . For any ,choose
in Corollary 7. We know that is bounded by
Select with . Since theasymptotic behavior holds for any
and . We know that there exists a constantdepending only on and such that
Putting this back into (67), we have
Now take . We know that the following holds
This proves our conclusion.
B. Rates for the Online Algorithm With the Least Square Loss
Turn to the least square loss. It has a special feature that. This enables us to improve the learning
rates.Proof of Corollary 2: For the least square loss, we use the
relation (see [13])
Then can be bounded by
(68)
We know from Lemma 3 in [26] that if is in the range offor some then the second term of (68) can be
bounded as
(69)
The first term can beestimated by a refined bound for . Namely,if and for each , there holds
(70)
Let us first prove (70). The choice of and Theorem 4 tellus that the learning sequence is uniformly bounded as
4786 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
The special feature of the least square lossyields
(71)
Combining with the estimate ,we know (71) is bounded by
(72)
The same procedure as (53) in the proof of Lemma 3 gives
This combining with (71) and (72) implies
From this bound, Theorem 2 and the restriction ,we know that can be bounded by
Applying this relation iteratively for , we see that
By Lemma 5 with , we know that
which verifies (70).Choose with . For some constant
, we have
Thus, is bounded by
Combining with (68) and (69), for a constant wehave
Choose and . We obtainthe desired bound
The proof of the corollary is completed.Proof of Theorem 3: If the regression function is in the
range of for , then by Lemma 3 in [26] weknow that
Replacing with and arguing as in the proof of corollary2, we obtain the desired result.
C. Rates With the -Norm SVM Loss
We shall use the relation (19) to derive the learning rate. Toestimate using the error in the metric, weneed the following lemma.
Lemma 6: For the -norm SVM loss with, and , there holds
(73)
Proof: Note the inequality [7]
It follows by taking thatcan be bounded by
where the bound (24) for is used. This completes the proofof the lemma.
Proof of Corollary 4: We apply Corollary 7. Observe thatsatisfies (63) with
and . Chooseand . It follows from (65)
that for a constant depending only on , andsuch that
YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4787
This in connection with (73) and the assumption on the regu-larization error that can bebounded by
where is a constant depending on , and . Thus, withanother constant , can be boundedby
Combining with the error decomposition (22) and the regular-ization error decay, this implies that isbounded by
Choose
then the desired result follows by the comparison relations (19)and (21).
VIII. CONCLUSION AND QUESTIONS
In this paper we have investigated the strong convergence ofthe online regularized classification algorithm involving generalloss functions and reproducing kernel Hilbert spaces. We veri-fied the convergence under rather weak assumptions on the stepsizes and loss functions. A novel relation between the error inthe RKHS norm and the excess regularized generalization errorplayed an important role. As done for the off-line setting in theliterature, by suitable choices of the regularization parameter
according to a priori conditions on the approxima-tion error, we presented explicit capacity independent learningrates of the excess misclassification error for commonly usedloss functions such as the hinge loss, the least square loss, andthe SVM -norm loss. In particular, we have shown (remarksfollowing Theorem 3) that our learning rates in with theleast square loss are comparable to the ones in the off-line set-ting.
Let us mention some questions for further study.1) In this paper the regularization parameter de-
pends only on the sample size which means that (4)studied here is not a fully online algorithm. How to analyzethe algorithm (4) with changing with the steps, ,remains open.
2) It would be interesting to analyze the online algorithm inthe case . This is closely related to the perceptronalgorithms (see [29], [10]).
3) Here we assume the data is i.i.d. accordingto an unknown distribution . In many applications thedata is not independent or identical. It is unknown whetherMarkov chains and the theory of martingales can be usedto deal with this case.
TABLE INOTATIONS
APPENDIX
ONLINE ALGORITHM AS DESCENT METHOD
The classical descent method [6] is an efficient method tosolve unconstrained minimization problems of the form
where is convex and continuously differen-tiable. The descent method is to find a suitable sequence ,
to approximate the minimum point :
where is the step size and denotes the descentdirection.
The descent method requires that
except when is optimal. The Taylor expansion of tells us
which requires
That is, the direction must make an obtuse angle with thegradient . If we select the descent direction as thenegative gradient direction, this gives the well-known descentalgorithm
Under some assumptions on the step sizes , one can get theconvergence [6]
The above descent method motivates the classification algo-rithm (4). To see this, let us assume that the classifying loss
4788 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006
is differentiable. Since the regularizing function is an un-constrained minimizer in of the functional (the regularizedgeneralization error)
as the descent algorithm, we use the following sequenceto approximate :
with
Note that the functional derivative (see [33])
However, the distribution is unknown in practical clas-sification problems. What we have is a random sample
. So we replace the integral partby the random value ,
and the above descent scheme then becomes the stochasticgradient descent online algorithm (4).
ACKNOWLEDGMENT
The authors would like to thank the referees for valuable com-ments and suggestions.
REFERENCES
[1] M. Anthony and P. L. Bartlett, Neural Network Learning: TheoreticalFoundations. Cambridge, U.K.: Cambridge University Press, 1999.
[2] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math.Soc., vol. 68, pp. 337–404, 1950.
[3] K. S. Azoury and M. K. Warmuth, “Relative loss bounds for on-linedensity estimation with the exponential family of distributions,” Ma-chine Learn., vol. 43, pp. 211–246, 2001.
[4] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classifica-tion, and risk bounds,” J. Amer. Statist. Assoc., vol. 101, pp. 138–156,2006.
[5] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Ma-chine Learn. Res., vol. 2, pp. 499–526, 2002.
[6] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.
[7] D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou, “Support vector machinesoft margin classifiers: Error analysis,” J. Machine Learn. Res., vol. 5,pp. 1143–1175, 2004.
[8] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “A second-order percep-tron algorithm,” SIAM J. Comput., vol. 34, pp. 640–688, 2005.
[9] N. Cesa-Bianchi, P. Long, and M. K. Warmuth, “Worst-case quadraticloss bounds for prediction using linear functions and gradient descent,”IEEE Trans. Neural Netw., vol. 7, pp. 604–619, 1996.
[10] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalizationability of on-line learning algorithms,” IEEE Trans. Inf. Theory, vol.50, pp. 2050–2057, 2004.
[11] F. Cucker and S. Smale, “On the mathematical foundations oflearning,” Bull. Amer. Math. Soc., vol. 39, pp. 1–49, 2001.
[12] E. De Vito, A. Caponnetto, and L. Rosasco, “Model selection for reg-ularized least-squares algorithm in learning theory,” Found. Comput.Math., vol. 5, pp. 59–85, 2005.
[13] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. New York: Springer-Verlag, 1997.
[14] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks andsupport vector machines,” Adv. Comput. Math., vol. 13, pp. 1–50, 2000.
[15] J. Forster and M. K. Warmuth, “Relative expected instantaneous lossbounds,” J. Comput. Syst. Sci., vol. 64, pp. 76–102, 2002.
[16] M. Herbster and M. K. Warmuth, “Tracking the best expert,” MachineLearn., vol. 32, pp. 151–178, 1998.
[17] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning withkernels,” IEEE Trans. Signal Process., vol. 52, pp. 2165–2176, 2004.
[18] Y. Lin, “Support vector machines and the Bayes rule in classification,”Data Min. Knowledge Discovery, vol. 6, pp. 259–275, 2002.
[19] G. Lugosi and N. Vayatis, “On the Bayes-risk consistency of regular-ized boosting methods,” Ann. Stat., vol. 32, pp. 30–55, 2004.
[20] P. Niyogi and F. Girosi, “On the relationships between generalizationerror, hypothesis complexity and sample complexity for radial basisfunctions,” Neural Comp., vol. 8, pp. 819–842, 1996.
[21] C. Scovel and I. Steinwart, “Fast rates for support vector machines,” inProc. Conf. Learn. Theory (COLT-2005), pp. 279–294.
[22] S. Smale and Y. Yao, “Online learning algorithms,” Found. Comput.Math., vol. 6, pp. 145–170, 2006.
[23] S. Smale and D. X. Zhou, “Estimating the approximation error inlearning theory,” Anal. Appl., vol. 1, pp. 17–41, 2003.
[24] ——, “Shannon sampling and function reconstruction from pointvalues,” Bull. Amer. Math. Soc., vol. 41, pp. 279–305, 2004.
[25] ——, “Shannon sampling II: Connection to learning theory,” Appl.Comput. Harmonic Anal., vol. 19, pp. 285–302, 2005.
[26] ——, “Learning theory estimates via integral operators and their ap-plications,” Constr. Approx., to be published.
[27] I. Steinwart, “Support vector machines are universally consistent,” J.Complex., vol. 18, pp. 768–791, 2002.
[28] A. B. Tsybakov, “Optimal aggregation of classifiers in statisticallearning,” Ann. Stat., vol. 32, pp. 135–166, 2004.
[29] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[30] G. Wahba, Spline Models for Observational Data. Singapore: SIAM,
1990.[31] Q. Wu, Y. Ying, and D. X. Zhou, “Multi-kernel regularized classifiers,”
J. Complex., to be published.[32] Q. Wu and D. X. Zhou, “SVM soft margin classifiers: Linear program-
ming versus quadratic programming,” Neural Comput., vol. 17, pp.1160–1187, 2005.
[33] K. Yosida, Functional Analysis, 6th ed. New York: Springer-Verlag,1980.
[34] Y. Ying and D. X. Zhou, “Learnability of Gaussians with flexible vari-ances,” J. Mach. Learning, 2004, to be published.
[35] T. Zhang, “Statistical behavior and consistency of classificationmethods based on convex risk minimization,” Ann. Stat., vol. 32, pp.56–85, 2004.
[36] ——, “Leave-one-out bounds for kernel methods,” Neural Comput.,vol. 15, pp. 1397–1437, 2003.
[37] D. X. Zhou, “The covering number in learning theory,” J. Complex.,vol. 18, pp. 739–767, 2002.
[38] ——, “Capacity of reproducing kernel spaces in learning theory,” IEEETrans. Inf. Theory, vol. 49, pp. 1743–1752, 2003.