online regularized classification algorithms

14
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006 4775 Online Regularized Classification Algorithms Yiming Ying and Ding-Xuan Zhou Abstract—This paper considers online classification learning algorithms based on regularization schemes in reproducing kernel Hilbert spaces associated with general convex loss functions. A novel capacity independent approach is presented. It verifies the strong convergence of the algorithm under a very weak assump- tion of the step sizes and yields satisfactory convergence rates for polynomially decaying step sizes. Explicit learning rates with respect to the misclassification error are given in terms of the choice of step sizes and the regularization parameter (depending on the sample size). Error bounds associated with the hinge loss, the least square loss, and the support vector machine -norm loss are presented to illustrate our method. Index Terms—Classification algorithm, error analysis, online learning, regularization, reproducing kernel Hilbert spaces. I. INTRODUCTION I N this paper, we study online classification algorithms gener- ated from Tikhonov regularization schemes associated with general convex loss functions and reproducing kernel Hilbert spaces. Let be a compact metric space and . A func- tion is called a (binary) classifier which divides the input space into two classes. A real valued function can be used to generate a classifier where the sign function is defined as if and for . For such a real valued func- tion , a loss function is often used to measure the error: is the local error at the point while is assigned to the event . Reproducing kernel Hilbert spaces are often used as hypoth- esis spaces in the design of classification algorithms. Let be continuous, symmetric and positive semidef- inite, i.e., for any finite set of distinct points , the matrix is positive semidefinite. Such a function is called a Mercer kernel. The reproducing kernel Hilbert space (RKHS) associated with the kernel is defined [2] to be the completion of the linear span of the set of functions with Manuscript received May 11, 2005; revised October 26, 2005. This work was supported by a grant from the Research Grants Council of Hong Kong [Project No. CityU 103704] and by a grant from City University of Hong Kong [Project No. 7001816]. Y. Ying was with the Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China. He is now with the Department of Com- puter Science, University College London, London, U.K., on leave from the In- stitute of Mathematics, Chinese Academy of Sciences, Beijing 100080, China (e-mail: [email protected]). D.-X. Zhou is with the Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China (e-mail: [email protected]). Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, Statistical Learning, and Inference. Digital Object Identifier 10.1109/TIT.2006.883632 the inner product given by . The reproducing property takes the form (1) Classification algorithms considered here are induced by reg- ularization schemes learned from samples. Assume that is a probability distribution on and is a set of random samples independently drawn according to . The batch learning algorithm for classifi- cation is implemented by an off-line regularization scheme [29] in the RKHS involving the sample , and the loss function as (2) The classifier is induced by the real valued function . The off-line algorithm induced by (2), a Tikhonov regular- ization scheme for learning [14], has been extensively studied in the literature. In particular, the error analysis is well done due to many results. See, e.g., [27], [5], [35], [4], [7], [21], [31], [26]. The main idea of the analysis is to show that has behaviors similar to the regularizing function of scheme (2) de- fined by (3) Here is the generalization error defined as This expectation of the similarity between and is motivated by the law of large numbers telling us that with confidence for any fixed function . For a function set, such as the union of unit balls of reproducing kernel Hilbert spaces associated with a set of Mercer kernels, the theory of uniform convergence is involved. See, e.g., [29], [1], [34]. Though the off-line algorithm (2) performs well in theory and in many applications, it might be practically challenging when the sample size or data is very large. For example, when we consider the case or corresponding to the support vector machines, the scheme (2) is a quadratic optimization problem. Its standard complexity is about . When , the algorithm is hard to implement. Online algorithms with linear complexity can be ap- plied and provide efficient classifiers, when the sample size is 0018-9448/$20.00 © 2006 IEEE

Upload: yiming-ying

Post on 24-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Regularized Classification Algorithms

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006 4775

Online Regularized Classification AlgorithmsYiming Ying and Ding-Xuan Zhou

Abstract—This paper considers online classification learningalgorithms based on regularization schemes in reproducing kernelHilbert spaces associated with general convex loss functions. Anovel capacity independent approach is presented. It verifies thestrong convergence of the algorithm under a very weak assump-tion of the step sizes and yields satisfactory convergence ratesfor polynomially decaying step sizes. Explicit learning rates withrespect to the misclassification error are given in terms of thechoice of step sizes and the regularization parameter (dependingon the sample size). Error bounds associated with the hinge loss,the least square loss, and the support vector machine q-norm lossare presented to illustrate our method.

Index Terms—Classification algorithm, error analysis, onlinelearning, regularization, reproducing kernel Hilbert spaces.

I. INTRODUCTION

I N this paper, we study online classification algorithms gener-ated from Tikhonov regularization schemes associated with

general convex loss functions and reproducing kernel Hilbertspaces.

Let be a compact metric space and . A func-tion is called a (binary) classifier which divides theinput space into two classes. A real valued function

can be used to generate a classifier wherethe sign function is defined as if and

for . For such a real valued func-tion , a loss function is often used to measurethe error: is the local error at the point while

is assigned to the event .Reproducing kernel Hilbert spaces are often used as hypoth-

esis spaces in the design of classification algorithms. Letbe continuous, symmetric and positive semidef-

inite, i.e., for any finite set of distinct points, the matrix is positive semidefinite. Such a

function is called a Mercer kernel.The reproducing kernel Hilbert space (RKHS) associated

with the kernel is defined [2] to be the completion of the linearspan of the set of functions with

Manuscript received May 11, 2005; revised October 26, 2005. This work wassupported by a grant from the Research Grants Council of Hong Kong [ProjectNo. CityU 103704] and by a grant from City University of Hong Kong [ProjectNo. 7001816].

Y. Ying was with the Department of Mathematics, City University of HongKong, Kowloon, Hong Kong, China. He is now with the Department of Com-puter Science, University College London, London, U.K., on leave from the In-stitute of Mathematics, Chinese Academy of Sciences, Beijing 100080, China(e-mail: [email protected]).

D.-X. Zhou is with the Department of Mathematics, City University of HongKong, Kowloon, Hong Kong, China (e-mail: [email protected]).

Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition,Statistical Learning, and Inference.

Digital Object Identifier 10.1109/TIT.2006.883632

the inner product given by . Thereproducing property takes the form

(1)

Classification algorithms considered here are induced by reg-ularization schemes learned from samples. Assume that isa probability distribution on and

is a set of random samples independentlydrawn according to . The batch learning algorithm for classifi-cation is implemented by an off-line regularization scheme [29]in the RKHS involving the sample , and the lossfunction as

(2)

The classifier is induced by the real valued function.

The off-line algorithm induced by (2), a Tikhonov regular-ization scheme for learning [14], has been extensively studiedin the literature. In particular, the error analysis is well done dueto many results. See, e.g., [27], [5], [35], [4], [7], [21], [31], [26].The main idea of the analysis is to show that has behaviorssimilar to the regularizing function of scheme (2) de-fined by

(3)

Here is the generalization error defined as

This expectation of the similarity between andis motivated by the law of large numbers telling us that

with confidence for any fixedfunction . For a function set, such as the union of unit ballsof reproducing kernel Hilbert spaces associated with a set ofMercer kernels, the theory of uniform convergence is involved.See, e.g., [29], [1], [34].

Though the off-line algorithm (2) performs well in theory andin many applications, it might be practically challenging whenthe sample size or data is very large. For example, when weconsider the case or

corresponding to the support vector machines, the scheme(2) is a quadratic optimization problem. Its standard complexityis about . When , the algorithm is hard toimplement.

Online algorithms with linear complexity can be ap-plied and provide efficient classifiers, when the sample size is

0018-9448/$20.00 © 2006 IEEE

Page 2: Online Regularized Classification Algorithms

4776 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

large. In this paper we investigate online classification algo-rithms generated by Tikhonov regularization schemes in repro-ducing kernel Hilbert spaces. Convergence to in the RKHSnorm as well as with respect to the excess misclassification error(to be defined) will be shown, and rate analysis will be done.These online algorithms can be considered as a descent method.See the discussion in the Appendix.

Throughout the paper, we assume that the loss function hasthe following form.

Definition 1: We say that is an admissible lossfunction if it is convex and differentiable at with .

The convexity of tells us that the left derivativeexists and equals

. It is the same as when it coincides with theright derivative

This paper is aimed at the following online algorithm for clas-sification given in [9], [22], and [17].

Definition 2: The Stochastic Gradient Descent Online Algo-rithm for classification is defined by and

for(4)

where is the regularization parameter andis called the step size. The classifier is given by the sign function

.We call the sequence the learning sequence for the on-

line scheme (4) which will be used to learn the regularizingfunction .

There is a vast literature on online learning. Let us mentionsome works relating to this paper. In [22], a stochastic gradientmethod in the Hilbert space is considered. Letbe the space of positive definite linear operators on , and

and be two maps. To learn astationary point satisfying

they proposed the learning sequence

(5)

But the online scheme (4) involving the general loss functionis nonlinear and is hard to write in the setting (5) except for theleast square loss. When is the least square loss, we shall im-prove the estimates in [22] for online regression by presentingcapacity independent learning rates (in Section II, Theorem 3),comparable to the best ones in [36], [26] for the off-line regular-ization setting (2). This is a byproduct of our analysis for onlineclassification algorithms.

The cumulative loss for online algo-rithms more general than (4) has been well studied in theliterature. See, for example, [15], [3], [10], [9], [16] andreferences therein. In particular, cumulative loss bounds arederived for online density estimation in [3] and for online linearregression with least square loss in [9]. In Section VI of [15],for a learning algorithm different from (4), the relative expected

instantaneous loss, measuring the prediction ability of inlinear regression problem, is analyzed in detail.

A general regularized online learning scheme (4) is in-troduced and analyzed in [17]. Assume the loss function

is convex, uniformly Lipschitz continuous, the stepsize has the form , and is fixed.It was proved there that the average instantaneous risk

converges toward theregularized generalization error with errorbound .

In this paper we mainly analyze the error inthe norm, which is different from estimating the cumulativeloss bounds as done in many previous results (e.g., [3], [10],[17]). Such a strong approximation was considered for leastsquare regression in [25].

II. MAIN RESULTS

The purpose of this paper is to give error bounds forwith fixed , and then apply them to the analysis of

the online classification algorithm (4): estimating the misclas-sification error, a quantity measuring the prediction ability, bysuitable choices of the regularization parameter .

Our first main result states that the sequence defined by(4) converges in expectation to in as long as the sequenceis uniformly bounded.

Theorem 1: Let and the sequence of positive step sizessatisfy

(6)

If the learning sequence given in (4) is uniformly boundedin , then

as (7)

Theorem 1 will be proved in Section V. The assumption ofuniform boundedness is mild and will be studied in Section III.In particular, we shall verify that is uniformly boundedwhen for any andfor some constant .

Recall from (3) that is the minimizer of the regularizedgeneralization error

(8)

Our second main result is the following important relation be-tween the excess regularized generalization error and themetric, which plays an essential role in proving Theorem 1.

Theorem 2: Let be an admissible loss function and .For any , there holds

(9)

Theorem 2 will be proved in Section IV. It will be used to de-rive convergence rates of (more quan-titative than (7)) when the step size decays in the form

with a constant (see Theorem 5 in Section VI).These rates, together with the approximation error (called theregularization error below) between and the minimizer of the

Page 3: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4777

generalization error, yield capacity independent learning ratesof the misclassification error of the online algorithm (4). This isour last main result illustrated in the following subsections.

A. Misclassification Error of Classification Algorithms

The prediction power of classification algorithms are oftenmeasured by the misclassification error which is defined for aclassifier to be the probability of the event

:

(10)

Here denotes the marginal distribution of on , andthe conditional probability measure. The best classifier mini-mizing the misclassification error is called the Bayes rule [13]and can be expressed as where is the regres-sion function

(11)

Recall that for the online learning algorithm (4), we are inter-ested in the classifier produced by the real valuedfunction from a sample . So the error anal-ysis for the classification algorithm (4) is aimed at the excessmisclassification error

(12)

Let us present some examples proved in Section VII to il-lustrate the learning rates of (12) from suitable choices of theregularization parameter and the step size .

The first example corresponds to the classical support vectormachine (SVM) with being the hinge loss .For this loss, the online algorithm (4) can be expressed asand

ifif .

(13)

Corollary 1: Let . Assume for some, the pair satisfies

(14)

For any , choose

and where .Then

(15)

In (15), the expectation is taken with respect to the randomsample . We shall use this notion throughout the paper,if the random variable for is not specified.

The condition (14) concerns the approximation of the func-tion in the space by functions from the RKHS .It can be characterized by requiring to lie in an interpolation

space of the pair , an intermediate space between themetric space and the much smaller approximating space

. For details, see the discussion in [7]. The assumption (14)is satisfied [34] when we use the Gaussian kernels with flexiblevariances and the distribution satisfies some geometric noisecondition [21].

Assumptions like (14) are necessary to determine the regular-ization parameter for achieving the learning rate (15). This canbe seen from the literature [21], [31], [35] of off-line algorithm(2): learning rates are obtained by suitable choices of the regu-larization parameter , according to the behavior of theapproximation error estimated from a priori conditions on thedistribution and the space .

The second example is for the least square loss. Here the ap-proximation error [23], [25] can be studied by the integral oper-ator on the space defined by

Since is a Mercer kernel, the operator is symmetric, com-pact and positive. Therefore its power is well-defined forany .

Corollary 2: Let and be in the rangeof with some . For any , take

and . Then

(16)

Let us analyze the assumption as in [25]. Denote asthe positive eigenvalues of the positive, compact operatorand the corresponding orthonormal eigenfunctions.Any function can be expressed aswith and satisfying . The functionlies in the range of for some if and only if

, that is, the sequence has such a nice decaythat for another sequence . In particular [11],

is in the range of if and only if . So the smalleris, the less demanding the assumption in Corollary 2. When

, this assumption can also be characterized [23] by thedecay . For the

least square loss, there holds .This leads to the condition on the regularization error discussedin the next subsection for the general loss.

B. Comparison Theorem and Regularization Error

Estimating excess misclassification errorin (12) can often be done by bounding the excess gen-

eralization error [35], [4], [37]

(17)

where is a minimizer of the generalization error

is measurable on

Page 4: Online Regularized Classification Algorithms

4778 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

In particular for the SVM 1-norm soft margin classifier with thehinge loss , we have [30] and animportant relation was given in [35] as

(18)

Such a relation is called a comparison theorem. For the gen-eral loss function, a simple comparison theorem was establishedin [7] and [4].

Proposition A: Let be an admissible loss function such thatexists and is positive. Then there is a constant such that

for any measurable function , there holds

(19)

If moreover, for some and , satisfies a Tsy-bakov noise condition: for any measurable function

(20)

then (19) can be improved as

(21)

The Tsybakov noise condition (20) was introduced in [28]where the reader can find more details and explanation. Thegreater is, the smaller the noise of . In particular, any dis-tribution satisfies (20) with and .

With a comparison theorem, it is sufficient for us to estimatethe excess generalization error (17). In order to do so, we needthe regularization error [24] between and .

Definition 3: The regularization error for (2) associated withthe triple is

Now we have the error decomposition [32] for (17) as

(22)

The regularization error term in the error decomposi-tion (22) is independent of the sample . It can beestimated by -functionals from the rich knowledge of approx-imation theory. For more details, see discussions in [23], [7],[31].

The first term in (22) is called the sampleerror which may be bounded by the error . Toshow the idea, we mention a rough approach here. More refinedestimates are possible for specific loss functions as shown inSection VII. Observe from the convexity of that andare both nondecreasing, and .

Denote as the space of continuous functions on withthe norm . Then the reproducing property (1) tells us that

(23)

Corollary 3: Let be an admissible loss function and. There holds

where is the interval with.

Proof: Since , the definition fortells us that

(24)

Note the elementary inequality

where is an interval containing and . Applying this in-equality with and , we know that forany , can be bounded by

, where is an in-terval containing and . But (23) implies

and .In connection with the fact

and the inequality ,we verify the desired bound.

Let us show by the example of SVM -norm loss the learningrates of the excess misclassification error for the online algo-rithm (4), derived from the regularization error, the choice ofthe regularization parameter and the step size.

Corollary 4: Let with . Assume thatfor some , there holds . For any

, take and

with

Then

If moreover, the distribution satisfies (20), then the rate can beimproved to

C. A Byproduct for Regression and Rate Comparison

Our method for the classification algorithm can provide notonly estimates for the excess misclassification error, but alsoestimates for the strong approximation in the metric. Letus show this by the least square loss. We assume that lies inthe range of for some .

Page 5: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4779

Theorem 3: Let and be in the range offor some . For any , take

and

Then

(25)

Let us compare our learning rates for the least square losswith the existing results.

First, we will show that our learning rate in the online set-ting is comparable to that in the off-line setting under the sameassumption on the approximation error: .In [5], [36], a leave-one-out technique was used to derive theexpected value of the off-line regularization algorithm (2). Theresults in [35], [36] can be expressed as

In terms of the regularization error, it can be restated as

(26)

We know from Lemma 3 in [26] that if forsome then

Putting this into (26) and trading off and , we know that thechoice gives the optimal rate

(27)

If is in the range of with andfor some constant , the error

is given in Theorem 2 of [26] for the off-line regularizationscheme (2) as

(28)

Hence our online learning rate (25) is almost the same as thecorresponding (28) in the offline setting while the rate (16) issuboptimal compared with (27).

Next, we will compare our learning rate (16) with the one in[22] under the same assumption: for some

. The result given in Theorem A and Remark 2.1 of[22] is

for (29)

where and are constants.

Select and , then there is another

constant such that (29) is rewritten as

Since is the range of for some , by [26] wehave

Consequently, the choice gives the optimalrate with a constant :

Setting andgives the optimal rate

(30)

Our learning rate (16) is better since

.

III. BOUNDING THE LEARNING SEQUENCE

In this section, we show how a local Lipschitz condition onthe loss function at the origin and some restrictions on the stepsize ensure the uniform boundedness of the learning se-quence , a crucial assumption for the convergence of the on-line scheme (4) stated in Theorem 1. Hence the uniform bound-edness holds for all loss functions encountered in specific clas-sification algorithms.

Definition 4: We say that is locally Lipschitz at the originif the local Lipschitz constant

(31)

is finite for any .The above local Lipschitz condition is equivalent to the exis-

tence of some and such thatfor every . In fact, the latter requirement implies theequation shown at the bottom of the page. Thus, when is twicecontinuously differentiable on , is locally Lipschitz at theorigin with . Exam-ples of loss functions will be discussed after the following the-orem on the boundedness of the sequence .

Theorem 4: Assume that is locally Lipschitz at the origin.Define by (4). If the step size satisfies

for each , then

(32)

Page 6: Online Regularized Classification Algorithms

4780 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

Proof: We prove by induction. It is trivial thatsatisfies the bound (32). Suppose that this bound holds true for

. Consider . It can be written as

Write the middle term as

(33)Here by means of the reproducing property (1) for

we have denoted as a self-adjoint, rank- , positive linear operator given by

The operator norm can be bounded as sincefor any .

The local Lipschitz condition tells us thatis well defined (set as zero when ). It is bounded by

, since byour induction hypothesis. Since is nondecreasing,

for any . Thus

Therefore, is a self-adjoint, positive linearoperator on . Its norm is bounded by .When , the linear operator

on is self-adjoint, positive, and . It followsthat

since . This in connection withthe induction hypothesis on implies that

This completes the induction procedure and proves the the-orem.

The following are some commonly used examples of lossfunctions (e.g., [4], [18], [19], [12], [35]).

Corollary 5: Let and be defined by (4). Thenthe learning sequence is uniformly bounded in if eachstep size with satisfies

1) for the least square loss;

2) for the -norm SVM losswith ;

3) for the -normSVM loss with ;

4) for the exponential loss.

Proof: Our conclusion follows from Theorem 4 and theexpressions for the local Lipschitz constant derived sep-arately.

1) Note the least square loss is twice continuously differ-entiable and . Then we have .

2) When , the -norm SVM losshas . We have for

all since

ifif ,if .

3) When , we have and. So we find that

.4) For the exponential loss, we have

and . Then we have.

IV. EXCESS GENERALIZATION ERROR AND RKHS METRIC

In this section, we prove Theorem 2, a relation betweenand . This relation

is very important for the proof of the general convergence result,Theorem 1, as well as for the error analysis done in the nextsection.

Lemma 1: Assume is differentiable. Then satisfies

(34)

for any .Proof: Since is a minimizer of the regularized general-

ization error defined by (8), taking yields

Hence . Then for anyand , we know that

is nonnegative and equals

Letting , by the Lebesgue Dominated Theorem, we seethat

But . It follows that

Page 7: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4781

This is true for every , which implies

(35)

Taking inner products with in proves the lemma.We first prove Theorem 2 for differentiable loss functions.Lemma 2: Let and be a differentiable convex loss

function. Then for any there holds

Proof: Let . Define a univariate functionby

We have

(36)

Since is differentiable, as a function of , is differentiable.In fact, if we denote , then

equals

The Lebesgue Dominated Theorem ensures that

This in connection with Lemma 1 tells us that equals

(37)

Since is convex in , it satisfies

Using this for and, we see from (37) that for

Therefore

This proves the desired result.

The intermediate step (37) in the above proof yields an in-teresting result for the least-square loss which may be useful inanalyzing the Tikhonov regularization scheme (2).

Corollary 6: Let be the least square loss.For any , there holds

Proof: It follows directly from (37) sinceimplies .

If is not differentiable like the hinge loss, we approximate itby which is convex, differentiable and defined foras

The approximation is valid:. Hence for any , there holds

as (38)

Now we can prove Theorem 2 for a general loss function.Proof of Theorem 2: We define, for any ,

and

(39)

For , we have used the conventional notationand .

Since is the minimizer of (39), by taking we get

which implies

for any (40)

Since any closed ballof the Hilbert space is weakly compact, the estimate (40)tells us that there exists a sequence such that

and converges to some weakly.That is,

(41)

In particular

and

(42)

Page 8: Online Regularized Classification Algorithms

4782 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

Let in (41). Then the reproducing property (1) yields

(43)This in connection with the continuity ofand the Lebesgue Dominated Theorem gives

. The uniform

bound (40) of in connection with the uniformconvergence (38) of to ensures that

(44)

Therefore, by (42), we have

Taking in (39), we know that

which means

It tells us that is also a minimizer of . The strictconvexity of the functional on verifies theuniqueness of which leads to . That is, (41) and (42)hold with replaced by .

Apply Lemma 2 to the modified loss function . We have

(45)

Applying (41) to , we know that

which can be bounded by .Hence

This in connection with (45), (42), and (44) implies thatis bounded by

Since , the conclusion of Theorem 2is proved.

V. GENERAL CONVERGENCE RESULTS

In this section we prove our first main result, Theorem 1. Theessential estimate in the proof will also be used in the next sec-tion to give convergence rates. Note that the uniform bounded-ness of by implies

.For simplicity, denote and

.

Lemma 3: Assume that for some and , thereholds

(46)for any . Then for ,

(47)

which can be further controlled by

(48)

Proof: Recall that where. Then

(49)By the reproducing property (1), part of the last term of

(49) equals

(50)

Since is a convex function on , we know that

Applying this relation to and to-gether with (50) yields

(51)The Schwarz inequality

implies

Page 9: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4783

Putting this and (51) into the middle term of (49), we know thatis bounded by

Since depends on but not on , it follows thatcan be bounded by

(52)

This in connection with (46) and (49) gives

(53)

By Theorem 2, this implies that isbounded by

(54)

Applying this relation iteratively for , wesee that is bounded by

This proves the first statement.The second statement follows from the inequality

for any .We are in a position to prove Theorem 1 stated in the intro-

duction. For this purpose, we use (47) while the bound (48) willbe used to derive explicit learning rates in the next section.

Proof of Theorem 1: By (6), there exists an integer suchthat for all . Since is uniformly boundedin , (46) is true for some constant . Applying Lemma 3,it is sufficient to estimate the right side of (47). According tothe assumption (6) on the step size, we have

as . So for any there ex-ists some such that the second term of (47) is boundedby whenever .

To deal with the first term, we use the assumptionand know that there exists some such that

for every . Writeas

(55)

Since is fixed, we can find some such that foreach , there holds

. It follows that for each , there holds

This in connection with the bound for eachtells us that the first term of (55) is bounded as

The second term of (55) is dominated by. But .

Then

Therefore, when , by Lemma 3, we have. This proves Theorem 1.

VI. CONVERGENCE RATES

Now we can derive convergence rates for the error. The step size here is often of the form for

some and . So to apply Lemma 3 for gettingerror bounds, we need to estimate the summations in (48) andlead to the following lemmas.

Lemma 4: For any and , there holds

if ,if .

(56)The proof follows from the simple inequality

.The next lemma in a modified form was given in [22] for

.Lemma 5: Let and . Then

is bounded by

,

.

Page 10: Online Regularized Classification Algorithms

4784 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

Proof: Denote .For , we apply (56) in Lemma 4 and see that isbounded by

For , we have and

. Then is bounded by

(57)Decompose the above integral in two parts. We have

(58)

When , we have and

. Also, . Hence

(59)

Combining (57), (58), and (59), we get

For , the inequality (56) implies is bounded by. For , there holds

. Hence , whichproves the lemma.

We are in a position to state the convergence rates for theonline algorithm (4). To this end, we need the following constantdepending on :

(60)

Theorem 5: Assume that is locally Lipschitz at the origin.Choose the step sizes as for some and

. Define by (4) and by (60).1) For

(61)

2) For , can be bounded by

(62)

Proof: The condition on the step size tells us thatfor each . By Theorem 4, this yields

, and hencefor each . Consequently, both and

0 lie in the interval . Itfollows that and

. Then (46) holds with

and . This in connection with (48) of Lemma 3 tellsus that

where

Since , (24) gives . By Lemma4, we know that

if

if .

By Lemma 5, can be bounded by

if ,

if .

This proves Theorem 5.To apply Theorem 5 for deriving rates for the misclassifica-

tion error, we need the constant and . They depend onthe loss function and play an essential role in getting learningrates. When satisfies the following increment condition withsome :

(63)

we can find and explicitly and then derive learning ratesfor the total error from Theorem 5. Denote as a constantdepending only on and satisfying

(64)

Corollary 7: Assume that satisfies (63) and is locallyLipschitz at the origin. Let . Choose the step sizes as

Page 11: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4785

with some and given by (64). Define by(4). Then

(65)

Proof: The increment condition (63) for tells us that

which implies

Also, the local Lipschitz constant can be bounded as

Choose . Then

and our conclusion follows from Theorem 5.

VII. TOTAL ERROR BOUNDS AND LEARNING RATES

Applying the above mentioned techniques, we can derive thelearning rates of the excess misclassification error for the onlinealgorithm (4) from our analysis on together withthe regularization error .

A. Rates for the Online Algorithm With the Hinge Loss

First, we prove the learning rates for the online algorithm (13)with the hinge loss.

Proof of Corollary 1: Consider the hinge loss. Recall the relation (18) between the excess misclas-

sification error and the excess generalization error. Then

(66)Using the uniform Lipschitz continuity of the hinge loss, weknow that . Combined

with the assumption on the regularization errorfor some , it follows from (66) that

(67)

Now we apply Corollary 7. It is easy to see that satisfies (63)with and . For any ,choose

in Corollary 7. We know that is bounded by

Select with . Since theasymptotic behavior holds for any

and . We know that there exists a constantdepending only on and such that

Putting this back into (67), we have

Now take . We know that the following holds

This proves our conclusion.

B. Rates for the Online Algorithm With the Least Square Loss

Turn to the least square loss. It has a special feature that. This enables us to improve the learning

rates.Proof of Corollary 2: For the least square loss, we use the

relation (see [13])

Then can be bounded by

(68)

We know from Lemma 3 in [26] that if is in the range offor some then the second term of (68) can be

bounded as

(69)

The first term can beestimated by a refined bound for . Namely,if and for each , there holds

(70)

Let us first prove (70). The choice of and Theorem 4 tellus that the learning sequence is uniformly bounded as

Page 12: Online Regularized Classification Algorithms

4786 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

The special feature of the least square lossyields

(71)

Combining with the estimate ,we know (71) is bounded by

(72)

The same procedure as (53) in the proof of Lemma 3 gives

This combining with (71) and (72) implies

From this bound, Theorem 2 and the restriction ,we know that can be bounded by

Applying this relation iteratively for , we see that

By Lemma 5 with , we know that

which verifies (70).Choose with . For some constant

, we have

Thus, is bounded by

Combining with (68) and (69), for a constant wehave

Choose and . We obtainthe desired bound

The proof of the corollary is completed.Proof of Theorem 3: If the regression function is in the

range of for , then by Lemma 3 in [26] weknow that

Replacing with and arguing as in the proof of corollary2, we obtain the desired result.

C. Rates With the -Norm SVM Loss

We shall use the relation (19) to derive the learning rate. Toestimate using the error in the metric, weneed the following lemma.

Lemma 6: For the -norm SVM loss with, and , there holds

(73)

Proof: Note the inequality [7]

It follows by taking thatcan be bounded by

where the bound (24) for is used. This completes the proofof the lemma.

Proof of Corollary 4: We apply Corollary 7. Observe thatsatisfies (63) with

and . Chooseand . It follows from (65)

that for a constant depending only on , andsuch that

Page 13: Online Regularized Classification Algorithms

YING AND ZHOU: ONLINE REGULARIZED CLASSIFICATION ALGORITHMS 4787

This in connection with (73) and the assumption on the regu-larization error that can bebounded by

where is a constant depending on , and . Thus, withanother constant , can be boundedby

Combining with the error decomposition (22) and the regular-ization error decay, this implies that isbounded by

Choose

then the desired result follows by the comparison relations (19)and (21).

VIII. CONCLUSION AND QUESTIONS

In this paper we have investigated the strong convergence ofthe online regularized classification algorithm involving generalloss functions and reproducing kernel Hilbert spaces. We veri-fied the convergence under rather weak assumptions on the stepsizes and loss functions. A novel relation between the error inthe RKHS norm and the excess regularized generalization errorplayed an important role. As done for the off-line setting in theliterature, by suitable choices of the regularization parameter

according to a priori conditions on the approxima-tion error, we presented explicit capacity independent learningrates of the excess misclassification error for commonly usedloss functions such as the hinge loss, the least square loss, andthe SVM -norm loss. In particular, we have shown (remarksfollowing Theorem 3) that our learning rates in with theleast square loss are comparable to the ones in the off-line set-ting.

Let us mention some questions for further study.1) In this paper the regularization parameter de-

pends only on the sample size which means that (4)studied here is not a fully online algorithm. How to analyzethe algorithm (4) with changing with the steps, ,remains open.

2) It would be interesting to analyze the online algorithm inthe case . This is closely related to the perceptronalgorithms (see [29], [10]).

3) Here we assume the data is i.i.d. accordingto an unknown distribution . In many applications thedata is not independent or identical. It is unknown whetherMarkov chains and the theory of martingales can be usedto deal with this case.

TABLE INOTATIONS

APPENDIX

ONLINE ALGORITHM AS DESCENT METHOD

The classical descent method [6] is an efficient method tosolve unconstrained minimization problems of the form

where is convex and continuously differen-tiable. The descent method is to find a suitable sequence ,

to approximate the minimum point :

where is the step size and denotes the descentdirection.

The descent method requires that

except when is optimal. The Taylor expansion of tells us

which requires

That is, the direction must make an obtuse angle with thegradient . If we select the descent direction as thenegative gradient direction, this gives the well-known descentalgorithm

Under some assumptions on the step sizes , one can get theconvergence [6]

The above descent method motivates the classification algo-rithm (4). To see this, let us assume that the classifying loss

Page 14: Online Regularized Classification Algorithms

4788 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 11, NOVEMBER 2006

is differentiable. Since the regularizing function is an un-constrained minimizer in of the functional (the regularizedgeneralization error)

as the descent algorithm, we use the following sequenceto approximate :

with

Note that the functional derivative (see [33])

However, the distribution is unknown in practical clas-sification problems. What we have is a random sample

. So we replace the integral partby the random value ,

and the above descent scheme then becomes the stochasticgradient descent online algorithm (4).

ACKNOWLEDGMENT

The authors would like to thank the referees for valuable com-ments and suggestions.

REFERENCES

[1] M. Anthony and P. L. Bartlett, Neural Network Learning: TheoreticalFoundations. Cambridge, U.K.: Cambridge University Press, 1999.

[2] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math.Soc., vol. 68, pp. 337–404, 1950.

[3] K. S. Azoury and M. K. Warmuth, “Relative loss bounds for on-linedensity estimation with the exponential family of distributions,” Ma-chine Learn., vol. 43, pp. 211–246, 2001.

[4] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classifica-tion, and risk bounds,” J. Amer. Statist. Assoc., vol. 101, pp. 138–156,2006.

[5] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Ma-chine Learn. Res., vol. 2, pp. 499–526, 2002.

[6] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[7] D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou, “Support vector machinesoft margin classifiers: Error analysis,” J. Machine Learn. Res., vol. 5,pp. 1143–1175, 2004.

[8] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “A second-order percep-tron algorithm,” SIAM J. Comput., vol. 34, pp. 640–688, 2005.

[9] N. Cesa-Bianchi, P. Long, and M. K. Warmuth, “Worst-case quadraticloss bounds for prediction using linear functions and gradient descent,”IEEE Trans. Neural Netw., vol. 7, pp. 604–619, 1996.

[10] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalizationability of on-line learning algorithms,” IEEE Trans. Inf. Theory, vol.50, pp. 2050–2057, 2004.

[11] F. Cucker and S. Smale, “On the mathematical foundations oflearning,” Bull. Amer. Math. Soc., vol. 39, pp. 1–49, 2001.

[12] E. De Vito, A. Caponnetto, and L. Rosasco, “Model selection for reg-ularized least-squares algorithm in learning theory,” Found. Comput.Math., vol. 5, pp. 59–85, 2005.

[13] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. New York: Springer-Verlag, 1997.

[14] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks andsupport vector machines,” Adv. Comput. Math., vol. 13, pp. 1–50, 2000.

[15] J. Forster and M. K. Warmuth, “Relative expected instantaneous lossbounds,” J. Comput. Syst. Sci., vol. 64, pp. 76–102, 2002.

[16] M. Herbster and M. K. Warmuth, “Tracking the best expert,” MachineLearn., vol. 32, pp. 151–178, 1998.

[17] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning withkernels,” IEEE Trans. Signal Process., vol. 52, pp. 2165–2176, 2004.

[18] Y. Lin, “Support vector machines and the Bayes rule in classification,”Data Min. Knowledge Discovery, vol. 6, pp. 259–275, 2002.

[19] G. Lugosi and N. Vayatis, “On the Bayes-risk consistency of regular-ized boosting methods,” Ann. Stat., vol. 32, pp. 30–55, 2004.

[20] P. Niyogi and F. Girosi, “On the relationships between generalizationerror, hypothesis complexity and sample complexity for radial basisfunctions,” Neural Comp., vol. 8, pp. 819–842, 1996.

[21] C. Scovel and I. Steinwart, “Fast rates for support vector machines,” inProc. Conf. Learn. Theory (COLT-2005), pp. 279–294.

[22] S. Smale and Y. Yao, “Online learning algorithms,” Found. Comput.Math., vol. 6, pp. 145–170, 2006.

[23] S. Smale and D. X. Zhou, “Estimating the approximation error inlearning theory,” Anal. Appl., vol. 1, pp. 17–41, 2003.

[24] ——, “Shannon sampling and function reconstruction from pointvalues,” Bull. Amer. Math. Soc., vol. 41, pp. 279–305, 2004.

[25] ——, “Shannon sampling II: Connection to learning theory,” Appl.Comput. Harmonic Anal., vol. 19, pp. 285–302, 2005.

[26] ——, “Learning theory estimates via integral operators and their ap-plications,” Constr. Approx., to be published.

[27] I. Steinwart, “Support vector machines are universally consistent,” J.Complex., vol. 18, pp. 768–791, 2002.

[28] A. B. Tsybakov, “Optimal aggregation of classifiers in statisticallearning,” Ann. Stat., vol. 32, pp. 135–166, 2004.

[29] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[30] G. Wahba, Spline Models for Observational Data. Singapore: SIAM,

1990.[31] Q. Wu, Y. Ying, and D. X. Zhou, “Multi-kernel regularized classifiers,”

J. Complex., to be published.[32] Q. Wu and D. X. Zhou, “SVM soft margin classifiers: Linear program-

ming versus quadratic programming,” Neural Comput., vol. 17, pp.1160–1187, 2005.

[33] K. Yosida, Functional Analysis, 6th ed. New York: Springer-Verlag,1980.

[34] Y. Ying and D. X. Zhou, “Learnability of Gaussians with flexible vari-ances,” J. Mach. Learning, 2004, to be published.

[35] T. Zhang, “Statistical behavior and consistency of classificationmethods based on convex risk minimization,” Ann. Stat., vol. 32, pp.56–85, 2004.

[36] ——, “Leave-one-out bounds for kernel methods,” Neural Comput.,vol. 15, pp. 1397–1437, 2003.

[37] D. X. Zhou, “The covering number in learning theory,” J. Complex.,vol. 18, pp. 739–767, 2002.

[38] ——, “Capacity of reproducing kernel spaces in learning theory,” IEEETrans. Inf. Theory, vol. 49, pp. 1743–1752, 2003.