Acta Mathematica Sinica, English Series
May, 2009, Vol. 25, No. 5, pp. 855–868
Published online: April 25, 2009
DOI: 10.1007/s10114-008-6210-8
Http://www.ActaMath.com
Acta Mathematica Sinica, English Series© The Editorial Office of AMS & Springer-Verlag 2009
Rademacher Complexity in Neyman–Pearson Classification
Min HANDepartment of Applied Mathematics, Beijing University of Technology, Beijing 100124, P. R. China
E-mail: [email protected]
Di Rong CHEN1)
Department of Mathematics, and LMIB, Beijing University of Aeronautics and Astronautics,
Beijing 100083, P. R. China
E-mail: [email protected]
Zhao Xu SUNSchool of Applied Mathematics, Central University of Finance and Economics,
Beijing 100081, P. R. China
E-mail: sunzhaoxu@ 163.com
Abstract Neyman–Pearson(NP) criterion is one of the most important ways in hypothesis testing.
It is also a criterion for classification. This paper addresses the problem of bounding the estimation
error of NP classification, in terms of Rademacher averages. We investigate the behavior of the global
and local Rademacher averages, and present new NP classification error bounds which are based on the
localized averages, and indicate how the estimation error can be estimated without a priori knowledge
of the class at hand.
Keywords Neyman–Pearson lemma, VC classes, Rademacher complexity, Neyman–Pearson classifi-
cation
MR(2000) Subject Classification 62G05, 68T05, 68T10
1 Introduction
The Neyman–Pearson(NP) approach to hypothesis testing applies when a priori probabilityis unknown or when one type of error is far more costly than another. For any α > 0, theNeyman–Pearson lemma [1] provides the most powerful test of size α, but it requires full orvery strong knowledge of the true distribution under each hypothesis. In the NP classification,the goal is to learn a classifier from labeled training datas such that the probability of a falsepositive is minimized while the probability of a false negative is below a user-specified levelα ∈ (0, 1). Unlike NP hypothesis testing, NP classification does not assume knowledge of theunderlying distributions except for a collection of independent and identically distributed (i.i.d.)training examples. There are few works on NP classification up to the present.
This frame-work is an important alternative to other more common approaches to classifi-cation that seek to minimize the probability of error or expected Bayes’s cost. One advantage
Received April 26, 2006, Accepted July 6, 2007Research supported in part by NSF of China under Grant Nos. 10801004, 10871015; supported in part by
Startup Grant for Doctoral Research of Beijing University of Technology1) Corresponding author
856 Han M. and Chen D. R.
of the NP classification is that in many important applications, such as disease classificationor network intrusion detection, it is more natural to specify a constraint on the false positiveprobability than to assign costs to the different kinds of errors. A second and no less importantadvantage is that, unlike decision-theoretic approaches, NP classification does not rely in anyway on knowledge of a priori class probability. This is extremely important in applicationswhere the class frequencies in the training data do not accurately reflect class frequencies in thelarger population. The third advantage is that NP classification performs the rule of statisticallearning theory, which has the advantage of small sample size and general learning machine.
In estimating the performance NP classification, the goal is to obtain the sharpest possibleestimates on the complexities of function classes. Data-based Rademacher average is the com-plexity measure we will consider. In this paper we consider constrained versions of empiricalrisk minimization (NP-ERM) and structural risk minimization (NP-SRM) based on date-basedRademacher complexities, and establish several new bounds in both cases. The estimates weestablish give better rates and are based on not only global data-based Rademacher complexi-ties, but also local data-based Rademacher complexities. The classes in this paper are randomclasses of measurable functions using Rademacher averages as complexity measures.
1.1 Notations
We focus exclusively on binary classification, although extensions to multi-class settings arepossible. Let X be a set and Y = {0, 1} be the class label set. Let z = (x, y) be a randomvariable with values in (Z ,P), here Z = X × Y . A classifier is a Borel measurable functionf : X → Y . In Neyman–Pearson testing, y = 0 corresponds to the null hypothesis. Therefore,if f is a classifier and (x, y) is a realization, a false positive occurs when f(x) = 1 but y = 0.Similarly, a false negative occurs when f(x) = 0 but y = 1. In the standard learning theorysetup [2], indicator function I{f(x) �=y} is the loss function, and the performance of f is measuredby the probability e(f) = P(f(x) �= y) of error. We will focus on the false alarm and missprobabilities denoted by ej(f) = PX |y=j(f(x) �= j|y = j) for j = 0 and j = 1, respectively.The false-alarm probability is also known as the size, while the miss probability is known as thepower. Note that e(f) = π0e0(f) + π1e1(f), where πj = PY (Y = j) is the (unknown) a prioriprobability of class j.
Let α ∈ [0, 1] be a user-specified level of significance or false alarm threshold. Given a classof classifier F and defined Fα = {f ∈ F : e0(f) ≤ α}, Neyman–Pearson classification seeksthe classifier f∗α as:
f∗α = arg min{e1(f) : f ∈ Fα}. (1)
That is, f∗α is the most powerful test/classifier in F of size α. If h0 and h1 are the class-conditional Lebesgue densities of x (with respect to a Lebesgue measure μ) corresponding toclass 0 and 1 respectively, then Neyman–Pearson lemma [1] states f∗α is given by a likelihoodratio test (LRT): f∗α = I{h1(x)>λαh0(x)} where λα is defined implicitly by the integral equation∫
h1(x)>λαh0(x)h0(x)dx = α. Thus, when h0 and h1 are known (or in certain special cases where
the likelihood ratio is a monotonic function of an unknown parameter), the Neyman–Pearsonlemma delivers the optimal classifier. In this paper we are interested in the case where our only
Neyman–Pearson Classification 857
information about the probability measure is a finite training sample.Let Zn = {(xi, yi)}n
i=1 ∈ Z n be a collection of n i.i.d. samples of z = (x, y). A learningalgorithm is a mapping A : Zn → F (X , Y ), where F (X , Y ) is the set of all classifiers. Inother words, the learner A is a rule for selecting a classifier based on a training sample. LetP
n denote the product measure on Zn induced by P, and En denote expectation with respect
to Pn. For j = 0, 1, let nj =
∑ni=1 I{yi=j} be the number of samples with yi = j. Let ej(f) =
1nj
∑i:yi=j I{f(xi) �=j} denote the empirical false alarm or miss probabilities, corresponding to
j = 0 or 1. Let ε0 > 0 and Fα = {f ∈ F : e0(f) ≤ α+ ε0}. NP-ERM seeks classifier fα,n as:
fα,n = arg min e1(f)
s.t. e0(f) ≤ α+ ε0.(2)
Finally, set e∗1,α = e1(f∗α) to be the miss probability of the optimal classifier f∗α.Let σ1, . . . , σn be n independent Rademacher random variables [2], that is, independent
random variables for which Pr(σi = 1) = Pr(σi = −1) = 12 . For a function f : F → R,
define Rnf = 1n
∑ni=1 σif(xi). For a class F , set RnF = supf∈F Rnf . Define Eσ to be the
expectation with respect to the random variables σ1, . . . , σn, conditioned on all of the otherrandom variables. The Rademacher average of F is ERnF , and the empirical (or conditional)Rademacher average of F is
EσRnF =1n
E
(
supf∈F
n∑
i=1
σif(xi)|x1, . . . , xn
)
.
For our purpose, we introduce, for j = 0, 1,
Rjnj
(F ) =1nj
supf∈F
∑
i:yi=j
σif(xi).
They play an important role in our analysis as RnF in the standard learning theory. More-over, as in the standard learning theory, we define the (NP) empirical Rademacher averagesEσR
jnj
(F ) of F for j = 0, 1 by
EσRjnj
(F ) = E(Rjnj
(F )|x1, . . . , xn).
1.2 Related Work
Neyman–Pearson learning has been first studied by Cannon, Howse, Hush and Scovel [3]. Theygave an analysis of NP-ERM and showed that the estimation error of a classifier designed byNP-ERM was bounded by the error deviance as in the standard statistical theory (see [2, 4–6]). Also, using the VC inequality [4–5] they determined probabilistic guarantees on estimationerror of Neyman–Pearson learning as in the standard statistical theory. But they consider onlyNP learning on fixed Vapnik–Chervonenkis (VC) classes. In a separate paper Cannon et al. [7]considered NP-ERM and were able to bound the estimation error in this case as well. Theyalso provided an algorithm for NP-ERM in some simple cases involving linear and sphericalclassifiers, and presented an experimental evaluation. But they consider only classes with finiteVC dimension, just because their results are based on [3].
One limitation of NP-ERM over fixed F is that most possibilities for the optimal rule f∗αcannot be approximated arbitrarily well. A solution to this problem is known as structuralrisk minimization (SRM) [9], whereby a classifier is selected from a family F k, k = 1, 2, . . .,
858 Han M. and Chen D. R.
of increasing rich classes of classifiers. Scott and Nowak [8] first studied NP-SRM, which isa version of SRM adapted to the NP setting. They considered NP-ERM over finite classesand presented NP-SRM. A general condition was given about VC classes or finite classes toguarantee universal consistency of NP-SRM. But their results are limited to VC classes:
In a recent paper, Scott [10] presented the following two families of performance measuresfor evaluating and comparing classifiers:
M k(f) = k(e0(f) − α)+ + (e1(f) − e∗1,α)+,
N k(f) = k(e0(f) − α)+ + e1(f) − e∗1,α,
where (x)+ = max(x, 0), k > 0. He suggested that N k(f) is meaningful (it is minimized bythe target classifier f∗α) when k ≥ 1
α . Finally several PAC bounds about NP-ERM with M k(f)
and N k(f) were given. But the bounds were at most O(√
log nn ), which are not tight enough.
In estimating the performance of learning algorithm, the goal is to obtain the sharpestpossible estimates on the complexity of function classes. Distribution-free notations of thecomplexity, such as the Vapnik–Chervonenkis dimension [6] or the metric entropy [11], typicallygive conservative estimates. Data-dependent estimates especially, are desirable. One of the mostinteresting data-dependent complexities is the Rademacher average of the function class. Itwas first proposed as an effective complexity measure by Koltchinskii [12], Bartlett, Boucheronand Lugosi [13], Mendelson [14] and further studied in [15], while global estimates on thecomplexity of the function class via the global Rademacher averages have the error rates atmost
√log n
n . Bartlett, Bousquet and Mendelson [16] considered local Rademacher averages ascomplexity measures, which guarantee a fast rates of O( log n
n ). See also Bousquet, Koltchinskiiand Panchenko [17], Lugosi and Wegkamp [18].
In this paper, we consider the performance of NP-ERM and NP-SRM in terms of Rade-macher complexity of the function classes. We get several bounds containing data-dependentcomplexity using data-dependent global and local Rademacher averages. In this case, the besterror rate is O( log n
n ).The paper is organized as follows. In Section 2, we consider the performance of NP-ERM. We
give three PAC bounds using global Redemacher complexities for NP-ERM in Subsection 2.1.Subsection 2.2 establishs two NP-ERM algorithms using local Rademacher averages, and proveperformance guarantees for both algorithms. Section 3 gives two oracle bounds for NP-SRMwith global and local Rademacher averages.
2 Neyman–Pearson Empirical Risk Minimization (NP-ERM)
Before proceeding further, we review a main result of Cannon et al. [3].
Lemma 1 (Cannon [3]) Let ε0, ε1 > 0, α ≥ 0, and take fα,n as in (2). Then
P[(e0(fα,n) − α > 2ε0)or(e1(fα,n) − e∗1,α > 2ε1)]
≤ P[ supf∈F
|e0(f) − e0(f)| > ε0] + P[ supf∈F
|e1(f) − e1(f)| > ε1].
2.1 PAC Bounds Using Global Rademacher Averages for NP-ERM
Set εj(F , nj , δj) = 2EσRjnj
(F ) + 3
√log 2
δj
2njfor j = 0, 1, and substitute them to the εj in (2)
Neyman–Pearson Classification 859
respectively. We have the following Theorem 1 which gives a PAC bound of NP-ERM (2) usingglobal Redemacher average.
Theorem 1 Let δ0, δ1 > 0, for NP-ERM (2) over F . We have
P
[(
e0(fα,n) − α > 4EσR0n0
(F ) + 3
√2 log 2
δ0
n0
)
or(
e1(fα,n) − e1(f∗α) > 4EσR1n1
(F ) + 3
√2 log 2
δ1
n1
)]
≤ δ0 + δ1.
The proof is given in Appendix A.For εj(F , nj, δj) defined above, we consider constrained minimization called NP-CON [10]:
fcα = arg min e1(f) + ε1(F , n1, δ1) + γε0(F , n0, δ0),
s.t. e0(f) ≤ α+ νε0(F , n0, δ0).We have the following theorem :
Theorem 2 Fix 0 < k <∞. Let δ0, δ1 > 0 and let fcα be the rule NP-CON with ν ≥ −1 and
γ ≥ (k + λα)(1 + ν). Then
e0(fcα) ≤ α+ (1 + ν)ε0(F , n0, δ0)
and
M k(fcα) ≤ inf
f∈Fνα
{γε0(F , n0, δ0) + e1(f) − e∗1,α + 2ε1(F , n1, δ1)}
with probability at least 1 − (δ0 + δ1) with respect to Zn, where F να = {f ∈ F : e0(f) ≤
α− (1−ν)ε0(F , n0, δ0)}. If k > λα and we replace M k by N k, the same statement holds whenγ ≥ k(1 + ν).
If we substitute εj(F , nj, δj) to an unconstrained minimization called NP-UNC [10]:
fuα = arg min
f∈Fk(e0(f) − α− νε0(F , n0, δ0))+ + γε0(F , n0, δ0) + e1(f) + ε1(F , n1, δ1),
we have the following PAC bound :
Theorem 3 Fix 0 < k < ∞. Let δ0, δ1 > 0 and let fuα be the classifier of NP-UNC with
ν ∈ R and γ = k(1 + ν). Then
N k(fuα) ≤ inf
f∈F{k(e0(f) − α)+ + 2kε0(F , n0, δ0) + e1(f) − e∗1 + 2ε1(F , n1, δ1)}
with probability at least 1 − (δ0 + δ1) with respect to Zn.
Theorem 2 and Theorem 3 extend Theorem 3 and Theorem 2 in Scott [10] to more generalfunction classes.
2.2 PAC Bounds Using Local Rademacher Averages for NP-EMR
Define the loss class associated with F = {f : all Borel measurable functionf : X → {0, 1}}as F = {f = (f(x), y) : f ∈ F}. In this article (f(x), y) = I{f(x) �=y}. Obviously for j = 0, 1,and ∀ f ∈ F , Varj(f) = ej(f2) − (ej(f))2 ≤ ej(f2) = ej(f). Let ψj be a subroot function(the definition and properties are given in Appendix B) r∗j be the fixed point of ψj . Assume ψj
satisfies, for any r ≥ r∗j ,
ψj(r) ≥ ERjnj{f ∈ F : ej(f) ≤ r}.
860 Han M. and Chen D. R.
Setε0 = ε(n0, x0) = 651Kr∗0 + 44(K + 1)
log 1/δ0n0
andε1 = ε1(n1, x1) = 651Kr∗1 + 44(K + 1)
log 1/δ1n1
.
Using the arguments in the proof of [16, Theorem 3.3], we get the following theorem:Theorem 4 F , ψj , r
∗j , ej , ej and nj are all described as above, j = 0, 1. Then, for any
K > 1 and δj > 0, with probability at least 1 − δj ,
∀ f ∈ F , ej(f) ≤ K
K − 1ej(f) + 651Kr∗j + 44(K + 1)
log 1/δjnj
,
and∀ f ∈ F , ej(f) ≤ K + 1
Kej(f) + 651Kr∗j + 44(K + 1)
log 1/δjnj
.
Proof Substitute B = 1, a = 0, b = 1, λ = 4 and α = 20 into the proof of Theorem 3.3 of [16]and notice that, as in the proof of Theorem 1, we have
ERjnj{f ∈ F : ej(f) ≤ r} = ERj
nj{f ∈ F : ej(f) ≤ r}.
For any K > 1, we consider
fKα,n = arg min e1(f)
s.t. e0(f) ≤ K+1K α+ ε0,
(3)
and FKα = {f ∈ F : e0(f) ≤ K+1
K α+ ε0}.Lemma 2 Let
Θ0 ={
Zn : e0(fKα,n) >
K + 1K − 1
α+2K − 1K − 1
ε0
}
,
Θ1 ={
Zn : e1(fKα,n) >
K + 1K − 1
e∗1,α +2K − 1K − 1
ε1
}
,
Ω0 ={
Zn :(
supf∈F
(e0(f) − K
K − 1e0(f)) > ε0
)
or
(
supf∈F
(e0(f) − K + 1K
e0(f)) > ε0
)}
,
Ω1 ={
Zn :(
supf∈F
(e1(f) − K
K − 1e1(f)) > ε1
)
or
(
supf∈F
(e1(f) − K + 1K
e1(f)) > ε1
)}
,
here, fKα,n and f∗α are as in (3) and (1).
Then, we haveΘ0 ∪ Θ1 ⊂ Ω0 ∪ Ω1.
The proof is given in Appendix C.The following Corollary 2 can be easily derived from Lemma 2.
Corollary 2 Given ε0, ε1 > 0, α > 0, fKα,n and f∗α are defined as above. Then ∀K > 1,
P
[(
e0(fKα,n) >
K + 1K − 1
α+2K − 1K − 1
ε0
)
or
(
e1(fKα,n) − K + 1
K − 1e∗1,α >
2K − 1K − 1
ε1
)]
≤ P
[(
supf∈F
(
e0(f) − K
K − 1e0(f)
)
> ε0
)
or
(
supf∈F
(
e0(f) − K + 1K
e0(f))
> ε0
)]
+ P
[(
supf∈F
(
e1(f) − K
K − 1e1(f)
)
> ε1
)
or
(
supf∈F
(
e1(f) − K + 1K
e1(f))
> ε1
)]
.
Neyman–Pearson Classification 861
Combining Theorem 4 with Corollary 2, we obtain the following PAC bound for NP-ERM(3):
Theorem 5 Let α > 0, e0, e0, e1, e1, fKα,n, f
∗α, ε0, ε1 be defined as above. Then ∀K > 1 and
∀ δj > 0,
P
[(
e0(fKα,n) >
K + 1K − 1
α+2K − 1K − 1
ε0
)
or(
e1(fKα,n) − K + 1
K − 1e1(f∗α) >
2K − 1K − 1
ε1
)]
≤ δ0 + δ1.
Theorem 5 is a PAC bound for NP-ERM (3) using local Rademacher averages, which hasthe fast rate of convergence.
For δ0, δ1 > 0, K > 1 εj and 0 < k ≤ ∞ as above, let
M kK(f) = k
(
e0(f) − K + 1K − 1
α
)
+
(
e1(f) − K + 1K − 1
e∗1,α
)
+
,
F να =
{
f ∈ F : e0(f) ≤ K + 1K
α+ νε0
}
,
F να =
{
f ∈ F : e0(f) ≤ α− K
K + 1(1 − ν)ε0
}
,
Fα = {f ∈ F : e0(f) ≤ α}.Let’s consider a constrained NP-CON:
f cα = arg min
K
K − 1e1(f) + ε1 + γε0
s.t. e0(f) ≤ K + 1K
α+ νε0.(4)
For this NP-ERM, the following PAC bound gives a fast error rate as in Theorem 5.
Theorem 6 For δ0, δ1 > 0, and the classifier fcα of NP-CON in (4) with ν ≤ −1, if γ ≥
(λα + k)(1 + KK−1ν), then with probability at least 1 − (δ0 + δ1),
e0(fcα) ≤ K + 1
K − 1α+
(
1 +K
K − 1ν
)
ε0,
andM k
K(fcα) ≤ inf
f∈Fνα
{
γε0 +2K − 1K − 1
ε1 +K + 1K − 1
(e1(f) − e∗1,α)}
.
Before giving the proof of Theorem 6, let us prove several lemmas.
Lemma 3 If α′ > α, then K+1K−1e
∗1,α−e∗1,α′ ≤ λα(α′−K+1
K−1α); if α′ < α, then e∗1,α′−K+1K−1e
∗1,α ≥
λα(K+1K−1)α− α′.
Proof Since f∗α = I{h1(x)>λαh0(x)}, we set G0,α = {x : f∗α = 0} = {x : h1(x) < λαh0(x)} andG1,α = {x : f∗α = 1} = {x : h1(x) > λαh0(x)} = G0,α. Define G0,α′ and G1,α′ similarly. Wehave, by the definition of G0,α,
K + 1K − 1
e∗1,α − e∗1,α′ =∫
G0,α/G0,α′h1(x)dx+
2K − 1
∫
G0,α
h1(x)dx
≤ λα
∫
G0,α/G0,α′h0(x)dx+
2K − 1
λα
∫
G0,α
h0(x)dx.
Then we haveK + 1K − 1
e∗1,α − e∗1,α′ ≤ λα
∫
G1,α′/G1,α
h0(x)dx− 2K − 1
λα
∫
G1,α
h0(x)dx
= λα
(
α′ − K + 1K − 1
α
)
.
862 Han M. and Chen D. R.
This proves the first part of the lemma. The second part follows from a similar argument.
Lemma 4 If Zn ∈ Ω0 and f ∈ F να , then e0(f) ≤ K+1
K−1α+ (1 + KK−1ν)ε0.
Proof By Zn ∈ Ω0 and f ∈ F να , e0(f) ≤ K+1
K α+ νε0, then
e0(f) = e0(f) − K
K − 1e0(f) +
K
K − 1e0(f)
≤ K + 1K − 1
α+(
1 +K
K − 1ν
)
ε0.
Lemma 5 Let ν ≤ −1. If Zn ∈ Ω0 and f ∈ F να , then K+1
K−1e∗1,α(f)−e1(f) ≤ λα(1+ K
K−1ν)ε0.
Proof Let Zn ∈ Ω0 and f ∈ F να . It follows from Lemma 4 that
e0(f) ≤ K + 1K − 1
α+(
1 +K
K − 1ν
)
ε0 =: α′,
which implies e1(f) ≥ e1(f∗α′). We haveK + 1K − 1
e∗1,α − e1(f) ≤ K + 1K − 1
e∗1,α − e1(f∗α′)
≤ λα
(
α′ − K + 1K − 1
α
)
= λα
(
1 +K
K − 1ν
)
ε0,
where the second inequality follows from Lemma 3. The proof is complete.
Proof of Theorem 6 Assume Zn ∈ Ω0 ∩ Ω1. Then by the definition of NP-CON (4) andZn ∈ Ω0 we have
e0(fcα) =
K
K − 1e0(fc
α) + e0(fcα) − K
K − 1e0(fc
α)
≤ K
K − 1
(K + 1K
α+ νε0
)
+ supf∈F
(e0(f) − K
K − 1e0(f))
≤ K + 1K − 1
α+(
1 +K
K − 1ν
)
ε0.
This establishes the first inequality of the theorem.We claim that the second inequality follows whenever the condition
M kK(fc
α) ≤ γε0 + e1(fcα) − K + 1
K − 1e∗1,α (5)
holds. In fact, ∀ f ∈ F να , e0(f) ≤ ε0 + K+1
K e0(f) ≤ K+1K α+ νε0, which shows F ν
α ⊂ F να . Then
by Zn ∈ Ω1 and the definition of NP-CON(4),
M kK(fc
α) ≤ γε0 + ε1 +K
K − 1e1(fc
α) − K + 1K − 1
e∗1,α
= inff∈Fν
α
{
γε0 + ε1 +K
K − 1e1(f) − K + 1
K − 1e∗1,α
}
≤ inff∈Fν
α
{
γε0 +2K − 1K − 1
ε1 +K + 1K − 1
e1(f) − K + 1K − 1
e∗1,α
}
.
Thus the second inequality of the theorem follows from F να ⊂ F ν
α , as claimed.Therefore we have only to show that (5) holds. We consider the following four separate
cases:(i) e1(fc
α) < K+1K−1e
∗1,α, e0(fc
α) > K+1K−1α,
Neyman–Pearson Classification 863
(ii) e1(fcα) ≥ K+1
K−1e∗1,α, e0(fc
α) > K+1K−1α,
(iii) e1(fcα) ≥ K+1
K−1e∗1,α, e0(fc
α) ≤ K+1K−1α,
(iv) e1(fcα) < K+1
K−1e∗1,α, e0(fc
α) ≤ K+1K−1α.
In the first case, we have by Lemmas 4 and 5 that
M kK(fc
α) = k
(
e0(fcα) − K + 1
K − 1α
)
≤ k
(
1 +K
K − 1ν
)
ε0 +K + 1K − 1
e∗1,α − e1(fcα) + e1(fc
α) − K + 1K − 1
e∗1,α
≤ k + λα)(
1 +K
K − 1ν
)
ε0 + e1(fcα) − K + 1
K − 1e∗1,α.
By substituting γ ≥ (λα + k)(1 + KK−1ν) into the above inequality we obtain (5).
For the second case, we have
M kK(fc
α) = k
(
e0(fcα) − K + 1
K − 1α
)
+ e1(fcα) − K + 1
K − 1e∗1,α
≤ k
(
1 +K
K − 1ν
)
ε0 + e1(fcα) − K + 1
K − 1e∗1,α
≤ γε0 + e1(fcα) − K + 1
K − 1e∗1,α.
For the third case, note M kK(fc
α) = e1(fcα) − K+1
K−1e∗1,α, which yields (5). For the last case,
(5) is obvious by M kK(fc
α) = 0.The inequality (5) holds under the assumption of the theorem. The proof of the theorem is
complete.
3 Neyman–Pearson Structural Risk Minimization (NP-SRM)
3.1 NP-SRM Using Global Rademacher Averages
Let F k, k = 1, 2, . . . be given, with F =⋃
k F k. Define
εj(nj , δj , k) = 4EσRjnj
(F k0 ) + 6
√2 log k + log 4
δj
nj.
Let K = K(n) be a nondecreasing integer valued function of n with K(1) = 1.• For each k = 1, 2, . . . ,K(n), set
fkα,n = arg min
f∈Fke1(f)
s.t. e0(f) ≤ α+12ε0(n0, δ0, k);
• Set
fα = arg min{
e1(fkα,n) +
12ε1(n1, δ1, k)|k = 1, 2, . . . ,K(n)
}
.
We have the following oracle bound based on global Rademacher complexities for NP-SRM.
Theorem 7 With probability at least 1 − (δ0 + δ1) over the training sample Zn, we have
e0(fα) − α ≤ ε0(n0, δ0,K(n)),
and
e1(fα) − e∗1,α ≤ inf1≤k≤K(n)
[ε1(n1, δ1, k) + inf
f∈Fke1(f) − e∗1,α
].
864 Han M. and Chen D. R.
Proof εj here is equal to the εj in Subsection 2.1 with δj being replaced by 12δjk
−2, which isused to show 1
2
∑∞k=1 k
−2δj < δj. With the same proof as that of Theorem 5 in C. Scott and R.Nowak [8], we can arrive at the result.
3.2 NP-SRM About Local Rademacher Complexity
Let F k, k = 1, 2, . . . ,K(n) be given, and F 1 ⊂ F 2 ⊂ · · · ⊂ FK(n), here K(n) being anondecreasing integer-valued function of n with K(1) = 1. For any integer K > 1 andxk,j forj = 0, 1 define
εj(nj , k, xk,j) = 651KEσR∗k,j + 44(K + 1)
xk,j
nj,
here, r∗k,j being the fixed point of sub-root function ψk,j(r), Assume ψk,j satisfies, for anyr ≥ r∗k,j ,
ψk,j(r) ≥ ERjnj{f ∈ F k : ej(f) ≤ r}.
And xk,j is an arbitrary positive number only related to k and nj . Set
f∗α,k = arg minf∈Fk
e1(f)
s.t. e0(f) ≤ α.
NP-SRM produces a classifier fα,n according to the following two-step process:
1 For each k = 1, 2, . . . ,K(n), set
fkα,n = arg min
f∈Fke1(f)
s.t. e0(f) ≤ K + 1K
α+ ε0(n0, k, xk,0).
2 Set
fα,n = arg min{
e1(fkα,n) +
K − 1K
ε1(n1, k, xk,1)|k = 1, 2, . . . ,K(n)}
.
The term K−1K ε1(n1, k, xk,1) may be viewed as a penalty that measures the complexity of class
F k.
Let V (j,F ) = supf∈F
(ej(f)− KK−1 ej(f)) and W (j,F ) = sup
f∈F(ej(f)− K+1
K ej(f)), for j = 0, 1.
We have the following PAC bound for NP-SRM.
Theorem 8 For any K > 0, positive number x0 relating to n0 and x1 > 0 connecting with
n1. Let C(K,nj, xj) = 44(K + 1) log π26 +2 log k+xj
njfor j = 0, 1. Then with probability at least
1 − (e−x0 + e−x1), we have
e0(fα,n) − K + 1K − 1
α ≤ 2K − 1K − 1
[651Kr∗K(n),0 + C(K,n0, x0)]
and
e1(fα,n) − e∗1,α ≤ inf1≤k≤K(n)
[K + 1K − 1
e1(f∗α,k) − e∗1,α +2K − 1K − 1
(651Kr∗k,1 + C(K,n1, x1)
)]
.
The proof is similar to that of Lemma 2.
As in the standard statistical learning, we are interested in the following way:
Theorem 9 For NP-SRM defined above, we have
Ee0(fα,n) ≤ K + 1K − 1
α+2K − 1K − 1
(
651KEr∗K(n),0 + 88(K + 1)log(n0K(n))
n0
)
Neyman–Pearson Classification 865
and
Ee1(fα,n) − e∗1,α ≤ inf1≤k≤K(n)
[K + 1K − 1
e1(f∗α,k) − e∗1,α +651K(2K − 1)
K − 1Er∗k,1 + C(k, n0, n1)
]
,
where C(k, n0, n1) = 88(K+1)(2K−1)K−1 · log(n1k)
n1+ 2
n20
+ 4n2
1.
The proof is given in Appendix D.
Appendix A Proof of Theorem 1
For j = 0, 1, by Mcdiarmid inequality and any δ′j ∈ (0, 12 ), with probability at least 1 − δ′j ,
supf∈F
|ej(f) − ej(f)| ≤ E supf∈F
|ej(f) − ej(f)| +√
log 1δ′j
2nj
≤ 2E supf∈F
1nj
∑
i:yi=j
σiI{f(xi) �=j} +
√log 1
δ′j
2nj.
(6)
Notice that
E
[
supf∈F
1nj
∑
i:yi=j
σiI{f(xi) �=j}
]
= E
[
supf∈F
1nj
∑
i:yi=j
σi(f(xi) − yi)]
= ERjnj
(F ).(7)
Using Mcdiarmid inequality also to EσRjnj
(F ), we get with probability at least 1 − δ′j ,
ERjnj
(F ) ≤ EσRjnj
(F ) +
√√√√ log 1
δ′j
2nj,
which, together with (6) and (7), yields that for j = 0, 1 and any δj ∈ (0, 1)
P
[
supf∈F
|ej(f) − ej(f)| > 2EσRjnj
(F ) + 3
√log 2
δj
2nj
]
≤ δj .
We get the result from Lemma 1.
Appendix B Sub-root function
Definition A.1 A function φ : [0,∞) → [0,∞) is sub-root if it is nonnegative, nondecreasing,and if r �→ φ(r)/
√r is non-increasing for r > 0.
Among other useful properties, sub-root functions have a unique fixed point.
Lemma A.2 If φ : [0,∞) → [0,∞) is a sub-root function, then it is continuous on [0,∞)and the equation φ(r) = r has a unique positive solution. Moreover, if we denote the solutionby r∗, then for all r > 0, r ≥ φ(r) if and only if r∗ ≤ r.
In some cases, these fixed points can be estimated from global information on the functionclass and have bound as
r∗ ≤ cd log(n/d)n
,
where c represents an absolute constant, VC dimension d <∞.See [16] and [19] for more details.
Appendix C Proof of Lemma 2
Set C = {Zn : e0(f∗α) > K+1K α+ ε0}. Then
Θ1 ∪ Θ0 = (Θ1 ∩ C) ∪ (Θ1 ∩ C) ∪ Θ0 ⊂ (Θ1 ∩ C) ∪ C ∪ Ω0.
866 Han M. and Chen D. R.
Hereinafter, we will prove Θ1 ∩ C ⊂ Ω1, C ⊂ Ω0 and Θ0 ⊂ Ω0 respectively.For the first term, let Zn ∈ C, notice that e1(fK
α,n) ≤ e1(f∗α) by the definition of fKα,n,
therefore,
e1(fKα,n) − K + 1
K − 1e1(f∗α) = e1(fK
α,n) − K
K − 1e1(fK
α,n) +K
K − 1(e1(fK
α,n) − K + 1K
e1(f∗α))
≤ e1(f) − K
K − 1e1(f) +
K
K − 1
(
e1(f∗α) − K + 1K
e1(f∗α))
≤ supf∈F
(
e1(f) − K
K − 1e1(f)
)
+K
K − 1supf∈F
(
e1(f) − K + 1K
e1(f))
.
So, Ω1 ⊂ Θ1, that is Ω1 ∩ C ⊂ Θ1. This implies Θ1 ⊂ C ∪ Ω1, hence Θ1 ∩ C ⊂ Ω1 ∩ C ⊂ Ω1.For the second term, let Zn ∈ C, which implies e0(f∗α) > K+1
K α + ε0. And also by thedefinition of f∗α , e0(f∗α) ≤ α. Therefore,
e0(f∗α) − K + 1K
e0(f∗α) >K + 1K
α+ ε0 − K + 1K
e0(f∗α) ≥ K + 1K
α+ ε0 − K + 1K
α = ε0.
This implies C ⊂ {Zn : e0(f∗α) − K+1K e0(f∗α) > ε0} ⊂ Ω0.
For the third term, if Zn ∈ Θ0, e0(fKα,n) > K+1
K−1α+ 2K−1K−1 ε0. And by the definition of fK
α,n,fK
α,n ∈ F0 and e0(fKα,n) ≤ K+1
K α+ ε0. Therefore
e0(fKα,n) − K
K − 1e0(fK
α,n) >K + 1K − 1
α+2K − 1K − 1
ε0 − K
K − 1
(K + 1K
α+ ε0
)
= ε0.
So Θ0 ⊂ Ω0 achieved. This complete the proof of this lemma.
Appendix D Proof of Theorem 9
Let
Θ0 ={
Zn : e0(fα,n) − K + 1K − 1
α >2K − 1K − 1
ε0(n0,K(n), xK(n),0)}
;
Θ1 ={
Zn : e1(fα,n) − e∗1,α > inf1≤k≤K(n)
[2K − 1K − 1
ε1(n1, k, xk,1) +K + 1K − 1
e1(f∗α,k) − e∗1,α
]}
;
Ωk0 =
{Zn : V (0,F k) > ε0(n0, k, xk,0) or W (0,F k) > ε0(n0, k, xk,0)
};
Ωk1 =
{Zn : V (1,F k) > ε1(n1, k, xk,1) or W (1,F k) > ε1(n1, k, xk,1)
};
Θk0 =
{
Zn : e0(fkα,n) − K + 1
K − 1α >
2K − 1K − 1
ε0(n0, k, xk,0)}
;
Θk1 =
{
Zn : e1(fkα,n) − K + 1
K − 1e1(f∗α,k) >
2K − 1K − 1
ε1(n1, k, xk,1)}
.
Before giving the proof of Theorem 9, let us prove a lemma.
Lemma 6 If for any f ∈ F k, xk,j > 0 and P(Ωkj ) ≤ e−xk,j , j = 0, 1. then we have
E
[
ej(f) − K
K − 1ej(f) − εj(nj , k, xk,j)
]
≤ e−xk,j ,
andE
[
ej(f) − K + 1K
ej(f) − εj(nj , k, xk,j)]
≤ e−xk,j .
Proof We only prove the lemma for j = 1, and the case for j = 0 follows in the same way.∀ f ∈ F k, we have
P
(
e1(f) − K
K − 1e1(f) + 651Kr∗k,1 + 44(K + 1)
xk,1
n1> 0
)
≤ e−xk,1 .
Neyman–Pearson Classification 867
We can always take xk,1 such that e1(f) − KK−1 e1(f) + 651Kr∗k,1 + 44(K + 1)xk,1
n1≤ 1, then
E
[
e1(f) − K
K − 1e1(f) + 651Kr∗k,1 + 44(K + 1)
xk,1
n1
]
≤ P
(
e1(f) − K
K − 1e1(f) + 651Kr∗k,1 + 44(K + 1)
xk,1
n1> 0
)
≤ e−xk,1 .
Proof of Theorem 9 For the NP-SRM defined above, by Theorem 4, we have for all k ∈{1, 2,K(n)}, xk,0 > 0 and xk,1 > 0,
P(Ωk0) ≤ e−xk,0 and P(Ωk
1) ≤ e−xk,1 .
Notice that, by Lemma 2, for any k,
Ωk0 ⊂ Θk
0 and P(Θk0) ≥ P(Ωk
0) ≥ 1 − e−xk,0 .
Because of fα,n ∈ F k,
P(Θk0) = P
(
e0(fα,n) >K + 1K − 1
α+2K − 1K − 1
ε0(n0, k, xk,0))
≤ e−xk,0 .
Proving in the same way as that of Lemma 6 , we have
Ee0(fα,n) ≤ K + 1K − 1
α+2K − 1K − 1
Eε0(n0, k, xk,0) + e−xk,0
≤ K + 1K − 1
α+2K − 1K − 1
Eε0(n0,K(n), xK(n),0) + e−xK(n),0 . (8)
For any k ∈ {1, 2, . . . ,K(n)} and let Ck = ε1(n1, k, xk,1), we note that
e1(fα,n) − e∗1 = e1(fα,n) − K
K − 1e1(fα,n) − Ck +
K
K − 1(e1(fα,n) − Ck) +
2K − 1K − 1
Ck − e∗1.
Taking expectation, we obtain
Ee1(fα,n) − e∗1
=K
K − 1E
(
e1(fα,n) +K − 1K
Ck
)
+ E
(
e1(fα,n) − K
K − 1e1(fα,n) − Ck
)
− e∗1
≤ K
K − 1E
(
e1(fkα,n) +
K − 1K
Ck
)
+ E supk
(
e1(fkα,n) − K
K − 1e1(fk
α,n) − Ck
)
− e∗1
≤ K
K − 1E(e1(fk
α,n)) + ECk +∑
k
E
(
e1(fkα,n) − K
K − 1e1(fk
α,n) − Ck
)
− e∗1. (9)
Here, we use the defination of NP-RSM in the second inequality.
Thereinafter, we will see F k0 ⊂ F k
0 holds on the set Ωk0 , here F k
0 = {f ∈ F k : e0(f) ≤ α}and F k
0 = {f ∈ F k : e0(f) ≤ K+1K α+ ε0(n0, k, xk,0)}. In fact, if Zn ∈ Ωk
0 , ∀ f ∈ F k0
e0(f) ≤ supf∈Fk
(
e0(f) − K + 1K
e0(f))
+K + 1K
e0(f)
≤ K + 1K
α+ ε0(n0, k, xk,0).
Therefore, e1(fkα,n) ≤ e(f∗α,k).
Then Lemma 6 tells usE(e1(fk
α,n))= E(e1(fkα,n)|Ωk
0) + E(e1(fkα,n)|Ωk
0)
≤ E(e1(f∗α,k)) + P(Ωk0)
≤ E(e1(f∗α,k)) + e−xk,0 .
(10)
868 Han M. and Chen D. R.
Here, the first inequality follows from that the loss function is an indicator function. Byf∗α,k ∈ F k, we have
P
(
e1(f∗α,k) >K + 1K
e1(f∗α,k) + Ck
)
≤ P(Ωk1) ≤ e−xk,1 .
Using Lemma 6 and (10), we obtain
E(e1(fkα,n)) ≤ K + 1
Ke1(f∗α,k) + ECk + e−xk,0 + e−xk,1 . (11)
For fkα,n ∈ F k, by Lemma 6, we also have that
E
(
e1(f∗α,n) − K
K − 1e1(fk
α,n) − Ck
)
≤ e−xk,1 . (12)
Finally, substituting (11), (12) into (9), we obtain
Ee1(fα,n) − e∗1 ≤ K + 1K − 1
e1(f∗α,k) − e∗1 +2K − 1K − 1
ECk + e−xk,0 + e−xk,1 +∑
k
e−xk,1
≤ K + 1K − 1
e1(f∗α,k) − e∗1 +2K − 1K − 1
ECk +∑
k
e−xk,0 + 2∑
k
e−xk,1 . (13)
Taking xk,0 = 2 log(n0k), xk,1 = 2 log(n1k) and substituting them into (8) and (13), we completethe proof.
References[1] Lehmann, E.: Testing statistical hypothesis, New York: Wiley, 1986
[2] Vapnik, V.: The nature of statistical learning theory, New York: Springer-Verlag, 1995
[3] Cannon, A., Howse, J., Hush, D., Scovel, C.: Learning with the Neyman–Pearson and mim-max criteria,
Tech. Rep. LA-UR 02-2951, Los Alamos National Laboratory, 2002
[4] Devroye, L., Gyorfi, L., Lugosi, G.: A probabilistic theory of pattern recognition, New York: Springer, 1996
[5] Vapnik, V., Chervoneukis, A.: Theory of pattern recognition, Moscow: Nauka, 1974
[6] Vapnik, V., Chervoneukis, A.: On the uniform convergence of relative frequencies of events to their proba-
bilities. Theory of Probability and its Applications, 16(2), 264–280 (1971)
[7] Cannon, A., Howse, J., Hush, D., Scovel, C.: Simple classifiers, Tech. Rep. LA-UR03-0193, Los Alamos
National Laboratory, 2003
[8] Scott, C., Nowak, R.: A Neyman–Pearson approach to statistical learning. IEEE Transactions on Infor-
mation Theory, 51(11), 3806–3819 (2005)
[9] Vapnik, V.: Estimation of dependencies based on empirical data, New York: Springer-Verlag, 1982
[10] Scott, C.: Performance measures for Neyman–Pearson classification , Preprint, 2005
[11] Pollard, D.: Convergence of stochastic process, New York: Springer-Verlag, 1984
[12] Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Transaction on Information
Theory, 47(5), 1902–1914 (2001)
[13] Bartlett, P. L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Machine Learning, 48,
85–113 (2002)
[14] Mendelson, S.: Rademacher averages and phase transition in Glivenko–Cantelli classes. IEEE Transaction
on Information Theory, 48(1), 251–263 (2002)
[15] Bartlett, P. L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results.
Journal of Machine Learning Research, 3, 463–482 (2002)
[16] Bartlett, P. L., Bousquet, O., Mendelson, S.: Local Rademacher Complexities. Annals of Statistics, 33(4),
1497–1537 (2005)
[17] Bousquet, O., Koltchinskii, V., Panchenko, D.: Some local measures of complexity of convex hulls and
generalization bounds, In U. Kivinen and R. H. Sloan, editors, Proceeding of the 15th Annual Conference
on Computational Learning Theory, 2002, 59–73
[18] Lugosi, G., Wagkamp, M.: Complexity regularization via localized random penalties. Annals of Statistics,
32(4), 1679–1697 (2004)
[19] Bousquet, O.: Concentration inequalities and empirical processes theory applied to the analysis of learning
algorithms, Degree dissertation of PH.D., 2002