data-derived pointwise consistency - upmcplc/data2015/anantharam2015.pdf · outline insurance...

Outline Insurance Compression Learning

Data-derived pointwise consistency

Venkat AnantharamUniversity of California, Berkeley

Challenges for Optimization in Data ScienceUniversite Pierre et Marie Curie – Paris 6

July 1 -2, 2015

joint work with Narayana Prasad SanthanamUniversity of Hawaii, Manoa


Outline

1 Insurance

2 Compression

3 Learning


Why is auto insurance mandatory?



Actually, it is not (in theU.S.): it is left to theindividual states




When mandated, perhapsinsurance is meant to saveyou from damage to self?




When mandated, perhapsinsurance is meant to saveyou from damage to self?

Not really: Liability


What makes the insurance industry tick?


What makes the insurance industry tick?

Aggregation of risk

Insurer Insured

Compensate losses→ Pays premium←


How to set the premiums?



Probabilistic approaches [Lundberg, Cramer, .. ]



Probabilistic approaches [Lundberg, Cramer, .. ]Parametric: assume there is enough data to build a realisticmodel for the distribution of losses



Probabilistic approaches [Lundberg, Cramer, .. ]Parametric: assume there is enough data to build a realisticmodel for the distribution of lossesThe insurer starts with some initial capital, losses with theknown distribution occur at the times of some point process,premiums are set based on the model, avoiding bankruptcy iscertainly the most basic goal.


How we got started on this work



The availability of insurance enables risk taking and so is adriver of innovation




So one should encourage the emergence of an insuranceindustry




So one should encourage the emergence of an insuranceindustry

But there are markets where not enough data to build areliable loss model, e.g. cyberinsurance.


Cyberinsurance


Cyberinsurance

Source: Advisen Insight, January 15, 2015.


Our approach to the problem



Insurer

The insurer adopts anonparametric model forlosses



Insurer


The loss distribution isonly assumed to be amember of a class P ofloss distributions



Insurer



The premium settingstrategy should workuniversally over lossdistributions in this class



Insurer



The premium settingstrategy should workuniversally over lossdistributions in this class

We ask: which classes Pare insurable?


Nonparametric Approach to Insurance

Insurer Insured

Compensate losses→Pays premium←

• Charges premiums such thatbuilt up reserves Φt+1,Φt+2, . . . Incurs loss X1,X2, . . .


Entry time: making sense of “universality”

X1,X2, . . . ,Xt Xt+1 Xt+2 . . . ∼ p-Φt+1 -Φt+2 . . . (payments)

Observe Φt+1 Φt+2 . . . ∈ P-Xt+1 -Xt+2 . . . (payments)





If at any step j , Φj − Xj ≤ 0: scheme bankrupt






Require: ∀p ∈ P scheme must enter the game almost surely






Require: ∀p ∈ P scheme must enter the game almost surely

Xi ∈ N (no bound on Xi ), Xi are independent


Insurability

Want for all η > 0, ∃ premium scheme Φ such that ∀p ∈ P,

p( bankruptcy ) ≤ η .


Insurability

Want for all η > 0, ∃ premium scheme Φ such that ∀p ∈ P,

p( bankruptcy ) ≤ η .

Conversely if there is η > 0 such that all schemes fail (on somep ∈ P) with probability ≥ η, P is not insurable.


Insurability as a form of learning

Generative model for data:



Generative model for data: A tail probability learning problem:



Generative model for data: A tail probability learning problem:

The equivalence of insurability to tail probability learning isnot obvious. It is a consequence of the proof of ourcharacterization of which model classes P are insurable.


Deceptive distributions



Let

J(p, q) := D(p�p + q

2) + D(q�p + q

2) .

where

D(p�q) :=�

i

p(i) logp(i)

q(i).



Let

J(p, q) := D(p�p + q

2) + D(q�p + q

2) .

where

D(p�q) :=�

i

p(i) logp(i)

q(i).

A probability distribution p ∈ P is called deceptive (for tails) if:

For all � > 0, there exists δ > 0 such that for all x ∈ R, thereexists q ∈ P with

J(p, q) ≤ � ,

andF−1q (1− δ) > x .


Necessary and sufficient conditions for insurability

Consider neighborhood � around p ∈ P,namely

�q ∈ P : D(p||p+q

2 ) + D(q||p+q2 ) ≤ �

�.




�q ∈ P : D(p||p+q

2 ) + D(q||p+q2 ) ≤ �

�.

Consider the 1− δ percentile




�q ∈ P : D(p||p+q

2 ) + D(q||p+q2 ) ≤ �

�.


Sup (over all q in neighborhood) q’s 1− δ percentile bounded?




�q ∈ P : D(p||p+q

2 ) + D(q||p+q2 ) ≤ �

�.



Above statement true for all δ?




�q ∈ P : D(p||p+q

2 ) + D(q||p+q2 ) ≤ �

�.



Above statement true for all δ?

Theorem: P is insurable iff ∃ neighborhood around all p ∈ P suchthat the above two conditions are true.


Data derived pointwise consistency

Pointwise consistency has two flavors(i) where we cannot tell from data if we are doing well(ii) where we can.





In addition to pointwise consistent estimator, for any givenconfidence 1− η there is a stopping rule that stops universally forall p ∈ P after the estimator starts doing well with confidence> 1− η.


Examples

If P is:

The set of all finite support distributions, then it is notinsurable

The set of all uniform distributions, then it is insurable

The set of all monotone distributions (with finite entropy),then it is not insurable

The set of all distributions with a uniformly bounded mean,then it is insurable

The set of all distributions with a finite mean, then it is notinsurable


Idea behind the necessity of the condition



Take P to be the set of all finite support distributions. P isnot insurable.




We show this by proving that all schemes fail with probability≥ 1− 2

e .





e .

Note that p0 and pL,� for all L ≥ 1, � > 0 are in P :





e .

Note that p0 and pL,� for all L ≥ 1, � > 0 are in P :

Insurer has to enter after observing some number, say N, oflosses with value 0(if not the insurer will not enter w.p. 1 if p0 is in force, whichis not allowed).


All finite support distributions

The seq of losses

0N 0m ? ∼ pL,�

Observe ΦN+m ∈ P

Fix M = �(e − 1)N�. Let L > max{ΦN+1, . . . ,ΦN+M}.

Scheme gets bankrupted on each of 0NL, 0N+1L, . . . ,0N+M−1L.

pL, 1Ne( bankruptcy ) ≥ (1− 1

Ne)N − (1− 1

Ne)N+M



The seq of losses

0N 0m ? ∼ pL,�

Observe ΦN+m ∈ P



pL, 1Ne( bankruptcy ) ≥ 1− N

Ne− (1− 1

Ne)N+M



The seq of losses

0N 0m ? ∼ pL,�

Observe ΦN+m ∈ P



pL, 1Ne( bankruptcy ) ≥ 1− 1

e− (1− 1

Ne)N+(e−1)N



The seq of losses

0N 0m ? ∼ pL,�

Observe ΦN+m ∈ P




e− (1− 1

Ne)Ne



The seq of losses

0N 0m ? ∼ pL,�

Observe ΦN+m ∈ P




e− 1

e


Sketch of proof of sufficiency



All p ∈ P have a neighborhood (reach) �p such that

for all percentiles 1− δ, ∃ upper bound on the 1− δ percentileof all q ∈ P in the neighborhood



All p ∈ P have a neighborhood (reach) �p such that

for all percentiles 1− δ, ∃ upper bound on the 1− δ percentileof all q ∈ P in the neighborhood

When p ∈ P is in effect, the insurer observes the type (empiricaldistribution) of the losses to decide what to do.


Good traps

Types captured by p� will not harm p if p is within the reach of p�.


Bad traps

p generates types captured by hostile p� —but such types have lowprobabilityPutting these together gives a constructive proof for the sufficiencycondition


Proof of sufficiency: two lemmas

For any two probability distributions p and q on N, we have

1

4 ln 2|p − q|21 ≤ J(p, q) ≤ 1

ln 2|p − q|1 .

For any three probability distributions p, q, r ∈ N we have

J(p, q) + J(q, r) ≥ ln 2

8J(p, r)2 .


Outline

1 Insurance

2 Compression

3 Learning


Data compression



Data compression


Want to find a measure q oninfinite sequences of rationalnumbers and measure itsperformance against p ∈ P via1n log

p(Xn)q(Xn at time n


Strong and weak compressibility



P is called strongly compressible if there exists q such that

lim supn→∞

supp∈P

1

nEp log

p(X n)

q(X n)= 0 .




lim supn→∞

supp∈P

1

nEp log

p(X n)

q(X n)= 0 .

This is a rephrasing of the requirement that the compressionredundancy go to zero asymptotically, uniformly overdistributions in the class.




lim supn→∞

supp∈P

1

nEp log

p(X n)

q(X n)= 0 .


P is called weakly compressible if there exists q such that forall p ∈ P,

lim supn→∞

1

nEp log

p(X n)

q(X n)= 0 .




lim supn→∞

supp∈P

1

nEp log

p(X n)

q(X n)= 0 .


P is called weakly compressible if there exists q such that forall p ∈ P,

lim supn→∞

1

nEp log

p(X n)

q(X n)= 0 .

That this boils down to the requirement that the compressionredundancy go to zero asymptotically for each distribution inthe class is proved by Kieffer (1978).


Data-derived weak compressibility



Fix a probability distribution q on sequences of naturalnumbers. Given δ > 0, a stopping time τ of the sequenceX1,X2, . . . is called δ premature at the sequence (x1, . . . , xn)for the probability distribution p ∈ P if there exists some

1 ≤ l ≤ n such that τ(x1, . . . , xl) = 1 and 1l Ep log

p(Xl )q(Xl )

> δ.



Fix a probability distribution q on sequences of naturalnumbers. Given δ > 0, a stopping time τ of the sequenceX1,X2, . . . is called δ premature at the sequence (x1, . . . , xn)for the probability distribution p ∈ P if there exists some

1 ≤ l ≤ n such that τ(x1, . . . , xl) = 1 and 1l Ep log

p(Xl )q(Xl )

> δ.

If there exists q such that for all δ > 0 and η > 0 there is astopping time τ(δ, η) such that for all p ∈ P we have

p(τ(δ, η) is δ-premature ) < η ,

then we say that P is data-derived weak compressible.


Reiterating the value of the data derived notion





In addition to pointwise consistent estimator, for any givenconfidence 1− η there is a stopping rule that stops universally forall p ∈ P after the estimator starts doing well with confidence> 1− η.


Characterizing data-derived weak compressibility



p ∈ P is called deceptive (for compression) if

lim�→0

infqlim supn→∞

supp�∈B(p,�)

1

nDn(p

��q) > 0 .



p ∈ P is called deceptive (for compression) if

lim�→0

infqlim supn→∞

supp�∈B(p,�)

1

nDn(p

��q) > 0 .

Theorem: (with Wojciech Szpankowski and Aleksandr Kavcic)P is data-derived weak compressible iff no p ∈ P is deceptive.


Examples


Examples

If P is:

The set of all uniform distributions, then it is d.w.c. but it isnot strongly compressible.

The set of all distributions with a finite mean, then it isweakly compressible but it is not d.w.c..

The set of all monotone distributions with finite entropy, thenit is weakly compressible but it is not d.w.c..

The set of all monotone distributions with finite entropy, andwith the uniformly bounded second moment of the selfinformation, i.e. Ep[(log

1p(X ))

2] ≤ h < ∞, then then it isstrongly compressible.


Proofs

Different from, but of the same flavor, as the proofs for theinsurability theorem.


Insurability does not imply strong compressibility

The class of all finitely supported uniform distributions, isinsurable but is not strong compressible.


Strong compressibility implies insurability

R := infqsupn∈N

supp∈P

log(p(n)

q(n)) < ∞ ,

is achieved by the Shtarkov distribution



R := infqsupn∈N

supp∈P

log(p(n)

q(n)) < ∞ ,


q∗(n) :=supp∈P p(n)�

n�≥0 supp∈P p(n�).



R := infqsupn∈N

supp∈P

log(p(n)

q(n)) < ∞ ,




Since

q∗(n) ≥ p(n)

2Rfor all n ∈ N

we have �

n≥N

p(n) ≤ 2R�

n≥N

q∗(n) for all N ∈ N ,



R := infqsupn∈N

supp∈P

log(p(n)

q(n)) < ∞ ,




Since

q∗(n) ≥ p(n)

2Rfor all n ∈ N

we have �

n≥N

p(n) ≤ 2R�

n≥N

q∗(n) for all N ∈ N ,

so for all δ > 0, we have

F−1p (1− δ) ≤ F−1

q∗ (1− δ

2R) .

This implies that P is insurable.


Weak compressibility does not imply insurability

The class of all finite mean distributions, is not insurable butis weakly compressible.


Insurability does not imply weak compressibility

Partition N as

{1, 2} ∪ {3, 4, 5, 6} ∪ {7, 8, 9, 10, 11, 12, 13, 14} ∪ . . .



Partition N as

{1, 2} ∪ {3, 4, 5, 6} ∪ {7, 8, 9, 10, 11, 12, 13, 14} ∪ . . .

Let I be the class of all probability distributions, where thei-the partition element is picked with probability 6

π2i2and then

an integer in this set is picked uniformly at random.



Partition N as

{1, 2} ∪ {3, 4, 5, 6} ∪ {7, 8, 9, 10, 11, 12, 13, 14} ∪ . . .


π2i2and then


For all p ∈ I and δ > 0 we have F−1p (1− δ) ≤ 2k(δ) − 2,

where k(δ) is the smallest integer satisfying δ > 6π2

�∞k(δ)

1i2,

so I is insurable.



Partition N as

{1, 2} ∪ {3, 4, 5, 6} ∪ {7, 8, 9, 10, 11, 12, 13, 14} ∪ . . .


π2i2and then


For all p ∈ I and δ > 0 we have F−1p (1− δ) ≤ 2k(δ) − 2,

where k(δ) is the smallest integer satisfying δ > 6π2

�∞k(δ)

1i2,

so I is insurable.

However, for every probability distribution q on N and i ≥ 1,there is an integer n in the i-th set of the partition withq(n) ≤ 1

2i, so there is some p ∈ I for which

�

n∈Np(n) log

1

q(n)= ∞ .


Compression and insurance


The class B of probability distributions

Partition N as supi≥0 Ti , where

Ti = {2i , . . . , 2i+1 − 1} .




Ti = {2i , . . . , 2i+1 − 1} .

Given � > 0, let n� := �1� �, and for each 1 ≤ j ≤ 2n� , let p�,jbe the probability distribution assigning probability 1− � to T0

and � to the j-th smallest element of Tn� .




Ti = {2i , . . . , 2i+1 − 1} .



B is comprised of the p�,j and the distribution p0, whichassigns probability 1 to T0.




Ti = {2i , . . . , 2i+1 − 1} .



B is comprised of the p�,j and the distribution p0, whichassigns probability 1 to T0.

Then B is insurable, weakly compressible, but not d.w.c.


Outline

1 Insurance

2 Compression

3 Learning


PAC learning model


PAC learning model

Assume X = N and Y = {0, 1} for concreteness. LetZ := X × Y.


PAC learning model

Assume X = N and Y = {0, 1} for concreteness. LetZ := X × Y.Training samples:

z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .


PAC learning model


z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .

H a class of hypotheses (maps from X to Y).


PAC learning model


z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .

H a class of hypotheses (maps from X to Y).erP(h) := P(h(X ) �= Y ).


PAC learning model


z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .


optP(H) := infg∈H erP(g).


PAC learning model


z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .



erz(h) :=1n |{i : h(xi ) �= yi}|.


PAC learning model


z = (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)) .



erz(h) :=1n |{i : h(xi ) �= yi}|.

L : ∪n≥1 Zn �→ H, a learning algorithm.


PAC learning


PAC learning

Suppose that for all 0 < � < 1 and all 0 < δ < 1, there isn0(�, δ) ≥ 1, such that, for all n ≥ n0(�, δ) and all P we have

P(erP(L(Zn)) < optP(H) + �) ≥ 1− δ .

Then we say that the learning algorithm L PAC learns thehypothesis class H.


PAC learning

Suppose that for all 0 < � < 1 and all 0 < δ < 1, there isn0(�, δ) ≥ 1, such that, for all n ≥ n0(�, δ) and all P we have

P(erP(L(Zn)) < optP(H) + �) ≥ 1− δ .

Then we say that the learning algorithm L PAC learns thehypothesis class H.

Note the requirement for uniformity over the (i.i.d.)probability distribution driving the data.


VC dimension


VC dimension

The VC dimension of H is the supremum of the sizes ofsubsets of X shattered by H.


VC dimension

The VC dimension of H is the supremum of the sizes ofsubsets of X shattered by H.

There is a learning algorithm that learns H iff it has finite VCdimension. In fact, sample error minimization algorithms willthen learn H.


Data-derived pointwise PAC learning



A stopping time τ of the sequence z1, z2, . . . is called �-prematurefor the generating distribution P and the sample sequencez = (z1, . . . , zn) if for some k ≤ n we have

τ(z1, . . . , zk) = 1 and erP(L(z1, . . . , zk)) ≥ optP(H) + � .





Suppose that for all 0 < � < 1 and all 0 < δ < 1, there is a stoppingtime τ(�, δ) of the sequence z1, z2, . . . such that for all P we have

P(τ(�, δ) is �-premature) ≤ δ .

Then we say the learning algorithm L learns H in the data-derivedpointwise PAC learning sense.





Suppose that for all 0 < � < 1 and all 0 < δ < 1, there is a stoppingtime τ(�, δ) of the sequence z1, z2, . . . such that for all P we have

P(τ(�, δ) is �-premature) ≤ δ .

Then we say the learning algorithm L learns H in the data-derivedpointwise PAC learning sense.

Note that we have given the learning algorithm the opportunity todeclare that learning has taken place at a time that can depend onthe generating distribution (through the data itself).


The question of interest for PAC learning



For X = N and Y = {0, 1} as above, in fact the set of allhypotheses is data-derived pointwise PAC learnable.




This can be (vaguely) understood as coming from the“essential finiteness” of the support of each probabilitydistribution on N.




This can be (vaguely) understood as coming from the“essential finiteness” of the support of each probabilitydistribution on N.Question: Is data-derived pointwise PAC learnability of ahypothesis class characterized by finiteness of “local”complexity notions such as “local VC-dimension”, “localRademacher complexity”, “local Gaussian complexity”, etc.?


The End

Thank you!


Proof of sufficiency: preliminaries

Enumerate the finite subsets of N. Let i(A) denote thenumber associated to A ⊂ N.



Enumerate the finite subsets of N. Let i(A) denote thenumber associated to A ⊂ N.Since every p ∈ P is not deceptive, it has a reach, defined as�p > 0 such that for all δ > 0 there exists xp,δ ∈ R such thatfor all q ∈ P with J(p, q) < �p we have F−1

q (1− δ) < xp,δ.




q (1− δ) < xp,δ.

For a finite sequence of nonnegative integers, σ, its type, orempirical distribution, is denoted q(σ). We will assume thatthis retains knowledge of the length of the sequence(unnormalized denominator).




q (1− δ) < xp,δ.

For a finite sequence of nonnegative integers, σ, its type, orempirical distribution, is denoted q(σ). We will assume thatthis retains knowledge of the length of the sequence(unnormalized denominator).

Given p, let

Ip := {q : |p − q|1 < �2p(ln 2)2

16} .


Proof of sufficiency: the scheme

Having observed σ, first find

P �σ := {p� ∈ P : q(σ) ∈ Ip�} .




P �σ := {p� ∈ P : q(σ) ∈ Ip�} .

Next, check if

2−n�4

p�(ln 2)3

512 ≤ 1

(n + 1)|A(σ)|η

2i(A(σ))2n236

π4.

for any p� ∈ P �σ.




P �σ := {p� ∈ P : q(σ) ∈ Ip�} .

Next, check if

2−n�4

p�(ln 2)3

512 ≤ 1

(n + 1)|A(σ)|η

2i(A(σ))2n236

π4.

for any p� ∈ P �σ.

Enter at the first time at which the answer to the precedingstep is “yes”, basically deciding that the underlying law is anysuch p�. If the true law is p, we will say that p has beentrapped by p� on this realization. At time n after entering wecharge the premium 2x

p�, δ(n)2

+ 1, where δ(n) := 6η2π2n2

.


Proof of sufficiency: entering with probability 1

Suppose the true loss distribution is p.

We will eventually have p ∈ P �σ (eventually in σ).

Since p is assumed to have finite alphabet, the empirical alphabeteventually stabilizes. Hence i(A(σ)) is eventually a constant. Thismeans the second criterion is also eventually met, so eventually theobserved sequence is trapped by some probability distribution in P.


Proof of sufficiency: entering with probability 1

Suppose the true loss distribution is p.

We will eventually have p ∈ P �σ (eventually in σ).

Since p is assumed to have finite alphabet, the empirical alphabeteventually stabilizes. Hence i(A(σ)) is eventually a constant. Thismeans the second criterion is also eventually met, so eventually theobserved sequence is trapped by some probability distribution in P.

Note: The only reason we need to assume finiteness of the range ofeach p ∈ P is because we need the empirical alphabet to stabilize.


Proof of sufficiency: Good traps

If the trapping p� is good, then at time n after being trapped, thepremium charged is 2x

p�, δ(n)2

+ 1, where δ(n) := 6η2π2n2

.

This is at least as big as 2F−1p (1− δ(n)

2 ) + 1.

So at time n after entering the loss bankrupts the scheme withprobability at most δ(n) := 6η

2π2n2.

Sum over n ≥ 1. The overall probability of going bankrupt overobserved loss sequences that get trapped in a good trap is at mostη2 ,


Proof of sufficiency: bad traps

If the trapping p� is bad, and at the time of trapping the observedloss sequence was σ, of length n = |σ|, then, from the lemmas

J(p, q(σ)) ≥ �2p�ln 2

16.

Hence, from the lemmas

D(q(σ)�p) ≥ �4p�(ln 2)3

512.

Hence

2−nD(q(σ)�p) ≤ 1

(n + 1)|A(σ)|η

2i(A(σ))2n236

π4.

There are at most (n + 1)|A(σ)| length n sequences with alphabetA(σ), and each has probability under p of at most 2−nD(q(σ)�p).Summing over all σ shows that the probability of falling into a badtrap is at most η

2 .


Proof of sufficiency without finite support assumption

The main observation is that P is Lindelof in the l1-topology,so the open covering of P by the sets

{Ip, p ∈ P}has a countable subcovering.





We can index these by associating index i(p�) to each p�

defining one such Ip� . Only such p� will be allowed as traps.







As a proxy for the size term (n + 1)|A(σ)| use the numberC (p�) defined by

Cp� := 2supr∈Bp�

F−1r (1−

�Dp�3

),

where Bp� is the set of distributions in P in the reach of p� and

Dp� :=�4p�(ln 2)

3

512.







As a proxy for the size term (n + 1)|A(σ)| use the numberC (p�) defined by

Cp� := 2supr∈Bp�

F−1r (1−

�Dp�3

),

where Bp� is the set of distributions in P in the reach of p� and

Dp� :=�4p�(ln 2)

3

512.

The rest of the proof is the same.


Proof of necessity without finite support assumption

Hp,γ := {y ∈ Ap : y ≤ F−1p (1− γ

2)} ,

where Ap denotes the support of p, is called the head of p at levelγ.



Hp,γ := {y ∈ Ap : y ≤ F−1p (1− γ

2)} ,

where Ap denotes the support of p, is called the head of p at levelγ.Let p ∈ P be deceptive. Given � > 0 and scheme Φ, for eachn ≥ 1 let:

Rn := {xn : Φ(xn) < ∞} ,

as before.



Hp,γ := {y ∈ Ap : y ≤ F−1p (1− γ

2)} ,


Rn := {xn : Φ(xn) < ∞} ,

as before.Now, pick N ≥ 1 such that

p(RN) > 1− α

2,

where α < 1− ln 2 as in the lemma. Later N → ∞.



Hp,γ := {y ∈ Ap : y ≤ F−1p (1− γ

2)} ,


Rn := {xn : Φ(xn) < ∞} ,

as before.Now, pick N ≥ 1 such that

p(RN) > 1− α

2,

where α < 1− ln 2 as in the lemma. Later N → ∞.Let

Rp,γ,i := {x i ∈ Ri : A(x i ) ⊆ Hp,γ , entered by N} ,

where A(xn) denotes the set of values showing up in xn.



Pick k ≥ 2 (depending on δ) such that

(1− 1

Nk)N+� 1

δ� ≥ 1− α

2.




(1− 1

Nk)N+� 1

δ� ≥ 1− α

2.

LetcN,δ := max

xi∈Rp, 1

Nk,i

, N≤i≤N+� 1δ�Φ(x i ) .




(1− 1

Nk)N+� 1

δ� ≥ 1− α

2.

LetcN,δ := max

xi∈Rp, 1

Nk,i

, N≤i≤N+� 1δ�Φ(x i ) .

Find δ > 0 and q ∈ P with J(p, q) < 1N2 and F−1

q (1− δ) > cN,δ.This is possible, because p is deceptive.




(1− 1

Nk)N+� 1

δ� ≥ 1− α

2.

LetcN,δ := max

xi∈Rp, 1

Nk,i

, N≤i≤N+� 1δ�Φ(x i ) .


q (1− δ) > cN,δ.This is possible, because p is deceptive.Then, for all N ≤ i ≤ N + �1δ �, we have

q(Rp, 1

Nk ,i) ≥ 1− 2

N− 2h(α) ,

for all N ≤ i ≤ N + �1δ �, by the lemma.




(1− 1

Nk)N+� 1

δ� ≥ 1− α

2.

LetcN,δ := max

xi∈Rp, 1

Nk,i

, N≤i≤N+� 1δ�Φ(x i ) .


q (1− δ) > cN,δ.This is possible, because p is deceptive.Then, for all N ≤ i ≤ N + �1δ �, we have

q(Rp, 1

Nk ,i) ≥ 1− 2

N− 2h(α) ,

for all N ≤ i ≤ N + �1δ �, by the lemma.The rest of the proof is as before.

data-derived pointwise consistency - upmcplc/data2015/anantharam2015.pdf · outline insurance...

Documents