from nonparametrics to model-free: a time series time...

117
From Nonparametrics to Model-Free: a time series time line Dimitris N. Politis University of California, San Diego

Upload: vuongdang

Post on 11-Feb-2019

223 views

Category:

Documents


0 download

TRANSCRIPT

From Nonparametrics to Model-Free:a time series time line

Dimitris N. PolitisUniversity of California, San Diego

1988-1989

I Stanford 1988-1989

I Idea: Bootstrap confidence intervals for the spectral density

I Murray: “What are you going to do about the bias?”

1988-1989

I Stanford 1988-1989

I Idea: Bootstrap confidence intervals for the spectral density

I Murray: “What are you going to do about the bias?”

1988-1989

I Stanford 1988-1989

I Idea: Bootstrap confidence intervals for the spectral density

I Murray: “What are you going to do about the bias?”

...

Davis, June 1989

I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.

I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.

I Talk by Peter Hall: bootstrap for the probability density.

I Solution for bias: Undersmoothing!

Davis, June 1989

I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.

I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.

I Talk by Peter Hall: bootstrap for the probability density.

I Solution for bias: Undersmoothing!

Davis, June 1989

I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.

I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.

I Talk by Peter Hall: bootstrap for the probability density.

I Solution for bias: Undersmoothing!

Davis, June 1989

I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.

I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.

I Talk by Peter Hall: bootstrap for the probability density.

I Solution for bias: Undersmoothing!

Some background

I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).

I Xn = 1n

∑ni=1 Xi : consistent & asymptotically efficient

I σ2n = Var (√nXn) =

∑ns=−n(1− |s|n )γ(s)

I Under regularity:

σ2∞ := limn→∞

σ2n =∞∑

s=−∞γ(s) = 2πf (0)

where f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.

I Standard error estimation is nontrivial under dependence.

Some background

I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).

I Xn = 1n

∑ni=1 Xi : consistent & asymptotically efficient

I σ2n = Var (√nXn) =

∑ns=−n(1− |s|n )γ(s)

I Under regularity:

σ2∞ := limn→∞

σ2n =∞∑

s=−∞γ(s) = 2πf (0)

where f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.

I Standard error estimation is nontrivial under dependence.

Some background

I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).

I Xn = 1n

∑ni=1 Xi : consistent & asymptotically efficient

I σ2n = Var (√nXn) =

∑ns=−n(1− |s|n )γ(s)

I Under regularity:

σ2∞ := limn→∞

σ2n =∞∑

s=−∞γ(s) = 2πf (0)

where f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.

I Standard error estimation is nontrivial under dependence.

Some background

I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).

I Xn = 1n

∑ni=1 Xi : consistent & asymptotically efficient

I σ2n = Var (√nXn) =

∑ns=−n(1− |s|n )γ(s)

I Under regularity:

σ2∞ := limn→∞

σ2n =∞∑

s=−∞γ(s) = 2πf (0)

where f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.

I Standard error estimation is nontrivial under dependence.

Some background

I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).

I Xn = 1n

∑ni=1 Xi : consistent & asymptotically efficient

I σ2n = Var (√nXn) =

∑ns=−n(1− |s|n )γ(s)

I Under regularity:

σ2∞ := limn→∞

σ2n =∞∑

s=−∞γ(s) = 2πf (0)

where f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.

I Standard error estimation is nontrivial under dependence.

Periodogram

I Spectral density: f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s)

I Naive plug-in estimator: T (w) = (2π)−1∑∞

s=−∞ e iws γ(s)

where γ(s) = n−1∑n−|s|

t=1 (Xt − Xn)(Xt+|s| − Xn)

I The periodogram T (w) is inconsistent for f (w).

� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0

where cw = f 2(w)(1 + 1{w/π∈Z}).

Periodogram

I Spectral density: f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s)

I Naive plug-in estimator: T (w) = (2π)−1∑∞

s=−∞ e iws γ(s)

where γ(s) = n−1∑n−|s|

t=1 (Xt − Xn)(Xt+|s| − Xn)

I The periodogram T (w) is inconsistent for f (w).

� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0

where cw = f 2(w)(1 + 1{w/π∈Z}).

Periodogram

I Spectral density: f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s)

I Naive plug-in estimator: T (w) = (2π)−1∑∞

s=−∞ e iws γ(s)

where γ(s) = n−1∑n−|s|

t=1 (Xt − Xn)(Xt+|s| − Xn)

I The periodogram T (w) is inconsistent for f (w).

� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0

where cw = f 2(w)(1 + 1{w/π∈Z}).

Periodogram

I Spectral density: f (w) = (2π)−1∑∞

s=−∞ e iwsγ(s)

I Naive plug-in estimator: T (w) = (2π)−1∑∞

s=−∞ e iws γ(s)

where γ(s) = n−1∑n−|s|

t=1 (Xt − Xn)(Xt+|s| − Xn)

I The periodogram T (w) is inconsistent for f (w).

� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0

where cw = f 2(w)(1 + 1{w/π∈Z}).

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Bartlett’s blocking scheme

I Bartlett (1946): take the average of short periodograms.

I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.

I Rosenblatt’s (1957) CLT for strong mixing “models”.

I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.

I

αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|

where A ∈ F0−∞,B ∈ F∞k and FK

k = σ{Xt , k ≤ t ≤ K}.

Blocking schemes

Basic assumption: b →∞ but b/n→ 0 as n→∞.

I Fully overlapping—number of blocks q = n − b + 1

B3︷ ︸︸ ︷B1︷ ︸︸ ︷

X1,X2,X3, · · · ,Xb,Xb+1,Xb+2, · · · ,Bq︷ ︸︸ ︷

Xn−b+1, · · · ,Xn︸ ︷︷ ︸B2

I Non-overlapping—number of blocks Q = [n/b]

B1︷ ︸︸ ︷X1, · · · ,Xb,

B2︷ ︸︸ ︷Xb+1, · · · ,X2b,

B3︷ ︸︸ ︷X2b+1, · · · ,X3b, · · · · · ·Xn

I Non-overlapping with ‘buffer’—number of blocks [n/(2b)]

B1︷ ︸︸ ︷X1, · · · ,Xb,

buffer︷ ︸︸ ︷Xb+1, · · · ,X2b,

B2︷ ︸︸ ︷X2b+1, · · · ,X3b, · · · · · ·Xn

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

Bartlett’s spectral estimation scheme

I Consider one of the blocking schemes—for simplicity:non-overlapping.

I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.

I Define T (w) = Q−1∑Q

i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]

I If b →∞ but b/n→ 0, then T (w)P−→ f (w).

◦ Same argument for overlapping scheme—just cw is different:33% smaller.

I Alternative expression for Bartlett’s estimator:

T (w) ' (2π)−1∞∑

s=−∞λ(s/b)γ(s)e iws

where λ(x) = (1− |x |)+ is the ‘tent’ function.

1989–1990

I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.

I T (w) = Q−1∑Q

i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!

I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.

I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).

I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.

1989–1990

I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.

I T (w) = Q−1∑Q

i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!

I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.

I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).

I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.

1989–1990

I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.

I T (w) = Q−1∑Q

i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!

I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.

I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).

I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.

1989–1990

I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.

I T (w) = Q−1∑Q

i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!

I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.

I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).

I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.

1989–1990

I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.

I T (w) = Q−1∑Q

i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!

I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.

I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).

I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.

Self-reference

I Recall σ2n = Var (√nXn)→ 2πf (0).

I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (

√nXn).

I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (

√nXn).

I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?

I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!

Self-reference

I Recall σ2n = Var (√nXn)→ 2πf (0).

I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (

√nXn).

I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (

√nXn).

I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?

I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!

Self-reference

I Recall σ2n = Var (√nXn)→ 2πf (0).

I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (

√nXn).

I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (

√nXn).

I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?

I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!

Self-reference

I Recall σ2n = Var (√nXn)→ 2πf (0).

I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (

√nXn).

I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (

√nXn).

I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?

I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!

Self-reference

I Recall σ2n = Var (√nXn)→ 2πf (0).

I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (

√nXn).

I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (

√nXn).

I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?

I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

Bias revisited

I Bartlett estimator has large bias.

I BiasT (w) = O(1/b) and Var T (w) = O(b/n)

I To minimize MSE of T (w) choose b ∼const.n1/3.

◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .

◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.

◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.

I Or: we can bias-correct the Bartlett estimator.

...

...

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude?

Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Bias corrected estimation

◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.

I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).

I Gained an order of magnitude? Actually, much more!

I Bias f = O(1/bk) where k = number of derivatives of f .

I If f has all derivatives, f is√n–consistent (up to a log term).

I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.

Different choices for the lag window λ and kernel Λ for estimator(2π)−1

∑∞s=−∞ λ(s/b)γ(s)e iws = Λ ? T (w)

lag window

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1(a)

Fejer kernel

0 5 10 15 20

0.0

0.02

0.06

0.10

1(b)

lag window

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1(c)

Dirichlet kernel

0 5 10 15 200.

00.

10.

20.

3

1(d)

lag window

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1(e)

flat-top kernel, c=0.5

0 5 10 15 20

0.0

0.05

0.10

0.15

1(f)

1(g) 1(h)

27

Interlude: Van Gogh’s transformation

Van Gogh: Paris 1886

Van Gogh: Paris 1886 → Arles 1888

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε

I GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals

I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals

I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals

I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals

I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.

Non-i.i.d. data

I In general, data Y n = (Y1, . . . ,Yn)′ are not i.i.d.

I So the predictive distribution of Yn+1 given the data willdepend on Y n and Xn+1 which is a matrix of observable,explanatory (predictor) variables.

I Key Examples: Regression and Time series

Models

I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)

I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt

I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.

Models

I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)

I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt

I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.

Models

I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)

I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt

I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.

Models

I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)

I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt

I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.

Models

I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)

I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt

I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.

Transformation vs. modeling

I DATA: Y n = (Y1, . . . ,Yn)′

I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector εm = Hm(Ym) has i.i.d. components εk where

εm = (ε1, . . . , εm)′

YHm−→ ε

YH−1m←− ε

Transformation vs. modeling

I DATA: Y n = (Y1, . . . ,Yn)′

I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector εm = Hm(Ym) has i.i.d. components εk where

εm = (ε1, . . . , εm)′

YHm−→ ε

YH−1m←− ε

Transformation vs. modeling

I DATA: Y n = (Y1, . . . ,Yn)′

I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector εm = Hm(Ym) has i.i.d. components εk where

εm = (ε1, . . . , εm)′

YHm−→ ε

YH−1m←− ε

Transformation vs. modeling

I DATA: Y n = (Y1, . . . ,Yn)′

I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector εm = Hm(Ym) has i.i.d. components εk where

εm = (ε1, . . . , εm)′

YHm−→ ε

YH−1m←− ε

Transformation

(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)

(ii) (Y1, . . . ,Ym)H−1

m←− (ε1, . . . , εm)

I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn

I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1

I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,

Yn+1 = h(εn+1)

Transformation

(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)

(ii) (Y1, . . . ,Ym)H−1

m←− (ε1, . . . , εm)

I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn

I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1

I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,

Yn+1 = h(εn+1)

Transformation

(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)

(ii) (Y1, . . . ,Ym)H−1

m←− (ε1, . . . , εm)

I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn

I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1

I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,

Yn+1 = h(εn+1)

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf Fε

I The mean and median of h(ε) where ε ∼ Fε are optimal pointpredictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point

predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point

predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point

predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point

predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Model-free prediction principle

Yn+1 = h(εn+1)

I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point

predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε

I To predict Y 2n+1, replace h by h2;

to predict g(Yn+1), replace h by g ◦ h.

I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.

I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.

Predictive inference for stationary time series

TIME SERIES DATA: Y1, . . . ,Yn

GOAL: Predict Yn+1 given the data

Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.

Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y

∗n−1,Y

∗n ,Y ∗n+1, . . . must be such that

have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.

Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1); —drop the regressor xt in what follows.

Predictive inference for stationary time series

TIME SERIES DATA: Y1, . . . ,Yn

GOAL: Predict Yn+1 given the data

Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.

Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y

∗n−1,Y

∗n ,Y ∗n+1, . . . must be such that

have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.

Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1);

—drop the regressor xt in what follows.

Predictive inference for stationary time series

TIME SERIES DATA: Y1, . . . ,Yn

GOAL: Predict Yn+1 given the data

Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.

Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y

∗n−1,Y

∗n ,Y ∗n+1, . . . must be such that

have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.

Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1); —drop the regressor xt in what follows.

Nonparametric autoregression modelsI i.i.d. errors: Yt = µ(Xt−1) + εt

I heteroscedastic errors: Yt = µ(Xt−1) + σ(Xt−1)εt

where Xt−1 = (Yt−1, · · · ,Yt−p)′ and the errors εt are i.i.d. andindependent of the past {Ys, s < t}.

Estimate the conditional mean µ(x) = E [Yt+1|Xt = x ] andvariance σ2(x) = E [(Yt+1 −m(x))2|Xt = x ] by usualNadaraya-Watson kernel estimates m and σ2 defined as

m(x) =

∑n−1t=1 K (‖x−xt‖

h )yt+1∑n−1t=1 K (‖x−xt‖

h ),

σ2(x) =

∑n−1t=1 K (‖x−xt‖

h )(yt+1 − m(xt ))2∑n−1t=1 K (‖x−xt‖

h ),

Nonparametric autoregression modelsI i.i.d. errors: Yt = µ(Xt−1) + εt

I heteroscedastic errors: Yt = µ(Xt−1) + σ(Xt−1)εt

where Xt−1 = (Yt−1, · · · ,Yt−p)′ and the errors εt are i.i.d. andindependent of the past {Ys, s < t}.

Estimate the conditional mean µ(x) = E [Yt+1|Xt = x ] andvariance σ2(x) = E [(Yt+1 −m(x))2|Xt = x ] by usualNadaraya-Watson kernel estimates m and σ2 defined as

m(x) =

∑n−1t=1 K (‖x−xt‖

h )yt+1∑n−1t=1 K (‖x−xt‖

h ),

σ2(x) =

∑n−1t=1 K (‖x−xt‖

h )(yt+1 − m(xt ))2∑n−1t=1 K (‖x−xt‖

h ),

Nonparametric autoregression–residual bootstrap I

Mammen, Franke and Kreiss (1998, 2002)I Given observations y1, · · · , yn, we generate the bootstrap

pseudo data forward by the following recursions:I y∗i = m(x∗i−1) + ε∗iI y∗i = m(x∗i−1) + σ(x∗i−1)ε∗i

where x∗i−1 = (y∗i−1, · · · , y∗i−p)′, and the ε∗i can be resampledfrom fitted or predictive residuals.� Fitted residuals

I εi = yi − m(xi−1), for i = p + 1, · · · ,nI εi =

yi−m(xi−1)σ(xi−1)

, for i = p + 1, · · · ,n� Predictive residuals as in Politis (2013)

I ε(t)t = yt − m(t)

t (xt−1), t = p + 1, · · · ,n

I ε(t)t =

yt−m(t)(xt−1)

σ(t)(xt−1), t = p + 1, · · · ,n.

Nonparametric autoregression—prediction intervals I

I Predictive root:I Yn+1 − Yn+1 = m(xn)− m(xn) + εn+1I Yn+1 − Yn+1 = m(xn)− m(xn) + σ(xn)εn+1

I Bootstrap predictive root:I Y ∗n+1 − Y ∗n+1 = m∗(xn)− m∗(xn) + ε∗n+1I Y ∗n+1 − Y ∗n+1 = m∗(xn)− m∗(xn) + σ∗(xn)ε∗n+1

I Prediction interval:Collect B bootstrap root replicates in the form of anempirical distribution whose α-quantile is denoted q(α).Then, a (1− α)100% equal-tailed predictive interval forYn+1 is given by [m(xn) + q(α/2), m(xn) + q(1− α/2)].

I This is forward bootstrap fixing the last p values; isbackward bootstrap feasible??

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.

I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt Transformation

I To use this in practice, we need to estimate n functionsDY1(·),DY2|Y1

(·),DY3|Y2,Y1(·), . . . ,DYn|Yn−1,...,Y1

(·)I Infeasible!—unless · · · we have Markov structure in which

cases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!

—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free Prediction for Stationary Time Series

I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn

I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)

I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and

independent of η1

I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)

and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions

DY1(·),DY2|Y1(·),DY3|Y2,Y1

(·), . . . ,DYn|Yn−1,...,Y1(·)

I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!

Model-free bootstrap for Markov—Pan and Politis IAssume {Yt} is Markov of order p.

I Let Xt−1 = (Yt−1, · · · ,Yt−p)′

I Distribution of Xt : D(x) = P(Xt ≤ x) for x ∈ Rp

I Distr. of Yt given Xt−1 = x : Dx (y) = P(Yt ≤ x |Xt−1 = x)

I Let ηp = D(Xp) and ηt = DXt−1(Yt ), for t = p + 1, · · · ,n.

P(ηp ≤ z) = P(D(Xp) ≤ z) = P(Xp ≤ D−1(z)) = D(D−1(z)) = z

P(ηt ≤ z|Xt−1 = x) =P(Dx (Yt ) ≤ z|Xt−1 = x)

=P(Yt ≤ D−1x (z)|Xt−1 = x)

=Dx (D−1x (z))

=z (Uniform)

and does not depend on x (Independence).

Model-free bootstrap for Markov—Pan and Politis III Given observations y1, · · · , yn, estimate Dx (y) by usual

kernel estimator Dx (y),

Dx (y) =

∑ni=p+1 1{yi≤y}K (

‖x−xi−1‖h )∑n

k=p+1 K (‖x−xk−1‖

h ).

I D is a step function—can use linear interpolation toproduce a continuous function D, or...

I Replace 1{yi≤y} by a continuous cdf Λ( y−yih0

).I Doubly smoothed estimator Dx (y) is defined by

Dx (y) =

∑ni=p+1 Λ( y−yi

h0)K (

‖x−xi−1‖h )∑n

k=p+1 K (‖x−xk−1‖

h )

Model-free bootstrap for Markov—Pan and Politis III

I Construct transformed data up+1, · · · ,un that are i.i.d.I Fitted transformation: ut = Dxt−1 (yt ), for t = p + 1, · · · , n.I Predictive transformation (delete-one): u(t)

t = D(t)xt−1

(yt ) with

D(t)xt−1 (yt ) =

∑ni=p+1,i 6=t Λ( y−yi

h0)K (

‖xt−1−xi−1‖h )∑

k=p+1,k 6=t K (‖xt−1−xk−1‖

h ), t = p+1, · · · ,n

I Generate pseudo series Y ∗1 , . . . ,Y∗n . Let Y ∗t = D−1

x∗t−1

(u∗t ) fort = 1, . . . ,n where u∗t is sampled from:

I up+1, · · · , un (MF)I u(t)

p+1, · · · , u(t)n (PMF)

I directly from Unif(0,1) distribution (LMF).I Also let Y ∗n+1 = D−1

xt−1 (u∗t ) .

Model-free predictors and prediction intervals

The L2–optimal point predictor and its bootstrap analog are:

Yn+1 =1

n − p

n∑t=p+1

D−1xn (ut )

Y ∗n+1 =1

n − p

n∑t=p+1

D∗−1xn (u∗t )

using either MF, PMF or LMF.

I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them

in an empirical distribution with α—quantile denoted q(α).I A (1− α)100% equal-tailed predictive interval for Yn+1 is

given by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].

Model-free predictors and prediction intervals

The L2–optimal point predictor and its bootstrap analog are:

Yn+1 =1

n − p

n∑t=p+1

D−1xn (ut )

Y ∗n+1 =1

n − p

n∑t=p+1

D∗−1xn (u∗t )

using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.

I Obtain B replicates of the bootstrap roots and collect themin an empirical distribution with α—quantile denoted q(α).

I A (1− α)100% equal-tailed predictive interval for Yn+1 isgiven by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].

Model-free predictors and prediction intervals

The L2–optimal point predictor and its bootstrap analog are:

Yn+1 =1

n − p

n∑t=p+1

D−1xn (ut )

Y ∗n+1 =1

n − p

n∑t=p+1

D∗−1xn (u∗t )

using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them

in an empirical distribution with α—quantile denoted q(α).

I A (1− α)100% equal-tailed predictive interval for Yn+1 isgiven by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].

Model-free predictors and prediction intervals

The L2–optimal point predictor and its bootstrap analog are:

Yn+1 =1

n − p

n∑t=p+1

D−1xn (ut )

Y ∗n+1 =1

n − p

n∑t=p+1

D∗−1xn (u∗t )

using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them

in an empirical distribution with α—quantile denoted q(α).I A (1− α)100% equal-tailed predictive interval for Yn+1 is

given by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].

Alternative Model-free approaches for Markov ProcessesI Bootstrap based on transition density (Rajarshi 1990)

I Estimate the transition density f (y |x) by smoothing, andgenerate Y ∗1 , . . . ,Y

∗n from the estimated transition density.

I Can use either forward or backward generation scheme.I The Local Bootstrap [Paparoditis and Politis (2001)]

I Generate Y ∗1 , . . . ,Y∗n from the estimated transition

distribution Dx (y) which is a step function.I Local bootstrap is to Rajarshi’s method what Efron’s (1979)

bootstrap is to the smooth bootstrap, i.e., resampling froma smoothed e.d.f.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:

h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample

standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.

I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:

h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample

standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.I Kernels K and Λ had a normal shape.

I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:

h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample

standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...

I For Rajarshi’s method, use the rule-of-thumb formula:h = 0.9An−1/4 where A = min(σ, IQR

1.34); σ is the samplestandard deviation and IQR is the interquartile range.

I The smoothing bandwidth for Λ was taken to be h0 = h2.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:

h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample

standard deviation and IQR is the interquartile range.

I The smoothing bandwidth for Λ was taken to be h0 = h2.

Simulation models

I Model 1: Yt+1 = sin(Yt ) + εt+1

I Model 2: Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1

where the errors {εt} are i.i.d. N(0,1).

I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:

h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample

standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.

normal innovations nominal coverage 95% nominal coverage 90%n = 100 CVR LEN st.dev. CVR LEN st.dev.

nonpara-f 0.927 3.860 0.393 0.873 3.255 0.310nonpara-p 0.943 4.099 0.402 0.894 3.456 0.317

trans-forward 0.942 4.137 0.627 0.901 3.535 0.519trans-backward 0.942 4.143 0.621 0.900 3.531 0.519

LB-forward 0.930 3.980 0.625 0.886 3.409 0.511LB-backward 0.932 4.001 0.605 0.886 3.411 0.508Hybrid-trans-f 0.921 3.822 0.412 0.868 3.241 0.335Hybrid-trans-p 0.936 4.045 0.430 0.889 3.441 0.341

Hybrid-LB-f 0.923 3.815 0.430 0.869 3.226 0.343Hybrid-LB-p 0.937 4.018 0.433 0.890 3.414 0.338

MF 0.916 3.731 0.551 0.869 3.221 0.489PMF 0.946 4.231 0.647 0.902 3.471 0.530

n = 200 CVR LEN st.dev. CVR LEN st.dev.nonpara-f 0.938 3.868 0.272 0.886 3.263 0.219nonpara-p 0.948 4.012 0.283 0.899 3.385 0.231

trans-forward 0.944 4.061 0.501 0.902 3.472 0.415trans-backward 0.944 4.058 0.507 0.902 3.470 0.424

LB-forward 0.937 3.968 0.530 0.891 3.369 0.439LB-backward 0.937 3.979 0.551 0.893 3.383 0.448Hybrid-trans-f 0.932 3.838 0.359 0.880 3.238 0.290Hybrid-trans-p 0.942 3.977 0.360 0.893 3.358 0.281

Hybrid-LB-f 0.932 3.798 0.336 0.882 3.228 0.272Hybrid-LB-p 0.942 3.958 0.338 0.895 3.356 0.265

MF 0.924 3.731 0.464 0.877 3.208 0.387PMF 0.946 4.123 0.570 0.899 3.439 0.444

Table : Yt+1 = sin(Yt ) + εt+1 with εt ∼ N(0,1).

normal innovations nominal coverage 95% nominal coverage 90%n = 100 CVR LEN st.dev. CVR LEN st.dev.

nonpara-f 0.894 3.015 0.926 0.843 2.566 0.783nonpara-p 0.922 3.318 1.003 0.868 2.744 0.826

trans-forward 0.943 3.421 0.629 0.901 2.928 0.561trans-backward 0.943 3.439 0.648 0.901 2.930 0.573

LB-forward 0.938 3.425 0.628 0.895 2.908 0.553LB-backward 0.937 3.410 0.616 0.894 2.903 0.549Hybrid-trans-f 0.888 2.997 0.873 0.833 2.564 0.747Hybrid-trans-p 0.914 3.301 0.980 0.855 3.726 0.792

Hybrid-LB-f 0.888 2.988 0.877 0.834 2.553 0.748Hybrid-LB-p 0.916 3.301 0.996 0.858 2.727 0.796

MF 0.921 3.119 0.551 0.874 2.679 0.476PMF 0.951 3.587 0.620 0.908 2.964 0.513

n = 200 CVR LEN st.dev. CVR LEN st.dev.nonpara-f 0.903 2.903 0.774 0.848 2.537 0.647nonpara-p 0.921 3.164 0.789 0.863 2.636 0.654

trans-forward 0.943 3.428 0.627 0.901 2.921 0.548trans-backward 0.943 3.430 0.633 0.901 2.921 0.552

LB-forward 0.942 3.425 0.578 0.898 2.894 0.483LB-backward 0.941 3.406 0.562 0.895 2.858 0.462Hybrid-trans-f 0.892 2.991 0.789 0.836 2.541 0.652Hybrid-trans-p 0.908 3.147 0.816 0.849 2.631 0.663

Hybrid-LB-f 0.892 2.953 0.763 0.837 2.520 0.643Hybrid-LB-p 0.911 3.148 0.810 0.853 2.628 0.663

MF 0.926 3.167 0.504 0.879 2.707 0.420PMF 0.947 3.481 0.574 0.900 2.890 0.473

Table : Yt+1 = sin(Yt ) +√

0.5 + 0.25Y 2t εt+1 with εt ∼ N(0,1).

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

●●

●●●●

● ●●

●●

● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

−3 0 2

−3−2

−10

12

3

x

y(a)

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●●●

● ●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

−4 0 2

−4−2

02

x

y

(b)

Figure : (a) Data from Model 1; (b) Data from Model 2. [n = 100.]