from nonparametrics to model-free: a time series time...
TRANSCRIPT
From Nonparametrics to Model-Free:a time series time line
Dimitris N. PolitisUniversity of California, San Diego
1988-1989
I Stanford 1988-1989
I Idea: Bootstrap confidence intervals for the spectral density
I Murray: “What are you going to do about the bias?”
1988-1989
I Stanford 1988-1989
I Idea: Bootstrap confidence intervals for the spectral density
I Murray: “What are you going to do about the bias?”
1988-1989
I Stanford 1988-1989
I Idea: Bootstrap confidence intervals for the spectral density
I Murray: “What are you going to do about the bias?”
Davis, June 1989
I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.
I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.
I Talk by Peter Hall: bootstrap for the probability density.
I Solution for bias: Undersmoothing!
Davis, June 1989
I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.
I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.
I Talk by Peter Hall: bootstrap for the probability density.
I Solution for bias: Undersmoothing!
Davis, June 1989
I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.
I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.
I Talk by Peter Hall: bootstrap for the probability density.
I Solution for bias: Undersmoothing!
Davis, June 1989
I Week-long lectures by Murray on Stochastic curve estimationUniversity of California, Davis, June 1989.
I Stochastic curve estimation NSF-CBPMS Lecture Notes 1991.
I Talk by Peter Hall: bootstrap for the probability density.
I Solution for bias: Undersmoothing!
Some background
I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).
I Xn = 1n
∑ni=1 Xi : consistent & asymptotically efficient
I σ2n = Var (√nXn) =
∑ns=−n(1− |s|n )γ(s)
I Under regularity:
σ2∞ := limn→∞
σ2n =∞∑
s=−∞γ(s) = 2πf (0)
where f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.
I Standard error estimation is nontrivial under dependence.
Some background
I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).
I Xn = 1n
∑ni=1 Xi : consistent & asymptotically efficient
I σ2n = Var (√nXn) =
∑ns=−n(1− |s|n )γ(s)
I Under regularity:
σ2∞ := limn→∞
σ2n =∞∑
s=−∞γ(s) = 2πf (0)
where f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.
I Standard error estimation is nontrivial under dependence.
Some background
I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).
I Xn = 1n
∑ni=1 Xi : consistent & asymptotically efficient
I σ2n = Var (√nXn) =
∑ns=−n(1− |s|n )γ(s)
I Under regularity:
σ2∞ := limn→∞
σ2n =∞∑
s=−∞γ(s) = 2πf (0)
where f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.
I Standard error estimation is nontrivial under dependence.
Some background
I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).
I Xn = 1n
∑ni=1 Xi : consistent & asymptotically efficient
I σ2n = Var (√nXn) =
∑ns=−n(1− |s|n )γ(s)
I Under regularity:
σ2∞ := limn→∞
σ2n =∞∑
s=−∞γ(s) = 2πf (0)
where f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.
I Standard error estimation is nontrivial under dependence.
Some background
I Data: X1, . . . ,Xn from stationary, weakly dependent timeseries {Xt , t ∈ Z} with unknown mean µ = EXt and (equallyunknown) autocovariance γ(k) = Cov(Xt ,Xt+k).
I Xn = 1n
∑ni=1 Xi : consistent & asymptotically efficient
I σ2n = Var (√nXn) =
∑ns=−n(1− |s|n )γ(s)
I Under regularity:
σ2∞ := limn→∞
σ2n =∞∑
s=−∞γ(s) = 2πf (0)
where f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s) for w ∈ [−π, π], is thespectral density function.
I Standard error estimation is nontrivial under dependence.
Periodogram
I Spectral density: f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s)
I Naive plug-in estimator: T (w) = (2π)−1∑∞
s=−∞ e iws γ(s)
where γ(s) = n−1∑n−|s|
t=1 (Xt − Xn)(Xt+|s| − Xn)
I The periodogram T (w) is inconsistent for f (w).
� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0
where cw = f 2(w)(1 + 1{w/π∈Z}).
Periodogram
I Spectral density: f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s)
I Naive plug-in estimator: T (w) = (2π)−1∑∞
s=−∞ e iws γ(s)
where γ(s) = n−1∑n−|s|
t=1 (Xt − Xn)(Xt+|s| − Xn)
I The periodogram T (w) is inconsistent for f (w).
� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0
where cw = f 2(w)(1 + 1{w/π∈Z}).
Periodogram
I Spectral density: f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s)
I Naive plug-in estimator: T (w) = (2π)−1∑∞
s=−∞ e iws γ(s)
where γ(s) = n−1∑n−|s|
t=1 (Xt − Xn)(Xt+|s| − Xn)
I The periodogram T (w) is inconsistent for f (w).
� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0
where cw = f 2(w)(1 + 1{w/π∈Z}).
Periodogram
I Spectral density: f (w) = (2π)−1∑∞
s=−∞ e iwsγ(s)
I Naive plug-in estimator: T (w) = (2π)−1∑∞
s=−∞ e iws γ(s)
where γ(s) = n−1∑n−|s|
t=1 (Xt − Xn)(Xt+|s| − Xn)
I The periodogram T (w) is inconsistent for f (w).
� ET (w) = f (w) + O(1/n) for w 6= 0.� VarT (w) ' cw 6→ 0
where cw = f 2(w)(1 + 1{w/π∈Z}).
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Bartlett’s blocking scheme
I Bartlett (1946): take the average of short periodograms.
I Blocking has been a core idea in time series analysis since 100years ago, e.g. big block–small block to prove CLT.
I Rosenblatt’s (1957) CLT for strong mixing “models”.
I Stationary series {Xt} is strong mixing if αX (k)→ 0 ask →∞.
I
αX (k) = supA,B|P(A ∩ B)− P(A)P(B)|
where A ∈ F0−∞,B ∈ F∞k and FK
k = σ{Xt , k ≤ t ≤ K}.
Blocking schemes
Basic assumption: b →∞ but b/n→ 0 as n→∞.
I Fully overlapping—number of blocks q = n − b + 1
B3︷ ︸︸ ︷B1︷ ︸︸ ︷
X1,X2,X3, · · · ,Xb,Xb+1,Xb+2, · · · ,Bq︷ ︸︸ ︷
Xn−b+1, · · · ,Xn︸ ︷︷ ︸B2
I Non-overlapping—number of blocks Q = [n/b]
B1︷ ︸︸ ︷X1, · · · ,Xb,
B2︷ ︸︸ ︷Xb+1, · · · ,X2b,
B3︷ ︸︸ ︷X2b+1, · · · ,X3b, · · · · · ·Xn
I Non-overlapping with ‘buffer’—number of blocks [n/(2b)]
B1︷ ︸︸ ︷X1, · · · ,Xb,
buffer︷ ︸︸ ︷Xb+1, · · · ,X2b,
B2︷ ︸︸ ︷X2b+1, · · · ,X3b, · · · · · ·Xn
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
Bartlett’s spectral estimation scheme
I Consider one of the blocking schemes—for simplicity:non-overlapping.
I Let Ti (w) be the periodogram calculated from Bi .I ETi (w) = f (w) + O(1/b)I VarTi (w) ' cw as before.
I Define T (w) = Q−1∑Q
i=1 Ti (w).I ET (w) = f (w) + O(1/b)I Var T (w) ' cw/Q = cw [b/n]
I If b →∞ but b/n→ 0, then T (w)P−→ f (w).
◦ Same argument for overlapping scheme—just cw is different:33% smaller.
I Alternative expression for Bartlett’s estimator:
T (w) ' (2π)−1∞∑
s=−∞λ(s/b)γ(s)e iws
where λ(x) = (1− |x |)+ is the ‘tent’ function.
1989–1990
I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.
I T (w) = Q−1∑Q
i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!
I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.
I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).
I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.
1989–1990
I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.
I T (w) = Q−1∑Q
i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!
I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.
I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).
I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.
1989–1990
I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.
I T (w) = Q−1∑Q
i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!
I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.
I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).
I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.
1989–1990
I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.
I T (w) = Q−1∑Q
i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!
I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.
I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).
I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.
1989–1990
I Under the non-overlapping scheme, T1,T2, . . . areapproximately independent.
I T (w) = Q−1∑Q
i=1 Ti (w) is the sample mean of Q (approx.)i.i.d. variables: can be bootstrapped!
I But overlapping blocks is more efficient; in this case,T1,T2, . . . are stationary and mixing.
I Block bootstrap for the sample mean of mixing series: Kunsch(1989) and Liu/Singh (1988, 1992).
I 1990 Ph.D. thesis: block bootstrap on the sample mean ofT1, . . . ,TQ . Tricky point: this is really a triangular array.
Self-reference
I Recall σ2n = Var (√nXn)→ 2πf (0).
I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (
√nXn).
I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (
√nXn).
I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?
I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!
Self-reference
I Recall σ2n = Var (√nXn)→ 2πf (0).
I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (
√nXn).
I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (
√nXn).
I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?
I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!
Self-reference
I Recall σ2n = Var (√nXn)→ 2πf (0).
I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (
√nXn).
I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (
√nXn).
I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?
I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!
Self-reference
I Recall σ2n = Var (√nXn)→ 2πf (0).
I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (
√nXn).
I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (
√nXn).
I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?
I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!
Self-reference
I Recall σ2n = Var (√nXn)→ 2πf (0).
I Let σ2BB (and σ2SUB) denote the block bootstrap (andsubsampling—Politis/Romano, 1992) estimates ofVar (
√nXn).
I Up to edge effects, σ2BB and σ2SUB are tantamount to 2πT (0),i.e., Bartlett’s estimator of 2πf (0) ≈ Var (
√nXn).
I How can you estimate the variance of T (0), the sample meanof T1(0), . . . ,TQ(0)?
I Blocks-of-blocks bootstrap: apply a spectral estimationtechnique to get the variance of a spectral estimator!
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias revisited
I Bartlett estimator has large bias.
I BiasT (w) = O(1/b) and Var T (w) = O(b/n)
I To minimize MSE of T (w) choose b ∼const.n1/3.
◦ The minimized MSE= O(n−2/3); same for σ2BB and σ2SUB .
◦ But: we can achieve MSE= O(n−4/5) with windows differentthan Bartlett’s ‘tent’, e.g. Parzen window, Tukey, Daniel, etc.
◦ Can even achieve MSE= O(n−2r/(2r+1)) with windows oforder r , i.e., having r derivatives vanishing at the origin.
I Or: we can bias-correct the Bartlett estimator.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude?
Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Bias corrected estimation
◦ I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is Bartlett with block size b and Bias f ' c/b + o(1/b)I f is over-smoothed, i.e., b > b.
I If we choose b = 2b, then the bias-corrected estimatorf = f − 2f has Bias = o(1/b).
I Gained an order of magnitude? Actually, much more!
I Bias f = O(1/bk) where k = number of derivatives of f .
I If f has all derivatives, f is√n–consistent (up to a log term).
I f is tantamount to smoothing with a window of infinite order;in this case: a ‘flat-top’ window of trapezoidal shape.
Different choices for the lag window λ and kernel Λ for estimator(2π)−1
∑∞s=−∞ λ(s/b)γ(s)e iws = Λ ? T (w)
lag window
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1(a)
Fejer kernel
0 5 10 15 20
0.0
0.02
0.06
0.10
1(b)
lag window
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1(c)
Dirichlet kernel
0 5 10 15 200.
00.
10.
20.
3
1(d)
lag window
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1(e)
flat-top kernel, c=0.5
0 5 10 15 20
0.0
0.05
0.10
0.15
1(f)
1(g) 1(h)
27
I.i.d. set-up
I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε
I GOAL: prediction of future εn+1 based on the data
I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals
I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.
I.i.d. set-up
I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data
I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals
I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.
I.i.d. set-up
I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data
I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals
I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.
I.i.d. set-up
I Let ε1, . . . , εn i.i.d. from the (unknown) cdf FεI GOAL: prediction of future εn+1 based on the data
I Fε is the predictive distribution, and its quantiles could beused to form predictive intervals
I The mean and median of Fε are optimal point predictorsunder an L2 and L1 criterion respectively.
Non-i.i.d. data
I In general, data Y n = (Y1, . . . ,Yn)′ are not i.i.d.
I So the predictive distribution of Yn+1 given the data willdepend on Y n and Xn+1 which is a matrix of observable,explanatory (predictor) variables.
I Key Examples: Regression and Time series
Models
I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)
I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt
I The above are flexible, nonparametric models.
I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.
I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.
Models
I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)
I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt
I The above are flexible, nonparametric models.
I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.
I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.
Models
I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)
I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt
I The above are flexible, nonparametric models.
I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.
I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.
Models
I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)
I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt
I The above are flexible, nonparametric models.
I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.
I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.
Models
I Regression: Yt = µ(x t) + σ(x t) εt with εt ∼ i.i.d. (0,1)
I Time series:Yt = µ(Yt−1, · · · ,Yt−p; x t) + σ(Yt−1, · · · ,Yt−p; x t) εt
I The above are flexible, nonparametric models.
I Given one of the above models, optimal model-basedpredictors of a future Y -value can be constructed.
I Nevertheless, the prediction problem can be carried out in afully model-free setting, offering—at the veryleast—robustness against model mis-specification.
Transformation vs. modeling
I DATA: Y n = (Y1, . . . ,Yn)′
I GOAL: predict future value Yn+1 given the data
I Find invertible transformation Hm so that (for all m) the
vector εm = Hm(Ym) has i.i.d. components εk where
εm = (ε1, . . . , εm)′
YHm−→ ε
YH−1m←− ε
Transformation vs. modeling
I DATA: Y n = (Y1, . . . ,Yn)′
I GOAL: predict future value Yn+1 given the data
I Find invertible transformation Hm so that (for all m) the
vector εm = Hm(Ym) has i.i.d. components εk where
εm = (ε1, . . . , εm)′
YHm−→ ε
YH−1m←− ε
Transformation vs. modeling
I DATA: Y n = (Y1, . . . ,Yn)′
I GOAL: predict future value Yn+1 given the data
I Find invertible transformation Hm so that (for all m) the
vector εm = Hm(Ym) has i.i.d. components εk where
εm = (ε1, . . . , εm)′
YHm−→ ε
YH−1m←− ε
Transformation vs. modeling
I DATA: Y n = (Y1, . . . ,Yn)′
I GOAL: predict future value Yn+1 given the data
I Find invertible transformation Hm so that (for all m) the
vector εm = Hm(Ym) has i.i.d. components εk where
εm = (ε1, . . . , εm)′
YHm−→ ε
YH−1m←− ε
Transformation
(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)
(ii) (Y1, . . . ,Ym)H−1
m←− (ε1, . . . , εm)
I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn
I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1
I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,
Yn+1 = h(εn+1)
Transformation
(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)
(ii) (Y1, . . . ,Ym)H−1
m←− (ε1, . . . , εm)
I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn
I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1
I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,
Yn+1 = h(εn+1)
Transformation
(i) (Y1, . . . ,Ym)Hm−→ (ε1, . . . , εm)
(ii) (Y1, . . . ,Ym)H−1
m←− (ε1, . . . , εm)
I (i) implies that ε1, . . . , εn are known given the data Y1, . . . ,Yn
I (ii) implies that Yn+1 is a function of ε1, . . . , εn, and εn+1
I So, given the data Y n, Yn+1 is a function of εn+1 only, i.e.,
Yn+1 = h(εn+1)
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf Fε
I The mean and median of h(ε) where ε ∼ Fε are optimal pointpredictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point
predictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point
predictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point
predictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point
predictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Model-free prediction principle
Yn+1 = h(εn+1)
I Suppose ε1, . . . , εn ∼ cdf FεI The mean and median of h(ε) where ε ∼ Fε are optimal point
predictors of Yn+1 under L2 or L1 criterion
I The whole predictive distribution of Yn+1 is the distribution ofh(ε) when ε ∼ Fε
I To predict Y 2n+1, replace h by h2;
to predict g(Yn+1), replace h by g ◦ h.
I The unknown Fε can be estimated by Fε, the edf of ε1, . . . , εn.
I But the predictive distribution needs bootstrapping—alsobecause h is estimated from the data.
Predictive inference for stationary time series
TIME SERIES DATA: Y1, . . . ,Yn
GOAL: Predict Yn+1 given the data
Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.
Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y
∗n−1,Y
∗n ,Y ∗n+1, . . . must be such that
have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.
Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1); —drop the regressor xt in what follows.
Predictive inference for stationary time series
TIME SERIES DATA: Y1, . . . ,Yn
GOAL: Predict Yn+1 given the data
Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.
Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y
∗n−1,Y
∗n ,Y ∗n+1, . . . must be such that
have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.
Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1);
—drop the regressor xt in what follows.
Predictive inference for stationary time series
TIME SERIES DATA: Y1, . . . ,Yn
GOAL: Predict Yn+1 given the data
Problem: We can not choose regressor xf; it is given by therecent time series history, e.g., Yn−1, · · · ,Yn−p for some p.
Bootstrap data Y ∗1 ,Y∗2 , . . . ,Y
∗n−1,Y
∗n ,Y ∗n+1, . . . must be such that
have the same values for the recent history, i.e.,Y ∗n−1 = Yn−1,Y ∗n−2 = Yn−2, . . . ,Y ∗n−p = Yn−p.
Model: (?) Yt = µ(Yt−1, · · · ,Yt−p; x t ) + σ(Yt−1, · · · ,Yt−p; x t ) εtwith εt ∼i.i.d. (0,1); —drop the regressor xt in what follows.
Nonparametric autoregression modelsI i.i.d. errors: Yt = µ(Xt−1) + εt
I heteroscedastic errors: Yt = µ(Xt−1) + σ(Xt−1)εt
where Xt−1 = (Yt−1, · · · ,Yt−p)′ and the errors εt are i.i.d. andindependent of the past {Ys, s < t}.
Estimate the conditional mean µ(x) = E [Yt+1|Xt = x ] andvariance σ2(x) = E [(Yt+1 −m(x))2|Xt = x ] by usualNadaraya-Watson kernel estimates m and σ2 defined as
m(x) =
∑n−1t=1 K (‖x−xt‖
h )yt+1∑n−1t=1 K (‖x−xt‖
h ),
σ2(x) =
∑n−1t=1 K (‖x−xt‖
h )(yt+1 − m(xt ))2∑n−1t=1 K (‖x−xt‖
h ),
Nonparametric autoregression modelsI i.i.d. errors: Yt = µ(Xt−1) + εt
I heteroscedastic errors: Yt = µ(Xt−1) + σ(Xt−1)εt
where Xt−1 = (Yt−1, · · · ,Yt−p)′ and the errors εt are i.i.d. andindependent of the past {Ys, s < t}.
Estimate the conditional mean µ(x) = E [Yt+1|Xt = x ] andvariance σ2(x) = E [(Yt+1 −m(x))2|Xt = x ] by usualNadaraya-Watson kernel estimates m and σ2 defined as
m(x) =
∑n−1t=1 K (‖x−xt‖
h )yt+1∑n−1t=1 K (‖x−xt‖
h ),
σ2(x) =
∑n−1t=1 K (‖x−xt‖
h )(yt+1 − m(xt ))2∑n−1t=1 K (‖x−xt‖
h ),
Nonparametric autoregression–residual bootstrap I
Mammen, Franke and Kreiss (1998, 2002)I Given observations y1, · · · , yn, we generate the bootstrap
pseudo data forward by the following recursions:I y∗i = m(x∗i−1) + ε∗iI y∗i = m(x∗i−1) + σ(x∗i−1)ε∗i
where x∗i−1 = (y∗i−1, · · · , y∗i−p)′, and the ε∗i can be resampledfrom fitted or predictive residuals.� Fitted residuals
I εi = yi − m(xi−1), for i = p + 1, · · · ,nI εi =
yi−m(xi−1)σ(xi−1)
, for i = p + 1, · · · ,n� Predictive residuals as in Politis (2013)
I ε(t)t = yt − m(t)
t (xt−1), t = p + 1, · · · ,n
I ε(t)t =
yt−m(t)(xt−1)
σ(t)(xt−1), t = p + 1, · · · ,n.
Nonparametric autoregression—prediction intervals I
I Predictive root:I Yn+1 − Yn+1 = m(xn)− m(xn) + εn+1I Yn+1 − Yn+1 = m(xn)− m(xn) + σ(xn)εn+1
I Bootstrap predictive root:I Y ∗n+1 − Y ∗n+1 = m∗(xn)− m∗(xn) + ε∗n+1I Y ∗n+1 − Y ∗n+1 = m∗(xn)− m∗(xn) + σ∗(xn)ε∗n+1
I Prediction interval:Collect B bootstrap root replicates in the form of anempirical distribution whose α-quantile is denoted q(α).Then, a (1− α)100% equal-tailed predictive interval forYn+1 is given by [m(xn) + q(α/2), m(xn) + q(1− α/2)].
I This is forward bootstrap fixing the last p values; isbackward bootstrap feasible??
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.
I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt Transformation
I To use this in practice, we need to estimate n functionsDY1(·),DY2|Y1
(·),DY3|Y2,Y1(·), . . . ,DYn|Yn−1,...,Y1
(·)I Infeasible!—unless · · · we have Markov structure in which
cases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!
—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free Prediction for Stationary Time Series
I Drop the nonparametric autoregression modelI Stationary time series data: Y1, . . . ,Yn
I Define η1 = DY1(Y1) so η1 ∼ Unif (0,1)
I Let η2 = DY2|Y1(Y2|Y1). Can show η2 ∼ Unif (0,1) and
independent of η1
I Let η3 = DY3|Y2,Y1(Y3|Y2,Y1). Can show η3 ∼ Unif (0,1)
and independent of η1 and η2.I η1, . . . , ηn ∼ iid Unif (0,1)—The Rosenblatt TransformationI To use this in practice, we need to estimate n functions
DY1(·),DY2|Y1(·),DY3|Y2,Y1
(·), . . . ,DYn|Yn−1,...,Y1(·)
I Infeasible!—unless · · · we have Markov structure in whichcases all these functions (except the first) are the same!
Model-free bootstrap for Markov—Pan and Politis IAssume {Yt} is Markov of order p.
I Let Xt−1 = (Yt−1, · · · ,Yt−p)′
I Distribution of Xt : D(x) = P(Xt ≤ x) for x ∈ Rp
I Distr. of Yt given Xt−1 = x : Dx (y) = P(Yt ≤ x |Xt−1 = x)
I Let ηp = D(Xp) and ηt = DXt−1(Yt ), for t = p + 1, · · · ,n.
P(ηp ≤ z) = P(D(Xp) ≤ z) = P(Xp ≤ D−1(z)) = D(D−1(z)) = z
P(ηt ≤ z|Xt−1 = x) =P(Dx (Yt ) ≤ z|Xt−1 = x)
=P(Yt ≤ D−1x (z)|Xt−1 = x)
=Dx (D−1x (z))
=z (Uniform)
and does not depend on x (Independence).
Model-free bootstrap for Markov—Pan and Politis III Given observations y1, · · · , yn, estimate Dx (y) by usual
kernel estimator Dx (y),
Dx (y) =
∑ni=p+1 1{yi≤y}K (
‖x−xi−1‖h )∑n
k=p+1 K (‖x−xk−1‖
h ).
I D is a step function—can use linear interpolation toproduce a continuous function D, or...
I Replace 1{yi≤y} by a continuous cdf Λ( y−yih0
).I Doubly smoothed estimator Dx (y) is defined by
Dx (y) =
∑ni=p+1 Λ( y−yi
h0)K (
‖x−xi−1‖h )∑n
k=p+1 K (‖x−xk−1‖
h )
Model-free bootstrap for Markov—Pan and Politis III
I Construct transformed data up+1, · · · ,un that are i.i.d.I Fitted transformation: ut = Dxt−1 (yt ), for t = p + 1, · · · , n.I Predictive transformation (delete-one): u(t)
t = D(t)xt−1
(yt ) with
D(t)xt−1 (yt ) =
∑ni=p+1,i 6=t Λ( y−yi
h0)K (
‖xt−1−xi−1‖h )∑
k=p+1,k 6=t K (‖xt−1−xk−1‖
h ), t = p+1, · · · ,n
I Generate pseudo series Y ∗1 , . . . ,Y∗n . Let Y ∗t = D−1
x∗t−1
(u∗t ) fort = 1, . . . ,n where u∗t is sampled from:
I up+1, · · · , un (MF)I u(t)
p+1, · · · , u(t)n (PMF)
I directly from Unif(0,1) distribution (LMF).I Also let Y ∗n+1 = D−1
xt−1 (u∗t ) .
Model-free predictors and prediction intervals
The L2–optimal point predictor and its bootstrap analog are:
Yn+1 =1
n − p
n∑t=p+1
D−1xn (ut )
Y ∗n+1 =1
n − p
n∑t=p+1
D∗−1xn (u∗t )
using either MF, PMF or LMF.
I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them
in an empirical distribution with α—quantile denoted q(α).I A (1− α)100% equal-tailed predictive interval for Yn+1 is
given by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].
Model-free predictors and prediction intervals
The L2–optimal point predictor and its bootstrap analog are:
Yn+1 =1
n − p
n∑t=p+1
D−1xn (ut )
Y ∗n+1 =1
n − p
n∑t=p+1
D∗−1xn (u∗t )
using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.
I Obtain B replicates of the bootstrap roots and collect themin an empirical distribution with α—quantile denoted q(α).
I A (1− α)100% equal-tailed predictive interval for Yn+1 isgiven by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].
Model-free predictors and prediction intervals
The L2–optimal point predictor and its bootstrap analog are:
Yn+1 =1
n − p
n∑t=p+1
D−1xn (ut )
Y ∗n+1 =1
n − p
n∑t=p+1
D∗−1xn (u∗t )
using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them
in an empirical distribution with α—quantile denoted q(α).
I A (1− α)100% equal-tailed predictive interval for Yn+1 isgiven by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].
Model-free predictors and prediction intervals
The L2–optimal point predictor and its bootstrap analog are:
Yn+1 =1
n − p
n∑t=p+1
D−1xn (ut )
Y ∗n+1 =1
n − p
n∑t=p+1
D∗−1xn (u∗t )
using either MF, PMF or LMF.I Predictive root: Yn+1 − Yn+1.I Bootstrap predictive root: Y ∗n+1 − Y ∗n+1.I Obtain B replicates of the bootstrap roots and collect them
in an empirical distribution with α—quantile denoted q(α).I A (1− α)100% equal-tailed predictive interval for Yn+1 is
given by: [Yn+1 + q(α/2), Yn+1 + q(1− α/2)].
Alternative Model-free approaches for Markov ProcessesI Bootstrap based on transition density (Rajarshi 1990)
I Estimate the transition density f (y |x) by smoothing, andgenerate Y ∗1 , . . . ,Y
∗n from the estimated transition density.
I Can use either forward or backward generation scheme.I The Local Bootstrap [Paparoditis and Politis (2001)]
I Generate Y ∗1 , . . . ,Y∗n from the estimated transition
distribution Dx (y) which is a step function.I Local bootstrap is to Rajarshi’s method what Efron’s (1979)
bootstrap is to the smooth bootstrap, i.e., resampling froma smoothed e.d.f.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:
h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample
standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.
I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:
h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample
standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.I Kernels K and Λ had a normal shape.
I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:
h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample
standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...
I For Rajarshi’s method, use the rule-of-thumb formula:h = 0.9An−1/4 where A = min(σ, IQR
1.34); σ is the samplestandard deviation and IQR is the interquartile range.
I The smoothing bandwidth for Λ was taken to be h0 = h2.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:
h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample
standard deviation and IQR is the interquartile range.
I The smoothing bandwidth for Λ was taken to be h0 = h2.
Simulation models
I Model 1: Yt+1 = sin(Yt ) + εt+1
I Model 2: Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1
where the errors {εt} are i.i.d. N(0,1).
I 500 true series from each model.I Kernels K and Λ had a normal shape.I Bandwidth h for all methods by cross-validation except...I For Rajarshi’s method, use the rule-of-thumb formula:
h = 0.9An−1/4 where A = min(σ, IQR1.34); σ is the sample
standard deviation and IQR is the interquartile range.I The smoothing bandwidth for Λ was taken to be h0 = h2.
normal innovations nominal coverage 95% nominal coverage 90%n = 100 CVR LEN st.dev. CVR LEN st.dev.
nonpara-f 0.927 3.860 0.393 0.873 3.255 0.310nonpara-p 0.943 4.099 0.402 0.894 3.456 0.317
trans-forward 0.942 4.137 0.627 0.901 3.535 0.519trans-backward 0.942 4.143 0.621 0.900 3.531 0.519
LB-forward 0.930 3.980 0.625 0.886 3.409 0.511LB-backward 0.932 4.001 0.605 0.886 3.411 0.508Hybrid-trans-f 0.921 3.822 0.412 0.868 3.241 0.335Hybrid-trans-p 0.936 4.045 0.430 0.889 3.441 0.341
Hybrid-LB-f 0.923 3.815 0.430 0.869 3.226 0.343Hybrid-LB-p 0.937 4.018 0.433 0.890 3.414 0.338
MF 0.916 3.731 0.551 0.869 3.221 0.489PMF 0.946 4.231 0.647 0.902 3.471 0.530
n = 200 CVR LEN st.dev. CVR LEN st.dev.nonpara-f 0.938 3.868 0.272 0.886 3.263 0.219nonpara-p 0.948 4.012 0.283 0.899 3.385 0.231
trans-forward 0.944 4.061 0.501 0.902 3.472 0.415trans-backward 0.944 4.058 0.507 0.902 3.470 0.424
LB-forward 0.937 3.968 0.530 0.891 3.369 0.439LB-backward 0.937 3.979 0.551 0.893 3.383 0.448Hybrid-trans-f 0.932 3.838 0.359 0.880 3.238 0.290Hybrid-trans-p 0.942 3.977 0.360 0.893 3.358 0.281
Hybrid-LB-f 0.932 3.798 0.336 0.882 3.228 0.272Hybrid-LB-p 0.942 3.958 0.338 0.895 3.356 0.265
MF 0.924 3.731 0.464 0.877 3.208 0.387PMF 0.946 4.123 0.570 0.899 3.439 0.444
Table : Yt+1 = sin(Yt ) + εt+1 with εt ∼ N(0,1).
normal innovations nominal coverage 95% nominal coverage 90%n = 100 CVR LEN st.dev. CVR LEN st.dev.
nonpara-f 0.894 3.015 0.926 0.843 2.566 0.783nonpara-p 0.922 3.318 1.003 0.868 2.744 0.826
trans-forward 0.943 3.421 0.629 0.901 2.928 0.561trans-backward 0.943 3.439 0.648 0.901 2.930 0.573
LB-forward 0.938 3.425 0.628 0.895 2.908 0.553LB-backward 0.937 3.410 0.616 0.894 2.903 0.549Hybrid-trans-f 0.888 2.997 0.873 0.833 2.564 0.747Hybrid-trans-p 0.914 3.301 0.980 0.855 3.726 0.792
Hybrid-LB-f 0.888 2.988 0.877 0.834 2.553 0.748Hybrid-LB-p 0.916 3.301 0.996 0.858 2.727 0.796
MF 0.921 3.119 0.551 0.874 2.679 0.476PMF 0.951 3.587 0.620 0.908 2.964 0.513
n = 200 CVR LEN st.dev. CVR LEN st.dev.nonpara-f 0.903 2.903 0.774 0.848 2.537 0.647nonpara-p 0.921 3.164 0.789 0.863 2.636 0.654
trans-forward 0.943 3.428 0.627 0.901 2.921 0.548trans-backward 0.943 3.430 0.633 0.901 2.921 0.552
LB-forward 0.942 3.425 0.578 0.898 2.894 0.483LB-backward 0.941 3.406 0.562 0.895 2.858 0.462Hybrid-trans-f 0.892 2.991 0.789 0.836 2.541 0.652Hybrid-trans-p 0.908 3.147 0.816 0.849 2.631 0.663
Hybrid-LB-f 0.892 2.953 0.763 0.837 2.520 0.643Hybrid-LB-p 0.911 3.148 0.810 0.853 2.628 0.663
MF 0.926 3.167 0.504 0.879 2.707 0.420PMF 0.947 3.481 0.574 0.900 2.890 0.473
Table : Yt+1 = sin(Yt ) +√
0.5 + 0.25Y 2t εt+1 with εt ∼ N(0,1).
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●●●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
● ●
●
●
●
●
●●
●
●
●
●●●●
● ●●
●
●●
●
●
● ●
●●
●
●
● ●●●
●●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
● ●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
−3 0 2
−3−2
−10
12
3
x
y(a)
●●
●
●
●●
●
●
●
●●
●●●
●
●
●●
●●
●
●
●●●
●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
● ●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●●●
● ●●
●
●●
●●
●●
●●
●
●
● ●●●
●●
●●
●
●
●
●
●●
●
●
● ●
●●●
● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
−4 0 2
−4−2
02
x
y
(b)
Figure : (a) Data from Model 1; (b) Data from Model 2. [n = 100.]