streaming solutions for time-varying optimization problems

13
1 Streaming Solutions for Time-Varying Optimization Problems Tomer Hamam, Justin Romberg, Fellow, IEEE Abstract—This paper studies streaming optimization problems that have objectives of the form T t=1 f (xt-1, xt ). In particular, we are interested in how the solution ˆ x t|T for the tth frame of variables changes as T increases. While incrementing T and adding a new functional and a new set of variables does in general change the solution everywhere, we give conditions under which ˆ x t|T converges to a limit point x * t at a linear rate as T →∞. As a consequence, we are able to derive theoretical guarantees for algorithms with limited memory, showing that limiting the solution updates to only a small number of frames in the past sacrifices almost nothing in accuracy. We also present a new efficient Newton online algorithm (NOA), inspired by these results, that updates the solution with fixed complexity of O ( 3Bn 3 ) , independent of T , where B corresponds to how far in the past the variables are updated, and n is the size of a single block-vector. Two streaming optimization examples, online reconstruction from non-uniform samples and non-homogeneous Poisson intensity estimation, support the theoretical results and show how the algorithm can be used in practice. Index Terms—time-varying, aggregated, incremental, opti- mization, unconstrained, filtering, streaming, cumulative, Newton method, graph optimization, Bayesian filtering, online, time- series, partially separable, block-tridiagonal. I. I NTRODUCTION W E consider time-varying convex optimization problems of the form minimize x T J T (x T ) := T X t=1 f t (x t-1 , x t ), (1) where each x t is block variable in R n and x T = (x 0 ; x 1 ; ··· ; x T ) R n(T +1) . We are particularly interested in solving these programs dynamically; we study below how the solutions ˆ x T evolve as T increases and how we can move from the minimizer of J T to the minimizer of J T +1 in an efficient manner. Optimization problems of the form (1) arise, broadly speak- ing, in applications where we are trying to estimate a time- varying quantity from data that is presented sequentially. In signal processing, they are used for online least-squares [1], [2] and estimation of sparse vectors [3]–[7]. They have also been used for low-rank matrix recovery in recommendation systems [8], [9], audio restoration and enhancement [10], [11], medical imagery applications [12], [13], and inverse The authors are with the School of Electrical and Computer Engi- neering at Georgia Tech in Atlanta. Email: [email protected], [email protected]. This work was supported by and ARL DCIST CRA W911NF-17-2-0181 and by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. Submitted October 30, 2021. problems [14], [15]. Applications in other fields include online convex optimization [16]–[18], adaptive learning [19], [20], time series prediction [21], [22], and optimal control [23], [24]. Closely related problems also come from estimation algorithms in wireless sensor networks [25], [26], multi-task learning [16], and in simultaneous localization and mapping (SLAM) and pose graph optimization (PGO) [27]–[30]. Even though each of the functions in (1) only depends on two block variables, the reliance of f t on both x t and x t-1 couples all of the functionals in the sum. When a new “frame” is added to the program above, meaning T T +1, a new term is added to the sum in the functional, and a new set of variables is introduced. This new term will affect the optimality of all of the x t . The most well-known example of (1) is when the f t are least-squares losses on linear functions of x t-1 and x t , f t (x t-1 , x t )= kB t x t-1 + A t x t - y t k 2 . (2) In this case, (1) has the same mathematical structure as the Kalman filter [31]–[33] and can be solved with a streaming least-squares algorithm. If we use ˆ x t|T to denote the optimal estimate of x t when J T is minimized, then ˆ x T +1|T +1 can be computed by applying an appropriate affine function to ˆ x T |T , and then the ˆ x T -j|T ,j =1,...,T are computed recursively (by again applying affine functions) with a backward sweep through the variables. This updating algorithm, which requires updating each frame only once, follows from the fact that the system of equations for solving J T +1 has a block-tridiagonal structure; a block LU factorization can be computed on the fly and the ˆ x T -j|T are then computed using back substitution. When the f t have any other form than (2), moving from the solution of J T to J T +1 is much more complicated. Unlike the linear least-squares case, we cannot update the solutions with a single backward sweep. However, as we describe in detail in Section III below, (1) retains a key piece of structure from the least-squares case. The Hessian matrix of J T has the same block-tridiagonal structure as the system matrix corresponding to (2). Our main mathematical contribution shows that when the Hessian matrix exhibits a kind of block diagonal dominance in addition to the tridiagonal structure, the solution vectors are only weakly coupled in that adding a new term to (1) does not significantly affect the solutions far in the past. Theorem 2.1 in Section II and Theorem 3.1 in Section III below show that the difference between ˆ x t|T +1 and ˆ x t|T decreases exponentially in T - t. As a result, ˆ x t|T and ˆ x t|T +1 will not be too different for even moderate T - t. arXiv:2111.02101v1 [math.OC] 3 Nov 2021

Upload: others

Post on 14-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Solutions for Time-Varying Optimization Problems

1

Streaming Solutions for Time-VaryingOptimization ProblemsTomer Hamam, Justin Romberg, Fellow, IEEE

Abstract—This paper studies streaming optimization problemsthat have objectives of the form

∑Tt=1 f(xt−1,xt). In particular,

we are interested in how the solution xt|T for the tth frameof variables changes as T increases. While incrementing T andadding a new functional and a new set of variables does ingeneral change the solution everywhere, we give conditions underwhich xt|T converges to a limit point x∗t at a linear rate asT → ∞. As a consequence, we are able to derive theoreticalguarantees for algorithms with limited memory, showing thatlimiting the solution updates to only a small number of framesin the past sacrifices almost nothing in accuracy. We also presenta new efficient Newton online algorithm (NOA), inspired bythese results, that updates the solution with fixed complexityof O

(3Bn3

), independent of T , where B corresponds to how

far in the past the variables are updated, and n is the size of asingle block-vector. Two streaming optimization examples, onlinereconstruction from non-uniform samples and non-homogeneousPoisson intensity estimation, support the theoretical results andshow how the algorithm can be used in practice.

Index Terms—time-varying, aggregated, incremental, opti-mization, unconstrained, filtering, streaming, cumulative, Newtonmethod, graph optimization, Bayesian filtering, online, time-series, partially separable, block-tridiagonal.

I. INTRODUCTION

WE consider time-varying convex optimization problemsof the form

minimizexT

JT (xT ) :=

T∑t=1

ft(xt−1,xt), (1)

where each xt is block variable in Rn and xT =(x0;x1; · · · ;xT ) ∈ Rn(T+1). We are particularly interestedin solving these programs dynamically; we study below howthe solutions xT evolve as T increases and how we can movefrom the minimizer of JT to the minimizer of JT+1 in anefficient manner.

Optimization problems of the form (1) arise, broadly speak-ing, in applications where we are trying to estimate a time-varying quantity from data that is presented sequentially. Insignal processing, they are used for online least-squares [1],[2] and estimation of sparse vectors [3]–[7]. They have alsobeen used for low-rank matrix recovery in recommendationsystems [8], [9], audio restoration and enhancement [10],[11], medical imagery applications [12], [13], and inverse

The authors are with the School of Electrical and Computer Engi-neering at Georgia Tech in Atlanta. Email: [email protected],[email protected].

This work was supported by and ARL DCIST CRA W911NF-17-2-0181and by C-BRIC, one of six centers in JUMP, a Semiconductor ResearchCorporation (SRC) program sponsored by DARPA.

Submitted October 30, 2021.

problems [14], [15]. Applications in other fields include onlineconvex optimization [16]–[18], adaptive learning [19], [20],time series prediction [21], [22], and optimal control [23],[24]. Closely related problems also come from estimationalgorithms in wireless sensor networks [25], [26], multi-tasklearning [16], and in simultaneous localization and mapping(SLAM) and pose graph optimization (PGO) [27]–[30].

Even though each of the functions in (1) only dependson two block variables, the reliance of ft on both xt andxt−1 couples all of the functionals in the sum. When a new“frame” is added to the program above, meaning T → T + 1,a new term is added to the sum in the functional, and a newset of variables is introduced. This new term will affect theoptimality of all of the xt.

The most well-known example of (1) is when the ft areleast-squares losses on linear functions of xt−1 and xt,

ft(xt−1,xt) = ‖Btxt−1 + Atxt − yt‖2. (2)

In this case, (1) has the same mathematical structure as theKalman filter [31]–[33] and can be solved with a streamingleast-squares algorithm. If we use xt|T to denote the optimalestimate of xt when JT is minimized, then xT+1|T+1 can becomputed by applying an appropriate affine function to xT |T ,and then the xT−j|T , j = 1, . . . , T are computed recursively(by again applying affine functions) with a backward sweepthrough the variables. This updating algorithm, which requiresupdating each frame only once, follows from the fact that thesystem of equations for solving JT+1 has a block-tridiagonalstructure; a block LU factorization can be computed on the flyand the xT−j|T are then computed using back substitution.

When the ft have any other form than (2), moving from thesolution of JT to JT+1 is much more complicated. Unlike thelinear least-squares case, we cannot update the solutions witha single backward sweep. However, as we describe in detailin Section III below, (1) retains a key piece of structure fromthe least-squares case. The Hessian matrix of JT has the sameblock-tridiagonal structure as the system matrix correspondingto (2).

Our main mathematical contribution shows that when theHessian matrix exhibits a kind of block diagonal dominancein addition to the tridiagonal structure, the solution vectors areonly weakly coupled in that adding a new term to (1) doesnot significantly affect the solutions far in the past.

Theorem 2.1 in Section II and Theorem 3.1 in Section IIIbelow show that the difference between xt|T+1 and xt|Tdecreases exponentially in T − t. As a result, xt|T and xt|T+1

will not be too different for even moderate T − t.

arX

iv:2

111.

0210

1v1

[m

ath.

OC

] 3

Nov

202

1

Page 2: Streaming Solutions for Time-Varying Optimization Problems

2

The correction terms’ rapid convergence gives rise to theoptimization filtering approach described in the second halfof Section III. We show how we can (approximately) solve(1) with bounded memory by only updating a relatively smallnumber of the xt in the past each time a new loss function isadded. We show that under appropriate conditions, the errordue to this “truncation” of early terms does not accumulate asT grows, meaning that the online algorithm is stable. Theorem3.3 gives these sufficient conditions and bounds the error as afunction of the memory in the system.

The remainder of the paper is organized as follows. Webriefly overview related work in the existing literature inSection I-A. In Section II, we study the particular case of least-squares loss (2), and give conditions under which the xt|Tconverge rapidly as T increases. This allows us to control theerror of a standard recursive least-squares algorithm when theupdates are truncated. Section III extends these results wherethe ft are general (smooth and strongly) convex loss functions,with an online-like Newton-type algorithm for solving theseproblems presented in Section IV. Numerical examples tosupport our theoretical results are given in Section V. Proofsare deferred to the appendices. Notation follows the standardconvention.

A. Related work

Further work on time-varying convex optimization problemsconsiders the problem of tracking the state of the solution tothe optimization problem, minx∈X(t) f(x; t), for all t ≥ 0.As an example for recent work on such problems, [34]–[36] have adapted prediction-correction methods by implicitlyextracting the dynamics of the solution and using it for aprediction step. A recent survey on problems of this sort andtheir applications can be found in [37].

Another related research area is that of online convex op-timization. Online convex optimization framework [38] arisesin various learning applications where the problem’s data isobtained sequentially, and more importantly, it can ( and oftenwill) vary based on some action that followed our previoussolution. Having an inherently different structure and a notionof cost, the objective in online convex learning, unlike standardconvex optimization problems, is to minimize a regret thatmeasures how well we do if compared to the best solutionin hindsight. The problem’s model was not assumed dynamicin its original setting, but some alternative regret models wereconsidered in recent work. Adaptive regret [39], shifting regret[40], are some examples.

Already mentioned in the introduction, Bayesian inferenceand, in particular, Bayesian inference estimation of time-varying systems [41] also resembles the concepts discussedin this work. The idea of optimal filtering and recursiveupdate of the solution is not new, and the Kalman filter, withits numerous variations, is considered the working horse forsuch problems. Greatly inspired by these ideas, our objective,in part, is to extend these ideas so they can be applied intime-varying optimization problems in a principally way. Infurthering this connection between the two fields, we also hopeto open the door to utilize additional tools from the reach

optimization framework in time-varying statistical inferenceproblems.

II. STREAMING LEAST-SQUARES

In this section, we look at solving (1) in the particular casewhere the cost functions are regularized linear least-squaresterms similar to (2): 1

ft(xt−1,xt) = ‖Btxt−1 + Atxt − yt‖2 + γ‖xt‖2. (3)

To ease the notation, we will fix the sizes of the variablesxt ∈ Rn and matrices At,Bt ∈ Rm×n to be the same,but everything below is easy to generalize. An importantapplication, and one which is discussed further in Section Vbelow, is streaming reconstruction from non-uniform samplesof signals that are being represented by local overlappingbasis elements. There is an elegant way to solve this type ofstreaming least-squares problem that is essentially equivalentto the Kalman filter. We will quickly review how this solutioncomes about below, as it will allow us to draw parallels to thecase when the ft are general convex functions.

We begin by presenting a matrix formulation of the problemthat expresses the minimizer as the solution to an inverseproblem. This formulation exposes the problem’s nice struc-ture, leading to an efficient forward-backward LU factorizationsolver. We then discuss the stability of the factorization andshow that it also leads to convergence of the updates (Theo-rem 2.1). Lastly, as a corollary of Theorem 2.1, we show thatthe updates can be truncated, allowing the streaming solutionto operate with finite memory, with very little additional error.

A. Matrix formulation

With the ft as in (3), the optimization problem (1) is equivalentto solving a structured linear inverse problem. At frame T , weare estimating the right-hand side of

y0

y1

y2

y3

y4

...yT

︸ ︷︷ ︸

yT

=

A0 0 · · · 0B1 A1 0 · · · 00 B2 A2 0 · · · 00 0 B3 A3 0 · · · 00 0 0 B4 A4 · · · 0...

. . .. . .

...0 · · · · · · BT AT

︸ ︷︷ ︸

ΦT

x0

x1

x2

x3

x4

...xT

︸ ︷︷ ︸

xT

+ noise,

with a least-squares loss. The minimizer of this (regularized)least-squares problem satisfies the normal equations

xT = (ΦTTΦT + γI)−1ΦTy

T. (4)

When T is large, solving (4) can be expensive or even infea-sible. However, the structure of ΦT (it has only two nonzeroblock diagonals) allows an efficient method for updating thesolution as T → T + 1.

1Throughout ‖ · ‖ refers to the Euclidean norm vectors and its inducednorm for matrices .

Page 3: Streaming Solutions for Time-Varying Optimization Problems

3

B. Block-tridiagonal systems

The key piece of structure that our analysis and algorithmstake advantage of is that the system matrix ΦT

TΦT +γI in (4)is block-tridiagonal. In this section, we overview how this typeof system can be solved recursively and introduce conditionsthat guarantee that these computations are stable. We will usethese results both to analyze the least-squares case and themore general convex case, where the Hessian matrix has thesame block-tridiagonal structure.

Consider a general block-tridiagonal system of equationsH0 ET

0

E0 H1 ET1

. . . . . . . . .ET−2 HT−1 ET

T−1ET−1 HT

x0

x1

...xT−1xT

=

g0

g1...

gT−1gT

(5)

There is a standard numerical linear algebra technique forcalculating the (block) LU factorization of a block-bandedmatrix (see [42, Chapter 4.5]); the matrix above gets factoredas

Q0 0 · · · 0E0 Q1 0

0 E1 Q2

. . ....

. . .. . . 0

0 · · · 0 ET−1 QT

︸ ︷︷ ︸

LT

I U0 00 I U1 0...

. . .. . .. . . UT−1

0 0 I

︸ ︷︷ ︸

UT

.

The Et in (II-B) are the same as in (5) while the Qt and U t

can be computed recursively using Q0 = H0, and then fort = 1, . . . , T

U t−1 = Q−1t−1ETt−1,

Qt = Ht −Et−1U t−1 = Ht −Et−1Q−1t−1E

Tt−1. (6)

This, in turn, gives us an efficient way to solve the system in(5) using a forward-backward sweep: we start by initializingthe “forward variable” as v0 = Q−10 g0, then move forward;for t = 1, . . . , T we compute

vt = Q−1t (gt −Et−1vt−1). (7)

After computing vT , we hold v = L−1g in our hands. Tocompute x = U−1v, we first set xT = vT and then movebackward; for t = T − 1, . . . , 0 we compute

xt = vt −U txt+1. (8)

This algorithm relies on the Qt being invertible for all t. Asthese are defined recursively, it might, in general, be hard todetermine their invertibility without actually computing them.The lemma below, however, shows that if the system in (5) isblock diagonally dominant, in that the Et are smaller than theHt, then well-conditioned Ht will result in well-conditionedQt, and as a result, the reconstruction process above is well-defined and stable.

Lemma 2.1. Suppose that there exists a κ ≥ 1 and δ, θ suchthat for all t ≥ 0, ‖κ−1Ht − I‖ ≤ δ and ‖Et‖ ≤ κθ withδ < 1 and θ ≤ (1− δ)/2. Then for all t ≥ 0,

‖κ−1Qt − I‖ ≤ ε∗, ε∗ =1 + δ

2−√

(1− δ)24

− θ2 < 1.

Proof. From the recursion that defines the Qt in (6) we have‖κ−1Qt − I‖ ≤ εt, where {εt} is a sequence that obeys

ε0 = δ, εt = δ +θ2

1− εt−1.

For the θ, δ in the lemma statement, these εt form a monoton-ically nondecreasing sequence that converges to ε∗.

One of the main consequences of Lemma 2.1, and a resultthat we will use for our least-squares and more general convexanalysis, is that a uniform bound on the size of the blocks ‖gt‖on the right-hand side of (5) implies a uniform bound on thesize of the blocks in the solution ‖xt‖.

Lemma 2.2. Suppose that the conditions of Lemma 2.1 hold.Suppose also that we have a uniform bound on the norm ofeach of the blocks of g, ‖gt‖ ≤M . Then if ρ = θ/(1−ε∗) < 1,

‖xt‖ ≤M(1− ρT−t+1)

(1− ε∗)(1− ρ)2≤ M

(1− ε∗)(1− ρ)2.

The proof of Lemma 2.2, which we present in Appendix A,essentially just traces the steps through the forward-backwardsweep making judicious use of the triangle inequality. Lateron, we will see that our ability to bound the solution on aframe-by-frame basis plays a key role in showing that thestreaming solutions to (1) converge as T increases.

From general linear algebra, we have the following resultthat will be of use later on.

Lemma 2.3. Consider the (m+ n)× (m+ n) systemA V T

V B

h0

y

=

q0

0

,with matrices A ∈ Rm×m, B ∈ Rn×n, V ∈ Rm×n, andvectors h0, q0 ∈ Rm and y = (y1, . . . ,yn) ∈ Rn. Supposethat ‖V ‖ ≤ α, and B is nonsingular with ‖B−1‖ ≤ β. Then‖y‖ ≤ αβ‖h0‖, and in particular, ‖yi‖ ≤ αβ‖h0‖.

Lemma 2.3 shows there is a simple relation between thefirst and remaining elements of the solution for the particularright-hand side above.

C. Streaming solutions

The forward-backward algorithm for solving block-tridiagonalsystems can immediately be converted into a streaming solverfor problems of the form (3). Indeed, the algorithm that doesthis, detailed explicitly as Algorithm 1 below, mirrors the stepsin the classic Kalman filter exactly (with the backtrackingupdates akin to “smoothing”).

Page 4: Streaming Solutions for Time-Varying Optimization Problems

4

For fixed T , the least-squares problem in (3) amounts to atridiagonal solve as in (5) with

Et = ATt+1Bt+1,

Ht = ATt At + BT

t+1Bt+1 + γI, t = 0, . . . , T − 1 ,

HT = ATTAT + γI.

If we have computed the solution xT = {xt|T }Tt=0, thenwhen T is incremented, we can move to the new solutionxT+1 by updating the QT matrix, computing the new termsEt−1,U t−1 in the block LU factorization (thus completing theforward sweep in one additional step), computing xT+1|T+1,then sweeping backward to update the solution by computingthe xt|T+1 for t = T, . . . , 0.

Algorithm 1 Streaming Least-squares[y0,A0]← GetSampleBatch(0)Q′0 ← AT

0 A0 + γIg′0 ← AT

0 y0

x0 ← Q′−10 g′0

for t = 1, 2, . . . do[yt,At,Bt]← GetSampleBatch(t)Qt−1 ← Q′t−1 + BT

t Bt

gt−1 ← g′t−1 + BTt yt

vt−1 ← Q−1t−1(gt−1 −Et−2vt−2)

Et−1 ← ATt Bt

U t−1 ← Q−1t−1ETt−1

Q′t ← ATt At −Et−1U t−1 + δI

g′t ← ATt yt

xt|t ← Q′−1t (g′t −Etvt)

for ` = 1, . . . , t doxt−`|t ← vt−` −U t−`xt−`+1|t

end forend for

We now have the natural questions: under what conditionsdoes an x?t exist such that xt|T → x?t , and if one exists,how fast do the solutions converge? Our first theorem providesone possible answer to these questions. At a high level, itsays that while the solution in every frame changes as we gofrom t → T + 1, this effect is local if the block-tridiagonalsystem is block diagonally dominant. Well-conditioned Ht

and relatively small Et result in rapid (linear) convergence ofxt|T → x?t .

Theorem 2.1. Suppose that the Ht and Et generated byAlgorithm 1 obey the conditions of Lemma 2.1 and θ < 1−ε∗.Suppose also that the size of the yt are bounded as

My = supt≥0

∥∥∥∥[ ytyt+1

]∥∥∥∥ .Then there exists xt∗ such that

xt|T → x∗t as t ≤ T →∞,

and there is a constant C(ε∗, θ, δ) such that

‖xt|T − x∗t ‖ ≤ C(ε∗, θ, δ) My

1− ε∗

)T−t

for all t ≤ T .

The purpose of κ in the theorem statement above is to makethe result scale-invariant; a natural choice is to take κ as theaverage of the largest and smallest eigenvalues of the matricesalong the main block diagonal.

Finally, we note that the condition that θ < 1 − ε∗ isclosely related to asking the system matrix to be (strictly)block diagonally dominant. We could guarantee block diagonaldominance by asking the smallest eigenvalue of each Ht

is larger than ‖Et−1‖ + ‖Et‖, which we can ensure bytaking θ/(1 − δ) < 1/2. In the next theorem, we show thatthe exponential convergence of the least-squares estimate inTheorem 2.1 allows us to stop updating xt|T once T − t getslarge enough with almost no loss in accuracy.

D. Truncating the updates

The result of Theorem 2.1 that the updates exhibit exponentialconvergence suggests that we might be able to “prune” theupdates by terminating the backtracking step early. In thissection, we formalize this by bounding the additional errorif we limit ourselves to a buffer of size B. This allows the al-gorithm to truly run online, as the memory and computationalrequirements remain bounded (proportional to B).

To produce the truncated result, we use a simple modifi-cation of Algorithm 1. The forward sweep remains the same(and exact), while the backtracking loop (the inner ’for’ loopat the end) only updates the B most recent frames, stoppingat ` = B − 1:

for ` = 1, . . . , B − 1 dozt−`|t ← vt−` −U t−`zt−`+1|t

end forTo avoid confusion, we have written the truncated solutionsas zt|t′ , and we will use z?t to be the “final” value z?t =zt+1|t+B . Note that to perform the truncated update, we onlyhave to store the matrices U t and the vectors vt for B stepsin the past. The schematic diagram in Figure 1 illustrates thearchitecture and dynamical flow of the algorithm.

Fig. 1. Schematic diagram of streaming with finite memory data architecture:The buffers pertain to fast memory which is assumed limited and hence usedonly to store the last B loss functions and the latest corresponding estimatedsolution variables, {zt|T }t, t = T − B, . . . , T . After each solve, we havefeedback to the buffer and as a possible implementation, zT−B|T is offloadedto peripheral memory to keep track of the entire solution’s trajectory.

Page 5: Streaming Solutions for Time-Varying Optimization Problems

5

The following corollary shows that the effect of the bufferis mild: the error in the final answer decreases exponentiallyin buffer size.

Corollary 2.1. Let {x∗t } denote the sequences of untruncatedasymptotic solutions, and {z∗t } the final truncated solutionsfor a buffer size B. Under the conditions of Theorem 2.1, wehave

‖x∗t − z∗t ‖ ≤ C(ε∗, θ, δ)My

1− ε∗

)Bfor all 0 ≤ t ≤ T −B.

The corollary is established simply by realizing that zt|T =xt|T for T = t, . . . , t + B and then applying Theorem 2.1.Section V-A demonstrates that for a practical application,excellent accuracy can be achieved with a modest buffer size.There we look at the problem of signal reconstruction fromnon-uniform level crossings.

III. STREAMING WITH CONVEX COST FUNCTIONS

In this section, we consider solving (1) in the more generalcase where the ft are convex functions. In studying theleast-squares case above, we saw that the coupling betweenthe variables means that adding a term to (1) (increasingT ) requires an update of the entire solution. We also sawthat if the least-squares system in (5) is block diagonallydominant, implying in some sense that coupling between thext is “loose,” then these updates are essentially local in thatthe magnitudes of the updates decay exponentially as theybackpropagate. Below, we will see that a similar effect occursfor general smooth and strongly convex ft.

We present our main theoretical result— the linear conver-gence of the updates— in two steps. Theorem 3.1 shows thatif it is possible to find an initialization point with uniformlybounded gradients, then we have the same exponential decayin the updates to the xt as t moves backward away from T .Theorem 3.1 shows that we can guarantee such initializationpoints when the Hessian of the aggregated objective is blockdiagonal dominant.

A. Convergence of the streaming solution

Let S2,1µ,L(D) denote the class of µ-strongly convex functionswith two continuous derivatives, with their first derivatives L-Lipschitz continuous on D.

Assumption 3.1. For all t ≥ 1, we assume that:(1) ft ∈ S2,1µt,Lt

(Rn × Rn);(2) Lt and µt are uniformly bounded,

0 < µmin ≤ µt ≤ Lt ≤ Lmax <∞.

This assumption is equivalent to a uniform bound on theeigenvalues of the Hessians of the ft; for every t ≥ 0 andevery x,y ∈ D we have

µt ≤ λmin(∇2ft(x,y)) ≤ λmax(∇2ft(x,y)) ≤ Lt.

The following lemma translates the uniform bounds on theconvexity and Lipschitz constants of the ft to similar boundson the aggregate function.

Lemma 3.1. If ft(xt−1,xt) ∈ S2,1µt,Lt(Rn × Rn), then JT =∑T

t=1 ft ∈ S2,1

µ,L(Rn × · · · × Rn), for L = 2Lmax and µ =

µmin.

Lemma 3.1 above is enough to establish the convergence ofthe updates provided that we can assume uniform bounds onthe gradients of the initialization points.

Theorem 3.1. Suppose that the ft are smooth and stronglyconvex as in Assumption 3.1. Suppose also that there exists aconstant Mg and a set of initialization points {wT } such that

‖∇fT (xT−1|T−1,wT )‖ ≤Mg, (9)

for all T > 0. Then there exists x∗t such that

xt|T → x∗t as t ≤ T →∞,

and a constant C1 that depends on µmin, Lmax,Mg such that

‖xt|T − x∗t ‖ ≤ C1

(2Lmax − µmin

2Lmax + µmin

)T−t. (10)

We prove Theorem 3.1 in Appendix D. The argument worksby tracing the steps the gradient descent algorithm takes whenminimizing JT after being initialized at the minimizer ofJT−1 (with the newly introduced variables initialized at wT

satisfying (9)).The condition (9) is slightly unsatisfying as it relies on

properties of the ft around the global solutions. We willsee in Theorem 3.2 below how to remove this condition byadding additional assumptions on the intrinsic structure of theft and their relationships. Nevertheless, (9) does not seemunreasonable. For example, if the solutions {xT |T } are shownto be uniformly bounded, then we could use the fact that the fthave Lipschitz gradients to establish (9) through an appropriatechoice of wT .

Although our theory lets us consider any wT , there aretwo very natural choices in practice. One option is to simplyminimize fT with the first variable fixed at the previoussolution,

wT = arg minw

fT (xT−1|T−1,w). (11)

Alternatively, wT could be fixed a priori at the xT |T thatminimizes ft in isolation (see the definition in (12) below).

B. Boundedness through block diagonal dominance

By adding some additional assumptions on the structure of theproblem, we can guarantee condition (9) in Theorem 3.1. Ourfirst (very mild) additional assumption is that the minimizersof the individual ft, when computed in isolation, are bounded.

Assumption 3.2. With the minimizers of the isolated ftdenoted as

(xt−1|t, xt|t) := arg minxt−1,xt

ft(xt−1,xt), (12)

we assume that there exists a constant Mx such that for allt ≥ 0 ∥∥∥∥[xt−1|txt|t

]∥∥∥∥ ≤Mx. (13)

Page 6: Streaming Solutions for Time-Varying Optimization Problems

6

The strong convexity of the ft implies that these isolatedminimizers are unique.

Note that there are two local solutions for the variablesxt at time t: xt|t is the second argument for the minimizerfor ft, while xt|t+1 is the first argument for the minimizerof ft+1. Assumption 3.2 also implies that these are close:‖xt|t − xt|t+1‖ ≤ 2Mx. We emphasize that Assumption 3.2only prescribes structure on the minimizers of the individualft, not on the minimizers of the aggregate JT . Showing howthis assumed (but very reasonable) bound on the size of theminimizers of the individual ft translates into a bound onthe size of the minimizers xt|T of JT is a large part ofTheorem 3.2.

The last piece of structure that allows us to make this linkcomes from the Hessian of the objective JT . The fact that ftdepends only on variables xt−1 and xt means that the Hessianis block diagonal, similar to the system matrix (5) in the least-squares case:

∇2JT =

H0 ET0

E0 H1 ET1

. . .. . .

. . .

ET−2 HT−1 ETT−1

ET−1 H′T

,

where the main diagonal terms are given by2

Ht =

∇0,0f1(x0,x1), t = 0;

∇t,t(ft(xt−1,xt) + ft+1(xt,xt+1), 1 ≤ t < T ;

∇T,T fT (xT−1,xT ), t = T ;

and the off diagonal terms by

Et = ∇t+1,tft+1(xt,xt+1), t = 0, . . . , T − 1.

If the Hessian is block diagonally dominant everywhere,then we can leverage the boundedness of the isolated solutions(13) to show the boundedness of the aggregate solutions xt|T .

Theorem 3.2. Suppose that ft as in Assumption 3.1 and let

κ :=2Lmax + µmin

2, δ :=

2Lmax − µmin

2Lmax + µmin,

and suppose that for another constant θ > 0 it holds that

‖Et(x)‖ ≤ κθ, ∀t ≥ 0.

If it holds that θ < (1− δ)/2, and in addition, the isolatedminimizers {(xt|t, xt|t+1)}Tt=0 are bounded as in Assump-tion 3.2, then the minimizers {xt|T }Tt=0 of JT will be boundedas

‖xt|T ‖ ≤ Mg(1− ρT−t+1)

(1− ε∗)(1− ρ)2, (14)

whereMg = 2Mxκ

√L2max + θ2,

for some ε∗, ρ with ρ = θ/(1− ε∗) < 1.

2We use ∇i,j(·) for ∂2(·)/∂xi∂xj .

The bound (14) on the size of the solutions can be usedto bound the size of the gradient in (9). If we initialize thevariables in frame T as the isolated minimizer from (12),wT = xT |T , we have

‖∇fT (xT−1|T−1,wT )‖ =

‖∇fT (xT−1|T−1, xT |T )−∇fT (xT−1|T , xT |T )‖≤ Lmax‖xT−1|T−1 − xT−1|T ‖≤ Lmax

(‖xT−1|T−1‖+ ‖xT−1|T ‖

).

These two terms can then be bounded by (14) and (13). Thusthe conditions of Theorem 3.2 ensure the rapid convergencein (10).

As in the least-squares case, the κ in Theorem 3.2 is justa scaling constant so that the eigenvalues of the κ−1Ht arewithin 1 ± δ. Perhaps more important is the condition thatθ < (1 − δ)/2. We can interpret this condition as meaningthat the coupling between the ft is “loose”; in a second-orderapproximation to JT , the quadratic interactions between thext and themselves is stronger than between the xt−1 or xt+1.As in the least-squares case, if the Hessian is strictly blockdiagonal dominant, then again (1− δ) > 2θ.

In the least-squares case, the fast convergence of the updatesenabled efficient computations through truncation of the back-ward updates. As we show next, truncation is also effective inthe general convex case.

C. Streaming with finite updates

Just as we did in the least squares case, we can control theerror in the convex case when the updates are truncated. Whileour goals parallel those in Section II-D, the analysis is moredelicate. The main difference in the general convex case is thatwhen we stop updating a set of variables xt|T it introduces(small) errors in the future estimates xt′|T ′ for t′ > t, T ′ ≥ T ;we were able to avoid this in the least-squares case since theforward ”state” variables vt carry all the information neededto compute future estimates optimally. Nonetheless, we showthat the errors introduced by truncation in the convex case aremanageable.

Consider the truncation of the objective functional at timethat keeps its last B loss terms

JTB(xT−B , . . . ,xT ) =

T∑t=T−B+1

ft(xt−1,xt). (15)

The overlapping structure of the ft implies that fixing thevalue of any one block variable (xT−B for our purpose) breaksdown the problem into two independent problems. Moreover,the value at which we fix the variable also determines the(unique) solution for all the other variables.

Proposition 3.1. Let {y∗T−B+i}Bi=0 be the solution to

min JT−B(yT−B , . . . ,yT ), s.t. yT−B = xT−B|T .

Then the solution satisfies

y∗T−B+i = xT−B+i|T , i = 1, . . . , B.

As before, we denote the truncated solutions as zt|t′ andtheir final values z∗t = zt|T for T > t+B.

Page 7: Streaming Solutions for Time-Varying Optimization Problems

7

In the spirit of Proposition 3.1, at each time step T , we fixzT−B = z∗T−B as a boundary condition, then minimize thelast B terms in the sum in (1), setting t′ = T −B + 1:

minimize(zt′ ,...,zT )

ft′(z∗t′−1, zt′) +

T∑t=t′+1

ft(zt−1, zt). (16)

It should be clear by Proposition 3.1 and the problemformulation in (16), that any difference between xT−B+i|Tand zT−B+i|T depends entirely on the difference of xT−B|Tand z∗T−B .

Lemma 3.2. Suppose the conditions of Theorem 3.2 hold.Then,∥∥xT−B+i|T − zT−B+i|T

∥∥ ≤ θ

(1− δ)∥∥xT−B|T − z∗T−B

∥∥ ,for i = 1, . . . , B.

We give the proof of the Lemma in Appendix F. Theorem3.3 below builds on the results of the Lemma to establish errorbound between {x∗t } and {z∗t } as a function of B.

Theorem 3.3. Let {x∗t } denote the sequences of untruncatedasymptotic solutions, and {z∗t } the final truncated solutionsfor a buffer size B. Under the conditions of Theorem 3.2, wehave

‖x∗t − z∗t ‖ ≤ Cb(

2Lmax − µmin

2Lmax + µmin

)Bfor some positive constant Cb(µmin, Lmax).

Theorem 3.3 shows that, again, the truncation error shrinksexponentially as we increase the buffer size, where the shrink-age factor depends on the convex conditioning of the problem.In the next section, we leverage this result to derive an onlinetruncated Newton algorithm.

IV. NEWTON ALGORITHM

In this section, we present an efficient truncated Newton onlinealgorithm (NOA). The main result enabling this derivationstems from the bound given in Theorem 3.3 in Section III, andfrom the Hessian’s block-tridiagonal structure. It implies thatwe can compute each Newton step using the same forward andthen backward sweep we derived for the least-squares. More-over, Theorem 3.3 implies that the solve can be implementedwith finite memory with moderate computational cost that doesnot change in time. Hence, our proposed algorithm sidestepsthe otherwise prohibiting complexity hurdle associated withNewton method while maintaining its favorable quadraticconvergence.

A. Setting up with truncated updates

Assumption 4.1. ∇2JT is M -Lipschitz continuous.

Assumption 4.1 is standard when considering Newton’smethod. Combined with the assumptions stated in Section IIIwe have sufficient conditions to guarantee the (local) quadraticconvergence of Newton’s algorithm (see, e.g., [43]). Forsmooth convex functions, the first-order optimality condition

implies the solution to the nonlinear system ∇JT (xT ) = 0is also the optimal solution to (1), hence solving (16) isequivalent to solving

F (zT−B+1, . . . , zT ) = 0, (17)

where

F (zT−B+1, . . . , zT ) = ∇JT−B(z∗T−B , zT−B+1, . . . ,zT ).

B. The Newton Online Algorithm (NOA)

To avoid confusion with the batch estimates, we use {y(k)}to denote the sequence of updates obtained using the Newtonsolutions. The three main blocks of the algorithm:systemupdate, initialization, and computation of the Newton step aredescribed next.

1) Newton system: The Newton method solves the systemin (17) iteratively. Starting with initial guess y(0), at everyiteration, we update with

s(m) = −F ′(y(m))−1F (y(m)),

y(m+1) = y(m) + τ (m)s(m),

τ (m) ∈ (0, 1] controls the step, , ensuring global convergence.2) Initialization: The smoothness of the objective implies

a warm-start is a reasonable choice. Hence, we initialize as

y(0)t

:= zt|T−1, 0 ≤ t ≤ T − 1, (18)

where for the new block-variable,y(0)T , we recommend using

one of the two options proposed in (11) or (12).3) Computing the Newton step: F ′ has the same block-

tridiagonal structure as in the LS, meaning we can computethe Newton step with the same LU forward-backward solverderived in Section II. However, a big difference is, for nonlin-ear systems, the factorized Qt and U t blocks are no longerstationary; updating y(k) updates F ′ and hence updates thefactorization blocks. In other words, except for the first stepwhere the change in F ′ is still local, we need to compute anew LU-factorization for every Newton step. 3

Although every step involves recomputing the factorization,the exponential convergence of the updates means that we canget away with good accuracy even for a small B, which, to-gether with the sparsity of the Hessian, implies very reasonablecomplexity of the order O

(3Bn3

), independent of T .

V. NUMERICAL EXAMPLE

In this section we consider two working examples to show-case the proposed algorithms. In the first example, we useAlgorithm 1 from the least-squares section for streamingreconstruction of a signal from its level crossing samples,and in the second example, we use NOA from Section IVto efficiently solve neural spiking data regression.

3For s(0), the initialization means only the last LU blocks change so wecan carry most blocks from the previous solve.

Page 8: Streaming Solutions for Time-Varying Optimization Problems

8

A. Least-squares example

Inspired by the recent interest in level-crossing analog-to-digital converters (ADCs), we consider the problem of con-structing a signal from its level crossings. Rather than samplea continuous-time signal x(t) on a set of uniformly spacedtimes, level-crossing ADCs output the times that x(t) crossesone of a predetermined set of levels. The result is a streamof samples taken at non-uniform (and signal-dependent) lo-cations. An illustration is shown in Figure 2. The particularsof the experiment are as follows. A bandlimited signal wasrandomly generated by super-imposing sinc functions (with theappropriate widths) spaced 1/64 apart on the interval [−5, 21];the Sinc functions’ heights were drawn from a standard normaldistribution. The level crossings were computed for L = 16different levels equally spaced between [−2.5, 2.5) in the timeinterval [−0.25, 16.25]. This produced 4677 samples.

The signal was reconstructed using the lapped orthogonaltransform (LOT). We applied the LOT to cosine IV basisfunctions resulting in a set of 16 frames of orthonormal basisbundles, each with N = 75 basis functions and transitionwidth η = 1/4. A single sample x(tm) in batch k (so tm ∈ Tk)can then be written in terms of the expansion coefficients inframe bundles k − 1 and k as

x(tm) =

N∑n=1

xk−1,nψk−1,n(tm) +

N∑n=1

xk,nψk,n(tm)

= 〈xk−1, bm〉+ 〈xk,am〉,

where xk−1,xk ∈ RN are the coefficient vectors (across allN components) in bundles k − 1 and k, and am, bm ∈ RNare samples of the basis functions at tm(and are independentof the actual signal x(t)). Collecting the correspondingmeasurement vectors ak, bk for all Mk = |Mk| samples inbatch k together as rows gives the Mk × N matrices Ak

and Bk from (3) , and we can use Alg. 1 for the signalreconstructed using .

The results in Figure 2 show the reconstructed signal at threeconsecutive time frames. Evident from the graphs, already atthree frames size buffer, we achieve seven digits of accuracy.Hence early terminating the loop in Algorithm 1, settingB = 3, costs us almost nothing in terms of reconstructionperformance.

B. Nonlinear regression of intensity functionIn this example, we consider the problem of estimating neu-ron’s intensity function, which amounts to a nonlinear convexprogram: recover a signal from non-uniform samples of a non-homogeneous Poisson process (NHPP). Estimating the ratefunction is a fundamental problem in neuroscience [44], as thespikes’ temporal pattern encodes information that characterizesthe neuron’s internal properties. A standard model for neuralspiking under a rate model [45]–[47] assumes that, giventhe rate function, the spiking observations follow a Poissonlikelihood.

Given time series of spikes HT = {τ1, . . . , τm} in [0, T ],generated by HT |λ(t) ∼ Poisson(λ(t)), we want to esti-mate the underlying rate function λ(t) using the maximum

likelihood estimator [48]–[50]. The associated optimizationprogram is

λ(t) = arg minλ(t)

∫ T

0

λ(t)dt−m∑i=1

log(λ(τi)). (19)

To make (19) well-posed, we need to introduce a model forλ(t). One simple model is to write λ as a superposition ofbasis functions, λ(t) =

∑i xiψi(t). Splines, and in particular

(cardinal) B-splines have been successfully used to capturethe relationship between the neural spiking and the underlyingintensity function [51], [52].

B-splines have properties (see, e.g. [53], [54]) that makethem favorable for this particular application. They are non-negative everywhere, and so restricting x to R+ guaranteesλ(t) ≥ 0. They have compact support (minimal supportfor a given smoothness degree), which nicely breaks thebasis functions to overlapping frames. In addition, they areconvenient to work with from the point of view of numericalanalysis. The results below were obtained with second-orderB-splines, but higher models can be obtained in the same way.

We divide the time axis into short intervals (”frames”) andassociate with each frame N B-splines. The frames’ length isset such that for each t, λ(t) is expressed by basis functionsfrom at most two frames (Cf. [55] for additional details).

To cast (19) as in (1), we define the local loss functions ftas

ft (xt−1,xt) = 〈xt,at〉+ 〈xt−1, bt〉

−∑m

log (〈xt, ct,m〉+ 〈xt−1,dt,m〉) ,

where xk−1,xk ∈ RN are the coefficient vectors in the (k−1)and k frames, and ak, bk ∈ RN are the basis functions fromthe same frames integrated over the kth frame:

ak =

∫ Tk

Tk−1

ψk,1(t)ψk,2(t)

...ψk,N (t)

dt, bk =

∫ Tk

Tk−1

ψk−1,1(t)ψk−1,2(t)

...ψk−1,N (t)

dt,and ck,m,dk,m ∈ RN are samples of the basis functions atevents observed during the kth frame:

ck,m =

ψk,1(τm)ψk,2(τm)

...ψk,N (τm)

, dk,m =

ψk−1,1(τm)ψk−1,2(τm)

...ψk−1,N (τm)

,with Mk = {m : τm ∈ [Tk−1, Tk)}.

The MLE optimization program can then be rewritten as

arg min(x0,...,xT )

T∑t=1

f (xt−1,xt) , s.t. xT ≥ 0 (20)

The condition xT ≥ 0 is set to ensure λ(t) ≥ 0 ∀t. We solve(20) with the NOA algorithm described in Section IV with thefollowing modification. We use log-barrier modification as in[56, §11 ], since NOA was originally derived for unconstrainedoptimization problems.

Page 9: Streaming Solutions for Time-Varying Optimization Problems

9

(a)3.5 4 4.5 5 5.5 6

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

(b) k=43.5 4 4.5 5 5.5 6

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

(c) k=53.5 4 4.5 5 5.5 6

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

(d) k=6

Fig. 2. (a) Level crossing samples; (b)-(d) The original signal (blue) and the reconstructed signal (orange) at time steps k = 4, 5, 6.

Fig. 3. Illustration of the convergence of the corrections as they backpropa-gate. The figures show, from top to bottom, λ(t) at T = 24, 26, 28. Followingthe dashed rectangles, moving down the plots shows the convergence to λ∗(t)(the dashed line).

The simulation data was created by generating randomsmooth intensity function λ∗(t), which was then used to sim-ulate events (”spikes”) following non0-homogeneous Poissondistribution using the standard thinning method [57]. Focusingon the effect of online estimation, we set the ”ground truth”reference as the batch solution, to which we then compare theonline estimates.

The rapid convergence of the updates is depicted in Figure3, and in Figure 4, we show the effect of truncating the updateson the solution accuracy.

VI. SUMMARY AND FUTURE WORK

This paper focused on optimization problems expressed asa sum of compact convex loss functions that locally sharevariables. We have shown that the updates converge rapidlyunder mild conditions that correspond to the variables’ loose-coupling. The main impact of the convergence result for theupdates is that it allows us to approximate the solution viaearly truncating the updates with a negligible sacrifice ofaccuracy. Driving these results, the primary underlying mech-anism that resurfaced throughout was the block-tridiagonalstructure of the Hessian. Rising through the structure of the

Fig. 4. 3D 3D illustration of the truncation error for various buffer sizes. The(log) magnitude of the asymptotic error is plotted for different buffer sizesagainst time; for a fixed B, the error displays uniformly distributed in timewithout aggregation. It decreases exponentially as B increases, as predictedby the theoretical analysis.

derivatives of the loss functions, the block-tridiagonal structureled to efficient numerical algorithms and played a prominentrole in the analysis and proofs to many results in this paper.Future work includes extending these results to problemsassociated with more general variable dependency graphs andoptimization programs with local constraints.

APPENDIX

A. Proof of Lemma 2.2

Our proof uses the simple fact that a recursive set of inequal-ities of the form,

z0 ≤ b, zt ≤ b+ azt−1, a, b ≥ 0, (21)

will also obey, if a < 1,

zt ≤ b(

1− at+1

1− a

)≤ b

1− a.

In the forward sweep in (7), we have

‖v0‖ = ‖Q−10 g0‖ ≤M/(1− ε∗),

and then for t = 1, . . . , T − 1,

‖vt‖ = ‖Q−1t (gt −Et−1vt−1)‖ ≤M/(1− ε∗) + ρ‖vt−1‖⇓

‖vt‖ ≤M

(1− ε∗)(1− ρ)=: Mv.

Page 10: Streaming Solutions for Time-Varying Optimization Problems

10

For the backward sweep in (8), we have

‖xT ‖ = ‖vT ‖ ≤Mv,

and then for t = T − 1, . . . , 0,

‖xt‖ = ‖vt −Q−1t ETt xt+1‖ ≤Mv + ρ‖xt+1‖

‖xt‖ ≤Mv

(1− ρT−t+1

1− ρ

)=M(1− ρT−t+1)

(1− ε∗)(1− ρ)2≤ M

(1− ε∗)(1− ρ)2.

B. Proof of Theorem 2.1

We start with a simple relation that connects the correctionin frame bundle t to the correction in bundle t − 1 as wemove from measurement batch T → T + 1. From the updateequations, we see that for t ≤ T ,

xt−1|T+1 − xt−1|T = −U t−1(xt|T+1 − xt|T

),

‖xt−1|T+1 − xt−1|T ‖ ≤ ‖U t−1‖ · ‖xt|T+1 − xt|T ‖

≤ θ

1− ε∗· ‖xt|T+1 − xt|T ‖,

where we used the fact that ‖U t−1‖ = ‖Q−1t−1ETt−1‖ ≤ (1−

ε∗)−1θ. Applying this bound iteratively and using t = T , we

can bound the correction error ` frames back as

‖xT−`|T+1 − xT−`|T ‖ ≤(

θ

1− ε∗

)`‖xT |T+1 − xT |T ‖.

(22)This says that the size of the update decreases geometricallyas it back propagates through previously estimated frames.We can get a uniform bound on the size of the initial update‖xT |T+1 − xT |T ‖ by using Lemma 2.2 with

M = maxt

∥∥∥∥[ATt BT

t+1

] [ ytyt+1

]∥∥∥∥ ≤ √1 + δMy,

to get

‖xT |T+1 − xT |T ‖ ≤ ‖xT |T+1‖+ ‖xT |T ‖

≤ M(2 + ρ)

(1− ε∗)(1− ρ)=: Mx.

Thus we have

‖xt|T+1 − xt|T ‖ ≤Mx

1− ε∗

)T−t, t ≤ T,

and the {xt|T } converge to some limit {x∗t }.We can write the difference of the estimate xt|T from its

limit point as the telescoping sum

xt|T − x∗t =

∞∑`=0

xt|T+` − xt|T+`+1,

and then using the triangle inequality,

‖xt|T − x∗t ‖ ≤∞∑`=0

‖xt|T+` − xt|T+`+1‖

≤∞∑`=0

1− ε∗

)T+`−t

= Mx

(1− ε∗

1− ε∗ − θ

)(θ

1− ε∗

)T−t. (23)

C. Proof of Lemma 3.1

We establish upper and lower bounds on the eigenvalue of theHessian of JT . Let H = ∇2JT and4 Gt = ∇2ft(xt−1 xt),and consider the factorization H = Ho + He with Ho =G1 + G3 + . . . and He = G2 + G4 + . . . . The fact thatft depends only on variables xt−1 and xt implies that Ho

and He both are block diagonal matrices, and so both ‖Ho‖and ‖He‖ have the easy upper bound Lmax. It follows that‖H‖ ≤ 2Lmax.

For the lower bound, let y be an arbitrary block vector.Then

yTHy = yT

(∑i

Gi

)y =

∑i

yTGiy

=∑i

[yTi−1 yT

i

]Gi

[yi−1yi

]≥ µmin

∑i

‖[yi−1yi

]‖2

≥ µmin

∑i

‖yi‖2 = µmin‖y‖2.

Thus the smallest eigenvale of H is at least µmin.

D. Proof of Theorem 3.1

Recall that xt|T−1 is frame t of the minimizer to JT−1, andxt|T is frame t of the minimizer to JT . We will start by upperbounding how much the solution in frame t moves as wetransition from the solution of JT−1 to the solution of JT .We bound this quantity, ‖xt|T − xt|T−1‖ by tracing the stepsin the gradient descent algorithm for minimizing JT wheninitialized at the minimizer of JT−1 and an appropriate choicefor the new variables introduced in frame T . Next, we givea simple argument that the {xt|T }t≤T form a Cauchy (andhence convergent) sequence.

To avoid confusion in the notation, we will use y(k) =

{y(k)t }Tt=0 to denote the gradient descent iterates. We initialize

y(0) with the {xt|T−1}t and a wT that obeys (9)

y(0)t :=

{xt|T−1, t = 0, . . . , T − 1

wT , t = T,(24)

and iterate using

y(k+1) ← y(k) + d(k), d(k) , −h∇JT (y(k)).

We know that for an appropriate choice of stepsize h (this willbe discussed shortly), the iterations converge to y∗ = xT . Wealso know that since JT is strongly convex and has a Lipschitz

4We assume that Gt are zero padded to be the same size as G.

Page 11: Streaming Solutions for Time-Varying Optimization Problems

11

gradient (recall Lemma 3.1), this convergence is linear (see,for example, [58, Thm 2.1.15]):

‖y(k)−y∗‖ ≤ r0ak, where a =

(2Lmax − µmin

2Lmax + µmin

), (25)

and r0 = ‖y∗ − y(0)‖.The key fact is that the d(k), which are proportional to the

gradient, are highly structured. Using the notation ∇t to mean“gradient with respect to the variables in frame t”, we canwrite ∇JT in block form as

∇JT (y(k)) =

∇0f1(y(k)0 ,y

(k)1 )

∇1f1(y(k)0 ,y

(k)1 ) +∇1f2(y

(k)1 ,y

(k)2 )

...∇T−1fT−1(y

(k)T−2,y

(k)T−1) +∇T−1fT (y

(k)T−1,y

(k)T )

∇T fT (y(k)T−1,y

(k)T )

.

Because of the initialization (24), the first step d(0) is zero inevery single block location except the last two

d(0) =[0 . . . 0 ∗ ∗

]T.

As such, only the variables in these last two blocks change;we will have y

(1)t = y

(0)t for t = 0, . . . , T − 2. Now we can

see that ∇JT (y(1)) (and hence d(1)) is nonzero only in thelast three blocks. Propagating this effect backward, we can saythat for any 1 ≤ τ ≤ T

d(k)T−τ = 0, for k = 0, . . . , τ − 1.

Thus we have

‖xT−τ |T − xT−τ |T−1‖ = ‖∞∑k=τ

d(k)T−τ‖ = ‖y(τ)

T−τ − y∗T−τ‖

≤ ‖y(τ) − y∗‖ ≤ r0aτ .

To bound r0 we use the µ strong convexity of JT fromLemma 3.1 and the assumption (9),

r0 = ‖y∗ − y(0)‖ ≤ 2

µ‖∇JT (y(0))‖

=2

µ‖∇fT (xT−1|T−1,wT )‖ ≤ µ−1Mg,

and so

‖xt|T − xt|T−1‖ ≤ C0aT−t, t < T, C0 := µ−1Mg (26)

An immediate consequence of (26) is that {xt|T }T is a Cauchysequence. For all t ≥ 0 and n > l > 0 we have that

‖xt|t+n+l − xt|t+n‖ = ‖l∑

k=1

xt|t+n+k − xt|t+n+k−1‖

≤l∑

k=1

‖xt|t+n+k − xt|t+n+k−1‖ ≤ C0

l∑k=1

an+k

= C0

(1− al

1− a

)an+1,

and solim

n,l→∞‖xt|t+n+l − xt|t+n‖ = 0,

and{xt|T

}T

has a limit point that is well defined [59],

x∗t = limT→∞

xt|T .

By taking n = T − t, and taking the limit as l→∞ we have

‖xt|T − x∗t ‖ ≤C0

(a

1− a

)aT−t. (27)

Plugging in our expression for a yields∥∥x∗t − xt|T∥∥ ≤ C1 ·

(2Lmax − µmin

2Lmax + µmin

)T−t, (28)

with C1 := C0

(2Lmax − µmin

2µmin

).

E. Proof of Theorem 3.2

We being by recalling the gradient theorem, which states thatfor a twice differentiable function

∇f(y) = ∇f(x) +

(∫ 1

0

∇2f(x + τ(y − x))dτ

)(y − x),

(29)for any x,y. In particular, since ∇ft(xt−1|t, xt|t) = 0, wecan write

∇ft(xt−1|T , xt|T ) =

[Gt−1,t ET

t

Et Gt,t

] [xt−1|T − xt−1|t

xt|T − xt|t

],

where[Gt−1,t ET

tEt Gt,t

]=

∫ 1

0∇2ft

([xt−1|t + τ(xt−1|T − xt−1|t)

xt|t + τ(xt|T − xt|t)

])dτ ,

(30)

and we have used the stacked notation ft

([uv

])in place of

ft(u,v) for convenience. With this notation, we can rewritethe optimality condition

∇JT =∇0f1(x0|t, x1|T )

∇1f1(x0|t, x1|T ) +∇1f2(x1|T , x2|T )

...∇T−1fT−1(xT−2|T , xT−1|T ) +∇T−1fT (xT−1|T , xT |T )

∇T fT (xT−1|T , xT |T )

= 0

as a block-tridiagonal system with the same form as (5) where

Ht =

G0,1, t = 0;

Gt,t + Gt,t+1, t = 1, . . . , T − 1

GT,T , t = T ;

(31)

and

gt =

[G0,1 E>0

] [x0|1x1|1

], t = 0;[

Et−1 Gt,t

] [xt−1|txt|t

]

+[Gt,t+1 ET

t

] [ xt|t+1

xt+1|t+1

]t = 1, . . . T − 1[

ET−1 GT,T

] [xT−1|TxT |T

]t = T.

Since the ft are strongly convex, we have that µmin ≤‖Gi,j‖ ≤ Lmax for all i and j = i, i + 1. Hence, the main

Page 12: Streaming Solutions for Time-Varying Optimization Problems

12

diagonal blocks Ht satisfy ‖κ−1Ht − I‖ ≤ δ with κ =(2Lmax+µmin)/2 and δ = (2Lmax − µmin)/(2Lmax + µmin).Defining θ as the smallest upper bound such that ‖Et‖ ≤ κθ,we have

‖[Gt−1,t Et−1

]‖ ≤ κ

√L2max + θ2

and can use the same bound for ‖[Et−1 Gt,t

]‖, and so

‖gt‖ ≤ 2Mxκ√L2max + θ2 =: Mg,

for all t. Then if ‖Et‖ ≤ µmin/2, by Lemma 2.1 there will bean ε∗ such that ‖Q−1t ‖ ≤ 1/(1− ε∗) for all t, satisfying ρ =θ/(1−ε∗) < 1, and the result follows by applying Lemma 2.2.

F. Proof of Lemma 3.2

The optimality of {zt|T } implies ∇JT−B(z[T−B:T |T ]) is allzeros except for the first term. Proposition 3.1 implies thesame for ∇JT−B(x[T−B:T |T ]). Applying (29) to JT−B withy := x[T−B:T |T ]) and x := z[T−B:T |T ]) following the sameprocess from the proof of Theorem 3.1, we get

H ′T−B ETT−B

ET−B HT−B+1 ET−B+1

. . .

ET−1 HT

×

zT−B|T − xT−B|TzT−B+1|T − xT−B+1|T

...

zT |T − xT |T

=

q0

0

...

0

,

with HT−B+i, i = 1, . . . , B as in (31), and H ′T−B =GT−B|T−B+1, and q0 = ∇T−BJT−B(z[T−B:T |T ]) −∇T−BJT−B(z[T−B:T |T ]). Applying Lemma 2.3 with α = θand β = (1− δ)−1 gives the Lemma’s result.

G. Proof of Theorem 3.3

The proof follows by breaking the error term into two parts:a self error term due to early termination of the updates and abias term due to errors in preceding terms. The argument thenfollows by expressing the bias error as a convergent recursivesequence.

Note that z∗t = zt|t+B−1 — the last round at which it wasupdated. Adding and subtracting zt|T+B−1, we have

‖x∗t − z∗t ‖ ≤ ‖xt|t+B−1 − zt|t+B−1‖︸ ︷︷ ︸eb

+ ‖x∗t − xt|t+B−1‖︸ ︷︷ ︸ef

For ef , we have the easy bound (cf. Theorem (3.1))

ef = ‖x∗t − xt|t+B−1‖ ≤ C1

(2Lmax − µmin

2Lmax + µmin

)B−1.

Invoking Lemma 3.2 with T = t+B − 1, we have that

‖xt|t+B−1 − zt|t+B−1‖(1)

≤ θ

(1− δ)‖xt−1|t+B−1 − z∗t−1‖

(2)=

θ

(1− δ)‖xt−1|t+B−1 − zt−1|t+B−2‖

(3)

≤ θ

(1− δ)‖xt−1|t+B−1 − xt−1|t+B−2‖

(1− δ)‖xt−1|t+B−2 − zt−1|t+B−2‖,

where(1)

≤ a is direct result of Lemma 3.2,(2)= holds because

z∗t−1 is the same as the solution zt−1|t+B−2, and in(3)

≤ , we addand subtract xt−1|t+B−2, and then apply the triangle inequity.

From (26), we know that

‖xt−1|t+B−1 − xt−1|t+B−2‖ ≤ C0aB .

Defining r(t) , ‖xt|t+B−1 − z∗t ‖, we obtain the recursiverelation

r(t) ≤ C0aB +

θ

(1− δ)r(t− 1) = C0a

B +θ

(1− δ)r(t− 1).

The geometric series, using (21), converges to

eb ≤ C0aB 1

(1− θ(1− δ)−1).

Lastly, combining ef and eb yields

‖x∗t − z∗t ‖ ≤ Cb(

2Lmax − µmin

2Lmax + µmin

)B,

with,

Cb := C0

(1

1− a+

1

(1− θ(1− δ)−1)

).

REFERENCES

[1] J. Jiang and Y. Zhang, “A revisit to block and recursive least squaresfor parameter estimation,” Computers & Electrical Engineering, vol. 30,no. 5, pp. 403–416, 2004.

[2] A. Vahidi, A. Stefanopoulou, and H. Peng, “Recursive least squares withforgetting for online estimation of vehicle mass and road grade: theoryand experiments,” Vehicle System Dynamics, vol. 43, no. 1, pp. 31–55,2005.

[3] W. Li and J. C. Preisig, “Estimation of rapidly time-varying sparsechannels,” IEEE J. Ocean. Eng., vol. 32, no. 4, pp. 927–939, 2007.

[4] M. S. Lewicki and T. J. Sejnowski, “Coding time-varying signals usingsparse, shift-invariant representations,” Advances in neural informationprocessing systems, pp. 730–736, 1999.

[5] M. S. Asif and J. Romberg, “Dynamic updating for sparse time varyingsignals,” in ACISS. IEEE, 2009, pp. 3–8.

[6] M. H. Gruber, “Statistical digital signal processing and modeling,” 1997.[7] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, “Online adaptive

estimation of sparse signals: where rls meets the `1-norm,” IEEE Trans.Signal Process., vol. 58, no. 7, pp. 3436–3447, 2010.

[8] L. Xu and M. A. Davenport, “Simultaneous recovery of a series oflow-rank matrices by locally weighted matrix smoothing,” in CAMSAP.IEEE, 2017, pp. 1–5.

[9] L. Xu and M. Davenport, “Dynamic matrix recovery from incompleteobservations under an exact low-rank constraint,” in Advances in NeuralInformation Processing Systems, vol. 29. Curran Associates, Inc., 2016.

[10] S. Godsill, P. Rayner, and O. Cappe, “Digital audio restoration,” inApplications of Digital Signal Processing to Audio and Acoustics.Springer, 2002, pp. 133–194.

[11] W. Fong, S. J. Godsill, A. Doucet, and M. West, “Monte carlo smoothingwith application to audio signal enhancement,” IEEE Trans. SignalProcess., vol. 50, no. 2, pp. 438–449, 2002.

Page 13: Streaming Solutions for Time-Varying Optimization Problems

13

[12] S. Sarkka, A. Solin, A. Nummenmaa, A. Vehtari, T. Auranen, S. Vanni,and F.-H. Lin, “Dynamic retrospective filtering of physiological noise inbold fmri: Drifter,” NeuroImage, vol. 60, no. 2, pp. 1517–1527, 2012.

[13] F.-H. Lin, L. L. Wald, S. P. Ahlfors, M. S. Hamalainen, K. K. Kwong,and J. W. Belliveau, “Dynamic magnetic resonance inverse imaging ofhuman brain function,” Magnetic Resonance in Medicine, vol. 56, no. 4,pp. 787–802, 2006.

[14] A. Tarantola, Inverse Problem Theory and Methods for Model ParameterEstimation. SIAM, 2005.

[15] J. Kaipio and E. Somersalo, Statistical and Computational InverseProblems. Springer Science & Business Media, 2006, vol. 160.

[16] E. C. Hall and R. M. Willett, “Online convex optimization in dynamicenvironments,” IEEE J. Sel. Topics Signal Process., vol. 9, no. 4, pp.647–662, 2015.

[17] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz, “A new look atshifting regret,” arXiv preprint arXiv:1202.3323, 2012.

[18] J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncatedgradient.” Journal of Machine Learning Research, vol. 10, no. 3, 2009.

[19] C. Andrieu, N. De Freitas, and A. Doucet, “Rao-blackwellised particlefiltering via data augmentation.” in NIPS, 2001, pp. 561–567.

[20] J. Hartikainen and S. Sarkka, “Kalman filtering and smoothing solutionsto temporal gaussian process regression models,” in MLSP. IEEE, 2010,pp. 379–384.

[21] S. Sarkka, A. Vehtari, and J. Lampinen, “Time series prediction bykalman smoother with cross-validated noise density,” in IJCNN, vol. 2.IEEE, 2004, pp. 1653–1657.

[22] S. Sarkka, A. Vehtari, and J. Lampinen, “Cats benchmark time seriesprediction by kalman smoother with cross-validated noise density,”Neurocomputing, vol. 70, no. 13-15, pp. 2331–2341, 2007.

[23] R. F. Stengel, Optimal Control and Estimation. Courier Corporation,1994.

[24] P. S. Maybeck, Stochastic Models, Estimation, and Control. Academicpress, 1982.

[25] L. Doherty, L. El Ghaoui, et al., “Convex position estimation in wirelesssensor networks,” in Proceedings IEEE INFOCOM, vol. 3. IEEE, 2001,pp. 1655–1663.

[26] Q. Shi, C. He, H. Chen, and L. Jiang, “Distributed wireless sensornetwork localization via sequential greedy optimization algorithm,”IEEE Trans. Signal Process., vol. 58, no. 6, pp. 3328–3340, 2010.

[27] D. Zou and P. Tan, “Coslam: Collaborative visual slam in dynamicenvironments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 2,pp. 354–366, 2012.

[28] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, andF. Dellaert, “isam2: Incremental smoothing and mapping using the bayestree,” The International Journal of Robotics Research, vol. 31, no. 2, pp.216–235, 2012.

[29] M. Shan, Q. Feng, and N. Atanasov, “Orcvio: Object residual con-strained visual-inertial odometry,” in IROS. IEEE, 2020, pp. 5104–5111.

[30] L. Carlone, G. C. Calafiore, C. Tommolillo, and F. Dellaert, “Planarpose graph optimization: Duality, optimal solutions, and verification,”IEEE Trans. Robot., vol. 32, no. 3, pp. 545–565, 2016.

[31] R. E. Kalman, “A new approach to linear filtering and predictionproblems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45,1960.

[32] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum likelihood estimatesof linear dynamic systems,” AIAA journal, vol. 3, no. 8, pp. 1445–1450,1965.

[33] A. Y. Aravkin, B. B. Bell, J. V. Burke, and G. Pillonetto, “Kalmansmoothing and block tridiagonal systems: new connections and numer-ical stability results,” arXiv preprint arXiv:1303.5237, 2013.

[34] A. Koppel, A. Simonetto, A. Mokhtari, G. Leus, and A. Ribeiro, “Targettracking with dynamic convex optimization,” in GlobalSIP. IEEE, 2015,pp. 1210–1214.

[35] M. Fazlyab, S. Paternain, V. M. Preciado, and A. Ribeiro, “Prediction-correction interior-point method for time-varying convex optimization,”IEEE Trans. Autom. Control, vol. 63, no. 7, pp. 1973–1986, 2017.

[36] A. Simonetto, A. Mokhtari, A. Koppel, G. Leus, and A. Ribeiro, “A classof prediction-correction methods for time-varying convex optimization,”IEEE Trans. Signal Process., vol. 64, no. 17, pp. 4576–4591, 2016.

[37] A. Simonetto, E. Dall’Anese, S. Paternain, G. Leus, and G. B. Giannakis,“Time-varying convex optimization: Time-structured algorithms andapplications,” Proceedings of the IEEE, vol. 108, no. 11, pp. 2032–2048,2020.

[38] M. Zinkevich, “Online convex programming and generalized infinitesi-mal gradient ascent,” in ICML, 2003, pp. 928–936.

[39] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms foronline convex optimization,” Machine Learning, vol. 69, no. 2-3, pp.169–192, 2007.

[40] E. C. Hall and R. M. Willett, “Online optimization in dynamic environ-ments,” arXiv preprint arXiv:1307.5944, 2013.

[41] S. Sarkka, Bayesian Filtering and Smoothing. Cambridge UniversityPress, 2013, vol. 3.

[42] G. H. Golub and C. F. Van Loan, Matrix Computations. JHU press,2012, vol. 3.

[43] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of NonlinearEquations in Several Variables. SIAM, 2000.

[44] M. S. Lewicki and A. N. Burkitt, “A review of methods for spike sorting:The detection and classification of neural action potentials,” Network:Computation in Neural Systems, vol. 9, no. 4, 1998.

[45] N. C. Rabinowitz, R. L. Goris, M. Cohen, and E. P. Simoncelli,“Attention stabilizes the shared gain of v4 populations,” Elife, vol. 4,p. e08998, 2015.

[46] Y. Zhao and I. M. Park, “Variational latent gaussian process forrecovering single-trial dynamics from population spike trains,” Neuralcomputation, vol. 29, no. 5, pp. 1293–1316, 2017.

[47] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N.Brown, “A point process framework for relating neural spiking activity tospiking history, neural ensemble, and extrinsic covariate effects,” Journalof neurophysiology, vol. 93, no. 2, pp. 1074–1089, 2005.

[48] S. Deneve, “Bayesian spiking neurons i: inference,” Neural computation,vol. 20, no. 1, pp. 91–117, 2008.

[49] E. N. Brown, R. E. Kass, and P. P. Mitra, “Multiple neural spiketrain data analysis: state-of-the-art and future challenges.” Nature neu-roscience, 2004.

[50] D. R. Brillinger, “Maximum likelihood analysis of spike trains ofinteracting nerve cells,” Biological cybernetics, vol. 59, no. 3, pp. 189–200, 1988.

[51] R. E. Kass, V. Ventura, and C. Cai, “Statistical smoothing of neuronaldata,” Network: Computation in Neural Systems, vol. 14, no. 1, p. 5,2003.

[52] M. Sarmashghi, S. P. Jadhav, and U. Eden, “Efficient spline regressionfor neural spiking data,” Plos one, vol. 16, no. 10, p. e0258321, 2021.

[53] C. De Boor and R. E. Lynch, “On splines and their minimum properties,”Journal of Mathematics and Mechanics, vol. 15, no. 6, pp. 953–969,1966.

[54] C. De Boor and C. De Boor, A Practical Guide to Splines. springer-verlag New York, 1978, vol. 27.

[55] T. Hamam and J. Romberg, “Streaming least-squares and quasi-newtonalgorithms,” in CAMSAP. IEEE, 2019.

[56] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridgeuniversity press, 2004.

[57] P. W. Lewis and G. S. Shedler, “Simulation of nonhomogeneous poissonprocesses by thinning,” Naval research logistics quarterly, vol. 26, no. 3,pp. 403–413, 1979.

[58] Y. Nesterov, Lectures on Convex Optimization. Springer, 2018, vol.137.

[59] H. L. Royden and P. Fitzpatrick, Real Analysis. Macmillan New York,1988, vol. 32.