distributionally robust chance constrained data-enabled

16
1 Distributionally Robust Chance Constrained Data-enabled Predictive Control Jeremy Coulson John Lygeros Florian D¨ orfler Accepted for publication: IEEE Transactions on Automatic Control. (DOI: 10.1109/TAC.2021.3097706) © 2021 IEEE. Abstract—We study the problem of finite-time constrained optimal control of unknown stochastic linear time-invariant systems, which is the key ingredient of a predictive control algorithm – albeit typically having access to a model. We propose a novel distributionally robust data-enabled predictive control (DeePC) algorithm which uses noise-corrupted input/output data to predict future trajectories and compute optimal control inputs while satisfying output chance constraints. The algorithm is based on (i) a non-parametric representation of the subspace spanning the system behaviour, where past trajectories are sorted in Page or Hankel matrices; and (ii) a distributionally robust optimization formulation which gives rise to strong probabilistic performance guarantees. We show that for certain objective functions, DeePC exhibits strong out-of-sample performance, and at the same time respects constraints with high probability. The algorithm provides an end-to-end approach to control design for unknown stochastic linear time-invariant systems. We illustrate the closed- loop performance of the DeePC in an aerial robotics case study. Index Terms—Data-driven control, predictive control, distri- butionally robust optimization. I. I NTRODUCTION O PTIMAL control of unknown systems can be approached in two ways: model-based and data-driven. In model- based control, a predictive model for the system of inter- est is first identified from data and subsequently used for control design. What have come to be known as “data- driven methods”, on the other hand, aim to design controllers directly from data, without explicitly identifying a predictive model. These methods are suitable for applications where first- principle models are not conceivable (e.g., in human-in-the- loop applications), when models are too complex for control design (e.g., in fluid dynamics), and when thorough modelling and parameter identification is too costly (e.g., in robotics). Data-driven control has recently gained a lot of popularity, but most methods cannot be applied (respectively, lack formal certificates) for real-time control of safety-critical systems. In this work, we focus on a data-driven control technique for unknown, stochastic, and constrained linear systems. In partic- ular, we present a method for finite-horizon optimal predictive control using input/output data from the unknown system, where the system behaviour is characterized by a data matrix time series. This method was first presented for deterministic systems in [1] and was later extended to stochastic systems in [2]. These works were motivated by [3] in which a unique method of direct data-driven open-loop control was conceived based on the seminal work on behavioural systems theory [4]. All authors are with the Department of Information Technology and Elec- trical Engineering at ETH Z¨ urich, Switzerland {jcoulson, lygeros, dorfler}@control.ee.ethz.ch. The research was supported by the ERC under project OCAL, grant agreement 787845, and ETH Z¨ urich funds. A key challenge for such data-driven control methods is ensuring the performance and safety of the system in the pres- ence of uncertainties, corrupted data, and noise. In this work, we present an end-to-end optimal data-driven control approach which comes with such guarantees but without explicitly identifying a predictive model and is agnostic to the particular probabilistic uncertainty. The approach combines the so-called data-enabled predictive control (DeePC) method from [1], [2] and distributionally robust optimization techniques from [5]. Data-driven control has been historically approached using, e.g., iterative feedback tuning and virtual reference feedback tuning [6], [7]. More recently the default approach is of- ten reinforcement learning. There are many approaches to reinforcement learning, many of which are episodic, where learning and control alternate; see [8] for a review of methods and challenges. Here, we follow different lines of literature. Behavioural Framework for Control: Instead of learning a parametric system representation, one can describe the entire subspace of possible trajectories of a linear time-invariant (LTI) system only in terms of raw data sorted into a Han- kel matrix. This result became known as the fundamental lemma [4], has been inspired by subspace system identifica- tion [9], and was first leveraged in [3] for the computation of open-loop control and for simulating system responses. The result was used in [1] in which a predictive control algorithm was proposed and was extended for stochastic systems in [2]. Robust closed-loop stability guarantees have been provided in [10]. Additionally, numerical case studies have illustrated that the algorithm performs robustly on some stochastic and weakly nonlinear systems and often outperforms system iden- tification followed by conventional model predictive control (MPC) [11]–[13]. The behavioural framework was used in [14] to construct explicit feedback controllers. Since then, the behavioural framework has become popular for control design giving rise to numerous methods [15]–[20]. Learning-based MPC: In learning-based MPC the unknown system dynamics are substituted with a learned model mapping inputs to output predictions; see [21] for a comprehensive sur- vey. A learning MPC approach for iterative tasks is presented in [22]. Other approaches conceptually related to ours exploit previously measured trajectories (i.e., motion primitives or trajectory dictionaries) based on which they synthesize new trajectories [23]–[25]. These methods, however, require learn- ing a dictionary of libraries that best fits a dataset, but do not take into account the control objective. Sequential System Identification and Control: System iden- tification (ID) produces a nominal model and an uncertainty estimate allowing for robust control design. Most approaches offer asymptotic guarantees, but some provide finite sample arXiv:2006.01702v2 [math.OC] 21 Jul 2021

Upload: others

Post on 01-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributionally Robust Chance Constrained Data-enabled

1

Distributionally Robust Chance ConstrainedData-enabled Predictive Control

Jeremy Coulson John Lygeros Florian Dorfler

Accepted for publication: IEEE Transactions on Automatic Control. (DOI: 10.1109/TAC.2021.3097706) © 2021 IEEE.

Abstract—We study the problem of finite-time constrainedoptimal control of unknown stochastic linear time-invariantsystems, which is the key ingredient of a predictive controlalgorithm – albeit typically having access to a model. We proposea novel distributionally robust data-enabled predictive control(DeePC) algorithm which uses noise-corrupted input/output datato predict future trajectories and compute optimal control inputswhile satisfying output chance constraints. The algorithm is basedon (i) a non-parametric representation of the subspace spanningthe system behaviour, where past trajectories are sorted in Pageor Hankel matrices; and (ii) a distributionally robust optimizationformulation which gives rise to strong probabilistic performanceguarantees. We show that for certain objective functions, DeePCexhibits strong out-of-sample performance, and at the sametime respects constraints with high probability. The algorithmprovides an end-to-end approach to control design for unknownstochastic linear time-invariant systems. We illustrate the closed-loop performance of the DeePC in an aerial robotics case study.

Index Terms—Data-driven control, predictive control, distri-butionally robust optimization.

I. INTRODUCTION

OPTIMAL control of unknown systems can be approachedin two ways: model-based and data-driven. In model-

based control, a predictive model for the system of inter-est is first identified from data and subsequently used forcontrol design. What have come to be known as “data-driven methods”, on the other hand, aim to design controllersdirectly from data, without explicitly identifying a predictivemodel. These methods are suitable for applications where first-principle models are not conceivable (e.g., in human-in-the-loop applications), when models are too complex for controldesign (e.g., in fluid dynamics), and when thorough modellingand parameter identification is too costly (e.g., in robotics).Data-driven control has recently gained a lot of popularity,but most methods cannot be applied (respectively, lack formalcertificates) for real-time control of safety-critical systems.

In this work, we focus on a data-driven control technique forunknown, stochastic, and constrained linear systems. In partic-ular, we present a method for finite-horizon optimal predictivecontrol using input/output data from the unknown system,where the system behaviour is characterized by a data matrixtime series. This method was first presented for deterministicsystems in [1] and was later extended to stochastic systemsin [2]. These works were motivated by [3] in which a uniquemethod of direct data-driven open-loop control was conceivedbased on the seminal work on behavioural systems theory [4].

All authors are with the Department of Information Technology and Elec-trical Engineering at ETH Zurich, Switzerland jcoulson, lygeros,[email protected]. The research was supported by theERC under project OCAL, grant agreement 787845, and ETH Zurich funds.

A key challenge for such data-driven control methods isensuring the performance and safety of the system in the pres-ence of uncertainties, corrupted data, and noise. In this work,we present an end-to-end optimal data-driven control approachwhich comes with such guarantees but without explicitlyidentifying a predictive model and is agnostic to the particularprobabilistic uncertainty. The approach combines the so-calleddata-enabled predictive control (DeePC) method from [1], [2]and distributionally robust optimization techniques from [5].

Data-driven control has been historically approached using,e.g., iterative feedback tuning and virtual reference feedbacktuning [6], [7]. More recently the default approach is of-ten reinforcement learning. There are many approaches toreinforcement learning, many of which are episodic, wherelearning and control alternate; see [8] for a review of methodsand challenges. Here, we follow different lines of literature.

Behavioural Framework for Control: Instead of learning aparametric system representation, one can describe the entiresubspace of possible trajectories of a linear time-invariant(LTI) system only in terms of raw data sorted into a Han-kel matrix. This result became known as the fundamentallemma [4], has been inspired by subspace system identifica-tion [9], and was first leveraged in [3] for the computation ofopen-loop control and for simulating system responses. Theresult was used in [1] in which a predictive control algorithmwas proposed and was extended for stochastic systems in [2].Robust closed-loop stability guarantees have been providedin [10]. Additionally, numerical case studies have illustratedthat the algorithm performs robustly on some stochastic andweakly nonlinear systems and often outperforms system iden-tification followed by conventional model predictive control(MPC) [11]–[13]. The behavioural framework was used in [14]to construct explicit feedback controllers. Since then, thebehavioural framework has become popular for control designgiving rise to numerous methods [15]–[20].

Learning-based MPC: In learning-based MPC the unknownsystem dynamics are substituted with a learned model mappinginputs to output predictions; see [21] for a comprehensive sur-vey. A learning MPC approach for iterative tasks is presentedin [22]. Other approaches conceptually related to ours exploitpreviously measured trajectories (i.e., motion primitives ortrajectory dictionaries) based on which they synthesize newtrajectories [23]–[25]. These methods, however, require learn-ing a dictionary of libraries that best fits a dataset, but do nottake into account the control objective.

Sequential System Identification and Control: System iden-tification (ID) produces a nominal model and an uncertaintyestimate allowing for robust control design. Most approachesoffer asymptotic guarantees, but some provide finite sample

arX

iv:2

006.

0170

2v2

[m

ath.

OC

] 2

1 Ju

l 202

1

Page 2: Distributionally Robust Chance Constrained Data-enabled

2

certificates [26]–[29]. In this spirit, an end-to-end ID andcontrol pipeline is given in [30], [31] and arrives at a data-driven control solution with guarantees on the sample effi-ciency, stability, performance, and robustness. We refer alsoto the earlier surveys on identification for control [32], [33].Well-known short-comings of sequential ID and control are asfollows [32]–[35]: The system ID step is known to be the mosttime consuming part of model-based control design. Further-more, most system ID techniques seek the model that best fitsthe data, but do not take into account the control objective,possibly leading to unsatisfactory performance. Finally, prac-titioners often demand end-to-end automated solutions. Lastly,we mention approaches related to subspace identification and(in hindsight) also to the behavioural perspective: [36] derivesa linear model from a Hankel matrix to be used for predictivecontrol (see also [37] for an overview of subspace methodsfor identification and predictive control). Connections betweenindirect data-driven control (sequential system ID and control)and direct data-driven control are discussed in [38].

Stochastic MPC: To account for the stochastic uncertaintyin the system dynamics, stochastic MPC approaches usuallyconsider minimizing the expectation of the objective function,while satisfying chance constraints [39]. While chance con-straints may be unsuitable for applications where constraintviolation has catastrophic consequences, they are often usedwhen transient violations are allowed or when violations areassociated with financial penalties. Closest to our work is [40]in which a stochastic MPC approach is developed for thecase when the probability distribution of the stochastic dis-turbance is unknown. The authors consider a distributionallyrobust approach to safe-guard against the unknown distributionassuming that an accurate system model is given.

The approach of this paper follows our earlier work [1], [2].Our contributions are as follows:Distributionally robust data-driven control formulation:We formulate a novel distributionally robust data-enabled pre-dictive control problem based on behavioural systems theoryand distributionally robust optimization.Probabilistic guarantees on performance: We show thata distributionally robust optimal control problem admits atractable reformulation, and with high confidence, its solutionsexhibit strong out-of-sample guarantees. That is, we prove thatthe optimal value of the tractable formulation is an upperconfidence bound on the out-of-sample performance of anoptimal worst-case solution. We provide sample complexityresults, and are able to leverage additional data measurementsin comparison to [2]. Furthermore, the distributional natureof the robustness means that the method is robust againsta set of systems compatible with the data collected, whichcan include non-Gaussian and non-additive noise as well as(weakly) nonlinear systems.Safety through chance constraint satisfaction: We enforcechance constraints on the outputs of the system using adistributionally robust conditional value-at-risk formulation.We provide a tractable reformulation of the chance constraintsand provide sample complexity results such that they holdwith high confidence and are agnostic to the underlyingprobabilistic uncertainty.

Tight guarantees due to new data structure: We provide anovel formulation of the so-called fundamental lemma frombehavioural system theory by proving that the subspace oftrajectories of an LTI system can be spanned by trajectoriesorganized in a Page matrix. This is in contrast with all of theliterature leveraging the behavioural framework for control inwhich a Hankel matrix is used. In particular, this provides anovel contribution relative to [1], [2] as well as the body ofliterature leveraging the behavioural framework for control.Furthermore, we show that this alternative matrix structuregives tighter guarantees and results in better performance.The guarantees provided are for the finite-horizon optimal con-trol problem. Providing guarantees for a closed-loop recedinghorizon implementation is a subject for future work. We referto [10] for related results in this direction.

Section II reviews preliminaries on behavioural systemsand presents non-parametric representations. In Section III,we present the data-enabled predictive control algorithm fordeterministic systems. In Section IV we present a distribu-tionally robust data-enabled predictive control algorithm forstochastic systems. In Section V, we illustrate the main resultsin simulation on a quadcopter and present a detailed analysisof the hyperparameters involved. We make concluding remarksin Section VI. All proofs have been deferred to the Appendix.

Notation: We denote by Z≥0, and Z>0 the set of non-negative and positive integers respectively. Given x, y ∈ Rn,〈x, y〉 := x>y denotes the usual inner product on Rn. Wedenote the associated dual norm of a norm ‖ · ‖ on Rn by‖x‖∗ := sup‖y‖≤1〈x, y〉. The convex conjugate of a functionf : X → R is denoted by f∗(θ) := supx∈X〈θ, x〉 − f(x).We define the positive part of a real-valued function f asf+(x) := maxf(x), 0. We denote by δx the Dirac distribu-tion at x. Given a signal u : Z→ Rm, we denote the restrictionof the signal to an interval by u[1,T ] := col(u1, . . . , uT ),where col(u1, . . . , uT ) denotes the stacked column vector(u>1 , . . . , u

>T )>. We use the · symbol to denote recorded data

samples and to indicate that objects depend on data samples.

II. BEHAVIOURAL SYSTEMS

A. Preliminaries and Notation

Behavioural system theory is a natural way of viewing adynamical system when one is not concerned with a particularsystem representation. This is in contrast to classical systemtheory, where a particular parametric system representation(such as a state-space model) is used to describe the behaviour,and system properties are derived by studying the chosenrepresentation. Following [41], we define a dynamical systemand its properties in terms of its behaviour.

Definition 2.1: A dynamical system is a 3-tuple(Z≥0,W,B) where Z≥0 is the discrete-time axis, W is asignal space, and B ⊆WZ≥0 is the behaviour.

Definition 2.2: Let (Z≥0,W,B) be a dynamical system.(i) (Z≥0,W,B) is linear if W is a vector space and B is a

linear subspace of WZ≥0 .(ii) (Z≥0,W,B) is time invariant if B ⊆ σB where σ is

the backward time shift defined by (σw)(t) = w(t + 1)and σB = σw | w ∈ B.

Page 3: Distributionally Robust Chance Constrained Data-enabled

3

(iii) (Z≥0,W,B) is complete if B is closed in the topologyof pointwise convergence.

Note that if a dynamical system satisfies (i)-(ii) then (iii) isequivalent to finite dimensionality of W (see [41, Section 7.1]).We denote the class of systems (Z≥0,Rm+p,B) satisfying (i)-(iii) by Lm+p, where m, p ∈ Z≥0. With slight abuse ofnotation and terminology, we denote a dynamical system inLm+p only by its behaviour B.

Definition 2.3: The restricted behaviour in the interval[1, T ] is the set BT = w ∈ (Rm+p)T | ∃ v ∈ B s.t. wt =vt, 1 ≤ t ≤ T. A vector w ∈ BT is called a T -lengthtrajectory of the dynamical system B.

Definition 2.4: A system B ∈ Lm+p is controllable if forevery T ∈ Z>0, w1 ∈ BT , w2 ∈ B there exists w ∈ Band T ′ ∈ Z>0 such that wt = w1

t for 1 ≤ t ≤ T and wt =w2t−T−T ′ for t > T + T ′.

In other words, a behavioural system is controllable if any twotrajectories can be patched together in finite time.

B. Parametric system representation

Without loss of generality, any trajectory w ∈ B can bewritten as w = col(u, y), where col(u, y) := (u>, y>)>

(see [42, Theorem 2]). In what follows, we will associate uand y with inputs and outputs. There are several equivalentways of representing a behavioural system B ∈ Lm+p, in-cluding the classical input/output/state representation denotedby B(A,B,C,D) = col(u, y) ∈ (Rm+p)Z≥0 | ∃ x ∈(Rn)Z≥0 s.t. σx = Ax + Bu, y = Cx + Du. The in-put/output/state representation of smallest order (i.e., smalleststate dimension) is called a minimal representation, and wedenote its order by n(B). Another important property of asystem B ∈ Lm+p is the lag defined by the smallest integer` ∈ Z>0 such that the observability matrix O`(A,C) :=col(C,CA, . . . , CA`−1

)has rank n(B). We denote the lag

by `(B) (see [41, Section 7.2] for equivalent definitions). Thelower triangular Toeplitz (impulse response) matrix consistingof Tf ∈ Z>0 system Markov parameters is denoted by

TTf(A,B,C,D) :=

D 0 · · · 0CB D · · · 0...

. . .. . .

...CATf−2B · · · CB D

.

Lemma 2.1: ([3, Lemma 1]): Let B ∈ Lm+p andB(A,B,C,D) a minimal input/output/state representation.Let Tini, Tf ∈ Z>0 with Tini ≥ `(B) and col(uini, u, yini, y) ∈BTini+Tf . Then there exists a unique xini ∈ Rn(B) such that

y = OTf(A,C)xini + TTf(A,B,C,D)u.

In other words, given a sufficiently long window of initialsystem data col(uini, yini), the initial state xini is unique and canbe computed given knowledge of A,B,C,D and col(u, y).

C. Hankel and Page Matrices

We introduce two important matrix structures: the Hankelmatrix and the Page matrix.

Definition 2.5: Let L, T ∈ Z>0. Let u[1,T ] = ujTj=1 ⊂Rm be a sequence of vectors.• We define the Hankel matrix1 of depth L as

HL(u[1,T ]) :=

u1 u2 · · · uT−L+1

u2 u3 · · · uT−L+2

......

. . ....

uL uL+1 · · · uT

.

• We define the Page matrix of depth L as

PL(u[1,T ]) :=

u1 uL+1 · · · u(bTL c−1)L+1

u2 uL+2 · · · u(bTL c−1)L+2

......

. . ....

uL u2L · · · ubTL cL

,

where b·c is the floor function which rounds its argumentdown to the nearest integer.

Note that if T is a multiple of L, then bTL cL = T and the aboveexpressions simplify accordingly. Hankel matrices have a longhistory in subspace identification [9] and more recently in data-driven control [1], [3], [10], [14]. Page matrices have also beenused as an alternative to the classical Hankel matrix [43], [44].

Definition 2.6: Let L, T,M ∈ Z>0 and u[1,T ] = ujTj=1 ⊂Rm be a sequence of vectors. We call u[1,T ]:• Hankel exciting of order L if the matrix HL(u[1,T ]) has

full row rank.• L-Page exciting of order M if the matrix

PL(u[1,T−(M−1)L])PL(u[L+1,T−(M−2)L])

...PL(u[L(M−1)+1,T ])

has full row rank.

Roughly speaking, the terms Hankel and Page exciting referto a collection of inputs sufficiently rich and long to yieldan output sequence representative for the system’s behaviour.Note that for a sequence of vectors to be Hankel exciting oforder L, one requires that T ≥ L(m+ 1)− 1. For a sequenceof vectors to be L-Page exciting of order M , one requiresthat T ≥ L((mL + 1)M − 1). We also observe that thedefinition of an L-Page exciting sequence of order M dependson two indices, L and M (the number of block rows of thePage matrices and the number of vertically concatenated Pagematrices, respectively), whereas the definition of a Hankelexciting sequence of order L depends only on a single indexL (the number of block rows of the Hankel matrix).

The definition for Hankel exciting appeared in [4] under thename persistently exciting. A more general notion of persis-tency of excitation appeared in [18] in the case when the datamatrix was a mosaic Hankel matrix and was termed collectivepersistency of excitation. Our notion of Page exciting is amodification of these notions; the main difference is that thereare no repeated entries in each of the data matrices. This

1The classical definition of a Hankel matrix requires it to be square. Weslightly abuse this classical terminology, allowing for general dimensions.

Page 4: Distributionally Robust Chance Constrained Data-enabled

4

has important implications when the entries of the matrixare corrupted by noise; see Section V-B, the case-study [12],and references [43], [44]. In short, the independence of thePage matrix entries leads to statistically and algorithmicallyfavourable properties, e.g., singular-value thresholding canbe used for de-noising. We will observe later that certainrobustness and optimality guarantees become tight for Pagematrices, and they lead to superior control performance. Theprice-to-pay is an increasing number of data samples. Almostall results within this paper hold for both Hankel and Pagematrix structures, and we will explicitly comment on whenthe results differ depending on the matrix structure chosen.

D. Non-parametric System Representation

We now present a result known in behavioural systems the-ory as the Fundamental Lemma [3]. The result first appearedin [4, Theorem 1] in the case when the Hankel matrix wasused, and was recently extended in [18, Theorem 2] for moregeneral mosaic Hankel matrices. We present the result usingthe Page matrix data structure introduced in Definition 2.5.

Theorem 2.1: Consider a controllable system B ∈ Lm+p.Let L, T ∈ Z>0 with L ≥ n(B). Let (u[1,T ], y[1,T ]) =(uj , yj)Tj=1 ⊂ R(m+p) be a T -length trajectory of B.Assume u[1,T ] to be L-Page exciting of order n(B) + 1.Then (u[1,L], y[1,L]) = col(u1, . . . , uL, y1, . . . , yL) ∈ BL ifand only if there exists a vector g ∈ RbTL c such that(

PL(u[1,T ])PL(y[1,T ])

)g =

(u[1,L]

y[1,L]

)(1)

The original result [4, Theorem 1] required Hankel excitationof order L + n(B), and it coincides with Theorem 2.1 forL = 1. Theorem 2.1 replaces the need for a model orsystem identification process and allows for any trajectory of acontrollable LTI system to be constructed using a finite numberof data samples generated by a sufficiently rich (i.e., L-pageexciting) input sequence. Each column of the Page matrixin (1) is a trajectory of the system, and can be thought ofas a motion primitive. By linearly combining elements of thistrajectory library, we recover the whole space of trajectories.

In a sense, the Page matrix in (1) is a non-parametric pre-dictive model based on raw data. To be precise in behaviourallanguage, the Page matrix in (1) is an image representationof the restricted behaviour BL. This is in contrast to kernelrepresentations (with latent variables) such as state-spacemodels which parameterize BL by means of its orthogonalcomplement. In what follows, we will express a linear systemB by its data matrix representation in (1). We refer to thePage matrix on the left-hand side of (1) as the data matrix.

Remark 2.1: When L ≥ `(B), it can be shown that therank of the data matrix in the data matrix representation givenby (1) is mL+n(B) where m is the dimension of the inputs.Hence, the column span of the data matrix carves out a low-dimensional subspace of RL(m+p) which coincides with therestricted behaviour BL, the space of L-length trajectories.Since in general n(B) is unknown, it suffices to replace n(B)in Theorem 2.1 with an upper bound on the system order. •

For the rest of the paper, we think of an LTI system B onlyin terms of its data matrix representation. We see next thatTheorem 2.1 allows to implicitly estimate the state, predictthe future behaviour, and design optimal control inputs [3].

III. DETERMINISTIC DEEPC

Consider the controllable LTI system B ∈ Lm+p whosemodel is unknown. We address the problem of finite-horizonoptimal control, where the goal is to design a finite sequenceof control inputs that result in desirable outputs. In particular,let t ∈ Z≥0 be the current time, let Tf ∈ Z>0 be theprediction horizon, and let f1 : RmTf → R≥0 (respectively,f2 : RpTf → R≥0) be a cost function on the future inputs(respectively, outputs). We wish to design an input sequencecol(ut, . . . , ut+Tf−1) ∈ RmTf such that the correspondingoutput sequence col(yt, . . . , yt+Tf−1) ∈ RpTf minimizes thecost f1+f2. Furthermore, we wish for the inputs and outputs tolie in the constraint sets U ⊆ RmTf and Y ⊆ RpTf , respectively.

The conventional formulation of this finite-time optimalcontrol problem uses a parametric input/output/state represen-tation B(A,B,C,D) for prediction and estimation:

minu,y,x

f1(u) + f2(y)

s.t. xk+1 = Axk +Buk ∀k ∈ 0, . . . , Tf − 1yk = Cxk +Duk ∀k ∈ 0, . . . , Tf − 1x0 = xt

u ∈ U , y ∈ Y .

(2)

Here, xt is the current state at time t, which in the de-terministic case can be obtained, e.g., by propagating themodel equations backward in time by Tini ≥ `(B) steps; seeLemma 2.1. When optimization problem (2) is implementedin a receding horizon fashion it is known as model-predictivecontrol (MPC) and is a widely celebrated control technique.Typically, in receding horizon MPC problems the constraintstake the form of separable stage constraints with an additionalterminal constraint (similarly for the cost functions) [45].

Using Theorem 2.1, we can see that raw data collected fromsystem B can be used to perform implicit state estimation atthe same time as predict forward trajectories of B. Indeed,let Tini, Tf ∈ Z>0 and let (u[1,T ], y[1,T ]) = (uj , yj)Tj=1 ⊂R(m+p) be a T -length trajectory of B collected offline. As-sume that u[1,T ] is (Tini +Tf)-Page exciting of order n(B)+1.We partition the data matrices into two parts; one will beused to perform the implicit state estimation and the otherto perform forward predictions. More formally, we use thelanguage of subspace system identification [9] and define(

Up

Uf

):= PTini+Tf(u[1,T ]),

(Yp

Yf

):= PTini+Tf(y[1,T ]),

(3)where Up ∈ RmTini×b T

Tini+Tfc consists of the first Tini block

rows of PTini+Tf(u[1,T ]) and Uf ∈ RmTf×b TTini+Tf

c the last Tf

block rows (similarly for Yp and Yf ). Let t ∈ Z≥0 be thecurrent time and denote by col(uini, yini) the most recent Tini-length trajectory measured from B. Then given a sequence of

Page 5: Distributionally Robust Chance Constrained Data-enabled

5

inputs u = col(ut, . . . , uTf−1) of length Tf ∈ Z>0, one maysolve for a vector g ∈ Rb

TTini+Tf

c satisfyingUp

Yp

Uf

g =

uiniyiniu

. (4)

By Theorem 2.1, the predicted Tf-length output of B is thengiven by y = Yfg. Note that in general the g satisfying (4)is non-unique. However, by Lemma 2.1, if Tini ≥ `(B) thenthe predicted output y is unique. The top two block equationsof (4) can be thought of as implicitly fixing the initial statefrom which the future trajectory will depart.

We now recall the deterministic DeePC method presentedin [1] (using Hankel matrices) for controlling the system B.Data Collection (offline): Apply to B a T -length input se-quence u[1,T ] that is (Tini +Tf)-Page exciting of order n(B)+1. Measure the corresponding output trajectory y[1,T ]. Formdata matrices Up, Uf , Yp, Yf according to (3).DeePC (online): Let f1 : RmTf → R≥0, f2 : RpTf → R≥0

be cost functions describing the costs on future inputs andoutputs of B, respectively. Collect the most recent Tini-lengthtrajectory col(uini, yini) measured from B. Solve the finite-horizon optimal control problem:

ming

f1(Ufg) + f2(Yfg)

s.t.

(Up

Yp

)g =

(uiniyini

)Ufg ∈ UYfg ∈ Y.

(5)

Similar to MPC, the online DeePC optimization can be im-plemented in a receding horizon fashion. Note that the opti-mization problem (5) only requires input/output measurementsfrom the system and does not require the identification of amodel. The equivalence of DeePC to classical model predictivecontrol when a system model for B is given, is establishedin [1] in the case when the data matrix is a Hankel matrix.

Proposition 3.1: Consider an LTI system B whose in-put/output/state representation is given by B(A,B,C,D).Consider the MPC optimization problem (2) and the optimiza-tion problem (5) with Tini ≥ `(B). Then g satisfying theconstraints of (5) implies that (u, y) = (Ufg, Yfg) satisfies theconstraints of (2). Conversely, (u, y) satisfying the constraintsof (2) implies the existence of g satisfying the constraints of (5)where (u, y) = (Ufg, Yfg).

As a direct corollary of Proposition 3.1, DeePC problem (5)and MPC problem (2) result in equivalent and unique (as-suming convexity) open-loop (resp., closed-loop) behaviourwhen applied in an open-loop (resp., receding horizon) fashionto a deterministic LTI system B. As a major difference,most MPC approaches assume direct state measurements,which allows certain methods to be applied (e.g., inclusion ofterminal state constraints). We remark that (uini, yini, u, y) arecoordinates of a (generally non-minimal) realization allowingfor similar methodologies. See [10] in which stability andrecursive feasibility for a variant of (5) have been shown.

IV. DISTRIBUTIONALLY ROBUST DEEPC

We now consider the case where system B is subject tostochastic disturbances, which result in noisy output measure-ments in the data matrices. We begin with modifying theobjective function and constraints of the deterministic DeePCproblem (5) to robustify against the stochastic disturbances.This leads to a semi-infinite optimization problem which wereformulate under reasonable assumptions into a finite, con-vex, and tractable program termed the distributionally robustDeePC problem. Finally, we show that the distributionallyrobust DeePC problem enjoys robust performance guarantees.

A. Problem Setup

We view the output data matrices Yp and Yf given by (3) asparticular realizations of random variables which we denoteby Yp and Yf , respectively. Recall that we use the · notationto denote recorded data samples and to indicate that objectsdepend on recorded data samples. Define the random variableξ = (ξ>1 , . . . , ξ

>p(Tini+Tf)

)> where ξ>i ∈ RbT

Tini+Tfc denotes

the i-th row of the matrix (Yp> Yf

>)>. We denote thesupport set of the random variable ξ by Ξ. We denote theprobability distribution of the random variable ξ by P. Notethat this distribution is induced by the system B itself, andany stochastic disturbance acting on the system.

Due to the stochastic nature of the system, the futuretrajectory predictions obtained from the deterministic DeePCoptimization problem (5) are uncertain. Furthermore, the re-alized data matrix (1) may not describe a low-dimensionalsubspace as described in Remark 2.1. In fact, the data matrixin (1) will likely have full rank and can thus predict arbitraryfuture trajectories. Finally, the distribution P itself is unknownsince we do not have models of the system and disturbances.Hence, we make some changes to the objective function andthe constraints in (5) to robustify against the noisy data.

Objective Function: Since the constraint used to obtain theinitial condition of the future trajectory given by Ypg = yini isaffected by the stochastic disturbance and may not be feasible,we lift it into the objective function using an estimationfunction f3 : RpTini → R≥0 mapping Ypg − yini to a penalty.Including f3 in the objective function can be thought of asan implicit estimation of the initial condition from which thepredicted future trajectory must evolve. Choosing f3(·) = ‖·‖2would result in a least-squares type initial condition estimatereminiscent of moving horizon estimation [45].

Since the system is stochastic, we focus on minimizingthe expectation of the objective function in (5), where theexpectation is taken with respect to the true distribution P, i.e.,we wish to minimize EP[f1(Ufg) + f2(Yfg) + f3(Ypg− yini)].

Recall that the distribution P pertains to the offline sam-ples (u[1,T ], y[1,T ]) and not to the real-time measurementyini addressed through the above least-square estimation. Asdiscussed before, the distribution P itself is unknown, andmust be estimated from data. In order to be robust to errorsin this estimate, we use distributionally robust optimizationtechniques; namely, we construct a so-called ambiguity setP depending on the collected data (specified precisely later)such that the true distribution lies in the ambiguity set with

Page 6: Distributionally Robust Chance Constrained Data-enabled

6

high confidence. We then optimize the expectation of thecost function, where the expectation is taken with respectto the worst-case distribution in P . This leads us to thedistributionally robust objective

supQ∈P

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

].

We note that this formulation leads to robustness against a “setof systems” compatible with the (possibly noisy) data samples,and that this “set of systems” is broader than mere LTI systemswith additive process and measurement noise.

Constraints: Requiring the future outputs to lie in a particu-lar subset Y almost surely as in (5) may not be possible. Thus,we relax the hard output constraint to a so-called conditionalvalue at risk (CVaR) constraint. Let the output constraint bewritten as Y = y | h(y) ≤ 0, where h : RpTf → R describesdesired constraints on future trajectories y. We define

CVaRP1−α(h(y)) := inf

τ∈R

τ +

1

αEP[(h(y)− τ)+]

,

where α ∈ (0, 1) is a user-chosen confidence parameter. Itis well known that the constraint CVaRP

1−α(h(y)) ≤ 0 is aconvex relaxation of the classical chance constraint (or VaR)given by P(h(y) ≤ 0) ≥ 1−α (see, e.g., [46]). In other words,the CVaR constraint ensures that the constraint y ∈ Y will holdwith high probability. In the case when the stochastic distur-bance is a continuous random variable, the CVaR constraintcan be interpreted as penalizing the expected violation in theα percent of cases where violations do occur (see Figure 1).The latter point is particularly important in control applicationswhere large violations of constraints could lead to catastrophicsystem behaviour. Hence, the CVaR constraint is not merely arelaxation of the classical VaR constraint, but often the morereasonable constraint formulation for control problems.

CVaRP1−α(X)

Fig. 1: Depiction of the conditional value at risk at level α =0.15 for a Gaussian random variable X . The shaded regionaccounts for α×100% of the mass of the Gaussian distribution.

Similar to minimizing the worst case expected objective func-tion, we ask that the CVaR constraint be satisfied for the worst-case distribution in P . This leads us to the robust constraint

supQ∈P

CVaRQ1−α(h(y)) ≤ 0.

One particular ambiguity set P that offers strong perfor-mance guarantees as well as mathematical tractability whileallowing the user to adjust its conservatism is the Wassersteinambiguity set centred around the so-called empirical distribu-tion, i.e., the sample distribution [5]. In particular, let M(Ξ)be the set of all distributions Q supported on Ξ such thatEQ[‖ξ‖r] <∞, where ‖·‖r is the r-norm for some r ∈ [1,∞].

Definition 4.1: Let r ∈ [1,∞]. The Wasserstein metricdW : M(Ξ)×M(Ξ)→ R≥0 is defined as

dW(Q1,Q2) := infΠ

∫Ξ2

‖ξ1 − ξ2‖rΠ(dξ1, dξ2)

,

where Π is a joint distribution of ξ1 and ξ2 with marginaldistributions Q1 ∈M(Ξ) and Q2 ∈M(Ξ), respectively.The Wasserstein metric can be viewed as a distance betweenprobability distributions, where the distance is calculated viaan optimal mass transport plan Π. For ε ≥ 0, we denote theWasserstein ball of radius ε centred around distribution Q by

Bε(Q) := Q′ ∈M(Ξ) | dW(Q,Q′) ≤ ε .

We chose the Wasserstein ball as the ambiguity set as opposedto other popular ambiguity sets such as f -divergence ormoment-based ambiguity sets since the latter require priorknowledge and assumptions on the true data generating dis-tribution (absolute continuity with respect to a nominal distri-bution or known moments).

We now present the distributionally robust DeePC methodfor controlling stochastic systems.Data Collection (offline): We collect N ∈ Z>0 trajectoriesvia repeated identical experiments. We start by fixing an inputsequence u[1,T ] that is (Tini + Tf)-Page exciting of ordern(B) + 1. For each experiment i ∈ 1, . . . , N, we assumethat the system is initialized to the same (unknown) state, applyinputs u[1,T ], and measure the corresponding output trajectoryy

(i)[1,T ]. The superscript (i) denotes the trajectory obtained from

the i-th experiment. For each data batch i ∈ 1, . . . , N, builddata matrices Up, Uf , Y

(i)p , and Y (i)

f using (3). The output datamatrices implicitly define N data samples which we denoteby ξ(i) for i ∈ 1, . . . , N. Lastly, we use the data collectedto build the empirical distribution PN = 1

N

∑Ni=1 δξ(i) . Note

that collecting only one T -length trajectory (i.e., N = 1) ispossible. As we will see in Section V-B though, choosing Nlarger can lead to less uncertainty and better performance.Distributionally Robust DeePC (online): We consider theWasserstein ball of radius ε (which will be chosen as ahyperparameter quantifying a desired robustness level) aroundthe empirical distribution PN as the ambiguity set. Hence, thedistributionally robust DeePC problem is given by

ming

supQ∈Bε(PN )

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

]s.t. Upg = uini

Ufg ∈ Usup

Q∈Bε(PN )

CVaRQ1−α(h(Yfg)) ≤ 0

(6)By assuming that the system is initialized to the same (un-

known) state at the beginning of each experiment and applyingan identical input sequence for each experiment, we ensurethat the collection of data matrices Y (i)

p and Y(i)f describe

N independent and identically distributed (i.i.d.) samples ofthe random variable ξ. Before their realization, the collectionξ(i)Ni=1 can be viewed as a random object governed by theN -fold product distribution PN . The empirical distribution PN

Page 7: Distributionally Robust Chance Constrained Data-enabled

7

is our sample average estimate for the true distribution P andis set as the centre of the ambiguity set in (6).

Compared to MPC or deterministic DeePC discussed inSection III, the distributional robust program (6) features theadditional robustness parameter ε, whose selection will bediscussed later, and α whose value is usually chosen to be 0.1,0.05, or 0.01. One notable feature of (6) is that it is robust toall probability distributions within the ambiguity set that coulddescribe the data matrices Yp and Yf . Hence, the above formu-lation is robust to a “set of systems” which captures more thanLTI systems with additive noise. In fact, for a sufficiently largerobustness parameter ε, the ambiguity set considers all timeseries originating from linear or nonlinear systems as long asthey have “finite expectation”, i.e., the integrals defining theWasserstein metric exist. This follows from the fact that theWasserstein ball contains all Dirac distributions generated bytime series ξ satisfying dW(δξ, PN ) ≤ ε. Hence, for ε largeenough, the time series ξ can be arbitrary.

Problem (6) is a semi-infinite optimization problem due tothe supremums over the space of distributions in the cost and inthe constraints. Moreover, the Wasserstein metric (resp. CVaR)are themselves defined via infinite (resp. finite) programs.Thus, (6) does not immediately present itself as a tractableprogram. Below, we show that under reasonable assumptionson the objective and constraint function, the semi-infiniteoptimization problem above admits a tractable reformulation.

B. Main results

We provide a tractable reformulation of the distributionallyrobust DeePC problem (6) which depends on two intermediatelemmas (Lemma A.3 and A.4 in the Appendix) providing sep-arate reformulations for the objective function and constraints.In the statement of the result below, we make use of the dualnorm ‖ · ‖q = ‖ · ‖r,∗, where, q and r satisfy 1

r + 1q = 1.

Theorem 4.1: (Tractable Reformulation): Assume thatf1 is convex and f2, f3, and h are convex and Lipschitzcontinuous. Specifically, let Lobj > 0 (resp., Lcon > 0) be theLipschitz constant with respect to the r-norm of the mapping(x, y) 7→ f2(x) + f3(y) (resp., h). Then the optimal value ofproblem (6) is upper bounded by the optimal value of

ming,τ,si

f1(Ufg) +1

N

N∑i=1

(f2(Y

(i)f g) + f3(Y (i)

p g − yini))

+ Lobjε‖g‖r,∗s.t. Upg = uini

Ufg ∈ U

− τα+ Lconε‖g‖r,∗ +1

N

N∑i=1

si ≤ 0

τ + h(Y(i)f g) ≤ si ∀i ≤ N

si ≥ 0 ∀i ≤ N.(7)

Moreover, the solution g? to the above will satisfyCVaRQ

1−α(h(Yf g?)) ≤ 0 for all Q ∈ Bε(PN ). The objec-

tive of (7) coincides with the objective of (6) when Ξ =

Rp(Tini+Tf)b TTini+Tf

c.

Observe that when the data matrices in (3) are Hankel matri-ces, then Ξ is always a strict subspace of Rp(Tini+Tf)b T

Tini+Tfc en-

coding the Hankel structure. When the matrices are Page ma-trices, the support set Ξ may coincide with Rp(Tini+Tf)b T

Tini+Tfc

(e.g., if the measurements are affected by Gaussian noise). Asimilar result holds for piecewise affine constraint functions bycombining Lemma A.3 and A.5. In this case, the reformulatedconstraint set can be made tight at the cost of making theobjective an upperbound (see Remark A.1). Since, we treat theobjective and constraints separately (via Lemma A.3 and A.4),it is possible to choose different values of ε in the constraintand objective of (6), though we refrain from doing so forclarity of exposition. Furthermore, by treating the objectiveand the constraints separately, their worst case distributionsmay differ leading to a more conservative solution.

The above result shows that the distributionally robustobjectives can be reformulated as the sample average of theobjective function plus an additional regularization term. Tothe best of our knowledge this observation was first madein [47] in the context of machine learning problems. Hence,being robust in the trajectory space (data space) in the sense ofthe r-norm requires regularizing the sample average objectivewith the dual q-norm, where the weight of regularizationdepends on the Lipschitz constant of the objective functionand the radius of the Wasserstein ball. For example, the 1-norm regularization adopted in [1] corresponds to ∞-normrobustness in the trajectory space.

We now show that the robust DeePC problem (7) exhibitsstrong probabilistic guarantees. We need the following aux-iliary measure concentration result under a light-tailednessassumption.

Theorem 4.2: (Measure Concentration [48, Theorem 2]):Assume that EP[exp(‖ξ‖ar)] <∞ for some a > 1. Then

PNdW(P, PN ) ≥ ε

c1exp(−c2Nεb

TTini+Tf

c) if ε ≤ 1

c1exp(−c2Nεa) if ε > 1

for all N ≥ 1, and ε > 0, where PN is the N -fold productdistribution, and c1, c2 are positive constants depending on a,b TTini+Tf

c and the value of EP[exp(‖ξ‖ar)].

We can use Theorem 4.2 in order to compute the mini-mum Wasserstein radius ε such that the true data-generatingdistribution P lives inside the Wasserstein ball Bε(PN ) withconfidence at least 1 − β. Indeed, by inverting the aboveinequalities the minimum radius is given by

ε(β,N) =

(

log(c1β−1)

c2N

) 1

b TTini+Tf

c if N ≥ log(c1β−1)

c2(log(c1β

−1)c2N

) 1a

if N < log(c1β−1)

c2

(8)

Theorem 4.3: (Robust Performance Guarantee): Assumethat EP[e‖ξ‖

ar ] < ∞ for some a > 1. Let β ∈ (0, 1) be

the desired level of confidence. Let J(g) := EP[f1(Ufg) +f2(Yfg) + f3(Ypg − yini)]. Let J(g) denote the value of the

Page 8: Distributionally Robust Chance Constrained Data-enabled

8

objective function in (7) evaluated at g and g? denote anoptimizer of problem (7) with ε(β,N) chosen as in (8). Then

PNJ(g?) ≤ J(g?)

≥ 1− β, and

PN

CVaRP1−α(h(Yf g

?)) ≤ 0≥ 1− β.

Theorem 4.3 shows that with probability 1−β (with respectto the N -fold product distribution PN of P), the optimalvalue of the robust DeePC problem (7) is an upper boundfor the expected cost with respect to the true distribution P.Furthermore, the CVaR constraint with respect to the truedistribution P holds with probability 1− β.

Remark 4.1: The Wasserstein radius given by (8) decayswith a rate of N

1bT/(Tini+Tf)c . To decrease the radius by 50%,

the number of samples N must increase by 2b TTini+Tf

c. This rate(albeit unfortunate) is tight [49]. In practice, the Wassersteinball radius given by (8) is often larger than necessary, i.e.,P 6∈ Bε(PN ) with probability much less than β. Furthermore,even when P 6∈ Bε(PN ), the robust quantity J(g?) may stillserve as an upper bound for the out-of-sample performanceJ(g?) [5]. In addition to this conservativeness, we note thatthe Wasserstein radius provided by (8) is hardly quantifiableas it depends on constants which are unknown to us in ourdata-driven setting. For practical purposes, one should choosethe radius of Bε(PN ) in a data-driven fashion (e.g., as we havedone via tuning in experiments in Section V-B). Alternatively,one could use standard cross-validation techniques (some ofwhich are stated in [5]) or more advanced techniques (albeitonly providing asymptotic guarantees) provided in [50]. •

The robust DeePC Algorithm 1 implements (7) in a recedinghorizon fashion. We note that all results in this sectionhold when the data matrices are Hankel matrices in (3).However, the upper bound and set inclusions presented inLemmas A.3, A.4 and Theorem 4.1 are not tight becausethe Hankel structure restricts Ξ to a strict subspace ofRp(Tini+Tf)b T

Tini+Tfc. If the constraints are piecewise affine (see

Lemma A.5), then the constraint reformulation can be tight forHankel matrices. In the following case study we will compareperformance of Hankel and Page matrix structures.

Algorithm 1 Robust DeePC

Input: data trajectories col(u, y(i)) ∈ R(m+p)T for i ∈1, . . . , N with u (Tini+Tf)-Page exciting of order n(B)+1,most recent input/output measurements col(uini, yini) ∈R(m+p)Tini

1: Solve (7) for g?.2: Compute optimal input sequence u? = Uf g

?.3: Apply optimal input sequence (ut, . . . , ut+ν−1) =

(u?1, . . . , u?ν) for some ν ≤ Tf.

4: Set t to t + ν and update uini and yini to the Tini mostrecent input/output measurements.

5: Return to 1.

V. NUMERICAL RESULTS

A. Nonlinear and Stochastic Aerial Robotics Case Study

We illustrate the performance of distributionally robustDeePC with a high-fidelity nonlinear quadcopter simula-tion [13]. The output measurements are the 3 spatial coor-dinates (px, py, pz) of the quadcopter. We denote the outputat time t by yt = (px,t, py,t, pz,t) ∈ R3 and the i-th componentof yt by yt,i. The inputs are the total thrust produced by all4 rotors ftot, and the angular body rates around the x and ybody axes of the quadcopter, ωx, ωy respectively. We denotethe input at time t by ut = (ftot,t, ωx,t, ωy,t) ∈ R3 andthe i-th component of ut by ut,i. The output measurementsare affected by additive zero-mean Gaussian noise during thedata collection phase (offline) and the control phase (online)in which the distributionally robust DeePC Algorithm 1 isimplemented. Statistics of the noise were chosen to closelymatch the experimental setup [13].

The output measurements were taken at a rate of 25Hz (i.e.,every 40ms). We performed N = 10 offline experiments eachof which yielded 15500 input/output measurements to populatethe Page matrices Up, Uf , Y

(i)p , and Y (i)

f for i ∈ 1, . . . , 10.The inputs used to excite the system were drawn from a Gaus-sian distribution. The same randomly generated inputs wereused for every repeated experiment. Due to the fact that thequadcopter system is open-loop unstable, these randomly gen-erated inputs were added to an existing control that maintainsthe quadcopter around a hover state. This existing controllerwas only used during the offline data collection phase, and wasnot used during the online control phase. Since the randomlygenerated inputs were much larger in magnitude than theadjustments made by the existing stabilizing controller duringdata collection, the matrices Y (i)

p , Y (i)f are approximately i.i.d.

The future prediction horizon was chosen as Tf = 25 (1second in real time), and the time horizon used to implicitlyestimate the initial condition was set to Tini = 6. The inputswere constrained to the set [0.1597, 0.4791] × [−π2 ,

π2 ]2. The

output constraint function h was chosen such that (px, py, pz)was constrained in the set [−0.6, 1.6]3 with α = 0.1 theconfidence parameter in the CVaR constraint. The compositeobjective function consisted of weighted 2-norms

f1(u[1,Tf]) = 16‖u[1,Tf],1 − fref‖2 + 4‖(u[1,Tf],2, u[1,Tf],3)‖2

f2(y[1,Tf])2 = 16002

∑Tf

t=1‖yt − yref‖22

f3(σ) = 750000‖σ‖2,

where fref = (0.27, . . . , 0.27) is the thrust needed to hover thequadcopter and yref = (px,ref, py,ref, pz,ref) = (0.5, 0.5, 1.5).The Wasserstein norm was chosen as the 2-norm and ε =0.003. The optimization problem (7) was implemented in areceding horizon fashion with control horizon ν = 1 (seeAlgorithm 1). A representative trajectory of the nonlinear andstochastic closed-loop system is illustrated in Figure 2(a),where the distributionally robust DeePC succeeds in steeringthe quadcopter to the reference trajectory while satisfyingoutput constraints.

Page 9: Distributionally Robust Chance Constrained Data-enabled

9

B. Choosing the Hyperparameters

In this section we study the effect of the hyperparameters onthe performance of distributionally robust DeePC. To comparethe performance of Algorithm 1 for different hyperparameters,we simulated the nonlinear stochastic quadcopter in recedinghorizon with control horizon ν = 1. We computed the trackingerror

∑Tsimt=0 ‖yt − yref‖22 where Tsim = 250 (corresponding to

10 seconds in real time).1) ε: We varied the Wasserstein radius for fixed N = 1

and N = 10, Tini = 6, and 300 columns in the data matrix.Figures 2(b) and 2(c) show the relationship between trackingerror and ε for Page and Hankel matrix structures, respectively.Increasing the number of data batches N increases the rangeof radii that lead to satisfactory tracking performance. Thisis because the empirical distribution PN at the centre of theWasserstein ball in problem (6) gets closer (in the Wassersteinsense) to the true distribution P from which the data is drawn.This allows the user to decrease ε. Since the optimal value ofthe distributionally robust DeePC problem (7) is monotonicallyincreasing with ε, choosing the minimum epsilon such thatP ∈ Bε(PN ), gives the minimum upper bound on the out-of-sample performance that holds with high confidence. Hence,increasing N decreases this minimum ε leading to better out-of-sample performance with high confidence. We note that theeffect of increasing N is more pronounced for the Hankelmatrices. This may be due to the fact that Page matricesalready require more data points to construct when comparedto a Hankel matrix with the same number of columns.

Performing experiments to obtain the range of ε resultingin stable flight (Figures 2(b) and 2(c)) may not always bepossible. We refer the reader to [5, Section 7.2.2] for methods

approximating an optimal ε using the data already collected.

2) T : We varied the number of data points used in thedata matrices constructed in (3). In particular, we studied theeffect of adding additional columns into the data matriceswith N = 1, Tini = 6, and choosing ε ∈ a × 10−b |a ∈ 1, . . . , 9, b ∈ 2, 3, 4 which minimizes the trackingerror. In these simulations the number of data points usedto construct the Page matrices did not meet the theoreticalminimum required by Theorem 2.1, but still exhibits goodtracking performance. Figure 2(d) indicates that increasingthe amount of data in the data matrices can significantlyimprove the tracking performance. Past a certain threshold(200 columns for Page matrix structure, 300 columns forHankel matrix structure), the tracking performance remainsapproximately constant. This is likely due to the fact thatwith this amount of data, the data matrices characterize arich enough subspace of trajectories well approximating thenonlinear dynamics. This intuition is drawn from the fact thatnonlinear dynamics (under certain assumptions) can be liftedto large (often infinite) dimensional linear dynamics, and alarge enough data set well approximates the dominant modes.

Aside from comparing the number of columns (and thusthe size of the optimization variables), we also compare theHankel and Page matrices given the same fixed amount of dataT . A representative comparison is presented in Table I. Themean solve time represents the average time in seconds it tookto solve (7) with N = 1, Tini = 6 and choosing ε ∈ a×10−b |a ∈ 1, . . . , 9, b ∈ 2, 3, 4 which minimizes the trackingerror. The optimization was performed using MOSEK [51] ona 3.4 GHz Intel Core i5 with 16GB of RAM.

When T = 527, the number of columns in the Hankel and

0 2 4 6 8 10-1

-0.5

0

0.5

1

1.5

2

(a)

0 0.002 0.004 0.006 0.008 0.01 0.0120

50

100

150

200

(b)

0 0.002 0.004 0.006 0.008 0.01 0.0120

50

100

150

200

(c)

150 200 250 300 350 400 450 50030

35

40

45

50

55

60

(d) (e)

4 5 6 7 8 9 100

20

40

60

80

100

(f)

Fig. 2: (a) Step trajectory of quadcopter using distributionally robust DeePC Algorithm 1; (b)–(c) Dependence of tracking erroron the Wasserstein radius ε; (d) Dependence of tracking error on the number of columns in the data matrix; (e) Comparisonof tracking error for Page and Hankel matrices; (f) Dependence of tracking error on the initial condition horizon Tini.

Page 10: Distributionally Robust Chance Constrained Data-enabled

10

TABLE I: Comparison of mean solve times and tracking errorfor Hankel and Page matrices for varying data lengths.

Mean solve time (s) Tracking errorData length T = 527 T = 15407 T = 527 T = 15407Hankel 0.182 3.70 33.94 34.35Page 0.007 0.166 7× 106 33.97

Page matrices are 497 and 17 respectively. When T = 15407,the number of columns in the Hankel and Page matrices are15377 and 497 respectively. Each additional Tini + Tf datapoints allow the addition of one column to the Page matrixand Tini +Tf columns to the Hankel matrix. Hence, the size ofthe optimization variable g in (7) when using a Hankel matrixincreases Tini + Tf times faster than that of a Page matrix.

In terms of tracking error, we see that for T = 527 theperformance of the Page matrix is poor (i.e., the quadcoptercrashes) since there are not enough columns in the Page matrixto construct a rich enough subspace of trajectories. Increasingto T = 15407 results in good performance for both the Hankeland Page matrices at the cost of a much larger solve time forthe Hankel matrix. Furthermore, the tracking errors confirmthe observation from Figure 2(d) that there is a threshold onthe number of data points above which the tracking errorsremain approximately constant.

3) Data matrix structure: We performed simulations withhyperparameters N ∈ 1, 2, 5, 10, Tini = 6, number ofcolumns in data matrix ranging from 150 to 500, and ε ∈ a×10−b | a ∈ 1, . . . , 9, b ∈ 2, 3, 4. Over all simulations, thesame data set was used to populate the Page/Hankel matricesconstructed in (3). The 273 simulations with tracking errorno larger than 200 (i.e., those exhibiting stable flight of thequadcopter) are plotted in the histogram in Figure 2(e). ThePage matrix significantly outperforms the Hankel matrix interms of tracking performance. This observation may be due tothe fact that the tractable reformulation of the distributionallyrobust DeePC objective in (7) is tight for the Page matrix.Additionally, the entries of the Hankel matrix are repeatedwhich may result in a higher sensitivity to noise comparedto the Page matrix whose entries are not repeated. Lastly, wenote that using a Page matrix allows one to perform singularvalue decomposition (SVD) to optimally preprocess the data.This is not the case for the Hankel matrix since SVD does notpreserve Hankel structure. See [52] for a relevant case study.

4) Tini: We varied the horizon over which the initialcondition is implicitly estimated while fixing N = 1,and 300 columns in the data matrix. For each value ofTini ∈ 1, . . . , 10, a simulation was performed for everyε ∈ a × 10−b | a ∈ 1, . . . , 9, b ∈ 2, 3, 4. For the bestperforming ε, the value of the tracking error for each Tini isreported in Figure 2(f). Once Tini is past a certain threshold(Tini = 5), the distributionally robust DeePC algorithm exhibitssatisfactory tracking performance. This is because Tini is themain parameter to fix the “system complexity” inside theDeePC algorithm (see Lemma 2.1).

VI. CONCLUSION

We presented a data-driven method for controlling stochas-tic constrained LTI systems only using raw data collected from

the system and without the need to explicitly identify a model.The method comes with strong out-of-sample performanceguarantees and ensures constraint satisfaction with high con-fidence. Furthermore, we discussed how the performance ofthe algorithm is affected by the hyperparameters. Future workincludes exploring the extension of this method to stronglynonlinear systems, extending to closed-loop guarantees, andcomparison to identification-based control approaches. Fur-thermore, our Page excitation condition is only sufficient, andwe seek relaxed conditions. Finally, we are interested in anonline adaptation of the data matrix.

ACKNOWLEDGMENT

The authors would like to thank Linbin Huang, PeymanMohajerin Esfahani, Claudio de Persis, Ivan Markovsky, andManfred Morari for useful discussions, as well as Ezzat Elokdaand Paul Beuchat for providing the simulation model.

APPENDIX

Proof of Theorem 2.1: The “if” part of the statementis obvious by linearity and time-invariance of the system. Wenow prove the “only if” part. Our proof strategy is partiallyinspired by arguments in [18]. Let u[1,L] = (u1, . . . , uL),y[1,L] = (y1, . . . , yL) be a trajectory of B. Hence, there existmatrices A,B,C,D such that (u[1,L], y[1,L]) is a trajectoryof B(A,B,C,D) with initial state x1 ∈ Rn(B). For ease ofnotation, throughout the proof we will denote n(B) by n.Define the matrix of state sequences from xi to xi+jL as

Xi,j =(xi xi+L · · · xi+jL

).

Then by the system dynamics we have that(PL(u[1,T ])PL(y[1,T ])

)=

(0 I

OL(A,C) TL(A,B,C,D)

)(X1,bTL c−1

PL(u[1,T ])

),

where I is the identity matrix. Furthermore,(u[1,L]

y[1,L]

)=

(0 I

OL(A,C) TL(A,B,C,D)

)(x1

u[1,L]

).

If there exists a vector g such that(x1

u[1,L]

)=

(X1,bTL c−1

PL(u[1,T ])

)g, (9)

then(u[1,L]

y[1,L]

)=

(0 I

OL(A,C) TL(A,B,C,D)

)(X1,bTL c−1

PL(u[1,T ])

)g

=

(PL(u[1,T ])PL(y[1,T ])

)g,

which is what we wanted to show. Hence it remains to provethat there always exists a vector g satisfying (9) for anyx1 and u[1,L]. Equivalently, we will show that the matrix(X1,bTL c−1

PL(u[1,T ])

)has full row rank. Let

(ξ η

)be an arbitrary

vector in the left kernel of this matrix where ξ> ∈ Rn and

Page 11: Distributionally Robust Chance Constrained Data-enabled

11

η> ∈ RmL. We will show that(ξ η

)must be the zero vector.

To do this, we switch our attention to the matrix

(XU

):=

X1,bTL c−n−1

PL(u[1,T−nL])PL(u[L+1,T−(n−1)L])

...PL(u[nL+1,T ])

(10)

and follow arguments from [18] to construct n + 1 vectorsin the left kernel of (10). By definition,

(ξ η 0nmL

)is in

the left kernel of (10), where 0>nmL ∈ RnmL denotes a vectorcontaining nmL zeros. Furthermore, we have that(

XL+1,bTL c−n−1

PL(u[L+1,T−(n−1)L])

)= M

(XU

), (12)

where M is defined in (11) and 0n×n is the zero matrix. Since(XL+1,bTL c−n−1

PL(u[L+1,T−(n−1)L])

)is a submatrix of

(X1,bTL c−1

PL(u[1,T ])

)then by definition of

(ξ η

), we have

(ξ η

)( XL+1,bTL c−n−1

PL(u[L+1,T−(n−1)L])

)= 0(bTL c−n−1)L+1.

Hence, by (12) we have that(ξAL ξAL−1B · · · ξAB ξB η 0(n−1)mL

)is in the left kernel of (10). Note also that

X2L+1,bTL c−n−1

=(A2L A2L−1B · · · AB B 0n×(n−1)mL

)(XU

).

Proceeding as above we find that the vector(ξA2L ξA2L−1B · · · ξAB ξB η 0(n−2)mL

)is in the left kernel of (10). Continuing in this way, we obtainn+ 1 vectors in the left kernel of (10) given by

w0 =(ξ η 0nmL

)w1 =

(ξAL ξAL−1B · · · ξAB ξB η 0(n−1)mL

)w2 =

(ξA2L ξA2L−1B · · · ξAB ξB η 0(n−2)mL

)...

wn =(ξAnL ξAnL−1B · · · ξAB ξB η

)However, since the input vectors are L-Page exciting of ordern + 1 the left kernel of (10) has dimension at most n. Thisimplies that the vectors w0, . . . , wn are linearly dependent.From the structure of the vectors we see that η = 0mL. By

Cayley-Hamilton, there exist αi ∈ R, i ∈ 0, . . . , n withαn = 1 such that

∑ni=0 αi(A

L)i = 0. Define

v =

n∑i=0

αiwi

=

(ξn∑i=0

αi(AL)i · · · ξAL−1Bαn · · · ξBαn 0mL

)=

(0n ξ

n∑i=1

αiAiL−1B · · · ξAL−1B · · · ξB 0mL

).

Since v is a linear combination of vectors withen it is itself in the left kernel of (10). Hence,(ξ∑ni=1 αiA

iL−1B · · · ξAL−1B · · · ξB 0mL)

is in the left kernel of U . However, U has full row rank bythe excitation assumption implying the above vector mustbe the zero vector. Thus, ξ

(AL−1B · · · B

)= 0mL. By

assumption we have that L ≥ n and B is controllable. Thusξ = 0n. Hence,

(ξ η

)= 0n+mL implying that (10) has full

row rank which concludes the proof.Proof of Proposition 3.1: We first look at the feasible

set of (2). By rewriting the constraints in (2) we obtain

y = OTf(A,C)xt + TTf(A,B,C,D)u, u ∈ U , y ∈ Y,

where xt is the state at time t. We now look at the feasibleset of (5). Since u[1,T ] is (Tini + Tf)-Page exciting of ordern(B)+1, then by Theorem 2.1 image

(col(Up, Yp, Uf , Yf)

)=

BTini+Tf , where Up, Yp, Uf , Yf are defined as in (3). Hence,the feasible set of (5) is equal to (u, y) ∈ U × Y |col(uini, u, yini, y) ∈ BTini+Tf. Since the system B yields anequivalent representation given by B(A,B,C,D), then byLemma 2.1 the feasible set of (5) can be written as the setof pairs (u, y) ∈ U × Y satisfying

y = OTf(A,C)xini + TTf(A,B,C,D)u,

where xini is uniquely determined from col(uini, yini) and hencecoincides with xt. This proves the claim.

The following two lemmas will serve as essential corner-stones of our subsequent results.

Lemma A.1: Assume that ϕ is proper, convex, and lowersemicontinuous. Then

supQ∈Bε(PN )

EQ[ϕ(ξ)] = infλ≥0

λε+1

N

N∑i=1

supξ∈Ξ

(ϕ(ξ)−λ‖ξ−ξ(i)‖r)

Proof: The proof follows from [5, Theorem 4.2], specif-ically from Equation (12b) in [5] which relies on a marginal-ization and dualization of the constraint Q ∈ Bε(PN ).Next we present a similar result as [2, Lemma 4.1]. The resultrelaxes the conservative assumption that Ξ =

∏p(Tini+Tf)i=1 Ξi for

M :=

AL AL−1B · · · AB B 0n×mL · · · 0n×mL

0m×n 0m×m · · · 0m×m 0m×m[I · · · 0m×m

]· · ·

[0m×m · · · 0m×m

]...

.... . .

......

.... . .

...0m×n 0m×m · · · 0m×m 0m×m

[0m×m · · · I

]· · ·

[0m×m · · · 0m×m

] (11)

Page 12: Distributionally Robust Chance Constrained Data-enabled

12

appropriately defined Ξi imposed in [2]. The proof strategy ispartially inspired by arguments in [5].

Lemma A.2: Let c, d ∈ Z>0 and Ω ⊆ Rcd. Let q be suchthat 1

r + 1q = 1. Let ζ = col(ζ1, . . . , ζc) ∈ Ω be given where

each ζi ∈ Rd. Let λ ∈ R>0 and ϕ : Rc → R be a convexand Lipschitz continuous function with Lipschitz constant Lϕwith respect to the r-norm. Then for b ∈ Rd,

supζ∈Ω

ϕ(ζ>1 b, . . . , ζ>c b)− λ‖ζ − ζ‖r

ϕ(ζ>1 b, . . . , ζ

>c b) if Lϕ‖b‖q ≤ λ

∞ otherwise

The above is an equality when Ω = Rcd.Proof: We begin by first noting that

supζ∈Ω

ϕ(ζ>1 b, . . . , ζ>c b)− λ‖ζ − ζ‖r

≤ supζ∈Rcd

ϕ(ζ>1 b, . . . , ζ>c b)− λ‖ζ − ζ‖r

with equality when Ω = Rcd. Let Φ(ζ) = ϕ(ζ>1 b, . . . , ζ>c b).

By definition of the conjugate function,

Φ∗(z) = supζ∈Rcd

〈z, ζ〉 − Φ(ζ)

= supζ∈Rcd

〈z, ζ〉 − ϕ(ζ>1 b, . . . , ζ>c b)

= supζ∈Rcd,s∈Rc

〈z, ζ〉 − ϕ(s)

∣∣∣∣∣ sj = ζ>j b∀j∈1,...,c

,

where s = col(s1, . . . , sc). The Lagrangian of the above isgiven by

L (s, ζ, θ) = 〈z, ζ〉+ 〈θ, (s1 − ζ>1 b, . . . , sc − ζ>c b)〉 − ϕ(s),

By strong duality (see, e.g., [53, Proposition 5.3.1]),

Φ∗(z) = infθ

supζ∈Rcd,s

〈z, ζ〉

+ 〈θ, (s1 − ζ>1 b, . . . , sc − ζ>c b)〉 − ϕ(s)

= infθ

supζ∈Rcd

〈z, ζ〉 − 〈θ, (ζ>1 b, . . . , ζ>c b)〉+ ϕ∗(θ)

= infθ

supζ∈Rcd

〈(z1 − θ1b, . . . , zc − θcb), ζ〉+ ϕ∗(θ)

where z = col(z1, . . . , zc). The above problem may take thevalue ∞ unless zi = θib for all i ∈ 1, . . . , c. Hence, bytaking Θ := θ | ϕ∗(θ) < ∞ as the effective domain of theconjugate function of ϕ, we obtain

Φ∗(z) = infθϕ∗(θ) | zi = θib, ∀i ∈ 1, . . . , c

= infθ∈Θϕ∗(θ) | zi = θib, ∀i ∈ 1, . . . , c .

Since Φ is convex and continuous, the biconjugate Φ∗∗ coin-cides with the function Φ itself. Hence,

Φ(ζ) = supz〈z, ζ〉 − Φ∗(z)

= supz

〈z, ζ〉 − inf

θ∈Θϕ∗(θ)

∣∣∣∣zi = θib, ∀i ∈ 1, . . . , c

= supθ∈Θ〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ).

Hence,

supζ∈Rcd

Φ(ζ)− λ‖ζ − ζ‖r

= supζ∈Rcd

supθ∈Θ〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ)− λ‖ζ − ζ‖r

= supζ∈Rcd

supθ∈Θ

inf‖µ‖q≤λ

〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ)− 〈µ, ζ − ζ〉,

where the last equality comes from the definition of thedual norm and homogeneity of the norm. Using the minimaxtheorem (see, e.g., [53, Proposition 5.5.4]) we switch thesupremum and infimum in the above giving

supζ∈Rcd

Φ(ζ)− λ‖ζ − ζ‖r

= supθ∈Θ

inf‖µ‖q≤λ

supζ∈Rcd

〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ)− 〈µ, ζ − ζ〉

= supθ∈Θ

inf‖µ‖q≤λ

supζ∈Rcd

〈(θ1b− µ1, . . . , θcb− µc), ζ〉

− ϕ∗(θ) + 〈µ, ζ〉,

where µ = col(µ1, . . . , µc). Carrying out the supremum overζ yields

supζ∈Rcd

Φ(ζ)− λ‖ζ − ζ‖r

= supθ∈Θ

inf‖µ‖q≤λ

〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ) if µi = θib

∀i ∈ 1, . . . , c∞ otherwise

= supθ∈Θ

〈(θ1b, . . . , θcb), ζ〉 − ϕ∗(θ) if ‖(θ1b, . . . , θcb)‖q ≤ λ∞ otherwise

=

ϕ(ζ>1 b, . . . , ζ

>c b) if supθ∈Θ ‖(θ1b, . . . , θcb)‖q ≤ λ

∞ otherwise,

where we used the definition of the biconjugate functionand ϕ∗∗ = ϕ (due to convexity and continuity). Note that‖(θ1b, . . . , θcb)‖q = ‖θ‖q‖b‖q . From [5, Proposition 6.5],we know that ‖θ‖q ≤ Lϕ. In fact, one can show thatsupθ∈Θ ‖θ‖q = Lϕ [49, Remark 3]. Substituting this into theexpression above proves the result.The following lemmas enable our main results and provideseparate reformulations for the objective function and con-straints. The proofs strategies are partially inspired by [5],[47], [54].

Lemma A.3: (Objective Reformulation): Assume thatf2 and f3 are convex and Lipschitz continuous functions.Specifically, let Lobj > 0 be the Lipschitz constant with respectto the r-norm of the mapping (x, y) 7→ f2(x) + f3(y). Then

supQ∈Bε(PN )

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

]≤

f1(Ufg)+1

N

N∑i=1

(f2(Y

(i)f g)+f3(Y (i)

p g − yini))

+Lobjε‖g‖r,∗

The above is an equality when Ξ = Rp(Tini+Tf)b TTini+Tf

c.

Page 13: Distributionally Robust Chance Constrained Data-enabled

13

Proof: By definition of ξ, we have f2(Yfg) + f3(Ypg −yini) = f2(ξ>pTini+1g, . . . , ξ

>p(Tini+Tf)

g)+f3((ξ>1 g, . . . , ξ>pTini

g)−yini). Define

ϕ(ξ>1 g, . . . , ξ>p(Tini+Tf)

g) :=f2(ξ>pTini+1g, . . . , ξ>p(Tini+Tf)

g)

+ f3((ξ>1 g, . . . , ξ>pTini

g)− yini)

By definition, we have Lϕ = Lobj. Define also Φ(ξ) :=ϕ(ξ>1 g, . . . , ξ

>p(Tini+Tf)

g). Then from Lemma A.1 applied to Φwe obtain

supQ∈Bε(PN )

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

]= infλ≥0

λε+ f1(Ufg)

+1

N

N∑i=1

supξ∈Ξ

(ϕ(ξ>1 g, . . . , ξ>p(Tini+Tf)

g)− λ‖ξ − ξ(i)‖r)

Applying Lemma A.2 yields

supQ∈Bε(PN )

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

]≤

εLϕ‖g‖r,∗+f1(Ufg)+1

N

N∑i=1

(f2(Y

(i)f g)+f3(Y (i)

p g − yini)).

The above is an equality when Ξ = Rp(Tini+Tf)b TTini+Tf

c.Substituting Lϕ = Lobj yields the result.The following two results depend on the constraint set

GCVaR :=

g

∣∣∣∣∣ supQ∈Bε(PN )

CVaRQ1−α (h(Yfg)) ≤ 0

.

Lemma A.4: (Lipschitz Constraint Reformulation): Leth be convex and Lipschitz continuous with Lipschitz constantLcon with respect to the r-norm. Theng

∣∣∣∣∣∣∣∣∣∣

∃τ, λ, si such that−τα+ Lconε‖g‖r,∗ + 1

N

∑Ni=1 si ≤ 0

τ + h(Y(i)f g) ≤ si

si ≥ 0∀i ≤ N

⊆ GCVaR .

The sets coincide if Ξ = Rp(Tini+Tf)b TTini+Tf

c and h(Yfg) isbounded on Ξ for all g.

Remark A.1: Note that if Ξ = Rp(Tini+Tf)b TTini+Tf

c, thenh(Yfg) can only be bounded on Ξ for all g if h is constant.Thus, for non-constant constraint functions h, the above willbe an inner approximation of GCVaR. •

Proof of Lemma A.4: By definition of CVaR,

CVaRP1−α(h(y)) ≤ 0 ⇐⇒ inf

τ∈R−τα+ EP[(h(y) + τ)+] ≤ 0

This can be shown by multiplying the definition of CVaRby α > 0 and changing τ to −τ . Letting `(g, ξ) :=h(ξ>pTini+1g, . . . , ξ

>p(Tini+Tf)

g), the constraint g ∈ GCVaR isequivalent to g satisfying

supQ∈Bε(PN )

infτ∈R−τα+ EQ

[(`(g, ξ) + τ)+

]≤ 0. (13)

We have

supQ∈Bε(PN )

infτ∈R−τα+ EQ[(`(g, ξ) + τ)+]

≤ infτ∈R−τα+ sup

Q∈Bε(PN )

EQ[(`(g, ξ) + τ)+](14)

by the max-min inequality. From Lemma A.1 we have

supQ∈Bε(PN )

EQ[(`(g, ξ) + τ)+]

= infλ≥0

λε+1

N

N∑i=1

supξ∈Ξ

((`(g, ξ) + τ)+ − λ‖ξ − ξ(i)‖r).

Recalling that `(g, ξ) = h(ξ>pTini+1g, . . . , ξ>p(Tini+Tf)

) and usingLemma A.2 gives

supQ∈Bε(PN )

EQ[(`(g, ξ) + τ)+]

infλ≥0 λε+ 1N

N∑i=1

(`(g, ξ(i)) + τ)+ if Lcon‖g‖r,∗ ≤ λ

∞ otherwise(15)

Hence, by resolving the inf and resorting to an epigraphformulation, we arrive at

infτ∈R−τα+ sup

Q∈Bε(PN )

EQ[(`(g, ξ) + τ)+]

infτ∈R,si∈R−τα+ εLcon‖g‖r,∗ + 1

N

∑Ni=1 si

s.t. τ + `(g, ξ(i)) ≤ sisi ≥ 0

∀i ≤ N

Substituting the definition for ` and recalling from (13) thatwe wish for the right hand side of the above inequality to benonnegative givesg

∣∣∣∣∣∣∣∣∣∣

infτ,si −τα+ Lconε‖g‖r,∗+ 1N

∑Ni=1 si ≤ 0

τ + h(Y(i)f g) ≤ si

si ≥ 0∀i ≤ N

⊆ GCVaR .

We can remove the infimum from the above formulation bynoting that there exist variables τ, si satisfying the constraintsabove if and only if the above infimum constraint holds.The “only if” part is obvious. The “if” part can be splitinto two cases: the infimum is achieved, or the infimum isnot achieved. If the infimum is achieved then the optimizerof the infimum satisfies the above. If the infimum is notachieved (i.e., the infimum is −∞), then the first constraint istrivially satisfied, and one can find variables τ, si satisfyingthe remaining constraints. Note that (14) is an equality ifg 7→ `(g, ξ) is convex for every ξ and ξ 7→ `(g, ξ) is boundedon Ξ for every g (see, [54, Theorem 2.2]). Furthermore, (15)is an equality if Ξ = Rp(Tini+Tf)b T

Tini+Tfc. Since h is convex,

this proves the claim.We present an alternative result for the special case of

piecewise affine constraints in which it is possible to obtain atight set reformulation.

Page 14: Distributionally Robust Chance Constrained Data-enabled

14

Lemma A.5: (Piecewise Affine Constraint Reformula-tion): Let Ξ = ξ | Fξ ≤ d for some F and d of appro-priate dimensions. Let `(g, ξ) := h(ξ>pTini+i

g, . . . , ξ>p(Tini+Tf)g).

Assume ` is piecewise affine in ξ, i.e., `(g, ξ) =maxk≤K〈Mkg, ξ〉+bk for matrices Mk, bk ∈ R and K ∈ Z>0.Then

g

∣∣∣∣∣∣∣∣∣∣∣∣∣∣

∃τ, λ, si, γik such that−τα+ λε+ 1

N

∑Ni=1 si ≤ 0

bk + τ + 〈Mkg, ξ(i)〉

+〈γik, d− F ξ(i)〉 ≤ si‖F>γik −Mkg‖r,∗ ≤ λsi ≥ 0∀i ≤ N, ∀k ≤ K

⊆ GCVaR .

The sets coincide when `(g, ξ) is bounded on Ξ for all g.

Proof: Recall that the support set of the random variableξ is defined as Ξ = ξ | Fξ ≤ d. Substituting `(g, ξ) =maxk≤K〈Mkg, ξ〉 + bk and applying [5, Corollary 5.1] wehave that

supQ∈Bε(PN )

EQ[(`(g, ξ) + τ)+]

=

infλ,si,γik λε+ 1N

∑Ni=1 si

s.t. bk + τ + 〈Mkg, ξ(i)〉+ 〈γik, d− F ξ(i)〉 ≤ si

‖F>γik −Mkg‖r,∗ ≤ λsi ≥ 0

∀i ≤ N, ∀k ≤ K

Invoking (13) and (14) and substituting the above yields thefollowing set inclusiong

∣∣∣∣∣∣∣∣∣∣∣∣

infτ,λ,si,γik −τα+ λε+ 1N

∑Ni=1 si ≤ 0

s.t. bk + τ + 〈Mkg, ξ(i)〉

+〈γik, d− F ξ(i)〉 ≤ si‖F>γik −Mkg‖r,∗ ≤ λsi ≥ 0∀i ≤ N, ∀k ≤ K

⊆ GCVaR .

Similarly as in the proof of Lemma A.4, we can replacethe infimum from the above formulation with existence ofvariables τ, λ, si, γik satisfying the constraints above. Hence,

g

∣∣∣∣∣∣∣∣∣∣∣∣∣∣

∃τ, λ, si, γik such that−τα+ λε+ 1

N

∑Ni=1 si ≤ 0

bk + τ + 〈Mkg, ξ(i)〉

+〈γik, d− F ξ(i)〉 ≤ si‖F>γik −Mkg‖r,∗ ≤ λsi ≥ 0∀i ≤ N, ∀k ≤ K

⊆ GCVaR .

If g 7→ `(g, ξ) is convex for every ξ and ξ 7→ `(g, ξ)is bounded on Ξ for every g then (14) reduces to equality(see, [54, Theorem 2.2]) which proves the claim.

Proof of Theorem 4.1: Applying Lemma A.3 to theobjective function and Lemma A.4 to the constraint of op-timization problem (6) gives the result.

Proof of Theorem 4.3: Choosing ε(β,N) as in (8) ensuresthat PN (P ∈ Bε(PN )) ≥ 1−β. Hence, J(g) is upper bounded

by supQ∈Bε(PN )

EQ[f1(Ufg) + f2(Yfg) + f3(Ypg − yini)

]with

confidence 1 − β. Furthermore, by Theorem 4.1, J(g) isupper bounded by J(g) with confidence 1 − β for all g.Choosing g = g? proves the first statement. From Lemma A.4,the constraint set of (7) is a subset of the set of vectors gsatisfying sup

Q∈Bε(PN )

CVaRQ1−α(h(Yfg)) ≤ 0. Invoking the fact

that PN (P ∈ Bε(PN )) ≥ 1 − β proves the second statement.

REFERENCES

[1] J. Coulson, J. Lygeros, and F. Dorfler, “Data-enabled predictive con-trol: In the shallows of the DeePC,” in 2019 18th European ControlConference (ECC), 2019, pp. 307–312.

[2] ——, “Regularized and distributionally robust data-enabled predictivecontrol,” 2019 IEEE Conference on Decision and Control (CDC), 2019.

[3] I. Markovsky and P. Rapisarda, “Data-driven simulation and control,”International Journal of Control, vol. 81, no. 12, pp. 1946–1959, 2008.

[4] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A noteon persistency of excitation,” Systems & Control Letters, vol. 54, no. 4,pp. 325–329, 2005.

[5] P. Mohajerin Esfahani and D. Kuhn, “Data-driven distributionally robustoptimization using the Wasserstein metric: Performance guarantees andtractable reformulations,” Mathematical Programming, vol. 171, no. 1-2,pp. 115–166, 2018.

[6] H. Hjalmarsson, M. Gevers, S. Gunnarsson, and O. Lequin, “Iterativefeedback tuning: Theory and applications,” IEEE control systems mag-azine, vol. 18, no. 4, pp. 26–41, 1998.

[7] M. C. Campi, A. Lecchini, and S. M. Savaresi, “Virtual referencefeedback tuning: A direct method for the design of feedback controllers,”Automatica, vol. 38, no. 8, pp. 1337–1346, 2002.

[8] B. Recht, “A tour of reinforcement learning: The view from continuouscontrol,” Annual Review of Control, Robotics, and Autonomous Systems,vol. 2, pp. 253–279, 2019.

[9] P. Van Overschee and B. De Moor, Subspace identification for linearsystems: Theory–Implementation–Applications. Springer Science &Business Media, 2012.

[10] J. Berberich, J. Kohler, M. A. Muller, and F. Allgower, “Data-drivenmodel predictive control with stability and robustness guarantees,” IEEETransactions on Automatic Control, 2020.

[11] L. Huang, J. Coulson, J. Lygeros, and F. Dofler, “Data-enabled predictivecontrol for grid-connected power converters,” in 2019 IEEE Conferenceon Decision and Control (CDC), 2019.

[12] L. Huang, J. Coulson, J. Lygeros, and F. Dorfler, “Data-driven wide-areacontrol,” arXiv preprint arXiv:1911.12151, 2019.

[13] E. Elokda, J. Coulson, P. Beuchat, J. Lygeros, and F. Dorfler, “Data-enabled predictive control for quadcopters,” ETH Zurich, ResearchCollection:10.3929/ethz-b-000415427, 2019.

[14] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization,optimality and robustness,” IEEE Transactions on Automatic Control,2019.

[15] J. Berberich, A. Romer, C. W. Scherer, and F. Allgower, “Robust data-driven state-feedback design,” arXiv preprint arXiv:1909.04314, 2019.

[16] J. Berberich, J. Kohler, M. A. Muller, and F. Allgower, “Ro-bust constraint satisfaction in data-driven MPC,” arXiv preprintarXiv:2003.06808, 2020.

[17] A. Koch, J. Berberich, J. Kohler, and F. Allgower, “Determiningoptimal input-output properties: A data-driven approach,” arXiv preprintarXiv:2002.03882, 2020.

[18] H. J. van Waarde, C. De Persis, M. K. Camlibel, and P. Tesi, “Willems’fundamental lemma for state-space systems and its extension to multipledatasets,” IEEE Control Systems Letters, vol. 4, no. 3, pp. 602–607,2020.

[19] H. J. Van Waarde, J. Eising, H. L. Trentelman, and M. K. Camlibel,“Data informativity: A new perspective on data-driven analysis andcontrol,” IEEE Transactions on Automatic Control, 2020.

[20] C. D. Persis and P. Tesi, “Low-complexity learning of linear quadraticregulators from noisy data,” arXiv preprint arXiv:2005.01082, 2020.

[21] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” AnnualReview of Control, Robotics, and Autonomous Systems, vol. 3, 2019.

Page 15: Distributionally Robust Chance Constrained Data-enabled

15

[22] U. Rosolia and F. Borrelli, “Learning model predictive control foriterative tasks. A data-driven control framework,” IEEE Transactionson Automatic Control, vol. 63, no. 7, pp. 1883–1896, 2018.

[23] E. Kaiser, J. N. Kutz, and S. L. Brunton, “Sparse identification ofnonlinear dynamics for model predictive control in the low-data limit,”Proceedings of the Royal Society A, vol. 474, no. 2219, p. 20180335,2018.

[24] J. R. Salvador, D. Ramirez, T. Alamo, D. M. de la Pena, and G. Garcıa-Marın, “Data driven control: An offset free approach,” in 2019 18thEuropean Control Conference (ECC). IEEE, 2019, pp. 23–28.

[25] R. Krug and D. Dimitrov, “Model predictive motion control based ongeneralized dynamical movement primitives,” Journal of Intelligent &Robotic Systems, vol. 77, no. 1, pp. 17–35, 2015.

[26] M. C. Campi and E. Weyer, “Finite sample properties of system iden-tification methods,” IEEE Transactions on Automatic Control, vol. 47,no. 8, pp. 1329–1334, 2002.

[27] M. Vidyasagar and R. L. Karandikar, “A learning theory approach tosystem identification and stochastic adaptive control,” in Probabilisticand randomized methods for design under uncertainty. Springer, 2006,pp. 265–302.

[28] S. Tu, R. Boczar, A. Packard, and B. Recht, “Non-asymptotic analysisof robust control from coarse-grained identification,” arXiv preprintarXiv:1707.04791, 2017.

[29] T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finite-time system identi-fication for partially observed LTI systems of unknown order,” arXivpreprint arXiv:1902.01848, 2019.

[30] R. Boczar, N. Matni, and B. Recht, “Finite-data performance guaranteesfor the output-feedback control of an unknown system,” arXiv preprintarXiv:1803.09186, 2018.

[31] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample com-plexity of the linear quadratic regulator,” Foundations of ComputationalMathematics, pp. 1–47, 2019.

[32] H. Hjalmarsson, “From experiment design to closed-loop control,”Automatica, vol. 41, no. 3, pp. 393–438, 2005.

[33] M. Gevers, “Identification for control: From the early achievements tothe revival of experiment design,” European Journal of Control, vol. 11,pp. 1–18, 2005.

[34] B. A. Ogunnaike, “A contemporary industrial perspective on processcontrol theory and practice,” Annual Reviews in Control, vol. 20, pp.1–8, 1996.

[35] J. Richalet, “Industrial applications of model based predictive control,”Automatica, vol. 29, no. 5, pp. 1251–1274, 1993.

[36] W. Favoreel, B. De Moor, and M. Gevers, “SPC: Subspace predictivecontrol,” IFAC Proceedings Volumes, vol. 32, no. 2, pp. 4004–4009,1999.

[37] B. Huang and R. Kadali, Dynamic modeling, predictive control andperformance monitoring: A data-driven subspace approach. Springer,2008.

[38] F. Dorfler, J. Coulson, and I. Markovsky, “Bridging direct & indirectdata-driven control formulations via regularizations and relaxations,”arXiv preprint arXiv:2101.01273, 2021.

[39] A. Mesbah, “Stochastic model predictive control: An overview andperspectives for future research,” IEEE Control Systems Magazine,vol. 36, no. 6, pp. 30–44, 2016.

[40] B. P. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari, “Distributionallyrobust control of constrained stochastic systems,” IEEE Transactions onAutomatic Control, vol. 61, no. 2, pp. 430–442, 2015.

[41] I. Markovsky, J. C. Willems, S. Van Huffel, and B. De Moor, Exactand Approximate Modeling of Linear Systems: A Behavioral Approach.SIAM, 2006.

[42] J. C. Willems, “From time series to linear system – Part I. Finitedimensional linear time invariant systems.” Automatica, vol. 22, no. 5,pp. 561–580, 1986.

[43] A. Damen, P. Van den Hof, and A. Hajdasinski, “Approximate realizationbased upon an alternative to the Hankel matrix: the Page matrix,”Systems & Control Letters, vol. 2, no. 4, pp. 202–208, 1982.

[44] A. Agarwal, M. J. Amjad, D. Shah, and D. Shen, “Model agnostictime series analysis via matrix estimation,” Proceedings of the ACMon Measurement and Analysis of Computing Systems, vol. 2, no. 3, pp.1–39, 2018.

[45] F. Borrelli, A. Bemporad, and M. Morari, Predictive Control for Linearand Hybrid Systems. Cambridge University Press, 2017.

[46] A. Nemirovski and A. Shapiro, “Convex approximations of chanceconstrained programs,” SIAM Journal on Optimization, vol. 17, no. 4,pp. 969–996, 2006.

[47] S. Shafieezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani, “Regular-ization via mass transportation,” Journal of Machine Learning Research,vol. 20, no. 103, pp. 1–68, 2019.

[48] N. Fournier and A. Guillin, “On the rate of convergence in Wassersteindistance of the empirical measure,” Probability Theory and RelatedFields, vol. 162, no. 3-4, pp. 707–738, 2015.

[49] D. Kuhn, P. Mohajerin Esfahani, V. A. Nguyen, and S. Shafieezadeh-Abadeh, “Wasserstein distributionally robust optimization: Theory andapplications in machine learning,” in Operations Research & Manage-ment Science in the Age of Analytics. INFORMS, 2019, pp. 130–166.

[50] J. Blanchet, Y. Kang, and K. Murthy, “Robust Wasserstein profileinference and applications to machine learning,” Journal of AppliedProbability, vol. 56, no. 3, pp. 830–857, 2019.

[51] M. ApS, The MOSEK optimization toolbox for MATLAB manual.Version 9.0., 2019. [Online]. Available: http://docs.mosek.com/9.0/toolbox/index.html

[52] L. Huang, J. Coulson, J. Lygeros, and F. Dorfler, “Decentralized data-enabled predictive control for power system oscillation damping,” arXivpreprint arXiv:1911.12151, 2019.

[53] D. P. Bertsekas, Convex optimization theory. Athena Scientific Belmont,2009.

[54] A. R. Hota, A. Cherukuri, and J. Lygeros, “Data-driven chance con-strained optimization under Wasserstein ambiguity sets,” arXiv preprintarXiv:1805.06729, 2018.

Jeremy Coulson is a PhD student with the Au-tomatic Control Laboratory at ETH Zurich. He re-ceived his Master of Applied Science in Mathemat-ics & Engineering from Queen’s University, Canadain August 2017. He received his B.Sc.Eng degreein Mechanical Engineering & Applied Mathematicsfrom Queen’s University in 2015. His research inter-ests include data-driven control methods, stochasticoptimization, and control of partial differential equa-tions.

John Lygeros completed a B.Eng. degree in elec-trical engineering in 1990 and an M.Sc. degree inSystems Control in 1991, both at Imperial Col-lege of Science Technology and Medicine, London,U.K.. In 1996 he obtained a Ph.D. degree fromthe Electrical Engineering and Computer SciencesDepartment, University of California, Berkeley. Dur-ing the period 1996–2000 he held a series of post-doctoral researcher appointments at the Laboratoryfor Computer Science, M.I.T., and the ElectricalEngineering and Computer Sciences Department at

U.C. Berkeley. Between 2000 and 2003 he was a University Lecturer at theDepartment of Engineering, University of Cambridge, U.K., and a Fellow ofChurchill College. Between 2003 and 2006 he was an Assistant Professor atthe Department of Electrical and Computer Engineering, University of Patras,Greece. In July 2006 he joined the Automatic Control Laboratory at ETHZurich, where he is currently serving as the Head of the Automatic ControlLaboratory. His research interests include modelling, analysis, and control ofhierarchical, hybrid, and stochastic systems, with applications to biochemicalnetworks, automated highway systems, air traffic management, power gridsand camera networks. John Lygeros is a Fellow of the IEEE, and a memberof the IET and the Technical Chamber of Greece; since 2013 he serves as theTreasurer of the International Federation of Automatic Control.

Page 16: Distributionally Robust Chance Constrained Data-enabled

16

Florian Dorfler is an Associate Professor at theAutomatic Control Laboratory at ETH Zurich. Hereceived his Ph.D. degree in Mechanical Engineeringfrom the University of California at Santa Barbarain 2013, and a Diplom degree in Engineering Cyber-netics from the University of Stuttgart in 2008. From2013 to 2014 he was an Assistant Professor at theUniversity of California Los Angeles. His primaryresearch interests are centered around control, opti-mization, and system theory with applications in net-work systems such as electric power grids, robotic

coordination, and social networks. He is a recipient of the distinguished youngresearch awards by IFAC (Manfred Thoma Medal 2020) and EUCA (EuropeanControl Award 2020). His students were winners or finalists for Best StudentPaper awards at the European Control Conference (2013, 2019), the AmericanControl Conference (2016), and the PES PowerTech Conference (2017). Heis furthermore a recipient of the 2010 ACC Student Best Paper Award, the2011 O. Hugo Schuck Best Paper Award, the 2012-2014 Automatica BestPaper Award, the 2016 IEEE Circuits and Systems Guillemin-Cauer BestPaper Award, and the 2015 UCSB ME Best PhD award.