a deep generative model for probabilistic energy

A deep generative model for probabilistic energy forecasting in power systems:normalizing flows

Jonathan Dumasa,∗, Antoine Wehenkela, Damien Lanaspezeb, Bertrand Cornelussea, Antonio Suteraa

aLiege University, Departments of Computer Science and Electrical Engineering, BelgiumbMines ParisTech, France

Abstract

Greater direct electrification of end-use sectors with a higher share of renewables is one of the pillars to power acarbon-neutral society by 2050. However, in contrast to conventional power plants, renewable energy is subject to un-certainty raising challenges for their interaction with power systems. Scenario-based probabilistic forecasting modelshave become a vital tool to equip decision-makers. This paper presents to the power systems forecasting practitionersa recent deep learning technique, the normalizing flows, to produce accurate scenario-based probabilistic forecaststhat are crucial to face the new challenges in power systems applications. The strength of this technique is to di-rectly learn the stochastic multivariate distribution of the underlying process by maximizing the likelihood. Throughcomprehensive empirical evaluations using the open data of the Global Energy Forecasting Competition 2014, wedemonstrate that this methodology is competitive with other state-of-the-art deep learning generative models: genera-tive adversarial networks and variational autoencoders. The models producing weather-based wind, solar power, andload scenarios are properly compared in terms of forecast value by considering the case study of an energy retailerand quality using several complementary metrics. The numerical experiments are simple and easily reproducible.Thus, we hope it will encourage other forecasting practitioners to test and use normalizing flows in power systemapplications such as bidding on electricity markets, scheduling power systems with high renewable energy sourcespenetration, energy management of virtual power plan or microgrids, and unit commitment.

Keywords: Deep learning, Normalizing flows, Energy forecasting, Time series, Generative adversarial networks,Variational autoencoders

1. Introduction

To limit climate change and achieve the ambitioustargets prescribed by the Intergovernmental Panel onClimate Change [1], the transition towards a carbon-freesociety goes through an inevitable increase of the shareof renewable generation in the energy mix. However,in contrast to conventional power plants, renewable en-ergy is subject to uncertainty. Therefore, the operationalpredictability of modern power systems has been chal-lenging as the total installed capacity of renewable en-ergy sources (RES) increases and new distributed en-ergy resources are introduced into the existing powernetworks. To address this challenge point forecasts arewidely used in the industry as inputs to decision-makingtools. However, they are inherently uncertain and in the

∗Corresponding authorEmail address: [email protected] (Jonathan Dumas)

context of decision-making, a point forecast plus an un-certainty interval is of genuine added value. In this con-text, probabilistic forecasts [2], which aim at modelingthe distribution of all possible future realizations, havebecome an important tool [3] to equip decision-makers[4], hopefully leading to better decisions in energy ap-plications [5].

The various types of probabilistic forecasts rangefrom quantile to density forecasts, scenarios, andthrough prediction intervals [4]. This paper focuses onscenario generation, a popular probabilistic forecastingmethod to capture the uncertainty of load, photovoltaic(PV) generation, and wind generation. It consists ofproducing sequences of possible load or power gener-ation realizations for one or more locations.

Forecasting methodologies can typically be classifiedinto two groups: statistical and machine learning mod-els. On the one hand, statistical approaches are moreinterpretable than machine learning techniques, some-

Preprint submitted to Applied Energy September 22, 2021

arX

iv:2

106.

0937

0v4

[cs

.LG

] 2

1 Se

p 20

21

times referred to as black-box models. On the otherhand, they are generally more robust, user-friendly, andsuccessful in addressing the non-linearity in the datathan statistical techniques. We provide in the followinga few examples of statistical approaches. More refer-ences can be found in Khoshrou and Pauwels [6] andMashlakov et al. [7].

Multiple linear regression models [8] and autoregres-sive integrated moving average [9] are among the mostfundamental and widely-used models. The latter gen-erates spatiotemporal scenarios with given power gen-eration profiles at each renewables generation site [10].These models mostly learn a relationship between sev-eral explanatory variables and a dependent target vari-able. However, they require some expert knowledgeto formulate the relevant interaction between differentvariables. Therefore, the performance of such models isonly satisfactory if the dependent variables are well for-mulated based on explanatory variables. Another classof statistical approaches consists of using simple para-metric distributions, e.g., the Weibull distribution forwind speed [11], or the beta distribution for solar ir-radiance [12] to model the density associated with thegenerative process. In this line, the (Gaussian) cop-ula method has been widely used to model the spatialand temporal characteristics of wind [13] and PV gen-eration [14]. For instance, the problem of generatingprobabilistic forecasts for the aggregated power of a setof renewable power plants harvesting different energysources is addressed by Camal et al. [15].

Overall, these approaches usually make statistical as-sumptions increasing the difficulty to model the under-lying stochastic process. The generated scenarios ap-proximate the future uncertainty but cannot correctlydescribe all the salient features in the power output fromrenewable energy sources. Deep learning is one ofthe newest trends in artificial intelligence and machinelearning to tackle the limitations of statistical methodswith promising results across various application do-mains.

1.1. Related work

Recurrent neural networks (RNNs) are among themost famous deep learning techniques adopted in en-ergy forecasting applications. A novel pooling-baseddeep recurrent neural network is proposed by Shi et al.[16] in the field of short-term household load fore-casting. It outperforms statistical approaches such asautoregressive integrated moving average and classi-cal RNN. A tailored forecasting tool, named encoder-decoder, is implemented in Dumas et al. [17] to compute

intraday multi-output PV quantiles forecasts. Guide-lines and best practices are developed by Hewamalageet al. [18] for forecasting practitioners on an extensiveempirical study with an open-source software frame-work of existing RNN architectures. In the continu-ity, Toubeau et al. [19] implemented a bidirectional longshort-term memory (BLSTM) architecture. It is trainedusing quantile regression and combined with a copula-based approach to generate scenarios. A scenario-basedstochastic optimization case study compares this ap-proach to other models regarding forecast quality andvalue. Finally, Salinas et al. [20] trained an autore-gressive recurrent neural network on several real-worlddatasets. It produces accurate probabilistic forecastswith little or no hyper-parameter tuning.

Deep generative modeling is a class of techniques thattrains deep neural networks to model the distribution ofthe observations. In recent years, there has been a grow-ing interest in this field made possible by the appear-ance of large open-access datasets and breakthroughs inboth general deep learning architectures and generativemodels. Several approaches exist such as energy-basedmodels, variational autoencoders, generative adversar-ial networks, autoregressive models, normalizing flows,and numerous hybrid strategies. They all make trade-offs in terms of computation time, diversity, and archi-tectural restrictions. We recommend two papers to geta broader knowledge of this field. (1) The comprehen-sive overview of generative modeling trends conductedby Bond-Taylor et al. [21]. It presents generative mod-els to forecasting practitioners under a single cohesivestatistical framework. (2) The thorough comparison ofnormalizing flows, variational autoencoders, and gen-erative adversarial networks provided by Ruthotto andHaber [22]. It describes the advantages and disadvan-tages of each approach using numerical experiments inthe field of computer vision. In the following, we fo-cus on the applications of generative models in powersystems.

In contrast to statistical approaches, deep generativemodels such as Variational AutoEncoders (VAEs) [23]and Generative Adversarial Networks (GANs) [24] di-rectly learn a generative process of the data. They havedemonstrated their effectiveness in many applicationsto compute accurate probabilistic forecasts includingpower system applications. They both make probabilis-tic forecasts in the form of Monte Carlo samples that canbe used to compute consistent quantile estimates for allsub-ranges in the prediction horizon. Thus, they can-not suffer from the issue raised by Ordiano et al. [25]on the non-differentiable quantile loss function. Notethat generative models such as GANs and VAEs allow

2

generating scenarios of the variable of interest directly.In contrast with methods that first compute weather sce-narios to generate probabilistic forecasts such as imple-mented by Sun et al. [26] and Khoshrou and Pauwels[6]. A VAE composed of a succession of convolutionaland feed-forward layers is proposed by Zhanga et al.[27] to capture the spatial-temporal complementary andfluctuant characteristics of wind and PV power withhigh model accuracy and low computational complex-ity. Both single and multi-output PV forecasts using aVAE are compared by Dairi et al. [28] to several deeplearning methods such as LSTM, BLSTM, convolu-tional LSTM networks and stacked autoencoders, wherethe VAE consistently outperformed the other methods.A GAN is used by Chen et al. [29] to produce a set ofwind power scenarios that represent possible future be-haviors based only on historical observations and pointforecasts. This method has a better performance com-pared to Gaussian Copula. A Bayesian GAN is intro-duced by Chen et al. [30] to generate wind and solarscenarios, and a progressive growing of GANs is de-signed by Yuan et al. [31] to propose a novel scenarioforecasting method. In a different application, a GAN isimplemented for building occupancy modeling withoutany prior assumptions [32]. Finally, a conditional ver-sion of the GAN using several labels representing somecharacteristics of the demand is introduced by Lan et al.[33] to output power load data considering demand re-sponse programs.

Improved versions of GANs and VAEs have alsobeen studied in the context of energy forecasting. TheWasserstein GAN enforces the Lipschitz continuitythrough a gradient penalty term (WGAN-GP), as theoriginal GANs are challenging to train and suffer frommode collapse and over-fitting. Several studies ap-plied this improved version in power systems: (1) amethod using unsupervised labeling and conditionalWGAN-GP models the uncertainties and variation inwind power [34]; (2) a WGAN-GP models both the un-certainties and the variations of the load [35]; (3) Jianget al. [36] implemented scenario generation tasks bothfor a single site and for multiple correlated sites withoutany changes to the model structure. Concerning VAEs,they suffer from inherent shortcomings, such as the dif-ficulties of tuning the hyper-parameters or generalizinga specific generative model structure to other databases.An improved VAE is proposed by Qi et al. [37] withthe implementation of a β hyper-parameter into the VAEobjective function to balance the two parts of the loss.This improved VAE is used to generate PV and powerscenarios from historical values.

However, most of these studies did not benefit from

conditional information such as weather forecasts togenerate improved PV, wind power, and load scenarios.In addition, to the best of our knowledge, only Ge et al.[38] compared NFs to these techniques for the genera-tion of daily load profiles. Nevertheless, the comparisononly considers quality metrics, and the models do notincorporate weather forecasts.

1.2. Research gaps and scientific contributions

This study investigates the implementation of Nor-malizing Flows [39, NFs] in power system applications.NFs define a new class of probabilistic generative mod-els. They have gained increasing interest from the deeplearning community in recent years. A NF learns a se-quence of transformations, a flow, from a density knownanalytically, e.g., a Normal distribution, to a complextarget distribution. In contrast to other deep generativemodels, NFs can directly be trained by maximum likeli-hood estimation. They have proven to be an effectiveway to model complex data distributions with neuralnetworks in many domains. First, speech synthesis [40].Second, fundamental physics to increase the speed ofgravitational wave inference by several orders of mag-nitude [41] or for sampling Boltzmann distributions oflattice field theories [42]. Finally, in the capacity firm-ing framework by Dumas et al. [43].

This present work goes several steps further than Geet al. [38] that demonstrated the competitiveness of NFsregarding GANs and VAEs for generating daily loadprofiles. First, we study the conditional version of thesemodels to demonstrate that they can handle additionalcontextual information such as weather forecasts orgeographical locations. Second, we extensively com-pare the model’s performances both in terms of forecastvalue and quality. The forecast quality correspondsto the ability of the forecasts to genuinely inform offuture events by mimicking the characteristics of theprocesses involved. The forecast value relates to thebenefits of using forecasts in decision-making, suchas participation in the electricity market. Third, weconsider PV and wind generations in addition to loadprofiles. Finally, in contrast to the affine NFs usedin their work, we rely on monotonic transformations,which are universal density approximators [44].

Given that Normalizing Flows are rarely used in thepower systems community despite their potential, ourmain aim is to present this recent deep learning tech-nique and demonstrate its interest and competitivenesswith state-of-the-art generative models such as GANsand VAEs on a simple and easily reproducible case

3

study. The research gaps motivating this paper are three-fold:

1. To the best of our knowledge, only Ge et al. [38]compared NFs to GANs and VAEs for the genera-tion of daily load profiles. Nevertheless, the com-parison is only based on quality metrics, and themodels do not take into account weather forecasts;

2. Most of the studies that propose or compare fore-casting techniques only consider the forecast qual-ity such as Ge et al. [38], Sun et al. [26], and Mash-lakov et al. [7];

3. The conditional versions of the models are not al-ways addressed, such as in Ge et al. [38]. How-ever, weather forecasts are essential for computingaccurate probabilistic forecasts.

With these research gaps in mind, the main contribu-tions of this paper are three-fold:

1. We provide a fair comparison both in terms ofquality and value with the state-of-the-art deeplearning generative models, GANs and VAEs, us-ing the open data of the Global Energy ForecastingCompetition 2014 (GEFcom 2014) [45]. To thebest of our knowledge, it is the first study that ex-tensively compares the NFs, GANs, and VAEs onseveral datasets, PV generation, wind generation,and load with a proper assessment of the qualityand value based on complementary metrics, and aneasily reproducible case study;

2. We implement conditional generative models tocompute improved weather-based PV, wind power,and load scenarios. In contrast to most of the pre-vious studies that focused mainly on past observa-tions;

3. Overall, we demonstrate that NFs are more accu-rate in quality and value, providing further evi-dence for deep learning practitioners to implementthis approach in more advanced power system ap-plications.

In addition to these contributions, this study also pro-vides open-access to the Python code1 to help the com-munity to reproduce the experiments. Figure 1 pro-vides the framework of the proposed method and Table1 presents a comparison of the present study to threestate-of-the-art papers using deep learning generativemodels to generate scenarios.

1https://github.com/jonathandumas/

generative-models

Criteria [35] [37] [38] study

GAN X × X X

VAE × X X X

NF × × X X

Number of models 4 1 3 3PV × X × X

Wind power × X × X

Load X ∼ X X

Weather-based X × × X

Quality assessment X X X X

Quality metrics 5 3 5 8Value assessment × X × X

Open dataset ∼ × X X

Value replicability - ∼ - X

Open-access code × × × X

Table 1: Comparison of the paper’s contributions to three state-of-the-art studies using deep generative models.X: criteria fully satisfied, ∼: criteria partially satisfied, ×: criteria notsatisfied, ?: no information, -: not applicable. GAN: a GAN model isimplemented; VAE: a VAE model is implemented; NF: a NF modelis implemented; PV: PV scenarios are generated; Wind power: windpower scenarios are generated; Load: load scenarios are generated;Weather-based: the model generates weather-based scenarios; Qual-ity assessment: a quality evaluation is conducted: Quality metrics:number of quality metrics considered; Value assessment: a value eval-uation is considered with a case study; Open dataset: the data used forthe quality and value evaluations are in open-access; Value replicabil-ity: the case study considered for the value evaluation is easily repro-ducible; Open-access code: the code used to conduct the experimentsis in open-access. Note: the justifications are provided in AppendixA.1.

4

https://github.com/jonathandumas/generative-models

https://github.com/jonathandumas/generative-models

NFsGEFcom 2014

Qualit y

Day-ahead bidding of an energy retailer

8 complementary metrics

Introduction of Norm alizing Flows (NFs) in power systems.

Extensive and fair comparison with:

VAEs

GANs

- Generative Adversarial Networks (GANs)

- Variational AutoEncoders (VAEs)

Open dat aset :

- PV

- wind power

- load

NFs are

overall more

accurate.

- Training

- Scenarios generation

- stochastic optimization

- scenarios evaluation

Result s

Met hodology

Value

Figure 1: The framework of the paper.The paper’s primary purpose is to present and demonstrate the poten-tial of NFs in power systems. A fair comparison is conducted both interms of quality and value with the state-of-the-art deep learning gen-erative models, GANs and VAEs, using the open data of the GlobalEnergy Forecasting Competition 2014 [45]. The PV, wind power, andload datasets are used to assess the models. The quality evaluation isconducted by using eight complementary metrics, and the value as-sessment by considering the day-ahead bidding of an energy retailerusing stochastic optimization. Overall, NFs tend to be more accurateboth in terms of quality and value and are competitive with GANs andVAEs.

1.3. Applicability of the generative models

Probabilistic forecasting of PV, wind generation,electrical consumption, and electricity prices plays a vi-tal role in renewable integration and power system oper-ations. The deep learning generative models presentedin this paper can be integrated into practical engineer-ing applications. We present a non-exhaustive list offive applications in the following. (1) The forecast-ing module of an energy management system (EMS)[46]. Indeed, EMSs are used by several energy mar-ket players to operate various power systems such as asingle renewable plant, a grid-connected or off-grid mi-crogrid composed of several generations, consumption,and storage devices. An EMS is composed of severalkey modules: monitoring, forecasting, planning, con-trol, etc. The forecasting module aims to provide themost accurate forecast of the variable of interest to beused as inputs of the planning and control modules. (2)Stochastic unit commitment models that employ scenar-ios to model the uncertainty of weather-dependent re-newables. For instance, the optimal day-ahead schedul-ing and dispatch of a system composed of renewableplants, generators, and electrical demand are addressedby Camal et al. [15]. (3) Ancillary services marketparticipation. A virtual power plant aggregating wind,PV, and small hydropower plants is studied by Camalet al. [15] to optimally bid on a day-ahead basis the en-ergy and automatic frequency restoration reserve. (4)More generally, generative models can be used to com-pute scenarios for any variable of interest, e.g., energy

prices, renewable generation, loads, water inflow of hy-dro reservoirs, as long as data are available. (5) Finally,quantiles can be derived from scenarios and used in ro-bust optimization models such as in the capacity firmingframework [43].

1.4. OrganizationThe remainder of this paper is organized as follows.

Section 2 presents the generative models implemented:NFs, GANs, and VAEs. Section 3 provides the qual-ity and value assessment methodologies. Section 4 de-tails empirical results on the GEFcom 2014 dataset, andSection 5 summarizes the main findings and highlightsideas for further work. Appendix A presents the justi-fications of Table 1 and Table 5, Appendix B providesadditional information on the generative models, Ap-pendix C and Appendix D detail the quality metricsand the retailer energy case study formulation, and Ap-pendix E presents additional quality results.

2. Background

This section formally introduces the conditional ver-sion of NFs, GANs, and VAEs implemented in thisstudy. We assume the reader is familiar with the neu-ral network’s basics. However, for further informa-tion, Goodfellow et al. [47], Zhang et al. [48] providea comprehensive introduction to modern deep learningapproaches.

2.1. Multi-output forecastLet us consider some dataset D = xi, ciNi=1 of N in-

dependent and identically distributed samples from thejoint distribution p(x, c) of two continuous variables Xand C. X being the wind generation, PV generation, orload, and C the weather forecasts. They are both com-posed of T periods per day, with xi := [xi

1, . . . , xiT ]ᵀ ∈

RT and ci := [ci1, . . . , c

iT ]ᵀ ∈ RT . The goal of this

work is to generate multi-output weather-based scenar-ios x ∈ RT that are distributed under p(x|c).

A generative model is a probabilistic model pθ(·),with parameters θ, that can be used as a generator ofthe data. Its purpose is to generate synthetic but real-istic data x ∼ pθ(x|c) whose distribution is as close aspossible to the unknown data distribution p(x|c). In ourapplication, it computes on a day-ahead basis a set of Mscenarios at day d − 1 for day d

xid :=

[xi

d,1, · · · , xid,T

]ᵀ∈ RT i = 1, . . . ,M. (1)

For the sake of clarity, we omit the indexes d and i whenreferring to a scenario x in the following.

5

2.2. Deep generative modelsFigure 2 provides a high-level comparison of three

categories of generative models considered in this pa-per: Normalizing Flows, Generative Adversarial Net-works, and Variational AutoEncoders.

NFs

VAEs

GANs

Log-likelihood

maximization

Flow Inverse

Min-max

problem

Discr im inat orFake /

real ?

Generat or

Encoder Decoder

Variational lower

bound

maximization

Figure 2: High-level comparison of three categories of generativemodels considered in this paper: normalizing flows, generative ad-versarial networks, and variational autoencoders.All models are conditional as they use the weather forecasts c to gen-erate scenarios x of the distribution of interest x: PV generation, windpower, load. Normalizing flows allow exact likelihood calculation.In contrast to generative adversarial networks and variational autoen-coders, they explicitly learn the data distribution and directly accessthe exact likelihood of the model’s parameters. The inverse of theflow is used to generate scenarios. The training of generative adver-sarial networks relies on a min-max problem where the generator andthe discriminator parameters are jointly optimized. The generator isused to compute the scenarios. Variational autoencoders indirectlyoptimize the log-likelihood of the data by maximizing the variationallower bound. The decoder computes the scenarios. Note: Section 2.3provides a theoretical comparison of these models.

2.2.1. Normalizing flowsA normalizing flow is defined as a sequence of invert-

ible transformations fk : RT → RT , k = 1, . . . ,K, com-posed together to create an expressive invertible map-ping fθ := f1 . . . fK : RT → RT . This composedfunction can be used to perform density estimation, us-ing fθ to map a sample x ∈ RT onto a latent vectorz ∈ RT equipped with a known and tractable probabil-ity density function pz, e.g., a Normal distribution. Thetransformation fθ implicitly defines a density pθ(x) thatis given by the change of variables

pθ(x) = pz( fθ(x))| det J fθ (x)|, (2)

where J fθ is the Jacobian of fθ regarding x. Themodel is trained by maximizing the log-likelihood∑N

i=1 log pθ(xi, ci) of the model’s parameters θ given thedataset D. For simplicity let us assume a single-stepflow fθ to drop the index k for the rest of the discussion.

In general, fθ can take any form as long as it definesa bijection. However, a common solution to make theJacobian computation tractable in (2) consists of im-plementing an autoregressive transformation [49], i.e.,such that fθ can be rewritten as a vector of scalar bijec-tions f i

fθ(x) := [ f 1(x1; h1), . . . , f T (xT ; hT )]ᵀ, (3a)

hi := hi(x<i;ϕi) 2 ≤ i ≤ T, (3b)x<i := [x1, . . . , xi−1]ᵀ 2 ≤ i ≤ T, (3c)

h1 ∈ R, (3d)

where f i(·; hi) : R→ R is partially parameterized by anautoregressive conditioner hi(·;ϕi) : Ri−1 → R|hi | withparameters ϕi, and θ the union of all parameters ϕi.

There is a large choice of transformers f i: affine, non-affine, integration-based, etc. This work implementsan integration-based transformer by using the class ofUnconstrained Monotonic Neural Networks (UMNN)[50]. It is a universal density approximator of continu-ous random variables when combined with autoregres-sive functions. The UMNN consists of a neural networkarchitecture that enables learning arbitrary monotonicfunctions. It is achieved by parameterizing the bijectionf i as follows

f i(xi; hi) =

∫ xi

0τi(xi, hi)dt + βi(hi), (4)

where τi(·; hi) : R|hi |+1 → R+ is the integrand neural net-work with a strictly positive scalar output, hi ∈ R|hi | anembedding made by the conditioner, and βi(·) : R|hi | →

R a neural network with a scalar output. The forwardevaluation of f i requires solving the integral (4) andis efficiently approximated numerically by using theClenshaw-Curtis quadrature. The pseudo-code of theforward and backward passes is provided by Wehenkeland Louppe [50].

Papamakarios et al. [51]’s Masked AutoregressiveNetwork (MAF) is implemented to simultaneously pa-rameterize the T autoregressive embeddings hi of theflow (3). Then, the change of variables formula appliedto the UMMN-MAF transformation results in the fol-lowing log-density when considering weather forecasts

log pθ(x, c) = log pz( fθ(x, c))| det J fθ (x, c)|, (5a)

= log pz( fθ(x, c)) +

T∑i=1

log τi(xi, hi(x<i), c),

(5b)

6

that can be computed exactly and efficiently with a sin-gle forward pass. The UMNN-MAF approach imple-mented is referred to as NF in the rest of the paper.Figure 3 depicts the process of conditional normalizingflows with a three-step NF for PV generation. Note:Appendix B.1 provides additional information on NFs.

PV

generation

weather

forecasts

weather

forecastsPV scenarios

Norm alizing Flow

Scenar ios generat ion

Figure 3: The process of conditional normalizing flows is illustratedwith a three-step NF for PV generation.The model fθ is trained by maximizing the log-likelihood of themodel’s parameters θ given a dataset composed of PV observationsand weather forecasts. Recall fθ defines a bijection between the vari-able of interest x, PV generation, and a Normal distribution z. Then,the PV scenarios x are generated by using the inverse of fθ that takesas inputs samples from the Normal distribution z and the weather fore-casts c.

2.2.2. Variational autoencodersA VAE is a deep latent variable model composed of

an encoder and a decoder which are jointly trained tomaximize a lower bound on the likelihood. The encoderqϕ(·) : RT ×R|c| → Rd approximates the intractable pos-terior p(z|x, c), and the decoder pθ(·) : Rd × R|c| → RT

the likelihood p(x|z, c) with z ∈ Rd. Maximum like-lihood is intractable as it would require marginalizingwith respect to all possible realizations of the latent vari-ables z. Kingma and Welling [23] addressed this issueby maximizing the variational lower bound Lθ,ϕ(x, c)as follows

log pθ(x|c) =KL[qϕ(z|x, c)||p(z|x, c)] +Lθ,ϕ(x, c), (6a)≥Lθ,ϕ(x, c), (6b)

Lθ,ϕ(x, c) :=Eqϕ(z|x,c)

[log

p(z)pθ(x|z, c)qϕ(z|x, c)

], (6c)

as the Kullback-Leibler (KL) divergence [52] is non-negative. Appendix B.2 details how to compute thegradients of Lθ,ϕ(x, c), and its exact expression for the

implemented VAE composed of fully connected neuralnetworks for both the encoder and decoder. Figure 4depicts the process of a conditional variational autoen-coder for PV generation.

Encoder

PV

generation

weather

forecasts

Decoder

PV scenarios

reparameterization

trick

weather

forecasts

Var iat ional Aut oEncoder


Decoder

PV scenarios

weather

forecasts

Figure 4: The process of the conditional variational autoencoder isillustrated for PV generation.The VAE is trained by maximizing the variational lower bound givena dataset composed of PV observations and weather forecasts. Theencoder qϕ maps the variable of interest x to a latent space z. Thedecoder pθ generates the PV scenarios x by taking as inputs samplesz from the latent space and the weather forecasts c.

2.2.3. Generative adversarial networksGANs are a class of deep generative models proposed

by Goodfellow et al. [24] where the key idea is the ad-versarial training of two neural networks, the genera-tor and the discriminator, during which the generatorlearns iteratively to produce realistic scenarios until theycannot be distinguished anymore by the discriminatorfrom real data. The generator gθ(·) : Rd × R|c| → RT

maps a latent vector z ∈ Rd equipped with a knownand tractable prior probability density function p(z),e.g., a Normal distribution, onto a sample x ∈ RT , andis trained to fool the discriminator. The discriminatordφ(·) : RT × R|c| → [0, 1] is a classifier trained to dis-tinguish between true samples x and generated samplesx. Goodfellow et al. [24] demonstrated that solving thefollowing min-max problem

θ? = arg minθ

maxφ

V(φ, θ), (7)

where V(φ, θ) is the value function, recovers the datagenerating distribution if gθ(·) and dφ(·) are givenenough capacity. The state-of-the-art conditionalWasserstein GAN with gradient penalty (WGAN-GP)proposed by Gulrajani et al. [53] is implemented with

7

V(φ, θ) defined as

V(φ, θ) = −

(Ex

[dφ(x|c)] − Ex

[dφ(x|c)] + λGP), (8a)

GP =Ex

[(‖∇xdφ(x|c)‖2 − 1

)2], (8b)

where x is implicitly defined by sampling convex com-binations between the data and the generator distribu-tions x = ρx+(1−ρ)x with ρ ∼ U(0, 1). The WGAN-GPconstrains the gradient norm of the discriminator’s out-put with respect to its input to enforce the 1-Lipschitzconditions. This strategy differs from the weight clip-ping of WGAN that sometimes generates only poorsamples or fails to converge. Appendix B.3 details thesuccessive improvements from the original GAN to theWGAN and the final WGAN-GP implemented, referredto as GAN in the rest of the paper. Figure 5 depicts theprocess of a conditional generative adversarial networkfor PV generation.

weather

forecasts

Generat ive Adversar ial Net work

Generat or

PV scenarios

Discr im inat or

PV generationweather forecasts

Fake /

real ?


weather

forecasts

Generat or

PV scenarios

Figure 5: The process of the conditional generative adversarial net-work is illustrated for PV generation.The GAN is trained by solving a min-max problem given a datasetcomposed of PV observations x and weather forecasts. The genera-tor gθ computes PV scenarios x by taking as inputs samples from theNormal distribution z and the weather forecasts c, and the decoder dφtries to distinguishes true data from scenarios.

2.3. Theoretical comparison

Normalizing flows are a generative model that allowsexact likelihood calculation. They are efficiently par-allelizable and offer a valuable latent space for down-stream tasks. In contrast to GANs and VAEs, NFs ex-plicitly learn the data distribution and provide directaccess to the exact likelihood of the model’s param-eters, hence offering a sound and direct way to opti-mize the network parameters [54]. However, NFs suf-

fer from drawbacks [21]. One disadvantage of requir-ing transformations to be invertible is that the input di-mension must be equal to the output dimension, makingthe model difficult to train or inefficient. Each trans-formation must be sufficiently expressive while beingeasily invertible to compute the Jacobian determinantefficiently. The first issue is also raised by Ruthottoand Haber [22] where ensuring sufficient similarity ofthe distribution of interest and the latent distribution isof high importance to obtain meaningful and relevantsamples. However, in our numerical simulations, wedid not encounter this problem. Concerning the secondissue, the UMNN-MAF transformation provides an ex-pressive and effective way of computing the Jacobian.

VAEs indirectly optimize the log-likelihood of thedata by maximizing the variational lower bound. Theadvantage of VAEs over NFs is their ability to handlenon-invertible generators and the arbitrary dimension ofthe latent space. However, it has been observed thatwhen applied to complex datasets such as natural im-ages, VAEs samples tend to be unrealistic. There is ev-idence that the limited approximation of the true poste-rior, with a common choice being a normally distributedprior with diagonal covariance, is the root cause [55].This statement comes from the field of computer vision.However, it may explain the shape of the scenarios ob-served in our numerical experiments in Section 4.

The training of GANs relies on a min-max problemwhere the generator and the discriminator are jointly op-timized. Therefore, it does not rely on estimates of thelikelihood or latent variable. The adversarial nature ofGANs makes them notoriously difficult to train due tothe saddle point problem [56]. Another drawback is themode collapsing, where one network stays in bad localminima, and only a small subset of the data distributionis learned. Several improvements have been designed toaddress these issues, such as the Wasserstein GAN withgradient penalty. Thus, GANs models are widely usedin computer vision and power systems. However, mostGAN approaches require cumbersome hyperparametertuning to achieve similar results to VAEs or NFs. Inour numerical simulations, the GAN is highly sensitiveto hyperparameter variations, which is consistent with[22].

Each method has its advantages and drawbacks andmakes trade-offs in terms of computing time, hyper-parameter tuning, architecture complexity, etc. There-fore, the choice of a particular method is dependent onthe user criteria and the dataset considered. In addition,the challenges of power systems are different from com-puter vision. Therefore, the limitations established inthe computer vision literature such as Bond-Taylor et al.

8

[21] and Ruthotto and Haber [22] must be addressedwith caution. Therefore, we encourage the energy fore-casting practitioners to test and compare these methodsin power systems applications.

3. Value and quality assessment

For predictions in any form, one must differentiatebetween their quality and their value [4]. Forecast qual-ity corresponds to the ability of the forecasts to gen-uinely inform of future events by mimicking the char-acteristics of the processes involved. Forecast value re-lates, instead, to the benefits from using forecasts in adecision-making process, such as participation in theelectricity market.

3.1. Forecast quality

Evaluating and comparing generative models remainsa challenging task. Several measures have been intro-duced with the emergence of new models, particularlyin the field of computer vision. However, there is noconsensus or guidelines as to which metric best cap-tures the strengths and limitations of models. Genera-tive models need to be evaluated directly to the appli-cation they are intended for [57]. Indeed, good per-formance to one criterion does not imply good perfor-mance to the other criteria. Several studies propose met-rics and make attempts to determine the pros and cons.We selected two that provide helpful information. (1)24 quantitative and five qualitative measures for evalu-ating generative models are reviewed and compared byBorji [58] with a particular emphasis on GAN-derivedmodels. (2) several representative sample-based evalua-tion metrics for GANs are investigated by Xu et al. [59]where the kernel Maximum Mean Discrepancy (MMD)and the 1-Nearest-Neighbour (1-NN) two-sample testseem to satisfy most of the desirable properties. Thekey message is to combine several complementary met-rics to assess the generative models. Some of the met-rics proposed are related to image generation and cannotdirectly be transposed to energy forecasting.

Therefore, we used eight complementary quality met-rics to conduct a relevant quality analysis inspired by theenergy forecasting and computer vision fields. They canbe divided into four groups: (1) the univariate metricscomposed of the continuous ranked probability score,the quantile score, and the reliability diagram. Theycan only assess the quality of the scenarios to theirmarginals; (2) the multivariate metrics are composedof the energy and the variogram scores. They can di-rectly assess multivariate scenarios; (3) the specific met-

rics composed of a classifier-based metric and the cor-relation matrix between scenarios for a given context;(4) the Diebold and Mariano statistical test. The basicsof these metrics are provided in the following, and Ap-pendix C presents the mathematical definitions and thedetails of implementation.

Univariate metricsThe continuous ranked probability score (CRPS) [60]

is a univariate scoring rule that penalizes the lack of res-olution of the predictive distributions as well as biasedforecasts. It is negatively oriented, i.e., the lower, thebetter, and for deterministic forecasts, it turns out to bethe mean absolute error (MAE). The CRPS is used tocompare the skill of predictive marginals for each com-ponent of the random variable of interest. In our case,for the twenty-four time periods of the day.

The quantile score (QS), also known as the pinballloss score, complements the CRPS. It permits obtainingdetailed information about the forecast quality at spe-cific probability levels, i.e., over-forecasting or under-forecasting, and particularly those related to the tails ofthe predictive distribution [61]. It is negatively orientedand assigns asymmetric weights to negative and positiveerrors for each quantile.

Finally, the reliability diagram is a visual verificationused to evaluate the reliability of the quantiles derivedfrom the scenarios. Quantile forecasts are reliable iftheir nominal proportions are equal to the proportionsof the observed value.

Multivariate metricsThe energy score (ES) is the most commonly used

scoring rule when a finite number of trajectories repre-sents distributions. It is a multivariate generalization ofthe CRPS and has been formulated and introduced byGneiting and Raftery [60]. The ES is proper and neg-atively oriented, i.e., a lower score represents a betterforecast. The ES is used as a multivariate scoring ruleby Golestaneh et al. [62] to investigate and analyze thespatio-temporal dependency of PV generations. Theyemphasize the ES pros and cons. It is capable of eval-uating forecasts relying on marginals with correct vari-ances but biased means. Unfortunately, its ability to de-tect incorrectly specified correlations between the com-ponents of the multivariate quantity is somewhat lim-ited. The ES is selected as a multivariate scoring rulein this study to quantitatively assess the performance ofthe generative methods similar to the mean absolute er-ror when considering point forecasts.

An alternative class of proper scoring rules based onthe geostatistical concept of variograms is proposed by

9

Scheuerer and Hamill [63]. They study the sensitivity ofthese variogram-based scoring rules to inaccurate pre-dicted means, variances, and correlations. The resultsindicate that these scores are distinctly more discrimi-native to the correlation structure. Thus, in contrast tothe Energy score, the Variogram score captures correla-tions between multivariate components.

Specific metricsA conditional classifier-based scoring rule is de-

signed by implementing an Extra-Trees classifier [64] todiscriminate true from generated samples. The receiveroperating characteristic (ROC) curves are computed foreach generative model on the testing set. The best gen-erative model should achieve an area under the ROCcurve (AUC) of 0.5, i.e., each sample is equally likelyto be predicted as true or false, meaning the classifieris unable to discriminate generated scenarios from theactual observations. Note: ROC curve is the relation-ship between True Positive Rate and False Positive Rategiven by different thresholds. AUC ROC is the area un-der the ROC curve, and it is the metric used to measurehow well the model can distinguish two classes. Werecommend the article of Fawcett [65] that is designedboth as a tutorial introduction to ROC graphs and as apractical guide for using them in research.

The second specific metric consists of computing thecorrelation matrix between the scenarios generated forgiven weather forecasts. Formally, let xiMi=1 be the setof M scenarios generated for a given day of the testingset. It is a matrix (M × 24) where each row is a sce-nario. Then, the Pearson’s correlation coefficients arecomputed into a correlation matrix (24 × 24). This met-ric indicates the variety of scenario shapes.

Statistical testingUsing relevant metrics to assess the forecast quality

is essential. However, it is also necessary to analyzewhether any difference in accuracy is statistically signif-icant. Indeed, when different models have almost iden-tical values in the selected error measures, it is not easyto draw statistically significant conclusions on the out-performance of the forecasts of one model by those ofanother. The Diebold-Mariano (DM) test [66] is proba-bly the most commonly used statistical testing tool toevaluate the significance of differences in forecastingaccuracy. It is model-free, i.e., it compares the forecastsof models, and not models themselves. The DM test isused in this study to assess the CRPS, QS, ES, and VSmetrics. The CRPS and QS are univariate scores, and avalue of CRPS and QS is computed per marginal (timeperiod of the day). Therefore, the multivariate variant of

the DM test is implemented following Ziel and Weron[67], where only one statistic for each pair of models iscomputed based on the 24-dimensional vector of errorsfor each day.

3.2. Forecast value

A model that yields lower errors in terms of fore-cast quality may not always point to a more effectivemodel for forecast practitioners [5]. To this end, sim-ilarly to Toubeau et al. [19], the forecast value is as-sessed by considering the day-ahead market schedul-ing of electricity aggregators, such as energy retailersor generation companies. The energy retailer aims tobalance its portfolio on an hourly basis to avoid finan-cial penalties in case of imbalance by exchanging thesurplus or deficit of energy in the day-ahead electricitymarket. The energy retailer may have a battery energystorage system (BESS) to manage its portfolio and mini-mize imports from the main grid when day-ahead pricesare prohibitive.

Let et [MWh] be the net energy retailer position onthe day-ahead market during the t-th hour of the day,modeled as a first stage variable. Let yt [MWh] be therealized net energy retailer position during the t-th hourof the day, which is modeled as a second stage variabledue to the stochastic processes of the PV generation,wind generation, and load. Let πt [e/ MWh] the clear-ing price in the spot day-ahead market for the t-th hourof the day, qt ex-post settlement price for negative im-balance yt < et, and λt ex-post settlement price for pos-itive imbalance yt > et. The energy retailer is assumedto be a price taker in the day-ahead market. It is mo-tivated by the individual energy retailer capacity beingnegligible relative to the whole market. The forward set-tlement price πt is assumed to be fixed and known. Asimbalance prices tend to exhibit volatility and are diffi-cult to forecast, they are modeled as random variables,with expectations denoted by qt = E[qt] and λt = E[λt].They are assumed to be independent random variablesfrom the energy retailer portfolio.

A stochastic planner with a linear programming for-mulation and linear constraints is implemented usinga scenario-based approach. The planner computes theday-ahead bids et that cannot be modified in the futurewhen the uncertainty is resolved. The second stage cor-responds to the dispatch decisions yt,ω in scenario ω thataims at avoiding portfolio imbalances modeled by a costfunction f c. The second-stage decisions are thereforescenario-dependent and can be adjusted according to therealization of the stochastic parameters. The stochastic

10

planner objective to maximize is

JS = E[∑

t∈T

πtet + f c(et, yt,ω)], (9)

where the expectation is taken to the random variables,the PV generation, wind generation, and load. Using ascenario-based approach, (9) is approximated by

JS ≈∑ω∈Ω

αω∑t∈T

[πtet + f c(et, yt,ω)

], (10)

with αω the probability of scenario ω ∈ Ω, and∑ω∈Ω αω = 1. The optimization problem is detailed

in Appendix D.

4. Numerical Results

The quality and value evaluations of the models areconducted on the load, wind, and PV tracks of the open-access GEFCom 2014 dataset [45], composed of one,ten, and three zones, respectively. Figure 6 depicts themethodology to assess both the quality and value of theGAN, VAE and NF models implemented in this study.

4.1. Implementation details

By appropriate normalization, we standardize theweather forecasts to have a zero mean and unit variance.Table 2 provides a summary of the implementation de-tails described in what follows. For the sake of propermodel training and evaluation, the dataset is divided intothree parts per track considered: learning, validation,and testing sets. The learning set (LS) is used to fitthe models, the validation set (VS) to select the optimalhyper-parameters, and the testing set (TS) to assess theforecast quality and value. The number of samples (#),expressed in days, of the VS and TS, is 50·nz, with nz thenumber of zones of the track considered. The 50 daysare selected randomly from the dataset, and the learningset is composed of the remaining part with D · nz sam-ples, where D is provided for each track in Table 2. TheNF, VAE, and GAN use the weather forecasts as inputsto generate on a day-ahead basis M scenarios x ∈ RT .The hyper-parameters values used for the experimentsare provided in Appendix B.4.

Wind trackThe zonal u10, u100 and meridional v10, v100 wind

components at 10 and 100 meters are selected, and sixfeatures are derived following the formulas providedby Landry et al. [68] to compute the wind speed ws10,

NFs VAEsGANs

GEFcom 2014

LS := Learning set

VS := Validation set

TS := Testing set

LS VS TS

Training

Hyper-parameters

8 m et r ics:

- Quantile score

- Continuous rank

probability score

- Reliability diagram

- Energy score

- Variogram score

- Classifier-based metric

- Scenarios correlations

- Diebold-Mariano test

Day-ahead bidding of an

energy ret ailer :

- Portfolio composed of PV,

wind power, load, and

storage device

- Stochastic optimization

based on scenarios

- Day-ahead planning

- Dispatching

- Profit computation

Models com par ison :

- quality metrics

- profits

Qualit y Value

- PV

- wind power

- load

Open dat aset s:

scenarios

Figure 6: Methodology to assess both the quality and value of theGAN, VAE, and NF models implemented in this study.The PV, wind power, and load datasets of the open-access Global En-ergy Forecasting Competition 2014 are divided into three parts: learn-ing, validation, and testing sets. The learning set is used to train themodels, the validation set to select the optimal hyper-parameters, andthe testing set to conduct the numerical experiments. The quality andvalue of the models are assessed by using the scenarios generated onthe testing set. The quality evaluation consists of eight complementarymetrics. The value assessment uses the simple and easily reproduciblecase study of the day-ahead bidding of an energy retailer. The energyretailer portfolio is composed of PV, wind power generation, load, anda storage system device. The retailer bids on the day-ahead market bycomputing a planning based on stochastic optimization. The dispatchis computed by using the observations of the PV generation, windpower, and load. Then, the profits are evaluated and compared.

11

ws100, energy we10, we100 and direction wd10, wd100 at10 and 100 meters

ws =√

u + v, (11a)

we =12

ws3, (11b)

wd =180π

arctan(u, v). (11c)

For each generative model, the wind zone is taken intoaccount with one hot-encoding variable Z1, . . . ,Z10, andthe wind feature input vector for a given day d is

cwindd =[u10

d ,u100d , v10

d , v100d ,ws10

d ,ws100d ,

we10d ,we100

d ,wd10d ,wd100

d ,Z1, . . . ,Z10], (12)

of dimension n f · T + nz = 10 · 24 + 10.

PV trackThe solar irradiation I, the air temperature T, and the

relative humidity rh are selected, and two features arederived by computing I2 and IT. For each generativemodel, the PV zone is taken into account with one hot-encoding variable Z1,Z2,Z3, and the PV feature inputvector for a given day d is

cPVd =[Id,Td, rhd, I2

d, ITd,Z1,Z2,Z3], (13)

of dimension n f · T + nz. For practical reasons, the pe-riods where the PV generation is always 0, across allzones and days, are removed, and the final dimension ofthe input feature vector is n f · T + nz = 5 · 16 + 3.

Load trackThe 25 weather station temperature w1, . . . ,w25 fore-

casts are used. There is only one zone, and the loadfeature input vector for a given day d is

cloadd =[w1, . . . ,w25], (14)

of dimension n f · T = 25 · 24.The number of samples (#), expressed in days, of the

VS and TS, is 50 · nz, with nz the number of zones ofthe track considered. The 50 days are selected randomlyfrom the dataset, and the learning set is composed of theremaining part with D ·nz samples, where D is providedfor each track.

4.2. Quality resultsA thorough comparison of the models is conducted

on the wind track, and Appendix E provides the Figuresof the other tracks for the sake of clarity. Note: themodel ranking slightly differs depending on the track.

Wind PV Load

T periods 24 16 24nz zones 10 3 —n f features 10 5 25cd dimension n f · T + nz n f · T + nz n f · T

# LS (days) 631 · nz 720 · nz 1999# VS/TS (days) 50 · nz 50 · nz 50

Table 2: Dataset and implementation details.Each dataset is divided into three parts: learning, validation, and test-ing sets. The number of samples (#) is expressed in days and is set to50 days for the validation and testing sets. T is the number of peri-ods per day considered, nz the number of zones of the dataset, n f thenumber of weather variables used, and cd is the dimension of the con-ditional vector for a given day that includes the weather forecasts andthe one hot-encoding variables when there are several zones. Note:the days of the learning, validation, and testing sets are selected ran-domly.

Wind trackIn addition to the generative models, a naive approach

is designed (RAND), where the scenarios of the learn-ing, validation, and testing sets are sampled randomlyfrom the learning, validation, and testing sets, respec-tively. Intuitively, it assumes that past observations arerepeated, and these scenarios are realistic but may notbe compatible with the context. Each model generatesa set of 100 scenarios for each day of the testing set,and the scores are computed following the mathemati-cal definitions provided in Appendix C. Figure 7 com-pares the QS, reliability diagram, and CRPS of the wind(markers), PV (plain), and load (dashed) tracks. Over-all, for the wind track in terms of CRPS, QS, and relia-bility diagrams, the VAE achieves slightly better scores,followed by the NF and the GAN. The ES and VS mul-tivariate scores confirm this trend with 54.82 and 18.87for the VAE vs 56.71 and 18.54 for the NF, respectively.

Figure 8 provides the results of the DM tests for thesemetrics. The heat map indicates the range of the p-values. The closer they are to zero, dark green, themore significant the difference between the scores oftwo models for a given metric. The statistical thresh-old is five %, but the scale color is capped at ten % fora better exposition of the relevant results. For instance,when considering the DM test for the RAND CRPS, allthe columns of the RAND row are in dark green, indi-cating that the RAND scenarios are always significantlyoutperformed by the other models. These DM tests con-firm that the VAE outperforms the NF for the wind trackconsidering these metrics. Then, the NF is only outper-

12

formed by the VAE and the GAN by both the VAE andNF. These results are consistent with the classifier-basedmetric depicted in Figure 9, where the VAE is the bestto mislead the classifier, followed by the NF and GAN.

The left part of Figure 10 provides 50 scenarios, (a)NF, (c) GAN, and (e) VAE, generated for a given dayselected randomly from the testing set. Notice how theshape of the NF’s scenarios differs significantly fromthe GAN and VAE as they tend to be more variablewith no identifiable trend. In contrast, the VAE andGAN scenarios seem to differ mainly in nominal powerbut have similar shapes. This behavior is even morepronounced for the GAN, where the scenarios rarelycrossed over periods. For instance, there is a gap in gen-eration around periods 17 and 18 where all the GAN’sscenarios follow this trend. These observations are con-firmed by computing the corresponding time correlationmatrices, depicted by the right part of Figure 10 demon-strating there is no correlation between NF’s scenarios.On the contrary, the VAE and GAN correlation matricestend to be similar with a time correlation of the sce-narios over a few periods, with more correlated periodswhen considering the GAN. This difference in the sce-nario’s shape is striking and not necessarily captured bymetrics such as the CRPS, QS, or even the classifier-based metric and is also observed on the PV and loadtracks, as explained in the next paragraph.

All tracksTable 3 provides the averaged quality scores. The

CRPS is averaged over the 24 time periods CRPS. TheQS over the 99 percentiles QS. The MAE-r is the meanabsolute error between the reliability curve and the di-agonal, and AUC is the mean of the 50 AUC. Overall,for the PV and load tracks in CRPS, QS, reliability dia-grams, AUC, ES, and VS, the NF outperforms the VAEand GAN and is slightly outperformed by the VAE onthe wind track. On the load track, the VAE outperformsthe GAN. However, the VAE and GAN achieved similarresults on the PV track, and the GAN performed betterin terms of ES and VS. These results are confirmed bythe DM tests depicted in Figure E.16. The classifier-based metric results for both the load and PV tracks,provided by Figure E.17, confirm this trend where theNF is the best to trick the classifier followed by the VAEand GAN.

Similar to the wind track, the shape of the scenar-ios differs significantly between the NF and the othermodels for both the load and PV tracks as indicatedby the left part of Figures E.18 and E.19, and the cor-responding correlation matrices provided by the rightpart of Figures E.18 and E.19. Note: the load track

NFwind

VAEPV

GANload

0 20 40 60 80 100q

0

2

4

6

%(a) Quantile score.

0 20 40 60 80 100q

0

20

40

60

80

100%

(b) Reliability diagram.

1 6 12 18 24Hour

0

2

4

6

8

10

%

(c) CRPS per marginal.

Figure 7: Quality standard metrics comparison on the wind (markers),PV (plain), and load (dashed) tracks.Quantile score (a): the lower and the more symmetrical, the better.Note: the quantile score has been averaged over the marginals (the24 time periods of the day). Reliability diagram (b): the closer to thediagonal, the better. Continuous ranked probability score per marginal(c): the lower, the better. NF outperforms the VAE and GAN for boththe PV and load tracks and is slightly outperformed by the VAE on thewind track. Note: all models tend to have more difficulties forecastingthe wind power that seems less predictable than the PV generation orthe load.

13

NF VAE GAN RAND

Wind

CRPS 9.07 8.80 9.79 16.92QS 4.58 4.45 4.95 8.55MAE-r 2.83 2.67 6.82 1.01AUC 0.935 0.877 0.972 0.918ES 56.71 54.82 60.52 96.15VS 18.54 17.87 19.87 23.21

PV


Load


Table 3: Averaged quality scores per dataset.The best performing deep learning generative model for each track iswritten in bold. The CRPS, QS, MAE-r, and ES are expressed in %.Overall, for both the PV and load tracks, the NF outperforms the VAEand GAN and slightly outperforms the VAE on the wind track.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(a) CRPS DM test.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(b) QS DM test.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(c) ES DM test.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(d) VS DM test.

Figure 8: Wind track Diebold-Mariano tests of the CRPS, QS, ES,and VS metrics.The Diebold-Mariano tests of the continuous ranked probability,quantile, energy, and variogram scores confirm that the VAE outper-forms the NF on the wind track for these metrics. The heat map in-dicates the range of the p-values. The closer they are to zero, darkgreen, the more significant the difference between the scores of twomodels for a given metric. The statistical threshold is five %, but thescale color is capped at ten % for a better exposition of the relevantresults.

scenarios are highly correlated for both the VAE andGAN. Finally, Figure E.20 provides the average of thecorrelation matrices over all days of the testing set foreach dataset. The trend depicted above is confirmed.This difference between the NF and the other genera-tive model may be explicated by the design of the meth-ods. The NF explicitly learns the probability densityfunction (PDF) of the multi-dimensional random vari-able considered. Thus, the NF scenarios are generatedaccording to the learned PDF producing multiple shapesof scenarios. In contrast, the generator of the GAN istrained to fool the discriminator, and it may find a shapeparticularly efficient leading to a set of similar scenar-ios. Concerning the VAE, it is less obvious. However,by design, the decoder is trained to generate scenariosfrom the latent space assumed to follow a Gaussian dis-tribution that may lead to less variability.

4.3. Value results

The energy retailer portfolio comprises wind power,PV generation, load, and a battery energy storage de-vice. The 50 days of the testing set are used and com-bined with the 30 possible PV and wind generationzones (three PV zones and ten wind farms), resulting in

14

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

Figure 9: Wind track classifier-based metric.The VAE (orange) is the best to mislead the classifier, followed by theNF (blue) and GAN (green). Note: there are 50 ROC curves depictedfor each model, each corresponding to a scenario generated used as in-put of the classifier. It allows taking into account the variability of thescenarios to avoid having results dependent on a particular scenario.

1 500 independent simulated days. A two-step approachis employed to evaluate the forecast value:

• First, for each generative model and the 1 500 dayssimulated, the two-stage stochastic planner com-putes the day-ahead bids of the energy retailer port-folio using the PV, wind power, and load scenarios.After solving the optimization, the day-ahead deci-sions are recorded.

• Then, a real-time dispatch is carried out using thePV, wind power, and load observations, with theday-ahead decisions as parameters.

This two-step methodology is applied to evaluate thethree generative models, namely the NF, GAN, andVAE. Figure 11 illustrates an arbitrary random dayof the testing set with the first zone for both the PVand wind. πt [e/ MWh] is the day-ahead prices onFebruary 6, 2020 of the Belgian day-ahead market usedfor the 1 500 days simulated. The negative qt and posi-tive λt imbalance prices are set to 2×πt, ∀t ∈ T . The re-tailer aims to balance the net power, the red curve in Fig-ure 11, by importing/exporting from/to the main grid.Usually, the net is positive (negative) at noon (evening)when the PV generation is maximal (minimal), and theload is minimal (maximal). As the day-ahead spot priceis often maximal during the evening load peak, the re-tailer seeks to save power during the day by chargingthe battery to decrease the import during the evening.Therefore, the more accurate the PV, wind generation,

scenarios 10 % 50 % 90 % obs

0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(a) NF.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(b) NF.

0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(c) GAN.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(d) GAN.

0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(e) VAE.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(f) VAE.

Figure 10: Wind power scenarios shape comparison and analysis.Left part (a) NF, (c) GAN, and (e) VAE: 50 wind power scenarios(grey) of a randomly selected day of the testing set along with theten % (blue), 50 % (black), and 90 % (green) quantiles, and the ob-servations (red). Right part (b) NF, (d) GAN, and (f) VAE: the cor-responding Pearson time correlation matrices of these scenarios withthe periods as rows and columns. The NF tends to exhibit no timecorrelation between scenarios. In contrast, the VAE and GAN tend tobe partially time-correlated over a few periods.

and load scenarios are, the better is the day-ahead plan-ning.

The battery minimum smin and maximum smax capac-ities are 0 and 1, respectively. It is assumed to be ca-pable of fully (dis)charging in two hours with ydis

max =

ychamax = smax/2, and the (dis)charging efficiencies areηdis = ηcha = 95 %. Each simulation day is indepen-dent with a fully discharged battery at the first and lastperiod of each day sini = send = 0. The 1 500 stochasticoptimization problems are solved with 50 PV, wind gen-eration, and load scenarios. The python Gurobi libraryis used to implement the algorithms in Python 3.7, andGurobi2 9.0.2 is used to solve the optimization prob-lems. Numerical experiments are performed on an Intel

2https://www.gurobi.com/

15

https://www.gurobi.com/

0 5 10 15 20

0.50

0.25

0.00

0.25

0.50

0.75

1.00

35

40

45

50

55

60

65

/MW

h

PV Wind Load net t

Figure 11: Energy retailer case study: illustration of the observationson a random day of the testing set.The energy retailer portfolio comprises PV generation, wind power,load, and storage device. The PV, wind power, and load scenariosfrom the testing set are used as inputs of the stochastic day-aheadplanner to compute the optimal bids. The net is the power balanceof the energy retailer portfolio. The day-ahead prices πt are obtainedfrom the Belgian day-ahead market on February 6, 2020.

Core i7-8700 3.20 GHz based computer with 12 threadsand 32 GB of RAM running on Ubuntu 18.04 LTS.

The net profit, that is, the profit minus penalty, iscomputed for the 1 500 days of the simulation and ag-gregated in the first row of Table 4. The ranking of eachmodel is computed for the 1 500 days. The cumulativeranking is expressed in terms of percentage in Table 4.NF outperformed both the GAN and VAE with a totalnet profit of 107 ke. There is still room for improve-ment as the oracle, which has perfect future knowledge,achieved 300 ke. NF ranked first 39.0 % during the1 500 simulation days and achieved the first and sec-ond ranks 69.6 %. Overall, in terms of forecast value,

NF VAE GAN

Net profit (ke) 107 97 93

1 (%) 39.0 31.8 29.21 & 2 (%) 69.6 68.3 62.11 & 2 & 3 (%) 100 100 100

Table 4: Total net profit (ke) and cumulative ranking (%).Using the NF PV, wind power, and load scenarios, the stochastic plan-ner achieved the highest net profit with 107 ke, ranked first 39.0 %,second 30.6 %, and third 30.4 % over 1 500 days of simulation. Thesecond-best model, the VAE, achieved a net profit of 97 ke, rankedfirst 31.8 %, second 36,5 %, and third 31.7 %.

the NF outperforms the VAE and GAN. However, thiscase study is ”simple,” and stochastic optimization re-lies mainly on the quality of the average of the scenar-

ios. Therefore, one may consider taking advantage ofthe particularities of a specific method by consideringmore advanced case studies. In particular, the speci-ficity of the NFs to provide direct access to the probabil-ity density function may be of great interest in specificapplications. It is left for future investigations as moreadvanced case studies would prevent a fair comparisonbetween models.

4.4. Results summary

Table 5 summarizes the main results of this studyby comparing the VAE, GAN, and NF implementedthrough easily comparable star ratings. The rating foreach criterion is determined using the following rules -1 star: third rank, 2 stars: second rank, and 3 stars: firstrank. Specifically, training speed is assessed based onreported total training times for each dataset: PV gener-ation, wind power, and load; sample speed is based onreported total generating times for each dataset; qualityis evaluated with the metrics considered; value is basedon the case study of the day-ahead bidding of the en-ergy retailer; the hyper-parameters search is assessedby the number of configurations tested before reach-ing satisfactory and stable results over the validation set;the hyper-parameters sensitivity is evaluated by the im-pact on the quality metric of deviations from the opti-mal the hyper-parameter values found during the hyper-parameter search; the implementation-friendly criterionis appraised regarding the complexity of the techniqueand the amount of knowledge required to implement it.

5. Conclusion

This paper proposed a fair and thorough compari-son of normalizing flows with the state-of-the-art deeplearning generative models, generative adversarial net-works, and variational autoencoders, both in quality andvalue. The experiments adopt the open data of theGlobal Energy Forecasting Competition 2014, wherethe generative models use the conditional informationto compute improved weather-based PV power, windpower, and load scenarios. The results demonstratethat normalizing flows can challenge generative adver-sarial networks and variational autoencoders. Overall,they are more accurate in quality and value and can beused effectively by non-expert deep learning practition-ers. In addition, normalizing flows have several advan-tages over more traditional deep learning approachesthat should motivate their introduction into power sys-tem applications:

16

Criteria VAE GAN NF

Train speed ??? ??? ???

Sample speed ??? ??? ???

Quality ??? ??? ???

Value ??? ??? ???

Hp search ??? ??? ???

Hp sensibility ??? ??? ???

Implementation ??? ??? ???

Table 5: Comparison between the deep generative models.The rating for each criterion is determined using the following rules -1 star: third rank, 2 stars: second rank, and 3 stars: first rank. Trainspeed: training computation time; Sample speed: scenario generationcomputation time; Quality: forecast quality based on the eight com-plementary metrics considered; Value: forecast value based on theday-ahead energy retailer case study; Hp search: assess the difficultyto identify relevant hyper-parameters; Hp sensibility: assess the sensi-tivity of the model to a given set of hyper-parameters (the more stars,the more robust to hyper-parameter modifications); Implementation:assess the difficulty to implement the model (the more stars, the moreimplementation-friendly). Note: the justifications are provided in Ap-pendix A.2.

(i) Normalizing flows directly learn the stochasticmultivariate distribution of the underlying processby maximizing the likelihood. In contrast to varia-tional autoencoders and generative adversarial net-works, they provide access to the exact likelihoodof the model’s parameters. Hence, they offer asound and direct way to optimize the network pa-rameters. It may open a new range of advancedapplications benefiting from this advantage. Forinstance, to transfer scenarios from one locationto another based on the knowledge of the prob-ability density function. A second application isthe importance sampling for stochastic optimiza-tion based on a scenario approach. Indeed, nor-malizing flows provide the likelihood of each gen-erated scenario, making it possible to filter relevantscenarios used in stochastic optimization.

(ii) In our opinion, normalizing flows are easier to useby non-expert deep learning practitioners once thelibraries are available. Indeed, they are more re-liable and robust in terms of hyper-parameters se-lection. Generative adversarial networks and varia-tional autoencoders are particularly sensitive to thelatent space dimension, the structure of the neuralnetworks, the learning rate, etc. Generative adver-sarial networks convergence, by design, is unsta-ble, and for a given set of hyper-parameters, the

scenario’s quality may differ completely. In con-trast, it was easier to retrieve relevant normalizingflows hyper-parameters by manually testing a fewsets of values that led to satisfying training conver-gence and quality results.

Nevertheless, their usage as a base component of themachine learning toolbox is still limited compared togenerative adversarial networks or variational autoen-coders.

6. Acknowledgments

The authors would like to acknowledge the authorsand contributors of the Scikit-Learn [69], Pytorch [70],and Weights & Biases [71] Python libraries. In addi-tion, the authors would like to thank the editor and thereviewers for the comments that helped improve the pa-per. Antoine Wehenkel is a recipient of an F.R.S.- FNRSfellowship and acknowledges the financial support ofthe FNRS (Belgium). Antonio Sutera is supported viathe Energy Transition Funds project EPOC 2030-2050organized by the FPS economy, S.M.E.s, Self-employedand Energy.

References

[1] M. Allen, P. Antwi-Agyei, F. Aragon-Durand, M. Babiker,P. Bertoldi, M. Bind, S. Brown, M. Buckeridge, I. Camilloni,A. Cartwright, et al., Technical Summary: Global warmingof 1.5° C. An IPCC Special Report on the impacts of globalwarming of 1.5° C above pre-industrial levels and related globalgreenhouse gas emission pathways, in the context of strength-ening the global response to the threat of climate change, sus-tainable development, and efforts to eradicate poverty, TechnicalReport, Intergovernmental Panel on Climate Change, 2019.

[2] T. Gneiting, M. Katzfuss, Probabilistic forecasting, AnnualReview of Statistics and Its Application 1 (2014) 125–151.

[3] T. Hong, S. Fan, Probabilistic electric load forecasting: A tuto-rial review, International Journal of Forecasting 32 (2016) 914–938.

[4] J. M. Morales, A. J. Conejo, H. Madsen, P. Pinson, M. Zugno,Integrating renewables in electricity markets: operational prob-lems, volume 205, Springer Science & Business Media, 2013.

[5] T. Hong, P. Pinson, Y. Wang, R. Weron, D. Yang, H. Zareipour,et al., Energy forecasting: A review and outlook, Technical Re-port, Department of Operations Research and Business Intelli-gence, Wroclaw . . . , 2020.

[6] A. Khoshrou, E. J. Pauwels, Short-term scenario-based proba-bilistic load forecasting: A data-driven approach, Applied En-ergy 238 (2019) 1258–1268.

[7] A. Mashlakov, T. Kuronen, L. Lensu, A. Kaarna, S. Honkapuro,Assessing the performance of deep learning models for multi-variate probabilistic energy forecasting, Applied Energy 285(2021) 116405.

[8] P. Wang, B. Liu, T. Hong, Electric load forecasting with recencyeffect: A big data approach, International Journal of Forecasting32 (2016) 585–597.

17

[9] J. G. De Gooijer, R. J. Hyndman, 25 years of time series fore-casting, International journal of forecasting 22 (2006) 443–473.

[10] J. M. Morales, R. Minguez, A. J. Conejo, A methodology togenerate statistically dependent wind speed scenarios, AppliedEnergy 87 (2010) 843–855.

[11] S. H. Karaki, B. A. Salim, R. B. Chedid, Probabilistic model ofa two-site wind energy conversion system, IEEE Transactionson Energy Conversion 17 (2002) 530–536.

[12] S. Karaki, R. Chedid, R. Ramadan, Probabilistic performanceassessment of autonomous solar-wind energy conversion sys-tems, IEEE Transactions on energy conversion 14 (1999) 766–772.

[13] P. Pinson, H. Madsen, H. A. Nielsen, G. Papaefthymiou,B. Klockl, From probabilistic forecasts to statistical scenariosof short-term wind power production, Wind Energy: An Inter-national Journal for Progress and Applications in Wind PowerConversion Technology 12 (2009) 51–62.

[14] H. Zhang, Z. Lu, W. Hu, Y. Wang, L. Dong, J. Zhang, Coor-dinated optimal operation of hydro–wind–solar integrated sys-tems, Applied Energy 242 (2019) 883–896.

[15] S. Camal, F. Teng, A. Michiorri, G. Kariniotakis, L. Badesa,Scenario generation of aggregated wind, photovoltaics andsmall hydro production for power systems applications, AppliedEnergy 242 (2019) 1396–1406.

[16] H. Shi, M. Xu, R. Li, Deep learning for household load fore-casting—a novel pooling deep rnn, IEEE Transactions on SmartGrid 9 (2017) 5271–5280.

[17] J. Dumas, C. Cointe, X. Fettweis, B. Cornelusse, Deep learning-based multi-output quantile forecasting of pv generation, in:2021 IEEE Madrid PowerTech, 2021, pp. 1–6. doi:10.1109/PowerTech46648.2021.9494976.

[18] H. Hewamalage, C. Bergmeir, K. Bandara, Recurrent neuralnetworks for time series forecasting: Current status and futuredirections, International Journal of Forecasting 37 (2020) 388–427.

[19] J.-F. Toubeau, J. Bottieau, F. Vallee, Z. De Greve, Deeplearning-based multivariate probabilistic forecasting for short-term scheduling in power markets, IEEE Transactions on PowerSystems 34 (2018) 1203–1215.

[20] D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, Deepar:Probabilistic forecasting with autoregressive recurrent net-works, International Journal of Forecasting 36 (2020) 1181–1191.

[21] S. Bond-Taylor, A. Leach, Y. Long, C. G. Willcocks, Deepgenerative modelling: A comparative review of vaes, gans, nor-malizing flows, energy-based and autoregressive models, arXivpreprint arXiv:2103.04922 (2021).

[22] L. Ruthotto, E. Haber, An introduction to deep generative mod-eling, GAMM-Mitteilungen (2021) e202100008.

[23] D. P. Kingma, M. Welling, Auto-encoding variational bayes,arXiv preprint arXiv:1312.6114 (2013).

[24] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarialnetworks, arXiv preprint arXiv:1406.2661 (2014).

[25] J. A. G. Ordiano, L. Groll, R. Mikut, V. Hagenmeyer, Proba-bilistic energy forecasting using the nearest neighbors quantilefilter and quantile regression, International Journal of Forecast-ing 36 (2020) 310–323.

[26] M. Sun, C. Feng, J. Zhang, Probabilistic solar power forecast-ing based on weather scenario generation, Applied Energy 266(2020) 114823.

[27] H. Zhanga, W. Hua, R. Yub, M. Tangb, L. Dingc, Optimized op-eration of cascade reservoirs considering complementary char-acteristics between wind and photovoltaic based on variationalauto-encoder, in: MATEC Web of Conferences, volume 246,

EDP Sciences, 2018, p. 01077.[28] A. Dairi, F. Harrou, Y. Sun, S. Khadraoui, Short-term fore-

casting of photovoltaic solar power production using variationalauto-encoder driven deep learning approach, Applied Sciences10 (2020) 8400.

[29] Y. Chen, Y. Wang, D. Kirschen, B. Zhang, Model-free renew-able scenario generation using generative adversarial networks,IEEE Transactions on Power Systems 33 (2018) 3265–3275.

[30] Y. Chen, P. Li, B. Zhang, Bayesian renewables scenario gen-eration via deep generative networks, in: 2018 52nd AnnualConference on Information Sciences and Systems (CISS), IEEE,2018, pp. 1–6.

[31] R. Yuan, B. Wang, Z. Mao, J. Watada, Multi-objective windpower scenario forecasting based on pg-gan, Energy (2021)120379.

[32] Z. Chen, C. Jiang, Building occupancy modeling using gen-erative adversarial network, Energy and Buildings 174 (2018)372–379.

[33] J. Lan, Q. Guo, H. Sun, Demand side data generating based onconditional generative adversarial networks, Energy Procedia152 (2018) 1188–1193.

[34] Y. Zhang, Q. Ai, F. Xiao, R. Hao, T. Lu, Typical wind powerscenario generation for multiple wind farms using conditionalimproved wasserstein generative adversarial network, Interna-tional Journal of Electrical Power & Energy Systems 114 (2020)105388.

[35] Y. Wang, G. Hug, Z. Liu, N. Zhang, Modeling load forecast un-certainty using generative adversarial networks, Electric PowerSystems Research 189 (2020) 106732.

[36] C. Jiang, Y. Mao, Y. Chai, M. Yu, Day-ahead renewable sce-nario forecasts based on generative adversarial networks, Inter-national Journal of Energy Research 45 (2021) 7572–7587.

[37] Y. Qi, W. Hu, Y. Dong, Y. Fan, L. Dong, M. Xiao, Optimalconfiguration of concentrating solar power in multienergy powersystems with an improved variational autoencoder, Applied En-ergy 274 (2020) 115124.

[38] L. Ge, W. Liao, S. Wang, B. Bak-Jensen, J. R. Pillai, Model-ing daily load profiles of distribution network for scenario gen-eration using flow-based generative network, IEEE Access 8(2020) 77587–77597.

[39] D. Rezende, S. Mohamed, Variational inference with normaliz-ing flows, in: International Conference on Machine Learning,PMLR, 2015, pp. 1530–1538.

[40] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stim-berg, et al., Parallel wavenet: Fast high-fidelity speech synthe-sis, in: International conference on machine learning, PMLR,2018, pp. 3918–3926.

[41] S. R. Green, J. Gair, Complete parameter inference forgw150914 using deep learning, Machine Learning: Science andTechnology 2 (2021) 03LT01.

[42] M. S. Albergo, D. Boyda, D. C. Hackett, G. Kanwar, K. Cran-mer, S. Racaniere, D. J. Rezende, P. E. Shanahan, Introduc-tion to normalizing flows for lattice field theory, arXiv preprintarXiv:2101.08176 (2021).

[43] J. Dumas, C. Cointe, A. Wehenkel, A. Sutera, X. Fettweis,B. Cornelusse, A probabilistic forecast-driven strategy for arisk-aware participation in the capacity firming market, 2021.Manuscript submitted for publication to IEEE Transactions onSustainable Energy.

[44] C.-W. Huang, D. Krueger, A. Lacoste, A. Courville, Neuralautoregressive flows, in: International Conference on MachineLearning, PMLR, 2018, pp. 2078–2087.

[45] T. Hong, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, R. J. Hyn-dman, Probabilistic energy forecasting: Global energy forecast-

18

http://dx.doi.org/10.1109/PowerTech46648.2021.9494976

http://dx.doi.org/10.1109/PowerTech46648.2021.9494976

ing competition 2014 and beyond, 2016.[46] J. A. A. Silva, J. C. Lopez, N. B. Arias, M. J. Rider, L. C.

da Silva, An optimal stochastic energy management system forresilient microgrids, Applied Energy 300 (2021) 117435.

[47] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learn-ing, volume 1, MIT press Cambridge, 2016.

[48] A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into DeepLearning, 2020. https://d2l.ai.

[49] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever,M. Welling, Improving variational inference with inverse au-toregressive flow, arXiv preprint arXiv:1606.04934 (2016).

[50] A. Wehenkel, G. Louppe, Unconstrained monotonic neural net-works, in: Advances in Neural Information Processing Systems,2019, pp. 1545–1555.

[51] G. Papamakarios, T. Pavlakou, I. Murray, Masked autoregres-sive flow for density estimation, in: Advances in Neural Infor-mation Processing Systems, 2017, pp. 2338–2347.

[52] F. Perez-Cruz, Kullback-leibler divergence estimation of con-tinuous distributions, in: 2008 IEEE international symposiumon information theory, IEEE, 2008, pp. 1666–1670.

[53] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville,Improved training of wasserstein gans, arXiv preprintarXiv:1704.00028 (2017).

[54] A. Wehenkel, G. Louppe, Graphical normalizing flows, arXivpreprint arXiv:2006.02548 (2020).

[55] S. Zhao, J. Song, S. Ermon, Towards deeper understand-ing of variational autoencoding models, arXiv preprintarXiv:1702.08658 (2017).

[56] M. Arjovsky, L. Bottou, Towards principled methods fortraining generative adversarial networks, arXiv preprintarXiv:1701.04862 (2017).

[57] L. Theis, A. v. d. Oord, M. Bethge, A note on the evaluation ofgenerative models, arXiv preprint arXiv:1511.01844 (2015).

[58] A. Borji, Pros and cons of gan evaluation measures, ComputerVision and Image Understanding 179 (2019) 41–65.

[59] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, K. Wein-berger, An empirical study on evaluation metrics of generativeadversarial networks, arXiv preprint arXiv:1806.07755 (2018).

[60] T. Gneiting, A. E. Raftery, Strictly proper scoring rules, predic-tion, and estimation, Journal of the American statistical Associ-ation 102 (2007) 359–378.

[61] P. Lauret, M. David, P. Pinson, Verification of solar irradianceprobabilistic forecasts, Solar Energy 194 (2019) 254–271.

[62] F. Golestaneh, H. B. Gooi, P. Pinson, Generation and evalua-tion of space–time trajectories of photovoltaic power, AppliedEnergy 176 (2016) 80–91.

[63] M. Scheuerer, T. M. Hamill, Variogram-based proper scor-ing rules for probabilistic forecasts of multivariate quantities,Monthly Weather Review 143 (2015) 1321–1334.

[64] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees,Machine learning 63 (2006) 3–42.

[65] T. Fawcett, Roc graphs: Notes and practical considerations forresearchers, Machine learning 31 (2004) 1–38.

[66] F. X. Diebold, R. S. Mariano, Comparing predictive accuracy,Journal of Business & economic statistics 20 (2002) 134–144.

[67] F. Ziel, R. Weron, Day-ahead electricity price forecasting withhigh-dimensional structures: Univariate vs. multivariate model-ing frameworks, Energy Economics 70 (2018) 396–420.

[68] M. Landry, T. P. Erlinger, D. Patschke, C. Varrichio, Probabilis-tic gradient boosting machines for gefcom2014 wind forecast-ing, International Journal of Forecasting 32 (2016) 1061–1066.

[69] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller,O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grob-ler, R. Layton, J. VanderPlas, A. Joly, B. Holt, G. Varoquaux,API design for machine learning software: experiences from the

scikit-learn project, in: ECML PKDD Workshop: Languagesfor Data Mining and Machine Learning, 2013, pp. 108–122.

[70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chil-amkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: Animperative style, high-performance deep learning library, in:Advances in Neural Information Processing Systems 32, CurranAssociates, Inc., 2019, pp. 8024–8035.

[71] L. Biewald, Experiment tracking with weights and biases, 2020.URL: https://www.wandb.com/, software available fromwandb.com.

[72] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed,B. Lakshminarayanan, Normalizing flows for probabilistic mod-eling and inference, arXiv preprint arXiv:1912.02762 (2019).

[73] I. Kobyzev, S. Prince, M. Brubaker, Normalizing flows: Anintroduction and review of current methods, IEEE Transactionson Pattern Analysis and Machine Intelligence (2020).

[74] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generativeadversarial networks, in: International conference on machinelearning, PMLR, 2017, pp. 214–223.

[75] C. Villani, Optimal transport: old and new, volume 338,Springer Science & Business Media, 2008.

[76] D. P. Kingma, J. Ba, Adam: A method for stochastic optimiza-tion, arXiv preprint arXiv:1412.6980 (2014).

[77] M. Zamo, P. Naveau, Estimation of the continuous ranked prob-ability score with limited information and applications to en-semble weather forecasts, Mathematical Geosciences 50 (2018)209–234.

[78] T. Gneiting, L. I. Stanberry, E. P. Grimit, L. Held, N. A. Johnson,Assessing probabilistic forecasts of multivariate quantities, withan application to ensemble predictions of surface winds, Test 17(2008) 211–235.

[79] E. L. Lehmann, J. P. Romano, Testing statistical hypotheses,Springer Science & Business Media, 2006.

Appendix A. Additional arguments

Appendix A.1. Table 1 justifications

Wang et al. [35] use a Wasserstein GAN withgradient penalty to model both the uncertainties andthe variations of the load. First, point forecasting isconducted, and the corresponding residuals are derived.Then, the GAN generates residual scenarios conditionalon the day type, temperatures, and historical loads.The GAN model is compared with the same versionwithout gradient penalty and two quantile regressionmodels: random forest and gradient boosting regressiontree. The quality evaluation is conducted on open loaddatasets from the Independent System Operator-NewEngland3 with five metrics: (1) the continuous rankedprobability score; (2) the quantile score; (3) the Winklerscore; (4) reliability diagrams; (5) Q-Q plots. Note: theforecast value is not assessed.

3https://www.iso-ne.com/

19

https://d2l.ai

https://www.wandb.com/

https://www.iso-ne.com/

Qi et al. [37] propose a concentrating solar power(CSP) configuration method to determine the CSPcapacity in multi-energy power systems. The config-uration model considers the uncertainty by scenarioanalysis. A β VAE generates the scenarios. It is animproved version of the original VAE However, it doesnot consider weather forecasts, and the model is trainedonly by using historical observations. The qualityevaluation is conducted on two wind farms and sixPV plants using three metrics. (1) The leave-one-outaccuracy of the 1-nearest neighbor classifier. (2) Thecomparison of the frequency distributions of the actualdata and the generated scenarios. (3) Comparing thespatial and temporal correlations of the actual dataand the scenarios by computing Pearson correlationcoefficients. The value is assessed by considering thecase study of the CSP configuration model, wherethe β VAE is used to generate PV, wind power, andload scenarios. However, the VAE is not compared toanother generative model for both the quality and valueevaluations. Note: the dataset does not seem to be inopen-access. Finally, the value evaluation case studyis not trivial due to the mathematical formulation thatrequires a certain level of knowledge of the system.Thus, the replicability criterion is partially satisfied.

Ge et al. [38] compared NFs to VAEs and GANs forthe generation of daily load profiles. The models do nottake into account weather forecasts but only historicalobservations. However, an example is given to illustratethe principle of generating conditional daily load pro-files by using three groups: light load, medium load, andheavy load. The quality evaluation uses five indicators.Four to assess the temporal correlation: (1) probabilitydensity function; (2) autocorrelation function; (3) loadduration curve; (4) a wave rate is defined to evaluate thevolatility of the daily load profile. Furthermore, one ad-ditional for the spatial correlation: (5) Pearson correla-tion coefficient is used to measure the spatial correlationamong multiple daily load profiles. The simulations usethe open-access London smart meter and Spanish trans-mission service operator datasets of Kaggle. The fore-cast value is not assessed.

Appendix A.2. Table 5 justificationsThe VAE is the fastest to train, with a recorded com-

putation time of 7 seconds on average per dataset. Thetraining time of the GAN is approximately three timeslonger, with an average computation time of 20 secondsper dataset. Finally, the NF is the slowest, with an av-erage training time of 4 minutes. This ranking is pre-served with the VAE the fastest concerning the sample

speed, followed by the GAN and NF models. The VAEand the GAN generate the samples over the testing sets,5 000 in total, in less than a second. However, the NFconsidered takes a few minutes. In contrast, the affineautoregressive version of the NF is much faster to trainand generate samples. Note: even a training time of afew hours is compatible with day-ahead planning appli-cations. In addition, once the model is trained, it is notnecessarily required to retrain it every day.

The quality and value assessments have already beendiscussed in section 4. Overall, the NF outperformsboth the VAE and GAN models.

Concerning the hyper-parameters search and sensibil-ity, the NF tends to be the most accessible model to cal-ibrate. Compared with the VAE and GAN, we foundrelevant hyper-parameter values by testing only a fewcombinations. In addition, the NF is robust to hyper-parameter modifications. In contrast, the GAN is themost sensitive. Variations of the hyper-parameters mayresult in very poor scenarios both in terms of qualityand shape. Even for a fixed set of hyper-parameters val-ues, two separate training may not converge towards thesame results illustrating the GAN training instabilities.The VAE is more straightforward to train than the GANbut is also sensitive to hyper-parameters values. How-ever, it is less evident than the GAN.

Finally, we discuss the implementation-friendly cri-terion of the models. Note: this discussion is only validfor the models implemented in this study. There ex-ist various architectures of GANs, VAEs, and NFs withsimple and complex versions. In our opinion, the VAEis the effortless model to implement as the encoder anddecoder are both simple feed-forward neural networks.The only difficulty lies in the reparameterization trickthat should be carefully addressed. The GAN is a bitmore difficult to deploy due to the gradient penalty tohandle but is similar to the VAE with both the discrimi-nator and the generator that are feed-forward neural net-works. The NF is the most challenging model to imple-ment from scratch because the UMNN-MAF approachrequires an additional integrand network. An affine au-toregressive NF is easier to implement. However, itmay be less capable of modeling the stochasticity of thevariable of interest. However, forecasting practitionersdo not necessarily have to implement generative mod-els from scratch and can use numerous existing Pythonlibraries.

20

Appendix B. Background

Appendix B.1. Normalizing flows

Normalizing flow computationEvaluating the likelihood of a distribution modeled bya normalizing flow requires computing (2), i.e., the nor-malizing direction, as well as its log-determinant. In-creasing the number of sub-flows by K of the trans-formation results in only O(K) growth in the compu-tational complexity as the log-determinant of J fθ can beexpressed as

log | det J fθ (x)| = log∣∣∣∣∣ K∏

k=1

det J fk,θ (x)∣∣∣∣∣, (B.1a)

=

K∑k=1

log | det J fk,θ (x)|. (B.1b)

However, with no further assumption on fθ, the compu-tational complexity of the log-determinant is O(K · T 3),which can be intractable for large T . Therefore, the ef-ficiency of these operations is essential during training,where the likelihood is repeatedly computed. There aremany possible implementations of NFs detailed by Pa-pamakarios et al. [72], Kobyzev et al. [73] to addressthis issue.

Autoregressive flowThe Jacobian of the autoregressive transformation fθ de-fined by (3) is lower triangular, and its log-absolute-determinant is

log | det J fθ (x)| = logT∏

i=1

∣∣∣∣∣∂ f i

∂xi(xi; hi)

∣∣∣∣∣, (B.2a)

=

T∑i=1

log∣∣∣∣∣∂ f i

∂xi(xi; hi)

∣∣∣∣∣, (B.2b)

that is calculated in O(T ) instead of O(T 3).

Affine autoregressive flowA simple choice of transformer is the class of affinefunctions

f i(xi; hi) = αixi + βi, (B.3)

where f i(·; hi) : R → R is parameterized by hi =

αi, βi, αi controls the scale, and βi controls the loca-tion of the transformation. Invertibility is guaranteed ifαi , 0, and this can be easily achieved by e.g. takingαi = exp (αi), where αi is an unconstrained parameter

in which case hi = αi, βi. The derivative of the trans-former with respect to xi is equal to αi. Hence the log-absolute-determinant of the Jacobian becomes

log | det J fθ (x)| =T∑

i=1

log |αi| =

T∑i=1

αi. (B.4)

Affine autoregressive flows are simple and computationefficient. However, they are limited in expressivenessrequiring many stacked flows to represent complex dis-tributions. It is unknown whether affine autoregressiveflows with multiple layers are universal approximators[72].

Appendix B.2. Variational autoencodersGradients computationBy using (6) Lθ,ϕ is decomposed in two parts

Lθ,ϕ(x, c) = Eqϕ(z|x,c)

[log pθ(x|z, c)] − KL[qϕ(z|x, c)||p(z)].

(B.5)

∇θLθ,ϕ is estimated with the usual Monte Carlo gradi-ent estimator. However, the estimation of ∇ϕLθ,ϕ re-quires the reparameterization trick proposed by Kingmaand Welling [23], where the random variable z is re-expressed as a deterministic variable

z =gϕ(ε, x), (B.6)

with ε an auxiliary variable with independent marginalpε , and gϕ(·) some vector-valued function parameterizedby ϕ. Then, the first right hand side of (B.5) becomes

Eqϕ(z|x,c)

[log pθ(x|z, c)] = Ep(ε)

[log pθ(x|gϕ(ε, x), c)]. (B.7)

∇ϕLθ,ϕ is now estimated with Monte Carlo integration.

Conditional variational autoencoders implementedFollowing Kingma and Welling [23], we implementedGaussian multi-layered perceptrons (MLPs) for both theencoder NNϕ and decoder NNθ. In this case, p(z) is acentered isotropic multivariate Gaussian, pθ(x|z, c) andqϕ(x|z, c) are both multivariate Gaussian with a diag-onal covariance and parameters µθ,σθ and µϕ,σϕ, re-spectively. Note: there is no restriction on the encoderand decoder architectures, and they could as well be ar-bitrarily complex convolutional networks. Under theseassumptions, the conditional VAE implemented is

p(z) = N(z; 0, I), (B.8a)

pθ(x|z, c) = N(x;µθ,σ2θI), (B.8b)

qϕ(z|x, c) = N(z;µϕ,σ2ϕI), (B.8c)

µθ, logσ2θ = NNθ(x, c), (B.8d)

µϕ, logσ2ϕ = NNϕ(z, c). (B.8e)

21

Then, by using the valid reparameterization trick pro-posed by Kingma and Welling [23]

ε ∼N(0, I), (B.9a)z :=µϕ + σϕε, (B.9b)

Lθ,ϕ is computed and differentiated without estimationusing the expressions

KL[qϕ(z|x, c)||p(z)] = −12

d∑j=1

(1 + logσ2ϕ, j − µ

2ϕ, j − σ

2ϕ, j),

(B.10a)

Ep(ε)

[log pθ(x|z, c)] ≈ −12

∥∥∥∥∥x − µθσθ

∥∥∥∥∥2, (B.10b)

with d the dimensionality of z.

Appendix B.3. Generative adversarial networkOriginal generative adversarial networkThe original GAN value function V(φ, θ) proposed byGoodfellow et al. [24] is

V(φ, θ) =Ex

[log dφ(x|c)] + Ex

[log(1 − dφ(x|c))]︸︷︷︸:=−Ld

,

(B.11a)

Lg := − Ex

[log(1 − dφ(x|c))], (B.11b)

where Ld is the cross-entropy, and Lg the probability thediscriminator wrongly classifies the samples.

Wasserstein generative adversarial networkThe divergences which GANs typically minimize areresponsible for their training instabilities for reasons in-vestigated theoretically by Arjovsky and Bottou [56].Arjovsky et al. [74] proposed instead using the Earthmover distance, also known as the Wasserstein-1 dis-tance

W1(p, q) = infγ∈Π(p,q)

E(x,y)∼γ[‖x − y‖], (B.12)

where Π(p, q) denotes the set of all joint distributionsγ(x, y) whose marginals are respectively p and q, γ(x, y)indicates how much mass must be transported from x toy in order to transform the distribution p into q, ‖·‖ isthe L1 norm, and ‖x − y‖ represents the cost of mov-ing a unit of mass from x to y. However, the infi-mum in (B.12) is intractable. Therefore, Arjovsky et al.[74] used the Kantorovich-Rubinstein duality [75] topropose the Wasserstein GAN (WGAN) by solving themin-max problem

θ? = arg minθ

maxφ∈W

Ex

[dφ(x|c)] − Ex

[dφ(x|c)], (B.13)

whereW = φ : ‖dφ(·)‖L ≤ 1 is the 1-Lipschitz space,and the classifier dφ(·) : RT × R|c| → [0, 1] is replacedby a critic function dφ(·) : RT × R|c| → R. However,the weight clipping used to enforce dφ 1-Lipschitznesscan lead sometimes the WGAN to generate only poorsamples or failure to converge [53]. Therefore, we im-plemented the WGAN-GP to tackle this issue.

Appendix B.4. Hyper-parameters

Table B.6 provides the hyper-parameters of the NF,VAE, and GAN implemented. The Adam optimizer[76] is used to train the generative models with a batchsize of 10 % of the learning set. The NF implemented

Wind PV Load

(a)

Embedding Net 4 × 300 4 × 300 4 × 300Embedding size 40 40 40Integrand Net 3 × 40 3 × 40 3 × 40Weight decay 5.10−4 5.10−4 5.10−4

Learning rate 10−4 5.10−4 10−4

(b)

Latent dimension 20 40 5E/D Net 1 × 200 2 × 200 1 × 500Weight decay 10−3.4 10−3.5 10−4

Learning rate 10−3.4 10−3.3 10−3.9

(c)

Latent dimension 64 64 256G/D Net 2 × 256 3 × 256 2 × 1024Weight decay 10−4 10−4 10−4

Learning rate 2.10−4 2.10−4 2.10−4

Table B.6: (a) NF, (b) VAE, and (c) GAN hyper-parameters.The hyper-parameters selection is performed on the validation set us-ing the Python library Weights & Biases [71]. This library is an ex-periment tracking tool for machine learning, making it easier to trackexperiments. The GAN model was the most time-consuming duringthis process, followed by the VAE and NF. Indeed, the GAN is highlysensitive to hyper-parameter modifications making it challenging toidentify a relevant set of values. In contrast, the NF achieved satisfac-tory results, both in terms of scenarios shapes and quality, by testingonly a few sets of hyper-parameter values.

is a one-step monotonic normalizer using the UMNN-MAF4. The embedding size |hi| is set to 40, and theembedding neural network is composed of l layers ofn neurons (l × n). The same integrand neural networkτi(·) ∀i = 1, . . . ,T is used and composed of 3 layersof |hi| neurons (3 × 40). Both the encoder and decoder

4https://github.com/AWehenkel/Normalizing-Flows

22

https://github.com/AWehenkel/Normalizing-Flows

Decoder

200

20 20

200

Encoder

Figure B.12: Variational autoencoder structure implemented for thewind dataset.Both the encoder and decoder are feed-forward neural networks com-posed of one hidden layer with 200 neurons. Increasing the numberof layers did not improve the results for this dataset. The latent spacedimension is 20.

of the VAE are feed-forward neural networks (l × n),ReLU activation functions for the hidden layers, and noactivation function for the output layer. Both the gen-erator and discriminator of the GAN are feed-forwardneural networks (l × n). The activation functions of thehidden layers of the generator (discriminator) are ReLU(Leaky ReLU). The activation function of the discrim-inator output layer is ReLU, and there is no activationfunction for the generator output layer. The generator istrained once after the discriminator is trained five timesto stabilize the training process, and the gradient penaltycoefficient λ in (7) is set to 10 as suggested by Gulrajaniet al. [53].

Figures B.12, B.13, and B.14 illustrate the VAE,GAN, and NF structures implemented for the winddataset where the number of weather variables selectedand the number of zones is 10, and 10, respectively. Re-call, c := weather forecasts, x := scenarios x := windpower observations, z := latent space variable, ε := Nor-mal variable (only for the VAE).

Appendix C. Quality metrics

Recall that the PV generation, wind power, and loadare assumed to be multivariate random variables of di-mension T , x ∈ RT , with T the number of periods perday. Let ∪d∈TSxi

dMi=1 be the set of #TS × M scenarios

generated with M scenarios per day of the testing set,

Generat orDiscr im inat or

256 256

256256

Fake / real ?

or

Figure B.13: Generative adversarial network structure implementedfor the wind dataset.Both the discriminator and generator are feed-forward neural net-works composed of two hidden layers with 256 neurons. The latentspace dimension is 64.

where xid ∈ R

T∀i, d. xid,k is the component k of scenario

i on day d of the testing set as specified by (1).

Appendix C.1. Continuous ranked probability score

Gneiting and Raftery [60] propose a formulationcalled the energy form of the CRPS since it is just theone-dimensional case of the energy score, defined innegative orientation as follows

CRPS(P, xk) =EP[|X − xk |] −12EP[|X − X′|], (C.1)

where and X and X’ are independent random variableswith distribution P and finite first moment, and EP is theexpectation according to the probabilistic distribution P.The CRPS is computed over the marginals of x by usingthe estimator of (C.1) provided by Zamo and Naveau[77]. For a given day d of the testing set, the CRPS permarginal k = 1, · · · ,T is

CRPSd,k =1M

M∑i=1

|xid,k − xd,k | −

12M2

M∑i, j=1

|xid,k − x j

d,k |.

(C.2)

Then, it is averaged over the entire testing set

CRPSk =1

#TS

∑d∈TS

CRPSd,k. (C.3)

23

In Table 3, CRPSk is averaged over all time periods

CRPS =1T

T∑k=1

CRPSk. (C.4)

Appendix C.2. Energy score

Gneiting and Raftery [60] introduced a generaliza-tion of the continuous ranked probability score definedin negative orientation as follows

ES(P, x) =EP‖X − x‖ −12EP‖X − X′‖, (C.5)

where and X and X’ are independent random variableswith distribution P and finite first moment, EP is theexpectation according to the probabilistic distribution P,and ‖ · ‖ the Euclidean norm. For a given day d of thetesting set, the ES is computed following Gneiting et al.[78]

ESd =1M

M∑i=1

‖xid − xd‖ −

12M2

M∑i, j=1

‖xid − x j

d‖. (C.6)

Then, it is averaged over the testing set

ES =1

#TS

∑d∈TS

ESd. (C.7)

Note: when we consider the marginals of x, it is easy torecognize that (C.6) is the CRPS.

Appendix C.3. Variogram score

For a given day d of the testing set and a T -variate ob-servation xd ∈ RT , the Variogram score metric of orderγ is formally defined as

VSd =

T∑k,k′

wkk′

(|xd,k − xd,k′ |

γ − EP|xd,k − xd,k′ |γ)2,

(C.8)

where xd,k and xd,k′ are the k-th and k′-th componentsof the random vector xd distributed according to Pfor which the γ-th absolute moment exists, and wkk′

are non-negative weights. Given a set of M scenar-ios xi

dMi=1 for this given day d, the forecast variogram

EP|xd,k − xd,k′ |γ can be approximated ∀k, k′ = 1, · · · ,T

by

EP|xd,k − xd,k′ |γ ≈

1M

M∑i=1

|xid,k − xi

d,k′ |γ. (C.9)

Then, it is averaged over the testing set

VS =1

#TS

∑d∈TS

VSd. (C.10)

In this study, we evaluate the Variogram score withequal weights across all hours of the day wkk′ = 1and using a γ of 0.5, which for most cases provides agood discriminating ability as reported in Scheuerer andHamill [63].

Appendix C.4. Quantile score

For a given day d of the testing set, a set of 99 quan-tiles (1, 2, ..., 99-th quantile) xq

d99q=1 are computed from

the set of M scenarios xid

Mi=1, with q the quantile index

(q = 0.01, . . . , 0.99). For a given day d of the testingset, the quantile score, per marginal, is defined by

ρq(xqd,k, xd,k) =

(1 − q) × (xqd,k − xd,k) xd,k < xq

d,k

q × (xd,k − xqd,k) xd,k ≥ xq

d,k

.

(C.11)Then, it is averaged over all time periods and the testingset

QSq =1

#TS

∑d∈TS

1T

T∑k=1

ρq(xqd,k, xd,k). (C.12)

In Table 3, QSq is averaged over all quantiles

QS =1

99

99∑q=1

QSq. (C.13)

Appendix C.5. Classifier-based metric

Modern binary classifiers can be easily turned intorobust two-sample tests where the goal is to assesswhether two samples are drawn from the same distri-bution [79]. In other words, it aims at assessing whethera generated scenario can be distinguished from an ob-servation. To this end, the generator is evaluated on aheld-out testing set that is split into a testing-train andtesting-test subsets. The testing-train set is used to traina classifier, distinguishing generated scenarios from theactual distribution. Then, the final score is computed asthe performance of this classifier on the testing-test set.

In principle, any binary classifier can be adopted forcomputing classifier two-sample tests (C2ST). A varia-tion of this evaluation methodology is proposed by Xuet al. [59] and is known as the 1-Nearest Neighbor (NN)classifier. The advantage of using 1-NN over other clas-sifiers is that it requires no special training and littlehyper-parameter tuning. This process is conducted as

24

follows. Given two sets of observations S r and gener-ated S g samples with the same size, i.e., |S r | = |S g|, it ispossible to compute the leave-one-out (LOO) accuracyof a 1-NN classifier trained on S r and S g with positivelabels for S r and negative labels for S g. The LOO accu-racy can vary from 0 % to 100 %. The 1-NN classifiershould yield a 50 % LOO accuracy when |S r | = |S g| islarge. It is achieved when the two distributions match.Indeed, the level 50 % happens when a label is ran-domly assigned to a generated scenario. It means theclassifier is not capable of discriminating generated sce-narios from observations. If the generative model over-fits S g to S r, i.e., S g = S r, and the accuracy would be0 %. On the contrary, if it generates widely differentsamples than observations, the performance should be100 %. Therefore, the closer the LOO accuracy is to 1,the higher the degree of under-fitting of the model. Thecloser the LOO accuracy is to 0, the higher the degreeof over-fitting of the model. The C2ST approach usingLOO with 1-NN is adopted by Qi et al. [37] to assessthe PV and wind power scenarios of a β VAE.

However, this approach has several limitations. First,it uses the testing set to train the classifier during theLOO. Second, the 1-NN is very sensitive to outliers asit simply chose the closest neighbor based on distancecriteria. This behavior is amplified when combined withthe LOO, where the testing-test set is composed of onlyone sample. Third, the euclidian distance cannot dealwith a context such as weather forecasts. Therefore,we cannot use a conditional version of the 1-NN us-ing weather forecasts to classify weather-based renew-able generation and the observations. Fourth, C2STwith LOO cannot provide ROC curve but only accuracyscores. An essential point about ROC graphs is that theymeasure the ability of a classifier to produce good rel-ative instance scores. In our case, we are interested indiscriminating the generated scenarios from the obser-vations, and the ROC provides more information thanthe accuracy metric to achieve this goal. A standardmethod to reduce ROC performance to a single scalarvalue representing expected performance is to calculatethe area under the ROC curve, abbreviated AUC. TheAUC has an essential statistical property: it is equiva-lent to the probability that the classifier will rank a ran-domly chosen positive instance higher than a randomlychosen negative instance [65].

To deal with these issues, we decided to modify thisclassifier-based evaluation by conducting the C2ST asfollows: (1) the scenarios generated on the learning setare used to train the classifier using the C2ST. There-fore, the classifier uses the entire testing set and cancompute ROC; (2) the classifier is an Extra-Trees classi-

fier that can deal with context such as weather forecasts.More formally, for a given generative model g the fol-

lowing steps are conducted:

1. Initialization step: the generative model g hasbeen trained on the LS and has generated Mweather-based scenarios per day of both the LSand TS: xi

LSMi=1 := ∪d∈LSxi

dMi=1 and xi

TSMi=1 :=

∪d∈TSxid

Mi=1. For the sake of clarity the index g

is omitted, but both of these sets are dependent onmodel g.

2. M pairs of learning and testing sets are built withan equal proportion of generated scenarios and

observations: DiLS :=

xi

LS, 0 ∪ xiLS, 1

and

DiTS =

xi

TS, 0 ∪ xiTS, 1

. Note: |Di

LS| = 2|LS|

and |DiTS| = 2|TS|.

3. For each pair of learning and testing setsDi

LS,DiTS

Mi=1 a classifier di

g is trained and makespredictions.

4. The ROCig curves and corresponding AUCi

g arecomputed for i = 1, · · · ,M.

This classifier-based methodology is conducted for allmodels g, and the results are compared. Figure C.15depicts the overall approach. The classifiers di

g are allExtra-Trees classifier made of 1000 unconstrained treeswith the hyper-parameters ”max depth” set to ”None”,and “n estimators” to 1 000.

Appendix C.6. Diebold-Mariano testFor a given day d of the testing set, let εd ∈ R be

the error computed by an arbitrary forecast loss functionof the observation and scenarios. The test consists ofcomputing the difference between the errors of the pairof models g and h over the testing set

∆(g, h)d = εgd − ε

hd , ∀d ∈ TS, (C.14)

and to perform an asymptotic z-test for the null hypothe-sis that the expected forecast error is equal and the meanof differential loss series is zero E[∆(g, h)d] = 0. Itmeans there is no statistically significant difference inthe accuracy of the two competing forecasts. The statis-tic of the test is deduced from the asymptotically stan-dard normal distribution as follows

DM(g, h) =√

#TSµ

σ, (C.15)

with #TS the number of days of the testing set, µand σ the sample mean and the standard deviation of

25

∆(g, h). Under the assumption of covariance stationarityof the loss differential series ∆(g, h)d, the DM statistic isasymptotically standard normal. The lower the p-value,i.e., the closer it is to zero, the more the observed datais inconsistent with the null hypothesis: E[∆(g, h)d] < 0the forecasts of the model h are more accurate than thoseof model g. If the p-value is less than the commonlyaccepted level of 5 %, the null hypothesis is typicallyrejected. It means that the forecasts of model g are sig-nificantly more accurate than those of model h.

When considering the ES or VS scores, there is avalue per day of the testing set ESd or VSd. In this case,εd = ESd or εd = VSd. However, when considering theCRPS or QS, there is a value per marginal and per dayof the testing set CRPSd,k or QSd,k. A solution consistsof computing 24 independent tests, one for each hour ofthe day. Then, to compare the models based on the num-ber of hours for which the predictions of one model aresignificantly better than those of another. Another wayconsists of a multivariate variant of the DM-test withthe test performed jointly for all hours using the multi-variate loss differential series. In this case, for a givenday d, εg

d = [εgd,1, . . . , ε

gd,24]ᵀ, εh

d = [εhd,1, . . . , ε

hd,24]ᵀ are

the vectors of errors for a given metric of models g andh, respectively. Then the multivariate loss differentialseries

∆(g, h)d = ‖εgd‖1 − ‖ε

hd‖1, (C.16)

defines the differences of errors using the ‖ · ‖1 norm.Then, the p-value of two-sided DM tests is computedfor each model pair as described above. The univari-ate version of the test has the advantage of providinga deeper analysis as it indicates which forecast is sig-nificantly better for which hour of the day. The mul-tivariate version enables a better representation of theresults as it summarizes the comparison in a single p-value, which can be conveniently visualized using heatmaps arranged as chessboards. In this study, we decidedto adopt the multivariate DM-test for the CRPS and QS.

Appendix D. Value assessment

Appendix D.1. Notation

Sets and indexesName Descriptiont Time period index.ω Scenario index.T Number of time periods per day.#Ω Number of scenarios.T Set of time periods, T = 1, 2, . . . ,T .

Ω Set of scenarios, Ω = 1, 2, . . . , #Ω.

ParametersName Descriptionemin

t , emaxt Minimum/maximum day-ahead bid

[MWh].ymin

t , ymaxt Minimum/maximum retailer net po-

sition [MWh].ydis

max, ychamax BESS maximum (dis)charging

power [MW].ηdis, ηcha BESS (dis)charging efficiency [-].smin, smax BESS minimum/maximum capacity

[MWh].sini, send BESS initial/final state of charge

[MWh].πt Day-ahead price [e/MWh].qt, λt Negative/positive imbalance price

[e/MWh].∆t Duration of a time period [hour].

VariablesFor the sake of clarity the subscript ω is omitted.

Name Range Descriptionet [emin

t , emaxt ] Day-ahead bid [MWh].

yt [ymint , ymax

t ] Retailer net position [MWh].ypv

t [0, 1] PV generation [MW].yw

t [0, 1] Wind generation [MW].ycha

t [0, ychamax] Charging power [MW].

ydist [0, ydis

max] Discharging power [MW].st [smin, smax] BESS state of charge [MWh].d−t , d+

t R+ Short/long deviation [MWh].yb

t 0, 1 BESS binary variable [-].

Appendix D.2. Problem formulation

The mixed-integer linear programming (MILP) opti-mization problem to solve is

maxet∈X,yt,ω∈Y(et)

∑ω∈Ω

αω∑t∈T

[πtet − qtd−t,ω − λtd+

t,ω

], (D.1a)

X =

et : et ∈ [emin

t , emaxt ]

, (D.1b)

Y(et) =

yt,ω : (D.2a) − (D.2m)

. (D.1c)

The optimization variables are et, day-ahead bid of thenet position, ∀ω ∈ Ω, yt,ω, retailer net position in sce-nario ω, d−t,ω, short deviation, d+

t,ω, long deviation, ypvt,ω,

PV generation, ywt,ω, wind generation, ycha

t,ω , battery en-ergy storage system (BESS) charging power, ydis

t,ω, BESSdischarging power, st,ω, BESS state of charge, and yb

t,ω abinary variable to prevent from charging and discharg-ing simultaneously. The imbalance penalty is modeled

26

by the constraints (D.2a)-(D.2b) ∀ω ∈ Ω, that definethe short and long deviations variables d−t,ω, d

+t,ω ∈ R+.

The energy balance is provided by (D.2c) ∀ω ∈ Ω.The set of constraints that bound ypv

t,ω and ywt,ω variables

are (D.2d)-(D.2e) ∀ω ∈ Ω where ypvt,ω and yw

t,ω are PVand wind generation scenarios. The load is assumedto be non-flexible and is a parameter (D.2f) ∀ω ∈ Ω

where ylt,ω are load scenarios. The BESS constraints are

provided by (D.2g)-(D.2j), and the BESS dynamics by(D.2k)-(D.2m) ∀ω ∈ Ω.

−d−t,ω ≤ −(et − yt,ω),∀t ∈ T (D.2a)−d+

t,ω ≤ −(yt,ω − et),∀t ∈ T (D.2b)yt,ω

∆t= ypv

t,ω + ywt,ω − yl

t,ω

+ ydist,ω − ycha

t,ω ,∀t ∈ T (D.2c)ypv

t,ω ≤ ypvt,ω,∀t ∈ T (D.2d)

ywt,ω ≤ yw

t,ω,∀t ∈ T (D.2e)

ylt,ω = yl

t,ω,∀t ∈ T (D.2f)

ychat,ω ≤ yb

t,ωychamax,∀t ∈ T (D.2g)

ydist,ω ≤ (1 − yb

t,ω)ydismax,∀t ∈ T (D.2h)

−st,ω ≤ −smin,∀t ∈ T (D.2i)st,ω ≤ smax,∀t ∈ T (D.2j)

s1,ω − sini

∆t= ηchaycha

1,ω −ydis

1,ω

ηdis , (D.2k)

st,ω − st−1,ω

∆t= ηchaycha

t,ω −ydis

t,ω

ηdis ,∀t ∈ T \ 1 (D.2l)

sT,ω = send = sini. (D.2m)

Notice that if λt < 0, the surplus quantity is remuneratedwith a non-negative price. In practice, such a scenariocould be avoided provided that the energy retailer hascurtailment capabilities, and (qt, λt) are strictly positivein our case study. The deterministic formulation withperfect forecasts, the oracle (O), is a specific case of thestochastic formulation by considering only one scenariowhere ypv

t,ω, ywt,ω, and yl

t,ω become the actual values of PV,wind, and load ∀t ∈ T . The optimization variables areet, yt, d−t , d+

t , ypvt , and yw

t , ychat , ydis

t , st, and ybt .

Appendix D.3. DispatchingOnce the bids et have been computed by the planner,

the dispatching consists of computing the second stagevariables given observations of the PV, wind power, andload. The dispatch formulation is a specific case of thestochastic formulation with et as parameter and by con-sidering only one scenario where ypv

t,ω, ywt,ω, and yl

t,ω be-come the actual values of PV, wind, and load ∀t ∈ T .

The optimization variables are yt, d−t , d+t , ypv

t , and ywt ,

ychat , ydis

t , st, and ybt .

27

300 300

300300

300

300

300

300

GeneratorFlow

Figure B.14: Normalizing flow structure implemented for the winddataset.A single-step monotonic normalizing flow is implemented with afeed-forward neural network composed of four hidden layers with300 neurons. The latent space dimension is 40. Note: For clarity,the integrand network is not included but is a feed-forward neural net-work composed of three hidden layers with 40 neurons. Increasingthe number of steps of the normalizing flow did not improve the re-sults. The monotonic transformation is complex enough to capture thestochasticity of the variable of interest. However, when consideringaffine autoregressive normalizing flows, the number of steps should behigher. Numerical experiments indicated that a five-step autoregres-sive flow was required to achieve similar results for this dataset. Note:the results are not reported in this study for the sake of clarity.

NFs VAEsGANs

1. Generate M scenarios per

day of both the LS & TS

2. Build M pairs of LS & TS for

a conditional classifier

3. M classifiers per model

classifier 1

ROC & AUC 1

classifier M.......

ROC & AUC M

Com pare t he m odels

.......

Figure C.15: Classifier-based metric methodology.Each generative model generates M scenarios per day of the learningand testing sets. They are used to build M pairs of learning and test-ing sets for a conditional classifier by including an equal proportionof observations and weather forecasts. M conditional classifiers, permodel, are trained and make predictions. The M ROC and AUC arecomputed per model, and the results are compared.

28

Appendix E. Quality results

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(a) PV CRPS DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(b) Load CRPS DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(c) PV QS DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(d) Load QS DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(e) PV ES DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(f) Load ES DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(g) PV VS DM.

NF VAE GAN RAND

NFVA

EGA

NRA

ND

0

2

4

6

8

10

(h) Load VS DM.

Figure E.16: PV and load tracks Diebold-Mariano tests.The Diebold-Mariano tests of the CRPS, QS, ES, and VS demonstratethe NF outperforms both the VAE and GAN. Note: the GAN outper-forms the VAE for both the ES and VS for the PV track. However, theVAE outperforms the GAN on this dataset for both the CRPS and QS.


0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

(a) ROC PV.


0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

(b) ROC load.

Figure E.17: Classifier-based metric for both the PV and load tracks.The NF (blue) is the best to fake the classifier, followed by the VAE(orange), and the GAN (green).


0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(a) NF.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(b) NF.

0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(c) GAN.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(d) GAN.

0 5 10 15 20 250.0

0.2

0.4

0.6

0.8

1.0

(e) VAE.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(f) VAE.

Figure E.18: PV scenarios shape comparison and analysis.Left part (a) NF, (c) GAN, and (e) VAE: 50 PV scenarios (grey) of arandomly selected day of the testing set along with the 10 % (blue),50 % (black), and 90 % (green) quantiles, and the observations (red).Right part (b) NF, (d) GAN, and (f) VAE: the corresponding Pearsontime correlation matrices of these scenarios with the time periods asrows and columns. Similar to wind power and load scenarios, NFtends to exhibit no time correlation between scenarios. In contrast,the VAE and GAN tend to be partially time-correlated over a few timeperiods.

29


0 5 10 15 20 250.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a) NF.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(b) NF.

0 5 10 15 20 250.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(c) GAN.

0.90

0.92

0.94

0.96

0.98

1.00

(d) GAN.

0 5 10 15 20 250.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(e) VAE.

0.990

0.992

0.994

0.996

0.998

1.000

(f) VAE.

Figure E.19: Load scenarios shape comparison and analysis.Left part (a) NF, (c) GAN, and (e) VAE: 50 load scenarios (grey) ofa randomly selected day of the testing set along with the 10 % (blue),50 % (black), and 90 % (green) quantiles, and the observations (red).Right part (b) NF, (d) GAN, and (f) VAE: the corresponding Pearsontime correlation matrices of these scenarios with the time periods asrows and columns. Similar to PV and wind power scenarios, NF tendsto exhibit no time correlation between scenarios. In contrast, the VAEand GAN tend to be highly time-correlated.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(a) Wind-NF.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(b) PV-NF.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(c) NF-load.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(d) Wind-GAN.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(e) PV-GAN.

0.90

0.92

0.94

0.96

0.98

1.00

(f) GAN-load.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(g) Wind-VAE.

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(h) PV-VAE.

0.990

0.992

0.994

0.996

0.998

1.000

(i) VAE-load.

Figure E.20: Average of the correlation matrices over the testing setfor the three datasets.Left: wind power; center: PV; right:load. The trend in terms oftime correlation is observed on each day of the testing set for all thedatasets. The NF scenarios are not correlated. In contrast, the VAEand GAN scenarios tend to be time-correlated over a few periods. Inparticular, the VAE generates highly time-correlated scenarios for theload dataset.

30

a deep generative model for probabilistic energy

Documents