high-performance computing methods in large-scale power

271
High-Performance Computing Methods in Large-Scale Power System Simulation Lukas Razik Institute for Automation of Complex Power Systems

Upload: khangminh22

Post on 21-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

81

High-Performance Computing Methodsin Large-Scale Power System Simulation

Lukas RazikInstitute for Automation of Complex Power Systems

High-Performance Computing Methodsin Large-Scale Power System Simulation

Von der Fakultät für Elektrotechnik und Informationstechnikder Rheinisch-Westfälischen Technischen Hochschule Aachen

zur Erlangung des akademischen Gradeseines Doktors der Ingenieurwissenschaften

genehmigte Dissertation

vorgelegt von

Dipl.-Inform. Lukas Daniel Razik

ausHindenburg

Berichter:Univ.-Prof. Antonello Monti, Ph. D.Univ.-Prof. Dr.-Ing. Andrea Benigni

Tag der mündlichen Prüfung: 8. Mai 2020

Diese Dissertation ist auf den Internetseitender Universitätsbibliothek online verfügbar.

Bibliographische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb-nb.de abrufbar. D 82 (Diss. RWTH Aachen University, 2020) Herausgeber: Univ.-Prof. Dr.ir. Dr. h. c. Rik W. De Doncker Direktor E.ON Energy Research Center Institute for Automation of Complex Power Systems (ACS) E.ON Energy Research Center Mathieustraße 10 52074 Aachen E.ON Energy Research Center I 81. Ausgabe der Serie ACS I Automation of Complex Power Systems Copyright Lukas Razik Alle Rechte, auch das des auszugsweisen Nachdrucks, der auszugsweisen oder vollständigen Wiedergabe, der Speicherung in Datenverarbeitungsanlagen und der Übersetzung, vorbehalten. Printed in Germany ISBN: 978-3-942789-80-6 1. Auflage 2020 Verlag: E.ON Energy Research Center, RWTH Aachen University Mathieustraße 10 52074 Aachen Internet: www.eonerc.rwth-aachen.de E-Mail: [email protected]

Zusammenfassung

In der seit 2009 geltenden Erneuerbare-Energien-Richtlinie der Europäi-schen Union haben sich die Mitgliedsstaaten darauf verständigt, dassder Anteil erneuerbarer Energien bis 2020 bei mindestens 20 % des Ener-gieverbrauchs liegen soll. Die damit einhergehende wachsende Zahl vonerneuerbaren Energieerzeugern wie Photovoltaik- und Windkraftanlagenführt zu einer vermehrt dezentralen Stromerzeugung, die ein komplexeresStromnetzmanagement erfordert.

Um dennoch einen sicheren Netzbetrieb zu gewährleisten, findet einWandel von konventionellen Stromnetzen zu sogenannten Smart Gridsstatt, bei denen z. B. nicht nur Statusinformationen der Stromerzeuger son-dern auch der Verbraucher (z. B. Wärmepumpen und Elektrofahrzeuge) indas Netzmanagement einbezogen werden. Die Nutzung von Flexibilitätenauf Erzeugungs- und Nachfrageseite und der Einsatz von Energiespei-chern zur Erreichung einer stabilen und wirtschaftlichen Stromversorgungerfordert neue Lösungen für die Planung und den Betrieb von SmartGrids. Andernfalls können Veränderungen an den Systemen des öffentli-chen Energiesektors (Stromnetz, IKT-Infrastruktur, Energiemarkt usw.)zu unerwarteten Problemen und damit auch zu Stromausfällen führen.Computersimulationen können deswegen helfen, das Verhalten von SmartGrids bei Veränderungen abzuschätzen, ohne das Risiko negativer Folgenbei unausgereiften Lösungen oder Inkompatibilitäten einzugehen.

Die wesentliche Zielsetzung der vorliegenden Dissertation ist die An-wendung und Analyse von Methoden des High-Performance Computings(HPC) und der Informatik zur Verbesserung von (Co-)Simulationssoftwareelektrischer Energiesysteme, um komplexere Komponentenmodelle sowiegrößere Systemmodelle in angemessener Zeit simulieren zu können. Durchdie zunehmende Automatisierung und Regelung in Smart Grids, die immerhöheren Anforderungen an deren Flexibilität und die Notwendigkeit einerstärkeren Marktintegration der Verbraucher werden Stromnetzmodelleimmer komplexer. Die Simulationen erfordern daher eine immer höhere

iii

Leistungsfähigkeit der eingesetzten Rechnersysteme. Der Schwerpunktder Arbeiten liegt deshalb auf der Verbesserung verschiedener Aspektemoderner und derzeit entwickelter Simulationslösungen. Dabei solltenjedoch keine neuen Simulationskonzepte oder -anwendungen entwickeltwerden, die ein Hochleistungsrechnen auf Supercomputern oder großenComputerclustern erst erforderlich machen würden.

Vielmehr werden in dieser Dissertation die Integrationen modernerdirekter Löser für dünnbesetzte lineare Systeme in verschiedene Strom-netzsimulations-Backends und die anschließenden Analysen mithilfe vongroßskaligen Stromnetzmodellen vorgestellt. Darüber hinaus wird eineneue Methode zur automatischen grobgranularen Parallelisierung vonStromnetz-Systemmodellen auf Komponentenebene präsentiert. Nebensolchen konkreten Anwendungen von HPC-Methoden auf Simulationsumge-bungen wird auch eine vergleichende Analyse verschiedener HPC-Ansätzezur Leistungssteigerung Python-basierter Software mithilfe von (Just-in-Time-)Kompilierern vorgestellt, da Python – in der Regel eine interpretierteProgrammiersprache – im Bereich der Softwarenetwicklung im Energiesek-tor immer beliebter wird. Im Weiteren stellt die Dissertation die Integrationeiner HPC-Netzwerktechnologie auf Basis des offenen InfiniBand-Standardsin ein Software-Framework vor, das für die Kopplung verschiedener Simu-lationsumgebungen zu einer Co-Simulation und für den Datenaustauschin Hardware-in-the-Loop (HiL) Aufbauten genutzt werden kann.

Für die Verarbeitung von Energiesystemtopologien durch Simulations-umgebungen, auf denen die oben genannten HPC-Methoden angewendetwurden, ist die Unterstützung eines standardisierten Datenmodells not-wendig. Die Dissertation behandelt daher auch das Common InformationModel (CIM), wie in IEC 61970 / 61968 standardisiert, welches für dieSpezifikation von Datenmodellen zur Repräsentierung von Energiesystem-topologien verwendet werden kann. Zunächst wird ein gesamtheitlichesDatenmodell vorgestellt, das für Co-Simulationen des Stromnetzes mitdem zugehörigen Kommunikationsnetz und dem Energiemarkt durch eineErweiterung von CIM entwickelt wurde. Um eine nachhaltige Entwicklungvon CIM-bezogenen Softwaretools zu erreichen, wird im Folgenden eineautomatisierte (De-)Serializer-Generierung aus CIM-Spezifikationen vorge-stellt. Die Deserialisierung von CIM-Dokumenten ist ein Schritt, der für dieanschließend entwickelte Übersetzung von CIM-basierten Netztopologienin simulatorspezifische Systemmodelle genutzt wird, die ebenfalls in dieserDissertation behandelt wird.

Viele der vorgestellten Erkenntnisse und Ansätze können auch zur Ver-besserung anderer Software im Bereich der Elektrotechnik und darüberhinaus genutzt werden. Zudem wurden alle in der Dissertation vorgestell-

iv

ten Ansätze in öffentlich zugänglichen Open-Source-Softwareprojektenimplementiert.

v

Abstract

In the Renewables Directive of the European Union, in effect since 2009,the member states agreed that the share in renewable energy should be20 % of the total energy by 2020. The concomitantly growing number ofrenewable energy producers such as photovoltaic systems and wind powerplants leads to a more decentralized power generation. This results in amore complex power grid management.

To ensure a secure power grid operation even so, there is a transformationfrom conventional power grids to so-called smart grids where, for instance,not only status information of power producers but also of consumers(e. g. heat pumps and electrical vehicles) is included in the power gridmanagement. The utilization of flexibility on generation and demand sideand the use of energy storage systems for achieving a stable and economicpower supply requires new solutions for the planning and operation ofsmart grids. Otherwise, manipulations of the systems in the public energysector (i. e. power grid, information and communications technology (ICT)infrastructure, energy market, etc.) can lead to unexpected problems suchas power failures. Computer simulations therefore can help to estimatethe behavior of smart grids on any changes without the risk of negativeconsequences in case of immature solutions or incompatibilities.

The main objective of this dissertation is the application and analysisof high-performance computing (HPC) and computer science methodsfor improving power system (co-)simulation software to allow simulatingmore detailed models in a, for the particular use case, appropriate time.Through more automation and control in smart grids, the higher demandon flexibility, and the need of stronger market integration of consumers, thepower system models become more and more complex. This requires anever greater performance of the utilized computer systems. The focus wason the improvement of different aspects of state-of-the-art and currentlydeveloped simulation solutions. The intention was not to develop new

vii

simulation concepts or applications that would make large-scale HPC onsuper-computers or large computer clusters necessary.

The dissertation presents the integration of modern direct solvers forsparse linear systems in various power grid simulation back-ends andsubsequent analyses with the aid of large-scale power grid models. Fur-thermore, a new method for an automatic coarse-grained parallelizationof power grid system models at component level is shown. Besides suchconcrete applications of HPC methods on simulation environments, alsoa comparative analysis of various HPC approaches for performance im-provement of Python based software with the aid of (just-in-time) com-pilers is presented, as Python – usually an interpreted programminglanguage – becomes more popular in the area of power system relatedsoftware. Moreover, the dissertation shows the integration of an HPCinterconnect solution based on InfiniBand – an open standard – in a soft-ware framework for the coupling of different simulation environments to aco-simulation and for Hardware-in-the-Loop (HiL) setups.

The support of a standardized data model for the processing of powersystem topologies by simulation environments, on which the aforemen-tioned HPC methods were applied, is necessary. Therefore, the dissertationconcerns the Common Information Model (CIM) as, i. a., standardizedby IEC 61970 / 61968, which can be used for the specification of datamodels representing power system topologies. At first, a holistic datamodel is introduced that was developed for co-simulations of the powergrid with the associated communication network and the energy marketby extending CIM. To achieve a sustainable development of CIM relatedsoftware tools, an automated (de-)serializer generation from CIM spec-ifications is presented. The deserialization from CIM is a step neededfor the subsequently developed template-based translation from CIM tosimulator-specific system models which is also covered in this dissertation.

Many presented findings and approaches can be used for improvingfurther software from the area of electrical engineering and beyond that.Moreover, all presented approaches were implemented in open-sourcesoftware projects, accessible by the public.

viii

Acknowledgement

I would like to thank the following people:My doctoral supervisor, Prof. Antonello Monti, for the guidance and the

support of my initiatives throughout my whole time as doctoral studentat the Institute for Automation of Complex Power Systems, my secondreviewer, Prof. Andrea Benigni, for the kind feedback to my dissertationmanuscript, and Prof. Ferdinanda Ponci for the helpful feedback andsupport regarding my scientific publications.

My colleagues Jan Dinkelbach, for reading the manuscript (especiallythe boring parts) and the great support during my way from a computerscientist to an engineer, Markus Mirz, for a great cooperation as wellas the inclusion of my humble self in interesting additional projects andactivities, Steffen Vogel for the assistance in software-technical matters,Simon Pickartz for the sophisticated LATEX template, and Stefan Dählingfor proofreading the final version.

All student researchers and students who participated in the researchand development related to this dissertation.

The Réseau de Transport d’Électricité co-workers Adrien Guironnet andGautier Bureau for a successful and enjoyable cooperation.

Vor allem möchte ich meinen Eltern danken, die auf vieles verzichtet undes mir durch Ihre Unterstützung erst ermöglicht haben, diesen beruflichenWeg zu beschreiten.

Zu guter Letzt danke auch Dir, mein Schatz, für Deine Unterstützungund Geduld während meiner Promotionszeit!

Aachen, May 2020 Lukas Daniel Razik

ix

Contents

Acknowledgement viii

List of Publications xv

1 Introduction 11.1 Challenges in Smart Grids . . . . . . . . . . . . . . . . . . 11.2 Large-Scale Multi-Domain Co-Simulation as a Solution . . 31.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Multi-Domain Co-Simulation 132.1 Fundamentals and Related Work . . . . . . . . . . . . . . 14

2.1.1 Architecture and Topology Data Model . . . . . . 142.1.2 Common Information Model . . . . . . . . . . . . 152.1.3 Simulation of Smart Grids . . . . . . . . . . . . . . 162.1.4 Classification of Simulations . . . . . . . . . . . . . 16

2.2 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Concept of the Co-Simulation Environment . . . . . . . . 19

2.4.1 Holistic Topology Data Model . . . . . . . . . . . . 192.4.2 Model Data Processing and Simulation Setup . . . 222.4.3 Synchronization . . . . . . . . . . . . . . . . . . . 232.4.4 Co-Simulation Runtime Interaction . . . . . . . . . 24

2.5 Validation by Use Case . . . . . . . . . . . . . . . . . . . 262.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Automated De-/Serializer Generation 293.1 CIM Formalisms and Formats . . . . . . . . . . . . . . . . 313.2 CIM++ Concept . . . . . . . . . . . . . . . . . . . . . . . 33

xi

Contents

3.3 From CIM UML to Compilable C++ Code . . . . . . . . 353.3.1 Gathering Generated CIM Sources . . . . . . . . . 373.3.2 Refactoring Generated CIM Sources . . . . . . . . 383.3.3 Primitive CIM Data Types . . . . . . . . . . . . . 40

3.4 Automated CIM (De-)Serializer Generation . . . . . . . . 413.4.1 The Common Base Class . . . . . . . . . . . . . . 413.4.2 Integrating an XML Parser . . . . . . . . . . . . . 423.4.3 Unmarshalling . . . . . . . . . . . . . . . . . . . . 433.4.4 Unmarshalling Code Generator . . . . . . . . . . . 463.4.5 Marshalling . . . . . . . . . . . . . . . . . . . . . . 49

3.5 libcimpp Implementation . . . . . . . . . . . . . . . . . . 503.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 51

4 From CIM to Simulator-Specific System Models 554.1 CIMverter Fundamentals . . . . . . . . . . . . . . . . . . 57

4.1.1 Modelica . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 Template Engine . . . . . . . . . . . . . . . . . . . 59

4.2 CIMverter Concept . . . . . . . . . . . . . . . . . . . . . . 594.3 CIMverter Implementation . . . . . . . . . . . . . . . . . 62

4.3.1 Mapping from CIM to Modelica . . . . . . . . . . 634.3.2 CIM Object Handler . . . . . . . . . . . . . . . . . 64

4.4 Modelica Workshop Implementation . . . . . . . . . . . . 654.4.1 Base Class of the Modelica Workshop . . . . . . . 664.4.2 CIM to Modelica Object Mapping . . . . . . . . . 664.4.3 Component Connections . . . . . . . . . . . . . . . 67

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.6 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 70

5 Modern LU Decompositions in Power Grid Simulation 755.1 LU Decompositions in Power Grid Simulation . . . . . . . 76

5.1.1 From DAEs to LU Decompositions . . . . . . . . . 765.1.2 LU Decompositions for Linear System Solving . . 785.1.3 KLU, NICSLU, GLU, and Basker by Comparison . 80

5.2 Analysis of Modern LU Decompositions for Electrical Circuits 835.2.1 Analysis on Benchmark Matrices from Large-Scale

Grids . . . . . . . . . . . . . . . . . . . . . . . . . 845.2.2 Analysis on Power Grid Simulations . . . . . . . . 92

5.3 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 95

xii

Contents

6 Exploiting Parallelism in Power Grid Simulation 976.1 Parallelism in Simulation Models . . . . . . . . . . . . . . 98

6.1.1 Task Scheduling . . . . . . . . . . . . . . . . . . . 1006.1.2 Task Parallelization in DPsim . . . . . . . . . . . . 1066.1.3 System Decoupling . . . . . . . . . . . . . . . . . . 110

6.2 Analysis of Task Parallelization in DPsim . . . . . . . . . 1116.2.1 Use Cases . . . . . . . . . . . . . . . . . . . . . . . 1126.2.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . 1136.2.3 System Decoupling . . . . . . . . . . . . . . . . . . 1176.2.4 Compiler Environments . . . . . . . . . . . . . . . 122

6.3 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 124

7 HPC Python Internals and Benefits 1277.1 HPC Python Fundamentals . . . . . . . . . . . . . . . . . 129

7.1.1 Classical Python . . . . . . . . . . . . . . . . . . . 1307.1.2 PyPy . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.3 Numba . . . . . . . . . . . . . . . . . . . . . . . . 1397.1.4 Cython . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Benchmarking Methodology . . . . . . . . . . . . . . . . . 1477.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . 1507.4 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 155

8 HPC Network Communication for HiL and RT Co-Simulation 1578.1 VILLAS Fundamentals . . . . . . . . . . . . . . . . . . . . 1588.2 InfiniBand Fundamentals . . . . . . . . . . . . . . . . . . 159

8.2.1 InfiniBand Architecture . . . . . . . . . . . . . . . 1618.2.2 OpenFabrics Software Stack . . . . . . . . . . . . . 165

8.3 Concept of InfiniBand Support in VILLAS . . . . . . . . . 1678.3.1 VILLASnode Basics . . . . . . . . . . . . . . . . . 1678.3.2 Original Read and Write Interface . . . . . . . . . 1678.3.3 Requirements on InfiniBand Node-Type Interface . 1708.3.4 Memory Management of InfiniBand Node-Type . . 1718.3.5 States of InfiniBand Node-Type . . . . . . . . . . . 1728.3.6 Implementation of InfiniBand Node-Type . . . . . 173

8.4 Analysis of the InfiniBand Support in VILLAS . . . . . . 1758.4.1 Service Types of InfiniBand Node-Type . . . . . . 1788.4.2 InfiniBand vs. Zero-Latency Node-Type . . . . . . 1818.4.3 InfiniBand vs. Existing Server-Server Node-Types 182

8.5 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 183

9 Conclusion 1859.1 Summary and Discussion . . . . . . . . . . . . . . . . . . 185

xiii

Contents

9.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

A Code Listings 193A.1 Exploiting Parallelism in Power Grid Simulation . . . . . 193

B Python Environment Measurements 195B.1 Execution Times . . . . . . . . . . . . . . . . . . . . . . . 195B.2 Memory Space Consumption . . . . . . . . . . . . . . . . 197

List of Acronyms 201

Glossary 207

List of Figures 209

List of Tables 213

Bibliography 215

xiv

List of Publications

Journal Articles

[DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera-tion of Go data models for OWL ontologies with integratedserialization and deserialization functionality”. In: To appearin SoftwareX (2020).

[Raz+19b] L. Razik, N. Berr, S. Khayyam, F. Ponci, and A. Monti.“REM-S-–Railway Energy Management in Real Rail Opera-tion”. In: IEEE Transactions on Vehicular Technology 68.2(Feb. 2019), pp. 1266–1277. doi: 10.1109/TVT.2018.2885007.

[Kha+18] S. Khayyamim, N. Berr, L. Razik, M. Fleck, F. Ponci, andA. Monti. “Railway System Energy Management Optimiza-tion Demonstrated at Offline and Online Case Studies”. In:IEEE Transactions on Intelligent Transportation Systems19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050. doi: 10.1109/TITS.2018.2855748.

[Mir+18] M. Mirz, L. Razik, J. Dinkelbach, H. A. Tokel, G. Alirezaei,R. Mathar, and A. Monti. “A Cosimulation Architecture forPower System, Communication, and Market in the SmartGrid”. In: Hindawi Complexity 2018 (Feb. 2018). doi: 10.1155/2018/7154031.

[Raz+18a] L. Razik, M. Mirz, D. Knibbe, S. Lankes, and A. Monti. “Auto-mated deserializer generation from CIM ontologies: CIM++—an easy-to-use and automated adaptable open-source libraryfor object deserialization in C++ from documents basedon user-specified UML models following the Common Infor-mation Model (CIM) standards for the energy sector”. In:Computer Science - Research and Development 33.1 (Feb.

xv

Contents

2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/s00450-017-0350-y.

[Raz+18b] L. Razik, J. Dinkelbach, M. Mirz, and A. Monti. “CIMverter—a template-based flexibly extensible open-source converterfrom CIM to Modelica”. In: Energy Informatics 1.1 (Oct.2018), p. 47. issn: 2520-8942. doi: 10.1186/s42162-018-0031-5.

[Gre+16] F. Gremse, A. Höfter, L. Razik, F. Kiessling, and U. Nau-mann. “GPU-accelerated adjoint algorithmic differentiation”.In: Computer Physics Communications 200 (2016), pp. 300–311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10.027.

[Fin+09b] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Low-Latency Linux Drivers for Ethernet over High-Speed Net-works”. In: IAENG International Journal of Computer Sci-ence 36.4 (2009).

Book Chapters

[Fin+10] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Trans-parent Integration of a Low-Latency Linux Driver for DolphinSCI and DX”. In: Electronic Engineering and Computing Tech-nology. Ed. by S.-I. Ao and L. Gelman. Dordrecht: SpringerNetherlands, 2010, pp. 539–549. isbn: 978-90-481-8776-8. doi:10.1007/978-90-481-8776-8_46.

Conference Articles

[Raz+19a] L. Razik, L. Schumacher, A. Monti, A. Guironnet, and G.Bureau. “A comparative analysis of LU decomposition meth-ods for power system simulations”. In: 2019 IEEE MilanPowerTech. June 2019, pp. 1–6.

[Vog+17] S. Vogel, M. Mirz, L. Razik, and A. Monti. “An Open Solutionfor Next-generation Real-time Power System Simulation”. In:2017 IEEE Conference on Energy Internet and Energy SystemIntegration (EI2). Nov. 2017, pp. 1–6. doi: 10.1109/EI2.2017.8245739.

xvi

Conference Articles

[Pic+16] S. Pickartz, N. Eiling, S. Lankes, L. Razik, and A. Monti. “Mi-grating LinuX Containers Using CRIU”. In: High PerformanceComputing. Ed. by M. Taufer, B. Mohr, and J. M. Kunkel.Cham: Springer International Publishing, 2016, pp. 674–684.isbn: 978-3-319-46079-6.

[Var+11] E. Varnik, L. Razik, V. Mosenkis, and U. Naumann. “FastConservative Estimation of Hessian Sparsity”. In: Fifth SIAMWorkshop on Combinatorial Scientific Computing, May 19–21,2011, Darmstadt, Germany. May 2011, pp. 18–21.

[Fin+09a] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETH-OM, an Ethernet over SCI and DX Driver for Linux”. In:Proceedings of 2009 International Conference of Parallel andDistributed Computing (ICPDC 2009), London, UK. 2009.

[Fin+08] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETHOS,a generic Ethernet over Sockets Driver for Linux”. In: Proceed-ings of the 20th IASTED International Conference. Vol. 631.017. 2008, p. 239.

xvii

1Introduction

In 1993 there were announcements in German newspapers saying that sun,water, and wind will also in long-range not cover more than 4 percent of theelectricity demand [Büt16]. As early as 2007 their share on the electricitysupply amounted to 14.2 percent. This was also the year when the firstofficial definition of smart grid was provided by the Energy Independenceand Security Act of 2007 [Con07] which was approved by the US Congressin January 2007. Meanwhile, the term smart grid is used worldwide forresearch, development, and investment programs with regard to technologyinnovations and the expansion of power grids. The principle approachesfor the transformation of conventional power grids to smart grids weredeveloped in the experts team Advisory Council of the Technology Platformfor Europeans in the years from 2005 to 2008 to establish a conceptual basefor a secure grid integration of significant electrical generation capacitieson the basis of renewable, mostly volatile and weather-dependent energysources.

1.1 Challenges in Smart Grids

Smart grids particularly require an improved coordination of grid operationand grid user behavior with the aid of information and communicationstechnology (ICT) and with the objective to ensure the sustainability ofeconomical, reliable, secure as well as ecofriendly power supply in theenvironment of increased energy efficiency and decreased greenhouse gasemissions. For instance, Smart Distribution plays a major role in the area

1

Chapter 1 Introduction

of smart grids. It can be divided into three pillars with the followingchallenges [BS14a; BS14b]:

1. Automation and remote control of local distribution grids: e. g.voltage control at distribution level (traditional as well as includingthe grid users), possibilities of power flow control, accelerated faultlocation and resumption of normal grid operation, as well as enhancedprotection concepts;

2. Flexibility by virtual power plants (VPPs): i. e. demand side man-agement and benefits of VPPs in a perspectival market organization;

3. Smart Metering and market integration of consumers: i. e. dynamictariffs, demand side response, and electromobility.

Since the efficiency of proper solutions can be improved by collectingand analyzing of information (i. e. data), a research field called EnergyInformatics has been established around 2012 with conferences such asthe DACH+ Conference on Energy Informatics and the ACM e-EnergyConference. Furthermore, Centers for Energy Informatics for exampleat the University of Southern Denmark and the University of SouthernCalifornia were founded to address ICT challenges of smart grids, e. g.with the help of artificial intelligence and machine learning approaches.However, new approaches and international standards in the area of ICTare not sufficient. Besides regulatory (i. e. legal) aspects, the introductionof new market rules is also necessary.

A successful realization of the European goals for global warming gasesreduction, the increase of energy efficiency as well as the continuouslyrising use of renewable energy sources require a harmonized design of theinterrelations between all participants in the process of electrical powersupply [BS14a; BS14b]. Both, the proponents as well as the opponents ofrenewable energies agree: the contribution of 37.8 percent from renewableenergy sources to gross electricity consumption in Germany [Umw19]can be significantly increased in the long term only with a utilization offlexibility (on generation and demand side) and the use of energy storagesystems.

Since modifications of the subsystems involved in the energy sector (i. e.power grid, ICT infrastructure, energy market, etc.) involve the risk oftechnical and economical faults such as destabilizations, major changesshould not be made without an accurate analysis of possible effects on thepower system. Computer simulations can help to estimate the behaviorof systems on their modifications with the aid of mathematical modelsto avoid negative consequences that could occur in real systems. In the

2

1.2 Large-Scale Multi-Domain Co-Simulation as a Solution

following, different kinds of power system simulation are introduced and itis motivated why large-scale multi-domain co-simulation is a solution forthe here presented three pillars of challenges in smart grids.

1.2 Large-Scale Multi-Domain Co-Simulation as aSolution

There are different types of (co-)simulations depending on their goals.Depending on the considered aspects, simulation types can be classified by

mathematical models with, e. g., pure algebraic equations for steady-stateobservations or ordinary differential equations (ODEs) for dynamicobservations;

simulation time which, e. g., can be continuous for “floating” physicalprocesses or discrete for events that occur at particular points intime, marking a change of the system’s state;

orchestration which, e. g., can be hybrid, when multiple system modelsfrom different domains are simulated by the same solver, or a co-simulation, when multiple system models are computed by differentsimulation solvers which are coupled (i. e. exchanging informationduring the simulation) [Sch+15].

Obviously, this list is not complete. The next sections, however, shallprovide a more general overview of simulation types with some of their goalsto motivate the contribution of this work. First, it shall be differentiatedbetween online and offline simulations.

Online Simulation

Online simulations are, e. g., performed for steady-state security assessment(SSA) and dynamic security assessment (DSA). In case of SSA, power flowsimulations of a sequence of (n-1)-states are needed to examine the abid-ance of the principle that, under the predicted maximum transmission andsupply responsibilities, the grid security is ensured even when a componentsuch as a transformer or a line becomes unplanned inoperative. The DSAis based on dynamic simulation which supplements the steady-state gridsecurity calculations with calculations of power plant dynamics in case ofclose short circuits and grid equipment outage. These dynamic stabilitycalculations can be very time-consuming. In Germany this means thatfor a timely availability of DSA results, to be used as decision aid forthe dispatcher, all (n-1)-scenarios should be available within 5 minutes.

3

Chapter 1 Introduction

This means that around 100 dynamic stability calculations must be ac-complished within this time frame. Such a real-time (RT) requirementis very challenging and hence requires an intelligent management of thecalculation cases as well as low simulation execution times [BS14a; BS14b].

Offline Simulation

Offline simulations (steady-state, electromagnetic transient simulation(EMT), etc.) are performed, e. g., for grid expansion planning, maintenanceplanning, commissioning of new operating equipment, and so forth. Asoffline simulations are not performed simultaneously to grid operation,they do not have any RT requirements. Nevertheless, low simulationexecution times are important to obtain simulation results in acceptabletime frames when many use cases or scenarios (e. g. the same power gridwith various switching events, changing its topology during simulation)have to be simulated or in case of simulation models with thousands ofnodes.

Large-Scale Simulation

Such simulations with several thousand of nodes, called large-scale, be-come important when simulation environments shall be applicable alsoon real-world scenarios rather than for lab experiments only. Thoughthere are commercial simulation tools which allow large-scale power gridsimulations for certain use cases, they have a significant disadvantage:they are closed source and thus changes of existing models (i. e. com-ponent models) or the solvers are often not possible. However, furtherdevelopment of models is an essential concern of scientific research toadapt them for future applications in smart grids as the lack of inertia inpower grids through a decreasing ratio of big power generators and moredistributed energy resources (DERs) can lead to frequency instabilities thatcannot be simulated by conventional models. Hence, at the Institute forAutomation of Complex Power Systems (ACS) new methods and conceptsare implemented into open-source software which can be used and im-proved by everyone. Here, it should be noted that not only publicly fundedscientific facilities can benefit from open-source simulation software butalso economic enterprises, of which some increasingly count on open-sourcealternatives instead of closed source products. For instance, RTE-France,the French transmission system operator (TSO), also develops open-sourcesimulation environments such as Dynaωo [Gui+18]. But, as in case of com-mercial software, compliance with international standards of associationssuch as CIGRE, IEC, IEEE, and VDE are crucial for the comparability

4

1.2 Large-Scale Multi-Domain Co-Simulation as a Solution

of solution approaches, study results, and applicability in existing systemenvironments.

Co-Simulation

Especially in case of co-simulation – a definition is given by [Sch+15] –where multiple simulators are coupled together, standardized data modelsfor information exchange between them are usually necessary. There aresingle-domain and multi-domain co-simulations.

Single domain co-simulations can be conducted with and without RTrequirements. Without RT requirements, co-simulations can be usefulif the involved simulators have complementary features but there is noneed for a synchronization of the simulation time with the real time(i. e. the co-simulation time can run slower or faster than the wall clock).Particularly in the power grids domain, RT requirements can come intoplay, e. g. with (power) Hardware-in-the-Loop (HiL), Control-in-the-Loop(CiL), and Software-in-the-Loop (SiL) use cases where a solution (i. e.an embedded system such as a control device) has to be connected toa simulated environment to verify its correct functioning within a realenvironment. A special case hereof is the geographically distributed real-time simulation (GD-RTS) which is based on the concept of a virtualinterconnection of laboratories in real time [Mon+18]. In this concept, amonolithic simulation model is partitioned into subsystem portions that aresimulated concurrently on multiple digital real-time simulators (DRTSs).As a result, comprehensive and large-scale real-world scenarios can besimulated for the validation of the interoperability of novel hardware andsoftware solutions with the existing power grid and without the need forany arbitrary in-the-Loop setups to be located at the same facility.

Multi-domain co-simulation in the following denotes a coupling of one ormore power grid simulators with other simulators of different domains suchas ICT, market, weather, and so on, for the obtainment of a holistic viewof the power system. Therefore, the term power system in this work doesnot stand for power grid only but for the power grid together with anyassociated system such as the ICT infrastructure and the energy market ina holistic view. This is the key for an extensive analysis and understandingof smart grids as depicted in [BS14a; BS14b].

In the previous sections, the use of large-scale single- and multi-domainco-simulation as a solution for the analysis and development of smart gridswith a continually growing share of renewable energy sources is motivated.The merits through the application of simulations during power systemoperation as well as for the planning of power systems as conducted sincedecades is undisputed.

5

Chapter 1 Introduction

Due to the three main challenges arising through the transition to smartgrids (see Sect. 1.1) – more automation and control in local distributiongrids (e. g. because of the needed digitalization), the higher demand onflexibility (e. g. by demand side management), and the need of strongermarket integration of consumers – the power system models become moreand more complex. This requires an ever greater performance of theutilized computer systems.

1.3 Contribution

The main objective of this dissertation is the application of high-performancecomputing (HPC) methods in the area of Energy Informatics and theiranalysis for improving power system (co-)simulation software to allow sim-ulating more complex component models as well as larger system modelsin an appropriate time. While in the past processor performance increasedcontinuously with increasing CPU clock rates, since around 2005 this isnot the case anymore because of the power wall [Bos11]. From then on,computer performance was increased by a growing numbers of cores perprocessor and accelerators such as, e. g., graphic processing units (GPUs)and Intel Xeon Phi adapters. Nowadays, the power draw is not a problemof central processing units (CPUs) only but also of whole supercomputers.Therefore, while the trend to more parallelism continues, HPC systemdesigners are more and more turning to hardware architectures respec-tive accelerators with high power efficiency (usually measured by FLOPSper watt) like GPUs, Advanced RISC Machines (ARM) processor basedsystems or field-programmable gate array (FPGA) accelerators [Gag+19].As in case of multi-core and manycore systems with special instructionssets for performance improvements (e. g. vector instructions), the softwarenowadays must be adapted continuously to make use of such new hardwarefeatures and accelerators. Under these circumstances, the focus is on theimprovement of different aspects on state-of-the-art and currently devel-oped simulation solutions in academic area as well as enterprises. Thus,the intention was not to develop new simulation concepts or applicationsthat would make large-scale HPC on supercomputers or large computerclusters necessary. However, especially computer and network hardware ofmodern commodity clusters is in focus of the contribution.

Figure 1.1 shows the real-world challenge of an improved coordinationof smart grid operation and grid user behavior. This is addressed bya solution based on an appropriate and therefore increasingly complexmodeling as well as (co-)simulation for smart grid planning and operation.The three major aspects of the solution, the contribution of this work

6

1.3 Contribution

Transition of conventional power grids to smart grids

Challenge:Smart grids require improved coordination

of grid operation and grid user behavior

Solution:Appropriate and more complex modeling and (co-)simulation

for smart grid planning and operation 

Modeling Simulation Informationexchange

High-performance computing and energy informatics

Chapter 2:Multi-domain co-simulation with a

holistic CIM-based topology

data model

Chapter 3:Automated

(de-)serializergeneration from

CIM UML

Chapter 4:From CIM-based

topologies tosimulator-specificsystem models

Chapter 5:Modern LU

decompositionsin power grid

simulation

Chapter 6:Exploiting

parallelism inpower gridsimulation

Chapter 7:HPC Pythoninternals and

benefits

Chapter 8:HPC network

communicationfor HiL and RTco-simulation

Figure 1.1: Contribution overview of this work

7

Chapter 1 Introduction

refers to, are modeling, simulation, and information exchange. Arrowsfrom bottom to the top illustrate the contribution of this work to thesemajor aspects of large-scale power system (co-)simulation.

On the one hand, mathematical component models become more com-plex, for example because of an increasing use of power electronics, onthe other hand, the complexity of system models increases for examplebecause of new electrical equipment and facilities, in case of smart gridsever more often with connections to other domains such as ICT, weather,mechanics, energy market, etc. which also need to be simulated. There-fore, a contribution of this thesis is the presentation of a multi-domainco-simulation architecture with a holistic (i. e. multi-domain) topologydata model which is based on the Common Information Model (CIM) asstandardized by IEC 61970 / 61968 / 62325, describing terms in the energysector and relations between them. CIM plays an important role as itbelongs to the IEC core semantic standards for smart grids [LE].

CIM makes use of the Unified Modeling Language (UML) which isstate-of-the-art in computer science for the specification of classes andtheir relationships in object-oriented software design. This thesis thereforecontributes with the concept of an automated (de-)serializer generationfrom a specification based on UML. Among others, the automated codegeneration process implements a CIM data model in C++ according to thegiven UML specification. It can be applied whenever the UML specificationchanges between its versions what usually happens a couple of times peryear. This avoids manual changes in a code base with currently around onethousand classes and many relations between each other which would bevery time-consuming and error-prone. The resulting (de-)serializer allowsreading in CIM documents in C++, according to the CIM-based datamodel, modifying the data in the main memory, and writing the data intoCIM documents.

Due to CIM’s fine granularity over several abstraction levels, a compo-nent (e. g. power transformer) consists of many CIM objects. This is areason why a mapping from CIM to a simulator-specific system modelis intricate. However, when a mapping to a system model of a certainsimulator is achieved, the mapping often can also be used for systemmodels of different simulators. Therefore, a template-based mapping fromCIM to system models is proposed. The templates allow a specification onhow model parameters from a CIM document have to be written into thesimulator-specific system model target format. The advantage of templatesis that if the system model format is written in a given language (e. g.Modelica), the templates are written in the same language with placeholders for the data from a CIM object to be mapped. Therefore, the user

8

1.3 Contribution

does not need to learn another language for specifying the system modelformat.

Simulation environments make use of various approaches for the transi-tion from a system model to a system of mathematical equations, whichis solved by the simulation solver. In case of power grid simulations, forinstance, the resistive companion approach in combination with the New-ton(-Raphson) method can be applied which results in a linear system ofequations (LSE), for each time step and Newton iteration. In another ap-proach, all component models can be combined into a differential-algebraicsystem of equations (DAE) which is then passed to a DAE solver whichfinally linearizes it to LSEs as well. For power grid simulations, LSEs typi-cally are very sparse (i. e. the fraction of non-zero elements in the matrixis typically much less than 1 h) and therefore require appropriate LSEsolvers. The contribution in this work is a comparative analysis of severalmodern LU decompositions for the solution of sparse LSEs coming frompower grids against KLU 1 which is a well-established LU decompositionfor electric circuits and therefore taken as the reference. The LU decom-positions concerned are called modern as they are developed especiallyfor current multi-core or massively parallel computer architectures. Thecomparison is based on benchmark matrices that arose during power gridsimulation and simulations performed by existing simulation environmentsin which the most promising LU decompositions were integrated.

There are various methods of expressing parallelism in power systemsimulation. On the one hand, the processing within a simulation solver canbe parallelized for instance with the aid of a certain parallel programmingparadigm in the solver’s programming language (e. g. with parallelconstructs using OpenMP in C++ [Ope19b]). Similarly to that approach,on the other hand, parallelism in a system model can also be expressedwith the aid of a formalism for parallel structures in the model (e. g. withparallel constructs in the modeling language ParModelica [Geb+12]).Besides such an explicit expression of parallelism in the solver or model, itis also possible to extract parallelism, e. g., from mathematical models atequation level which is a variant of already existing automatic fine-grainedparallelization of mathematical models. The contribution of this workis, however, the introduction of an automatic exploitation of parallelismin system models at component level and therefore called an automaticcoarse-grained parallelization of mathematical models. For this coarse-grained parallelization of mathematical models, parallel task schedulingsare introduced. Accordingly, various task schedulers allow the parallel

1The “K” in KLU stands for “Clark Kent” which is the bourgeois identity behindthe fictional superhero Superman. This is an allusion to SuperLU, which is awell-known LU decomposition for sparse linear systems [DP10].

9

Chapter 1 Introduction

processing of component models related tasks within one simulation step.An analysis of the whole implementation shows the execution time speedupswith respect to different scheduling methods and other modeling andsoftware technical aspects.

Power system simulation requires not only simulation itself but alsodata processing before the simulation (e. g. load and generation profiles),during the simulation (e. g. data exchanged between simulators), andafter the simulation (e. g. simulation results). Since Python, as a modernand relatively easy-to-learn script language, is enjoying ever growingpopularity under programming beginners, many power engineers programdiverse parts of software projects in the area of power system simulationin Python. Especially the pre- and postprocessing of simulation data isperformed in Python, while the simulation cores are often programmedin other programming languages such as C++. Sometimes executiontimes of (usually interpreted) Python applications are too long for givenuse cases and there is not enough time or a lack of know-how to portthe Python application to a more runtime-efficient language such as,e. g., C++. Admittedly, there are Python modules, just-in-time (JIT)compilers, and Python language extensions which allow improving theruntime efficiency of Python programs but their internals and benefits arerather unknown. The contribution of this work is therefore an overview andcomparative analysis of the most popular approaches for the performanceimprovement of Python, not necessarily with the aid of parallelization(e. g. multithreading).

Co-simulations as well as HiL setups require an information exchangebetween simulators as well as between devices and simulators. Especiallyin case of RT applications, short latencies in information exchange can becrucial. To reduce latencies, HPC interconnects, in contrast to commonlyused interconnects, provide connection modes in which data is directlytransmitted to or read from the main memory of a remote server withoutinvolving the operating system or a process running on a remote serveras it is usually the case. Therefore, a contribution of this thesis is thepresentation of InfiniBand (IB), a widely-used HPC network communica-tion standard, and its integration in a state-of-the-art software frameworkthat can, for instance, be freely used for hard RT coupling of devices withsimulators in case of HiL setups as well as for the coupling of simulatorsin case of hard RT co-simulations with very low latencies.

All the contributed approaches were implemented or integrated in exist-ing or new open-source software projects which can be used and investi-gated. Moreover, the concepts and analyses introduced in this work for animproved modeling, simulation, and information exchange shall supportother researchers, developers and users of (co-)simulation software.

10

1.4 Outline

1.4 Outline

Chapter 2 shows the benefits of multi-domain co-simulation and introducesan appropriate co-simulation environment for the three smart grid domainspower grid, communication network, and energy market, developed inthe research project SINERGIEN. This SINERGIEN environment is thestart for several approaches, concepts, and analyses which are presentedin the following chapters. The usage of UML for the specification of CIMallows extending it to a holistic topology data model that is used forthe SINERGIEN co-simulation environment with simulators for the threementioned domains.

Chapter 3 presents the automated (de-)serializer generation from aspecification based on UML. The automated deserializer generation isimplemented in the CIM++ software project which can map CIM, asspecified by UML, to a C++ code base, also implementing the holisticCIM-based data model. The thus created open-source software libraryallows reading and writing arbitrary CIM-based documents in C++.

Chapter 4 shows the approach on how CIM-based documents for powergrid topology representation can be translated into simulator-specificsystem models with the aid of template documents. In the SINERGIENenvironment this became necessary for the power grid simulator based onModelica to run simulations of power grid topologies stored in CIM-baseddocuments, as CIM is used more and more by distribution system operators(DSOs) and TSOs. The translation from CIM to a simulator-specific modelwas implemented in the open-source software CIMverter. It uses templatedocuments making it possible to modify the simulator-specific systemmodels to be outputted in case the input format of the target simulatoris changing, e. g. because of a newer version which allows to set moreparameters or new component models to be included in a system model.This allows a flexible adaption of the translation from CIM to a supportedsimulator-specific model without a recompilation of CIMverter which isalso shown in this chapter.

Chapter 5 outlines the comparative analysis of several modern LU de-compositions for sparse linear systems. In the first part of the analysis theyare compared by different benchmark matrices arising from simulationsof large-scale power grids. This analysis was a help for deciding whichLU decomposition is worth to be integrated into existing simulation envi-ronments. In the second part, the most promising modern decomposition(after its integration) is compared with the reference decomposition bysimulations with both a fixed time step and a variable time step solver.Therefore, these LU decompositions were i. a. integrated into the DAE

11

Chapter 1 Introduction

solver used by the open-source simulation environments OpenModelicaand Dynaωo.

Chapter 6 presents the approach for exploiting parallelism in powergrid simulation from the newly introduced type of approaches describedas automatic coarse-grained parallelization of mathematical models fora higher performance through thereby enabled parallel computations inpower system simulators. This approach is applied on a newly developedopen-source power grid simulator called DPsim. At first, the implementedparallelization approach is categorized into the existing parallelism cate-gories of simulation models. Moreover, an overview of formally definedscheduling methods for the parallel processing of data independent tasksis provided. It follows a performance analysis of the implemented taskparallelization methods.

Chapter 7 provides an overview of the internals of HPC approaches toimprove the runtime of Python applications and an comparative analysis ofthese approaches. The comparative analysis is based on various benchmarkalgorithms of different algorithm classes that were programmed in Pythonand in C++, as an efficient reference. This comparative analysis canhelp Python programmers to chose the right approach for increasing theperformance of Python applications with or without multithreading, withthreads that are executed really in parallel which is not always the case inPython as will be explained, too.

Chapter 8 presents the integration of a HPC network communicationinto HiL and RT co-simulation. The HPC interconnect solution chosen forthe integration in the open-source VILLASframework, that can be utilizedfor the setup of HiL simulations and (even hard) RT coupling of DRTSs, isbased on IB. IB was chosen as it is an open standard that is implementedby various manufacturers. The integration of IB is also compared withother communication methods provided by the VILLASframework.

Chapter 9 concludes the dissertation, providing a summary and discus-sion on all topics of this work. Moreover, it gives an overview of futurework that can be conducted for an improvement of the introduced conceptsas well as their analyses and implementations.

12

2Multi-Domain Co-Simulation

More and more distributed energy resources (DERs) at distribution levelcause bidirectional power flows between distribution and transmission levelwhich require changes in the related information and communicationstechnology (ICT) and energy market mechanisms. Therewith associatedextension of the measurement devices in lower voltage layers, for instance,require appropriate communication network capabilities for meeting therequirements on the exchange of measurement data between the mea-surement devices and all involved entities such as control centers andsubstations. Therefore, electrical grids and the belonging communicationnetworks should be planned holistically to take the interactions betweenboth domains into account [Li+14]. Apart from that, new energy marketmodels are developed for customers (i. e. prosumers) to empower themto a more active role in the exchange of energy with the grid [WH16] ina way that their behavior will be considered in grid operation [EFF15]and possibly vice versa. Given these facts, it is reasonable to include alsothe energy market simulation in the planning to get a holistic picture offuture grids.

An integration of energy market mechanisms, the communication net-work, and power grid hamper future studies on power grids due to a lack ofestablished modeling approaches which encompass the three domains andthere are only few tools which enable a joint simulation. In this chaptera comprehensive data model is presented together with a co-simulationarchitecture based on it. Both enable an investigation of dynamic inter-actions between power grid, communication network and market. Suchinteractions can be technical constraints of the grid which require actions

13

Chapter 2 Multi-Domain Co-Simulation

on market side as well as communication failures which affect the com-munication between grid and market or market decisions that change thebehavior of a generation unit or energy prosumers connected to the grid.For this purpose, a data model based on the Common Information Model(CIM) as standardized in IEC 61970/61968/62235 was created to be ableto describe an entire smart grid topology with components and actors fromall three domains. This data model is called SINERGIEN_CIM as it resultedin the research project SINERGIEN. It allows the storage of the wholenetwork topology with components from all three domains in a singlewell-defined data model, hiding some complexity of the simulation fromthe users. SINERGIEN_CIM-based topology descriptions are being processedby the co-simulation architectures as presented in [Mir+18].

Some parts of the SINERGIEN co-simulation architecture will be ad-dressed in the following as they are relevant for the research and develop-ment that is presented in the following chapters of this dissertation.

After a section on the related work and another one on various usecases for multi-domain simulation, the challenges for the realization of theimplemented SINERGIEN co-simulation environment are discussed. Itfollows a section about the concept and a further one on its validation bya use case. The chapter is concluded with final remarks in its last section.The work in this chapter has been partially presented in [Mir+18]1.

2.1 Fundamentals and Related Work

2.1.1 Architecture and Topology Data Model

A major formal modeling method for future intelligent power grids is givenby the Smart Grid Architecture Model (SGAM) [Sma12]. Therefore, theSGAM framework provides five levels: for physical components in thenetwork (component layer), protocols for information exchange betweenservices or systems (communication layer), data models which definethe rules for data structures (information layer), functions and services(function layer), as well as business and market models (business layer).Furthermore, the model divides all two-dimensional layers into the domaindimension from generation over transmission, and so forth to customerpremises and in the zones dimension from process over field, and so on, tothe market.

1 “A Cosimulation Architecture for Power System, Communication, and Market inthe Smart Grid” by Markus Mirz, Lukas Razik, Jan Dinkelbach, Halil Alper Tokel,Gholamreza Alirezaei, Rudolf Mathar, and Antonello Monti is licensed under CCBY 4.0

14

2.1 Fundamentals and Related Work

SGAM shall accelerate and standardize the development of unified datamodels, services, and applications in industry and research. In this context,the SINERGIEN data model and the co-simulation framework build uponSGAM as follows:

• the unified data model formally defines the data exchange structurein alignment with the information layer concept of SGAM (seeSect. 2.4);

• the domain-specific simulators of our co-simulation environmentinclude models of power grid and communication network componentsas well as market actors in the distribution, DER, and customerpremise domains of the SGAM component layer;

• the communication layer is abstracted by a co-simulation interfaceand software extensions for the particular domain-specific simulatorsin order to enable data exchange between the components (seeSect. 2.4);

• the example use case presented in Sect. 2.2 with an optimal man-agement of distributed battery storage systems is an example of asystem function that would fall on the SGAM function layer. Fur-thermore, the business model motivating the provision of a propersystem function, e. g., an incentive by a distribution system operator(DSO) is defined within the business layer.

For our unified data model we chose CIM as well-established basis forpower grid data that can be extended in a flexible manner. An extensionof CIM was needed for communication infrastructure and energy marketas for example shown in [Haq+11] and [Fre+09].

2.1.2 Common Information ModelSome of the most important smart grid related standards (i. e. core stan-dard) are from the IEC Technical Committee 57 (IEC TC 57). The so-called CIM is standardized in IEC 61970 (Energy Management Systems),IEC 61968 (Distribution Management), and IEC 62325 (Energy MarketCommunications) [IEC12b; IEC12a; IEC14]. Therefore, CIM belongs tothe core standards included in the IEC/TR 62357 reference architecture[IEC; IEC16b].Originally CIM was developed as a database model forenergy management systems (EMSs) and supervisory control and dataacquisition (SCADA) systems but then changed into an object-orientedapproach for electric distribution, transmission, and generation. Use casesof CIM are system integration using pre-defined interfaces between the

15

Chapter 2 Multi-Domain Co-Simulation

IT of distribution management systems (DMSs) and automation parts,custom system integration using XML-based payloads for semanticallysound coupling of systems, and serializing topology data using the ResourceDescription Framework (RDF) [Usl+12]. The IEC considers CIM and theIEC 61850 series as the pillars for a realization of the smart grid objectivesof interoperability and device management [LE].

2.1.3 Simulation of Smart Grids

Example approaches for co-simulations on power grids and communicationare presented in [Li+14; Hop+06; ZCN11; Lin+12] with focus on short-termed effect and therefore not including the energy market. In MOCES[EFF15] a holistic approach is taken for modeling distributed energysystems but the result is a monolithic co-simulation and no co-simulationwith a hybrid simulation for the physical part and an agent-based partfor behavior-based simulations, e. g., coming from the market. With theSINERGIEN co-simulation environment the advantages of existing toolsshall be harnessed which enhances the credibility of simulation results andobviate reinventing the wheel. The SINERGIEN co-simulation platformconsists of several domain specific simulators with the possibility to usethe “best tool” for each domain.

2.1.4 Classification of Simulations

In [Sch+15] a classification scheme for energy-related co-simulations isintroduced, with the four modeling categories continuous processes, discreteprocesses / events, roles, and statistical elements. The power grid in theSINERGIEN co-simulation environment is modeled based on Modelica. Ashort introduction to Modelica is provided in Sect. 4.1.1. Thermal systems[Mol+14] as well as power grids [MNM16] were modeled in Modelica. TheModelica models in the SINERGIEN co-simulation express continuousas well as discrete processes and events which makes the power gridsimulation a hybrid simulation. The communication network is simulatedwith available discrete event simulation (DES) tools, such as ns-3. In a DESthe simulation time does not proceed continuously but with the arising ofcertain events such as packet arrival, time expiry etc. [WGG10]. The energymarket simulation was implemented also as DES but in Python which isflexible and suitable to test different optimization methods [Har+12]. Eachmarket participant aims at optimizing the schedule for its assets, e. g.,minimizing energy costs and maximizing its profit. Examples for statisticalelements are, e. g., wind farm models of the power grid simulator.

16

2.2 Use Case

In view of the above, the SINERGIEN co-simulation environment isformalized as a coupled Discrete Event System Specification (DES) asdefined in [ZPK00]. This formalization is shown in Sect. 2.4.

2.2 Use Case

The SINERGIEN environment can be used for an evaluation of differentscenarios with

fast phenomena in the range from microseconds to seconds (i. e. withsmaller simulation time steps) between highly dynamic power gridcomponents, e. g., power electronics and communication network and

slow phenomena in the range from minutes to hours (i. e. with largersimulation time steps) that include market entities, power grid, andcommunication network.

More on these two phenomena classes can be found in [Mir+18] with afocus on slow phenomena and a discussion on fast phenomena containinga description of adaptions needed for fast phenomena investigations.

Based on this classification it can be concluded that the three simulatorsdo not necessarily need to participate in each co-simulation. The exampleuse case that was chosen in [Mir+18] for a validation of the SINERGIENenvironment was an optimal management of distributed storage systemsfor peak-shaving to support the grid operation. The SINERGIEN environ-ment including the communication network allows testing the effects ofcommunication failures on the operation strategy and eventually on theelectrical grid, which can provide valuable insights for decision making.Simulation results for this example are also provided in [Mir+18].

Before the co-simulation is initiated, it is necessary to define and storethe topology under investigation along with the scenario-specific parame-ters. For example, various scenarios in which failures in the communicationnetwork are stochastically or deterministically set by the user in the datamodel can be examined. From a user perspective it would be advanta-geous if all components, their links, and parameters could be defined inone environment rather than splitting this information between differentsoftware solutions and formats. Then, the data model for the topologyneeds to include components that couple different domains.

Under these requirements, following challenges were identified:

• definition of a common data model that includes components of alldomains and their interconnections;

17

Chapter 2 Multi-Domain Co-Simulation

• interaction of simulators with different simulation types, e. g., event-driven for the communication network and continuous processes forthe power grid;

• choice of the co-simulation time step which is limited by the synchro-nization method connecting the simulators.

2.3 Challenges

A major issue in coupling of simulators with different modeling approachesis the selection and implementation of a synchronization mechanism whichensures proper progress of the simulation time and a timely data ex-change between the simulators. This selection is of crucial significancefor minimization of the error propagation in the co-simulation and thesynchronization overhead in terms of simulation time. Since this is outof scope of this work, please refer to [Mir+18] for more details, whereasthe definition and implementation of a new proper data model, involvingall three mentioned domains, is crucial for the whole following work onlarge-scale co-simulation.

Holistic Topology Data Model

A common data model that covers the power grid, communication infras-tructure and electrical market was not existent. Besides the benefit forthe user of a co-simulation environment with a single data model, for thespecification of a holistic co-simulation topology, also the data exchangebetween simulators is simplified. A system description that encompassesall components of smart grids as shown in Fig. 2.1 (1) can be either useddirectly by a single multi-domain smart grid simulator or divided intosubsystems for a co-simulation as in Fig. 2.1 (2). For many components,this division is obvious since their parameters are only needed by onedomain-specific simulator but some components (called inter-domain com-ponents) constitute natural coupling points between the three domains.For instance, a battery storage device connected to the grid can act as amarket participant that offers its capability to charge or discharge. In orderto enable its participation in the energy market, the battery storage needsan interface which is a communication modem in this case. The modemcan be seen as a part of the battery storage. For a co-simulation, theinformation on inter-domain components must be split into several partsas each simulator has to simulate a dedicated part of these components.

18

2.4 Concept of the Co-Simulation Environment

2.4 Concept of the Co-Simulation Environment

2.4.1 Holistic Topology Data Model

As already mentioned, a holistic data model for a whole three domainsco-simulation topology can be based on CIM with an extension by furtherclasses. These classes, introduced for completing CIM in its representationof smart grids, are linked to already existing CIM classes using the Unified

Wholesale andRetail Market

MarketParticipants

(1)

CommunicationNetworkPower Grid Market

Wholesale andRetail Market

MarketParticipants

(2)

Figure 2.1: Exemplary topology including components of (1) all domainsand (2) domain-specific topologies

19

Chapter 2 Multi-Domain Co-Simulation

Modeling Language (UML). The proposed format can be structured infour packages:

• Original CIM (IEC 61970 / 61968 / 62325),

• Communication,

• Market, and

• EnergyGrid.

Whenever suitable, original CIM classes are considered. However, somecomponents do not have an associated class in the standard yet andtherefore are added in one of the other three packages. This approachleads to a flexible update to a new CIM version without losing the addedclasses with their links.

The most important feature of the SINERGIEN data model is theinterconnection of domains. Examples of inter-domain components, namelyBatteryStorage, SolarGeneratingUnit, and MarketCogeneration, areshown in Fig. 2.2, an excerpt from the SINERGIEN data model. Accordingto the UML diagram, the energy market components are associated tothe power grid components, whereas power grid components have an

 Communication

  Power Grid

  Market

Equipment::MarketSolarGeneratingUnit

Equipment::MarketBatteryStorage

PowerSystemResource  

Production::        CogenerationPlant        

Equipment::MarketCogenerationUnit

RegulatingCondEq  

EnergyStorage::          BatteryStorage         

Modems::ComMod

GeneratingUnit  

Production::      SolarGeneratingUnit      

Figure 2.2: Inter-domain connections between classes of power grid, com-munication network and market

20

2.4 Concept of the Co-Simulation Environment

aggregation relationship to communication devices. This means thatparameters specific to the market, communication network, and powergrid which relate to the same device are linked with each other. Therefore,all information on one device is easily accessible but at the same timethere is a separation according to the domains. The connections betweenclasses of different domains are defined in a logical and not a topologicalmanner. Instead, topological connections exist to interconnect power gridcomponents, for instance.

In the mentioned battery storage device example, the data model isas follows: the device is a part of the grid and has electrical parameters.Furthermore, the battery storage might participate in the market, e. g.,as part of a virtual power plant (VPP). Market-specific information canbe stored in MarketBatteryStorage class objects which is associated withthe BatteryStorage. The communication modem ComMod which could beused to communicate with the VPP is aggregated to the BatteryStorageclass.

The three additional packages EnergyGrid, Communication, and Marketare needed for the following:

• some newer components occurring in power grids are missing in orig-inal CIM. For instance, it was necessary to create a new model forelectrical energy storages like stationary batteries. A battery storageis a conducting equipment that is able to regulate its energy through-put in both directions. Therefore, the class BatteryStorage addedin the EnergyGrid package is a specialization of a CIM Regulating-ConductingEquipment since it can influence the flow of power at aspecific point in the grid.

• the key component of the Market package for the scenarios that wewould like to investigate is a VPP since the aggregation of smallDER units enables their participation at electricity markets.

• the Communication package includes all additionally defined classesthat are related to the communication network model, such as classesfor communication links and technologies, modems, network nodesalong with their parameters and their relations with the classes inCIM, power grid package, and market package.

Figure 2.3 shows an excerpt from the communication data model withan aggregation to a WindGeneratingUnit. By means of the associatedclasses for modems, communication requirements and channels, the modelenables a description of network parameters and topology. More on thepackages can be found in [Mir+18].

21

Chapter 2 Multi-Domain Co-Simulation

2.4.2 Model Data Processing and Simulation Setup

The overall information flow for the simulation setup is depicted in Fig. 2.4.After the holistic topology is edited in a graphical Topology Builder,including all objects of the three domains, it is forwarded to the co-simulation interface. In order to execute a simulation, the Modelicasolver requires a Modelica model, whereas the communication networktopology can be given to the communication network simulator in CIMformat, which includes the components of the network, their connections,and parameters. The co-simulation interface incorporates a componentcalled CIMverter, based on CIM++ [Raz+18a] presented in Chap. 3. TheCIMverter [Raz+18b] reads in the CIM document and outputs a Modelicasystem model (Chap. 4) for the power grid simulator. In contrast, thePython-based market simulation relies on a C++/Python interface, whichcould be realized using one of the common libraries for Python to wrapC++ data types and functions, to retrieve the market relevant informationfrom the C++ objects and store them in Python objects. A detailedexplanation of the translation from CIM to Modelica is given in Chap. 4.

CommunicationRequirement

GeneratingUnit   WindGeneratingUnit       

ComMod

WirelessMod WiredMod

LTEModem FiberModemWiredInterface   FiberInt         

WiredChannel   FiberChannel    

1

11

1..*

0..10..*1

1..*

Figure 2.3: Communication network class association example

22

2.4 Concept of the Co-Simulation Environment

2.4.3 Synchronization

The synchronization during simulation is performed at fixed time steps. Forslow phenomena scenarios this is managed by mosaik, a well-establishedco-simulation framework [SST11]. It allows coupling the three simulatorsin a simple manner as explained in Sect. 2.4.4 in case of longer synchro-nization time steps. VILLASnode, a software project for coupling real-timesimulations in LANs [Vog+17; Ste+17], is a suitable alternative for mosaikin the case of synchronizations with very short synchronization time steps.

In Modelica, the synchronization data exchange is achieved by inte-grating Modelica blocks of the Modelica_DeviceDrivers library, whichwas originally developed for interfacing devices to Modelica environments[Thi19]. The library conveniently allows the definition of a fixed interval fordata exchange that can be different from the simulation time step. Moreon this choice and the integration can be found in [Mir+18]. Figure 2.5depicts the flow of time for the co-simulation and each simulator. Thepower grid and market simulators compute in parallel, whereas the com-

Co-Simulation Interfacemosaik, CIM++,CIMverter, etc.

Power GridSimulation

Modelica (Dymola,OpenModelica)

Market SimulationPython

CommunicationSimulationns-3, etc.

Topology Builder

C++ objects from CIM++ &simulator-specific system models

XMLrepresentation

of topology

Modelica Models Python ObjectsCommunication

Network Topology

Block diagramof topology

Extended CIM

Figure 2.4: Overall SINERGIEN architecture for simulation setup

23

Chapter 2 Multi-Domain Co-Simulation

munication network is waiting for their inputs. The interactions betweenthe simulators in each co-simulation step can be formalized by

up(n + 1) = Fc(Fm(um(n))), (2.1)

um(n + 1) = Fc(Fp(up(n))), (2.2)where uc, um and up are the corresponding input values of the simulatorsfor the power grid, energy market and communication network for eachtime step. Therefore, it is required to set initial values, up(0), um(0),uc(0), at the beginning of the co-simulation. n denominates the currentco-simulation time step. Fc (communication), Fm (market) and Fp (powergrid) are the functions describing the calculation within a step.

2.4.4 Co-Simulation Runtime InteractionFigure 2.6 shows the coupling of the simulators for their co-simulationruntime interaction with following entities:mosaik As already mentioned, mosaik is used for the coordination during

the synchronization steps of several minutes (in simulation time)regarding all simulators [Sch19].

Market Simulator Implemented in Python, it can make use of mosaik’sso-called high-level API as illustrated in Fig. 2.6.

Communication Network Simulator Based on available DES tools, theirnetwork simulation modules are extended with inter-process commu-nication functionalities for message exchange with mosaik.

PowerGrid

PowerGrid

Market Market

CommunicationNetwork

CommunicationNetwork

PowerGrid

Market

1 1

1 1

1

1 20

20

20

IndividualSimulatorSteps

Co-SimulationSteps

0Event0

Event1

Event2

Event3

1 2

Figure 2.5: Synchronization scheme of simulators at co-simulation timesteps

24

2.4 Concept of the Co-Simulation Environment

Power System Simulator The integration of so-called TCPIP_Send/Recv_IO blocks from Modelica_DeviceDrivers into the Modelicamodels, allows the exchange of simulation data via sockets but in theform of Modelica variables as bitvectors instead of messages in JSON,an open-source and human-readable data format [ecm19]. Therefore,the MODD Server is implemented.

MODD Server It receives commands from the socket connected withmosaik. Based on these commands it starts, for example, the power

DESEnvironment

CommunicationSimulation

InfiniBand

TCP Sockets

mosaik-core

TCP Sockets

MODD Server

TCP Sockets

TCP Sockets

VILLAS-node

InfiniBand

ModelicaEnvironment

Power GridSimulation

ModelicaDeviceDrivers

TCP Sockets

Mod

elic

aD

evic

eDriv

ers

Shar

ed M

emor

y

Python Environment

MarketSimulation

TCP Sockets

Fast phenomena communication

Slow phenomenacommunication

Shared Memory

Figure 2.6: Scheme of runtime interaction between co-simulation compo-nents

25

Chapter 2 Multi-Domain Co-Simulation

grid simulator or receives the bitstream from Modelica_DeviceDriversand encapsulates it into JSON messages before transferring themto mosaik. Besides the synchronization steps controlled by mosaik,there will be also more fine-grained synchronization steps of fractionsof seconds between the power grid and communication networksimulator. That is why a VILLASnode gateway is included.

VILLASnode Instead of the Transmission Control Protocol (TCP) asin case of mosaik, VILLASnode can make use of InfiniBand (IB)interconnects for data exchange between real-time simulators ondifferent machines and shared-memory regions on the same machine.The use of shared-memory regions and IB interconnects leads tolower latencies and consequently to shorter synchronization timesteps as shown in Chap. 8.

For more on the formalization of the SINERGIEN co-simulation and thelimitations of the environment please refer to [Mir+18].

2.5 Validation by Use Case

The proper functioning of the SINERGIEN co-simulation environment hasbeen validated with the aid of different use case scenarios. In the use casepresented in [Mir+18] there is the assumption that a VPP operator triesto reduce the VPP’s peak power. This behavior could be desired by theresponsible DSO and come with financial incentives. Therefore, a peak-shaving algorithm is utilized for an optimal management of distributedbattery storage systems.

First of all, simulation results which are obtained without the SIN-ERGIEN environment were compared with results obtained with the SIN-ERGIEN environment for demonstrating that the results do not changeunder the assumption of an ideal communication network when simulatingthe same scenario. Furthermore another scenario was presented where thecommunication network was supposed to impair the control loop betweenthe power grid and the market due to communication device failures. Moreon the details of the co-simulated scenarios again can be found in [Mir+18]as the simulations themselves are not in focus of this work. Anyway,with the simulation results of both scenarios a proper functioning of theco-simulation environment has been shown.

26

2.6 Conclusion

2.6 Conclusion

The here presented architecture of the implemented multi-domain co-simulation environment shows the applicability of the CIM-based holisticdata model for smart grid simulations which include the three domains:power grid, communication network, and market. The data model facili-tates the use of the software environment since the domain-specific smartgrid component parameters and their interconnections can be modified andstored in a self-contained topology description. Due to the SINERGIENco-simulation approach the user can take advantage of established domain-specific simulators for each domain.

For this purpose, also new software tools have been developed. TheModPowerSystem library can be used for scientific research on variousmodels since Modelica as modeling language simplifies the development andimprovement of component models. Because of the increasing use of CIM-based documents for grid topology representation, the choice of Modelicalead to the development of a CIM to Modelica mapping that is presentedin Chap. 4. Besides the initiation of the CIM related topics (Chap. 3and Chap. 4), the SINERGIEN co-simulation architecture illustrates howthe work on HPC Python (Chap. 7) and the integration of InfiniBand inVILLAS Chap. 8 can be used in power system co-simulation. The workin Chap. 5 and Chap. 6, however, contributes to a higher performanceof the simulation itself which is accomplished by the simulators of theco-simulation environment.

In the following chapter, the automated generation of a (de-)serializer forreading and writing CIM-based documents, implemented in the mentionedCIM++ software library, is presented.

27

3Automated De-/SerializerGeneration

Due to growing automation in smart grids with the aid of an increasingdigitalization and a rising number of decentralized energy systems, theactors in this area are increasingly dependent on ICT systems that mustbe compatible with each other, which in particular concerns the dataexchange between these systems. Therefore, different countries, organiza-tions, and vendors started to develop smart grids related standards withdifferent focus on technical and economical aspects. Eventually, only fewnational standards have been integrated into standards of the InternationalElectrotechnical Commission (IEC) or the International Organization forStandardization (ISO) [Usl+12].

In recent years, the CIM standards (IEC 61970/61968/62325, seeSect. 2.1.2) have been subject to numerous research activities often relatedto use cases for CIM [MDC09; DK12; Wei+07]. Some of them, like inthe research project SINERGIEN, also introduce extensions by classes notincluded in original CIM as, for instance, in [MMS13] where a method-ology for modeling telemetry in power systems using IEC 61970/68 inthe case of a US independent system operator is presented. There arealso harmonization approaches because of data exchange between energyrelated software systems based on CIM standards and ICT for substationautomation based on IEC 61850 [LK17; Bec10; SRS10].

Since CIM is object-oriented, it specifies classes of objects containinginformation about energy system aspects as well as relations betweenthese classes (referred to as the ontology) [GDD+06]. Currently more

29

Chapter 3 Automated De-/Serializer Generation

and more commercial software tools in the energy sector provide importand export of CIM documents. Moreover, there are already about 200corporate members organized in the CIM User Group (CIMug) providingCIM models for common visual Unified Modeling Language (UML) editors[CIM].

This high acceptance among companies and institutions has pushed theadoption of CIM also in the simulation environment as presented in Chap. 2with respect to the SINERGIEN co-simulation environment, where thedata format of the multi-domain component-based co-simulation model(referred to as the topology) is based on CIM. As the topology, includingpower grid and communication network components as well as energymarket actors, evolves continuously, a high compatibility, updatability,and extensibility of the chosen data model are key requirements.

The object-oriented design with concepts such as inheritance, associa-tions, aggregations, etc. led to a CIM data encapsulation format referred toas RDF/XML [IEC06] coming from the area of semantic web [AH11] andnot as common in other domains. This and the huge specification of CIMwith hundreds of classes and relations between them, making CIM veryextensible and universal applicable in comparison to other more specificand static data models, have a deterrent effect to new users. However,keeping CIM based software up-to-date continuously can be too effortful,especially in the scientific and academic area. These could be the reasonwhy there are hardly any software libraries especially for handling CIMdocuments.

Therefore, in this chapter an automated (de-)serializer generation fromUML based CIM ontologies is presented. The approach was introducedin [Raz+18a] and implemented in a chain of tools for generating an open-source library libcimpp within the CIM++ software project [FEI19a].libcimpp can be used for reading in CIM RDF/XML documents directlyinto CIM C++ objects (called deserialization) and is currently also ex-tended for serialization (i. e. writing of CIM C++ objects from the memoryto RDF/XML documents). Due to a model-driven architecture (MDA),libcimpp can be adapted to new CIM versions or user-specified CIM basedontologies in an automated way. For this purpose, the approach makesuse of a common visual UML editor and our CIM++ toolchain generatinga complete compilable CIM C++ codebase of given CIM UML models(i. e. CIM profiles) which are kept up-to-date (e. g. by the CIMug). Itis also shown how this CIM C++ codebase can be used for holding thedeserialized CIM objects as well as for an automated generation of C++code for exactly this deserialization. Hence, if the CIM C++ codebasechanges (because of changes in the CIM UML), there is no need to adaptcode for libcimpp by hand.

30

3.1 CIM Formalisms and Formats

The direct deserialization into C++ objects makes the library very easyto apply because its user does need neither any CIM RDF/XML knowledgenor have to handle intermediate representations of the CIM RDF/XMLdocument like a Document Object Model (DOM) in combination with theResource Description Framework (RDF) syntax. For instance, in case ofa power grid topology stored in CIM documents, a power grid simulatorcan directly access the CIM objects, deserialized by libcimpp, in form ofcommon C++ objects.

The chapter gives a short introduction to data formats as well as othercomponents used in CIM++ followed by an overview of the overall concept.Then it explicitly describes how Common Information Model (CIM) is au-tomatedly mapped to compilable C++ code which is used by the CIM++Deserializer (i. e. libcimpp) during the so-called unmarshalling step ex-plained subsequently together with its automated generation. Followingthis, the final libcimpp is introduced. Finally, the chapter is concluded bya roundup and an outlook of future work. The work in this chapter hasbeen partially presented in [Raz+18a]1.

3.1 CIM Formalisms and Formats

An introduction to CIM is provided by Sect. 2.1.2. CIM makes use ofseveral formalisms and formats which are explained in the following.

UML

UML is a well-established formalism for graphical object-oriented modeling[RJB04]. In CIM only UML class diagrams with attributes and inheritanceas well as associations, aggregations, and compositions with multiplicitiesare used. The CIM UML contains no class methods as CIM defines just thesemantics of its object classes and their relations without any functionalityof the objects just to specify which kind of information a CIM objectcontains.

CIM UML diagrams can, as other UML diagrams, be edited by visualUML editors and stored in a proprietary or open format like XML MetadataInterchange (XMI) [KH02]. Conveniently, the CIMug provides such CIMmodel drafts [CIM]. While UML resp. XMI is used for the definition

1 Reprinted by permission from Springer Nature Customer Service Centre GmbH:Springer Computer Science – Research and Development (“Automated deserial-izer generation from CIM ontologies: CIM++—an easy-to-use and automatedadaptable open-source library for object deserialization in C++ from documentsbased on user-specified UML models following the Common Information Model(CIM) standards for the energy sector“, Lukas Razik, Markus Mirz, Daniel Knibbe,Stefan Lankes, Antonello Monti), © (2017)

31

Chapter 3 Automated De-/Serializer Generation

of all classes with their attributes and relations among them, the actualobjects (i. e. instances of these classes) are stored in form of RDF/XMLdocuments.

XML and RDF

The Extensible Markup Language (XML) is a widely used text-based for-malism for human- and machine-readable documents [Bra+97]. In general,XML documents have a tree structure which is why XML itself is not wellsuited for representing arbitrary graphs. Therefore it is combined with theRDF [Pan09]. RDF provides triples of the form “<Subject> <Predicate><Object>” which allow representing a relation (<Predicate>) betweenresources (<Subject> and <Object>). Therefore, links (i. e. instances ofassociations, aggregations, . . . ) between CIM objects, as specified in theUML ontology, can be expressed by RDF/XML.

For instance, in List. 3.1 the object of class BatteryStorage has anrdf:ID (line 7) which is referenced in the Terminal (line 5) with theRDF/XML attribute rdf:resource="#BS7". A brief introduction to CIMwith its key concepts is provided by [McM07].

XML Parsers

There are three common types of pure XML parsers [Fri16; HR07; KH14].During parse time, the so-called DOM parser generates a treelike structure

Listing 3.1: Snippet of a CIM document representing an IEEE EuropeanLow Voltage Test Feeder with an additional BatteryStorage

1 <cim: Terminal rdf:ID=" BADCAB1E ">2 <cim: IdentifiedObject .name >T13 </cim: IdentifiedObject .name >4 ...5 <cim: Terminal . ConductingEquipment rdf: resource ="#BS7"/>6 </cim: Terminal >7 <cim: BatteryStorage rdf:ID="BS7">8 <cim: Equipment . EquipmentContainer rdf: resource ="#C7"/>9 <cim: IdentifiedObject .name >Battery -1

10 </cim: IdentifiedObject .name >11 <cim: BatteryStorage . nominalP >500012 </cim: BatteryStorage . nominalP >13 <cim: BatteryStorage . ratedU >40014 </cim: BatteryStorage . ratedU >15 ...16 </cim: BatteryStorage >

32

3.2 CIM++ Concept

with strings of the whole document what can be very memory demanding.For further processing, the particular strings have to be picked out manuallyand interpreted respective converted to desired data types. To avoidloading a whole document into memory, StAX parsers (a kind of pullparsers [Slo01]) can be used. They are a compromise solution betweenDOM and Simple API for XML (SAX) parsers as they allow randomaccess to all elements within a document. SAX parsers are most commonlyused. They traverse XML documents linearly and trigger event callbacksat certain positions. Because of the fact that one linear reading of theCIM document is sufficient for its deserialization, a SAX parser is used.

C++ Source Code Analysis

For C++ source code analysis, correction, and adaption, which is neededin several steps of the automated generation, a compiler front-end waschosen. It can transform source code into a so-called abstract syntax tree(AST) [Aho03]. With further functionalities provided by the compilerfront-end, e. g., static code analysis [Bou13] and source code manipulationscan be performed. One of the conceptional ideas is to use the data fromthe AST as input for a template engine.

Template Engines

Template engines are mainly used for generation of dynamic web pages[Fow02; STM10]. The core idea behind them is to separate static context(e. g. HTML code defining the structure of a web page) from dynamicdata (e. g. the actual web page content). Therefore, the static part can bewritten in template documents with place holders filled by the templateengine with data from a data base as described in Sect. 3.4.4.

3.2 CIM++ Concept

An conceptual overview of the automated (de-)serializer generation fromCIM UML is presented in Fig. 3.1. The upper part of the diagram showsthe automated code generation process from the definition of the ontologyby CIM UML to the (un-)marshalling code generation of the CIM++(De-)Serializer libcimpp. The lower part shows the deserialization processfrom a given topology (based on the specified CIM ontology) to CIM C++objects. The CIM based specification, which represents classes and theirrelations in UML, is loaded with a visual UML editor and transformedto a C++ codebase. Before this C++ codebase can be included by the(de-)serializer’s source-code (i. e. libcimpp), it is adapted by the developed

33

Chapter 3 Automated De-/Serializer Generation

CIM C++Codebase

AdaptedCIM C++

Codebase

UnmarshallingTemplates

CIM XML/RDF Topology Document(s)

CIM C++ Topology Objects

CIM++ (Un-)Marshalling Generator

Template Engine

Compiler Front-End

CIM++ (De-)Serializerlibcimpp

(Un-)Marshalling

Topology EditorCIM based Topology

Visual UML EditorCIM UML Ontology

CIM++ Code Toolchain

Compiler Front-End

TemplateEngine

DB

Figure 3.1: Overall concept of the CIM++ project

34

3.3 From CIM UML to Compilable C++ Code

CIM++ code toolchain to compilable C++ code as the original CIMC++ codebase is not complete as explained later. This adapted codebaseCIM++ (Un-)Marshalling Generator for unmarshalling code generationneeded for the CIM++ (de-)serializer. Originally only a deserialization wasimplemented in libcimpp but a serialization is currently being implementedas this concept can be applied for both directions. The code toolchain aswell as the (un-)marshalling generator make use of a compiler front-endand the latter make also use of a template engine getting its data fromabstract syntax trees created by the compiler front-end while reading inthe adapted CIM++ codebase. Afterwards, the template engine can fillthe data about the codebase into the (un-)marshalling code templates.After all these automated steps, which can be repeated whenever theCIM based specification in UML form is visually modified, the CIM++deserializer can be compiled to a library.

This CIM++ (de-)serializer library (libcimpp) can be used by C++programs for reading (by deserialization of C++ objects) and writing (byserialization of C++ objects) CIM documents. In the shown topologyeditor screenshot, for instance, all components of a grid with their links(i. e. the grid’s topology corresponding to the previously defined CIM spec-ification) are stored by a topology editor in one or more CIM RDF/XMLdocuments. These documents can be directly transformed to C++ objectsby libcimpp.

C++ was chosen as programming language because of its high exe-cution time and memory space efficiency and to be directly compatiblewith programs written in C++. Before the automated generation of(un-)marshalling code can be introduced, the mapping of the CIM UMLspecification to the adapted and therefore compilable C++ codebase ispresented.

3.3 From CIM UML to Compilable C++ Code

With visual UML editors the CIM model can be rapidly modified orextended to individual requirements. Moreover, many tools follow MDAapproaches making round trip engineering (RTE) possible. RTE in relationto UML allows the user to keep UML models and related source codesconsistent by two software engineering principles: Forward engineering,where changes to UML diagrams lead to an automated adaption of thebelonging source codes. And reverse engineering (if supported by the UMLeditor), where changes to source codes lead to an automated adaptionof the belonging UML diagrams [Dav03; Reu+16]. In our case, theseprinciples provide the ability for incremental development of CIM ontologies

35

Chapter 3 Automated De-/Serializer Generation

HydroPump ReservoirHydroPowerPlant1..* 0..1 0..* 0..1

Figure 3.2: UML diagram of HydroPowerPlant class which instances canbe associated with no more than one Reservoir instance

(respective data models based on CIM) and the automated generatedCIM C++ codebases. This leads to better software documentation andcompatibility between different (distributed) software developing entities.

Unfortunately, there are no standardized canonical mappings betweenUML associations, aggregations, composition, etc. on the one side andobject-oriented programming (OOP) languages on the other one. Therefore,different C++ code representations for CIM UML aspects had to be chosenfor the code generation. For instance, in case of no multiplicity, the chosenrepresentation of an association is a pointer to an instance of the associatedclass. In case of a possible multiplicity greater than 1, it is a pointer toan Standard Template Library (STL) list of pointers to instances of theassociated class.

The CIM UML specification of the HydroPowerPlant class is partlypresented in Fig. 3.2. Since the given multiplicity of the aggregatedHydroPump objects can be greater than one, for the belonging HydroPumpsattribute in the generated code a list is used as depicted in List. 3.2. TheHydroPowerPlant aggregates one or more HydroPump instances. Further-more, there can be multiple HydroPowerPlant instances associated with aReservoir and a HydroPump can also exist without being aggregated by aHydroPowerPlant.

Inheritance in CIM UML can be easily represented by C++ inheritance.Due to the fact that no operations are defined in CIM UML, i. a. the

Listing 3.2: Snippet of HydroPowerPlant class

class HydroPowerPlant : publicIEC61970 :: Base :: Core :: PowerSystemResource

{public :

std :: list < IEC61970 :: Base :: Generation :: Production :: HydroPump *>* HydroPumps ;

IEC61970 :: Base :: Generation :: Production :: Reservoir * Reservoir ;...

};

36

3.3 From CIM UML to Compilable C++ Code

generated standard constructors are empty, there are no further classfunctions, and all UML-defined attributes stay uninitialized. The classesdefined in the CIM standard as primitive types (see also Sect. 3.3.3), aregenerated as empty classes. In case of the used code generator and highlylikely most others, the generated enum types are not strongly typed andtherefore have no scope. Besides these circumstances, due to the CIM UMLstandards in conjunction with C++, the generated code also comes withsome software technical deficiencies. For instance, the #include directivefor the chosen std::list container is not automatically inserted etc.

The mentioned facts lead to source code files that could not be directlyused for the subsequent automated (un-)marshalling code generation.Therefore following solution approaches were also considered: ReplacingC++ by any other programming language would not guarantee a solutionof any mentioned issues. Writing a new generation tool would result inan additional sophisticated software project just for the special case ofgenerating C++ code from a machine readable CIM UML representationsuch as XMI. Therefore, a cost-benefit analysis led to the decision todevelop a toolchain for automated code correction and adaption basedon existing widely used transformation and C++ refactoring techniques.Thus, the developed toolchain should be easily adaptable for the usagewith different general purpose CIM UML to C++ code generators. Asource code transformation by hand, e. g., in case of IEC 61970 / 61968on around 2000 source files, would be an cumbersome and error-pronetask. The demands on the generated code after its transformation by thetoolchain are hierarchical includes of all header files as well as an adequateusage of the chosen container class (i. e. std::list). Furthermore, acommon BaseClass for all CIM classes is needed as it will be shown later.

3.3.1 Gathering Generated CIM Sources

The first steps are performed by the CIM-Include-Tool, grouping all C++source files together that are created by the code generator of the vi-sual UML editor from CIM UML. The tool scans all source files writtenby the code generator for the container class that was chosen for asso-ciations with multiplicities and adds missing header includes (here i. e.#include <list>). In case of the used code generator, all files are groupedtogether according to the CIM packages. For instance, the definition of theIEC 61970 class Terminal, located in the package Base::Core, is be storedin the directory path IEC61970/Base/Core. This is why all occurrences of

# include " Terminal .h"

are transformed to

37

Chapter 3 Automated De-/Serializer Generation

# include " IEC61970 /Base/Core/ Terminal .h"

for keeping the hierarchical structure of all directories and files [Daw;ISO14].

3.3.2 Refactoring Generated CIM Sources

After that the CIMRefactorer based on the Clang LibTooling library isexecuted. Clang is a compiler front-end supporting C/C++, developedwithin the LLVM compiler infrastructure project [LLV]. During parsetime, the library creates an AST containing objects that represent thewhole source code like declarations and statements [DMS14]. For an ASTtraversal the visitor design pattern [KWS07] is used to evaluate and processthe AST. Due to the usage of a visitor pattern, the implementation of theso-called composite does not need to be adapted if its processing has to bechanged or extended. If a new action has to be performed on the AST,a new visitor has to be implemented only. Clang provides the class tem-plate clang::RecursiveASTVisitor<Derived> for this need. It is derivedwith an appropriate implementation of a visitor class given as templateparameter as pictured in Fig. 3.3 for an example visitor MyASTVisitor.By design there also exists the MyASTConsumer class inheriting from clang::ASTConsumer which determines the entry point of the AST. It callsthe TraverseDecl() of the AST visitor which then calls the appropriatemethods of the given MyASTVisitor.

The CIM models provided by the CIMug include UML enumerations,e. g., for units and multipliers. Thereby, several enumerations containsame symbols. For example, the enumeration UnitSymbol contains theenumerator m as unit for meters while the enumeration UnitMultipliercontains the enumerator m as SI prefix for milli. Since C++ requires unique

clang::RecursiveASTVisitor

      TraverseDecl(Decl* D)

MyASTVisitor

VisitDecl(Decl* D)VisitStmt(Stmt* S)

clang::ASTConsumer         HandleTopLevelDecl(clang::DeclGroupRef DR)

MyASTConsumer     std::unordered_set<std::string> Locations

MyASTVisitor

1 1

Figure 3.3: UML diagram of the class MyASTVisitor

38

3.3 From CIM UML to Compilable C++ Code

symbols which is not true for the symbol m declared twice, the generatedcode with unscoped enum types is incorrect. Therefore, VisitDecl() i. a.adds the class keyword to all visited unscoped enumerations. However,this does not define them as classes, it is just a reuse of an existing C++keyword. Furthermore, the visitor checks each statement if any used datatype is an enumeration and adds its corresponding scope as prefix.

Hence, e. g. the unscoped enumeration with a corresponding assignmentstatement

enum UnitSymbol { F, ... }...const IEC61970 :: Base :: Domain :: UnitSymbol

Capacitance :: unit = F;

is adapted by VisitDecl() to a strongly typed enumerationenum class UnitSymbol { F, ... }...const IEC61970 :: Base :: Domain :: UnitSymbol

Capacitance :: unit =EC61970 :: Base :: Domain :: UnitSymbol ::F;

with the needed scope in the assignment statement. Such modificationsare not performed by the visitor on existing code directly but temporarilystored in the designated container provided by Clang for later usage toavoid invalid ASTs.

Furthermore, initialization lists are added to the standard constructorsfor all class attributes which are provided by Clang with their belongingdata types. Also, all pointer operators * to the chosen container typein case of associations with given multiplicities are removed. The list ofattributes with their data types specified in the visited class declaration isalso provided by Clang. Thus, such associations are finally represented aslists of pointers.

std :: list < IEC61970 :: Base :: Generation :: Production :: HydroPump *>HydroPumps ;

Almost all thousands of CIM headers include other headers which wouldlead to many repeatedly visited declarations and consequently to very longexecution times. As already mentioned, MyASTConsumer defines the entrypoints of the AST which are the top-level declarations of the CIM C++headers. A top-level declaration is not included in another declaration.Hence, each top-level declaration is traversed in order to visit all nodesof the AST. During this, the position of each node in the source code isstored in a hash table with an average-case time complexity for searchoperations of O(1). As a result, in case a declaration’s position is alreadycontained in the table, the declaration is ignored.

39

Chapter 3 Automated De-/Serializer Generation

3.3.3 Primitive CIM Data Types

The CIM standards do not only define classes for virtual instances ofreal objects but also so-called primitive (data) types String, Integer,Float, Decimal, and Boolean which correspond to intrinsic data typesof many programming languages. All other CIM data types are classesthat can contain these primitive types. For the CIM classes representingsuch primitive types just empty skeletons are generated with the resultthat they must be implemented depending on their aim which can differbetween different CIM respective libcimpp users. In the used CIM modelthere is also the Decimal type which is not specified as primitive (inpresent CIM standards) but used like the four others and is thus handledby the toolchain like a primitive type. Thus, two different methods forthe implementation of primitive types have been discussed: simple typedefinitions, e. g., with typedef on intrinsic C++ data types, and theimplementation of C++ classes.

For the unmarshalling step (explained in Sect. 3.4.3) it is mandatorythat the class attributes provide reading from C++ string streams. Sincea design decision was to throw exceptions while trying to read from neverdefined CIM class attributes, primitive types were implemented in form ofclasses (and not, e. g., just typedefs on intrinsic C++ types). Moreover,in case of numeric data types a sufficient precision can only be guaranteedsince C++11 which is the already used standard for CIM++ because ofother language features.

The primitive String type is based on std::string since it can storeindefinitely many UTF-8 encoded strings as required by the standard. Theintegral type Integer is implemented based on long which size dependson the used platform but usually is 32 bit which should be sufficientin most cases. Float is CIM’s floating-point numbers type for whichthe double type was chosen instead of float as a sufficient accuracyin case of CIM is more important than a higher runtime performance.All these types already provide read in from streams. Boolean is basedon bool which also provides read in from streams but only in case ofthe digits 0/1 and not in case of the words true/false as used in CIMRDF/XML documents. Therefore, it was implemented with appropriatestream and cast operators which make it i. a. comparable also with othertypes. Decimal was implemented based on std::strings to keep the readvalue as it is because of the standard’s requirement that it should beable to represent a base-10 real number without any restriction on itslength. Afterwards, it can be converted by the libcimpp user, e. g., intoan arbitrary-precision representation such as provided for example by the

40

3.4 Automated CIM (De-)Serializer Generation

Multiple-Precision Binary Floating-Point Library With Correct Rounding(MPFR) [Fou+07].

Overall CIM C++ Source Code Transformation

In addition to the previously described procedures and the implementationof primitive data types in the form of a patch, also a couple of code fixingpatches are applied on the generated CIM C++ code. Besides software-technical details like a correction of definitions inside the IEC61970Versionclass or making all source files Portable Operating System Interface

(POSIX) conform [IEE18], also some conceptional issues have to be solved.As the CIM standards define an enumerated type for three-digit currencycodes of ISO 4217 which can have a leading 0, they are interpreted inC++ code as octal numbers which is why such leading zeros must beremoved. Moreover, CIM defines an attribute switch which in C++ isa keyword. Therefore, the attribute is renamed which must be takeninto account during the unmarshalling step of the deserializer later onfor reading in the attribute by its original name. Afterwards, the code ischecked for its compilability with clang-check, what could be done byany C++ compiler, too. If the check is successful, the code can be usedfor the automated CIM++ (De-)Serializer generation for detecting errorswhen the CIM standard is changed or extended with the aid of a visualUML editor. Finally, the documentation generator doxygen is appliedon the now compilable CIM C++ code as support for the CIM++ user[FEIe].

3.4 Automated CIM (De-)Serializer Generation

With a UML to source code generator and the previously introducedtoolchain, the CIM UML model is transformed to a compilable codebasewhich can be used as a CIM data model with instantiatable C++ objects.These objects then can be filled with data read from a CIM RDF/XML doc-ument by a common XML parser with the aid of automatically generatedunmarshalling routines. Or the objects can be filled / modified by C++statements and serialized into a CIM RDF/XML document. However, forbeing able to store CIM C++ objects in (e. g. STL) containers, furtherwork needs to be done.

3.4.1 The Common Base ClassDuring reading of CIM RDF/XML documents, the thereby created CIMobjects are stored on the heap and therefore referenced by pointers which

41

Chapter 3 Automated De-/Serializer Generation

are collected in the a list container. Due to the fact that STL containersstore items of one single type, the concept of base class pointers is used.This means that objects of derived classes can also be referenced bypointers of their base classes. The motivation is that not all CIM classesinherit from the CIM class IdentifiedObject. Due to the absence of acommon base class for all CIM classes, it is not possible to collect them ina container of objects of one base class.

Several solutions for solving this issue have been discussed. One possibil-ity would be to use typeless pointers (void*) but as C++ is a strictly typedlanguage, this was rejected. Another possibility is the use of a containertype (like boost::any from the Boost.Any Library [Hen]). But to remainon STL, keeping it simple for the CIM++ user, and for avoiding furthersoftware dependencies, each top-level CIM class (i. e. a class without superclass) derives from a newly introduced BaseClass. As a consequence, itis the base class for all CIM classes and thus is added to the CIM C++codebase by the previously introduced CIMRefactorer.

3.4.2 Integrating an XML ParserBasically, CIM RDF/XML can be read by each XML parser. As alreadydescribed in Sect. 3.2, RDF extends XML i. a. by the possibility within anXML element of referencing other elements. There are a couple of librariesfor RDF handling such as Apache Jena for Java [Apa] and Redland RDFLibraries written in C [Bec]. The relevant Redland libraries are librdf,the actual RDF/XML parser, and libraptor, providing the data accessby RDF triples. Redland’s implementation is similar to a DOM parser.Data from RDF/XML documents is stored in an own container residingin the main memory. However, the main goal of CIM++ is to deserializethe CIM objects stored in RDF/XML into C++ objects. Consequently,all CIM data already stored in an intermediate format would need tobe copied into the objects instantiated accordingly to the defined CIMC++ classes. Therefore, the choice fell on a SAX parser which, with asucceeding unmarshalling step, can directly fill the read CIM RDF/XMLdata into the CIM C++ objects.

The first versions of the developed libcimpp library were using the event-based SAX parser of libxml++ [Cum] which is a C++ warapper for thewell-established libxml library. In the current libcimpp version, libxml++[Cum] was replaced by the Arabica XML Toolkit which comes with uniformSAX wrappers for several XML parsers [Hig] making libcimpp usable ondifferent Unix-like operating systems as well as on Windows. All event-based SAX parsers provide callback functions called whenever during parsetime a certain event occurs. In case of libcimpp these methods call the

42

3.4 Automated CIM (De-)Serializer Generation

unmarshalling code which instantiates proper CIM C++ objects and fillsthem with the read data.

Whenever a new opening XML tag is encountered, a startElement iscalled which gets the XML tag and its attributes that will be stored forlater use. If the tag represents a CIM class, an object of this class isinstantiated on the heap and referenced by a BaseClass pointer whichis pushed onto a stack and later on popped from the stack by a call ofendElement. If an opened XML tag contains an RDF attribute, whichrefers to another CIM object, a task is created and inserted into a taskqueue. This can be the case in all kinds of CIM UML associations. Finally,the end of the document endDocument is called which processes all tasks ofthe task queue. These tasks connect associated objects by pointers. If anopened XML tag contains an RDF attribute, which refers to another CIMobject, a task is created and inserted into a queue. This can be the case inall kinds of CIM UML associations. These tasks then connect associatedCIM objects by pointers. Therefore all objects of the CIM document haveto be instantiated before their pointers can be set correctly. Furthermore,a certain routine is called whenever the SAX parser encounters characterswhich represent no XML tag. These characters and the uppermost elementof the tag and the object stack is passed to an assignment function whichinterprets the characters to values and tries to assign them to the properattributes of the belonging CIM object.

3.4.3 UnmarshallingThe previously introduced assignment functions form the core functions ofthe unmarshaller. Since the CIM UML model is transformed into a correctcompilable CIM C++ codebase, it is possible to map XML elementswith their contents into the previously instantiated CIM C++ objects.For this purpose, a proper mapping function was defined which will beexemplarily described by the CIM RDF/XML snippet shown in List. 3.1.For instance, a function has to assign the name of the Terminal element(List. 3.1 line 2) to the name attribute of the corresponding C++ objectwhich is an instance of the Terminal class that inherits the attributefrom IdentifiedObject whose code snippet is shown in List. 3.3. Thesimplest way in general would be using reflection of the programminglanguage which is the ability to examine, introspect, and modify its ownstructure and behavior at runtime [DM95]. Reflection in OOP languagesi. a. allows “looking” into an object. For instance, that would allow theprogram to check if a certain object has the attribute name and access it atruntime. Usually, it would also be possible to iterate through all attributesof an object, entirely independently of their types. Contrary to dynamic

43

Chapter 3 Automated De-/Serializer Generation

Listing 3.3: Snippet of the CIM C++ class IdentifiedObject.

class IdentifiedObject {public :

IdentifiedObject ();IEC61970 :: Base :: Domain :: String name;...

};

Listing 3.4: Assignment function for IdentifiedObject.name

bool assign_IdentifiedObject_name (std :: stringstream & buffer ,BaseClass * base_class_ptr ) {

if( IEC61970 :: Base :: Core :: IdentifiedObject *element = dynamic_cast < IEC61970 :: Base :: Core ::

IdentifiedObject *>( base_class_ptr )){

buffer >> element ->name;...

}

programming languages such as Python which provide reflection and alsoobject runtime alternation [ŠD12; Chu01], C++ by design provides onlyvery limited reflection mechanisms. Without additional programmingeffort only information like, e. g., the object’s type identifier can be queriedat runtime which is no solution in this context. There are methods toextend C++ by reflection mechanisms with the aid of libraries addingmeta information but such an approach would increase the complexityof the CIM++ project significantly, add further dependencies, and alsodeteriorate its maintainability and flexibility. Instead, Clang’s LibToolingis used for generating the mapping functions based on information providedby the previously adapted CIM C++ codebase.

A mapping function needs the object whose attribute has to be accessed,the attribute’s name, and the character string which has to be interpretedand assigned to the attribute. In List. 3.1 line 2 the attribute is iden-tified by cim:IdentifiedObject.name, where cim is the namespace. Byimplication, a mapping function calls an appropriate assignment functionwhich, for the given case, is presented in List. 3.4. If the dynamic_cast issuccessful, the stream operator which was previously implemented for allprimitive types, is used for interpreting the given characters to the propervalue and its subsequent assignment to the attribute.

44

3.4 Automated CIM (De-)Serializer Generation

In addition to primitive types, there are also CIM classes which areno data types but in CIM based CGMES documents [ENT16] similarlyused and in context of OOP called structured data types. Apart from avalue attribute these classes just contain members representing enumer-ated types, units, or multipliers. CIM’s Domain package defines most ofthese classes such as Base::Domain::Voltage with the attributes valueof the type IEC61970::Base::Domain::Float, multiplier of the typeBase::Domain::UnitMultiplier, and unit of the type Base::Domain::UnitSymbol. Accordingly to the presented assignment function, the assign-ment for an attributenominalVoltage of the type Base::Domain::Voltagewould be:

buffer >> element -> nominalVoltage . value ;

Since for all such attributes there have to be similar assignment functionsimplemented, they are generated with the aid of a template engine by theunmarshaller generator explained in Sect. 3.4.4. In case of IEC 61970 only,there are more than 3000 assignment functions generated. To find theright one by if-branches at runtime would lead to an average-case timecomplexity of O(n) for each assignment, with n being the total number ofassignment functions. For improving the performance, a kind of dynamicswitch statement was implemented. For this, pointers to all assignmentfunctions are stored in a hash table with the attributes’ names as keys.Therefore, lookups in the hash table have an average time complexity ofO(1).

Before any assignment can take place, the proper objects have to beinstantiated. As already described this happens when a new opening XMLtag is encountered. In case of <cim:Terminal rdf:ID="BADCAB1E"> a newobject on the heap is instantiated by new Base::Core::Terminal. Themapping of such an XML tag to its line of code is also done with the aidof the dynamic switch statement concept. For each CIM class, there is afunction instantiating respective objects. These functions are part of aFactory design pattern [Ale01] implemented in the CIMFactory class whichis part of libcimpp. The object’s rdf:ID is stored as key value in a hashtable together with a pointer to the object for later task resolving. The Taskclass has a resolve() method which is called for setting the associationbetween objects as mentioned before. During construction, a Task instancegets the CIM object which represents the end of the regarding associationtogether with the association’s identifier. The identifier is the XML tagbelonging to the association. To resolve a task in resolve(), the rdf:IDis looked up in the hash table for getting the address of the associatedCIM object. Afterwards, a set of assignment functions is used to link theobjects together.

45

Chapter 3 Automated De-/Serializer Generation

3.4.4 Unmarshalling Code Generator

In the previous section the unmarshalling process of libcimpp was described.The developed CIM-Unmarshalling-Generator application generates C++code for the introduced classes Task and CIMFactory as well as for theassignment functions. The step is performed with the aid of the CTemplateengine [Spe].

Each template engine needs a data source for template file rendering.To be most independent from any tools, no proprietary format containingthe CIM model was used. It would be possible to export the availableCIM model to an open format like XMI but this approach was rejectedfor different reasons: as the code generation, the XMI export of the usedvisual UML editor can have inadequacies, too. The present corrected andadapted CIM C++ codebase already contains all needed information aboutthe given CIM model and can be used as input for the template engine’sdata base. Thus, subsequent manual changes to the CIM C++ codebasecan also be considered by the CIM++ toolchain. Therefore, the data baseneeded for the template engine is built from the CIM C++ codebase.

As already mentioned, the introduced class CIMFactory creates instancesof CIM classes that were requested by their names. Therefore, appropriatefunctions are needed for each CIM class. These functions can be expressedby a template snippet presented in List. 3.5. There, {{#FACTORY}} beginsa reiterative section and {{CLASS_NAME}} as well as {{QUAL_CLASS_NAME}}are place holders which are replaced at render time by values read fromthe data base. Based on this template, the CIM-Unmarshalling-Generatorwill create the appropriate function for each CIM class. Therefore, theASTVisitor creates a section dictionary for each class definition it findsin the CIM C++ files since the CTemplate engine works with dictionariesto set the place holders at render time. The final code for Terminal afterthe so-called rendering by the template engine is shown in List. 3.6.

The AST visitor also has access to a whitelist in which all CIM classesthat are used like data types (i. e. they just occur in attribute declarations

Listing 3.5: Snippet of CIMFactory template

{{# FACTORY }}BaseClass * {{ CLASS_NAME }} _factory () {

return new {{ QUAL_CLASS_NAME }};}{{/ FACTORY }}

46

3.4 Automated CIM (De-)Serializer Generation

Listing 3.6: Automated generated Terminal_factory()

BaseClass * Terminal_factory () {return new IEC61970 :: Base :: Core :: Terminal ;

}

of other CIM classes) are listed and, as a consequence, are not beingdirectly instantiated. For these classes no sections are generated.

The function which initializes the hash table of the CIMFactory is alsopart of the shown template with the aid of the created section dictionaries.The template for Task contains sections for attributes of a pointer type ora list of pointers (in case of given multiplicities greater than 1).

Although in CIM associations are generally developed in form of bidirec-tional links, in typical CIM RDF/XML documents they are implementedas unidirectional relations. Therefore, this is analogously done with CIMC++ objects. An example is the association of the class Terminal withConnectivityNode. In C++ this association is realized as a pointer at-tribute of Terminal. In CIM RDF/XML documents it is realized in formof the tag cim:Terminal.ConnectivityNode with an RDF reference to theRDF ID of an object of the class ConnectivityNode. For resolving a corre-sponding task, a function is needed which assigns the address of the object,referenced by the given RDF ID, to the attribute ConnectivityNode ofthe class Terminal.

Compositions are not used in the available CIM model but there aremany aggregations which are unidirectional, too. Nevertheless, the CIMC++ implementation of aggregations expressed in CIM RDF/XML arenot that straight forward. In C++, the aggregating object contains anattribute of the type pointer or a list of pointers which show(s) on theaggregated object(s). The XML document, however, contains XML tagswhich are part of elements embedded in the aggregated objects. Theseaggregated objects contain RDF references to their aggregating object.Therefore, functions are needed which assign pointers to the aggregatedobjects to the pointers or list of pointers of the aggregating objects.

The AST visitor generates an assignment function for each pointer orlist attribute of the CIM C++ classes. These functions get base classpointers as argument to the objects which have to be linked together. Thelookup of the proper function will be accomplished by another hash table.The main issue is the generation of the function which initializes the hashtable with the correct XML tags as keys to the function pointers.

47

Chapter 3 Automated De-/Serializer Generation

In some cases the association representation is rather simple. Exemplar-ily, for Terminal with the attribute ConnectivityNode, the AST visitorgenerates the key value cim:Terminal.ConnectivityNode. This is ex-pressed by the following template:

cim:{{ CLASS_NAME }}.{{ FIELD_NAME }}

In other cases (depending on the CIM UML specification), the gener-ation of correct key values is different. For instance, TopologicalNodeaggregates one or more instances of the CIM class Terminal but theXML tag representing the association (here an aggregation) is writ-ten the other way round (therefore called inverted XML tag), namelycim:Terminal.TopologicalNode. Therefore, in the case of the C++ classTopologicalNode with the attribute Terminal, which represents the asso-ciation, the key value can be expressed by the template:

cim:{{ FIELD_NAME }}.{{ CLASS_NAME }}

This proceeding is sufficient in the very most cases but in some CIMdocuments the XML tag representing the association looks different. There-fore, there are configuration files with proper mappings from key valuesgenerated by the previous template to the inverted XML tags representingassociations in the CIM RDF/XML documents to be deserialized. Theseconfiguration files (one for primitive types and another for the remain-ing classes) are read by libcimpp at runtime. Currently there are onlyaround a dozen such cases. With these and further template sections, theunmarshalling code of CIM++ is completed.

The sections of the class Task are filled (as shown with the previousexamples) by the AST visitor with the aid of further placeholders anddictionaries. Furthermore, the template for the assign function con-sists of two sections. The first one (ENUM_STRINGSTREAM) generates theunmarshalling function for enumerations and the second one the actualassignments of the read in data to the CIM C++ objects. In this unmar-shalling function the stream operators for enumerations are implemented.Therefore, for all enumerated types proper CIM RDF/XML data can beread in with the aid of streams as for primitive types. Since the enumeratedtypes are strongly typed, besides the placeholder {{ENUM_CLASS_TYPE}}for enumerations without a scope, there is {{QUAL_ENUM_CLASS_TYPE}}for scoped enumerations. For filling these placeholders, the AST visitortraverses all enum class declarations and generates the needed sectiondictionaries.

Finally, an ASSIGNMENT section for the assignment function containsseveral placeholders which are filled using section directories generatedwhile visiting attributes of all CIM C++ classes, which are a data type oran enumeration.

48

3.4 Automated CIM (De-)Serializer Generation

Listing 3.7: serialize function of ACLineSegment

1 std :: string ACLineSegment :: serialize (bool isXmlElement ,2 std ::map < BaseClass *,3 std :: string >4 * id_map )5 {6 std :: string output = "";78 if( isElement ) {9 output . append ("<cim: ACLineSegment rdf:ID =\"" +

10 mRID + "\" >\n");11 }1213 output . append ( IEC61970 :: Base :: Wires :: Conductor ::14 serialize (false , id_map ));1516 if(bch. value . initialized ) {17 output . append (" <cim: ACLineSegment .bch >" +18 std :: to_string (bch. value ) +19 " </cim: ACLineSegment .bch >\n");20 }21 ...2223 if( isElement ) {24 output . append (" </cim: ACLineSegment >\n");25 }26 }

3.4.5 MarshallingFor the serialization of CIM C++ objects from the main memory to CIMdocuments, BaseClass was extended by the member function

virtual std :: string serialize (bool isXmlElement ,std ::map < BaseClass *,

std :: string > * id_map )

that can be overridden by all CIM subclasses as they inherit all (di-rectly or indirectly through other classes) from BaseClass. For instance,ACLineSegment overrides it by the function partly depicted in List. 3.7.The isXmlElement parameter tells the serialize method if the attributesto be serialized belong to an superclass of the instance (isXmlElement= false) or to the class of the instance (isXmlElement = true). In thelatter case, XML element tags (see lines 10 and 24) must be wrappedaround the attributes’ marshalling output (between line 12 and 21). Thismeans that if an instance of ACLineSegment has to be deserialized, theACLineSegment::serialize is called with isXmlElement = true, leadingto a serialization with the introductory XML line <cim:ACLineSegment

49

Chapter 3 Automated De-/Serializer Generation

rdf:ID=...>. In line 14 the serialize method of its superclass Conductoris called with isXmlElement = false to achieve an unmarshalling of thesuperclass’ attributes without any XML tags introducing a new Conductorobject.

3.5 libcimpp Implementation

The CIM++ (De-)Serializer is implemented as a library which must beextended by CIM++ code toolchain. Afterwards, it can be easily built as acmake project. The libcimpp library is available as an open-source project[FEI19a]. It already contains automated generated code for current CIMversions.

Pointers to the deserialized C++ objects from CIM documents areprovided in form of a list. Furthermore, a documentation for libcimpp isgenerated by Doxygen [Hee] and available to the user.

3.6 Evaluation

The flexibility and usability of the developed and implemented approachesare here demonstrated by a use case scenario. Regarding the flexibil-ity it shall be shown that the developed toolchain for CIM C++ codeadaption (presented in Sect. 3.3) and the CIM-Unmarshalling-Generatorcan be successfully applied on a given CIM model, which was changedor extended by a visual UML editor. As already mentioned, the cur-rently available open-source version of libcimpp was generated and canbe used for deserialization of different CIM versions as published by theCIMug. However, the main goal was to make CIM++ able to deserializeobjects of classes added with a visual UML editor. This flexibility willbe shown exemplarily by newly introduced component classes, which aremissing in the original CIM standards and needed in the SINERGIENco-simulation environment. There, the original CIM classes are extendedby a Sinergien package containing the mentioned additional classes.One of them is the class BatteryStorage which has become necessarynow that battery storages are increasingly integrated on distributionlevel. After an extension of the original IEC 61970 standard (iec61970cim16v29a_iec61968cim12v08_iec62325cim03v01a from CIMug) by theSinergien package with the aid of Enterprise Architect (v11.0.1106), theCIM C++ code is generated and the introduced toolchain for adaptingthe CIM C++ code to be compilable is applied. This also allows anapplication of Doxygen on the code which generates the developer docu-mentation i. a. for the added Sinergien classes [FEId]. For instance, this

50

3.7 Conclusion and Outlook

also includes the collaboration diagram of the BatteryStorage class asdepicted in Fig. 3.4. After the toolchain for CIM C++ code adaption, theCIM-Unmarshalling-Generator is applied, which completes libcimpp bythe code for unmarshalling. The correct functionality of the generatedunmarshalling code is demonstrated in [Raz+18a] and in Chap. 4 by thetranslation of an established power grid topology with the aid of libcimpp.Among others, this shows the correct functionality of the unmarshallingcode generated by the CIM-Unmarshalling-Generator.

3.7 Conclusion and Outlook

In this chapter the concept of an automated CIM RDF/XML (de-)serializergeneration has been presented. The approach is based on an automatedmapping from CIM UML to compilable C++ code with the aid of a visualUML editor, a compiler front-end, and a template engine. Using thesecomponents, the implemented code adaption toolchain is flexible enoughto generate correct CIM C++ code from different CIM based ontologieswhich then, together with the automated generated unmarshalling code,can be integrated into the libcimpp (de-)serializer library.

Besides software technical improvements related to the libcimpp itself,the approach could be extended by serialization from C++ objects to CIMRDF/XML documents as well as to XML streams e. g. for XMPP commu-nication. After a definition of the required steps, the so-called marshallingcode can be added to the classes by the code adaption toolchain.

Additionally, it could happen that the generated CIM C++ codebasecontains circular class dependencies. In case of present CIM models thereare only few of them (always at the same positions) which is why theyare resolved during code adaption by the mentioned code patches usingforward declarations. Although circular dependencies should be avoided bya clean UML design, it could be researched how such forward declarationsand different solutions could be applied by the code adaption toolchain inan automated way.

Such efforts currently contribute to the first drafts of a harmonizationstandard [IEC17]. Differently to the mapping from CIM primitive datatypes to intrinsic C++ types and classes presented in this work, in [Lee+15]a data type unification of IEC 61850 and CIM is shown. This also includesdefinitions of operations from CIM and IEC 61850 types to unified datatypes using Query/View/Transformation (QVT) which is specified bythe Object Management Group (OMG) as part of MDA. Since the mainimportance for libcimpp is to store data adequately, transformations areonly performed if a sufficient accuracy can be achieved as specified by the

51

Chapter 3 Automated De-/Serializer Generation

Sinergien::E

nergyGrid

::EnergyS

torage::BatteryS

torage

IEC

61970::Base::W

ires::R

egulatingCondE

q

IEC

61970::Base::C

ore::C

onductingEquipm

entIE

C61970::B

ase::Core

::Equipm

entIE

C61970::B

ase::Core

::Pow

erSystem

Resource

IEC

61970::Base::W

ires::R

egulatingControl

IEC

61970::Base::C

ore::IdentifiedO

bjectIE

C61970::B

ase::Core

::PS

RType

BaseC

lass

Sinergien::E

nergyGrid

::Dom

ain::ElectricalC

apacity

Sinergien::C

omm

unication::com

municationR

equirement

PS

RType

IEC

61970::Base::D

omain

::Boolean

controlEnabled

aggregatenorm

allyInService

discreteenabled

isAvailableLTE

requiresCom

munication

isAvailableFiber

isAvailableW

LAN

isAvailableB

PLC

RegulatingC

ontrol

IEC

61970::Base::D

omain

::ActiveP

ower

nominalP

IEC

61970::Base::D

omain

::ApparentP

ower

ratedS

IEC

61970::Base::D

omain

::Voltage

ratedU

IEC

61970::Base::D

omain

::ReactiveP

ower

nominalQ

capacity

m_com

municationR

equirement

Figure 3.4: Section of collaboration diagram for BatteryStorage generatedby Doxygen on the automated adapted CIM C++ codebase.The entire diagram can be found in [FEIb]

52

3.7 Conclusion and Outlook

CIM standards. However, conform to the CIM++ approach, the generatedCIM C++ classes could be extended during their automated adaption bymember functions providing such QVTs for areas where a harmonizationwith IEC 61850 is desirable.

Our approach to synchronize UML models and source code in an au-tomated way are continuously improved [GDD+06; Sad+09]. The mainidea behind such RTE methods is a more visual software development[Die07] which is not finished after the software design phase but is itera-tively repeated during the implementation phase. Therefore, there is alsoongoing research which began with reverse engineering methods and so-called Computer-Aided Software Engineering (CASE) tools [Nic+00]. Forinstance, in [Kol+02] a comparison of the reverse engineering capabilitiesbetween commercial and academic CASE tools is presented. Because ofthe increasing complexity of software systems, the application of MDAbased methods is becoming more and more important. Thus, our approachcontributes to these efforts.

Besides, generic XML and RDF/XML parsers as mentioned in Sect. 3.4.2,which are subjects of research activities as well [Mae12], there is also aCIM specific parser available with serialization capabilities according to[IEC16a] called PyCIM [Lin], currently supporting only CIM versions until2011. Since the project is not maintained anymore, a new project for CIMdocument (de-)serialization called CIMpy is developed at ACS. BesidesCIM it will also support CGMES which is defined using information onCIM [ENT16]. CGMES is currently also being integrated into libcimpp inan automated way with deserialization as well as serialization capabilities.

53

4From CIM toSimulator-Specific System Models

In Chap. 3 the relevance of the Common Information Model (CIM) forpower grids has been outlined an automated generated (de-)serializerlibrary for documents based on the CIM has been presented. Because of thewidespread use of CIM-based grid topology interchange, commercial powersystem simulation and analysis tools such as NEPLAN and PowerFactorycan handle CIM. The problem of such proprietary simulation solutionsin academic area often is an insufficient or unavailable possibility forcomponent model as well as solver modifications. As a consequencemany open-source and free power system simulation tools have beendeveloped during recent years as, for instance, MATPOWER [MAT19]which is compatible to the proprietary MATLAB as well as the open-source GNU Octave [Eat19] environment [ZMT11] and its Python portPYPOWER [Lin19a] as well as pandapower [Fra19]. Other open-sourcesolutions are programmed in the object- and component-oriented multi-domain modeling language Modelica [Fri15b]. Since it allows a declarativedefinition of the model equations, the Modelica user resp. programmerdoes not need to transform mathematical models into imperative code(i. e. assignments). Modelica simulations can be executed with the aid ofproprietary environments such as Dymola and open-source ones such asOpenModelica [Fri+06] and JModelica [Åke+10] with various numericalback-ends. Modelica libraries with models for power system simulationsare PowerSystems [FW14] and ModPowerSystems [MNM16]. The use ofModelica for power system simulation is not limited to the academia but it

55

Chapter 4 From CIM to Simulator-Specific System Models

is also applied in real operation, especially with CIM-based grid topologiesas shown in [Vir+17]. However, in the presented approach an intermediatedata format, called IIDM, is used.

The main contribution of this chapter is the presentation of a template-based transformation from CIM to Modelica system models. It has beenimplemented in the open-source tool called CIMverter which, in its currentversion, transforms CIM documents into Modelica system models basedon arbitrary Modelica libraries, as specified by the user.

The transformation into arbitrary Modelica system models allows theexecution of any kind of Modelica simulations which shall make use ofinformation stored in CIM documents. To achieve this, CIMverter utilizesa template engine that processes template files written in Modelica, con-taining placeholders. These placeholders are filled by the template enginewith data from CIM documents and combined to a complete system modelthat can be simulated in an arbitrary Modelica environment. The use of atemplate engine leads to encapsulation, clarity, division of labor, componentreuse, single point-of-change, interchangeable views, and so forth, as statedin [Par04]. For instance, this means that in case of many interface changesof a component model, the Modelica user does not need to modify theCIMverter source files but just the templates written in Modelica. Hence,there is no special knowledge of CIMverter’s programming language (C++)or any domain-specific language (DSL) needed. Furthermore, this chapterpresents examples on how CIM objects can be mapped to objects of ausual Modelica power system library. Our template-based approach canalso be used for conversions to formats other than Modelica. Therefore,also system models of the Distributed Agent-Based Simulation of ComplexPower Systems (DistAIX) simulator [Kol+18] can also be generated fromCIM documents just through the undertaken adaption of the templatefiles used by CIMverter for the transformation.

This chapter gives a short introduction to data formats as well as themain software components used in CIMverter followed by an overviewof the overall concept. Then it describes how the mapping from CIM toModelica is performed at top level and on bottom level with the usage ofa C++ representation of the Modelica classes in the so-called ModelicaWorkshop. Following this, the approach and implementation is evaluatedwith the aid of two Modelica power system libraries and validated witha commercial simulation tool. Finally, related work is discussed and thechapter is concluded by a roundup and an outlook of future work. Thework in this chapter has been partially presented in [Raz+18b]1.

1 “CIMverter—a template-based flexibly extensible open-source converter from CIMto Modelica” by Lukas Razik, Jan Dinkelbach, Markus Mirz, Antonello Monti islicensed under CC BY 4.0

56

4.1 CIMverter Fundamentals

4.1 CIMverter Fundamentals

For an introduction to CIM, RDF, and XML please have a look intoSect. 3.1. In the following Modelica and template engines will be introducedbriefly.

4.1.1 Modelica

Modelica enables engineers to focus on the formulation of the physicalmodel by the implementation of the underlying equations in a declarativemanner [Fri15b]. The physical model can be readily implemented withoutthe necessity to fix any causality through the definition of input and outputvariables, thus, increasing the flexibility and reusability of the models[Til01]. Besides, existing Modelica environments relieve the engineer fromthe implementation of numerical methods to solve the specified equationsystem.

Modelica Models

The concept of component modeling by equations is shown exemplarilyin List. 4.1 for a constant power load, which is typically employed torepresent residential and industrial load characteristics in electrical gridsimulations.

The presented PQLoad model is part of the ModPowerSystems [MNM16]library and is derived from the base model OnePortGrounded using thekeyword extends, underlining that the Modelica language supports object-oriented modeling by inheritance. In the equation section, the physicalbehavior of the model is defined in a declarative manner by the commonequations for active and reactive power. The parameters employed inthe equations are declared in the PQLoad model beforehand, while the

Listing 4.1: Component model of a constant power load

model PQLoad " Constant power load"extends ModPowerSystems.Base.Interfaces.

ComplexPhasor.SinglePhase.OnePortGrounded ;parameter SI.ActivePower Pnom = 0.5 e6 " active power ";parameter SI.ReactivePower Qnom = 0.5 e6 " reactive power ";

equationPnom /3 = real(v*conj(i));Qnom /3 = imag(v*conj(i));

end PQLoad ;

57

Chapter 4 From CIM to Simulator-Specific System Models

declarations of the complex variables voltage and current are inheritedfrom the base model OnePortGrounded. A complex system, e. g., an en-tire electrical grid, can be implemented as system model by instantiatingmultiple components and specifying their interaction by means of connec-tion equations, see line 25 in List. 4.6. The connect construct involvestwo connectors and introduces a fixed relation between their respectivevariables, e. g., between their voltages (equality coupling) and currents(sum-to-zero coupling). Typically, Modelica environments provide a GUIfor the graphical composition of models.

Modelica Simulations

In [Fri15a] the translation and execution of a Modelica system model issketched. At first, the system model is converted into an internal represen-tation (i. e. an abstract syntax tree (AST)) of the Modelica compiler. Onthis representation, the Modelica language specific functionality is appliedand the equations of the used component models (which are the blocks inthe graphical representation of the system model) are connected together.This is resulting in the so-called flat model.

Then all equations are sorted according to the data-flow among themand transformed by algebraic simplification algorithms, symbolic indexreduction methods, and so forth, to a set of equations that will be solvednumerically. For instance, duplicates of equations are removed. Also,equations in explicit form are transformed to assignment statements (i. e.an imperative form) which is possible since they have been sorted. Theestablished execution order leads to an evaluation of the equations inconjunction with the iteration step of the numeric solver. Subsequently,the equations are translated to C code, equipped with a driver (i. e. C codewith a main-routine), and compiled to an executable (i. e. a program) whichis linked to the utilized numerical libraries. This program is then executedaccordingly to a configuration file which defines, e. g., the simulations startand end times, numerical methods to be utilized, simulation results format,and so forth. Initial values are usually taken from the model definitions inModelica.

For the conversion from CIM to Modelica system models it must bedefined where the topology parameters (written in the CIM documentto be converted) must be placed in the Modelica system model (i. e. theresulting Modelica file). For this purpose, a template engine is used, whichfunction principle is introduced in the following.

58

4.2 CIMverter Concept

4.1.2 Template EngineA template engine (also called template processor or template system)is common in web site development and generates the Modelica code.Template engines allow the separation of model (i. e. logic as well as data)and view (i. e. resulting code). For CIMverter it shortly means that thereis no Modelica code within the C++ source code of CIMverter. To achievethis, template engines have a

data model for instance based on a database, a text / binary file, or acontainer type of the template engine’s programming language,

template files also called templates) written in the language of the result-ing documents together with special template language statements,and

result documents which are generated after the processing of data andtemplate files, so-called expanding,

as illustrated in Fig. 4.1, where an example HTML code template with aplace holder {{name}} is filled with the name from a database, resulting ina complete HTML document. Such place holders are one type of templatemarkers.

4.2 CIMverter Concept

The concept of CIMverter is depicted in Fig. 4.2. The upper part showsthe automated code generation process from the definition of the ontol-ogy by CIM UML to the unmarshalling code generation of the CIM++(De-)Serializer library libcimpp. The middle part shows the transformationprocess from a given topology (based on the specified CIM ontology) to aModelica system model, based on Modelica libraries which are addressed

Template Engine <title>Hello World!</title>

<title>Hello {{name}}!</title>

Template

Output

Database

name = "World"

Figure 4.1: Template engine example with HTML code

59

Chapter 4 From CIM to Simulator-Specific System Models

by appropriate Modelica templates. It uses and extends the concept ofCIM++ as introduced in [Raz+18b]. The CIM UML ontology can beedited by a visual UML editor and exported to a CIM C++ codebasewhich is not compilable and therefore needs to be completed by the CIM++code toolchain. The resulting adapted CIM C++ codebase, representingall CIM classes with their relations, is compilable and used by the CIM++(Un-)Marshalling Generator for the generation of code which is neededfor the actual deserialization process of libcimpp. The CIM++ toolchainand the (Un-)Marshalling Generator are applied in an automated way,whenever the ontology is changed. This keeps libcimpp compatible withnewest CIM RDF/XML documents.

CIMverter uses the libcimpp for deserialization of CIM objects fromRDF/XML documents to C++ objects. Therefore, CIMverter also includesthe adapted CIM C++ codebase, especially the headers for all CIM classes.Due to ongoing development of CIM and the concomitant automatedmodifications of these headers, one might suppose that the CIMverterdevelopment has to keep track of all CIM modifications but in the vastmajority of cases a subsequent modification of CIMverter code is unneeded.This is because the continuous development of CIM mostly leads to new

CIM C++Codebase

AdaptedCIM C++

Codebase

CIM XML/RDF Topology

Document(s)

ModelicaWorkshop

CIM++ Unmarshalling Generator

CIMverter

Template Engine

CIM++ Deserializer

Modelica EditorComponent Model(s)

Modelica Libraries

Visual UML EditorCIM UML Ontology

Topology EditorCIM based Topology

CIM_N1

CIM_N2

CIM_Load1_ICIM_Load1_H

+CIM_L1_2

+CIM_TR

1

n:1

System Model

ModelicaTemplates

CIM++ Code Toolchain

Figure 4.2: Overall concept of the CIMverter project

60

4.2 CIMverter Concept

CIM classes with further relations or new attributes in existing classes.Such extensions of existing CIM classes require no changes on CIMvertercode using them.

With a Modelica editor, the component models of Modelica librariescan be edited. In case the interface of a component model is changed, theappropriate Modelica template files have to be adapted by the CIMverteruser. Thereby, using the template engine with the concomitant model-viewseparation leads to the following advantages:

clarity: the templates are written in Modelica with only few kind of tem-plate keywords (i. e. markers).

division of labor: the CIMverter user, typically a person with electricalengineering background and knowledge of Modelica, can adapt theModelica templates easily in parallel with the CIMverter programmerreducing conflicts during their developments. While the engineerdoes neither need any C++ programming skills nor any knowl-edge of CIMverter internals, the programmer does not need to keepCIMverter up-to-date with all Modelica libraries that could be usedwith CIMverter.

component reuse: for better readability, templates can include other tem-plates, which can be reused for different component models of thesame or further Modelica libraries.

interchangable views: some Modelica models can be compiled with variousoptions, e. g., for the use of different model equations, which can bedefined directly in the code of the system model. For this purpose,the user can easily specify another set of templates.

maintenance: changes to the Modelica code to be generated, which areneeded, e. g., due to changes of component model interfaces, canbe achieved by editings of template files in a multitude of cases.Changing a template, by the way, is less riskier than changing aprogram which can lead to bugs. Furthermore, recompiling andreinstalling of CIMverter is unnecessary.

As already pointed out, some changes to the Modelica libraries requiremore than a template adaption which is related to the mapping of thedeserialized CIM C++ objects to the dictionaries of the template engineused to complete the Modelica templates to full system models.

For a clear mapping between relevant data from the CIM C++ objects tothe template dictionaries, the Modelica Workshop was introduced. For eachModelica component, the Workshop contains a C++ class with attributes

61

Chapter 4 From CIM to Simulator-Specific System Models

holding the values to be inserted in the appropriate dictionary, whichwill be used for the Modelica code fragment expansion of the belongingcomponent within the Modelica system model. The mapping from CIMC++ objects to these Modelica Workshop objects is defined by C++ code.An alternative would have been the introduction of a DSL for a moreflexible mapping definition. However, a really flexible DSL would haveto support data conversions and computations for data mappings fromCIM to Modelica class instances. Despite tools for DSL specification andparser generation etc., the complexity of the CIMverter project wouldincrease. Moreover, CIMverter users as well as the programmers wouldneed to get familiar with the DSL. Both reasons would make CIMverter’smaintenance and further development more sophisticated and thereforeless attractive to potential developers. For instance, the co-simulationframework mosaik at the beginning also made use of a specially developedDSL for scenario definitions [Sch11] but it was removed later on and nowthe scenarios are described by Python, in which mosaik is implemented,as this is more flexible and powerful. The Modelica Workshop and otherimplementation design aspects, as described in the next sections, shallperform the C++ coded mappings in an intuitive and understandableway, making CIMverter therefore easily extensible by further Modelicacomponent models and libraries.

4.3 CIMverter Implementation

As described conceptually, CIMverter utilizes libcimpp for deserializationof CIM topology documents (e. g. power grids) for the generation of fullsystem models based on the chosen Modelica library (e. g. ModPowerSys-tems). C++ was selected as programming language because of libcimpp,with its including CIM C++ codebase, as well as CTemplate, both writtenin C++. As a static, strong type-checking language with less runtimetype information (RTTI) capabilities than a dynamic language such ase. g. Python, speculative dynamic typecasts are used for a return of thecorrect CIM C++ class object. Anyway, the time for converting CIMto Modelica models in comparison to the compile time of the generatedModelica models is negligible. The usage of C++ also allows looking upCIM details in the Doxygen documentation generated from the adaptedCIM C++ codebase of CIM++.

CIMverter has a command line interface (CLI) and follows the UNIXphilosophy of developing one program for one task [MPT78; Ray03].Therefore, it can be simply integrated into a chain of tasks which needto be performed between the creation of a CIM topology and the sim-

62

4.3 CIMverter Implementation

ulations within a Modelica environment as realized in the SINERGIENCo-Simulation project [Mir+18] described in Chap. 2.

A configuration file is handled with the aid of the libconfig++ library,where i. a. the default graphical extent of each Modelica component can beadjusted. It also allows the definition of default CIM datatype multipliers(e. g. M for MW in case of IEC61970::Base::Domain::ActivePower) whichare not defined in some CIM RDF/XML documents such as the onesfrom NEPLAN based on the European Network of Transmission SystemOperators for Electricity (ENTSO-E) profile, specified by [ENT]. Afterthese implementation details, in following subsections the main aspects ofthe overall implementation are presented.

4.3.1 Mapping from CIM to ModelicaThe mapping from CIM documents to Modelica system models can bedivided into three levels of consideration as in [Cao+15].

At first level, there are the library mappings. The relevant data fromCIM C++ objects, as deserialized by CIM++, is first stored in an inter-mediate object representation (i. e. in the Modelica Workshop) with aclass structure similar to the one of the Modelica library. Hence, for eachModelica library there can be a set of appropriate C++ class definitionsin the Modelica Workshop.

Object mappings are part of the second level. There are not just one-to-one mappings, as illustrated in Fig. 4.3. Sometimes, several CIM objects aremapped to one Modelica object resp. component, such as the IEC61970::Base::Wires::PowerTransformer. There are also CIM objects likeIEC61970::Base::Core::Terminal (electrical connection points, linked toother CIM objects) which are not mapped to any Modelica componentmodels.

Parameters and unit conversions are performed at the third level betweenthe CIM C++ objects and the Modelica Workshop objects. Examples are

Object

Object Object

Object

ModelicaCIM C++

Figure 4.3: Mapping at second level between CIM and Modelica objects

63

Chapter 4 From CIM to Simulator-Specific System Models

voltages, coordinates, and so forth. The next section faces the second andthird level mappings as part of the Modelica Workshop but before, theCIM object handling is explained.

4.3.2 CIM Object HandlerThe CIMObjectHandler is in charge of the CIM objects handling. List-ing 4.2 shows a part of its main routine ModelicaCodeGenerator. This is

Listing 4.2: Snippet of the routine ModelicaCodeGenerator

ctemplate :: TemplateDictionary *dict =new ctemplate :: TemplateDictionary (" MODELICA ");

...for( BaseClass * Object : this -> _CIMObjects ) {

if(auto * tp_node = dynamic_cast <TPNodePtr >( Object )) {BusBar busbar =

this -> TopologicalNodeHandler (tp_node , dict);...std :: list < TerminalPtr >:: iterator terminal_it ;for( terminal_it = tp_node -> Terminal . begin ();

terminal_it != tp_node -> Terminal .end (); ++ terminal_it ) {...if(auto * power_trafo = dynamic_cast < PowerTrafoPtr >(

(* terminal_it )-> ConductingEquipment )) {Transformer trafo =

PowerTransformerHandler (tp_node , (* terminal_it ),power_trafo , dict);

Connection conn (& busbar , & trafo );connectionQueue .push(conn);

}...

because topological nodes have a central role in bus-branch based CIMtopologies of power grids [Pra+11]. Therefore, finding a TopologicalNode(saved as tp_node), a busbar object of the Modelica Workshop class BusBaris initialized with it. busbar is needed later on, for the connections ofall kind of conducting equipment (i. e.. power grid components) that isconnected to it.

Then, the inner loop iterates over all terminals of the found tp_node andchecks which kind of ConductingEquipment is connected by the respectiveterminal to the tp_node. In case of a PowerTransformer, a trafo object ofthe Modelica Workshop class Transformer is initialized with the data fromthe PowerTransformerHandler. Furthermore, a new connection betweenthe previously created busbar and the trafo is constructed and pushed ona queue of all connections. These steps are performed for all other kinds

64

4.4 Modelica Workshop Implementation

of components, which is why the ModelicaCodeGenerator calls handlersfor all of them.

The tp_node with the terminal connected to the regarding component(here: trafo) are passed to the appropriate component handler (here:PowerTransformerHandler). Besides, the handler also gets the maintemplate directory dict, called "MODELICA". Within a handler, the con-versions from the required CIM C++ object(s) to the Modelica Workshopobject trafo are performed. Furthermore, a subdirectory (here called"TRANSFORMER" used for the Transformer subtemplate, see e. g. List. 4.4)is created and linked to the given main template directory (see List. 4.3).

Some conversions are related to the graphical representation of theCIM objects. This is because a graphical power grid editor, which canexport CIM documents, can link a IEC61970::Base::DiagramLayout::DiagramObject to each component, with information about the positionof this component, i. e. (x, y)-coordinates, in the coordinate system ofthe graphical editor. Since the coordinate system of the CIM exportingeditor (e. g. NEPLAN) can differ from the one of the Modelica editor (e. g.OMEdit), the coordinates are converted by following code lines:

t_points . xPosition = trans_para [0]*x + trans_para [1];t_points . yPosition = trans_para [2]*y + trans_para [3];

For reasons of flexibility, the four parameters trans_para can be set inthe configuration file and in case of NEPLAN and OMEdit are initializedby {1,0,-1,0} (for trans_para[0] to trans_para[3]). Furthermore, theNEPLAN generated CIM documents have several DiagramObject instanceslinked to one component. To avoid multiple occurrences of the samecomponent in the Modelica connections diagram, the middle point of theseDiagramObject coordinates is calculated. This middle point then definesthe component’s position in the Modelica connections diagram.

Another conversion must be performed for the instance names of Mod-elica classes which are derived from the name attribute of the CIM ob-ject and may not begin or contain certain characters. Each such objectderives its name attribute from the elementary IEC61970::Base::Core::IdentifiedObject superclass. More on the electrics related conversiondetails will be given in the next section.

4.4 Modelica Workshop Implementation

In List. 4.2, different CIM object handlers (e. g. PowerTransformerHandler) return appropriate Modelica Workshop objects which representcomponents of the targeted Modelica library. It should be stated at thisjuncture that CIM is not only related to power grid components and, for

65

Chapter 4 From CIM to Simulator-Specific System Models

instance, also includes energy market players (e. g. Customer), Asset, andso forth. Moreover, as presented in [Mir+18], CIM also can be extended byfurther classes of different domains. Hence, the Modelica Workshop doesnot need to be reduced to power grid components, even though the currentModelica Workshop is related to components for power grid simulations.This is due to ModPowerSystems as first Modelica library targeted by theCIMverter converter. Nonetheless, the current Modelica Workshop canbe used as is for the utilization of another Modelica library as presentedin the Evaluation. To avoid reimplementations, each Modelica Workshopclass representing a Modelica component, such as Slack or Transformer,inherits from the so-called ModBaseClass.

4.4.1 Base Class of the Modelica WorkshopAll Modelica components need an annotation information which defines thevisibility of the component, its extent, rotation, etc. Each Modelica Work-shop class, inheriting from ModBaseClass, therefore has an annotationmember holding the annotation data in a form as used in the Modelicacomponent’s annotation statement. For this purpose, ModBaseClass alsoholds several member functions which combine the annotation data to wellstructured strings as needed for the template dictionary used for filling theannotation statements of all Modelica template files as the annotationstatements of all Modelica components have the same structure and thesame markers (see lines 12-14 and 20-22 of List. 4.6).

For the Modelica statements which differ between different Modelica com-ponents (see lines 8-11 and 16-19 of List. 4.6) there exists a virtual functionset_template_values. In each of the component subclasses this functionwill be overridden with a specialized one which sets all markers thatare needed for a complete filling of the belonging Modelica componenttemplate, such as presented in List. 4.4.

Further member variables of ModBaseClass hold the name of the objectand the specified units information, whose default values are set in theconfiguration file. The object’s name is read from the name attribute ofthe CIM class IdentifiedObject. Besides, it accumulates objects of theCIM class DiagramObjects, where the objects rotation and points on theGUI coordinate systems are stored.

4.4.2 CIM to Modelica Object MappingOne of the most interesting mappings is from the CIM PowerTransformerto the Modelica Workshop Transformer class, as presented in Tab. 4.1. ThePowerTransformer consists of two or more coupled windings and therefore

66

4.4 Modelica Workshop Implementation

CIM Contained / Accumulated Modelica WorkshopPowerTransformer Member Variables Transformer

PowerTransformerEnd1BaseVoltage-> Vnom1nominalVoltage.value * mV

PowerTransformerEnd2BaseVoltage-> Vnom2nominalVoltage.value * mV

PowerTransformerEnd1 ratedS.value * mP SrPowerTransformerEnd1 r.value rPowerTransformerEnd1 x.value x

r · Sr

V 2nom1

· 100 Pcur√

r2 + x2 · Sr

V 2nom1

· 100 Vscr

Table 4.1: CIM PowerTransformer to Modelica Workshop Transformermapping. The left column shows the primary and secondaryPowerTransformerEnd which accumulate further CIM objects,as listed in the middle column, holding the information neededfor the initialization of the Transformer attributes as listed inthe right column. The constants mV and mP stand for thevoltage and power value multipliers. The bottom of the tableshows that additionally two conversions are needed to calculatethe rated short circuit voltage Vsc,r and the short circuit lossesPcu,r in percent

accumulates objects of the class PowerTransformerEnd which represent theconnectors of the PowerTransformer [FEIc]. Further important mappingsimplemented in the Modelica Workshop are listed in Tab. 4.2.

4.4.3 Component ConnectionsAfter the instantiations of all components in the Modelica system model,the connections must be defined as well. In List. 4.2 for each newlycreated component a connection (i. e. instance of Connection class) tothe corresponding busbar is created. Therefore, a function template ofConnection with the signature

template < typename T> void cal_middle_points (T * component );

is called in the constructors of Connection and computes one or two middlepoints between the endpoints of the connection line. The four differentcases for the middle points are illustrated in Fig. 4.4.

Furthermore, the connectors of the different components can vary bet-ween different Modelica libraries. Therefore, the connector names can be

67

Chapter 4 From CIM to Simulator-Specific System Models

CIM ModPowerSystems

TopologicalNode SlackExternalNetwoorkInjection

ACLineSegment PiLine

TopologicalNodePQLoadEnergyConsumer

SvPowerFlow

Table 4.2: Excerpt of further important mappings from CIM to ModPow-erSystems as implemented in the Modelica Workshop

configured in a separate configuration file, called connectors.cfg, whichis included in the directory of the belonging Modelica template files. Itssettings are read by all Connection constructors, combined, and fed intothe dictionary which is used for filling the connections subtemplate, in-cluded by the main template file. The final Modelica code generation willbe exemplarily presented in the next section.

4.5 Evaluation

For an evaluation of the approach and its implementation, exemplary tem-plates as well as the resulting Modelica models are shown. To demonstratethe flexibility and applicability of CIMverter, two different power systemlibraries are used; the ModPowerSystems and the PowerSystems library.Besides, the simulation results obtained with the generated models arevalidated against the commercial simulation tool NEPLAN.

The main Modelica template defines the overall structure of the Model-ica system model and contains markers for component instantiations andconnection equations, List. 4.3. The inserted subtemplates hold informa-tion regarding the library and package from which the models are taken.For instance, see line 1 in the corresponding subtemplates, List. 4.4 (forModPowerSystems) and List. 4.5 (for PowerSystems), of the Transformer

zero one two

Figure 4.4: Connections with zero, one, and two middle points betweenthe endpoints. The endpoints are marked with circles

68

4.5 Evaluation

model. As use case, we generate the components for a steady-state sim-ulation of a symmetrical power system in balanced operation. For theModPowerSystems library, we utilize models from the PhasorSinglePhasepackage, since complex phasor variables and a single phase representationare functional for this type of simulation. In case of the PowerSystemslibrary, we perform the simulation with models from the AC3ph package,obtaining comparable results by considering the dq0 transform in thesynchronously rotating reference frame. Other types of simulation mightbe performed by changing package and model names accordingly in thesubtemplates. The considered Transformer subtemplates, List. 4.4 andList. 4.5, contain markers to define primary and secondary nominal voltageas well as rated apparent power. The interface of the ModPowerSystemscomponent specifies the Transformer’s electrical characteristics by ratedshort circuit voltage Vsc,r and short circuit losses Pcu,r, while resistanceR and reactance X are defined for the PowerSystems component.

In our use case, we model the benchmark system described in [Rud+06],which is a medium-voltage distribution network with rural character.Integrated components are a slack bus, busbars, transformers, Pi lines, andPQ loads. An extract of the resulting Modelica system model generatedfrom the CIM data with the presented CIMverter converter shows List. 4.6.The system model of the benchmark grid was additionally generated

Listing 4.3: Main Modelica template related to ModPowerSystems, includ-ing several sections (e. g. SYSTEM_SETTINGS) and subtemplates(e. g. PQLOAD)

{{# HEADER_FOOTER_SECTION }} model {{ GRID_NAME }}{{/ HEADER_FOOTER_SECTION }}{{# SYSTEM_SETTINGS_SECTION }}inner ModPowerSystems.Base.System

{{ NAME }}( freq_nom ( displayUnit = "{{ FNOM_UNIT }}") = {{ FNOM }})annotation ( Placement ( visible = {{ VISIBLE }},

transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }})));

{{/ SYSTEM_SETTINGS_SECTION }}...{{ >PQLOAD }}{{ >TRANSFORMER }}...equation{{ >CONNECTIONS }}{{# HEADER_FOOTER_SECTION }}...end {{ GRID_NAME }}; {{/ HEADER_FOOTER_SECTION }}

69

Chapter 4 From CIM to Simulator-Specific System Models

Listing 4.4: Transformer subtemplate related to ModPowerSystems li-brary

ModPowerSystems.PhasorSinglePhase.Transformers.Transformer{{ NAME }}( Vnom1 = {{ VNOM1 }}, Vnom2 = {{ VNOM2 }},Sr( displayUnit = "{{ SR_DISPLAYUNIT }}") = {{ SR }},Pcur = {{ PCUR }}, Vscr = {{ VSCR }})

annotation ( Placement ( visible = {{ VISIBLE }},transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));

Listing 4.5: Transformer subtemplate related to PowerSystems library

PowerSystems.AC3ph.Transformers.TrafoStray{{ NAME }}( redeclare record Data =

PowerSystems.AC3ph.Transformers.Parameters.TrafoStray( puUnits = false, V_nom = { {{ VNOM1 }}, {{ VNOM2 }} },r = { {{R}}, 0 }, x = { {{X}}, 0 }, S_nom = {{ SR }}))

annotation ( Placement ( visible = {{ VISIBLE }},transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));

for the use of the PowerSystems library, simply by switching from theModPowerSystems to the PowerSystems template set. The connectiondiagrams of the resulting models, Fig. 4.5, show the same grid topologyinvolving the respective components from both libraries.

For the validation of both Modelica system models, they were built andsimulated. Afterwards, the simulation results were compared with theones of the proprietary simulation tool NEPLAN, Tab. 4.3.

4.6 Conclusion and Outlook

This chapter presents an approach for the transformation from CIM toModelica. The mapping of CIM RDF/XML documents to Modelica systemmodels is based on a CIM to C++ deserializer, a Modelica Workshop rep-resenting the Modelica classes in C++, and a template engine. CIMverter,the implementation of this approach, is flexible enough to address arbitraryModelica libraries as presented by the generation of system models for twopower system libraries. In case of ModPowerSystems, there is no need ofmodifying the mappings as implemented in the CIM object handlers whileswitching to the PowerSystems library. Also, the Modelica Workshop

70

4.6 Conclusion and Outlook

Listing 4.6: Medium-voltage benchmark grid [Rud+06] as converted fromCIM to a system model based on the ModPowerSystems library

1 model modpowersystems_mv_benchmark_grid2 inner ModPowerSystems.Base.System3 System ( freq_nom ( displayUnit = "Hz") = 50.0)4 annotation ( Placement ( visible = true,5 transformation ( extent = {{0.0,-30.0},{30.0,0.0}},6 rotation = 0)));7 ...8 ModPowerSystems.PhasorSinglePhase.Loads.PQLoad9 CIM_Load12_H (Pnom( displayUnit = "W") = 15000000.000,

10 Qnom( displayUnit = "var") = 3000000.000,11 Vnom( displayUnit = "V") = 20000.000)12 annotation ( Placement ( visible = true,13 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}},14 rotation = 0, origin = {237.1,-107.8}) ));15 ...16 ModPowerSystems.PhasorSinglePhase.Transformers.Transformer17 CIM_TR1 ( Vnom1 = 110000.000, Vnom2 = 20000.000,18 Sr( displayUnit = "W") = 40000000.000,19 Pcur = 0.63000, Vscr = 12.04000)20 annotation ( Placement ( visible = true,21 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}},22 rotation = -90, origin = {86.0,-64.3}) ));23 ...24 equation25 connect ( CIM_N0.Pin1,CIM_TR1.Pin1 )26 annotation (Line( points ={{153.80,-40.00},{153.80,-56.15},27 {86.00,-56.15},{86.00,-72.30}},28 color = {0,0,0}, smooth = Smooth.None ));29 ...30 end modpowersystems_mv_benchmark_grid ;

classes are compatible with both libraries. Subsequently, the generatedsystem models simulated with a Modelica environment are successfullyvalidated against a common power systems simulation tool. CIMverterhas already been successfully applied in the research area of power gridsimulations as, for instance, in [Din+18].

It is obvious that the current implementation can also be used for con-versions into other formats than Modelica even with the current ModelicaWorkshop as the introduced template markers can be used in every fileformat. Therefore, the Modelica Workshop could be cleaned up and ex-tended to a general Power Systems Workshop, addressing data formatsused by other power system analysis and simulation tools. Furthermore,the template-based approach also allows different target system model

71

Chapter 4 From CIM to Simulator-Specific System Models

Systemfreq. = 50.0 Hz

CIM_N12

CIM_N13

CIM_N14

CIM_N0

CIM_N1

CIM_N2

CIM_N3

CIM_N4

CIM_N5

CIM_N11

CIM_N10 CIM_N8

CIM_N7

CIM_N6

CIM_N9

CIM_HV_Netz

CIM_Load12_HCIM_Load12_I

CIM_Load13_I

CIM_Load14_HCIM_Load14_I

CIM_Load1_ICIM_Load1_H

CIM_Load3_HCIM_Load3_I

CIM_Load4_H

CIM_Load5_H

CIM_Load11_H

CIM_Load10_ICIM_Load10_HCIM_Load8_H

CIM_Load7_I

CIM_Load6_H

CIM_Load9_I

+C

IM_L12

_13+

CIM

_L13_14

+C

IM_L1_

2+

CIM

_L2_3

+C

IM_L3_

4

+C

IM_L4_

5

+C

IM_L5_

6

+C

IM_L10

_11+

CIM

_L9_10

+C

IM_L3_

8

+C

IM_L7_

8

+C

IM_L8_

9

+C

IM_T

R1

n:

1

+C

IM_T

R2

n:

1

(1) ModPowerSystems

Systemf:Parameterf_nom=50Synchron

SteadyState

CIM_N12

CIM_N13

CIM_N14

CIM_N0

CIM_N1

CIM_N2

CIM_N3

CIM_N4

CIM_N5

CIM_N11

CIM_N10 CIM_N8

CIM_N7

CIM_N6

CIM_N9

CIM_HV_Netz~Vslack

CIM_Load12_H

p q

CIM_Load12_I

p q

CIM_Load13_I

p q

CIM_Load14_H

p q

CIM_Load14_I

p q

CIM_Load1_I

p q

CIM_Load1_H

p q

CIM_Load3_H

p q

CIM_Load3_I

p q

CIM_Load4_H

p q

CIM_Load5_H

p q

CIM_Load11_H

p q

CIM_Load10_I

p q

CIM_Load10_H

p q

CIM_Load8_H

p q

CIM_Load7_I

p q

CIM_Load6_H

p q

CIM_Load9_I

p q

CIM

_L12_13

CIM

_L13_14

CIM

_L1_2

CIM

_L2_3

CIM

_L3_4

CIM

_L4_5

CIM

_L5_6

CIM

_L10_11

CIM

_L9_10

CIM

_L3_8

CIM

_L7_8

CIM

_L8_9

CIM

_TR

1

12

CIM

_TR

2

12

grd1

f

dq0

dq0

dq0

dq0

dq0

dq0

dq0

dq0

dq0

dq0

dq0 dq0

dq0

dq0

dq0

dq0

dq0 dq0

dq0

dq0 dq0

dq0dq0

dq0 dq0

dq0

dq0

dq0

dq0dq0dq0

dq0

dq0

dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

dq0dq0

(2) PowerSystems

Figure 4.5: Medium-voltage benchmark grid [Rud+06] converted from CIMto a system model in Modelica based on the ModPowerSystemsand PowerSystems library

72

4.6 Conclusion and Outlook

Grid NEPLAN ModPowerSystems PowerSystemsNode |V | [kV] ∠V [°] |V | [kV] ∠V [°] |V | [kV] ∠V [°]N0 110.000 0.000 110.000 0.000 110.000 0.000N1 19.531 -4.300 19.532 -4.268 19.532 -4.268N10 18.828 -4.900 18.828 -4.852 18.828 -4.852N11 18.825 -4.900 18.826 -4.852 18.826 -4.852

Table 4.3: Excerpt from the numerical results for node phase-to-phasevoltage magnitude and angle regarding the medium-voltagebenchmark grid. The models based on the ModPowerSystemsand PowerSystems libraries yield equal results using the Dymolaenvironment and dassl solver. The results deviate marginallyfrom the reference results obtained with the proprietary toolNEPLAN, which might be explained by numerical roundingand different solution methods

formats than Modelica. Meanwhile, also the system model format for theDistAIX simulator [FEIa] has been implemented.

Additionally, the current middle point calculations for the Modelicaconnections diagrams could be improved by the usage of a graph layoutlibrary such as Graphviz [Ell+01]. This would allow CIMverter to equip theoutputted document with proper diagram data even if the CIM topologyto be converted contains no diagram data at all.

73

5Modern LU Decompositionsin Power Grid Simulation

With the aid of CIMverter, which was presented in Chap. 4, systemmodels based on the ModPowerSystems (MPS) finally can be created fromup-to-date industry standard grid models (i. e. based on the CommonInformation Model (CIM)). This allows scientific studies on real world usecases with usually higher complexity than simple lab examples. Thesestudies often involve newly developed and more accurate models as wellas smaller time steps for higher resolution simulations. A possibility toaccomplish more accurate simulations within the same computation timeis the improvement of the numerical back-end of the utilized simulationenvironment.

During the development of the MPS library [FEI19b] (for more onModelica see Sect. 4.1.1) by ACS and the iTesla Power System Library(iPSL) i. a. developed by Réseau de Transport d’Électricité (RTE) [AIA19],a cooperation between RTE and ACS was established. Only little timebefore, the SUNDIALS/IDA solver [Hin+05] for differential-algebraic sys-tems of equations (DAEs) was integrated into OpenModelica, to achieve apotentially higher simulation performance in case of large models with asparse structure [Ope19a]. During its execution, IDA applies a backwarddifferentiation formula (BDF) to the given DAE, resulting in a nonlin-ear algebraic system of equations that is solved by Newton iterations[HSC19]. Within each iteration, a linear system needs to be solved. Forlinear system solution, IDA provides several iterative and direct methods[MV11]: BLAS/LAPACK [Uni17; Uni19] implementations are supplied for

75

Chapter 5 Modern LU Decompositions in Power Grid Simulation

dense as well as band matrices and KLU [DP10] as well as SuperLU_MT(a multithreaded version of the well-known SuperLU [Sup]) are suppliedfor sparse linear systems.

In the European project PEGASE [CRS+11], KLU has shown thehighest overall performance of all compared LU decompositions (the otherswere LAPACK, UMFPACK, MUMPS, SuperLU_MT, and PARADISO),applied on linear systems (i. e. Jacobian matrices) coming from differentpower grid simulation scenarios. However, new LU decompositions havebeen developed since PEGASE: the parallelized NICSLU [CWY13] andBASKER [BRT16] for conventional shared-memory computer architectures[Roo99] and GLU for graphic processing units (GPUs) [Che+15].

This chapter provides a comparison of the mentioned LU decompositions(e. g. KLU, NICSLU, BASKER, and GLU) that are all developed especiallyfor circuit simulation. This comprises a brief introduction to the workingprinciples of the decompositions for an illustration of the main ideasbehind them. The subsequent analysis is carried out on a set of benchmarkmatrices which came up during simulations with Dynaωo, an open-sourcesimulation tool, developed at RTE [Adr19]. Finally, the results are summedup and it a conclusion follows. The work in this chapter has already beenpartially presented in [Raz+19a].

5.1 LU Decompositions in Power Grid Simulation

In many simulation environments such as OpenModelica [Fri15a], systemmodels with algebraic and differential equations are transformed to a DAE.More on this transformation procedure from system models to DAEs isprovided in Sect. 4.1.1. A numeric DAE solver computes the values of allrelevant variables in the simulation specified time interval [tstart, tend].

5.1.1 From DAEs to LU Decompositions

Two famous DAE solvers are DASSL [Pet82] and IDA from the open-source SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers(SUNDIALS) [Hin+05]. IDA solves the initial value problem (IVP) for aDAE of the form

F (t, y, y) = 0, y(t0) = y0, y(t0) = y0, (5.1)

where F, y, y ∈ RN , t is the independent (time) variable, y = dy/dt, andthe initial values y0, y0 are given [HSC19].

76

5.1 LU Decompositions in Power Grid Simulation

The integration method in IDA is the so-called variable-order, variable-coefficient BDF in fixed-leading-coefficient form [BCP96] of order q ∈1, . . . , 5 given by the multistep formula

q∑i=0

αn,iyn−i = hnyn, (5.2)

where yn, yn are the computed approximations to y(tn) and y(tn), with(time) step size hn = tn − tn−1 and coefficients αn,i determined dependenton q. The application of this BDF to the DAE results in the followingnonlinear algebraic system to be solved at each step:

G(yn) := F

(tn, yn, h−1

n

q∑i=0

αn,iyn−i

)= 0. (5.3)

IDA solves Eq. (5.3) with the Newton method (or a user-defined nonlinearsolver). G(y), where y := yn in the n-th time step and y = (y1, . . . , yN )T ∈RN , is linearized with the aid of Newton’s method, by applying the Taylorseries on the component Gi around y(m), in the m-th Newton iteration:

Gi(y) = Gi(y(m)) +N∑

j=1

∂Gi(y(m))∂yj

(yj − yj (m)) + O(∥∥y − y(m)

∥∥22

),

(5.4)

with i = 1, . . . , N which can be shortened by using the Jacobian matrixdefinition [DR08]

J =

∂G1∂y1

. . . ∂G1∂yN

......

∂GN∂y1

. . . ∂GN∂yN

(5.5)

to the equation

G(y) = G(y(m)) + J(y(m))(y − y(m)) + O(∥∥y − y(m)

∥∥22

). (5.6)

Hence, neglecting the Taylor remainder approximation (i. e. the O-term)and setting G(y) to 0, for finding the zeros, in each Newton iteration alinear system of the form

J [yn (m+1) − yn (m)] = −G(yn (m)), (5.7)

needs to be solved, where yn (m) is the m-th approximation to yn in the n-thsimulation time step. For solving the linear system, LU decompositionscan be utilized.

77

Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.1.2 LU Decompositions for Linear System Solving

LU decompositions belong to the category of direct solvers. There arevarious methods for different matrix types such as, e. g., the Choleskydecomposition for hermetian positive-definite matrices [FO08]. For thedecomposition of sparse matrices, special LU decomposition algorithmsare used which store just the non-zero entries of the matrices to reducememory consumption and arithmetic operations.

During factorization, a non-zero entry can arise at a position where a zeroentry has been before, which is called fill-in. Therefore, LU decompositionsusually perform a preordering step before the actual factorization step forfill-in reduction, leading to better memory space and time consumptionduring the subsequent factorization step [TW67]. In general, the problemof computing the lowest fill-in is NP-complete [Yan81].

Apart from direct solution methods, in [CRS+11] it has also been an-alyzed how well iterative methods for solving linear systems, inside theNewton iterations, perform. The iterative Generalized Minimal ResidualAlgorithm (GMRES) has been taken as it is convenient for general ma-trices. The conclusion was that GMRES is too costly on the Jacobianmatrices from the area of power grids, especially in the case when complexpreconditioning methods must be performed before the solver in order toachieve a better convergence behavior. Furthermore, the Jacobian matricesare not only sparse but also generate little fill-in during the processingby the LU decompositions. Similarly, [SV01] states that large electriccircuits are not easy to solve in an efficient manner by iterative methodsbut there is a development potential as there has not been much researchdone in this area, yet. In the following, the two main steps of current LUdecomposition methods are being introduced.

Preprocessing

Usually, the preprocessing consists of a preordering step and partial piv-oting. During the preordering, permutation matrices are computed. Thepartial pivoting is accomplished to reduce the round-off error during thesubsequent factorization. Hence, for a given linear system Ax = b, thefinal system of equations which has to be solved, after preordering andfactorization with pivoting, can be represented as

(P AQ)QT x = P b,

where the row permutations as well as partial pivoting are performed byP and the column permutations by Q [DP10].The preordering methods

78

5.1 LU Decompositions in Power Grid Simulation

for fill-in reduction are usually based on one of the following heuristics:

minimum degree (MD) which belongs to the greedy algorithms [Heg+01];

nested dissection (ND) which is based on graph partition computationby a divide and conquer approach [Geo73].

In General, nested dissection based fill reduction algorithms are more time-consuming [Heg+01] but the results usually lead to less fill-in [KMS92].Besides the permutations coming from fill-in reductions, there are alsoother permutations performed during the preprocessing of some LU decom-position methods as well as matrix scaling and scheduling of the parallelfactorization (if any). This will be mentioned during the introduction ofthe particular decomposition method.

Factorization

The actual LU factorization, with the factors L and U is performed onthe previously permuted matrix A′ = P AQ, such that A′ = LU , andb′ = P b. For efficiency reasons, preorderings are not performed beforeeach factorization. In case that, e. g., the values of a Jacobian change butthe structure remains, the same permutations can be reapplied. In circuitsimulation this is very often the case [CWY12].

Solving with the computed LU decomposition

Usually, LU decompositions also provide a functionality for right-handsolving as this needs the permutation of the preordering to return correctresults. Hence, for a given A′ = LU , the solution x for Ax = b can becomputed from the solution vector x′, whereby

A′x′ = b′ ⇔ Ly = b′ and Rx′ = y

with x = QT x′ and b = P T b′.The solving step is computationally less time expensive than the two

steps before but it is repeated many times in Newton’s (iterative) method.In this work, the term decomposition is used when the whole method suchas KLU is meant. Whereas factorization is meant when the focus restsupon the actual factorization step of the decomposition. The consideredLU decompositions for electrical circuits (NICSLU, GLU, and Basker inrespect to KLU as reference) are compared in the following.

79

Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.1.3 KLU, NICSLU, GLU, and Basker by ComparisonContrary to KLU, all newer LU decompositions (i. e. NICSLU, Basker,and GLU) are developed especially for modern computer architectureswith multi-core central processing units (CPUs) or even GPUs. As thereis actually no single-core performance improvement since around theyear 2005 [Pre12], the utilization of parallel architectures is of essentialimportance for a higher runtime efficiency on newer computer hardwarecoming with more and more CPU cores as well as more performantaccelerators.

KLU

KLU is a decomposition algorithm for asymmetric sparse matrices in powergrid simulation [DP10]. Besides commercial tools, such as the numericalcomputing environment MATLAB and the circuit simulator Xyce, KLUis integrated into IDA. Since KLU was developed with focus on circuitmatrices, it shows a high runtime efficiency in the area of power gridsimulations [CRS+11]. Therefore, the OpenModelica and the Dynaωosimulation environment make use of KLU as linear solver within IDAwhich is utilized as solver for the initial value problems of DAEs resultingfrom system models.

When solving the first matrix (in a sequence), KLU performs four steps:

1. A permutation of the given matrix A, to be factorized into L and U ,is performed by the matrices P (row) and Q (column permutation)into a block triangular form (BTF):

P AQ =

A11 A12 · · · A1n

0 A22. . .

......

. . .. . .

...0 · · · 0 Ann

The diagonal blocks can be independent and therefore be the onlyones requiring factorization.

2. The Approximate Minimum Degree (AMD) ordering algorithm isperformed block-wise on each block Akk for fill-in reduction [ADD04].Fill-in is defined as a non-zero entry arising during factorization inL or U at a position at which A has a zero entry. Fill-in reductionis a crucial step in sparse matrix factorizations as new non-zeroentries in sparse matrices require memory space (zero entries needno space). This leads to a higher memory consumption and, duringfurther processing, to more memory accesses which can be very

80

5.1 LU Decompositions in Power Grid Simulation

time-costly, decreasing the performance of the whole factorizationsignificantly (esp. on modern processors because of the memory wall[ECT17]). Therefore, KLU is optimized for fill-in reduction of circuitmatrices. Alternatively to AMD, the Column Approximate MinimumDegree (COLAMD) ordering algorithm [Dav+04] or CHOLMOD,such as nested dissection based on METIS (an unstructured graphpartitioning and sparse matrix ordering algorithm [KK95]), as wellas a user-defined permutation can be chosen for each block.

3. Each Akk is scaled and symbolically as well as numerically factorizedusing KLU’s implementation of Gilbert/Peierls’ (GP) left-lookingalgorithm. The scaling of the block matrices (i. e. achieving matrixentries with comparable magnitudes) is a pre-step for pivoting whichis performed on each Akk as the factorization method is also appliedblock-wise and leads to a higher numerical stability.

4. Optional: The whole system is solved with the resulting factorizationusing block back substitution.

In case of subsequent factorizations of matrices with same non-zero pattern,the first two steps are omitted and in the third step a simplified left-lookingmethod does not perform the partial pivoting. Therefore, this is calleda refactorization step which allows the omission of a depth-first searchwithin the GP algorithm, leading to a higher performance. The first twosteps build the preordering. A parallelization approach was mentioned in[Abu+18] but without any implementation details. However, the officialKLU version is not parallelized.

NICSLU

NICSLU is a shared-memory parallelized [Roo99] LU decomposition[CWY13]. Nevertheless, some steps performed by NICSLU are similar tothe ones of KLU:

1. Instead of BTF, the MC64 algorithm is utilized for finding a permu-tation and diagonal scaling for sparse matrices. Putting large entrieson the diagonal can make the subsequent pivoting numerically morestable.

2. As opposed to KLU, the AMD algorithm for fill-in reduction is notapplied on each diagonal block but on the whole matrix.

3. This step determines if the subsequent factorization shall be per-formed (in 4.1.) sequentially or (in 4.2.) in parallel (i. e. withmultiple threads, e. g., on several CPU cores).

81

Chapter 5 Modern LU Decompositions in Power Grid Simulation

4.1. The sequential factorization is based on the left-looking GP algorithm,performing a symbolic factorization, a numeric factorization, andpartial pivoting.

4.2. The parallel factorization was developed based on the left-lookingGP and KLU algorithm [CWY13].

5. Optional: The whole system is solved with the resulting factorizationusing classical right-hand solving.

Analogous to KLU, the first two steps make up the preordering phase andtogether with step 3 the whole preprocessing. In [CWY13] the authorspresent a benchmarking i. a. of NICSLU vs. KLU on 23 circuit matriceswith NICSLU showing speedups of 2.11 to 8.38 on geometric mean, exe-cuted with 1 to 8 parallel computing threads. These parallel speedups wereone reason for the choice of NICSLU in the later presented comparativeanalysis of modern LU decompositions.

GLU

GLU is also a parallelized LU decomposition but for CUDA-enabled GPUs[Che+15]. As it was also developed for circuit matrices, it has similarsteps to KLU and NICSLU:

1. MC64 is performed as in NICSLU.

2. AMD is performed as in NICSLU (i. e. on whole matrix).

3. A symbolic factorization, with 0 and 1 as only entries for zero andnon-zero values, is performed to determine the structure of L and U aswell as grouping of independent columns into so-called column-levels.

4. A hybrid right-looking LU factorization (instead of left-looking as inGP) is performed which benefits from the column-level concurrencyand symbolic factorization.

The first three (preprocessing) steps are executed on the CPU and onlystep 4 on the GPU. Experimental results where presented in [Che+15],including, e. g., speedups of 19.56 over KLU on the set of typical circuitmatrices from the University of Florida.

Basker

Basker is the newest of the four LU decompositions and, like NICSLU,also shared-memory parallelized but from algorithmic point of view mostly

82

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

similar to KLU [BRT16]. It was developed as an alternative to KLU forcircuit simulation by performing a two level parallelism: between blocksand within blocks, as described below:

1. Such as by KLU, BTF is performed (can be disabled). The resultingmatrix has large and small diagonal blocks.

2.1. The small diagonal blocks can be factorized in parallel, by a so-calledFine Block Triangular Structure, as they do not depend on eachother. Hereby,

a) each small diagonal block is symbolically factorized in paralleland afterwards

b) a parallel loop over all these small blocks applies the sequentialGP algorithm on each of them.

2.2. Instead, the large diagonal blocks could be too large to be factorizedby sequential GP as this could dominate the complete LU decompo-sition time. Therefore, large blocks

a) are reordered by ND and

b) the ND structure is mapped to threads by using a task de-pendency graph which is transformed into a task dependencytree which represents level sets that can be executed in parallel.After that

c) the parallel ND Symbolic Factorization and

d) the parallel ND Numeric Factorization are performed.

In [Che+15] a geometric means speedup of 5.91 over KLU is stated for aCPU-based system with 16 cores.

5.2 Analysis of Modern LU Decompositionsfor Electrical Circuits

For the comparative analysis of the new LU decompositions with KLUas reference, they have been integrated into a measurement environmentwith drivers for a set of benchmark matrices to evaluate which could beintegrated into a power grid simulation environment for further analyses.In this work, the results presented in [Raz+19a] were extended by furtheranalyses, especially on Basker.

83

Chapter 5 Modern LU Decompositions in Power Grid Simulation

5.2.1 Analysis on Benchmark Matrices from Large-Scale Grids

For an equal measurement of all LU decomposition methods, a measure-ment environment was developed in C++ which also helped with theintegration of promising methods into proper simulation environments.The driver executes each decomposition and measures the wall clock timeof each relevant processing step.

Benchmark Matrices

For an analysis of the correctness and performance of the LU decompo-sitions, a benchmark around seven matrices has been developed. Thematrices have been extracted from Dynaωo static phasor simulations ofreal test cases conducted at RTE, spanning from a regional portion ofthe grid to a test case representing a merge of the networks of differentcountries, with high voltage (HV) and extra high voltage (EHV) parts.

The modeling choices are the same for all scenarios (except the load mod-els): synchronous machines with their control for classical generation units,standard static var-compensators, controllers as well as special protectionschemes (tap and phase shifter, current limit controller, voltage controller,etc.), primary frequency control, and primary as well as secondary voltageregulations.

Loads are modeled either as first-order restorative loads denoted assimplified loads (SLs) or as voltage dependent loads (VDLs) behind oneor two transformers. Both models are used at RTE depending on thestudy scope and are thus of practical relevance. Tab. 5.1 presents allbenchmark matrices provided by RTE with some information about theirorigin and their characteristics. Moreover, Fig. 5.1 depicts the matrixsparsity patterns which are typical for power grid matrices. Usually they

No. Power Grid K N NNZ d [%]

(1) French EHV with SL 2000 26432 92718 0.013(2) French EHV with VDL 2000 60236 188666 0.0051(3) F. + one neighbor EHV, SL 3000 47900 205663 0.0089(4) F. + one neighbor EHV, VDL 3000 75300 266958 0.0047(5) F. + neighb. countries EHV, SL 7500 70434 267116 0.0054(6) F. EHV + regional HV, VDL 4000 197288 586745 0.0015(7) F. + neighb. countries EHV, VDL 7500 220828 693442 0.0014

Table 5.1: Characteristics of squared matrices with size N × N , K nodes,sorted by nonzeros NNZ, and with density factor d = NNZ

N·N in %

84

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

show a very low density factor (i. e. number of non-zero elements), mainlyconcentrated around the diagonal.

In all the shown matrices, the upper left part corresponds to the networkpart. It is followed by a lot of little blocks around the diagonal: the injectionmodels (generators, loads, etc.) which are modeled using only one interfaceto the network (current and voltage). Finally, the columns in the right partof the matrix, containing non-zero elements, result from the system-widecontrols such as calculations of the system frequency that are related toall generators.

The density factor is higher with SLs models than with VDLs modelsas VDLs have much more variables which are mainly linked togetherbut not with outer variables (except through a single network interface).More information on the (a-)symmetry of circuit matrices can be found in[DP10].

Measurements Environment

The following execution time measurements were performed on a serverwith 2 sockets, each with an Intel Xeon E5-2643v4 3.4 GHz (3.7 GHzTurbo), 6 cores CPU with Hyper-Threading (HT); 32 GB DDR4 mainmemory; NVIDIA TESLA P40, GP102 Pascal, 24 GB GDDR5; running anx86_64 Ubuntu 16.04 Server Linux with kernel a) 4.13.0-46-generic forgeneral measurements, b) 4.11.5-RT (with enabled PREEMPT_RT [Lin19b])for real-time (RT) kernel measurements, and c) 4.13.16-custom for GLUmeasurements with NVIDIA driver x86_64-396.44 and CUDA 9.2 . Theversions of the LU decompositions and compilers are: KLU v1.3.8 withgcc-7.2.0, NICSLU v3.0.1 with clang-4.0.1-6, GLU v2.0 with g++-7.2.0,all built with compiler optimizations level 2 as this leads to highest perfor-mance. All measured times are wall clock times.

85

Chapter 5 Modern LU Decompositions in Power Grid Simulation

(1) (2)

(3) (4)

(5) (6)

(7)

Figure 5.1: Sparsity patterns of benchmark matrices

86

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

Complete Decomposition

The total execution times (i. e. preprocessing and factorization) for acomplete decomposition of the benchmark matrices are plotted in Fig. 5.2.Through almost all matrices, Basker is the most time-consuming method,followed by NICSLU which on some matrices is nearly as performant asKLU. Only in case of matrix no. 3, Basker shows a better performancethan NICSLU. As pictured in Fig. 5.1, this matrix has many relatively bigblocks on its diagonal. While the times of all CPU-based implementationsare below ca. 1 s, the GLU times are in most cases more around 10 timeshigher. The main reason is the preprocessing time as it can be seen in thenext plots.

1 2 3 4 5 6 7Matrix no.

0.10

1.00

Time [s]

KLUNICSLUBasker

(1) Basker, KLU, and NICSLU

1 2 3 4 5 6 7Matrix no.

1.00

10.00

Time [s]

GLU

(2) GLU

Figure 5.2: Total (preprocessing+factorization) times

Preprocessing

The preprocessing times of KLU, as shown in Fig. 5.3, are lowest in allcases. The main reason for this is the application of AMD on (smaller)submatrices instead of the whole matrix. In case of Basker, not onlythe whole runtime but also the preprocessing in case of matrix no. 3 isperformed relatively quickly in comparison to the other matrices. For GLUit can be seen that the preprocessing occupies most of the total time for adecomposition which is due to the symbolic factorization step performedon the CPU.

87

Chapter 5 Modern LU Decompositions in Power Grid Simulation

1 2 3 4 5 6 7Matrix no.

0.01

0.10

1.00

Time [s]

KLUNICSLUBasker

(1) Basker, KLU, and NICSLU

1 2 3 4 5 6 7Matrix no.

1.00

10.00

Time [s]

GLU

(2) GLU

Figure 5.3: Preprocessing times

Factorization

The factorization times in Fig. 5.4 of Basker are also higher than the onesof KLU and NICSLU. The times which NICSLU needs are mostly equalor lower than the ones which KLU needs for factorization, especially formatrix no. 6 which is one of the larger matrices and has a quite big denseblock in its upper left corner as depicted in Fig. 5.1. The factorizationsperformed by GLU on the TESLA device need only a fraction of the totaldecomposition time but are still around 10 times slower than KLU andNICSLU on the CPU.

1 2 3 4 5 6 7Matrix no.

0.01

0.10

Time [s]

KLUNICSLUBasker

(1) Basker, KLU, and NICSLU

1 2 3 4 5 6 7Matrix no.

0.10

1.00

Time [s]

GLU

(2) GLU

Figure 5.4: Factorization times

Complete Decomposition and Preprocessing on RT kernel

In Fig. 5.5 the execution times for the most promising decompositionmethods, KLU and NICSLU, are compared between the generic and the

88

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

RT kernel. KLU needs considerably more time on the RT as on thegeneric kernel. In case of NICSLU there are only small differences betweenthe kernels. As a consequence, the total times of KLU with the generickernel are always lower and with the RT kernel often higher, compared toNICSLU. At this point it is important to notice that a real-time optimizedsystem respectively kernel does not need to run faster than a generic one.Instead, the goal is that it runs deterministic within well-specified timeconstraints. The pure preprocessing times of KLU, as shown in Fig. 5.5,are lowest in all cases. The main reason for this is the application ofAMD on (smaller) submatrices instead of the whole matrix. Again, therun-times of KLU on the different kernels differ more than the ones ofNICSLU.

1 2 3 4 5 6 7Matrix no.

0.05

0.10

0.15

0.20

0.25

Time [s]

KLUNICSLUKLU-RTNICSLU-RT

(1) Total (preprocessing+factorization)

1 2 3 4 5 6 7Matrix no.

0.00

0.05

0.10

0.15

0.20Time [s]

KLUNICSLUKLU-RTNICSLU-RT

(2) Preprocessing

Figure 5.5: Execution times on generic vs. RT kernel

Refactorization vs. Factorization

Since neither Basker nor GLU currently supports refactorization, theseexecution times were only measured for KLU and NICSLU on the genericand the RT kernel as depicted in Fig. 5.6. In both methods the time forrefactorizations is much lower than for factorizations. The refactorizationsare performed by NICSLU much faster than by KLU. On the RT kernelmost NICSLU factorizations of the provided matrices are even faster thanKLU refactorizations.

89

Chapter 5 Modern LU Decompositions in Power Grid Simulation

1 2 3 4 5 6 7Matrix no.

0.00

0.02

0.04

0.06

0.08

Tim

e [s

]KLU fact.NICSLU fact.KLU ref.NICSLU ref.

(1) Generic kernel

1 2 3 4 5 6 7Matrix no.

0.00

0.05

0.10

0.15

Time [s]

KLU-RT fact.NICSLU-RT fact.KLU-RT ref.NICSLU-RT ref.

(2) RT kernel

Figure 5.6: (Re-)factorization times

Parallel Shared-Memory Processing of Basker and NICSLU

The CPU-based LU decompositions in the previously presented measure-ments were executed sequentially. As there is no official parallelized versionof KLU available to the authors, only the parallel processing of Basker andNICSLU is considered in the following. The parallel processing of NICSLUis shown in Fig. 5.7. The total execution times for multiple threads arealways higher than for one single thread. This cannot be caused by theturbo mode (i. e. higher CPU clock rate) only as the times with twothreads are also higher. Even the pure factorization times with multiplethreads are higher than a single thread. Obviously, the parallelizationof NICSLU does not scale for the benchmark matrices. The reason forthe low performance with 16 threads is the total number of 12 physicalprocessors (i. e. 24 logical processors with HT), leading the operatingsystem scheduler to switch between running and waiting threads moreoften. The accompanied context switching causes longer execution times.

The parallel processing of Basker is shown in Fig. 5.8. Contrary toNICSLU, the factorization performed by Basker can scale well with multiplethreads, e. g. for matrix no. 6. But still, Basker is not faster than thesequential KLU even with a higher number of threads. Since Basker is inalpha stadium, it can only handle numbers of threads that are power of two.This is why not more than 8 really independently executed threads couldhave been started on the 12 core system but there were software-technicalissues with Basker and some matrices leading to the limit of 4 threads inthe measurements.

90

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

1 2 3 4 5 6 7Matrix no.

0.05

0.10

0.15

0.20

0.25

0.30Time [s]

1 T2 T4 T8 T16 T

(1) Total (preprocessing+factorization)

1 2 3 4 5 6 7Matrix no.

0.000

0.025

0.050

0.075

0.100

0.125

Time [s]

1 T2 T4 T8 T16 T

(2) Factorization

Figure 5.7: NICSLU’s scaling over multiple threads (T )

1 2 3 4 5 6 7Matrix no.

0.00

0.25

0.50

0.75

1.00

Time [s]

KLU1 T2 T4 T

(1) Total (preprocessing+factorization)

1 2 3 4 5 6 7Matrix no.

0.0

0.2

0.4

0.6Time [s]

KLU1 T2 T4 T

(2) Factorization

Figure 5.8: Basker’s scaling over multiple threads (T)

Alternative Preordering Methods

For a performance analysis with different preordering methods (AMD,METIS, and COLAMD), we integrated METIS and COLAMD into NIC-SLU. In case of METIS, the total execution times for the LU decomposi-tions, as depicted in Fig. 5.9, are significantly higher than in case of AMDand COLAMD. The reasons are long execution times of METIS. Thesecan be derived from Fig. 5.10, as the factorization times after METISpreorderings are comparable to the factorization times after AMD andCOLAMD.

On the generic kernel, KLU benefits from AMD. On the RT kernel itbenefits from COLAMD but in case of pure factorizations it can benefitfrom METIS as well. The NISCLU factorization times, in case of AMD

91

Chapter 5 Modern LU Decompositions in Power Grid Simulation

1 2 3 4 5 6 7Matrix no.

0.0

0.2

0.4

0.6

0.8Time [s]

KLU-AMDKLU-COLAMDKLU-METIS

(1) KLU on generic kernel

1 2 3 4 5 6 7Matrix no.

0.00

0.25

0.50

0.75

1.00

1.25

Time [s]

NICSLU-AMDNICSLU-COLAMDNICSLU-METIS

(2) NICSLU on generic kernel

1 2 3 4 5 6 7Matrix no.

0.0

0.2

0.4

0.6

0.8

Time [s]

KLU-AMD-RTKLU-COLAMD-RTKLU-METIS-RT

(3) KLU on RT kernel

1 2 3 4 5 6 7Matrix no.

0.0

0.5

1.0

1.5

2.0

Time [s]

NICSLU-AMD-RTNICSLU-COLAMD-RTNICSLU-METIS-RT

(4) NICSLU on RT kernel

Figure 5.9: Total times with different preorderings

and COLAMD, are in all cases very close together and lower than afterMETIS preorderings.

5.2.2 Analysis on Power Grid Simulations

The benchmarks presented in this subsection were performed by RTEand parts of the text were authored by the co-authors from RTE of thepublication [Raz+19a]. Because of the low performance in comparisonto the other LU decompositions, GLU was not selected for the integra-tion in simulations environments. Basker, however, was integrated intoOpenModelica but because of its early development stadium it was notmature enough for adequate simulation benchmarks as it generates errorsat certain system sizes.

Due to NICSLU’s relatively high performance on the benchmark matri-ces, it was integrated into the IDA versions used by OpenModelica andDynaωo. Moreover, due to positive performance results in parallel mode,

92

5.2 Analysis of Modern LU Decompositions for Electrical Circuits

1 2 3 4 5 6 7Matrix no.

0.02

0.04

0.06

0.08Time [s]

KLU-AMDKLU-COLAMDKLU-METIS

(1) KLU on generic kernel

1 2 3 4 5 6 7Matrix no.

0.01

0.02

0.03

0.04

0.05

0.06

Time [s]

NICSLU-AMDNICSLU-COLAMDNICSLU-METIS

(2) NICSLU on generic kernel

1 2 3 4 5 6 7Matrix no.

0.05

0.10

0.15

Time [s]

KLU-AMD-RTKLU-COLAMD-RTKLU-METIS-RT

(3) KLU on RT kernel

1 2 3 4 5 6 7Matrix no.

0.00

0.02

0.04

0.06

0.08

Time [s]

NICSLU-AMD-RTNICSLU-COLAMD-RTNICSLU-METIS-RT

(4) NICSLU on RT kernel

Figure 5.10: Factorization times with different preorderings

Basker was integrated into the IDA version of OpenModelica for testing.This needs more effort as Basker is in a too early development stage (e. g. itreturns errors for certain matrices) which is why it was not integrated intoDynaωo. As a result, simulations were performed with NICSLU in Dynaωo[Gui+18], that contains two solvers utilizing SUNDIALS. Three test caseshave been selected to measure the performance of both LU decompositionswith the aforementioned two solvers which will be introduced later on:

(1) French EHV network with SL models

(2) French EHV network with VDL models

(3) French EHV/HV network with VDL models

Measurements Environment

For each test case, the simulation lasts for 200 s with a line disconnection att = 100 s and is done on a machine with an Intel Core i7-6820HQ 2.7 GHz

93

Chapter 5 Modern LU Decompositions in Power Grid Simulation

(3.6 GHz Turbo), 4 cores CPU with HT; 62 GB DDR4 main memory;running on Fedora Linux with kernel 4.13.16-100.fc25.x86_64. Allmeasured times are wall clock times.

Dynaωo’s Fixed Time Step Solver

The first of the two solvers available in Dynaωo is a fixed time step solver,inherited from PEGASE [Fab+11; FC09] and specifically designed for afast long-term voltage stability simulation. It applies an first-order Eulermethod using a Newton-Raphson (NR) approximation for resolving thenonlinear system at each time step (with KINSOL, a NR based methodavailable in SUNDIALS). In this approach, the LU decomposition for theJacobian is computed as few times as possible.

Tab. 5.2 shows that only a few milliseconds are spent in the LU de-composition and the Jacobian evaluation. Moreover, most of the timeelapses for the residual evaluation. It is important to notice that the LUdecompositions are performed only when there are major changes of thegrid.

Case KLU NICSLU Eval. JF Eval. Fno. [s] C [s] C [s] C [s] C

(1) 0.095 3 0.071 3 0.11 3 2.01 561(2) 0.215 4 0.215 4 0.46 4 5.96 617(3) 0.847 13 0.790 13 1.61 13 9.41 767

Table 5.2: Total execution times and numbers C of calls of the correspond-ing routines within the fixed time step solver, with JacobianJF and residual function vector F

Dynaωo’s Variable Time Step Solver

The second solver available in Dynaωo is a variable time step solver basedon SUNDIALS/IDA plus additional routines to deal with algebraic modechanges due to topology modifications of the grid. Jacobian evaluationsand LU decompositions occur much more often than with the fixed timestep solver.

Table 5.3 presents the results with the variable time step solver. Theyconfirm the trends observed with the individual matrices, i. e. the preorder-ing step takes more time with NICSLU than KLU but these extra costs areoffset by a substantial reduction on the factorization and refactorizationsteps. Usually, there should be mainly refactorizations. Factorizations

94

5.3 Conclusion and Outlook

should appear only when there is a change in the matrix structure (thatcorresponds either to a change in the grid topology or a deep change in theform of the injection equations). Keeping this point in mind, it should bepossible to gain time with NICSLU on complete simulation times comparedto KLU ( 26.67 s vs. 34.56 s in case (3) ). This gain remains minimal at themoment compared to the overall numerical resolution time ( 36 s in case(1), 102 s in (2), and 266 s in (3) ) but if improvements are also achievedon the other elementary tasks, it could help making the difference in thelong term.

5.3 Conclusion and Outlook

This chapter presents most promising recently developed LU decompositionmethods (Basker, NICSLU, and GLU) for electric circuit simulation thathave been found in current literature. After a short introduction of themain ideas behind the methods, a comparative analysis with KLU (as thereference LU decomposition for power grids) was conducted on benchmarkmatrices from large-scale phasor time-domain simulation. Through theintegration of NICSLU in OpenModelica’s and Dynaωo which is stable,it can already be used in productive environment. The immature Baskerimplementation, however, can be software-technically improved and testedwithin the OpenModelica environment, where it was integrated, to gainbetter runtime stability.

The analysis shows that KLU and NICSLU achieve a similar perfor-mance for total execution times on benchmark matrices while Basker’sperformance, especially in single-threaded mode, is lower. However, Baskercan achieve speedups of the factorization when running parallel threads.

Case Preord. [s] Fact. [s] Refact. [s] Sum [s] D f Method

(1) 2.42 2.58 2.85 7.85 461 0.33 KLU2.74 0.88 0.72 4.34 461 0.33 NICSLU

(2) 4.98 2.81 2.72 10,51 466 0.34 KLU6.28 1.59 1.22 9.09 466 0.34 NICSLU

(3) 15.01 10.79 8.76 34.56 899 0.42 KLU18.96 4.87 2.84 26.67 899 0.42 NICSLU

Table 5.3: Accumulated execution times for the listed steps of the variabletime step solver, with D LU decompositions and a factorizationratio f = #Fact.

#Refact.

95

Chapter 5 Modern LU Decompositions in Power Grid Simulation

Moreover, Basker’s speedup behaves superlinear in a subset of the bench-mark matrices. Superlinear speedup often occurs due to hardware featuresregarding CPU caches [Ris+16]. It can be caused by less data amountper thread which is fitting better into caches. Basker’s developers indeedmentioned that for a larger number of threads, the ND-tree may providesmaller cache-friendly submatrices [BRT16]. Since Basker’s implementa-tion is in an alpha state, one could achieve possibly better results withfurther development. For instance, it is dependend on the Trilinos library[Tri], especially on the parallel execution framework Kokkos. An individualparallelization of Basker, however, could result in a higher performance.GLU, despite its massive parallelization for GPUs, in the presented analy-sis cannot compete with current CPU-based implementations as it showeda much lower performance in all cases.

The preprocessing of NICSLU is usually slower than of KLU but espe-cially refactorizations are performed faster. Such as other shared-memoryparallelized LU decompositions for sparse systems in many cases, NICSLUcannot make use of multiple CPU cores. This is a problem since CPUclock speeds are not increasing anymore and the performance of processorsnowadays is mainly increased by more CPU cores.

Executed on an RT kernel, NICSLU has shown a better performancethan KLU but there is more causal investigation needed. Both, KLU andnow also NICSLU, can benefit from different preorderings. Regardingcomplete simulations, NICSLU can provide improvements compared toKLU, benefiting from its different refactorization step, which is morecommon during simulations than a complete factorization step.

The analysis of the unitary LU decompositions opens new perspectivesfor the generic numerical schemes and the choices made to improve theperformance of power grid simulation solvers as well as other power gridrelated software that can make use of LU decompositions. Furthermore, theintegration of a performant LU decomposition (esp. into the widely-usedSUNDIALS library) allows the simulation environment users to switchbetween different solvers not just for a better runtime performance underdifferent circumstances – e. g. offline vs. real-time simulations – but alsofor a different numerical behavior. This can lead to better results in caseof possible numerical instabilities but also to an alternative in case of asolver issue.

96

6Exploiting Parallelismin Power Grid Simulation

Besides runtime improvements through the application of numerical meth-ods such as LU decompositions, that suite better in general or in thespecial case of power grid simulation, also proper methods from the area ofhigh-performance computing (HPC) can be applied on the regarding simu-lation software. One such that has been recently developed at the Institutefor Automation of Complex Power Systems (ACS) is the Dynamic PhasorReal-Time Simulator (DPsim) which introduces the dynamic phasor (DP)approach to real-time (RT) power grid simulation as larger simulation stepsare possible without losing accuracy [Mir+19]. This leads to a smallerimpact of communication delays, e. g., between geographically distributedsimulators running in different laboratories with special Hardware-in-the-Loop (HiL) setups. A reason for the coupling to one RT co-simulationcould be the lack of needed resources (e. g. hardware, software, know-how,location, etc.) to run a complete HiL simulation in just one laboratory[Mir+17].

DPsim uses several external software libraries which includes the VIL-LAS framework for the communication with other real-time simulators,control/monitoring software as well as hardware, and so forth. Grid datain a Common Information Model (CIM)-based format is read using thelibcimpp library of the CIM++ project as introduced in Chap. 3. Further-more, multiple numerical libraries are used as there are several solversimplemented in DPsim, such as a modified nodal analysis (MNA) basedsolver which utilizes LU factorizations on dense and sparse matrices of

97

Chapter 6 Exploiting Parallelism in Power Grid Simulation

Eigen [Eig19]. Also the SUite of Nonlinear and DIfferential/ALgebraicEquation Solvers (SUNDIALS) library is used as backend of DPsim’sordinary differential equation (ODE) solver.

To benefit from modern shared-memory multi-core systems, the com-putations within one time step are partitioned into multiple tasks, asdefined by the parts of the simulation such as the utilization of differentsolvers and interfaces (e. g. an interface for real-time data exchange, auser interface for monitoring, end so forth). At this point, it should benoted that different solvers can be utilized within a single time step asthis depends on the components of the power grid model. Because of datadependencies between these tasks, they cannot run all in parallel as thiswould lead to data races with wrong results [Qui03]. Therefore, a taskdependency analysis is performed to achieve a data race free parallel taskexecution.

This chapter gives an overview of multiple kinds of model parallelizationapproaches on different abstraction resp. implementation levels thathave been implemented, exemplarily, in the OpenModelica simulationenvironment. It also introduces schedulers for parallel tasks executionand describes how they are implemented into DPsim in combination withthe task dependency analysis. This is followed by a runtime analysis ofDPsim with the implemented approach on various power grids of differentsizes. The chapter concludes with a discussion on the advantages anddisadvantages of the parallel execution as well as on the utilized schedulers.This chapter presents outcomes of the supervised thesis [Rei19].

6.1 Parallelism in Simulation Models

Chapter 5 i. a. dealt with approaches where parallelism within numericalmethods (LU decompositions in the present case) is used for shorterexecution times on multi-core architectures. But instead of using theparallelism of numerical solvers, it is also possible to exploit the inherentparallelism of the model as such. The inherent parallelism of a model canbe either expressed by its developer or automatically recognized. Withoutany claim to completeness, [Lun+09] describes the following first threetypes of approaches for exploiting parallelism in mathematical models:

1. Explicit Parallel Programming

This type concerns approaches where parallel constructs are expressed inthe programming language of the mathematical model itself. For example,ParModelica [Geb+12] is an extension to the modeling language Modelicawhich allows the user to express parallelism in algorithm sections (i. e.

98

6.1 Parallelism in Simulation Models

in imperative programmed parts of a model instead of declarative partsas expressed by equation sections). In this approach, the developer ofthe model is responsible for its (correct) parallelization. For this purpose,ParModelica provides parallel variables (allocated in different memoryspaces) as well as functions, parfor loops, and kernel functions whichare executed on OpenCL devices (e. g. graphic processing units (GPUs))as part of so-called heterogeneous computer systems.

2. Explicit Parallelization Using Computational Components

Another type of explicit parallelization exploitation is achieved by struc-turing the model into computational components using strongly typedcommunication interfaces. For this, the architectural language propertiesof Modelica, supporting components and strongly typed connectors, aregeneralized to distributed components and connectors. An example for thisapproach is the Transmission Line Modeling (TLM) where the physicalmodel is distributed among numerically isolated components [Sjö+10].Hence, the equations of each submodel can be solved independently andthus in parallel.

This kind of explicit parallelization is implemented in DPsim by SystemDecoupling in form of two different methods: the Decoupled Line Modeland Diakoptics as presented in [Mir20].

3. Automatic Fine-Grained Parallelization of Mathematical Models

Besides the explicit expression of parallelism, it is also possible to extractparallelism from the high-level mathematical model or from the numericalmethods used for solving the problem. The parallelism exploitation frommathematical models is categorized by the following the subtypes:

Parallelism over time: for example in case of discrete event simulationswhere certain events are independent from other events and thereforecan be handled in parallel;

Parallelism of the system: this means that the modeled system (i. e. themodel equations) is parallelized. There has been much research doneon automatic parallelization, especially at equation level methods[Aro06; Cas13; Wal+14].

Similarly to the fine-grained approach, the following new (4.) approachtype was introduced.

99

Chapter 6 Exploiting Parallelism in Power Grid Simulation

(4.) Automatic Coarse-Grained Parallelization of Mathematical Models

Rather than exploiting the parallelism at equation level, it is also possibleto consider it at component level. This new methodology was implementedinto DPsim by splitting one simulation step into separate tasks, wherebyevery component in the power grid model declares a list of tasks that haveto be processed in each simulation step. The approach will be presentedin the following.

6.1.1 Task Scheduling

This chapter deals with the scheduling of tasks, i. e. parts of a solutionprocedure, which can be performed by multiple threads that are spawnedon a multiprocessor system by the process’ main thread. It is not aboutany operating system schedulers for processes running on a single- ormultiprocessor system [Tan09]. The term multiprocessor system refersto logical central processing units (CPUs) and therefore includes systemswith a single physical CPU and multiple cores as well as systems withmultiple physical CPUs and multiple cores. As the simulated modelsare small enough to fit into the main memory of current workstationsand servers, only shared-memory parallel programming is considered.Therefore, multiple threads sharing same memory regions can be usedinstead of multiple processes running in parallel on multiple interconnectedprocessors with distributed memory.

The obstacle in case of shared-memory parallelization is that multiplethreads could access same data concurrently which can lead to so-calledrace conditions [Roo99] causing wrong results. Therefore, a synchronizationbetween the parallel running threads must be performed and the executionorder of program statements that are dependent on each other must bekept. For example, if a value is calculated in a program statement S1 andused in S2 as input value, then statement S1 must be executed before S2.Statement S2 depends on S1 and, therefore, both statements cannot beexecuted in parallel. Dependency analyses on statements have long beensubjects of research [WB87] but can also be performed on procedures ortasks. For what applies to single statements, equally applies to a bunch ofstatements (i. e. tasks).

The scheduling of tasks to a set of processors in [KA99] is divided intodifferent categories as pictured in Fig. 6.1. As the considered tasks dependon each other, a scheduling variant from the scheduling and mappingmust be chosen, with the two subcategories dynamic scheduling and staticscheduling. Dynamic scheduling is chosen when there is not enough apriori information about the tasks’ processing durations available before

100

6.1 Parallelism in Simulation Models

their processing. Instead, static scheduling can be used in case there isenough a priori information given which can be used for an mostly efficientscheduling. Static scheduling can again be divided into such based on taskinteraction graphs and on task precedence graphs. Task interaction graphscan be used when loosely coupled communicating processes need to bescheduled which can be true on a distributed (memory) system. As thisis not true in case of the intended shared-memory parallelization, staticscheduling based on task precedence graphs (in the following called as taskgraphs) was chosen.

The task processing times of a time step can be exploited for the nextsteps as long as the mathematical structure of the grid model resp. thecontrol flow within the tasks does not change too much. A reason thatthe control flow within a task changes could be a switching between oneand another simulation step within a component. Whereas a switchingbetween components (e. g. by a breaker) could change the data flow resp.dependency between tasks and therefore require an updated task graph.In the following, some formal definitions for the used terms are introduced.

Basic Terms

At this point, a task can be considered as a sequence of program statementsthat are executed by a processor sequentially.

Definition 6.1 (Dependency and task graph)Given a set T = {T1, ..., Tn} of tasks, which is the set of nodes of thebelonging task graph, an edge (Ti, Tj) ∈ E = T × T , with i, j ∈ {1, ..., n},

Parallel ProgramScheduling

Dynamic Scheduling Static Scheduling

Task InteractionGraph

Task PrecedenceGraph

Scheduling and Mapping(multiple interacting tasks)

Job Scheduling(independent tasks)

Figure 6.1: Categories of parallel task scheduling

101

Chapter 6 Exploiting Parallelism in Power Grid Simulation

expresses a data dependency of Tj on Ti, requiring that Ti must be per-formed before Tj, also denoted as Ti ≺ Tj. The resulting directed acyclicgraph (DAG) G = (T, E) is called the task graph.

Definition 6.2 (Task types, weight, and length)Given a task graph G = (T = {T1, ..., Tn}, E = T × T ),

• a task V ∈ T without incoming edges, i. e.,it holds that ∀U ∈ T : @(U, V ) ∈ E,is called an entry task;

• a task V ∈ T without outgoing edges, i. e.,it holds that ∀W ∈ T : @(V, W ) ∈ E,is called an exit task;

• the weight (i. e. execution time) of a task V ∈ T is given by w(V ),with the weight function w : T → N;

• the length lp of a path p = Ti1 ≺ ... ≺ Tik , with k ∈ N tasks,is defined as the sum of the weights of its tasks, i. e.,lp =

∑k

j=1 w(Tij

).

In case of a distributed-memory system, also communication costs couldbe taken into account as edge weights (e. g. because of message passingbetween the computing nodes) but they are neglected for the intendedshared-memory parallelization. An example task graph is given in Fig. 6.2,where the weights of the tasks are given in parantheses beside the taskidentifier. In the following it is shown how task graphs can be utilized todistribute the tasks among multiple processors in an optimal way regardingtheir total processing time.

T2(2) T3(1)T1(1)

T6(1)

T4(3) T5(1)

Figure 6.2: Example task graph

102

6.1 Parallelism in Simulation Models

General Scheduling Problem for Parallel Processing

A task schedule must provide the start time for each task. This can beformalized on the basis of [Ull75] as follows.

Definition 6.3 (Schedule function, optimal schedule)Given a set of tasks T = {T1, ..., Tn} to be executed on a system with p ∈ Nprocessors, a schedule function f : T → N0 which specifies the start timefor each task is sought, for which following restrictions hold:

• a task Tj that depends on Ti may not start before Ti, i. e.,∀Ti, Tj ∈ T : If Ti ≺ Tj, then f(Ti) + w(Ti) ≤ f(Tj);

• at each time point, at most p tasks are processed concurrently, i. e.,∀t ∈ N0 : |{V ∈ T |f(V ) ≤ t < f(V ) + w(V )}| ≤ p.

A schedule specified by the schedule function fopt is an optimal scheduleiff. the total execution time is minimal under the restrictions above, i. e.,

maxi{fopt(Ti) + w(Ti)} = min∀f {maxi{f(Ti) + w(Ti)}}.

The problem of finding an optimal schedule in case of p = 2 and a singleexecution time tconst = w(V ), for all V ∈ T , can be solved determin-istically in polynomial time but for p > 2 it is generally NP-complete[Ull75]. Therefore, instead of trying to find an optimal schedule, heuristicalgorithms are applied. Two classes of such heuristic schedulers will bepresented in the following.

Level Scheduling

A level scheduling based approach for equation-based parallelization ofModelica was implemented in OpenModelica. At the beginning all entrytasks are assigned to the first level as they do not depend on each other.All tasks that depend on tasks in the first level only, are assigned to thesecond level and so forth.Definition 6.4 (Level scheduling)Given a task V ∈ T = {T1, ..., Tn} and the set of predecessors PV = {U ∈T |U ≺ V }, the level function l : T → N returns the level of the task Vaccording to the recursive definition

l(V ) ={

0, if PV = ∅1 + max{l(S)|S ∈ PV }, otherwise.

As the tasks within a certain level are independent from each other, theycan be executed in any order or in parallel. In the simplest form, the tasks

103

Chapter 6 Exploiting Parallelism in Power Grid Simulation

T2(2) T3(1)T1(1)

T6(1)

T4(3) T5(1)

Level 0

Level 1

Level 2

Figure 6.3: Example task graph including levels

within a level are therefore distributed among the available processorswithout regard to their execution times. If the integer division of n byp returns a rest, the remaining tasks are arbitrarily distributed amongthe processors which causes that certain processors have to execute onetask more than the others. Fig. 6.3 shows exemplarily how levels couldbe assigned to the tasks in Fig. 6.2. Derived from this level assignment, afinal schedule for p = 2 processors is illustrated in Fig. 6.4.

In case of level scheduling, the synchronization (typically threads on ashared-memory system) confines itself to barriers [Cha+08] between theexecutions of the levels. This leads to a simple implementation and lowsynchronization costs. But it could be improved by an enhanced assignmentof the tasks within a level to the processors to minimize the execution timeof each level. However, this corresponds to the NP -complete problem ofmulti-way number partitioning where a given set of integers needs to bedivided into a collection of subsets, so that the sum of the numbers in eachsubset are as nearly equal as possible [Kor09]. A famous greedy heuristic[Cor+01] for solving this problem is to sort the numbers (here: w(Ti),with i = 1, ..., n) in decreasing order and assign each one to the subset(here: processor) with the smallest sum so far. Since the partial order ≺,

P1 T1 T2 T4 T6

P2 T3 T5

Figure 6.4: Schedule for task graph in Fig. 6.2 with p = 2 using levelscheduling

104

6.1 Parallelism in Simulation Models

restricted on the tasks within a level, is empty (as the tasks within a levelare independent), the ratio between the execution time resulting fromthe greedy heuristic and the optimal execution time is limited by 4

3 − 13p

[Gra69]. This can be an acceptable value in many cases but it must be keptin mind that the division of tasks into levels generally is not optimal. Withthe aid of the greedy heuristic, the two smaller tasks T1 and T3 in level 0of the example shown in Fig. 6.3 are assigned to the first processor andtask T2 to the second one (see Fig. 6.5), resulting in a shorter executiontime of level 0 than before (see Fig. 6.4). The total execution time of alllevels therefore reduces from 7 to 6. The next implemented method is thelist scheduling introduced in the following.

P1 T1 T3 T4 T6

P2 T2 T5

Figure 6.5: Schedule for task graph in Fig. 6.2 with p = 2 using levelscheduling considering execution times

List Scheduling

A comparison of list schedules for parallel processing systems is providedby [ACD74]. All of them accomplish the following steps:

1. Creation of a scheduling list (i. e. sequence of task to be scheduled)by assigning them priorities.

2. While the task graph is not empty:

a) Assignment of the task with highest priority to the next availableprocessor and

b) removing of it from the task graph.

The difference between the algorithms lies in the determination of thetasks’ priorities. Two often used attributes for the assignment of prioritiesto tasks are the t-level (top level) and b-level (bottom level). The t-levelof a task V ∈ T is the length (as defined in Def. 6.2) of a longest pathfrom an entry task to V. The same applies to the b-level of a task V withthe length of a longest path from it to an exit task.

105

Chapter 6 Exploiting Parallelism in Power Grid Simulation

T2(2, 4) T3(1, 3)T1(1, 5)

T6(1, 1)

T4(3, 4) T5(1, 2)

Figure 6.6: Example task graph including b-levels, with node label formatTi(w(Ti), b(Ti))

Definition 6.5 (B-level function)Given a task V ∈ T = {T1, ..., Tn} the b-level function b : T → N returnsthe b-level of the task V according to the recursive definition

b(V ) ={

w(V ), if {W ∈ T |V ≺ W } = ∅w(V ) + max{b(W )|V ≺ W }, otherwise.

A critical path (CP) of a DAG is a longest path in the DAG and thus ofhigh importance for a schedule (see [KA99] where also algorithms for t- andb-level computations are presented). In general, scheduling in a descendingb-level order tends to schedule tasks on a CP first, while scheduling inan ascending t-level order tends to schedule tasks in topological order (formore on topological ordering see [KK04]). In [ACD74] the performance ofdifferent heuristic list scheduling algorithms is analyzed. It has been shownthat the CP-based algorithms have near-optimal performance. One ofthese is the Highest Level First with Estimated Times (HLFET) algorithm.Another algorithm with a similar procedure but assuming a uniformexecution time w(V ) = 1, for all V ∈ T , is the Highest Level First withNo Estimated Times (HLFNET) algorithm. Figure 6.6 shows the examplegraph, extended by the b-level for each node. Using HLFET on it, resultsin an optimal schedule es shown in Fig. 6.7. More on these and otherscheduling algorithms can be found in [KA99].

6.1.2 Task Parallelization in DPsim

The core part of the simulation toool is the actual simulation solver forpower grid simulation. One of its main steps is calculating the system ma-trix A by iterating through a list of power grid components, accumulatingeach component’s contribution. The simulation at time point t can then

106

6.1 Parallelism in Simulation Models

P1 T1 T4 T6

P2 T2 T3 T5

Figure 6.7: Schedule for task graph in Fig. 6.2 with p = 2 using HLFET

be sketched with the following steps:

1. computing the right-hand side vector b(t) by accumulating eachcomponent’s contribution (similar to the procedure composing thematrix A);

2. solving of the system equation Ax(t) = b(t);

3. updating components’ states (e. g. equivalent current sources) usingthe solution x(t).

These are just the major tasks as also others have to be performed in eachstep such as the simulation of the dynamics of the mechanical parts ofelectromechanical components like synchronous generators and simulationvalues must be exchanged between the time steps during a distributedsimulation. Eventually, simulation results and logs are saved where this isneeded.

A single step is split into tasks defined by a list of tasks for eachcomponent which has to be simulated. Further tasks are added for themain step of system solving and optionally also for logging of results aswell as data exchange with other processes (e. g. simulators) or HiL.

Task Dependency Analysis

For the representation of dependencies, a system of attributes is imple-mented. Attributes are properties of components such as, e. g., the voltageof a voltage source which are accessed during the simulation by a reador write. A task has two sets of attributes: one for attributes with readand one for those with write accesses. If an attribute is written by a taskT1 and read by task T2 then T2 depends on T1 which is represented by atask graph as defined in Def. 6.1 for all tasks within one simulation step.The task graph for an example circuit (see Fig. 6.8) is depicted in Fig. 6.9.In PreStep, certain values necessary for the current simulation step (i. e.contributions to the right-hand-side vector) are computed depending on

107

Chapter 6 Exploiting Parallelism in Power Grid Simulation

the solutions of the previous simulation step. In PostStep, certain com-ponent specific values are calculated from the system solution computedby Sim.Solve in the current simulation step. For optimization purpose,tasks that are not necessary in a certain simulation are omitted. In case ofthe Resistor component, e. g., a PostStep task is processed, calculatingthe current through it, based on the voltages from the system solution(e. g. calculated in Sim.Solve). More on the task dependency analysis canbe found in [Mir20].

Task Schedulers

Before the actual simulation, a scheduler analyzes the task graph in order tocreate a schedule for the simulation using a certain number of concurrentthreads which can be scheduled by the operating system on differentparallel processors for potential execution time improvements. Severalschedulers based on the presented scheduling methods (see Sect. 6.1.1)were implemented in DPsim as given in Tab. 6.1. Each scheduler has acreateSchedule for initialization purpose based on the task graph and astep method called in the main simulation loop. The SequentialSchedulersorts the task graph in topological order for obtaining a valid task schedulefor sequential processing. For the actual parallel processing, different

_+ V1

R1

C1

Figure 6.8: Example circuit

V1.PreStep C1.PreStep

Sim.Solve

Sim.LogV1.PostStep R1.PostStep C1.PostStep

Figure 6.9: Task graph resulting from Fig. 6.8

108

6.1 Parallelism in Simulation Models

Table 6.1: Overview of the implemented schedulersScheduler class name Short name Paradigm Algorithm

SequentialScheduler sequential - Topological sortOpenMPLevelScheduler omp_level OpenMP Level schedulingThreadLevelScheduler thread_level std::thread Level schedulingThreadListScheduler thread_list std::thread HLF(N)ET

Application Programming Interfaces (APIs) are used: OpenMP [Ope19b],providing a simple interface for the (incremental) development of parallelapplications and the std::thread class from the systems’ C++ StandardLibrary [Jos12].

The OpenMPLevelScheduler has the simplest implementation as it isutilizing the OpenMP API. Its step function (see List. A.1) forks a givennumber of concurrent threads (through a parallel section) in which a loopis processed by each thread sequentially (i. e. each thread processes eachlevel). Within this level loop, an parallel loop over the tasks within alevel is executed with an OpenMP schedule(static) clause, causing anearly equal distribution of the tasks among the threads. As a parallelfor-loop in OpenMP has an implicit barrier per default, the concurrentthreads process the levels synchronously. An advantage of OpenMP isthat there are many implementations for different computer platforms butthere could be significant differences in computing performance [Mül03].Also, the simple OpenMP pragmas allow an incremental development butalso prevent influence over some implementation details.

The ThreadScheduler was implemented based on the std::thread classfrom the C++ standard library [Wil19], implementing the step functionfor having more control over the synchronization between the threads. Inevery time step, each thread executes its list of assigned tasks successivelyand synchronized by atomic counters supporting the two operations: anatomic increment of the counter’s value and waiting until it reaches agiven value which is implemented in form of busy waiting [Tan09]. Thecounter of each task is incremented after its processing. Before each step,the atomic wait method is called on the counters of all tasks with an edge(in the task graph) to the current task. The actual distribution of thetasks among the threads is accomplished by the two sub classes of theThreadScheduler.

The ThreadLevelScheduler, like OpenMPLevelScheduler, realizes levelscheduling but with a different behavior. In case of the OpenMP-basedscheduler, there are barriers for all threads at each level’s end, causing also

109

Chapter 6 Exploiting Parallelism in Power Grid Simulation

threads without tasks within a certain level to wait before the executionof (independent) tasks of the next level. Such unnecessary barriers are notconducted by the ThreadLevelScheduler. Moreover, it can make use of ex-ecution times measured during a previous execution by applying the greedyheuristic for multi-way partitioning to keep the subsequent execution timeper level between the threads mostly uniform (see Sect. 6.1.1).

The ThreadListScheduler which also derives from ThreadSchedulerimplements the list scheduling algorithm based on HLFET, in case ex-ecution times are provided, and Highest Level First with No EstimatedTimes (HLFNET) if not (e. g. the execution times per task is assumed tobe uniform).

Component-Based Modified Nodal Analysis

The system to be simulated is passed as a list of component objects to aMNA solver, implemented with the MNASolver class. All components thatcan be simulated using the MNA approach, have the following in common:

• their internal state is initialized depending on the system frequencyand time step;

• their presence may change the system matrix;

• they specify tasks such as PreStep and PostStep which have to beprocessed at each time step.

At simulation start, each component is initialized, its contribution isaccumulated to the system matrix, and the decomposition is calculated.More details on the MNA implementation itself can be found in [Mir20].

A Simulation class constructs the task graph from the given list of tasksand such for logging as well as interfacing if needed. During simulation, thescheduler’s step method (for proceeding in time) is called which executesall tasks in a correct order (i. e. avoiding race conditions). Because ofthe distinction in the implementation between the scheduler and solver,the implemented framework for parallel processing is not MNA solverspecific but can be adapted to any solver structure which, however, mustbe divisible into tasks.

6.1.3 System Decoupling

Solving a linear system of size n requires O(n3) operations which leadsto long execution times in case of large matrices. Even if the systemmatrix stays fix between the simulation steps, leading to the fact thata LU decomposition of it could be reused for solving the system, the

110

6.2 Analysis of Task Parallelization in DPsim

forward-/backward substitutions would require O(n2) operations at eachtime step. Because of requirements on the time step in real-time simulation(dependent on the simulation model / method and use case), this wouldcause a limit in the size of the system model. A possible proceeding is, tosplit the system into smaller matrices that can be solved independently andto compose the solution of the whole system from all partial solutions. Incase, the LU decomposition can be reused, the potential speedup by solvingk systems of size n/k over solving a system of size n is n2

k·(n/k)2 = k. As thesmaller matrices are independent, they can be solved concurrently whichresults in a higher performance. Therefore, two methods for increasingthe performance gain from the presented parallelization by splitting thesystem matrix into smaller parts were implemented.

Decoupled Transmission Line Model

The application of the TLM (in literature also called decoupled transmissionline model) which belongs to the explicit parallelization approaches usingcomputational components (see Sect. 6.1), can split a grid into two subgridswhich are not topologically connected. This allows the creation of twoseparate system matrices that can be solved concurrently during eachtime step. DPsim automatically recognized such cases, solves the systemsseparately and simulates the line behavior of the equivalent components,connecting the two subnetworks together.

Diakoptics

Diakoptics is another method which allows the user to divide a gridinto subgrids. The resulting subgrids can also be computed concurrentlyand their results can be calculated to the whole solution. More on theimplementation of TLM and diakoptics in DPsim can be found in [Mir20].

6.2 Analysis of Task Parallelization in DPsim

In the following, the performance benefits of the previously introducedparallelization methods are analyzed on models without and with systemdecoupling. For that purpose, the average wall clock time needed for asingle simulations step is used as metric in all analyses. It was chosenbecause of its importance for soft real-time simulation where the elapsedtimes of all time steps must stay below a specified average. At first, theexecution times for the different schedulers are analyzed for several systemmodel sizes. Afterwards, the effect of the parallelization on the system

111

Chapter 6 Exploiting Parallelism in Power Grid Simulation

decoupling methods is investigated. Finally, the parallel performance iscompared when DPsim is built by various popular compiler environments.

Measurements Environment

All measurements in this section where accomplished on a server with 2sockets, each with an Intel Xeon Silver 4114 2.2 GHz (3.0 GHz Turbo),10 cores CPU with Hyper-Threading (HT); 160 GB DDR4 main memory;running an x86_64 Ubuntu 16.04 Server Linux with gcc v8.1.0 as defaultcompiler environment for DPsim.

6.2.1 Use CasesThe Western System Coordinating Council (WSCC) 9-bus transmissionbenchmark network was used as reference network which consists of threegenerators, each connected to a power transformer, and three loads con-nected to the generators by six lines in a ring topology. The whole network(e. g. system model) as depicted in Fig. 6.10 was provided in form of aCIM-based file. Its components were modeled in the following way:

• synchronous generators represented with the aid of an inductanceand an ideal voltage source whose value was updated in each stepbased on a model for transient stability studies;

• power transformers modeled as ideal transformers with an additionalresistance and inductance on the primary side to model in particularthe electrical impact of the windings and related power losses;

• transmission lines represented by PI models with additional smallso-called snubber conductances to ground at both ends;

• loads modeled as having a constant impedance and inductive behav-ior, thus represented by a resistance and inductance in parallel.

More on the component models can be found in [Mir20].For an analysis of various system model sizes, multiple replications of

the WSCC 9-bus system were combined in an automated way. For thispurpose, further transmission lines were added between nodes connectedto loads (labeled in Fig. 6.10 as BUS5, BUS6 and BUS8) to form furtherrings between components of the system copies. The resulting topologiesfor two and three system copies are illustrated in Fig. 6.11 where differentnode colors signify different copies of the original 9-bus system and newlyadded transmission lines are represented using solid lines. Only relevantbuses are shown and the omitted parts are sketched as dashed lines.

112

6.2 Analysis of Task Parallelization in DPsim

BUS3(14.14 kV > 4.88°)

BUS5(218.66 kV > -4.14°)

BUS6(222.23 kV > -3.74°)

BUS4(229.02 kV > -2.31°)

BUS1(17.16 kV > 0.00°)

BUS2(18.45 kV > 9.69°)

BUS7(229.40 kV > 3.97°)

BUS9(230.95 kV > 2.11°)

BUS8(225.21 kV > 0.84°)

GEN1DP::Ph1::SynchronGeneratorTrStab

GEN2DP::Ph1::SynchronGeneratorTrStab

GEN3DP::Ph1::SynchronGeneratorTrStab

LINE54DP::Ph1::PiLine

LINE64DP::Ph1::PiLine

LINE75DP::Ph1::PiLine

LINE78DP::Ph1::PiLine

LINE89DP::Ph1::PiLine

LINE96DP::Ph1::PiLine

LOAD5DP::Ph1::RXLoad

LOAD6DP::Ph1::RXLoad

LOAD8DP::Ph1::RXLoad

TR14DP::Ph1::Transformer

TR27DP::Ph1::Transformer

TR39DP::Ph1::Transformer

Figure 6.10: WSCC 9-bus transmission benchmark network

6.2.2 Schedulers

In the first part of the scheduler analysis, the different schedulers werecompared on various benchmark networks of different sizes. In Fig. 6.12the average wall clock times per step for simulating the 9-bus systemwere plotted for each implemented scheduler, depending on the number ofthreads from one to ten (due to a 10 cores server). The simulation had astep of 100 µs, was 100 ms long and the average execution time for a singletime step was calculated on the execution times of 50 simulations. Thescheduler names in the plot’s legend are as defined in Tab. 6.1, wherebythe adjunct meas indicates that the measured average task execution timeswere passed to the scheduler.

Compared to the sequential scheduler (dashed line) the parallel pro-cessing as scheduled by all methods is slower than sequential processing.All schedulers, except the OpenMP-based one with an addition overhead,

113

Chapter 6 Exploiting Parallelism in Power Grid Simulation

BUS6'

BUS5'BUS8'

BUS5

BUS6

BUS8

(1) Two system copies

BUS6'

BUS5'BUS8'

BUS5

BUS6

BUS8

BUS5''

BUS6''

BUS8''

(2) Three system copies

Figure 6.11: Schematic representation of the connections between systemcopies

2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

·10−5

Number of threads

Wal

lclo

cktim

epe

rst

ep[s]

sequentialomp_level

thread_levelthread_level meas

thread_listthread_list meas

Figure 6.12: Performance comparison of schedulers for the WSCC 9-bussystem

114

6.2 Analysis of Task Parallelization in DPsim

lead to similar execution times which are increasing with the number ofthreads.

Therefore, the same benchmark was performed on a network with 20interlinked copies of the 9-bus system. For this larger system the par-allel processing for all schedules performed better than the sequentialone as depicted in Fig. 6.13. Here the OpenMP-based level scheduler

2 4 6 8 100

0.5

1

1.5

2

2.5

·10−3

Number of threads

Wal

lclo

cktim

epe

rst

ep[s]

sequentialomp_level

thread_levelthread_level meas

thread_listthread_list meas

Figure 6.13: Performance comparison of schedulers for 20 copies of theWSCC 9-bus system

implementation lead to the highest speedup of ∼1.27 in relation to se-quential processing but again there are only slight differences between theschedulers.

At the end of the scheduler analysis, the number of threads was fixedat eight (i. e. a few less than cores to reduce context switching causedby other system threads on the same CPU) whereas the system size wasvaried up to forty 9-bus copies. The resulting average execution times fora single time step plotted in Fig. 6.14 were calculated on 10 measurementsbecause of rising overall simulation times for larger systems. From fifteen9-bus copies on, the parallel processing shows an performance improvementover sequential processing and again there is no relevant difference between

115

Chapter 6 Exploiting Parallelism in Power Grid Simulation

0 10 20 30 400

0.5

1

1.5·10−2

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequentialomp_level

thread_levelthread_level meas

thread_listthread_list meas

Figure 6.14: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system

the parallel schedulers. Furthermore, the required simulation time growsquadratically with the system size.

As usual, it can be seen that a system must have a certain size to makeuse of parallelization during parallel processing as the synchronizationbetween multiple threads, realized by OpenMP barriers and a countermechanism of the other schedulers, requires too much time comparedto the actual simulation computations. In the dependency graph (seeFig. 6.15), where the area of a circle (representing a task) is proportionalto its execution time, it can be seen that most time is spent for onesingle task which solves the system equation. As the system to be solvedis growing quadratically with the number of nodes, the parallelizationspeedup is limited by the concerned SolveTask. This is also the reasonfor the little differences between the various schedulers, as there is only asmall number of meaningfully different schedules. The reduction of a bigSolveTask to multiple smaller subtasks was therefore the main motivationfor system decoupling in the following.

116

6.2 Analysis of Task Parallelization in DPsim

6.2.3 System Decoupling

In this analysis, the impact of the parallelization methods on decoupledsystems was examined. For this, the 9-bus system copies were connectedas described before. Then, the TLM is applied on the added transmissionlines. In a second case, the added transmission lines are used as so-calledsplitting branches for the diakoptics method [Mir20]. Again, the simulationhad a step of 100 µs, was 100 ms long and the average execution time fora single time step was calculated on the execution times of 10 simulations.

At first, the parallel performance of an increasing number of sys-tems using the TLM is depicted in Fig. 6.16 exemplarily for the OpenMPLevelScheduler and the ThreadLevelScheduler (without any informa-tion about the execution times of the tasks in a previous step) dependingon the number of deployed threads. The parallel processing leads to muchlower execution times in case of both schedulers and scales up to 8 threadson the utilized 10 cores system although the execution times needed bysequential processing are already much slower than without TLM. Themaximum achieved speedups with 8 as well as 10 threads in relation tosequential execution are around two orders of magnitude.

The TLM performance of all schedulers was measured with 8 threadsand is shown in Fig. 6.17. There, the average execution time per step isnearly the same for all schedulers. It does not grow linearly with the systemcopies (as the solving of the decoupled subsystems grows quadratically)and the plots have sharp increases at some points which could stem fromsystem size which does not fit in the cache of a certain level anymoreleading to higher latencies while accessing the cache of the next level resp.the main memory.

Similar measurements were performed using diakoptics instead of TLMas depicted in Fig. 6.18. Again, the parallel processing scheduled by theOpenMPLevelSchedulerand ThreadLevelScheduler show a higher performance compared to se-

Figure 6.15: Task graph for simulation of the WSCC 9-bus system

117

Chapter 6 Exploiting Parallelism in Power Grid Simulation

quential processing with maximum speedups of around one order of mag-nitude. Unfortunately, the speedup from two to more threads is verylimited.

The diakoptics performance of all schedulers was measured with 8 threadsand is shown in Fig. 6.19. Here as well, the parallel processing based on allschedulers leads to very similar execution times but without any regularsharp increases as in case of the parallel processing on decoupled systemsusing TLM.

118

6.2 Analysis of Task Parallelization in DPsim

0 10 20 30 400

2

4

6

·10−4

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequential2 threads4 threads8 threads10 threads

(1) OpenMPLevelScheduler

0 10 20 30 400

2

4

6

·10−4

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequential2 threads4 threads8 threads10 threads

(2) ThreadLevelScheduler

Figure 6.16: Performance for a varying number of copies of the WSCC9-bus system using the decoupled line model

119

Chapter 6 Exploiting Parallelism in Power Grid Simulation

0 10 20 30 40

10−5

10−4

10−3

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequentialomp_level

thread_levelthread_level meas

thread_listthread_list meas

Figure 6.17: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using the decoupled linemodel with 8 threads

120

6.2 Analysis of Task Parallelization in DPsim

0 10 20 30 400

0.5

1

1.5

2

·10−3

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequential2 threads4 threads8 threads10 threads

(1) OpenMPLevelScheduler.

0 10 20 30 400

0.5

1

1.5

2

·10−3

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequential2 threads4 threads8 threads10 threads

(2) ThreadLevelScheduler.

Figure 6.18: Performance for a varying number of copies of the WSCC9-bus system using diakoptics

121

Chapter 6 Exploiting Parallelism in Power Grid Simulation

0 10 20 30 40

10−5

10−4

10−3

Number of system copies

Wal

lclo

cktim

epe

rst

ep[s]

sequentialomp_level

thread_levelthread_level meas

thread_listthread_list meas

Figure 6.19: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using diakoptics with 8threads

6.2.4 Compiler Environments

The performance of the parallelization does not only depend on the schedul-ing methods but also on the parallelization paradigms (OpenMP andC++11 threads) of the used compuiler environments and their optimiza-tions. Table 6.2 lists three compiler environments that are nowadays oftenused in the scientific area together with the applied optimization level forcomparable results (i. e. programs). The simulation was repeated withall three compilers, having a time step of 100 µs and a duration of 100 ms.The average execution time for a single time step was calculated on theexecution times of 50 simulations as presented in Fig. 6.20. The gcc and

122

6.2 Analysis of Task Parallelization in DPsim

2 4 6 8 100

1

2

3·10−3

Number of threads

Wal

lclo

cktim

epe

rst

ep[s]

gcc (sequential)gcc (omp_level)

gcc (thread_level)clang (sequential)clang (omp_level)

clang (thread_level)icc (sequential)icc (omp_level)

icc (thread_level)

Figure 6.20: Performance comparison of compilers for 20 copies of theWSCC 9-bus system

123

Chapter 6 Exploiting Parallelism in Power Grid Simulation

Table 6.2: Overview of the tested compilersCompiler Version Flags Reference

GNU Compiler Collection (gcc) 8.1.0 -O3 -march=native [GCC]Clang (clang) 7.0.1 -O3 -march=native [Cla]Intel C++ Compiler (icc) 19.0.1.144 -O3 -xHost [Int]

icc compilers lead to a comparable performance for all schedulers which,in case of the small system to be simulated, with parallelization is lowerthan with sequential execution. The executable compiled with clang,however, has the lowest performance.

Therefore, simulations with same parameters as before were performedon a system model consisting of twenty interlinked 9-bus copies. The plotsfor all compilers are quantitatively similar to the ones in Fig. 6.13. Forall compilers the parallel processing on this larger system model achieveslower execution times than with sequential processing. Here again, gccyields to the highest performance.

6.3 Conclusion and Outlook

This chapter provides an overview of approaches exploiting parallelism inmathematical models. In addition to the three known approach types insimulation area [Lun+09], it introduces a new Automatic Fine-GrainedParallelization of Mathematical Models which was implemented in DP-sim; an existing power grid simulation software. After a presentationand formal definition of different scheduling methods which are used forthe implementation of different parallelization methods belonging to thenew approach type, i. a. the task dependency analysis, and two systemdecoupling methods (TLM and diakoptics) are sketched.

The subsequent analysis of the task parallelization methods implementedin DPsim for shared-memory systems has shown sublinear speedups forsmall systems models with execution times per simulation step increasingwith the number of used CPUs. However, in case of larger system models(with more than 100 nodes) in combination with TLM superlinear speedupshave been achieved. Unfortunately, TLM has some restrictions on thesimulation time steps as well as the types of transmission lines for whichit can be applied and it also potentially introduces inaccuracies for higherfrequencies. The utilization of diakoptics, which does not introduce suchdisadvantages, leads to parallel speedups when applying the implementedparallelization methods.

124

6.3 Conclusion and Outlook

On the 9-bus system model, the various scheduling algorithms hadalmost no performance differences in many cases. Moreover, existingdifferences are not caused by the different scheduling concepts but insteadby the particular parallelization paradigm and the compiler environment.The reason is the general structure of the task dependency graphs whichleave only little flexibility for the algorithms to generate strongly differingschedules. Anyway, as the task dependency graphs depend on the systemmodels, a comprehensive analysis with different models could result in avariety of execution times depending on the parallel scheduling method.

The implemented task dependency analysis is general enough to intro-duce a finer-grained inner task parallelization. For instance, GPUs ona heterogeneous architecture could be utilized as accelerators (e. g. forcomputations of complex component models) by porting task related codesfor CPUs to GPU kernel code. Then, the schedulers could deploy tasksamong CPUs and GPUs. A utilization of further parallel programmingparadigms for distributed-memory architectures, such as Message PassingInterface (MPI), could be considered but only in case of very large systemmodels because of the higher latencies usually introduced by interactions(i. e. memory accesses and synchronizations) between different computernodes.

Furthermore, optimization efforts of the processing within tasks werebegun by the usage of explicit single instruction multiple data (stream)(SIMD) vectorization where vector instructions (such as Advanced VectorExtensions (AVX)) of modern CPUs are utilized. With these, a higherperformance can be achieved if same CPU instructions are performed onvectors instead of scalars. Modern compilers already perform automaticvectorization but only for parts of the code where they can assure cor-rectness by a static code analysis and they can recognize only in case ofcertain control flow patterns. In more complex computations, explicitvectorization can be enabled by the programmer, e. g. using OpenMPSIMD compiler directives or by SIMD compiler intrinsics.

125

7HPC PythonInternals and Benefits

In the past decade, Python developed to be one of the most popularprogramming languages. In many rankings of the most widely usedprogramming languages it is on one of the first three positions [Cas19].Especially in the engineering sector it enjoys a steadily growing popularityas it is said to be easy to learn because of its clear syntax and a relativelysmall set of keywords. Furthermore, there are several open-source Pythonimplementations with a comprehensive standard library available for free.It allows, e. g., object-oriented as well as functional programming and thevery portable Python implementations allow the use on many platforms.As Python programs are usually interpreted, they do not need to becompiled, which is why it is often used as a script language for small tasks.

But besides the duration and simplicity of software development also thetime efficiency of a programming language is crucial, especially in scientificcomputing. Python’s simple syntax and automatic memory managementleads to short development times in comparison to other programminglanguages. However, the execution times of interpreted programminglanguages usually is considerably higher than of compiled languages.

Therefore, various language extensions, optimized interpreters and com-pilers are developed to increase the time and memory efficiency of Pythonprograms. Important representatives are the Python package NumPy[VCV11; Numb], the just-in-time (JIT) compilers numba [LPS15; Numa]and PyPy as well as the language extension Cython [Beh+11; Cyt]. But ifan engineer, for instance, developed a software project in Python with all

127

Chapter 7 HPC Python Internals and Benefits

needed features but insufficient performance, the question arises, whichof the mentioned solutions should be taken for which kind of algorithms.Around these efforts, a scientific community has grown in the past yearswith conferences on Python for High-Performance and Scientific Comput-ing [Ger]. However, no systematic comparative analysis on the methodsimproving Python’s runtime performance has been accomplished so far.

In a blog post [Pug16] an execution time comparison based just on a LUdecomposition between Python 3, C, and Julia was shown in combinationwith the (JIT) compilers Cython and numba as well as the modules NumPyand SciPy [Bre12; Scid], containing numerical algorithms based on NumPy.The result of this benchmark was that the execution of conventionalPython was one decimal power slower than C and Julia. With the appliedimprovements, however, the performance of the Python solution was similarto C and Julia. The execution time of the SciPy-based implementation waseven more performant than the ones in C and Julia when using precompiledfunctions of the SciPy and NumPy modules. Except the conventionalPython 3 solution, each implementation was optimized for vector CPUinstructions.

In [Rog16] a benchmark of Python runtime environments was presented.The comparison was accomplished with the conventionally used referenceC-implementation CPython [Pytb] and the Java-implementation Jythonof the Python interpreter on the one hand as well as PyPy and Cython ascompilers on the other hand. The results are interesting as Jython achievesa higher performance than CPython and Cython is as fast as CPython.The latter is the case because the Cython version was not adapted to makeuse of Cython’s features which will be introduced later in this chapter.Furthermore, for this benchmark only Python 2 was used which is nowdeprecated and Python 3 is not backward compatible to Python 2.

The available benchmarks focus on the execution time only. For aholistic view of the solutions, the memory consumption must be taken intoaccount as well which has not been considered in the previous analyses.

Therefore, this chapter presents a comparative analysis of the currentlymost popular performance improvement solutions for Python programs ondifferent kind of standard algorithms from the area of numerical methodsand operations on common abstract data types (ADTs). These algorithmimplementations based on the various Python solutions are comparedwith reference implementations in C++ which is meant to be a time andmemory efficient object-oriented programming language. The comparativeanalysis presented here does not only compare the execution times of theprograms but also their memory consumption. It shall provide Pythonprogrammers an overview of current solutions to improve the performanceof their Python programs. Moreover, it shall provide them information on

128

7.1 HPC Python Fundamentals

how much effort is required for the application of a certain solution on theone side and which gain can be expected on the other side.

The chapter gives an introduction to HPC relevant properties of thePython language and its reference implementation CPython. It follows ashort introduction of the aforementioned Python runtime environmentswith a focus on their different approaches. Hereafter the benchmarkingmethodology based on representative algorithms is presented. The al-gorithms are used for the comparative analysis on the execution timesand memory consumption in case of the various Python environments,presented in the following section. Finally, a conclusion on the comparativeanalysis is given with an outlook of future work. This chapter presentsoutcomes of the supervised thesis [Kas17].

7.1 HPC Python Fundamentals

Before the available Python environments are presented, a short overviewof the HPC relevant peculiarities of Python is given. Usually high-levellanguages (HLLs) like Python are structured in a way that humans areable to read and maintain them easily and reuse certain parts of theprogram, and so forth. Hence, before such programs can be executed on acentral processing unit (CPU), the source code must be transformed into asequence of instructions of the actual CPU. This can be accomplished, forinstance, with an interpreter or compiler.

Interpreter

An interpreter processes the source code at the run time of a program.It reads the program’s source code, analyzes or even preprocesses it, andexecutes the statements by translating them successively into instructionsof the target CPU. In case of Python programs interpreted by the CPythonenvironment, the preprocessing consists of a transformation of the Pythoncode into an intermediate format, the bytecode (stored in .pyc files), for avirtual machine [Ben]. And the Python interpreter is an implementationof that virtual machine.

The successive execution of source code by an interpreter makes theprogramming language usable for scripting and usually allows a bettererror diagnosis [Aho03]. However, this has the disadvantage of a tenden-tially slower execution speed of interpreted in comparison with compiledprograms.

129

Chapter 7 HPC Python Internals and Benefits

Compiler

A compiler for HLLs usually translates the whole relevant source code toexecutable machine code (i. e. instructions of the target CPU). It can alsogenerate intermediate codes from the source code but the main differenceof this approach, in contrast to an interpreter, is that the program afterthe compilation process is available in a form that can be executed on theCPU directly. The direct execution of the program instructions on theCPU leads to a high execution speed but disadvantages are, e. g., that themachine code is CPU architecture dependent and must be compiled againfor different computer platforms. The same applies in case of source codechanges. Such compilers are therefore also called ahead-of-time (AOT)compilers.

Just-in-Time Compiler

In contrast to AOT compilers, JIT compilers translate the source codemostly during run time. Only those parts of the program that needto be executed are compiled. JIT compilers can be used to increasethe execution speed of interpreted programs when the execution of thecompiled part of the source code is so much faster than its interpretationthat the compilation process of the JIT compiler does not have a negativeeffect on the whole execution time. Once compiled parts do not need tobe compiled again in case of multiple executions such as in loops.

Tracing Just-in-Time Compiler

A tracing just-in-time (TJIT) compiler makes use of the assumption thatmost of a typical program’s run time is spent in loops [Bol+09]. Therefore,a TJIT compiler tries to identify often executed paths within loops. Theinstruction sequences of such execution paths are called traces. After theiridentification, the traces are usually optimized and translated to machinecode.

7.1.1 Classical Python

Python is continously further developed. Currently Python is available inversion 3, which has many new features breaking backward compatibilityto version 2 [Tad] [Rosb]. The Python version numbers refer to themajor version numbers of the reference Python interpreter implementationCPython [Pytb]. After around 20 years of development, Python 2 willbe retired and the last CPython 2.7.18 was released in April 2020 [Pytd].

130

7.1 HPC Python Fundamentals

Nevertheless, Python 2 was considered in this dissertation since there isstill much Python 2 code that has not been ported to Python 3.

Data Types

In Python, variables are not declared and can be used without a datatype definition. Everything is an object in Python and associated with acertain data type [Mon]. A Python variable can reference different objectsof different types. And the type of an object is determined dynamicallyat run time with the aid of their attributes and methods which is calledDuck Typing [FM09].

There are so-called mutable and immutable objects. Objects that are,e. g., of the type int, float, bool, and tuple are immutable. An instanceof an immutable data type has a constant value which cannot be changed.Multiple variables with the same value are not referencing multiple in-stances. Instead, the same instance is referenced by all these variables. Incontrast, a mutable instance can change its value during run time which iswhy the same mutable objects are created in the memory each time theyare newly requested. Mutable objects are, e. g., of the type list, dict,and set [Cara].

Python 3 distinguishes between several types for numbers. Integersare of the type int and have an arbitrary precision. In Python 2 an intrepresents an integer value with 64 bit and the type long int correspondsto int of Python 3.

An instance of type list is a sequence of objects that can have anarbitrary type. The content of the list can be changed during run timeand the objects can be mutable or immutable. Unlike a list, the contentof a tuple cannot be modified during run time.

An object of type dict is an associative data field which consists ofkey-value pairs. The keys, which can only be of an immutable type, referto objects of an arbitrary type.

Parameter Passing

The two most common evaluation strategies for parameters during afunction call in HLLs are call by value and call by reference. In the firstcase, the value of the given expression (passed to the function) is assigned tothe function’s parameter. In the second case, the object that is referencedby the given expression is also referenced by the functions’ parameterwithin the function. The latter leads to the fact that the object’s valueis changing in the calling part of the code when it is changed within thecalled function.

131

Chapter 7 HPC Python Internals and Benefits

Python, however, uses the mechanism referred to as call by object(reference) [Kle]. If a variable x in main is passed to a function as parametery, then x and y refer to the same object. This behavior corresponds tocall by reference. If, subsequently, another object is assigned to y withinthe function, y refers to the new object and x in main stays untouchedwhich corresponds to a call by value behavior.

Side Effects

The call by object reference principle can cause side effects. If a mutableobject, e. g., of the type list, referenced in main by the variable l is passedto a function with a parameter m, all modification to m within the functionare apply also to the list in main. To avoid this, a copy of the list can bepassed with the aid of the slicing function which can be used by writingl[:] instead of l as argument in the function call.

NumPy Module

NumPy, standing for Numerical Python, is a package for scientific com-puting with Python [Numb]. It contains an N -dimensional array objectimplementation (called ndarray), functions which can work on such arrays,tools for integrating C/C++ and Fortran code, and linear algebra Fouriertransform as well as random number capabilities. The ndarray can beused for numerical computations instead of the normal Python list. Allelements of the ndarray must be of the same data type as the NumPypackage is implemented with the use of C and therefore can benefit ofstatic typing at compile time for higher run time efficiency. Possible datatypes that can be used are, e. g., bool_, int_, float_, and complex_ forequivalent C-types as shown in [SWD15]. A one-dimensional ndarray withn 64 bit floating-point numbers containing zeros can only be created asfollows:

numpy . zeros (n, float )

An ndarray provides a Python object around a functionally extendedC-array. The following Python code shows a matrix-matrix multiplication:

for i in range (n):for j in range (n):

for k in range (n):C[i][j] += A[i][k] * B[k][j]

A usage of ndarrays for the matrices would lead to an additional overheadin the innermost loop. The overhead would occur at the border betweenthe pure Python code around the +=-statement (i. e. the three loops) andthe NumPy code executed during the evaluation of the statement. In the

132

7.1 HPC Python Fundamentals

case of a 10 × 10 matrix the border within the 3 loops would be passed103 = 1000 times. That could make the program execution slower thanwith the normal Python list which is why NumPy functions should becalled for sufficiently long processing on the provided data only.

Instead of applying pure Python operations over the entries of an ndarrayit is recommended to apply operations over the whole ndarray in C code.For this purpose, NumPy provides precompiled functions implementedin C as the following one that can be used for the multiplication of twomatrices:

numpy .dot(A, B)

With this function the border between Python and NumPy code will bepassed just once. Moreover, the precompiled functions of NumPy for linearalgebra make use of BLAS [Uni17] and LAPACK [Uni19]. In comparison toPython lists, the ndarray generates less overhead with regard to executiontime and memory usage [Coh] as it consists of continuous memory blocks(at least in virtual memory) whereas a Python list consists of pointers tomemory blocks which can be randomly distributed in the memory whichis unfavorable for CPU caches as depicted in Fig. 7.1.

Array Module

Python’s array module defines an object type which can compactly repre-sent an array of basic values of one C data type such as, e. g., char, int,float, double, etc. Hence, the module is also implemented using C butis not as powerful as ndarray because only one-dimensional arrays can

Python List

PyObject_HEAD

data

dimensions

strides

NumPy Array

12345678

0x1337180x1337480x1337300x1337600x1337000x1337b80x1337d00x1337e8

..............................................................................

PyObject_HEAD

length

items

Figure 7.1: NumPy ndarray vs. Python list [Van]

133

Chapter 7 HPC Python Internals and Benefits

be defined and there are no precompiled functions. More on this can befound in [Pyta].

Memory Management in CPython

The reference implementation CPython comes with an automatic memorymanagement based on so-called reference counting and a garbage collector(GC). [Dig]. Each Python object has a reference counter which is increasedwhen the object is referenced once more and decreased if a reference isdissolved. If the reference count equals zero, the memory allocated forthe object can be freed. However, the reference counting of CPythoncannot detect reference cycles which can occur, for instance, when oneor more objects are referencing each other [Glo]. Therefore, CPythonhas a generational cyclic GC that runs periodically, determining referencecycles for freeing the memory occupied by objects which are referencingjust themselves. As the garbage collection interrupts the execution of thePython program, there are certain thresholds that can be adjusted. Moreon that can be read in [Debb].

Architecture of the CPython Environment

The software architecture of the CPython environment is depicted inFig. 7.2. Before CPython can be used, it must be compiled from theCPython source code by a proper C compiler. The resulting pythonprogram can then be applied on the Python code to be executed whichis translated to bytecode and interpreted by the bytecode interpreter asCPU instructions of the target CPU.

The bytecode interpreter is implemented in form of a stack-based virtualmachine (VM) [Ben]. For the Python function

def add(a, b):z = a + breturn z

the following sequence of bytecode instructions is executed by the VM.First, the two operands a and b are pushed on the stack by the LOAD_FASTinstruction. Then the BINARY_ADD instruction pops the two operands fromthe stack, performs the addition, and pushes the result onto the stack. ASTORE_FAST instruction stores the result in z which is then pushed againon the stack by a further LOAD_FAST, to be returned by a RETURN_VALUEinstruction.

The data types of the objects (here: a, b, and z) are determined at theexecution of each bytecode instruction. Therefore, a BINARY_ADD duringone call can be performed onto two integer values as well as on two lists

134

7.1 HPC Python Fundamentals

during another call. This makes the interpretation process very flexiblebut also much more time-consuming than the direct execution of machineinstructions from a compiled program. For example, the call of BINARY_ADDon two integer values consists of following steps:

1. Determine data type of a

2. a is an int: get value of a

3. Determine data type of b

4. b is an int: get value of b

5. Call C function int binary_add(int, int) on values of a and b

6. Result of type int will be stored in result

Parallel Processing in CPython

CPython allows multithreading with the aid of the threading module[Pyte] which is based on POSIX threads on a Portable Operating SystemInterface (POSIX) conform operating system [IEE18] which is mappingthe Python threads to native threads of the operating system. However,because of a global interpreter lock (GIL) the Python threads within one

PythonCode

CPythonSource (C) C Compiler

python

Bytecode BytecodeInterpreter

MaschineCode

Computer Platform

Figure 7.2: Software architecture of CPython (python command)

135

Chapter 7 HPC Python Internals and Benefits

CPython interpreter are not really running concurrently. The reason forthe GIL is i. a. the automatic memory management by reference countersas explained above. Without the GIL in CPython, multiple threads thatare using the same Python object could increment and decrement itsreference counter concurrently. This could lead to a race condition onthe reference counter resulting in a wrong value. Besides the memorymanagement also global variables as well as mutable objects cause issuesfor a thread-safe program execution: if a thread modifies a global variable,another thread could use an old value – the same applies to mutableobjects.

Therefore, a Python thread in CPython must hold the GIL to be ableto execute bytecode instructions. How the GIL is assigned to the threadsdepends on the CPython version. If multiple threads are created, one getsthe GIL and the others wait (blocking) on it.

In Python 2 a check is implemented which counts the ticks (bytecodeinstruction) since the creation of a new thread [Bea]. After 100 ticks theactive thread is yielding the GIL and all inactive threads get a signal forrequesting the GIL. One of them gets it and continues with the executionof its bytecode while the other threads wait on the GIL.

In Python 3 each thread gets a time interval of 5 ms instead of ticks [Gir].After each interval the GIL is yielded and assigned to the next thread in arow. This avoids a competition between the threads leading to a fairnessof task scheduling.

CPython also provides the multiprocessing module [Pytc] with whichchild processes can be created within a Python process. Each child hasits own process memory that is independent from other processes. Hence,the memory management, global variable, and so forth are no issue forconcurrently running processes belonging to the same process tree. Thecommunication between such processes can be performed with the aid ofa Manager object.

All previously presented Python peculiarities are important to under-stand what the Python environments other than CPython do differentlyto achieve a higher run time performance. These Python environmentswill be presented in the following.

7.1.2 PyPyContrary to CPython, PyPy’s Python interpreter, implementing the fullPython language, is written in Restricted Python (RPython) rather thanin C. RPython is a restricted subset of Python and therefore suitable forstatic analysis. For instance, variables should contain values of at mostone type at each control flow point [Min]. The PyPy interpreter was

136

7.1 HPC Python Fundamentals

written in RPython as the language was designed for the development ofimplementations of dynamic languages. RPython code can be compiledby the RPython translation toolchain [PyPe] as it is done for the PyPyinterpreter. Due to a separation of language specification of the dynamiclanguage to be implemented and implementation aspects, the RPythontoolchain can automatically generate a JIT compiler for the dynamiclanguage. As a subset of Python, RPython can also be interpreted by anarbitrary Python interpreter [Min].

Architecture of the PyPy Environment

The software architecture of the PyPy environment is depicted in Fig. 7.3.Here, the program which runs the Python code to be executed is calledpypy and must by compiled from the PyPy source code with the RPythontoolchain. Similar to CPython, first the Python program is compiled tobytecode which is also processed by a stack-based virtual machine [PyPa].The important difference between the CPython and PyPy interpreter isthat the latter delegates all actual manipulations of the users’ Pythonobjects to a so-called object space which is some kind of a library of built-intypes [PyPc]. Hence, PyPy’s interpreter treats the Python objects as blackboxes.

PyPy Source(RPython)

RPythonToolchain

PythonCode

pypy

Bytecode BytecodeInterpreter Tracing JIT

MaschineCode

Computer Platform

Figure 7.3: Software architecture of PyPy (pypy command)

137

Chapter 7 HPC Python Internals and Benefits

The BINARY_ADD in PyPy is implemented as follows [BW12]:def BINARY_ADD (space , frame ):

object1 = frame .pop () # pop left operand off stackobject2 = frame .pop () # pop right operand off stackresult = space .add(object1 , object2 ) # perform operationframe .push( result ) # record result on stack

The interpreter pops the two operand objects from the stack and passesthem to the add method from the object space. In contrast to CPython,the PyPy interpreter does not determine the types of the objects which iswhy the latter does not need to be adapted when new data types need tobe supported.

The TJIT compiler, automatically generated by the RPython toolchain,uses meta-tracing [Bol+09]. Therefore, at runtime of the actual Pythonprogram executed by the user, the PyPy interpreter, implemented as astack-based VM in RPython, is traced and not the user program itself.Typically, a TJIT approach is based on a tracing VM which goes throughthe following phases [Cun10]:

Interpretation At first, the bytecode is interpreted as usual with theaddition of a lightweight code for profiling of the execution to detectwhich loops are run most frequently (i. e. hot loops). For thispurpose, a counter is incremented at each backward jump. At acertain threshold, the VM enters the tracing phase.

Tracing The interpreter records all instructions of a whole hot loop itera-tion. This record is called a trace which is passed to the JIT compiler.The trace is a list of instructions with their operands and results.

Compilation The JIT compiler turns a trace into efficient machine codethat is immediately executable and can be reused for the next itera-tion of the hot loop.

Running The compiled machine code is executed.

The phases above represent only the nodes of a graph with many possiblepaths which is not linear. For ensuring correctness, a trace contains aguard at each point where the path in the control flow graph (CFG) couldhave followed another branch, e. g., at conditional statements. If a guardfails, the VM falls back into interpretation mode.

However, the meta-tracing approach of PyPy is different. As the tracedprogram is the PyPy interpreter itself and not the interpreted program, ahot loop is the bytecode dispatch loop (and for many simple interpretersthis is the only hot loop). Tracing one iteration of this loop means that therecorded trace corresponds to executing one opcode (i. e. a machine code

138

7.1 HPC Python Fundamentals

instruction) and it is very unlikely that the same opcode is executed manytimes in a row. Therefore, the corresponding guard will fail, meaning thatthe performance is not improved. Better if the execution of several opcodescould be traced which would effectively unroll the bytecode dispatch loop.Ideally, the bytecode dispatch loop should be unrolled exactly so muchthat the unrolled version corresponds to one loop in the interpreted userprogram. Such user loops can be recognized if the program counter (PC)of the PyPy interpreter VM has the same value several times. Since theJIT cannot know which part of the PyPy interpreter represents the PCof the VM, the developer of the interpreter needs to mark the relevantvariables with a so-called hint. More on meta-tracing can be found in[Bol+09].

PyPy provides different parameters controlling the behavior of JITcompilation with some magic numbers which are [BL15]:

Loop threshold Determines the number of times a loop must be iteratedto be identified as hot loop (default: 1619);

Function threshold Determines how often a function must be called to betraced from the beginning (default: 1039);

Trace eagerness If a guard failures happen above this threshold, the TJITattempts to translate the sub-path from the point of the path failureto the loop’s end which is called a bridge (default: 200).

Memory Management in PyPy

Since PyPy’s initial release in 2007, many garbage collection methods,without reference counting, were implemented, such as Mark and Sweep,Semispace Copying Collector, Generational GC, Hybrid GC, Mark &Compact GC, and Minimark GC [PyPb] Currently the default one isIncminimark, a generational moving collector [PyPd]. Since Incminimarkis an incremental GC, the major collection is incremental (i. e. there aredifferent stages of collection). The goal is not to have any pause longerthan 1 ms, but in practice it depends on the size and characteristics of theheap and there can be pauses between 10-100 ms.

7.1.3 NumbaNumba is an open-source JIT compiler translating a subset of Python andNumPy into machine code [Numa] using the LLVM compiler infrastructureproject [LLV]. Most commonly, Numba is used through so-called decoratorsto code parts that shall be compiled instead of being interpreted by

139

Chapter 7 HPC Python Internals and Benefits

CPython. Numba is therefore no alternative to CPython but an extensionto it and available for Python 2 and Python 3.

Features of the Numba environment

Since code compilation can be time intensive, only code parts that havea high share in total execution time should be compiled. There are twomodes how the compiler treats the code [Anad]:Nopython mode Numba generates code which is independent of the Py-

thon C API which is the interface for C programs to the Pythoninterpreter. A function can be compiled in nopython mode only if adata type can be assigned to all objects accessed by the function. Innopython mode atomic (i. e. thread-safe) reference counters are used[Anaa] instead of the ones in CPython which are not thread-safe.

Object mode Numba generates code which declares all objects as Pythonobjects on which operations are performed with the aid of the PythonC API. Therefore, the performance improvement is lower than innopython mode, unless so-called loop-jitting can be applied by Numba.In the latter case the loop can be automatically extracted andcompiled in nopython mode which is possible if the loop containsnopython-supported operations only [Anae].

Numba supports standard data types such as, e. g., int16, float32,and complex128 with a precision of up to 64 bit per value [Anaf]. For thecompilation of a function by Numba, a decorator must be written beforethe function:

@jitdef f(x, y):

return x + y

There are two possibilities for using the jit-decorator:Lazy compilation Numba determines the function parameters’ type as well

as the result type at run time, compiling special code for differentinput data types.

Eager compilation The programmer determines all data types manually,i. e. in case of upper example the type definition could be:@jit(int32(int32, int32))

Moreover, following arguments can be set to True in the decorator [Anac]:nopython Numba tries to compile the function in nopython mode with

an error message if not possible instead of an automatic fallback toobject mode.

140

7.1 HPC Python Fundamentals

cache Numba saves the machine code of the compiled function instead ofcompiling it at each call.

nogil Since atomic reference counters are used in nopython mode, theGIL can be disabled, leading to a real concurrent execution of parallelrunning threads.

Supporting the NumPy module, Numba provides the possibility tobuild NumPy universal functions (ufuncs). A ufunc is a function thatoperates on ndarrays (for definition see Sect. 7.1.1) in an element-by-element fashion, supporting several standard features [Scic]. Hence, aufunc is a vectorized wrapper for a function that takes a fixed number ofspecific inputs, producing a fixed number of specific outputs. The wrappertherefore enables applying the wrapped function on the variable longndarray. For the generation of a ufunc, the vectorize-decorator is usedwhich allows lazy and eager compilation. In case of a lazy compilation,where no data types were defined, a dynamic universal function (DUFunc)is built which behaves like a ufunc with the difference that machine codeis compiled for loops if the given data types cannot be cast to types of theexisting code. In case of ufuncs, an error is thrown if the provided datacannot be cast [Scib]. The advantage of ufuncs over functions compiledwith the jit-decorator is the support of features like broadcasting. Basicoperations on ndarrays are performed element-wise which works on arraysof the same size. The broadcasting conversion, however, defines a way ofapplying operations on arrays of different size as specified in [Scia].

The vectorize-decorator supports scalar arrays only while guvectorizeallows multi-dimensional arrays as input and output. Unlike vectorize,in GUfunc signatures also the dimensions and relations of the inputs mustbe provided in a symbolic way. A guvectorize-decorator for the knownmatrix-matrix multiplication could be used as follows:

@guvectorize (["void(int32 ,float64 [: ,:] , float64 [: ,:] , float64 [: ,:])"],

"() , (m,m), (m,m)", nopython = True)def multiplication (n, A, B, C):

for i in range (n):for j in range (n):

for k in range (n):C[i][j] += A[i][k] * B[k][j]

In both decorators the nopython parameter can be specified to avoid afallback to object mode.

Numba does not support the whole Python language in nopython mode.Moreover, not all Standard Library modules of Python are supported.More on both can be read in [Anah]. However, NumPy is well integrated[Anag].

141

Chapter 7 HPC Python Internals and Benefits

General Procedure of Numba

Figure 7.4 shows the stages of the Numba compiler [Anab]:1) Bytecode Analysis Numba analyzes the function bytecode to find the

CFG.

2) Numba-IR Generation Based on the CFG and a data flow analysis, thebytecode is translated to Numba’s intermediate representation (IR)which is better suited to analyze and translate as it is not based ona stack representation (used by Python interpreter) but on a registermachine representation (used by LLVM).

3) Macro Expansion This step converts specific decorator attributes (e. g.CUDA intrinsics for grid, block, and thread dimension) into Numba-IR nodes representing function calls.

4) Untyped IR Rewriting Certain transformations on the untyped IR areperformed, e. g., for the detection of certain kinds of statements.

5) Type Inference The data type determination is performed as explainedfor lazy and eager compilation with fallback to object mode or errorin nopython mode.

6a) Typed IR Rewriting Optimizations like loop fusion are performed,where two loops with operations on the same array are mergedtogether into one loop.

6b) Automatic Parallelization This stage is performed only if the parallelparameter is passed to a jit-decorator for automatic exploitation

of parallelism in the semantics of operation in Numba-IR.

7a) Nopython Mode LLVM-IR Generation If a Numba type was foundfor every intermediate variable, Numba can (potentially) generatespecialized native code which is called lowering as Numba-IR isan abstract high-level intermediate language while LLVM-IR is amachine dependent low-level representation. The LLVM toolchain isthen able to optimize this to an efficient code of the target CPU.

7b) Object Mode LLVM-IR Generation If type inference fails to findNumba types for all values inside a function, it is compiled in objectmode which generates a significantly longer LLVM-IR as calls to thePython C API will be performed to basically all operations.

8 LLVM-IR Compilation The LLVM-IR is compiled to machine code bythe LLVM JIT compiler.

142

7.1 HPC Python Fundamentals

@jitdefdo_something(a,b):...>>>do_something(x,y)

Python Function(Bytecode)

FunctionArguments

BytecodeAnalysis

Numba-IRGeneration

MacroExpansion

Untyped IRRewriting

TypeInference

Nopython & Object Modeand LLVM-IR Generation

LLVM-IRCompilation

Typed IR Rewriting andAutomatic Parallelization

Computer Platform

Machine Code

Figure 7.4: Numba compilation stages

Numba does not implement a vectorization of the Numba-IR but LLVMcan apply automatic vectorization for single instruction multiple data(stream) (SIMD) capable CPUs [Anai]. LLVM’s behavior on that can bechanged by Numba environment variables [Anaj].

7.1.4 Cython

Cython is the name of a compiled programming language and of an open-source project, written in Python and C, an implementation of a Cythoncompiler with static code optimization [Cyt]. The Cython language shallcombine the simplicity of Python with the performance of C/C++ assketched in Fig. 7.5 [Behc] with mostly usual Python and additionalC-inspired syntax. Therefore, it mostly supports Python 2 as well asPython 3 and extends Python by C data types and structures. A detailed

143

Chapter 7 HPC Python Internals and Benefits

documentation of the differences in the semantics between the compiledcode and Python is provided in [Beha].

Cython Extending Python

In Cython it is possible to optimize Python code by static variable decla-rations such as

cdef int i

as it supports all basic C data types as well as pointers, arrays, typdef-edalias types, structs / unions, and function pointers. Furthermore, alsoPython types such as list and dict can be declared statically. Variableswithout a static variable declaration are handled by the Cython compiler,with the aid of the Python C API, as Python objects. Moreover, it ispossible letting the Cython compiler to typify variables as static in anautomatic way, in certain functions or even the whole code, with thefollowing compiler directive:

@cython . infer_types (True)

The compiler then tries to find the right data types by reference to theassignments in the related code. However, the static typing of variablesis not designated for the whole program. Only the variables withinparts which are relevant for the performance should be statically typed.Anyhow, a conversion from Python objects to C or C++ types or objects isunavoidable as will become apparent later. A Python integer, for instance,can be converted to char, int, long, etc. and a Python string can beconverted to a C++ std::string [Smi15].

Python and C functions have similarities as they obtain arguments andreturn values but Python functions are more powerful and flexible whichmakes them potentially slower. Cython therefore supports Python as wellas C functions which can call each other.

Performance

Sim

plic

ity

       Fortran              C

Python      

C++

Cython

Figure 7.5: Comparison of Cython with other programming languages

144

7.1 HPC Python Fundamentals

A Python function is valid Cython code and can contain static typedefinitions as introduced before. These Python functions can be directlycalled by external Python code.

A C function can be included by a wrapper or directly implementedin Cython and therefore declared with the keyword cdef instead of def.Contrary to Python code, C code is not processed by the Cython compileras will become apparent later, too. A cdef function finally is a C functionthat is implemented in Python-based syntax. The function arguments’types and return types are defined statically. In cdef-functions C pointersas well as structs and further C types can be used. Moreover, the callof a cdef-function is as performant as the call of a pure C function by awrapper and the overhead of the call is minimal. It is also possible to usePython objects as well as dynamic typed variables in cdef-functions andpass them to the function in form of arguments. A cdef function cannotbe called from external Python code but it is possible to write a Pythonfunction within the same module which is externally visible and calls thecdef function, as for example the following one:

def externaly_visible_cfunction_wrapper ( argument ):return cfunction ( argument )

A third possibility for the implementation of a function is provided bycpdef which combines the access possibility of Python functions with theperformance of C functions [Rosa].

There is a restriction on Cython functions as the data types of thearguments and the return value must be compatible with C and Python.While each Python object can be represented in C, not each C type canbe represented in Python, as for example C pointers and arrays.

Cython provides a set of predefined Python and C/C++ related headerfiles with the filename extension .pxd. Most important is the C standardlibrary libc with the header files stdlib, stdio, math, etc. The sameapplies for the Standard Template Library (STL) with the option to makeuse of containers such as vector, list, map, etc.

Cython allows an efficient access to NumPy’s ndarrays (for definitionsee Sect. 7.1.1) that is defined in a separate .pxd file as it is written in C.Besides ndarrays, also Python’s array module can be used efficiently asPython accesses the elementary C array directly.

Since it is possible to access Python functions at runtime, they are notdefined in header files. Both declarations and definitions are located inthe implementation files with the filename extension .pyx.

145

Chapter 7 HPC Python Internals and Benefits

Cython Compilation Pipeline

Cython produces a standard Python module but in an unconventionalmanner that is depicted in Fig. 7.6. A script (here: setup.py) is usedto start the setuptools build procedure which translates the Cythonimplementation file(s) (here: hello.pyx) to optimized and platform inde-pendent C code (here: hello.c) with the aid of the Cython compiler. Forinstance, the mult function

def mult(a, b):return a * b

is compiled to several thousand lines of C code which mainly consists ofdefines for portability reasons:

__pyx_t_1 = PyNumber_Multiply ( __pyx_v_a , __pyx_v_b );if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error )__Pyx_GOTREF ( __pyx_t_1 );__pyx_r = __pyx_t_1 ;__pyx_t_1 = 0

It contains automatically generated variable names which make the codehard to read. However, this is no problem since no manual changes onit are expected. The first line invokes the function PyNumber_Multiplyfrom the Python C API, which performs a multiplication between twoPython objects that are passed in form of pointers to their addresses.The if-statement checks if the multiplication was successful and GOTREFimplements the reference counting.

CythonCompiler

C Compiler

setup.py

hello.pyx

launch.py

hello.so

hello.c

import

Figure 7.6: Cython’s workflow for Python module building [Dav]

146

7.2 Benchmarking Methodology

Using Cython’s advantage of static type declarations, the Cython code,in case that int was used, is translated to the following C code:

__pyx_t_1 = __Pyx_PyInt_From_int ( __pyx_v_a * __pyx_v_b );if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error )__Pyx_GOTREF ( __pyx_t_1 );__pyx_r = __pyx_t_1 ;__pyx_t_1 = 0

Here, the multiplication is performed directly in the first line of the upperC code and the result is converted to a Python integer. It is also possibleto convert the Cython code to C++ but the default target language is C.The outputted code of the Cython compiler can also be adapted by somedirectives listed in [Carb].

Afterwards, the generated C code is compiled by a C compiler suchas gcc [GCC] or Clang [Cla] to a shared library file (here: hello.so onPOSIX systems and hello.pyd on Windows). These shared libraries arecalled C extension modules and can be used like pure Python modulesafter a usual import. Depending on the setuptools script that is used,an extension module for the particular Python environment is generated.Therefore, Cython is not autarkic as it depends on a Python environmentsuch as CPython or PyPy.

Parallel Programming in Cython

With the nogil keyword after a function definition, the GIL is released:cdef int function ( double x) nogil :

After the return the GIL is active again. Also external C/C++ functionscan make use of concurrent processing by multithreading with nogil:

cdef extern from " header .h":double function ( double var) nogil

This is possible only if no Python objects are used within the function.Based on this also OpenMP can be used in an efficient manner [Behb].

7.2 Benchmarking Methodology

The benchmarking of the different environments for a high performanceexecution of Python programs was performed with the aid of the followingalgorithms:

Quicksort A sorting algorithm of the divide and conquer based approaches[Cor+01].

147

Chapter 7 HPC Python Internals and Benefits

Dijkstra Finds the shortest paths between a start node and all other nodesin a graph [Cor+01].

AVL Tree Insertion Insertion of values into a Adelson-Velsky and Landis(AVL) tree which is a self-balancing binary search tree [Cor+01].

Matrix-Matrix Multiplication Performs multiplication of two quadraticmatrices of same.

Gauss-Jordan Elimination Solves a system of linear equations by row re-ductions [Sto+13].

Cholesky Decomposition Computes the decomposition of a symmetricand positive definite matrix into a lower left triangular matrix andits transpose. The product of them equals to the original matrix[Sto+13].

PI Calculation Iterative algorithm for the approximation of π based onan integration using the rectangle rule [Qui03].

These algorithms were chosen to represent different algorithm categoriesfrom the area of classical data processing on common ADTs such as lists,graphs, and trees on the one side and numerical mathematics on the otherside. All algorithms except the PI calculation were implemented sequen-tially. The PI algorithm was chosen as a known example of a perfectlyparallel workload. As such it can be utilized to benchmark how well the in-dividual environments can perform when the workload can be parallelizedin an optimal way (i. e. with no synchronization / communication betweenthe parallel processors).

All algorithms are implemented in an iterative (not recursive) mannerand the following languages:

• C++, as a time and memory efficient object-oriented compiled pro-gramming language

• Pure Python 2

• Pure Python 3

• Pure Python 3 with NumPy

• Pure Python 3 with NumPy and Numba decorators

• Python 3 with Cython

148

7.2 Benchmarking Methodology

For some algorithms certain implementations are not available if, forinstance, there was no reasonable use of an ndarray from NumPy in caseof the AVL tree implementation. Furthermore, there are no further PyPyspecific implementations needed. The source code can be obtained bycontacting the author.

Matrix-Matrix Multiplication Implementation as Example

For the matrix-matrix multiplication in Python the code presented inSect. 7.1.1 is used. In C++ the matrix is implemented based on a structwith an elementary two-dimensional array for which memory from theheap is allocated dynamically:

struct Matrix { int n; double ** doublePtr ; };

Matrix newMatrix (int n) {Matrix mat;mat.n = n;mat. doublePtr = new double *[n];for(int i = 0; i < n; i++)

mat. doublePtr [i] = new double [n];return mat;

}

In pure Python the matrices are represented by lists and initialized withthe aid of list comprehension in Python 2 and Python 3:

def newMatrix (n):return [[0 for x in range (n)] for y in range (n)]

For the ndarray the same data types as in the C++ version are used:def newMatrix (n):

return np. zeros ( shape =(n,n), dtype =’float_ ’)

Relevant for the execution time are the three nested loops which is whythe according function in the Numba version is based on the ndarray andhas following decorator:

@guvectorize (["void(int32 , float64 [: ,:] , float64 [: ,:] ,\float64 [: ,:])"],

"() ,(m,m) ,(m,m) -> (m,m)", nopython = True)def multiplication (n, A, B, C):

In Cython the function that performs the actual multiplication wasequipped with static type definitions and the ndarray was used for efficiencyreasons as well:

def multiplication (int n,np. ndarray [np. float64_t , ndim =2] A,np. ndarray [np. float64_t , ndim =2] B):

cdef np. ndarray [np.float64_t , ndim =2]C = newMatrix (n)

149

Chapter 7 HPC Python Internals and Benefits

cdef:int iint jint k

for i in range (n):for j in range (n):

for k in range (n):C[i,j] += A[i,k] * B[k,j]

Here again, the data types are of the same precision as in the C++ version.The implementations of the further algorithms are achieved in a similarway to the matrix-matrix multiplication presented here.

Realization of the Measurements

The execution time (wall clock time) of each runtime environment resp.executable (in case of C++) was measured with the shell command time[IEE18]. For memory usage measurements libmemusage.so was used andthe captured value was the heap peak as defined in [Man].

To make sure that a given algorithm is executed with the same values,pseudo random generators were implemented to achieve the same inputdata in each run and runtime environment. For comparison reasons, auto-matic vectorization for CPUs with SIMD instructions were disabled in allPython environments as well as during the Cython and C++ compilation.

7.3 Comparative Analysis

A comparative analysis of the different Python runtimes against C++ wasaccomplished on the following computer system.

Measurements Environment

All measurements in this section where accomplished on a server with 2sockets, each with an Intel Xeon X7550 2.0 GHz (2.4 GHz Turbo), 8 coresCPU with Hyper-Threading (HT); 256 GB DDR3 main memory; runningan x86_64 Scientific Linux 6. Following software packages were used:

• CPython2 v2.7

• CPython3 v3.6.0 in combination with– NumPy 1.12.0– Cython 0.25.2– Numba 0.31.0

• PyPy2 v5.6.0 in combination with

150

7.3 Comparative Analysis

– NumPy 1.12.0– Cython 0.25.0

• PyPy3 v5.5.0

• Clang v.3.9.1 with LLVM v3.9.1

PyPy’s NumPyPy module was not considered as it was too incomplete atthe time of the analysis. The legends of the plots show different measuredcases with following meanings:

CPython2 Implementation in pure Python 2 and executed by CPython2

CPython3 Implementation in pure Python 3 and executed by CPython3

C++ Implemented in C++ and compiled by Clang

Cython (Pure Python) Implementation in pure Python 3, translated toC and compiled by Clang. The thus generated extension module isimported by a Python file

Cython (Optimized) Implementation in Cython with C data types, trans-lated to C and compiled by Clang. The thus generated extensionmodule is imported by a Python file

CythonPyPy (Optimized) Analogue to “Cython (Optimized)”. The exten-sion module is utilized with the aid of the cpyext, PyPy’s subsystemwhich provides a compatibility layer to compile and run CPython Cextensions inside PyPy [Cun]

Numba Implementation in Python 3

Numba + NumPy Implementation in Python 3 with the aid of ndarray

PyPy2 (Pure Python) Implementation in pure Python 2, executed byPyPy 2

PyPy3 (Pure Python) Implementation in pure Python 3, executed byPyPy 3

PyPy2 + NumPy Implementation in Python 2 with ndarray, utilized withthe aid of the cpyext subsystem, and executed by PyPy 2

CPython3 + Array Implementation in Python 3 with array module andexecuted by CPython3

151

Chapter 7 HPC Python Internals and Benefits

0.0 0.5 1.0 1.5 2.0 2.5Size of array 1e7

10 1

100

101

102

103

104

Tim

e [s

]

Quicksort

C++CPython3CPython2Cython(Pure Python)Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy

PyPy3(Pure Python)PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)

Figure 7.7: Execution times for Quicksort

Quicksort Analysis as Example

In Fig. 7.7 the execution time of the various Quicksort implementationsis plotted with logarithmic y-scale. Due to much higher execution timesthan in the other cases, the measurement of PyPy2 with NumPy wasaborted when 2 million elements were to sort. The other measurementscan be divided into a slower and a faster group. In the slower group,among others, are both reference Python environments CPython2 andCPython3. During the whole measurement, the CPython2 solution wasaround 30 % faster than CPython3 but the Cython-compiled version ofpure Python 3 code is even faster than both CPython version. The usageof the arrays from the NumPy and the array module has no positive effects.CPython3 with NumPy is even slower than pure Python on CPython3.In the faster group, PyPy2 and PyPy3 have similar execution times: fora size of 10 million elements they are around 35 times faster than purePython on CPython3. Even faster are both Numba cases (pure Pythonand with NumPy) and the optimized Cython module built for PyPy2 asruntime environment. Only the optimized Cython module for CPython3has a higher performance than all other versions and is almost as fast as

152

7.3 Comparative Analysis

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Size of array 1e7

0

250

500

750

1000

1250

1500

1750

2000He

ap-p

eak

[Mb]

QuicksortC++CPython3CPython2Cython(Pure Python)Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy

PyPy3(Pure Python)PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)

Figure 7.8: Memory consumption (maximum heap peak) for Quicksort

the C++ implementation. For more execution time measurements seeAppendix B.1.

In Fig. 7.8 the memory space consumption (i. e. maximum heap peak) ofall previously mentioned Quicksort implementations are plotted with nowlinear y-scale. There, PyPy2 with NumPy and CPyton2 show a eminentlyhigher memory consumption than the other cases which is why they wereplotted in Fig. 7.9 separately. As the execution times of the three casesPyPy2 + NumPy, CPython3 + Array, and CPython3 + NumPy werevery high, so that the memory measurements were aborted at 2 millionelements. Here, the resulting plot lines can be divided into three groups.The group with the pure Python implementation on Numba and the arraymodule based implementation on CPython3 show the highest memoryconsumption. The second group consists of PyPy2, PyPy3, CPython3,and the Cython module for CPython3 (all four pure Python). In themost memory efficient group are Numba + NumPy and CPython3 +NumPy. Moreover, the optimized Cython implementation for CPython3has the lowest memory consumption which corresponds to the one of theC++ implementation. For more memory consumption measurements seeAppendix B.2.

153

Chapter 7 HPC Python Internals and Benefits

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Size of array 1e7

0

50

100

150

200

250

300

350He

ap-p

eak

[Mb]

QuicksortC++CPython3Cython(Pure Python)Cython(Optimized)Numba + NumPyPyPy3(Pure Python)

PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)

Figure 7.9: Memory consumption (maximum heap peak) for Quicksort ofselected runtime environments

Parallel Processing

For the PI calculation the best-of-three execution times, which weremeasured for each case, was plotted in Fig. 7.10. The integration of thePI calculation was performed with 1 trillion rectangles. Depending on thenumber of threads to be forked for the measurement, an area of rectangleswas assigned to each thread to achieve a parallel processing of the work bymultithreading vs. multiprocessing. PyPy was not considered as it has aGIL like CPython. Therefore, as expected, there are no speedups gainedfrom multithreading in case of CPython3 and CPython2. Instead, the JITcompiled Numba case (pure Python3 plus jit-decorator with nopython =True, nogil = True in its signature) and the AOT compiled Cython case(optimized with static types) achieve speedups through multithreading.The C- and Cython-based solution with four threads has an executiontime of 2.5 s while with one thread 10 s were needed, which equals to aparallel speedup of 4 and an efficiency of 100 %. The multiprocessing case(pure Python3) also leads to speedups but these are much lower.

154

7.4 Conclusion and Outlook

1 Thread 2 Threads 3 Threads 4 Threads100

101

102

103

Tim

e [s

]

PI CalculationCCPython2CPython3

NumbaCythonMultiprocessing

Figure 7.10: Execution times for PI calculations with multiple threads

7.4 Conclusion and Outlook

In this chapter a comparative analysis of various High Performance Pythonenvironments was presented. For this, benchmark algorithms from dif-ferent categories were chosen and also implemented in C++ as referencelanguage. Furthermore, great value was given to ensuring an equivalentimplementation in each case to achieve similar conditions for the differentcases. Hence, only one C/C++ compiler (i. e. Clang) with the sameoptions was used for the compilation of the Cython and C++ solutions.

In case of the sequential Python-based solutions, Cython optimizedby static data type declarations and built for CPython3 has shown theshortest execution times in all test cases, which could always competewith the C++ implementations. Even the unoptimized Cython builds forCPython3 has led to performance gains of 40 to 50 %. The execution timesfor PyPy2 builds have shown great variations. Numba solutions also haveled to high performance gains despite longest startup times of the runtimeenvironment. Both PyPy versions have had similar execution times, whichin case of pure Python were always at least one order of magnitude fasterthan the CPython environments.

155

Chapter 7 HPC Python Internals and Benefits

The memory consumption measurements regarding the heap peak duringruntime have shown the highest consumption for CPython2. Both PyPyversions have also shown a higher memory consumption than CPython3.For particularly large data structures, it has been shown that a veryefficient memory consumption can be achieved through the usage of staticdata types in Cython built for CPython3. In case of Cython built forPyPy2 the results has shown great variations depending on the algorithms.Numba also has led to a higher memory efficiency than in case of CPython.

The multithreading benchmark has shown no parallel speedup for bothCPython versions because of the GIL. In case of Numba and Cython wherethe GIl could be disabled, multithreading led to high parallel speedups.Especially in case of Cython the perfectly parallelizable PI calculation led toa perfect parallel efficiency of around 100 %. The usage of multiprocessingin case of CPython led to low speedups only.

The presented comparative analysis gives Python programmers anoverview of the analyzed solutions to improve the performance of theirPython programs. Moreover, it provides information on how much effortis required for the application of a certain solution on the one side andwhich gain can be expected on the other side.

In future work, a closer look should be taken at Numba’s precompiledfunctions that are vectorized. These were not considered in this work toachieve comparable implementation between the various solutions. Further-more, besides multithreading and multiprocessing as parallel paradigmssuitable for shared-memory computer architectures, also paradigms thatare suitable for distributed-machines such as Message Passing Interface(MPI) for Python should be included in future considerations. An imple-mentation of MPI for Python is mpi4py [Dal]. Since all Python runtimeenvironments and the language itself are under development, a frameworkfor automated ongoing comparative analyses and result presentation wouldbe useful.

156

8HPC Network Communicationfor Hardware-in-the-Loopand Real-Time Co-Simulation

A digital real-time simulator (DRTS) for power grids reproduces voltageand current waveforms with a desired accuracy that represent the behaviorof the real power grid that is simulated. To be RT-capable, the DRTS needsto solve the power grid model equations for each time-step within the timepassed in the real world (i. e. according to the wall clock time) [Far+15;BDD91]. To achieve this, outputs are generated in the simulations atdiscrete time intervals while the system states are computed at certaindiscrete time intervals with a fixed time step. In [Far+15] two classesof digital real-time (RT) simulations are defined. There are full digitalRT simulations that are modeled in the DRTS completely and (power)Hardware-in-the-Loop (HiL) RT simulations which can exchange simulationdata through I/O interfaces with real hardware. Besides the improvementof DRTSs, e. g., with the aid of more performant numerical algorithmsto be able to simulate increasingly complex models of Smart Grids inreal-time, it is also possible to distribute a simulation among multipleDRTSs. An approach of a coupling of DRTSs in laboratories even fromdifferent countries is presented in [Ste+17]. The coupling of this so-calledgeographically distributed real-time simulation (GD-RTS) was performedwith the VILLASframework [Vog+17], abbreviated in the following asVILLAS, which was chosen for the integration of InfiniBand (IB).

157

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

In the following section, the fundamentals of VILLAS are covered tomotivate its choice for the integration of IB. The fundamentals of IBare introduced in the subsequent section. Then, the concept of the IBsupport in VILLAS is presented and analyzed comparatively with otherinterconnecting methods available in VILLAS. Finally, the chapter isconcluded and an outlook of future work is given. This chapter presentsoutcomes of the supervised thesis [Pot18].

8.1 VILLAS Fundamentals

The VILLASframework is a collection of open-source software packagesfor local and geographically distributed RT (co-)simulations. VILLASnodeis part of the collection that can be used as gateway for simulation data.It supports several interfaces (called node-types) of the three classes,

internal communication such as file for logging and replay, shmem forshared-memory communication, signal for test signal generation,etc.;

server-server communication such as socket for UDP/IP communication,mqtt for Message Queue Telemetry Transport (MQTT) communica-tion, websocket for WebSocket based communication, etc.;

simulator-server communication such as opal for connections to OPAL-RT devices, gtwif for connections to RTDS devices, fpga for connec-tions to VILLASfpga PCI-e cards, etc.

The instance of a node-type is called a node.In Fig. 8.1 besides VILLASnode also the VILLASweb component of the

whole framework is depicted. As sketched in the figure, a lab(-oratory)can contain multiple nodes which are used as gateway between software(SW) and hardware (HW) solutions. The interconnected nodes can runon the same or on different host systems in one or multiple labs. Some ofthe nodes can be hard or just soft RT capable, which depends on theirnode-type.

While there are hard RT capable node-types for internal and simulator-server communication, there was no such node-type for server-servercommunications, because they are all depending on the Internet Protocol(IP) which is mostly used with Ethernet based interconnects for local areanetworks (LANs). One problem of Ethernet based solutions are relativelyhigh latencies of the data transfers also caused by the network protocolstack of the operating system [Lar+09]. Another problem of Ethernetbased solutions is that quality of service (QoS) support is very limited.

158

8.2 InfiniBand Fundamentals

That is why latencies of the data transfers have a relatively high variabilitywhich is a disadvantage for hard RT applications. To achieve hard RTbetween different computer hosts, IB was used as alternative technologydesigned for low-latency and high-throughput server-server and device-server communication (e. g. for interconnecting storage solutions withcomputer clusters). The following chapter introduces how these propertiesof IB are achieved.

8.2 InfiniBand Fundamentals

Before an introduction to IB with its benefits, the main difference toclassical utilization of network interface controllers (NICs) must be ex-plained. Usually, NICs are utilized through sockets (also called Berkeleyor BSD sockets) which is an Application Programming Interface (API) forinter-process communication (IPC) coming from the Unix-like BerkeleySoftware Distribution (BSD) operating system (OS) [Tan09] and with lit-tle modification standardized in the Portable Operating System Interface(POSIX) specification. However, socket API implementations are not onlypart of POSIX conform OSs but, e. g., also of Windows and Linux. Thefocus in this chapter is on the latter as the approach presented here wasimplemented based on Linux. A POSIX socket is a user space abstraction

web-basedaccess

user n

web-basedaccess

user 2

web-basedaccess

user 1

domain-specificoffline analysis

dataas a

service

offline integration layer

simulationas a

service

SWHW

node...

lab

1

lab

n SWHW

node...SWHW

node

hard real-time integration layer

soft real-time integration layer

VILL

ASn

ode

VILL

ASw

eb

...

...

modelparameter

setting

node node node

Figure 8.1: Overview of the VILLASframework

159

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

of network communication, which is mainly using the operating systemkernel based TCP/IP or UDP/IP stack (Transmission Control Protocol(TCP), User Datagram Protocol (UDP), IP) [Ker]. The network commu-nication through sockets is accomplished via function calls on the socket.As in many other OSs, these user space calls are mapped on system calls(i. e. OS kernel function calls) which generate so-called traps (a typeof synchronous interrupt) and sysenter instructions in case of moderncomputer architectures which let the central processing unit (CPU) switchfrom user to kernel mode [Ker10; Tan09]. The switches between user andkernel mode (and back) can be time expensive in relation to the datatransfer through the NIC itself. This and other drawbacks were the reasonfor a development of the virtual interface architecture (VIA) [Com97].Some of the VIA characteristics mentioned in [Dun+98] are the avoidanceof system callbacks whenever possible, data transfers with zero-copy, nointerrupts for initializing and completing data transport, and there is asimple set of instructions for exchanging data. Therefore, some of thetasks, which are handled by the IP stack in case of standard sockets (i. e.such that are mapped on standard kernel sockets), such as data transferscheduling, in VIA must be handled by the NIC.

Contrary to standard sockets, VIA provides virtual interfaces (VIs)which are direct interfaces to the NIC through which each process assumesto own the interface and that there is no need for system calls during datatransfers. Each such VI consists of a send and receive (work) queue, whichcan hold descriptors that contain all needed information for data transferssuch as the destination address, transfer mode, and location of the payloadin the main memory. After completed transfers (with or without an error),the descriptors are marked by the NIC. Usually, the so-called VI consumer,residing in the user space, is responsible to remove processed descriptorsfrom their work queues. Alternatively, on creation, a VI can be boundto a Completion Queue (CQ) where notifications on completed transfersare directed. Each CQ has to be bound to at least one work queue whichmeans that notifications of multiple work queues can be directed to asingle queue.

The VIA supports the two following asynchronously operating datatransfer models:

Send and receive messaging model (channel semantics) The receivingcomputer node (in this section referred to as node) specifies wherein its local memory received data shall be saved by submitting adescriptor to the receive work queue. Afterwards, a sending nodeacts analogously with its data to be sent and the send work queue.

160

8.2 InfiniBand Fundamentals

Remote Direct Memory Access (RDMA) model (memory semantics)The so-called active node specifies the local memory region andthe remote memory region of the so-called passive node. There aretwo possible operations in the RDMA model: In case of an RDMAwrite transfer, the active node specifies with the local memory re-gion the data to be sent while with the remote memory region itspecifies where the data shall be stored. In case of an RDMA readtransfer, the active node makes analogous specifications. To initiatean RDMA transfer, the active node specifies the local and remotememory addresses as well as the operation mode in a descriptor andsubmits it to the sending work queue. The operating system andother software on the passive node is not actively participating inthe (write or read) transfer. Therefore, no descriptors are submittedto the receive queue at passive node.

8.2.1 InfiniBand Architecture

The InfiniBand Architecture (IBA) makes use of the abstract VIA designdecisions [Pfi01]. The InfiniBand Trade Association (IBTA), founded in1999 by more than 180 companies, describes the IBA in [Inf07] and thephysical implementation of IB in [Inf16].

Network Stack

In Fig. 8.2 the IBA is depicted in form of a network stack which consists ofa physical, link, network, and transport layer. Hints for the IBA realizationsare given to the right of the various layers.

Endnodes and Channel Adapters

The communication within an IB network takes place between (end)nodeswhich can be, e. g., a server node or a storage system in a computer cluster.A Channel Adapter (CA) is the interface between a node and a link. Itcan be either a Host Channel Adapter (HCA) which are used in computerhosts, supporting certain software features defined by so-called verbs. Orit can be a Target Channel Adapter (TCA), which has no defined softwareinterface and is normally used in devices such as a storage system.

161

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Service Types

InfiniBand supports the following service types:

Reliable Connection (RC) A connection between nodes is established andmessages are reliably transferred between them (optional for TCAs).One Queue Pair (QP), which is IB’s equivalent to a VI), on a localnode is connected to one QP on a remote node.

Reliable Datagram (RD) Single packet messages are transferred reliablywithout a one-to-one connection. A local QP can communicate withany other RD QP. This is optional and not implemented in theOFED stack (see Sect. 8.2.2).

Unreliable Connection (UC) Analogous to RC but unreliable (i. e. packetscan get lost).

Unreliable Datagram (UD) Analogous to RD but unreliable.

Raw Datagram Packets are sent without IBA specific headers.

Message Segmentation

The payload is divided into messages between 0 B and 2 GiB for all servicetypes except for UD which supports messages between 0 B and 4096 B,depending on the message transmission unit (MTU). Messages bigger

   consumer

   IBA operationssegmentation & reassembly

   network

   link encoding

media access control

   port

consumer operations (verbs)

channel adapter´s port /physical link

flow control

subnet routing (LRH)

inter subnet routing (GRH)

messages (queue pairs)

network

link

physical

transport

Figure 8.2: Network stack of the InfiniBand Architecture (IBA)

162

8.2 InfiniBand Fundamentals

than the MTU are segmented into smaller packets by the IB hardwarewhich, thus, should not affect the performance as in case of software basedsegmentation [CDZ05]. In the following, QPs are explained further.

Queue Pairs and Completion Queues

Figure 8.3 shows an abstract model of the IBA. Such as VIs also QPs haveSend Queues (SQs) and Receive Queues (RQs) which enable processes todirectly communicate with the HCA. Like descriptors in the VIA, WorkRequests (WRs) are submitted to a work queue before message transfer,resulting in Work Queue Elements (WQEs) in the queue. In case of a sendWR, the WQE contains the address to the memory location containingdata to be sent. In case of a receive WR, the WQE contains the addressto the memory location where received data shall be stored. Not each QPcan access each memory location due to memory protection mechanisms

main memory

 0x0F ...

 0x0E 0x0D 0x0C 0x0B 0x0A ...

IBA memorymanagement

QPs

Nx

send recvsend recvsend recvsend recv

QPCQCQCQ

Mx

CQs

DMA Engine

channel adapter

transport

VL 2VL 1

arbiter

VL P...VL 2VL 1

arbiter

VL P... VL 2VL 1

arbiter

VL P...

...

port Qport 2port 1 ...Figure 8.3: InfiniBand Architecture (IBA) model

163

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

that also handle which locations can be accessed by remote hosts and theHCA. A WQE in the SQ also contains the network address of the remotenode and the transfer model (i. e. send messaging or RDMA).

Data Transmissions Example

Figure 8.4 shows a sending and a receiving node, each with three QPs.Each QP is always initialized with a send and a receive queue but for thesake of clarity the unused empty queues are not depicted.

Before a transmission, the receiving node submits WRs to its RQs. Inthe figure, the receiving node’s consumer is submitting a WR to the redRQ. Afterwards, WRs can be submitted to the SQs and will then beprocessed by the CA. While the processing order between queues dependson the priority of the services, on congestion control, and the HCA, WQEswithin a certain queue are processed in first in – first out (FIFO) order. Inthe figure, the sending node’s consumer is submitting a WR to the blackSQ and the HCA is processing a WQE from the blue SQ.

After the HCA processed a WQE, it places a Completion Queue Entry(CQE) in the completion queue, which, i. a., contains information aboutthe processed WQE and the status of the operation, indicating a successfultransmission or an error if, e. g., the queue concerned was full. A CQE isposted as soon as a WQE is processed, which depends on the used servicetype. For instance, in case of a unreliable type, a CQE is posted as soonas the HCA sends the data belonging to a send WR. Instead, in case of

WQE WQE WQEWQE

WQE

WQE WQE

send queues

CQECQE CQE

completion queues HCA HCA

CQE CQE CQECQE

WQE WQE WQEWQEconsumer

recv.work

request

workcompl.

completion queues

receive queues

sending node receiving node

consumer

sendwork

request

workcompl.

WQE WQE WQE

WQE WQE WQE

message

Figure 8.4: InfiniBand data transmission example

164

8.2 InfiniBand Fundamentals

a reliable type, the CQE is not posted until the message is successfullyreceived by the remote node.

In the figure, the receiving node’s HCA is consuming a WQE fromthe blue receive queue. After receiving a WQE, the HCA will write thereceived message into the memory location contained in the WQE and posta CQE. If the sending node’s consumer have included so-called immediatedata in the message, that will be present in the CQE of the receiving node.

Work Queue Entry Processing

After the submission of a WR to a queue by a process, the HCA startsprocessing the resulting WQE. In Fig. 8.3 can be seen that an internalDirect Memory Access (DMA) engine is accessing the memory locationcontaining in the WQE and copying the data from that location to alocal buffer of the HCA. Every HCA port has several such buffers, calledVirtual Laness (VLs). After this step, the arbiter of each port decidesfrom which VL packets will be sent through the physical link. More onthat and further details on the InfiniBand Architecture can be found in[Pot18].

8.2.2 OpenFabrics Software Stack

The IBA does not include a full API specification to allow vendor specificAPIs. In 2004 the nonprofit OpenIB Alliance was founded and renamedlater to OpenFabrics Alliance, which releases the open-source OpenFabricsEnterprise Distribution (OFED). OFED is a software stack including, i. a.,software drivers, kernel code, and user-level interfaces such as verbs. MostInfiniBand vendors provide OFED based software, with little adaptionsand enhancements, together with their hardware solutions. Figure 8.5shows the sketch of an OFED stack [Mel18] where the user and the kernelverbs can be seen, whereby in this work verbs always refer to user verbs.

Submitting Work Requests to Queues

The submission of WRs to the work queues allows user space processesto initiate data transfers through a HCA without an intervention of theoperating system kernel. As mentioned before, WQEs contain memorylocations for data read and written by the HCA. A WR contains thepointer to a list with at least one scatter/gather element (sge) containingthe memory address and length of a memory location as well as a localkey for access control. Besides a list of sges, the receive WR structurecontains only a few further data elements such as a pointer to the next

165

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

receive WR. Additionally, the send WR structure contains even moreelements by which various (sometimes optional) features of HCAs can beenabled. The opcode element defines the operation to send the associatedmessage(s). Which operations are allowed depends on the QP the WRis submitted to and the chosen service type. Furthermore, send_flagscan hold various flags defining how the send WR shall be processed. Oneof the flags is IBV_SEND_INLINE which causes that the data pointed toby the sge is directly copied into the WQE by the CPU. This avoids acopying, performed by the HCA’s DMA engine, from the host’s mainmemory to the internal buffer of the HCA. The inline send mode is notdefined in the original IBA and therefore not supported by each HCA.

OpenFabrics user verbs / RDMA CM

process processprocessdiag.tools

openSM

user levelMAD API

IP

TCP UDP ICMP

socket layer

netdevice

IPoIB

applicationlevel

user API

upperlayer

protool

mid-layer

provider

hardware

kern

el s

pace

user

spa

ce traditionalnetworkinterface

data channel(kernel bypass)

commandchannel

   OpenFabrics kernel verbs

adapter drivers

InfiniBand HCA

SA client SMA CM

CMA

MAD

Figure 8.5: An overview of the OFED stack

166

8.3 Concept of InfiniBand Support in VILLAS

Since it potentially leads to lower latencies and the buffers can be releasedfor re-use immediately after submission of the send WR, the InfiniBandintegration here presented makes use of the inline mode. More detailsabout the OFED can be found in [Pot18].

8.3 Concept of InfiniBand Support in VILLAS

The InfiniBand support was implemented in the VILLASnode sub-projectof the VILLASframework. Therefore, the VILLASnode component isintroduced in the following.

8.3.1 VILLASnode Basics

As already mentioned in Sect. 8.1, VILLASnode supports different node-types. One VILLASnode instance, called super-node can have multiplenodes that are sources and / or sinks of simulation data. Hence, a super-node can serve as a gateway for simulation data. In the context of VILLAS,a node is defined as an instance of a node-type from one of three categoriesintroduced in Sect. 8.1. The connections within a super-node are realizedwith paths between source and output nodes. A path starts from an inputnode obtaining data that can, optionally, be sent through a hook to modifythe data (e. g. by a filter). Subsequently, the data is written to a FIFOqueue (for buffering) before it can be sent through a register which canmultiplex and mask it. After this, it can be manipulated by more hooksagain before it will be passed to the output queue which holds the datauntil the output node is ready. The data is transmitted as samples holdingthe payload (e. g. simulation data) with metadata (timestamps and asequence number). As a sample is the internal format for payload exchangebetween nodes of arbitrary types, its structure is kept simple to avoidoverhead.

Figure 8.6 depicts a super-node with five node-type instances: opal,file, socket, mqtt, and the additionally implemented IB node, presentedin this chapter. The paths (1 to 3) connect the nodes (n1 to n5) throughhooks (h1 to h6), registers (r1 to r3), and input queues (q1,1 to qi,5) aswell as output queues (qo,1 to qo,4). More on the types of the nodes canbe found in [FEIg].

8.3.2 Original Read and Write Interface

For the interoperability between nodes of different types, various functionssuch as start(), stop(), read(), write() must be provided by the imple-

167

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

mentation of a new node-type in form of assignments of the implementedfunctions’ addresses to the specified function pointers, as for instance:

int (* read )( struct node *n , struct sample * smps [] ,unsigned cnt );

int (* write )( struct node *n , struct sample * smps [] ,unsigned cnt );

Some of the functions are optional and will be omitted if no implementationis available for a certain node-type.

Read Function in General

Figure 8.7 depicts the general proceeding of the read function of anarbitrary node-type. During the call of a read function, the super-nodepasses the address to a field of sample addresses (*smps[]) of the lengthcnt ≥ 1 for the data the super-node wants to read from the node. Asample contains, i. a., a sequence number, reference counter (refcnt), anda field for the actual signal (i. e. payload such as, e. g., 64 bit integersand floating-point numbers). During the allocation of a sample by the

socket

mqtt

IB

opal

file

h4h1r1 h2 h3

h6

h5 r2

h3r3

qi,3

qi,1 qo,1

qi,2

qi,4

qo,2

qo,3

qo,4

qi,5

n1

n2

n4

n3

n5

path 1 path 2 path 3

Figure 8.6: An example super-node with three paths connecting five nodesof different node-types

168

8.3 Concept of InfiniBand Support in VILLAS

super-node, its refcnt is set to 1 and their memory will not be freed whenrefcnt > 1. Releasing of a sample means decreasing its refcnt. Withinthe read function, the node (i. e. its receive module) is instructed to storeat most cnt received samples in the passed list of samples. When thereceiving module finished copying ret ≤ cnt samples, the read functionreturns ret.

After that, the super-node can then process the samples by hooks beforeforwarding them to another node. Finally, all samples are released, usuallyresulting in freeing of their memory.

Write Function in General

The general proceeding of the write function of an arbitrary node-type issimilar to the one of the read function. Here the super-node passes a fieldwith cnt samples to the function which will be copied to the send modulewithin the write function. The send module tries to send all samples whichis blocking the return of the write function. After finishing the sending,the number of sent samples, ret, is returned. If ret is not equal to cnt,the super-node handles the sending error properly. Anyway, the refcnt ofall cnt samples is decremented.

allocate cntsamples and setrefcnt= 1

decreaserefcnt ofcnt samples

receivemodule

receivemodule

_read( )

*smps[]

copy max cntsamples to *smps[]

smps[]

cnt

*smps[]

*smps[]

ret

copy

smps[]

_read( )

processret    cntsamples

cnt

return number ofreceived samples

cnt

node node

super-node super-nodere

turn

read

call

read

Figure 8.7: General read function working principle in VILLAS

169

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

8.3.3 Requirements on InfiniBand Node-Type InterfaceInfiniBand with its zero-copy principle, inherited from the VIA, requiresthat the receive / send modules do not copy any data between their localbuffers and the super-nodes’ buffers. Instead, pointers to the super-nodes’buffers and their lengths should be passed to the HCA which uses themdirectly for received data and for data to be sent. In the following thedesired behavior of the read and write function is sketched.

Read Function of InfiniBand Node-Type

Figure 8.8 depicts the read function’s proceeding of the IB node-type.The QP is instructed to receive data by a WQE in its RQ. Therefore, areceive WR pointing to buffers of the super-node must be submitted tothe RQ. For compatibility reasons with existing node-types, the furthersteps were implemented in a way that causes as few changes as possible.Therefore, within the read function, the addresses of the samples (passedas *snmp[] parameter) are assigned to sges and inserted into WRs whichare then submitted to the RQ. This results in a direct storage of received

allocate cntsamples and

set refcnt = 1

decrease refcntof samples that arenot in any queue

_read( )

*smps[]

smps[]

cnt

*smps[]

*smps[]

ret

smps[]

_read( )

processret    cntsamples

poll CQ, returnnumber of CQEs

cnt

super-node super-node

receive module

RQ CQ

receive module

RQ CQ

node node

amount ofsmps torelease

replace original pointers of samplesthat may not be released (i. e., that

are submitted to the RQ) byaddresses in wr_id of the CQE

place max cntsamples in RQ

call

read

retu

rn re

ad

Figure 8.8: InfiniBand node-type read function working principle

170

8.3 Concept of InfiniBand Support in VILLAS

data by the HCA in the super-node’s samples field, avoiding data copying.Furthermore, the returning of the read function is very different from othernode-types. If the CQ contains no CQEs, the HCA received no data and,thus, the ret value should be 0. However, the sample buffers must not bereleased (i. e. no refcnt may be decreased) as they are submitted to theRQ of the HCA. If the CQ contains CQEs, the addresses of the buffersfrom the CQ holding the received data are assigned to the pointers of thesmps[] field that was passed to the read function. Moreover, the ret valueis set to the number of pointers that have been replaced. Furthermore, thebuffers containing the received data must be released after being processedby the super-node. This approach requires that the super-node calls theread function once (during initialization) without reading any data sinceonly after this first call the HCA knows where to store received data.

Write Function of InfiniBand Node-Type

The write function’s proceeding of the IB node-type must be similar tothe read function in order to achieve zero-copy. When the addresses ofthe sample buffers that are passed to the write function are submittedvia send WRs to the SQ, ret must be set to the number of submittedpointers. If the CQ is empty, none of the passed buffers may be releasedas the HCA has to send the data they contain. If the CQ is not empty,previously submitted WRs are finished and the buffers they point to canbe released. Therefore, the addresses of the buffers that were passed ina previous call of the write function are assigned to the pointers of thepointers of the smps[] field that was passed with the current call of theread function. Furthermore, the super-node must be notified to releasethe sample buffers that were yielded by the according CQEs.

Adapted Read and Write Interface

The original interface could be adapted in order to return the numberof samples that must be released by the super-node as it cannot predictthe number, especially in the case of sending inline, where buffers can bereleased immediately after send WR submission or in the case where aWR could not be successfully submitted to the SQ. The information onthe number of samples to be released could be passed to the super-node bya further integer pointer in the signatures of the read and write function.

8.3.4 Memory Management of InfiniBand Node-TypeVILLASnode allows memory allocation that is improved for real-timeprocessing. The implemented alloc() function can allocate huge pages,

171

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

which leads to a faster mapping between virtual and physical memory[Deba]. Furthermore, it can lead to less page faults and in case of enabledpage pinning, the pages must remain in main memory (i. e. are notswapped), which avoids delays in the execution of the program that couldcause real-time violations. These and some other memory types arenot sufficient for the IB node-type as the HCA will access the buffers,allocated by the super-node and referenced by WRs. Therefore, everynode-type defines what kind of memory allocation is performed by alloc()and if it should be registered with a memory region (as needed for IB).Furthermore, this also allows to implement functionality for local keyacquiring for samples that are passed to read / write functions. Thedefinition of preferred memory-type for a node-type allows the super-nodesto use a proper memory allocations for input and output buffers that areconnected to nodes of that type.

8.3.5 States of InfiniBand Node-TypeBefore the implementation of the IB node-type, a node could be in sixstates that are depicted in Fig. 8.9 as circles with solid lines. If, e. g.,the _start() function of the node-type interface is called successfully,

initialized

pendingconnect

parsed

stopped

checked

started

destroyed

connected

_parse( )

_check( )

_start( )

_stop( )

Figure 8.9: VILLASnode state diagram with newly implemented states

172

8.3 Concept of InfiniBand Support in VILLAS

the transition checked→started is performed. According to the VIA,a node-type could be initiated but not connected (i. e. the node is notable to send data). Therefore, the start state of VILLASnode is notsufficient and was extended by the new state connected. Moreover, beforethe receiving of any data, WQEs must be present in the regarding RQ.These circumstances lead to the finite-state machine in Fig. 8.9 with thenew states printed with dashed lines. If a node is in one of these states,the super-node interprets it as if it would be in the start state. Thisfinite-state machine can also be used for other future node-types than IBthat are based on the VIA.

8.3.6 Implementation of InfiniBand Node-Type

An overview of the implemented IB node-type is shown in Fig. 8.10. Themost important aspects are explained in the following, i. e. the read and

protection domain*

communicationevent channel*

smps[ ]

smps[ ]

buffers*

writ

e*re

ad*

star

t*st

op

communicationmanagement

thread*

HC

A*

recv

send

CQs

recv

send

queue pair*

rdma_cm_id*modify QP

VILLASnode*

InfiniBand node

comm.

data

data

Figure 8.10: Components of InfiniBand node-type

173

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

write function which allow the kernel bypass offered by InfiniBand. Thewhole source code is open-source and part of the VILLAS project [FEIf].

Start Function

The start function is called by the super-node for initialization purpose.First, RDMA communication event channels are created to be able toresolve the remote address (as active node) or to place itself into a listeningstate (as passive node). Whether a node is active or passive is defined byits configuration. In case of a successful start, the super-node transitionsinto the started state.

Communication Management Thread

The communication management thread is spawned by the start function.It waits for the blocking rdma_get_cm_event() function for events suchas connection requests, errors, rejection, and establishing. Depending onthe node, the thread acts as follows:

Active node As the node tries to connect to another node, theRDMA_CM_EVENT_ADDR_RESOLVED signals that the address could beresolved. After a succeeding initialization of various structures,the RDMA route is resolved which should end with an RDMA_CM_EVENT_ROUTE_RESOLVED event, followed by an RDMA_CM_EVENT_ESTABLISHED if the remote node accepts the connection which resultsin a transition to the connected state of the node. In this state datacan be transmitted.

Passive node As the node listens for connection requests of other nodes,the RDMA_CM_EVENT_CONNECT_REQUEST event occurs if another nodeperforms such a request. In case the service type is UC or RC,the node transitions to the pending connect state. In case of theunconnected service type UD, it transitions to the connected state.An RDMA_CM_EVENT_ESTABLISHED event occurs after a successfully es-tablished connection, which let the node transition to the connectedstate.

Read Function

The read function’s functionality differs from the principle as depicted inFig. 8.7 as it can happen that samples could not be submitted successfullyand therefore must be released again. For this purpose, a thresholdnumber can be defined in the node’s configuration to achieve that at least

174

8.4 Analysis of the InfiniBand Support in VILLAS

threshold samples can be received. If the threshold is reached, the CQis polled until it contains enough CQEs which intentionally blocks thefurther execution of the read function. Moreover, entries in *smps[] arefreed as it can hold only a certain amount of values.

Write Function

When the super-node calls the write function, it tries to submit all passedsamples to the SQ. Iterating through the samples, the node decides dy-namically in which manner the samples have to be sent:

1. samples are submitted normally and are not released by the super-node until a CQE with the proper address appears;

2. samples are submitted normally but some are marked as bad andmust be released by the super-node;

3. samples will be sent inline (i. e. are copied by the CPU directly intothe HCA’s buffers) and must be released by the super-node.

More on the implementation of the InfiniBand node-type can be found in[Pot18].

8.4 Analysis of the InfiniBand Support in VILLAS

The performance of the newly implemented IB node-type is evaluatedin the following in comparison to other already existing node-types ofVILLASnode.

Measurements Environment

All measurements in this section where accomplished on a DELL T630server with a NT78X mainboard providing 2 sockets, each with an IntelXeon E5-2643 v4 3.4 GHz (3.7 GHz Turbo), 6 cores CPU with Hyper-Threading (HT); 32 GB DDR4 main memory at 2400MHz; 2x MellanoxConnectX-4 MT27700 HCAs with 100GBit/s, interconnected via a 0.5 mMellanox MCP100-E00A passive copper cable; running an x86_64 Fe-dora Linux with kernel 4.13.9-200 and MLNX OFED Linux 4.4-2.0.7.0.Moreover, the system was optimized for real-time processing.

175

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Real-Time Optimizations

The following real-time optimizations were accomplished [Pot18]:

Memory optimizations Achieved through the utilization of huge pages,aligned memory allocations, and memory pinning.

CPU isolation and affinity Achieved by using the isolcpus kernel parame-ter, excluding processor cores from general balancing and schedulingmechanisms, resulting in the avoidance of process scheduling toexcluded CPUs unless a process is explicitly assigned to a CPUby sched_setaffinity(). Moreover, cpusets are used for allowingthreads that are forked by processes on an excluded CPU to be sched-uled among all available excluded CPUs instead of being assignedonly to the CPU of their forking process.

Non-movable kernel threads During system booting, kernel threads arecreated for tasks of the kernel and pinned to CPUs. This can beavoided to have no kernel threads running on excluded CPUs.

Interrupt affinity This is used for re-routing of interrupts that woulddisturb a CPU which is performing time-critical operations (e. g.busy polling on a signal variable for a certain event) to a CPU whichis not assigned to time-critical processing.

Tuned daemon Red Hat based systems such as the used Fedora Linuxsupport the tuned daemon for monitoring devices and adjustingsystem settings for higher performance. Supported tuning pluginsare, e. g., cpu, net, sysctl, and so forth. The tuned offers manypredefined profiles such as latency-performance for low-latencyapplications. For instance, this profile sets the CPU frequencygovernor to a performance.

Figure 8.11 shows the distribution of the CPUs among the cpusets. CPUsin the two real-time<N> cpusets are limited to their memory locationsof their non-uniform memory access (NUMA) node. This leads to lowermemory access latencies as in NUMA computer architectures the mainmemory is distributed among the nodes (here: processors) of the systemas shown in Fig. 8.11 for the described test system. The limited memorylocations are also used for the respective HCAs for writing and reading ofreceived data and data to be sent. Therefore, all time-critical processesusing the HCAs (i. e. mlx5_0 and mlx5_1) were executed on the CPUs 16,18, 20, and 22 as well as 17, 19, 21, and 23 (see Fig. 8.11).

176

8.4 Analysis of the InfiniBand Support in VILLAS

distance: 21

0 4 6

8 10 14

221816

2

12

20

cpuset: real-time-0no IRQs to this group

Xeon

® E

5-26

43 v

4

cpuset: system

NUMA node 0(internal distance: 10)

16 G

B, D

DR

-4, 2

400

MH

z1 5 7

9 11 15

231917

3

13

21

cpuset: system

cpuset: real-time-1no IRQs to this group

Xeon

® E

5-26

43 v

4

NUMA node 1(internal distance: 10)

16 G

B, D

DR

-4, 2

400

MH

z

MELLANOX ConnectX4mlx5_1 / net-ib1MELLANOX ConnectX4

mlx5_0 / net-ib0

Dell PowerEdge T630

Cable: 0.5m long

Figure 8.11: Computer system with NUMA architecture used for measure-ments

VILLASnode Node-Type Benchmark

The VILLASnode node-type benchmark was used to compare the perfor-mance of different node-types. Its working principle is depicted in Fig. 8.12.First, the already existing signal node for sample generation with times-tamps was used which are then forwarded to a file node that stores themin a file of comma-separated values (CSV), called in. Concurrently, thesamples are sent to a sending node of the type to be benchmarked. Areceiving node gets the samples and writes them together with timestampsof their reception to a CSV file, called out. Therefore, the benchmark

file node

signal node

file node

node node

node-type under test

super-node 1 super-node 2

in

out

Figure 8.12: VILLASnode node-type benchmark working principle

177

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

is utilized for measuring the transfer latencies between nodes. Althoughthe out file contains the generation and receive timestamps, the in file isneeded to determine which samples were lost. Since the signal node canmiss steps during sample generation at high rates, it can be determined ifsamples are missing because of the signal node or because they were lostbetween the nodes. The reason for the missed steps is explained in thefollowing.

Sample Generation

For the payload generation at different rates, the signal node was config-ured accordingly. It can make use of two different timers: the timerfd,which relies on the wait on a file descriptor used that is used for notifica-tions and the time-stamp counter (TSC), which is a 64 bit CPU registerthat is incremented mainly depending on the CPU’s maximum core fre-quency. In separate latency benchmarks [Pot18] with a rate of 100 kHzwith ten 64 bit floating point numbers per sample, and RC as service typeof an IB node, it was determined that the timerfd had a higher effect thanthe TSC on the median tlat of the measured latencies. Indeed, with TSCand relatively small rates below 2500 Hz, steps were missed. For example,the fraction amount of missed steps at 100 Hz was around 8 %. Sinceusing timerfd at this slow rates would have skewed the results and sincea deviation of 8 Hz at a rate of 100 Hz will hardly influence any latencyresults, the TSC was chosen.

8.4.1 Service Types of InfiniBand Node-Type

In the following, different service modes of the IB node-type were compared,which are RC and UD as they are officially supported by the RDMAcommunication manager (CM) (see Fig. 8.5) and therefore do not requireany modifications of the RDMA library. All measurements in this sectionwere performed with 250’000 samples.

Various Sample Generation Rates

In these measurements the samples contained 8 random 64 bit floating-point numbers and were generated at rates between 100 Hz and 100 kHz.For RC with 24 B of metadata such a message has 88 B and for UD with aGlobal Routing Header (GRH) of 40 B a message has 128 B – all messageswere sent inline. Figure 8.13 shows the results that are relatively similar forboth modes (RC and UD) over all rates which is typical for InfiniBand asthe reliability is implemented in the HCA which causes less overhead than

178

8.4 Analysis of the InfiniBand Support in VILLAS

an implementation in the network stack (e. g. TCP/IP) of the operatingsystem.

In both cases, tlat is decreasing with higher frequencies and, thus, shorterperiods between sample transmissions. Assuming one-way transmissiontimes of 1 µs [Pot18], transmission rates of up to approximately 1 GHzshould be possible. However, higher rates than 100 kHz were not measuredas the signal node of VILLAS missed even more steps. A higher ratecould not be achieved despite optimizations of the file node. Anotherlimitation is that the rate for clearing the CQ and refilling the RQ, bythe IB node, depends on the rate of the read function calls. If the RQsize is sufficient, it can absorb short message peaks but not in case ofcontinuously high rates.

Various Sample Sizes

For a measurement over various sample sizes, the rate was fixed to 25 kHzand the messages contained 1 to 64 values per sample, resulting in messagesof 32 B to 536 B for the RC and 74 B to 576 B for the UD type. Messagesof 188 B or smaller were sent by the used HCAs inline. In Fig. 8.14 anincreasing median latency can be seen when the message size exceedsabout 128 B which is in accordance with the findings presented in [MR12].Furthermore, the variability of the latencies with the UD type is higherthan with the RC type. Moreover, the RC type shows lower latencies thanthe UD type which can be explained with the adding of the remote node’saddress handle (AH) to every send WR and the GRH to every message,both not needed in case of the RC type.

100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]

0

1

2

3

4

t lat [µ

s]

RC UD

Figure 8.13: Median latencies tlat over various sample rates

179

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Various Generation Rates and Sample Sizes

The result of a combined measurement with various generation rates andsample sizes are shown in Fig. 8.15 for the RC service type only. Thefindings of both previous messages are reflected in this overall measurementdiagram. The percentage of missed steps is also shown in Fig. 8.15 and

1 2 4 8 16 32 64number of values in sample

0

1

2

3

4

5

t lat [µ

s]

RC UD

Figure 8.14: Median latencies tlat over various sample sizes

rate

1002500

500010000

2500050000

100000num

ber of

value

s in s

ample

24

816

3264

t lat [µs]

0

1

2

3

4

5

6

8%3%

0%0%

0%0%

0%

8%

3%0%

0%0%

0%0%

8%

3%0%

0%0%

0%0%

8%

3%0%

0%0%

0%0%

8%

3%0% 0%

0%0%

0%

8%

3%0%

0%0%

0%22%

8%

3%0%

0%0%

10% 55%

2.0

2.5

3.0

3.5

4.0

t lat [µ

s]

min tlat: 1.706 µs max tlat: 4.915 µs

% of samples missed by signal generator

Figure 8.15: Median latencies tlat over various sample rates and samplesizes

180

8.4 Analysis of the InfiniBand Support in VILLAS

colored in red if the signal node missed more than 10 % of the steps.With these results, the data rate T can be calculated with

T =(

1 − pmissed

100 %

)· ssample · fsignal, (8.1)

with pmissed being the percentage of missed samples, ssample the size of asample and fsignal the sample generation rate. In the measurements thedata rate was approximately 20 MiB/s which shows that the file nodewas not able to process large amounts of data.

8.4.2 InfiniBand vs. Zero-Latency Node-TypeFor the comparison of the IB node-type with a zero-latency node-type, theshmem node-type was chosen as this is utilizing the POSIX shared-memoryAPI for the communication between VILLAS nodes and therefore thelatency between two shmem nodes will correspond to the shared-memoryregion used by both of them. Again, 250’000 samples were sent at ratesbetween 100 Hz and 100 kHz, each containing 8 random 64 bit floating-point numbers.

Figure 8.16 shows the results for both node-types. The latency differ-ences between the node-types can be assumed as being caused by the IBcommunication. Both, the median latency of the IB node and the one ofthe shmem node were decreasing with higher frequencies. Therefore, thiseffect cannot be caused by the PCI-e bus or the IB node implementationitself.

100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]

0

1

2

3

4

t lat [µ

s]

8.03%3.72% 0.03% 0.08% 0.13% 0.25%

0.49%

8.03% 3.71% 0.03% 0.04% 0.07% 0.14% 0.28%

infiniband (RC) shmem

% of samples missed by signal generator

Figure 8.16: Median latencies tlat of IB vs. shmem node-type over varioussample rates

181

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

Furthermore, the IB node only missed a negligible amount of steps morethan the shmem node. This implies that the write function of the IB nodereturned fast enough and did not influence the signal generation too much.With median latencies of around 2.5 µs, transmission rates up to ~400 kHzcould be possible.

8.4.3 InfiniBand vs. Existing Server-Server Node-Types

One reason for the integration of IB into VILLASnode was the lack ofa hard real-time capable server-server node-type. Therefore, this sectioncompares the IB and shmem node-type with existing node-types commonlyused for server-server communication, zeromq and nanomsg. Once more,250’000 samples were sent at rates between 100 Hz and 100 kHz, eachcontaining 8 random 64 bit floating-point numbers.

Loopback vs Physical Link

First, the loopback IP 127.0.0.1 was used for the IP-based node-typeszeromq and nanomsg nodes and then repeated on a real physical link.Afterward, a physical link was used for the IP-based node-types whichusually is based on the Ethernet technology. However, to avoid skewingof the results by another link technology such as Ethernet, the IB HCAswas also used as physical link for the communication between the IP-based node-types. This was realized utilizing the Internet Protocol overInfiniBand (IPoIB) driver which provides an IP-based interface that canbe used by the IP-based node-types.

Figure 8.17 shows the results for the IP-based node-types. For ratesbelow 25 kHz there were no relevant latency deviations between the loop-back and the physical link. Above 25 kHz the latencies on the physicallinks increased especially with zeromq. The percentage of missed stepsfor 100 Hz and 250 Hz was for both IP-based nodes the same as for the IBand shmem nodes, again indicating TSC to be the reason.

In Fig. 8.18 the results of the IP-based nodes on physical links arecompared to the ones of the IB and shmem nodes. It can be seen that thelatencies of the hard real-time capable IB node were at least one orderof magnitude lower than the ones of the IP-based node-types that aresoft real-time capable only. Also the variability of the latencies in caseof IB was very low comparatively to the IP-based types, especially forrates above 25 kHz, when the IP-based types showed increasing latencymagnitudes.

182

8.5 Conclusion and Outlook

8.5 Conclusion and Outlook

The results presented in this chapter show that the integration of InfiniBandin the VILLAS framework enables the transmission of samples at relativelyhigh rates with latencies of few microseconds and under hard real-timerequirements. These low latencies were achieved by a strict compliance tothe principles of VIA such as zero-copy and the utilization of InfiniBand’scapabilities to initiate data transmissions without using system calls or

100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]

102

103

t lat [µ

s]

nanomsg

nanomsg (lo)

zeromq

zeromq (lo)

Figure 8.17: Median latencies of nanomsg and zeromq node-type over vari-ous sample rates via loopback (lo) and physical link

100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]

100

101

102

103

t lat [µ

s]

infiniband

nanomsg

shmem

zeromq

Figure 8.18: Median latencies of nanomsg and nanomsg via physical linksas well as IB and shmem node-type over various sample rates

183

Chapter 8 HPC Network Communication for HiL and RT Co-Simulation

the active participation of a CPU. This is why the IB node-type can beadapted for other VIA-based interfaces.

While for small messages at high rates the IB node-type showed medianlatencies of around 1.7 µs, the median latencies for larger message sizesat low rates were around 4.9 µs. Compared to the – almost zero-latency –shmem node-type, the median latencies were only 1.5 −2.5 µs higher whichis of high value in the area of real-time processing as the shmem allowsonly communication between the nodes of a shared-memory system whichare typically located on the same computer. Moreover, existing VILLASnode-types for communication among different systems over IP showedmedian latencies that were one to two orders of magnitude larger than incase of IB. The latter can furthermore be used for much higher samplerates.

With the IB node-type, VILLASnode can be used for hard real-timecapable coupling of simulators running on conventional and inexpensivecomputer hardware in academic area and industry. Moreover, in the future,e. g., HiL setups are possible where devices to be connected to a computerhost, running the simulation, can be supplied with an InfiniBand TCA forlow latency data transfers between the device and the simulation. Thesame setup can be used for real-time operation.

The IB node-type implementation could be further improved for real-time processing with the aid of RT_PREEMPT-capable Linux kernel. Furtherperformance analyses, e. g., based on a profiling of the node-type’s readand write functions could be accomplished for a code optimization leadingto even lower latencies.

184

9Conclusion

In the following, the conclusions of all previous chapters are summarizedand discussed.

9.1 Summary and Discussion

This dissertation presents various methods from high-performance com-puting (HPC) in support of power system simulation. These methodsshall help other power system simulation users and developers in theirundertakings. Therefore, all presented approaches were implemented inopen-source software projects.

In Chapter 2 a data model for multi-domain smart grid topologies,based on the Common Information Model (CIM), was presented. CIM waschosen despite the lack of all object classes for communication networks,many classes for energy markets, and some classes for power grids. Thethus needed CIM extensions can lead to various extensions of differentorganizations. Of course, CIM is not the only possible information modelbut it provides the biggest well-specified subset of needed classes for aholistic representation of smart grid topologies from a high to a relativelydetailed level. Moreover, the CIM User Group (CIMug) extends CIMby new classes to achieve a more and more holistic model of smart gridtopologies. The developed SINERGIEN_CIM data model was furthermoreused for the successful validation of the automated CIM (de-)serializergeneration in Chap. 3 with ontologies that extend CIM, ensuring a generalapplicability of the approach.

185

Chapter 9 Conclusion

Chapter 3 introduced an approach for an automated (de-)serializergeneration based on a UML to C++ code generation, a subsequent codeadaption plus extension, and a template based (un-)marshalling codegeneration with the aid of a C++ compiler front-end. In contrast, insteadof making use of a UML editor such as Enterprise Architect (EA), onecould save the UML specification in an open document format such as XMLMetadata Interchange (XMI). Then a code generator (to be developed)could directly generate (un-)marshalling code by traversing through theXMI document. This would make the code adaption as well as thecompiler front-end processing unnecessary. In fact, this proceeding iscurrently applied for the integration of Common Grid Model ExchangeStandard (CGMES) into CIM++. However, instead of an XMI documentrepresenting the CGMES UML specification, the generator for CGMES(un-)marshalling code for CIM++ makes use of Resource DescriptionFramework Schema (RDFS) documents which define the structure of aconcrete RDF based document type. Furthermore, the compilation of morethen thousand CIM classes into libcimpp is not needed for each projectusing it. To reduce its size, e. g. for the application in embedded systemswith a very limited main memory and program storage, an approach willbe implemented which enables the possibility to choose a certain subsetof CIM classes. Of course, all superclasses of a given subset must beautomatically integrated into libcimpp as well. Despite all that, libcimppis already utilized not only in Institute for Automation of Complex PowerSystems (ACS) software projects but also by a Swiss and a Czech companyand potentially also by other Github users which stay anonymous. Thisindicates that it is not only usable in academic area but also in enterpriseapplication.

Chapter 4 presents a template based translation method from CIM tosimulator-specific system models as implemented in the CIMverter project.One could argue that the template based approach is too inflexible incomparison to a domain-specific language (DSL) based approach as morecomplex mappings must be implemented in C++. A contrary indicationis that further component translations from CIM to Modelica requiredhardly any or even none changes in the C++ Modelica Workshop. Alsothe integration of the DistAIX system model format in CIMverter wasaccomplished in a couple of person-days as it could be performed mainlywith new templates. Besides these examples, it must be assumed thatcomplex mappings would also require a comprehensive DSL. This wouldlead to higher efforts while learning the DSL and implementing it inCIMverter. Furthermore, the presented mapping from CIM to simulator-specific system models covers so-called bus-branch models only. Therefore,CIMverter is not able to handle node-breaker models but this follows the

186

9.1 Summary and Discussion

chosen UNIX philosophy of developing one program for one task [MPT78;Ray03] as a node-breaker model does not provide all information neededfor a final system model. In fact, it provides a set of topologies which candiffer depending on the configuration of the breakers. Hence, the mappingof a node-breaker model to a bus-branch model should be handled by aseparate topology processor with respect to the given breaker configuration.

Chapter 5 presents modern LU decomposition methods for circuit simu-lation that have been parallelized for current parallel processors. It showsa comparative analysis with the state of the art LU decomposition KLU,based on benchmark matrices as well as on real simulations performedby Dynaωo. It could be regarded as a disadvantage that solving of linearsystems only was considered although in power system simulation usuallynon-linear systems are solved. One reason is that solving a non-linearsystem usually is implemented by linearization and a subsequent solvingof a linear system. Furthermore, in case of large-scale power grid models,Réseau de Transport d’Électricité (RTE), the French transmission systemoperator (TSO), found out that during the solving of DAEs (e. g. withthe aid of IDA) most of the CPU time is spent in the LU factorization(i. e. KLU). RTE and other partners of the PEGASE research projectconducted a comprehensive analysis on different solvers. The outcomes ofthe analysis are another reasons why only LU decompositions and thusdirect solvers have been analyzed and no iterative ones.

Chapter 6 introduces the new approach type of an automatic fine-grainedparallelization of mathematical models that was implemented in the newpower grid simulator DPsim. It is about exploiting parallelism in math-ematical power system models for making use of multi-core processorswith shared-memory architectures that are common in today’s computersystems. The MNA solver itself has not been improved which is based onthe SparseLU method of the Eigen library which implements the supern-odal LU factorization [Sup] for sparse non-symmetric systems of linearequations. Obviously, at this point also other LU decompositions couldbe integrated and tried to be improved, analogously to the work alreadydone for OpenModelica and Dynaωo in Chap. 5. This would improve theperformance of task processing itself instead of the implemented paral-lel processing of the tasks which was the goal of this work. From HPCpoint of view, the power grid simulations that were performed, e. g., byOpenModelica, Dynaωo, and DPsim, because of the sparsity of the linearsystems, never led to matrices of a size that was large enough for anefficient use of distributed-memory systems or even supercomputers. Evenin case of large-scale static phasor power grid simulations with more than7500 nodes, the matrices had indeed a size of 200000 × 200000 but withup to 700000 nonzeros their memory consumption was around 5 MiB only.

187

Chapter 9 Conclusion

Hence, no parallelization approaches for distributed-memory architectureswere needed or implemented. This could change in case of large-scaledynamic simulations but such have not been considered, yet.

Chapter 7 addresses different approaches for increasing the performanceof Python programs. After an introduction to the ideas and internalsof the Python runtime environments implementing these approaches, acomparative analysis based on algorithms from different algorithm classesis presented. The analysis helps programmers to understand how toadapt Python programs to achieve a better runtime performance in acertain environment, for instance based on just-in-time (JIT) compilation.Thus, it also helps to estimate the efforts and benefits of the development.The analysis was mainly focused on sequential processing with shared-memory parallelization by multithreading only but exactly achieving betterperformance in case of sequential Python code was in focus of the analysis.There are also other JIT compilation based programming languages suchas Julia. Julia’s syntax is similar to MATLAB and Python and providesmemory management making it easy to learn by programming beginners.However, Julia was developed as a language for scientific computing andPython is much more popular in engineering.

Chapter 8 presents the implementation of InfiniBand (IB) support intothe VILLASframework for Hardware-in-the-Loop (HiL) setups and thereal-time (RT) coupling of simulators. The implemented IB communicationshows transmission latencies that are one order of magnitude lower thanthe corresponding latencies of Internet Protocol (IP) based communicationwith nanomsg and zeromq. Furthermore, the IB latencies are slightlyhigher than in case of a shared-memory based data exchange which islimited to the same computer host. The InfiniBand latencies also showa very low variability which is important for RT requirements. Eventhough InfiniBand based communication is not suitable for wide areanetworks (WANs), distances above 15 m can be bridged by active fiberoptic cables with hundreds of meters in length. Therefore, with InfiniBandinterconnects, a widely-used HPC network technology can be appliedfor hardware-server and server-server communication even with hard RTrequirements via the VILLASnode software gateway for simulation dataexchange.

Taken as a whole, it can be concluded that the work presented inthis dissertation already improved or can improve the performance ofdifferent (co-)simulation environments. Furthermore, it enables the useof CIM topologies in different power system simulators, allowing thesimulation of large-scale real world power systems. Also, many findingsand approaches can be used for improving further software from the areaof electrical engineering and beyond that. The implemented open-source

188

9.2 Outlook

software projects can be used and improved by scientists and developersin academic area and free economy.

9.2 Outlook

Some concrete improvements of the developed concepts and approacheshave already been mentioned in the Discussion above. One of today’s mostimportant goals in the area of HPC for smart grid simulation is a solverfor linear systems of equations, arising during simulations performed bypower grid simulation environments, which scales with the cores of modernmulti-core processors. At least in case of state-of-the-art steady-statesimulations it has been seen that there is no need for parallel architectureswith distributed memory. Workstations and servers with a shared-memoryarchitecture can compete with steady-state simulations but larger andincreasingly complex system models on the basis of component model im-provements, more elaborate models of new equipment, and more grid nodesrequire an efficient utilization of the utilized computer hardware. Hence,if simulations with more complex system models must not run longer,the software must make use of new hardware developments. Therefore,it needs more research and development on the power system simulationenvironments and their numerical back-ends to make use of wider vectorunits of today’s processors and accelerators such as graphic processingunits (GPUs) and field-programmable gate arrays (FPGAs).

Further scientific work related to HPC in power system simulation isalso needed in the area of dynamic security assessment (DSA) based ondynamic grid simulation. In DSA systems, different scenario computationscan be triggered by certain events such as the outage of grid equipment.Then, dynamic (n-1)-computations must be performed which can provideinformation on the voltage stability, small-signal stability, and transientstability of the system. DSA systems can make use of expert systems, forexample on the basis of neural networks, that can derive grid operationimprovements from the mentioned analyses. Since the real-time require-ments on DSA systems can be very challenging and dynamic computationscan be very time-consuming, the application of high-throughput computing(HTC) on distributed-memory systems can be the method of choice, whereHTC denotes a computing paradigm that focuses on the efficient executionof a large number of loosely-coupled tasks [Eur].

In the context of dynamic (n-1)-computations, a topology processorwhich generates bus-branch topologies from node-breaker models with agiven breaker setting could be helpful. The (n-1)-computation controlcould make use of such a topology processor providing all topologies to

189

Chapter 9 Conclusion

be considered in case of the scenarios which have to be calculated for aDSA computations triggering event. These topologies could be used forthe aforementioned stability calculations as well es for additional dynamicand static protection simulations.

Besides multithreading paradigms for shared-memory parallelizatin inPython programs, also paradigms for distributed-memory parallelization,e. g. with the aid of the Message Passing Interface (MPI). The bench-marking of an MPI implementation itself should not be performed bythe examination of the performance of a set of MPI based applicationsbut more systematically for different kinds of communication operations(e. g. individual, collective, one-sided, etc.), for various communicationpatterns (one-to-one, one-to-many, all-to-all), and multiple numbers ofcommunication modes. For this purpose, an approach such as implementedin the special Karlsruher MPI-benchmark (SKaMPI) could be followedwhich performs various measurements of different MPI functions in acustomizable way.

190

ACode Listings

A.1 Exploiting Parallelism in Power Grid Simulation

Listing A.1: step method of the OpenMP-based level scheduler

void OpenMPLevelScheduler :: step(Real time , Int timeStepCount ) {size_t i = 0, level = 0;

# pragma omp parallel shared (time , timeStepCount ) \private (level , i) num_threads ( mNumThreads )

for ( level = 0; level < mLevels .size (); level ++) {# pragma omp for schedule ( static )for (i = 0; i < mLevels [ level ]. size (); i++) {

mLevels [ level ][i]-> execute (time , timeStepCount );}

}}

193

BPython Environment Measurements

B.1 Execution Times

0 5000 10000 15000 20000 25000 30000 35000Number of nodes

10 1

100

101

102

103

104

Tim

e [s

]

AVL Tree Insertion

C++CPython3CPython2Cython(Pure Python)

Cython(Optimized)CythonPyPy(Optimized)PyPy3(Pure Python)PyPy2(Pure Python)

Figure B.1: Execution times for AVL Tree Insertion

195

Appendix B Python Environment Measurements

0 500 1000 1500 2000 2500Number of nodes

10 1

100

101

102

103

104

Tim

e [s

]

Dijkstra

C++CPython3CPython2Cython(Pure Python)Cython(Optimized)

CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)

Figure B.2: Execution times for Dijkstra

0 500 1000 1500 2000 2500 3000 3500Size of matrices

10 1

100

101

102

103

104

Tim

e [s

]

Gauss-Jordan Elimination

C++CPython3CPython2Cython(Pure Python)Cython(Optimized)

CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)

Figure B.3: Execution times for Gauss-Jordan Elimination

196

B.2 Memory Space Consumption

B.2 Memory Space Consumption

0 200 400 600 800Size of matrices

0

20

40

60

80

100

Heap

-pea

k [M

b]

Cholesky DecompositionC++CPython3CPython2Cython(Pure Python)Cython(Optimized)

CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)

Figure B.4: Memory consumption (maximum heap peak) for Cholesky

197

Appendix B Python Environment Measurements

0 250 500 750 1000 1250 1500 1750 2000Size of matrices

0

50

100

150

200

Heap

-pea

k [M

b]

Gauss-Jordan EliminationC++CPython3CPython2Cython(Pure Python)Cython(Optimized)

CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)

Figure B.5: Memory consumption (maximum heap peak) for Gauss-JordanElimination

0 250 500 750 1000 1250 1500 1750 2000Size of matrices

0

5

10

15

20

25

30

35

40

Heap

-pea

k [M

b]

Gauss-Jordan EliminationC++CPython3Cython(Pure Python)

Cython(Optimized)Numba + NumPy

Figure B.6: Memory consumption (maximum heap peak) for Gauss-JordanElimination of selected runtime environments

198

B.2 Memory Space Consumption

0 250 500 750 1000 1250 1500 1750 2000Size of matrices

0

50

100

150

200

250

300

350

Heap

-pea

k [M

b]

Matrix-Matrix MultiplicationC++CPython3CPython2Cython(Pure Python)Cython(Optimized)

CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)

Figure B.7: Memory consumption (maximum heap peak) for Matrix-Matrix Multiplication

0 250 500 750 1000 1250 1500 1750 2000Size of matrices

0

20

40

60

80

100

Heap

-pea

k [M

b]

Matrix-Matrix MultiplicationC++CPython3Cython(Pure Python)

Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy

Figure B.8: Memory consumption (maximum heap peak) for Matrix-Matrix Multiplication of selected runtime environments

199

List of Acronyms

ACS Institute for Automation of Complex Power Systems 4, 97,186

ADT abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128AH address handle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179AMD Approximate Minimum Degree. . . . . . . . . . . . . . . . . . . . . . . . 80AOT ahead-of-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130API Application Programming Interface . . . . . . . . . . . . . 109, 159ARM Advanced RISC Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6AST abstract syntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 58AVL Adelson-Velsky and Landis . . . . . . . . . . . . . . . . . . . . . . . . . . 148AVX Advanced Vector Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 125

BDF backward differentiation formula . . . . . . . . . . . . . . . . . . . . . . 75BSD Berkeley Software Distribution. . . . . . . . . . . . . . . . . . . . . . . 159BTF block triangular form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

CA Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161CASE Computer-Aided Software Engineering . . . . . . . . . . . . . . . . 53CFG control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138CGMES Common Grid Model Exchange Standard . . . . . . . . . . . . 186CiL Control-in-the-Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5CIM Common Information Model . iv, viii, 8, 14, 31, 55, 75, 97,

185CIMug CIM User Group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 185CLI command line interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62CM communication manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178COLAMD Column Approximate Minimum Degree . . . . . . . . . . . . . . . 81CP critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106CPU central processing unit . . . . . . . . . . . . . . . 6, 80, 100, 129, 160

201

List of Acronyms

CQ Completion Queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160CQE Completion Queue Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164CSV comma-separated values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

DAE differential-algebraic system of equations . . . . . . . . . . . 9, 75DAG directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102DER distributed energy resource . . . . . . . . . . . . . . . . . . . . . . . . . 4, 13DES discrete event simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16DES Discrete Event System Specification. . . . . . . . . . . . . . . . . . . 17DistAIX Distributed Agent-Based Simulation of Complex Power

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56DMA Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165DMS distribution management system . . . . . . . . . . . . . . . . . . . . . . 16DOM Document Object Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31DP dynamic phasor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97DPsim Dynamic Phasor Real-Time Simulator . . . . . . . . . . . . . . . . 97DRTS digital real-time simulator . . . . . . . . . . . . . . . . . . . . . . . . . 5, 157DSA dynamic security assessment . . . . . . . . . . . . . . . . . . . . . . 3, 189DSL domain-specific language . . . . . . . . . . . . . . . . . . . . . . . . . 56, 186DSO distribution system operator . . . . . . . . . . . . . . . . . . . . . . 11, 15DUFunc dynamic universal function . . . . . . . . . . . . . . . . . . . . . . . . . . 141

EA Enterprise Architect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186EHV extra high voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84EMS energy management system . . . . . . . . . . . . . . . . . . . . . . . . . . . 15EMT electromagnetic transient simulation . . . . . . . . . . . . . . . . . . . 4ENTSO-E European Network of Transmission System Operators for

Electricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

FIFO first in – first out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164FPGA field-programmable gate array. . . . . . . . . . . . . . . . . . . . . 6, 189

GC garbage collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134GD-RTS geographically distributed real-time simulation . . . . 5, 157GIL global interpreter lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135GMRES Generalized Minimal Residual Algorithm . . . . . . . . . . . . . 78GP Gilbert/Peierls’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81GPU graphic processing unit . . . . . . . . . . . . . . . . . . . . 6, 76, 99, 189GRH Global Routing Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

HCA Host Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161HiL Hardware-in-the-Loop. . . . . . . . . . . . . iv, viii, 5, 97, 157, 188

202

List of Acronyms

HLFET Highest Level First with Estimated Times . . . . . . 106, 210HLFNET Highest Level First with No Estimated Times . . . . . . . 106HLFNET Highest Level First with No Estimated Times . . . . . . . . 110HLL high-level language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129HPC high-performance computing. . . . . . . . . . . . . . . vii, 6, 97, 185HT Hyper-Threading . . . . . . . . . . . . . . . . . . . . . . . 85, 112, 150, 175HTC high-throughput computing . . . . . . . . . . . . . . . . . . . . . . . . . . 189HV high voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84HW hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

IB InfiniBand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 26, 157, 188IBA InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 161, 211IBTA InfiniBand Trade Association . . . . . . . . . . . . . . . . . . . . . . . . 161ICT information and communications technology . . . . vii, 1, 13IEC International Electrotechnical Commission . . . . . . . . . . . . 29IP Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158, 188IPC inter-process communication . . . . . . . . . . . . . . . . . . . . . . . . . 159IPoIB Internet Protocol over InfiniBand . . . . . . . . . . . . . . . . . . . . 182iPSL iTesla Power System Library . . . . . . . . . . . . . . . . . . . . . . . . . . 75IR intermediate representation . . . . . . . . . . . . . . . . . . . . . . . . . . 142ISO International Organization for Standardization . . . . . . . . 29IVP initial value problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

JIT just-in-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 127, 188

LAN local area network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158LSE linear system of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

MD minimum degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79MDA model-driven architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30MNA modified nodal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97MPI Message Passing Interface . . . . . . . . . . . . . . . . . . 125, 156, 190MPS ModPowerSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75MQTT Message Queue Telemetry Transport . . . . . . . . . . . . . . . . . 158MTU message transmission unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

ND nested dissection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79NIC network interface controller . . . . . . . . . . . . . . . . . . . . . . . . . . 159NR Newton-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94NUMA non-uniform memory access. . . . . . . . . . . . . . . . . . . . . . . . . . 176

ODE ordinary differential equation . . . . . . . . . . . . . . . . . . . . . . . 3, 98

203

List of Acronyms

OFED OpenFabrics Enterprise Distribution . . . . . . . . . . . . . . . . . 165OMG Object Management Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 51OOP object-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . 36OS operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

PC program counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139POSIX Portable Operating System Interface . . . . . . . . 41, 135, 159

QoS quality of service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158QP Queue Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162QVT Query/View/Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . 51

RC Reliable Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162RD Reliable Datagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162RDF Resource Description Framework . . . . . . . . . . . . . . . . . . 16, 31RDFS Resource Description Framework Schema . . . . . . . . . . . . 186RDMA Remote Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . 161RPython Restricted Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136RQ Receive Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163RT real-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 85, 97, 157, 188RTE Réseau de Transport d’Électricité . . . . . . . . . . . . . . . . 75, 187RTTI runtime type information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

SAX Simple API for XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33SCADA supervisory control and data acquisition . . . . . . . . . . . . . . 15SGAM Smart Grid Architecture Model . . . . . . . . . . . . . . . . . . . . . . . 14sge scatter/gather element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165SiL Software-in-the-Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5SIMD single instruction multiple data (stream) . . . . . . . . 125, 143SKaMPI special Karlsruher MPI-benchmark . . . . . . . . . . . . . . . . . . 190SL simplified load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84SQ Send Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163SSA steady-state security assessment. . . . . . . . . . . . . . . . . . . . . . . . 3STL Standard Template Library . . . . . . . . . . . . . . . . . . . . . . 36, 145SUNDIALS SUite of Nonlinear and DIfferential/ALgebraic Equation

Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76, 98SW software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

TCA Target Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161TCP Transmission Control Protocol . . . . . . . . . . . . . . . . . . . 26, 160TJIT tracing just-in-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

204

List of Acronyms

TLM Transmission Line Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 99TSC time-stamp counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178TSO transmission system operator. . . . . . . . . . . . . . . . . . . . . . 4, 187

UC Unreliable Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162UD Unreliable Datagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162UDP User Datagram Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160ufunc universal function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141UML Unified Modeling Language . . . . . . . . . . . . . . . . . . . . . 8, 19, 30

VDL voltage dependent load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84VI virtual interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160VIA virtual interface architecture . . . . . . . . . . . . . . . . . . . . . . . . . 160VL Virtual Lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165VM virtual machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134VPP virtual power plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 21

WAN wide area network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188WQE Work Queue Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163WR Work Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163WSCC Western System Coordinating Council . . . . . . . . . . 112, 210

XMI XML Metadata Interchange . . . . . . . . . . . . . . . . . . . . . . 31, 186XML Extensible Markup Language . . . . . . . . . . . . . . . . . . . . . . . . . 32

205

Glossary

barrier A barrier is a synchronization primitive e. g. among aset of threads or processes for which holds that eachthread / process of the regarding set must execute all in-structions before the barrier before all of them continuewith the instructions after the barrier. . . . . . . . . . . . . . . . 104

driver In context of numerical software (i. e. not hardware driver):a program) which is applying numerical methods, e. g. im-plemented in libraries that are linked to the program, withall needed initializations and parameters on a particularproblem to be solved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58, 83

flat model The semantics of the Modelica language is specified bymeans of a set of rules for translating any class described inthe Modelica language to a flat Modelica structure (i. e. aflat model). A class must have additional properties in orderthat its flat Modelica structure can be further transformedinto a set of differential, algebraic and discrete equations(i. e. a flat hybrid DAE). [Mod] . . . . . . . . . . . . . . . . . . . . . . . 58

Modelica Modelica is a free object-oriented multi-domain modelinglanguage for component-oriented modeling. . . . . . . . 16, 98

OpenModelica An open-source Modelica-based modeling and simulationenvironment intended for industrial and academic usage. .12, 76, 98

207

Glossary

pivoting The pivot element of a row or column of a matrix is the firstelement selected by an algorithm (e. g. during a Gaussianelimination) before a certain calculation step. Finding thiselement is called pivoting. In Gaussian elimination withpivoting, usually the element with the highest absolutevalue is chosen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

preordering Computation of permutation matrices which are applied onthe matrix to be factorized before the actual factorizationstep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

race condition A race condition is a condition where the result of concur-rently executed program statements is dependent on the(uncontrollable) execution order of the CPU instructionsbelonging to the statements. . . . . . . . . . . . . . . . . . . . . . . . . . 136

thread A thread (of execution) is a set of instructions associatedwith a process. A multi-threaded process has multiplethreads. If the computer system allows to run these threadsconcurrently, the process can benefit from a higher compu-tation computational power.. . . . . . . . . . . . . . . . . . . . . . . . . . 81,100

thread-safe A part of a program is thread-safe if multiple threads canexecute the part concurrently, always generating results asif the threads would have executed the part in a sequen-tial order (i. e. one thread executes the whole part, thenext thread executes the whole part, and so forth until allthreads finished). The sequential order can vary betweenthe executions of the program.. . . . . . . . . . . . . . . . . . . . . . . 136

wall clock time The wall clock time is the time which elapses in realityduring the measured process. . . . . . . . . . . . . . . . 84, 111, 150

208

List of Figures

1.1 Contribution overview of this work . . . . . . . . . . . . . 7

2.1 Exemplary topology including components of (1) all do-mains and (2) domain-specific topologies . . . . . . . . . . 19

2.2 Inter-domain connections between classes of power grid,communication network and market . . . . . . . . . . . . 20

2.3 Communication network class association example . . . . 222.4 Overall SINERGIEN architecture for simulation setup . . 232.5 Synchronization scheme of simulators at co-simulation time

steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 Scheme of runtime interaction between co-simulation com-

ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Overall concept of the CIM++ project . . . . . . . . . . . 343.2 UML diagram of HydroPowerPlant class which instances

can be associated with no more than one Reservoir instance 363.3 UML diagram of the class MyASTVisitor . . . . . . . . . . 383.4 Section of collaboration diagram for BatteryStorage gen-

erated by Doxygen on the automated adapted CIM C++codebase. The entire diagram can be found in [FEIb] . . . 52

4.1 Template engine example with HTML code . . . . . . . . 594.2 Overall concept of the CIMverter project . . . . . . . . . 604.3 Mapping at second level between CIM and Modelica objects 634.4 Connections with zero, one, and two middle points between

the endpoints. The endpoints are marked with circles . . 684.5 Medium-voltage benchmark grid [Rud+06] converted from

CIM to a system model in Modelica based on the ModPow-erSystems and PowerSystems library . . . . . . . . . . . . 72

209

List of Figures

5.1 Sparsity patterns of benchmark matrices . . . . . . . . . . 865.2 Total (preprocessing+factorization) times . . . . . . . . . 875.3 Preprocessing times . . . . . . . . . . . . . . . . . . . . . 885.4 Factorization times . . . . . . . . . . . . . . . . . . . . . . 885.5 Execution times on generic vs. RT kernel . . . . . . . . . 895.6 (Re-)factorization times . . . . . . . . . . . . . . . . . . . 905.7 NICSLU’s scaling over multiple threads (T ) . . . . . . . . 915.8 Basker’s scaling over multiple threads (T) . . . . . . . . . 915.9 Total times with different preorderings . . . . . . . . . . . 925.10 Factorization times with different preorderings . . . . . . 93

6.1 Categories of parallel task scheduling . . . . . . . . . . . . 1016.2 Example task graph . . . . . . . . . . . . . . . . . . . . . 1026.3 Example task graph including levels . . . . . . . . . . . . 1046.4 Schedule for task graph in Fig. 6.2 with p = 2 using level

scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.5 Schedule for task graph in Fig. 6.2 with p = 2 using level

scheduling considering execution times . . . . . . . . . . . 1056.6 Example task graph including b-levels . . . . . . . . . . . 1066.7 Schedule for task graph in Fig. 6.2 with p = 2 using Highest

Level First with Estimated Times (HLFET) . . . . . . . . 1076.8 Example circuit . . . . . . . . . . . . . . . . . . . . . . . . 1086.9 Task graph resulting from Fig. 6.8 . . . . . . . . . . . . . 1086.10 Western System Coordinating Council (WSCC) 9-bus trans-

mission benchmark network . . . . . . . . . . . . . . . . . 1136.11 Schematic representation of the connections between system

copies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.12 Performance comparison of schedulers for the WSCC 9-bus

system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.13 Performance comparison of schedulers for 20 copies of the

WSCC 9-bus system . . . . . . . . . . . . . . . . . . . . . 1156.14 Performance comparison of schedulers for a varying number

of copies of the WSCC 9-bus system . . . . . . . . . . . . 1166.15 Task graph for simulation of the WSCC 9-bus system . . 1176.16 Performance for a varying number of copies of the WSCC

9-bus system using the decoupled line model . . . . . . . . 1196.17 Performance comparison of schedulers for a varying number

of copies of the WSCC 9-bus system using the decoupledline model with 8 threads . . . . . . . . . . . . . . . . . . 120

6.18 Performance for a varying number of copies of the WSCC9-bus system using diakoptics . . . . . . . . . . . . . . . . 121

210

List of Figures

6.19 Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using diakoptics with8 threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.20 Performance comparison of compilers for 20 copies of theWSCC 9-bus system . . . . . . . . . . . . . . . . . . . . . 123

7.1 NumPy ndarray vs. Python list [Van] . . . . . . . . . . . 1337.2 Software architecture of CPython (python command) . . . 1357.3 Software architecture of PyPy (pypy command) . . . . . . 1377.4 Numba compilation stages . . . . . . . . . . . . . . . . . . 1437.5 Comparison of Cython with other programming languages 1447.6 Cython’s workflow for Python module building [Dav] . . . 1467.7 Execution times for Quicksort . . . . . . . . . . . . . . . . 1527.8 Memory consumption (maximum heap peak) for Quicksort 1537.9 Memory consumption (maximum heap peak) for Quicksort

of selected runtime environments . . . . . . . . . . . . . . 1547.10 Execution times for PI calculations with multiple threads 155

8.1 Overview of the VILLASframework . . . . . . . . . . . . . 1598.2 Network stack of the InfiniBand Architecture (IBA) . . . 1628.3 InfiniBand Architecture (IBA) model . . . . . . . . . . . . 1638.4 InfiniBand data transmission example . . . . . . . . . . . 1648.5 An overview of the OFED stack . . . . . . . . . . . . . . . 1668.6 An example super-node with three paths connecting five

nodes of different node-types . . . . . . . . . . . . . . . . 1688.7 General read function working principle in VILLAS . . . . 1698.8 InfiniBand node-type read function working principle . . . 1708.9 VILLASnode state diagram with newly implemented states 1728.10 Components of InfiniBand node-type . . . . . . . . . . . . 1738.11 Computer system with NUMA architecture used for mea-

surements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.12 VILLASnode node-type benchmark working principle . . 1778.13 Median latencies tlat over various sample rates . . . . . . 1798.14 Median latencies tlat over various sample sizes . . . . . . . 1808.15 Median latencies tlat over various sample rates and sample

sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1808.16 Median latencies tlat of IB vs. shmem node-type over various

sample rates . . . . . . . . . . . . . . . . . . . . . . . . . . 1818.17 Median latencies of nanomsg and zeromq node-type over

various sample rates via loopback (lo) and physical link . 1838.18 Median latencies of nanomsg and nanomsg via physical links

as well as IB and shmem node-type over various sample rates183

211

List of Figures

B.1 Execution times for AVL Tree Insertion . . . . . . . . . . 195B.2 Execution times for Dijkstra . . . . . . . . . . . . . . . . . 196B.3 Execution times for Gauss-Jordan Elimination . . . . . . 196B.4 Memory consumption (maximum heap peak) for Cholesky 197B.5 Memory consumption (maximum heap peak) for Gauss-

Jordan Elimination . . . . . . . . . . . . . . . . . . . . . . 198B.6 Memory consumption (maximum heap peak) for Gauss-

Jordan Elimination of selected runtime environments . . . 198B.7 Memory consumption (maximum heap peak) for Matrix-

Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 199B.8 Memory consumption (maximum heap peak) for Matrix-

Matrix Multiplication of selected runtime environments . 199

212

List of Tables

4.1 CIM PowerTransformer to Modelica Workshop Transformermapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Excerpt of further important mappings from CIM to Mod-PowerSystems as implemented in the Modelica Workshop 68

4.3 Excerpt from the numerical results for node phase-to-phasevoltage magnitude and angle regarding the medium-voltagebenchmark grid . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Characteristics of squared matrices with size N × N , Knodes, sorted by nonzeros NNZ, and with density factord = NNZ

N·N in % . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 Total execution times and numbers C of calls of the corre-

sponding routines within the fixed time step solver, withJacobian JF and residual function vector F . . . . . . . . 94

5.3 Accumulated execution times for the listed steps of thevariable time step solver, with D LU decompositions and afactorization ratio f = #Fact.

#Refact. . . . . . . . . . . . . . . . 95

6.1 Overview of the implemented schedulers . . . . . . . . . . 1096.2 Overview of the tested compilers . . . . . . . . . . . . . . 124

213

Bibliography

[Abu+18] A. Abusalah et al. “CPU based parallel computation of elec-tromagnetic transients for large power grids”. In: ElectricPower Systems Research 162 (Sept. 2018), pp. 57–63.

[ACD74] T. L. Adam, K. M. Chandy, and J. Dickson. “A comparisonof list schedules for parallel processing systems”. In: Commu-nications of the ACM 17.12 (1974), pp. 685–690.

[ADD04] P. R. Amestoy, T. A. Davis, and I. S. Duff. “Algorithm837: AMD, an Approximate Minimum Degree Ordering Al-gorithm”. In: ACM Trans. Math. Softw. 30.3 (Sept. 2004),pp. 381–388. issn: 0098-3500. doi: 10.1145/1024074.1024081.

[Adr19] Adrien Guironnet. GitHub - dynawo/dynawo. 2019. url: https://github.com/dynawo/dynawo (visited on 10/21/2019).

[AH11] D. Allemang and J. Hendler. Semantic Web for the WorkingOntologist: Effective Modeling in RDFS and OWL. Elsevier,2011.

[Aho03] A. Aho. Compilers: Principles, Techniques and Tools (forAnna University),2/e. Pearson Education, 2003. isbn: 978-8-13176-234-9.

[AIA19] AIAitesla. GitHub - itesla/ipsl. 2019. url: https://github.com/itesla/ipsl (visited on 10/21/2019).

[Åke+10] J. Åkesson et al. “Modeling and optimization with Optim-ica and JModelica. org—Languages and tools for solvinglarge-scale dynamic optimization problems”. In: Computers& Chemical Engineering 34.11 (2010), pp. 1737–1749.

[Ale01] A. Alexandrescu. Modern C++ design: generic programmingand design patterns applied. Addison-Wesley, 2001.

215

Bibliography

[Anaa] Anaconda, Inc. Notes on Numba Runtime. url: http://numba . pydata . org / numba - doc / dev / developer / numba -runtime.html (visited on 02/09/2020).

[Anab] Anaconda, Inc. Numba architecture. url: http://numba.pydata . org / numba - doc / dev / developer / architecture .html (visited on 02/10/2020).

[Anac] Anaconda, Inc. Numba: Compilation Options. url: http://numba.pydata.org/numba-doc/latest/user/jit.html#compilation-options (visited on 02/09/2020).

[Anad] Anaconda, Inc. Numba: Just-in-Time compilation. url: http://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html (visited on 02/09/2020).

[Anae] Anaconda, Inc. Numba: LoopJitting. url: http://numba.pydata.org/numba-doc/dev/developer/numba-runtime.html (visited on 02/09/2020).

[Anaf] Anaconda, Inc. Numba: Numbers. url: http://numba.pydata.org/numba-doc/latest/reference/types.html#numbers(visited on 02/09/2020).

[Anag] Anaconda, Inc. Numba: Supported NumPy features. url: http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html (visited on 02/10/2020).

[Anah] Anaconda, Inc. Numba: Supported Python features. url: http://numba.pydata.org/numba-doc/dev/reference/pysupported.html (visited on 02/10/2020).

[Anai] Anaconda, Inc. Numba: Why my loop is not vectorized? url:http://numba.pydata.org/numba-doc/dev/user/faq.html#does-numba-vectorize-array-computations-simd(visited on 02/10/2020).

[Anaj] Anaconda, Inc. Numba: Why my loop is not vectorized?url: http : / / numba . pydata . org / numba - doc / 0 . 30 . 1 /reference/envvars.html#compilation-options (visitedon 02/10/2020).

[Apa] Apache Jena. Apache Jena - Home. url: http : / / jena .apache.org (visited on 12/23/2019).

[Aro06] P. Aronsson. “Automatic Parallelization of Equation-BasedSimulation Programs”. PhD thesis. Institutionen för dataveten-skap, 2006.

216

[BCP96] K. Brenan, S. Campbell, and L. Petzold. Numerical Solutionof Initial-Value Problems in Differential-Algebraic Equations.Classics in Applied Mathematics. Society for Industrial andApplied Mathematics, 1996. isbn: 9780898713534.

[BDD91] T. Berry, A. Daniels, and R. Dunn. “Real time simulation ofpower system transient behaviour”. In: 1991 Third Interna-tional Conference on Power System Monitoring and Control.IET. 1991, pp. 122–127.

[Bea] D. Beazley. Understanding the Python GIL. url: http://www.dabeaz.com/python/UnderstandingGIL.pdf (visited on02/09/2020).

[Bec] Beckett, Dave. Redland RDF Libraries. url: http://librdf.org (visited on 12/23/2019).

[Bec10] D. Becker. “Harmonizing the International ElectrotechnicalCommission Common Information Model (CIM) and 61850”.In: Electric Power Research Institute (EPRI), Tech. Rep1020098 (2010).

[Beha] S. Behnel. Limitations – Cython 3.0a0 documentation. url:http://www.behnel.de/cython200910/talk.html (visitedon 02/10/2020).

[Behb] S. Behnel. Using Parallelism – Cython 3.0a0 documentation.url: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html (visited on 02/10/2020).

[Behc] S. Behnel. Using the Cython Compiler to write fast Pythoncode. url: http://www.behnel.de/cython200910/talk.html (visited on 02/10/2020).

[Beh+11] S. Behnel et al. “Cython: The Best of Both Worlds”. In:Computing in Science & Engineering 13.2 (2011), p. 31.

[Ben] J. Bennett. An introduction to Python bytecode. url: https://opensource.com/article/18/4/introduction-python-bytecode (visited on 02/09/2020).

[BL15] M. Barros and Y. Labiche. Search-Based Software Engineer-ing: 7th International Symposium, SSBSE 2015, Bergamo,Italy, September 5-7, 2015, Proceedings. Lecture Notes inComputer Science. Springer International Publishing, 2015.isbn: 9783319221830.

217

Bibliography

[Bol+09] C. F. Bolz et al. “Tracing the Meta-Level: PyPy’s Trac-ing JIT Compiler”. In: Proceedings of the 4th Workshop onthe Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. ICOOOLPS’09. Genova, Italy: Association for Computing Machinery,2009, pp. 18–25. isbn: 9781605585413.

[Bos11] P. Bose. “Power Wall”. In: Encyclopedia of Parallel Computing.Ed. by D. Padua. Boston, MA: Springer US, 2011, pp. 1593–1608. isbn: 978-0-387-09766-4. url: https://doi.org/10.1007/978-0-387-09766-4_499 (visited on 12/24/2019).

[Bou13] J.-L. Boulanger. Static Analysis of Software: The AbstractInterpretation. John Wiley & Sons, 2013.

[Bra+97] T. Bray et al. “Extensible Markup Language (XML)”. In:World Wide Web Journal 2.4 (1997), pp. 27–66.

[Bre12] E. Bressert. SciPy and NumPy: An Overview for Developers.O’Reilly Media, 2012. isbn: 9781449361631. url: https://books.google.de/books?id=c-xzkDMDev0C.

[BRT16] J. D. Booth, S. Rajamanickam, and H. Thornquist. “Basker:A Threaded Sparse LU Factorization Utilizing HierarchicalParallelism and Data Layouts”. In: 2016 IEEE InternationalParallel and Distributed Processing Symposium Workshops(IPDPSW). May 2016, pp. 673–682. doi: 10.1109/IPDPSW.2016.92.

[BS14a] B. Buchholz and Z. Styczynski. Smart Grids: Grundlagen undTechnologien der elektrischen Netze der Zukunft. VDE-Verlag,2014. isbn: 9783800735624.

[BS14b] B. M. Buchholz and Z. Styczynski. Smart grids-fundamentalsand technologies in electricity networks. Vol. 396. Springer,2014.

[Büt16] F. Bütow. Zeitgeistwandel: Vom Aufbruch der Neuzeit zumAufbruch ins planetarische Zeitalter. Books on Demand, 2016.isbn: 9783734741074.

[BW12] A. Brown and G. Wilson. PyPy. The Architecture of OpenSource Applications. Creative Commons, 2012. isbn: 978-1-10557-181-7. url: http://aosabook.org/en/pypy.html(visited on 02/09/2020).

218

[Cao+15] J. Cao et al. “A flexible model transformation to link BIMwith different Modelica libraries for building energy perfor-mance simulation”. In: Proceedings of the 14th IBPSA Con-ference. 2015.

[Cara] A. Carattino. Mutable and Immutable Objects. url: https://www.pythonforthelab.com/blog/mutable-and-immutable-objects/ (visited on 02/09/2020).

[Carb] C. Carey. Why Python is Slow: Looking Under the Hood |Pythonic Perambulations. url: https://github.com/cython/cython/wiki/enhancements-compilerdirectives (visitedon 02/10/2020).

[Cas13] F. Casella. “A Strategy for Parallel Simulation of Declara-tive Object-Oriented Models of Generalized Physical Net-works”. In: Proceedings of the 5th International Workshopon Equation-Based Object-Oriented Modeling Languages andTools; April 19; University of Nottingham; Nottingham; UK.084. Linköping University Electronic Press. 2013, pp. 45–51.

[Cas19] S. Cass. “The 2018 Top Programming Languages”. In: IEEESpectrum (2019). url: https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages(visited on 12/10/2019).

[CDZ05] D. Crupnicoff, S. Das, and E. Zahavi. Deploying Quality ofService and Congestion Control in InfiniBand-based DataCenter Networks. Tech. rep. 2379. 2005.

[Cha+08] B. Chapman et al. Using OpenMP: Portable Shared Mem-ory Parallel Programming. Scientific Computation Series.Books24x7.com, 2008. isbn: 9780262533027.

[Che+15] X. Chen et al. “GPU-Accelerated Sparse LU Factorizationfor Circuit Simulation with Performance Modeling”. In: IEEETransactions on Parallel and Distributed Systems 26.3 (Mar.2015), pp. 786–795. doi: 10.1109/tpds.2014.2312199.

[Chu01] W. Chun. Core Python Programming. Vol. 1. Prentice HallProfessional, 2001.

[CIM] CIM User Group. Home - CIMug. url: http://cimug.ucaiug.org (visited on 12/22/2019).

[Cla] Clang community. Clang C Language Family Frontend forLLVM. url: https://clang.llvm.org (visited on 12/22/2019).

219

Bibliography

[Coh] O. Cohen. Is your Numpy optimized for speed? url: https://towardsdatascience.com/is-your-numpy-optimized-for-speed-c1d2b2ba515 (visited on 02/09/2020).

[Com97] Compaq, Intel, Microsoft. Virtual Interface Architecture Spec-ification. Version 1.0. Compaq, Intel, Microsoft. Dec. 1997.

[Con07] Congress, 110th United States. Energy Independence andSecurity Act of 2007. 2007. url: https://www.govinfo.gov/content/pkg/PLAW-110publ140/html/PLAW-110publ140.htm (visited on 10/21/2019).

[Cor+01] T. Cormen et al. Introduction To Algorithms. MIT Press,2001. isbn: 9780262032933.

[CRS+11] CRSA et al. D4.1: Algorithmic requirements for simulationof large network extreme scenarios. Tech. rep. tech. rep., PE-GASE Consortium, 2011.

[Cum] M. Cumming. libxml++ – An XML Parser forC++. url: http://libxmlplusplus.sourceforge.net (vis-ited on 12/23/2019).

[Cun] A. Cuni. PyPy Status Blog. url: https://morepypy.blogspot.com/2018/09/inside-cpyext-why-emulating-cpython-c.html (visited on 02/10/2020).

[Cun10] A. Cuni. “High performance implementation of Python forCLI/.NET with JIT compiler generation for dynamic lan-guages”. PhD thesis. Dipartimento di Informatica e Scienzedell’Informazione, 2010.

[CWY12] X. Chen, Y. Wang, and H. Yang. “An Adaptive LU Factor-ization Algorithm for Parallel Circuit Simulation”. In: 17thAsia and South Pacific Design Automation Conference. Jan.2012, pp. 359–364. doi: 10.1109/ASPDAC.2012.6164974.

[CWY13] X. Chen, Y. Wang, and H. Yang. “NICSLU: An AdaptiveSparse Matrix Solver for Parallel Circuit Simulation”. In:IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems 32.2 (Feb. 2013), pp. 261–274. doi:10.1109/tcad.2012.2217964.

[Cyt] Cython community. Cython: C-Extensions for Python. url:https://cython.org (visited on 12/11/2019).

[Dal] L. Dalcin. MPI for Python – MPI for Python 3.0.3 documen-tation. url: https://mpi4py.readthedocs.io/en/stable/(visited on 02/10/2020).

220

[Dav] DavidBrooksPokorny. Cython. url: https://en.wikipedia.org / wiki / Cython # /media / File : Cython _ CPython _ Ext _Module_Workflow.png (visited on 02/10/2020).

[Dav+04] T. A. Davis et al. “Algorithm 836: COLAMD, a ColumnApproximate Minimum Degree Ordering Algorithm”. In: ACMTrans. Math. Softw. 30.3 (Sept. 2004), pp. 377–380. issn:0098-3500.

[Dav03] F. S. David. Model driven architecture: applying MDA toenterprise computing. 2003.

[Daw] Dawes, Beman. Filesystem Home - Boost.org. url: http://www.boost.org/libs/filesystem (visited on 12/23/2019).

[Deba] Debian Wiki team. Hugepages – Debian Wiki. url: https://wiki.debian.org/Hugepages (visited on 02/14/2020).

[Debb] A. Debrie. Python Garbage Collection: What It Is and HowIt Works. url: https://stackify.com/python-garbage-collection/ (visited on 02/09/2020).

[Die07] S. Diehl. Software visualization: visualizing the structure, be-haviour, and evolution of software. Springer Science & Busi-ness Media, 2007.

[Dig] Digi International Inc. Python garbage collection. url: https://www.digi.com/resources/documentation/digidocs/90001537/references/r_python_garbage_coll.htm(visited on 02/09/2020).

[Din+18] J. Dinkelbach et al. “Hosting Capacity Improvement Unlockedby Control Strategies for Photovoltaic and Battery StorageSystems”. In: 2018 Power Systems Computation Conference(PSCC). IEEE. 2018, pp. 1–7.

[DK12] P. Dutta and M. Kezunovic. “Unified representation of dataand model for sparse measurement based fault location”. In:Power and Energy Society General Meeting, 2012 IEEE. IEEE.2012, pp. 1–8.

[DM95] F.-N. Demers and J. Malenfant. “Reflection in logic, func-tional and object-oriented programming: a short comparativestudy”. In: Proceedings of the IJCAI. Vol. 95. 1995, pp. 29–38.

[DMS14] E. B. Duffy, B. A. Malloy, and S. Schaub. “Exploiting theClang AST for Analysis of C++ Applications”. In: Proceedingsof the 52nd Annual ACM Southeast Conference. 2014.

221

Bibliography

[DP10] T. A. Davis and E. Palamadai Natarajan. “Algorithm 907:KLU, A Direct Sparse Solver for Circuit Simulation Problems”.In: ACM Trans. Math. Softw. 37.3 (Sept. 2010), 36:1–36:17.issn: 0098-3500. doi: 10.1145/1824801.1824814.

[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure undNaturwissenschaftler. Springer-Lehrbuch. Springer Berlin Hei-delberg, 2008. isbn: 9783540764939.

[DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera-tion of Go data models for OWL ontologies with integratedserialization and deserialization functionality”. In: To appearin SoftwareX (2020).

[Dun+98] D. Dunning et al. “The Virtual Interface Architecture”. In:IEEE micro 18.2 (Mar. 1998), pp. 66–76. issn: 0272-1732.

[Eat19] J. Eaton. GNU Octave. 2019. url: https://www.gnu.org/software/octave (visited on 11/25/2019).

[ecm19] ecma International. Standard ECMA-404 – The JSON DataInterchange Syntax. 2019. url: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf(visited on 10/23/2019).

[ECT17] D. Efnusheva, A. Cholakoska, and A. Tentov. “A Survey OFDifferent Approaches for Overcoming the Processor-MemoryBottleneck”. In: International Journal of Computer Science& Information Technolog (2017). doi: 10.5121/ijcsit.2017.9214.

[EFF15] L. Exel, F. Felgner, and G. Frey. “Multi-domain modelingof distributed energy systems-The MOCES approach”. In:Smart Grid Communications (SmartGridComm), 2015 IEEEInternational Conference on. IEEE. 2015, pp. 774–779.

[Eig19] Eigen Developers. Eigen. Aug. 2019. url: http://eigen.tuxfamily.org (visited on 10/21/2019).

[Ell+01] J. Ellson et al. “Graphviz—open source graph drawing tools”.In: International Symposium on Graph Drawing. Springer.2001, pp. 483–484.

[ENT] ENTSO-E. COMMON INFORMATION MODEL (CIM) –MODEL EXCHANGE PROFILE 1. url: https://docstore.entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/140610 _ ENTSO - E _ CIM _ Profile _ v1 _ UpdateIOP2013 . pdf(visited on 05/13/2018).

222

[ENT16] ENTSO-E. Common Grid Model Exchange Specification(CGMES) – Version 2.5. 2016. url: https : / / docstore .entsoe.eu/Documents/CIM_documents/IOP/CGMES_2_5_TechnicalSpecification_61970-600_Part%201_Ed2.pdf(visited on 11/21/2019).

[Eur] European Grid Infrastructure community. Glossary V1 - EGI-Wiki. url: https://wiki.egi.eu/wiki/Glossary_V1#High_Throughput_Computing (visited on 12/24/2019).

[Fab+11] D. Fabozzi et al. “On simplified handling of state events intime-domain simulation”. In: Proc. of the 17th Power SystemComputation Conference PSCC. 2011.

[Far+15] M. O. Faruque et al. “Real-Time Simulation Technologiesfor Power Systems Design, Testing, and Analysis”. In: IEEEPower and Energy Technology Systems Journal 2.2 (2015),pp. 63–73.

[FC09] D. Fabozzi and T. V. Cutsem. “Simplified time-domain simu-lation of detailed long-term dynamic models”. In: 2009 IEEEPower & Energy Society General Meeting. IEEE, July 2009.

[FEIa] FEIN Aachen e. V. DistAIX – Scalable simulation of cyber-physical power distribution systems. url: https://fein-aachen.org/en/projects/distaix/ (visited on 12/26/2019).

[FEIb] FEIN Aachen e. V. Doxygen generated webpages of CIM++Adapted CIM_SINERGIEN Codebase: BatteryStorage ClassReference. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_SINERGIEN_20170324//classSinergien_1_1EnergyGrid_1_1EnergyStorage_1_1BatteryStorage.html (visited on 12/23/2019).

[FEIc] FEIN Aachen e. V. Doxygen generated webpages of CIM++Adapted CIM_SINERGIEN Codebase: PowerTransformerClass Reference. url: http : / / cim . fein - aachen . org /libcimpp/doc/IEC61970_16v29a_IEC61968_12v08/classIEC61970_1_1Base_1_1Wires_1_1PowerTransformer.html(visited on 05/31/2018).

[FEId] FEIN Aachen e. V. Doxygen generated webpapes of CIM++Adapted CIM_SINERGIEN Codebase. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_SINERGIEN_20170324/ (visited on 12/23/2019).

223

Bibliography

[FEIe] FEIN Aachen e. V. IEC61970 16v29a - IEC61968 12v08:Class List. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_IEC61968_12v08/annotated.html(visited on 12/26/2019).

[FEIf] FEIN Aachen e. V. VILLAS. url: https://villas.fein-aachen.org/website (visited on 02/14/2020).

[FEIg] FEIN Aachen e. V. VILLASframework: Node-types. url: https://villas.fein-aachen.org/doc/node-types.html(visited on 02/14/2020).

[FEI19a] FEIN Aachen e.V. CIM++. 2019. url: https : / / www . fein-aachen.org/projects/modpowersystems (visited on11/21/2019).

[FEI19b] FEIN Aachen e.V. ModPowerSystems. 2019. url: https://www.fein-aachen.org/projects/modpowersystems (visitedon 10/21/2019).

[Fin+08] R. Finocchiaro et al. “ETHOS, a generic Ethernet over Sock-ets Driver for Linux”. In: Proceedings of the 20th IASTEDInternational Conference. Vol. 631. 017. 2008, p. 239.

[Fin+09a] R. Finocchiaro et al. “ETHOM, an Ethernet over SCI andDX Driver for Linux”. In: Proceedings of 2009 InternationalConference of Parallel and Distributed Computing (ICPDC2009), London, UK. 2009.

[Fin+09b] R. Finocchiaro et al. “Low-Latency Linux Drivers for Ethernetover High-Speed Networks”. In: IAENG International Journalof Computer Science 36.4 (2009).

[Fin+10] R. Finocchiaro et al. “Transparent Integration of a Low-Latency Linux Driver for Dolphin SCI and DX”. In: ElectronicEngineering and Computing Technology. Ed. by S.-I. Ao andL. Gelman. Dordrecht: Springer Netherlands, 2010, pp. 539–549. isbn: 978-90-481-8776-8. doi: 10.1007/978-90-481-8776-8_46.

[FM09] M. Foord and C. Muirhead. IronPython in Action. ManningPubs Co Series. Manning, 2009. isbn: 9781933988337. url:http://www.voidspace.org.uk/python/articles/duck_typing.shtml#duck-typing (visited on 02/09/2020).

[FO08] H.-r. Fang and D. P. O’Leary. “Modified Cholesky algorithms:a catalog with new approaches”. In: Mathematical Program-ming 115.2 (Oct. 2008), pp. 319–349. issn: 1436-4646.

224

[Fou+07] L. Fousse et al. “MPFR: A Multiple-Precision Binary Floating-Point Library With Correct Rounding”. In: ACM Transactionson Mathematical Software (TOMS) 33.2 (2007), p. 13.

[Fow02] M. Fowler. Patterns of Enterprise Application Architecture.Addison-Wesley Longman Publishing Co., Inc., 2002.

[Fra19] Fraunhofer IEE and University of Kassel Revision. PYPOWER.2019. url: https://pandapower.readthedocs.io (visitedon 11/25/2019).

[Fre+09] J. Fremont et al. “CIM extensions for ERDF information sys-tem projects”. In: Power & Energy Society General Meeting,2009. PES’09. IEEE. IEEE. 2009, pp. 1–5.

[Fri+06] P. Fritzson et al. “OpenModelica-A free open-source envi-ronment for system modeling, simulation, and teaching”. In:Computer Aided Control System Design, 2006 IEEE Inter-national Conference on Control Applications, 2006 IEEEInternational Symposium on Intelligent Control, 2006 IEEE.IEEE. 2006, pp. 1588–1595.

[Fri15a] P. Fritzson. Principles of Object-Oriented Modeling and Sim-ulation with Modelica 3.3: A Cyber-Physical Approach. Wiley,2015. isbn: 9781118859162.

[Fri15b] P. A. Fritzson. Principles of object oriented modeling andsimulation with Modelica 3.3. 2nd ed. Hoboken: John Wiley& Sons, 2015.

[Fri16] J. Friesen. Java XML and JSON. Apress, 2016.[FW14] R. Franke and H. Wiesmann. “Flexible modeling of electri-

cal power systems–the Modelica PowerSystems library”. In:Proceedings of the 10 th International Modelica Conference;March 10-12; 2014; Lund; Sweden. 096. Linköping UniversityElectronic Press. 2014, pp. 515–522.

[Gag+19] F. Gagliardi et al. “The international race towards Exas-cale in Europe”. In: CCF Transactions on High PerformanceComputing (2019), pp. 1–11.

[GCC] GCC team. GCC, the GNU Compiler Collection. url: https://gcc.gnu.org (visited on 12/22/2019).

[GDD+06] D. Ga, D. Djuric, V. Deved, et al. Model Driven Architec-ture and Ontology Development. Springer Science & BusinessMedia, 2006.

225

Bibliography

[Geb+12] M. Gebremedhin et al. “A Data-Parallel Algorithmic Mod-elica Extension for Efficient Execution on Multi-Core Plat-forms”. In: Proceedings of the 9th International MODEL-ICA Conference; September 3-5; 2012; Munich; Germany. 76.Linköping University Electronic Press; Linköpings universitet,2012, pp. 393–404.

[Geo73] A. George. “Nested Dissection of a Regular Finite ElementMesh”. In: SIAM Journal on Numerical Analysis 10.2 (1973),pp. 345–363.

[Ger] German Aerospace Center (DLR). DLR – Simulation andSoftware Technology – 8th Workshop on Python for High-Performance and Scientific Computing. url: https://www.dlr.de/sc/en/desktopdefault.aspx/tabid-12954/22625_read-52397/ (visited on 12/11/2019).

[Gir] C. Giridhar. Understanding Python GIL. url: https://callhub.io/understanding-python-gil/ (visited on 02/09/2020).

[Glo] A. Gloubin. Garbage collection in Python: things you needto know. url: https://rushter.com/blog/python-garbage-collector (visited on 02/09/2020).

[Gra69] R. L. Graham. “Bounds on multiprocessing timing anoma-lies”. In: SIAM journal on Applied Mathematics 17.2 (1969),pp. 416–429.

[Gre+16] F. Gremse et al. “GPU-accelerated adjoint algorithmic differ-entiation”. In: Computer Physics Communications 200 (2016),pp. 300–311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10.027.

[Gui+18] A. Guironnet et al. “Towards an open-source solution usingModelica for time-domain simulation of power systems”. In:Proc. 8th IEEE PES ISGT Europe. Sarajevo, Bosnia andHerzegovina, Oct. 2018.

[Haq+11] E. Haq et al. “Use of Common Information Model (CIM) inelectricity market at California ISO”. In: Power and EnergySociety General Meeting, 2011 IEEE. IEEE. 2011, pp. 1–6.

[Har+12] W. E. Hart et al. Pyomo – Optimization Modeling in Python.Vol. 67. Springer, 2012.

[Hee] D. van Heesch. Doxygen: Main Page. url: http : / / www .doxygen.org (visited on 12/23/2019).

226

[Heg+01] P. Heggernes et al. The Computational Complexity of theMinimum Degree Algorithm. Tech. rep. Lawrence LivermoreNational Lab., CA (US), 2001.

[Hen] Henney, Kevlin. Chapter 5. Boost.Any. url: http : / / www . boost . org / doc / libs / release / libs / any (visited on12/23/2019).

[Hig] J. Higgins. Arabica. url: https://github.com/RWTH-ACS/arabica (visited on 12/23/2019).

[Hin+05] A. C. Hindmarsh et al. “SUNDIALS: Suite of Nonlinear andDifferential/Algebraic Equation Solvers”. In: ACM Trans.Math. Softw. 31.3 (Sept. 2005), pp. 363–396. issn: 0098-3500.doi: 10.1145/1089014.1089020.

[Hop+06] K. Hopkinson et al. “EPOCHS: a platform for agent-basedelectric power and communication simulation built from com-mercial off-the-shelf components”. In: IEEE Transactions onPower Systems 21.2 (2006), pp. 548–558.

[HR07] S. C. Haw and G. R. K. Rao. “A Comparative Study andBenchmarking on XML Parsers”. In: Advanced Communica-tion Technology, The 9th International Conference on. Vol. 1.IEEE. 2007, pp. 321–325.

[HSC19] A. C. Hindmarsh, R. Serban, and A. Collier. User Documen-tation for IDA v4.1.0. 2019. url: https://computing.llnl.gov/sites/default/files/public/ida_guide.pdf (visitedon 10/21/2019).

[IEC] IEC. IEC Smart Grid - IEC Standards. url: http://www.iec.ch/smartgrid/standards (visited on 12/22/2019).

[IEC06] IEC. IEC 61970-501:2006 Energy management system appli-cation program interface (EMS-API) – Part 501: CommonInformation Model Resource Description Framework (CIMRDF) schema. 2006.

[IEC12a] IEC. IEC 61968-11:2013 Application integration at electricutilities - System interfaces for distribution management –Part 11: Common information model (CIM) extensions fordistribution. 2012.

[IEC12b] IEC. IEC 61970-301:2012 Energy management system appli-cation program interface (EMS-API) – Part 301: CommonInformation Model (CIM) base. 2012.

227

Bibliography

[IEC14] IEC. IEC 62325-301:2014 Framework for energy market com-munications – Part 301: Common information model (CIM)extensions for markets. 2014.

[IEC16a] IEC. IEC 61970-552:2016 Energy management system appli-cation program interface (EMS-API) - Part 552: CIMXMLModel exchange format. 2016.

[IEC16b] IEC. IEC/TR 62357-1:2016 Power systems management andassociated information exchange - Part 1: Reference architec-ture. 2016.

[IEC17] IEC. IEC TS 62361-102 ED1 Power systems managementand associated information exchange - Interoperability in thelong term - Part 102: CIM - IEC 61850 harmonization. 2017.

[IEE18] IEEE and The Open Group. The Open Group Base Specifi-cations Issue 7 – IEEE Std 1003.1, 2018 Edition. New York,NY, USA: IEEE, 2018. url: http://pubs.opengroup.org/onlinepubs/9699919799.

[Inf07] InfiniBand Trade Association. InfiniBand Architecture Specifi-cation, Volume 1. Release 1.2.1. InfiniBand Trade Associationet al. Nov. 2007.

[Inf16] InfiniBand Trade Association. InfiniBand Architecture Specifi-cation Volume 2. Release 1.3.1. InfiniBand Trade Associationet al. Nov. 2016.

[Int] Intel Corporation. Intel C++ Compiler. url: https : / /software . intel . com / en - us / c - compilers (visited on12/22/2019).

[ISO14] ISO. ISO/IEC JTC 1/SC 22/WG 21 N4100 ProgrammingLanguages – C++ – File System Technical Specification. 2014.

[Jos12] N. Josuttis. The C++ Standard Library: A Tutorial andReference. Addison-Wesley, 2012. isbn: 9780321623218.

[KA99] Y.-K. Kwok and I. Ahmad. “Static Scheduling Algorithmsfor Allocating Directed Task Graphs to Multiprocessors”. In:ACM Comput. Surv. 31.4 (Dec. 1999), pp. 406–471. issn:0360-0300.

[Kas17] S. Kaster. Runtime Analysis of Python Programs. 2017.[Ker] Kernel development community. Networking – The Linux

Kernel documentation. url: https://linux-kernel-labs.github.io/master/labs/networking.html (visited on02/14/2020).

228

[Ker10] M. Kerrisk. The Linux Programming Interface: a Linux andUNIX System Programming Handbook. No Starch Press, 2010.isbn: 978-1-59327-220-3.

[KH02] J. Kovse and T. Härder. “Generic XMI-based UML modeltransformations”. In: Object-Oriented Information Systems(2002), pp. 183–190.

[KH14] G. Krüger and H. Hansen. Java-Programmierung – Das Hand-buch zu Java 8. O’Reilly Germany, 2014.

[Kha+18] S. Khayyamim et al. “Railway System Energy ManagementOptimization Demonstrated at Offline and Online Case Stud-ies”. In: IEEE Transactions on Intelligent TransportationSystems 19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050.doi: 10.1109/TITS.2018.2855748.

[KK04] W. Kocay and D. Kreher. Graphs, Algorithms, and Optimiza-tion. Discrete Mathematics and Its Applications. CRC Press,2004. isbn: 978-0-20348-905-5.

[KK95] G. Karypis and V. Kumar. METIS – Unstructured GraphPartitioning and Sparse Matrix Ordering System, Version 2.0.Tech. rep. University of Minnesota, Department of ComputerScience, 1995.

[Kle] B. Klein. Python3-Tutorial: Parameterübergabe. url: https://www.python-kurs.eu/python3_parameter.php (visited on02/09/2020).

[KMS92] M. S. Khaira, G. L. Miller, and T. J. Sheffler. Nested Dissec-tion: A survey and comparison of various nested dissectionalgorithms. Carnegie-Mellon University. Department of Com-puter Science, 1992.

[Kol+02] R. Kollmann et al. “A Study on the Current State of the Artin Tool-Supported UML-Based Static Reverse Engineering”.In: Reverse Engineering, 2002. Proceedings. Ninth WorkingConference on. IEEE. 2002, pp. 22–32.

[Kol+18] S. Kolen et al. “Enabling the Analysis of Emergent Behaviorin Future Electrical Distribution Systems Using Agent-BasedModeling and Simulation”. In: Complexity 2018 (2018).

[Kor09] R. E. Korf. “Multi-Way Number Partitioning”. In: Twenty-First International Joint Conference on Artificial Intelligence.2009.

229

Bibliography

[KWS07] U. Kastens, W. M. Waite, and A. M. Sloane. GeneratingSoftware from Specifications. Jones & Bartlett Learning, 2007.

[Lar+09] S. Larsen et al. “Architectural breakdown of end-to-end la-tency in a TCP/IP network”. In: International Journal ofParallel Programming 37.6 (Dec. 2009), pp. 556–571. issn:1573-7640. doi: 10.1007/s10766-009-0109-6.

[LE] T. Lefebvre and H. Englert. IEC TC57 Power system man-agement and associated information exchange. url: https://www.iec.ch/resources/tcdash/Poster_IEC_TC57.pdf(visited on 02/22/2020).

[Lee+15] B. Lee et al. “Unifying data types of IEC 61850 and CIM”. In:IEEE Transactions on Power Systems 30.1 (2015), pp. 448–456.

[Li+14] W. Li et al. “Cosimulation for Smart Grid Communications”.In: IEEE Transactions on Industrial Informatics 10.4 (2014),pp. 2374–2384.

[Lin] R. Lincoln. PyCIM – Python implementation of the CommonInformation Model. url: https://github.com/rwl/pycim(visited on 12/23/2019).

[Lin+12] H. Lin et al. “GECO: Global event-driven co-simulation frame-work for interconnected power system and communicationnetwork”. In: IEEE Transactions on Smart Grid 3.3 (2012),pp. 1444–1456.

[Lin19a] R. Lincoln. PYPOWER. 2019. url: https://pypi.org/project/PYPOWER/ (visited on 11/25/2019).

[Lin19b] R.-T. Linux. realtime:start [Wiki]. 2019. url: https : / /wiki.linuxfoundation.org/realtime/start (visited on10/21/2019).

[LK17] B. Lee and D.-K. Kim. “Harmonizing IEC 61850 and CIMfor connectivity of substation automation”. In: ComputerStandards & Interfaces 50 (2017), pp. 199–208.

[LLV] LLVM Foundation. The LLVM Compiler Infrastructure Project.url: http://www.llvm.org (visited on 12/23/2019).

[LPS15] S. K. Lam, A. Pitrou, and S. Seibert. “Numba: a LLVM-basedPython JIT compiler”. In: Proceedings of the Second Workshopon the LLVM Compiler Infrastructure in HPC. ACM. 2015,p. 7.

230

[Lun+09] H. Lundvall et al. “Automatic Parallelization of SimulationCode for Equation-based Models with Software Pipelining andMeasurements on Three Platforms”. In: SIGARCH Comput.Archit. News 36.5 (June 2009), pp. 46–55. issn: 0163-5964.

[Mae12] K. Maeda. “Performance Evaluation of Object SerializationLibraries in XML, JSON and Binary Formats”. In: DigitalInformation and Communication Technology and it’s Appli-cations (DICTAP), 2012 Second International Conference on.IEEE. 2012, pp. 177–182.

[Man] Man-Pages Authors. memusage(1) - Linux manual page. url:http://man7.org/linux/man-pages/man1/memusage.1.html (visited on 02/10/2020).

[MAT19] MATPOWER Developers. GNU Octave. 2019. url: https://matpower.org (visited on 11/25/2019).

[McM07] A. W. McMorran. “An Introduction to IEC 61970-301 &61968-11: The Common Information Model”. In: Universityof Strathclyde 93 (2007), p. 124.

[MDC09] A. Mercurio, A. Di Giorgio, and P. Cioci. “Open-SourceImplementation of Monitoring and Controlling Services forEMS/SCADA Systems by Means of Web Services – IEC61850 and IEC 61970 Standards”. In: IEEE Transactions onPower Delivery 24.3 (2009), pp. 1148–1153.

[Mel18] Mellanox Technologies. Mellanox OFED for Linux User Man-ual. 2877. Rev 4.3. Mellanox Technologies. Mar. 2018.

[Min] S. Mingshen. Getting Started. url: http://mesapy.org/rpython-by-example/getting-started/index.html (visitedon 02/09/2020).

[Mir+17] M. Mirz et al. “Dynamic phasors to enable distributed real-time simulation”. In: 2017 6th International Conference onClean Electrical Power (ICCEP). June 2017, pp. 139–144.

[Mir+18] M. Mirz et al. “A Cosimulation Architecture for Power System,Communication, and Market in the Smart Grid”. In: HindawiComplexity 2018 (Feb. 2018). doi: 10.1155/2018/7154031.

[Mir+19] M. Mirz et al. “DPsim—A dynamic phasor real-time simulatorfor power systems”. In: SoftwareX 10 (2019), p. 100253. issn:2352-7110. doi: https://doi.org/10.1016/j.softx.2019.100253. url: http://www.sciencedirect.com/science/article/pii/S2352711018302760.

231

Bibliography

[Mir20] M. Mirz. “A Dynamic Phasor Real-Time Simulation BasedDigital Twin for Power Systems”. PhD thesis. RWTH AachenUniversity, 2020.

[MMS13] N. V. Mago, J. D. Moseley, and N. Sarma. “A methodol-ogy for modeling telemetry in power systems models usingIEC-61968/61970”. In: Innovative Smart Grid Technologies-Asia (ISGT Asia), 2013 IEEE. IEEE. 2013, pp. 1–6.

[MNM16] M. Mirz, L. Netze, and A. Monti. “A multi-level approach topower system Modelica models”. In: Control and Modeling forPower Electronics (COMPEL), 2016 IEEE 17th Workshopon. IEEE. 2016, pp. 1–7.

[Mod] Modelica Association. Introduction – Modelica Language Spec-ification 3.3 Revision 1. url: https : / / modelica . readthedocs . io / en / latest / introduction . html (visited on12/26/2019).

[Mol+14] C. Molitor et al. “MESCOS–A Multienergy System Cosimula-tor for City District Energy Systems”. In: IEEE Transactionson Industrial Informatics 10.4 (2014), pp. 2247–2256.

[Mon] Montaigne, Michel de. Native datatypes – Dive Into Python3. url: https://diveintopython3.net/native-datatypes.html (visited on 02/09/2020).

[Mon+18] A. Monti et al. “A Global Real-Time Superlab: Enabling HighPenetration of Power Electronics in the Electric Grid”. In:IEEE Power Electronics Magazine 5.3 (Sept. 2018), pp. 35–44.

[MPT78] M. D. McIlroy, E. Pinson, and B. Tague. “UNIX Time-SharingSystem: Foreword”. In: Bell Labs Technical Journal 57.6(1978), pp. 1899–1904.

[MR12] P. MacArthur and R. D. Russell. “A Performance Studyto Guide RDMA Programming Decisions”. In: High Perfor-mance Computing and Communication & 2012 IEEE 9thInternational Conference on Embedded Software and Systems(HPCC-ICESS), 2012 IEEE 14th International Conferenceon. IEEE, 2012, pp. 778–785.

[Mül03] M. S. Müller. “An OpenMP Compiler Benchmark”. In: Sci-entific Programming 11.2 (2003), pp. 125–131.

232

[MV11] A. Meister and C. Vömel. Numerik linearer Gleichungssys-teme: Eine Einführung in moderne Verfahren. Mit MAT-LAB®-Implementierungen von C. Vömel. Vieweg+TeubnerVerlag, 2011. isbn: 9783834881007.

[Nic+00] U. A. Nickel et al. “Roundtrip engineering with FUJABA”. In:Proceedings of the 2nd Workshop on Software-Reengineering(WSR), August. Citeseer. 2000.

[Numa] Numba community. Numba: A High Performance PythonCompiler. url: http : / / numba . pydata . org (visited on12/11/2019).

[Numb] NumPy developers. NumPy. url: https://numpy.org (vis-ited on 12/11/2019).

[Ope19a] OpenModelica Developers. Major OpenModelica Releases.2019. url: https://www.openmodelica.org/doc/OpenModelicaUsersGuide/latest/tracreleases.html#release-notes-for-openmodelica-1-11-0 (visited on 10/21/2019).

[Ope19b] OpenMP Architecture Review Board. Home – OpenMP. 2019.url: https://www.openmp.org (visited on 10/21/2019).

[Pan09] J. Z. Pan. “Resource Description Framework”. In: Handbookon Ontologies. Springer, 2009, pp. 71–90.

[Par04] T. J. Parr. “Enforcing strict model-view separation in tem-plate engines”. In: Proceedings of the 13th international con-ference on World Wide Web. ACM. 2004, pp. 224–233.

[Pet82] L. Petzold. Description of DASSL: a differential/algebraicsystem solver. Tech. rep. Sandia National Labs., Livermore,CA (USA), Sept. 1982.

[Pfi01] G. F. Pfister. “An Introduction to the Infiniband Architec-ture”. In: High Performance Mass Storage and Parallel I/O42 (2001), pp. 617–632.

[Pic+16] S. Pickartz et al. “Migrating LinuX Containers Using CRIU”.In: High Performance Computing. Ed. by M. Taufer, B. Mohr,and J. M. Kunkel. Cham: Springer International Publishing,2016, pp. 674–684. isbn: 978-3-319-46079-6.

[Pot18] D. Potter. Implementation and Analysis of an InfiniBandbased Communication in a Real-Time Co-Simulation Frame-work. 2018.

233

Bibliography

[Pra+11] Y. Pradeep et al. “CIM-Based Connectivity Model for Bus-Branch Topology Extraction and Exchange”. In: IEEE Trans-actions on Smart Grid 2.2 (June 2011), pp. 244–253. issn:1949-3061. doi: 10.1109/TSG.2011.2109016.

[Pre12] J. Preshing. A Look Back at Single-Threaded CPU Perfor-mance. 2012. url: https://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/ (vis-ited on 10/21/2019).

[Pug16] J. F. Puget. A Speed Comparison Of C, Julia, Python, Numba,and Cython on LU Factorization. 2016. url: https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en (visited on02/10/2020).

[PyPa] PyPy community. Bytecode Interpreter. url: http://doc.pypy.org/en/latest/interpreter.html#introduction-and-overview (visited on 02/09/2020).

[PyPb] PyPy community. Garbage Collection in PyPy. url: https://doc.pypy.org/en/release-2.4.x/garbage_collection.html (visited on 02/09/2020).

[PyPc] PyPy community. Goals and Architecture Overview. url:http://doc.pypy.org/en/latest/architecture.html#id1(visited on 02/09/2020).

[PyPd] PyPy community. Incminimark. url: https://doc.pypy.org/en/latest/gc_info.html#incminimark (visited on02/09/2020).

[PyPe] PyPy community. RPython Documentation. url: https://rpython.readthedocs.io/en/latest/index.html#index(visited on 02/09/2020).

[Pyta] Python Software Foundation. array — Efficient arrays ofnumeric values. url: https://docs.python.org/3/library/array.html#module-array (visited on 02/09/2020).

[Pytb] Python Software Foundation. CPython. url: https://www.python.org (visited on 12/12/2019).

[Pytc] Python Software Foundation. multiprocessing – Process-basedparallelism. url: https://docs.python.org/3.6/library/multiprocessing.html (visited on 02/09/2020).

234

[Pytd] Python Software Foundation. Python Software Foundation:Press Release 20-Dec-2019. url: https://www.python.org/psf/press-release/pr20191220/ (visited on 02/09/2020).

[Pyte] Python Software Foundation. threading – Thread-based par-allelism. url: https://docs.python.org/3.6/library/threading.html (visited on 02/09/2020).

[Qui03] M. Quinn. Parallel Programming in C with MPI and OpenMP.McGraw-Hill, 2003. isbn: 9780071232654.

[Ray03] E. S. Raymond. The art of Unix programming. Addison-Wesley Professional, 2003.

[Raz+18a] L. Razik et al. “Automated deserializer generation from CIMontologies: CIM++—an easy-to-use and automated adaptableopen-source library for object deserialization in C++ fromdocuments based on user-specified UML models following theCommon Information Model (CIM) standards for the energysector”. In: Computer Science - Research and Development33.1 (Feb. 2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/s00450-017-0350-y.

[Raz+18b] L. Razik et al. “CIMverter—a template-based flexibly ex-tensible open-source converter from CIM to Modelica”. In:Energy Informatics 1.1 (Oct. 2018), p. 47. issn: 2520-8942.doi: 10.1186/s42162-018-0031-5.

[Raz+19a] L. Razik et al. “A comparative analysis of LU decompositionmethods for power system simulations”. In: 2019 IEEE MilanPowerTech. June 2019, pp. 1–6.

[Raz+19b] L. Razik et al. “REM-S-–Railway Energy Management inReal Rail Operation”. In: IEEE Transactions on VehicularTechnology 68.2 (Feb. 2019), pp. 1266–1277. doi: 10.1109/TVT.2018.2885007.

[Rei19] G. Reinke. Development of a Dependency Analysis betweenPower System Simulation Components for their Parallel Pro-cessing. 2019.

[Reu+16] R. H. Reussner et al. Modeling and Simulating Software Ar-chitectures: The Palladio Approach. MIT Press, 2016.

[Ris+16] S. Ristov et al. “Superlinear speedup in HPC systems: Whyand when?” In: 2016 Federated Conference on ComputerScience and Information Systems (FedCSIS). IEEE. 2016,pp. 889–898.

235

Bibliography

[RJB04] J. Rumbaugh, I. Jacobson, and G. Booch. Unified ModelingLanguage Reference Manual, The (2Nd Edition). PearsonHigher Education, 2004. isbn: 0321245628.

[Rog16] A. Roghult. “Benchmarking Python Interpreters”. In: KTHRoyal Institute of Technology School (2016). url: http://kth.diva-portal.org/smash/get/diva2:912464/FULLTEXT01.pdf.

[Roo99] S. Roosta. Parallel Processing and Parallel Algorithms: The-ory and Computation. Springer New York, 1999. isbn: 978-0-38798-716-3.

[Rosa] P. Ross. Cython Function Declarations – Cython def, cdefand cpdef functions 0.1.0 documentation. url: https : / /notes-on-cython.readthedocs.io/en/latest/function_declarations.html (visited on 02/10/2020).

[Rosb] G. van Rossum. What’s New In Python 3.0. url: https://docs.python.org/3/whatsnew/3.0.html (visited on02/09/2020).

[Rud+06] K. Rudion et al. “Design of benchmark of medium voltagedistribution network for investigation of DG integration”. In:Power Engineering Society General Meeting, 2006. IEEE.IEEE. 2006, 6–pp.

[Sad+09] A. Sadovykh et al. “On Study Results: Round Trip Engineer-ing of Space Systems”. In: European Conference on ModelDriven Architecture-Foundations and Applications. Springer.2009, pp. 265–276.

[Sch+15] F. Schloegl et al. “Towards a classification scheme for co-simulation approaches in energy systems”. In: Smart ElectricDistribution Systems and Technologies (EDST), 2015 Inter-national Symposium on. IEEE. 2015, pp. 516–521.

[Sch11] S. Schütte. “A Domain-Specific Language For SimulationComposition”. In: ECMS. 2011, pp. 146–152.

[Sch19] S. Scherfke. mosaik Documentation — Release 2.5.1. June2019. url: https://media.readthedocs.org/pdf/mosaik/latest/mosaik.pdf (visited on 10/21/2019).

[Scia] SciPy community. Broadcasting. url: http://scipy-lectures.org/intro/numpy/operations.html#broadcasting(visited on 02/09/2020).

236

[Scib] SciPy community. Casting Rules. url: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules(visited on 02/09/2020).

[Scic] SciPy community. Universal functions (ufunc). url: https://docs.scipy.org/doc/numpy/reference/ufuncs.html(visited on 02/09/2020).

[Scid] SciPy developers. SciPy.org. url: https://www.scipy.org(visited on 12/11/2019).

[ŠD12] V. Štuikys and R. Damaševičius. Meta-Programming andModel-Driven Meta-Program Development: Principles, Pro-cesses and Techniques. Vol. 5. Springer Science & BusinessMedia, 2012.

[Sjö+10] M. Sjölund et al. “Towards Efficient Distributed Simulationin Modelica using Transmission Line Modeling”. In: 3rd In-ternational Workshop on Equation-Based Object-OrientedModeling Languages and Tools; Oslo; Norway; October 3. 047.Linköping University Electronic Press. 2010, pp. 71–80.

[Slo01] A. Slominski. Design of a Pull and Push Parser System forStreaming XML. Tech. rep. Technical Report TR-550, IndianaUniversity, 2001.

[Sma12] Smart Grid Coordination Group, CEN-CENELEC-ETSI.Smart Grid Reference Architecture. Nov. 2012. url: https://ec.europa.eu/energy/sites/ener/files/documents/xpert_group1_reference_architecture.pdf (visited on10/21/2019).

[Smi15] K. Smith. Cython: A Guide for Python Programmers. O’ReillyMedia, 2015. isbn: 9781491901755. url: https://books.google.de/books?id=ERFkBgAAQBAJ.

[Spe] O. van der Spek. C++ CTemplate system. url: https://github.com/OlafvdSpek/ctemplate (visited on 12/23/2019).

[SRS10] R. Santodomingo, J. Rodrıguez-Mondéjar, and M. Sanz-Bobi.“Ontology Matching Approach to the Harmonization of CIMand IEC 61850 Standards”. In: Smart Grid Communications(SmartGridComm), 2010 First IEEE International Confer-ence on. IEEE. 2010, pp. 55–60.

237

Bibliography

[SST11] S. Schütte, S. Scherfke, and M. Tröschel. “Mosaik: A frame-work for modular simulation of active components in SmartGrids”. In: Smart Grid Modeling and Simulation (SGMS),2011 IEEE First International Workshop on. IEEE. 2011,pp. 55–60.

[Ste+17] M. Stevic et al. “Multi-site European framework for real-time co-simulation of power systems”. In: IET Generation,Transmission & Distribution 11.17 (2017), pp. 4126–4135.issn: 1751-8687. doi: 10.1049/iet-gtd.2016.1576.

[STM10] L. Surhone, M. Timpledon, and S. Marseken. Template Pro-cessor. Betascript Publishing, 2010. isbn: 9786130536886.

[Sto+13] J. Stoer et al. Introduction to Numerical Analysis. Textsin Applied Mathematics. Springer New York, 2013. isbn:9781475722727.

[Sup] SuperLU developers. SuperLU: Home Page. url: https://portal.nersc.gov/project/sparse/superlu (visited on12/24/2019).

[SV01] Y. Saad and H. A. Van Der Vorst. “Iterative Solution ofLinear Systems in the 20th Century”. In: Numerical Analysis:Historical Developments in the 20th Century. Elsevier, 2001,pp. 175–207.

[SWD15] R. Sedgewick, K. Wayne, and R. Dondero. Introduction toProgramming in Python: An Interdisciplinary Approach. Pear-son Education, 2015. isbn: 9780134076522. url: https://introcs.cs.princeton.edu/python/appendix_numpy/ (vis-ited on 02/09/2020).

[Tad] Tadeck. Why is Python 3 not backwards compatible? url: https://stackoverflow.com/questions/9066956/why-is-python-3-not-backwards-compatible (visited on 02/09/2020).

[Tan09] A. Tanenbaum. Modern Operating Systems. Pearson PrenticeHall, 2009. isbn: 9780138134594.

[Thi19] B. Thiele. GitHub - modelica-3rdparty/Modelica_DeviceDrivers. 2019. url: https://github.com/modelica-3rdparty/Modelica_DeviceDrivers (visited on 10/23/2019).

[Til01] M. Tiller. Introduction to physical modeling with Modelica.Boston: Kluwer Academic Publishers, 2001.

[Tri] Trilinos developers. GitHub - trilinos/Trilinos. url: https://github.com/trilinos/Trilinos (visited on 02/23/2020).

238

[TW67] W. F. Tinney and J. W. Walker. “Direct Solutions of SparseNetwork Equations by Optimally Ordered Triangular Fac-torization”. In: Proceedings of the IEEE 55.11 (Nov. 1967),pp. 1801–1809. issn: 0018-9219. doi: 10.1109/PROC.1967.6011.

[Ull75] J. D. Ullman. “NP-Complete Scheduling Problems”. In: Jour-nal of Computer and System sciences 10.3 (1975), pp. 384–393.

[Umw19] Umweltbundesamt, German. Erneuerbare Energien inDeutschland – Daten zur Entwicklung im Jahr 2018. 2019.url: https://www.umweltbundesamt.de/sites/default/files/medien/1410/publikationen/uba_hgp_eeinzahlen_2019_bf.pdf (visited on 10/21/2019).

[Uni17] University of Tennessee, Knoxville. BLAS (Basic Linear Alge-bra Subprograms). 2017. url: http://www.netlib.org/blas/(visited on 10/21/2019).

[Uni19] University of Tennessee, Knoxville et al. LAPACK – LinearAlgebra PACKage. 2019. url: http://www.netlib.org/lapack/ (visited on 10/21/2019).

[Usl+12] M. Uslar et al. The Common Information Model CIM: IEC61968/61970 and 62325 – A practical introduction to theCIM. Power Systems. Springer Berlin Heidelberg, 2012. isbn:9783642252150. url: https://books.google.de/books?id=cdw6gtzwc-QC.

[Van] J. VanderPlas. Why Python is Slow: Looking Under the Hood| Pythonic Perambulations. url: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow (visited on02/10/2020).

[Var+11] E. Varnik et al. “Fast Conservative Estimation of HessianSparsity”. In: Fifth SIAM Workshop on Combinatorial Sci-entific Computing, May 19–21, 2011, Darmstadt, Germany.May 2011, pp. 18–21.

[VCV11] S. Van Der Walt, S. C. Colbert, and G. Varoquaux. “TheNumPy Array: A Structure for Efficient Numerical Compu-tation”. In: Computing in Science & Engineering 13.2 (2011),p. 22.

239

Bibliography

[Vir+17] R. Viruez et al. “A Modelica-based Tool for Power SystemDynamic Simulations”. In: Proceedings of the 12th Interna-tional Modelica Conference, Prague, Czech Republic, May15-17, 2017. 132. Linköping University Electronic Press. 2017,pp. 235–239.

[Vog+17] S. Vogel et al. “An Open Solution for Next-generation Real-time Power System Simulation”. In: 2017 IEEE Conferenceon Energy Internet and Energy System Integration (EI2). Nov.2017, pp. 1–6. doi: 10.1109/EI2.2017.8245739.

[Wal+14] M. Walther et al. “Equation based parallelization of Modelicamodels”. In: Proceedings of the 10 th International ModelicaConference; March 10-12; 2014; Lund; Sweden. 096. LinköpingUniversity Electronic Press. 2014, pp. 1213–1220.

[WB87] M. Wolfe and U. Banerjee. “Data dependence and its appli-cation to parallel processing”. In: International Journal ofParallel Programming 16.2 (Apr. 1987), pp. 137–178. issn:1573-7640.

[Wei+07] S. Wei et al. “Multi-Agent Architecture of Energy Manage-ment System Based on IEC 61970 CIM”. In: Power Engineer-ing Conference, 2007. IPEC 2007. International. IEEE. 2007,pp. 1366–1370.

[WGG10] K. Wehrle, M. Günes, and J. Gross. Modeling and tools fornetwork simulation. Springer Science & Business Media, 2010.

[WH16] Z. Wang and Y. He. “Two-stage optimal demand responsewith battery energy storage systems”. In: IET Generation,Transmission & Distribution 10.5 (2016), pp. 1286–1293.

[Wil19] A. Williams. C++ Concurrency in Action. Manning Publica-tions Company, 2019. isbn: 9781617294693.

[Yan81] M. Yannakakis. “Computing the Minimum Fill-In is NP-Complete”. In: SIAM Journal on Algebraic Discrete Methods2.1 (1981), pp. 77–79.

[ZCN11] K. Zhu, M. Chenine, and L. Nordstrom. “ICT architectureimpact on wide area monitoring and control systems’ relia-bility”. In: IEEE transactions on power delivery 26.4 (2011),pp. 2801–2808.

[ZMT11] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas.“MATPOWER: Steady-state operations, planning, and analy-sis tools for power systems research and education”. In: IEEETransactions on Power Systems 26.1 (2011), pp. 12–19.

240

[ZPK00] B. P. Zeigler, H. Praehofer, and T. G. Kim. Theory of Model-ing and Simulation: Integrating Discrete Event and Continu-ous Complex Dynamic Systems. Academic press, 2000.

241

E.ON ERC Band 1

Streblow, R.

Thermal Sensation and

Comfort Model for

Inhomogeneous Indoor

Environments

1. Auflage 2011

ISBN 978-3-942789-00-4

E.ON ERC Band 2

Naderi, A.

Multi-phase, multi-species

reactive transport modeling as

a tool for system analysis in

geological carbon dioxide

storage

1. Auflage 2011

ISBN 978-3-942789-01-1

E.ON ERC Band 3

Westner, G.

Four Essays related to Energy

Economic Aspects of

Combined Heat and Power

Generation

1. Auflage 2012

ISBN 978-3-942789-02-8

E.ON ERC Band 4

Lohwasser, R.

Impact of Carbon Capture and

Storage (CCS) on the European

Electricity Market

1. Auflage 2012

ISBN 978-3-942789-03-5

E.ON ERC Band 5

Dick, C.

Multi-Resonant Converters as

Photovoltaic Module-

Integrated Maximum Power

Point Tracker

1. Auflage 2012

ISBN 978-3-942789-04-2

E.ON ERC Band 6

Lenke, R.

A Contribution to the Design of

Isolated DC-DC Converters for

Utility Applications

1. Auflage 2012

ISBN 978-3-942789-05-9

E.ON ERC Band 7

Brännström, F.

Einsatz hybrider RANS-LES-

Turbulenzmodelle in der

Fahrzeugklimatisierung

1. Auflage 2012

ISBN 978-3-942789-06-6

E.ON ERC Band 8

Bragard, M.

The Integrated Emitter Turn-

Off Thyristor - An Innovative

MOS-Gated High-Power

Device

1. Auflage 2012

ISBN 978-3-942789-07-3

E.ON ERC Band 9

Hoh, A.

Exergiebasierte Bewertung

gebäudetechnischer Anlagen

1. Auflage 2013

ISBN 978-3-942789-08-0

E.ON ERC Band 10

Köllensperger, P.

The Internally Commutated

Thyristor - Concept, Design

and Application

1. Auflage 2013

ISBN 978-3-942789-09-7

E.ON ERC Band 11

Achtnicht, M.

Essays on Consumer Choices

Relevant to Climate Change:

Stated Preference Evidence

from Germany

1. Auflage 2013

ISBN 978-3-942789-10-3

E.ON ERC Band 12

Panašková, J.

Olfaktorische Bewertung von

Emissionen aus Bauprodukten

1. Auflage 2013

ISBN 978-3-942789-11-0

E.ON ERC Band 13

Vogt, C.

Optimization of Geothermal

Energy Reservoir Modeling

using Advanced Numerical

Tools for Stochastic Parameter

Estimation and Quantifying

Uncertainties

1. Auflage 2013

ISBN 978-3-942789-12-7

E.ON ERC Band 14

Benigni, A.

Latency exploitation for

parallelization of

power systems simulation

1. Auflage 2013

ISBN 978-3-942789-13-4

E.ON ERC Band 15

Butschen, T.

Dual-ICT – A Clever Way to

Unite Conduction and

Switching Optimized

Properties in a Single Wafer

1. Auflage 2013

ISBN 978-3-942789-14-1

E.ON ERC Band 16

Li, W.

Fault Detection and

Protection inMedium

Voltage DC Shipboard

Power Systems

1. Auflage 2013

ISBN 978-3-942789-15-8

E.ON ERC Band 17

Shen, J.

Modeling Methodologies for

Analysis and Synthesis of

Controls and Modulation

Schemes for High-Power

Converters with Low Pulse

Ratios

1. Auflage 2014

ISBN 978-3-942789-16-5

E.ON ERC Band 18

Flieger, B.

Innenraummodellierung einer

Fahrzeugkabine

in der Programmiersprache

Modelica

1. Auflage 2014

ISBN 978-3-942789-17-2

E.ON ERC Band 19

Liu, J.

Measurement System and

Technique for Future Active

Distribution Grids

1. Auflage 2014

ISBN 978-3-942789-18-9

E.ON ERC Band 20

Kandzia, C.

Experimentelle Untersuchung

der Strömungsstrukturen in

einer Mischlüftung

1. Auflage 2014

ISBN 978-3-942789-19-6

E.ON ERC Band 21

Thomas, S.

A Medium-Voltage Multi-

Level DC/DC Converter with

High Voltage Transformation

Ratio

1. Auflage 2014

ISBN 978-3-942789-20-2

E.ON ERC Band 22

Tang, J.

Probabilistic Analysis and

Stability Assessment for Power

Systems with Integration of

Wind Generation and

Synchrophasor Measurement

1. Auflage 2014

ISBN 978-3-942789-21-9

E.ON ERC Band 23

Sorda, G.

The Diffusion of Selected

Renewable Energy

Technologies: Modeling,

Economic Impacts, and Policy

Implications

1. Auflage 2014

ISBN 978-3-942789-22-6

E.ON ERC Band 24

Rosen, C.

Design considerations and

functional analysis of local

reserve energy markets for

distributed generation

1. Auflage 2014

ISBN 978-3-942789-23-3

E.ON ERC Band 25

Ni, F.

Applications of Arbitrary

Polynomial Chaos in Electrical

Systems

1. Auflage 2015

ISBN 978-3-942789-24-0

E.ON ERC Band 26

Michelsen, C. C.

The Energiewende in the

German Residential Sector:

Empirical Essays on

Homeowners’ Choices of

Space Heating Technologies

1. Auflage 2015

ISBN 978-3-942789-25-7

E.ON ERC Band 27

Rohlfs, W.

Decision-Making under Multi-

Dimensional Price Uncertainty

for Long-Lived Energy

Investments

1. Auflage 2015

ISBN 978-3-942789-26-4

E.ON ERC Band 28

Wang, J.

Design of Novel Control

algorithms of Power

Converters for Distributed

Generation

1. Auflage 2015

ISBN 978-3-942789-27-1

E.ON ERC Band 29

Helmedag, A.

System-Level Multi-Physics

Power Hardware in the Loop

Testing for Wind Energy

Converters

1. Auflage 2015

ISBN 978-3-942789-28-8

E.ON ERC Band 30

Togawa, K.

Stochastics-based Methods

Enabling Testing of Grid-

related Algorithms through

Simulation

1. Auflage 2015

ISBN 978-3-942789-29-5

E.ON ERC Band 31

Huchtemann, K.

Supply Temperature Control

Concepts in Heat Pump

Heating Systems

1. Auflage 2015

ISBN 978-3-942789-30-1

E.ON ERC Band 32

Molitor, C.

Residential City Districts as

Flexibility Resource: Analysis,

Simulation, and Decentralized

Coordination Algorithms

1. Auflage 2015

ISBN 978-3-942789-31-8

E.ON ERC Band 33

Sunak, Y.

Spatial Perspectives on the

Economics of Renewable

Energy Technologies

1. Auflage 2015

ISBN 978-3-942789-32-5

E.ON ERC Band 34

Cupelli, M.

Advanced Control Methods for

Robust Stability of MVDC

Systems

1. Auflage 2015

ISBN 978-3-942789-33-2

E.ON ERC Band 35

Chen, K.

Active Thermal Management

for Residential Air Source Heat

Pump Systems

1. Auflage 2015

ISBN 978-3-942789-34-9

E.ON ERC Band 36

Pâques, G.

Development of SiC GTO

Thyristors with Etched

Junction Termination

1. Auflage 2016

ISBN 978-3-942789-35-6

E.ON ERC Band 37

Garnier, E.

Distributed Energy Resources

and Virtual Power Plants:

Economics of Investment and

Operation 1. Auflage 2016

ISBN 978-3-942789-37-0

E.ON ERC Band 38

Calì, D.

Occupants' Behavior and its

Impact upon the Energy

Performance of Buildings

1. Auflage 2016

ISBN 978-3-942789-36-3

E.ON ERC Band 39

Isermann, T.

A Multi-Agent-based

Component Control and

Energy Management System

for Electric Vehicles

1. Auflage 2016

ISBN 978-3-942789-38-7

E.ON ERC Band 40

Wu, X.

New Approaches to Dynamic

Equivalent of Active

Distribution Network for

Transient Analysis

1. Auflage 2016

ISBN 978-3-942789-39-4

E.ON ERC Band 41

Garbuzova-Schiftler, M.

The Growing ESCO Market for

Energy Efficiency in Russia: A

Business and Risk Analysis

1. Auflage 2016

ISBN 978-3-942789-40-0

E.ON ERC Band 42

Huber, M.

Agentenbasierte

Gebäudeautomation für

raumlufttechnische Anlagen

1. Auflage 2016

ISBN 978-3-942789-41-7

E.ON ERC Band 43

Soltau, N.

High-Power Medium-Voltage

DC-DC Converters: Design,

Control and Demonstration

1. Auflage 2017

ISBN 978-3-942789-42-4

E.ON ERC Band 44

Stieneker, M.

Analysis of Medium-Voltage

Direct-Current Collector Grids

in Offshore Wind Parks

1. Auflage 2017

ISBN 978-3-942789-43-1

E.ON ERC Band 45

Bader, A.

Entwicklung eines Verfahrens

zur Strompreisvorhersage im

kurzfristigen Intraday-

Handelszeitraum

1. Auflage 2017

ISBN 978-3-942789-44-8

E.ON ERC Band 46

Chen, T.

Upscaling Permeability for

Fractured Porous Rocks and

Modeling Anisotropic Flow

and Heat Transport

1. Auflage 2017

ISBN 978-3-942789-45-5

E.ON ERC Band 47

Ferdowsi, M.

Data-Driven Approaches for

Monitoring of Distribution

Grids

1. Auflage 2017

ISBN 978-3-942789-46-2

E.ON ERC Band 48

Kopmann, N.

Betriebsverhalten freier

Heizflächen unter zeitlich

variablen Randbedingungen

1. Auflage 2017

ISBN 978-3-942789-47-9

E.ON ERC Band 49

Fütterer, J.

Tuning of PID Controllers

within Building Energy

Systems

1. Auflage 2017

ISBN 978-3-942789-48-6

E.ON ERC Band 50

Adler, F.

A Digital Hardware Platform

for Distributed Real-Time

Simulation of Power Electronic

Systems 1. Auflage 2017

ISBN 978-3-942789-49-3

E.ON ERC Band 51

Harb, H.

Predictive Demand Side

Management Strategies for

Residential Building Energy

Systems

1. Auflage 2017

ISBN 978-3-942789-50-9

E.ON ERC Band 52

Jahangiri, P.

Applications of Paraffin-Water

Dispersions in Energy

Distribution Systems

1. Auflage 2017

ISBN 978-3-942789-51-6

E.ON ERC Band 53

Adolph, M.

Identification of Characteristic

User Behavior with a Simple

User Interface in the Context of

Space Heating

1. Auflage 2018

ISBN 978-3-942789-52-3

E.ON ERC Band 54

Galassi, V.

Experimental evidence of

private energy consumer and

prosumer preferences in the

sustainable energy transition

1. Auflage 2017

ISBN 978-3-942789-53-0

E.ON ERC Band 55

Sangi, R.

Development of Exergy-based

Control Strategies for Building

Energy Systems

1. Auflage 2018

ISBN 978-3-942789-54-7

E.ON ERC Band 56

Stinner, S.

Quantifying and Aggregating

the Flexibility of Building

Energy Systems

1. Auflage 2018

ISBN 978-3-942789-55-4

E.ON ERC Band 57

Fuchs, M.

Graph Framework for

Automated Urban Energy

System Modeling

1. Auflage 2018

ISBN 978-3-942789-56-1

E.ON ERC Band 58

Osterhage, T.

Messdatengestützte Analyse

und Interpretation

sanierungsbedingter

Effizienzsteigerungen im

Wohnungsbau

1. Auflage 2018

ISBN 978-3-942789-57-8

E.ON ERC Band 59

Frieling, J.

Quantifying the Role of Energy

in Aggregate Production

Functions for Industrialized

Countries

1. Auflage 2018

ISBN 978-3-942789-58-5

E.ON ERC Band 60

Lauster, M.

Parametrierbare

Gebäudemodelle für

dynamische

Energiebedarfsrechnungen von

Stadtquartieren

1. Auflage 2018

ISBN 978-3-942789-59-2

E.ON ERC Band 61

Zhu, L.

Modeling, Control and

Hardware in the Loop in

Medium Voltage DC

Shipboard Power Systems

1. Auflage 2018

ISBN 978-3-942789-60-8

E.ON ERC Band 62

Feron, B.

An optimality assessment

methodology for Home Energy

Management System

approaches based on

uncertainty analysis

1. Auflage 2018

ISBN 978-3-942789-61-5

E.ON ERC Band 63

Diekerhof, M.

Distributed Optimization for

the Exploitation of Multi-

Energy Flexibility under

Uncertainty in City Districts

1. Auflage 2018

ISBN 978-3-942789-62-2

E.ON ERC Band 64

Wolisz, H.

Transient Thermal Comfort

Constraints for Model

Predictive Heating Control

1. Auflage 2018

ISBN 978-3-942789-63-9

E.ON ERC Band 65

Pickartz, S.

Virtualization as an Enabler for

Dynamic Resource Allocation

in HPC

1. Auflage 2019

ISBN 978-3-942789-64-6

E.ON ERC Band 66

Khayyamim, S.

Centralized-decentralized

Energy Management in

Railway System

1. Auflage 2019

ISBN 978-3-942789-65-3

E.ON ERC Band 67

Schlösser, T.

Methodology for Holistic

Evaluation of Building Energy

Systems under Dynamic

Boundary Conditions

1. Auflage 2019

ISBN 978-3-942789-66-0

E.ON ERC Band 68

Cui, S.

Modular Multilevel DC-DC

Converters Interconnecting

High-Voltage and Medium-

Voltage DC Grids

1. Auflage 2019

ISBN 978-3-942789-67-7

E.ON ERC Band 69

Hu, J.

Modulation and Dynamic

Control of Intelligent Dual-

Active-Bridge Converter Based

Substations for Flexible DC

Grids

1. Auflage 2019

ISBN 978-3-942789-68-4

E.ON ERC Band 70

Schiefelbein, J.

Optimized Placement of

Thermo-Electric Energy

Systems in City Districts under

Uncertainty

1. Auflage 2019

ISBN 978-3-942789-69-1

E.ON ERC Band 71

Ferdinand, R.

Grid Operation of HVDC-

Connected Offshore Wind

Farms: Power Quality and

Switching Strategies

1. Auflage 2019

ISBN 978-3-942789-70-7

E.ON ERC Band 72

Musa, A.

Advanced Control Strategies

for Stability Enhancement of

Future Hybrid AC/DC

Networks

1. Auflage 2019

ISBN 978-3-942789-71-4

E.ON ERC Band 73

Angioni, A.

Uncertainty modeling for

analysis and design of

monitoring systems for

dynamic electrical distribution

grids

1. Auflage 2019

ISBN 978-3-942789-72-1

E.ON ERC Band 74

Möhlenkamp, M.

Thermischer Komfort bei

Quellluftströmungen

1. Auflage 2019

ISBN 978-3-942789-73-8

E.ON ERC Band 75

Voss, J.

Multi-Megawatt Three-Phase

Dual-Active Bridge DC-DC

Converter

1. Auflage 2019

ISBN 978-3-942789-74-5

E.ON ERC Band 76

Siddique, H.

The Three-Phase Dual-Active

Bridge Converter Family:

Modeling, Analysis,

Optimization and

Comparison of Two-Level and

Three-Level Converter

Variants

1. Auflage 2019

ISBN 978-3-942789-75-2

E.ON ERC Band 77

Heesen, F.

An Interdisciplinary Analysis

of Heat Energy Consumption

in Energy-Efficient Homes:

Essays on Economic, Technical

and Behavioral Aspects

1. Auflage 2019

ISBN 978-3-942789-76-9

E.ON ERC Band 78

Möller, R.

Untersuchung der

Durchschlagspannung von

Mineral-, Silikonölen und

synthetischen Estern bei

mittelfrequenten Spannungen

1. Auflage 2020

ISBN 978-3-942789-77-6

E.ON ERC Band 79

Höfer, T.

Transition Towards a

Renewable Energy

Infrastructure: Spatial

Interdependencies and Stake-

holder Preferences

1. Auflage 2020

ISBN 978-3-942789-78-3

E.ON ERC Band 80

Freitag, H.

Investigation of the Internal

Flow Behavior in Active

Chilled Beams

1. Auflage 2020

ISBN 978-3-942789-79-0

81

This dissertation deals with established and newly developed methods from the field of high-performance computing (HPC) and computer science which were implemented in existing and new software that can be used for the simulation of large-scale power systems. The motivation for this is the transformation from conventional power grids to smart grids due to the growing share of renewable energies that require a more complex power grid management. The presented HPC methods make it possible to use the potential of modern computer hard-ware which, for example, comes along with more and more parallel computing units or decreasing latencies in network communication that can be of decisive importance especially for real-time applications. In addition to measures for the optimization of hardware utilization the dissertation also deals with the represen-tation of power systems. In the simulation of smart grids, this includes not only the power grid but also, for instance, the associated communication network and the energy market. Therefore, a data model for smart grid topologies based on existing standards is introduced and validated in a co-simulation environ-ment. In addition, an approach is presented that automatedly generates a soft-ware library from the specification of the data model. Subsequently, an appro-ach is shown which uses the library for converting topological data into various simulator-specific system models. All presented approaches were implemented in open-source software projects, accessible by the public.

ISBN 978-3-942789-80-6