high-performance computing methods in large-scale power
TRANSCRIPT
81
High-Performance Computing Methodsin Large-Scale Power System Simulation
Lukas RazikInstitute for Automation of Complex Power Systems
High-Performance Computing Methodsin Large-Scale Power System Simulation
Von der Fakultät für Elektrotechnik und Informationstechnikder Rheinisch-Westfälischen Technischen Hochschule Aachen
zur Erlangung des akademischen Gradeseines Doktors der Ingenieurwissenschaften
genehmigte Dissertation
vorgelegt von
Dipl.-Inform. Lukas Daniel Razik
ausHindenburg
Berichter:Univ.-Prof. Antonello Monti, Ph. D.Univ.-Prof. Dr.-Ing. Andrea Benigni
Tag der mündlichen Prüfung: 8. Mai 2020
Diese Dissertation ist auf den Internetseitender Universitätsbibliothek online verfügbar.
Bibliographische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb-nb.de abrufbar. D 82 (Diss. RWTH Aachen University, 2020) Herausgeber: Univ.-Prof. Dr.ir. Dr. h. c. Rik W. De Doncker Direktor E.ON Energy Research Center Institute for Automation of Complex Power Systems (ACS) E.ON Energy Research Center Mathieustraße 10 52074 Aachen E.ON Energy Research Center I 81. Ausgabe der Serie ACS I Automation of Complex Power Systems Copyright Lukas Razik Alle Rechte, auch das des auszugsweisen Nachdrucks, der auszugsweisen oder vollständigen Wiedergabe, der Speicherung in Datenverarbeitungsanlagen und der Übersetzung, vorbehalten. Printed in Germany ISBN: 978-3-942789-80-6 1. Auflage 2020 Verlag: E.ON Energy Research Center, RWTH Aachen University Mathieustraße 10 52074 Aachen Internet: www.eonerc.rwth-aachen.de E-Mail: [email protected]
Zusammenfassung
In der seit 2009 geltenden Erneuerbare-Energien-Richtlinie der Europäi-schen Union haben sich die Mitgliedsstaaten darauf verständigt, dassder Anteil erneuerbarer Energien bis 2020 bei mindestens 20 % des Ener-gieverbrauchs liegen soll. Die damit einhergehende wachsende Zahl vonerneuerbaren Energieerzeugern wie Photovoltaik- und Windkraftanlagenführt zu einer vermehrt dezentralen Stromerzeugung, die ein komplexeresStromnetzmanagement erfordert.
Um dennoch einen sicheren Netzbetrieb zu gewährleisten, findet einWandel von konventionellen Stromnetzen zu sogenannten Smart Gridsstatt, bei denen z. B. nicht nur Statusinformationen der Stromerzeuger son-dern auch der Verbraucher (z. B. Wärmepumpen und Elektrofahrzeuge) indas Netzmanagement einbezogen werden. Die Nutzung von Flexibilitätenauf Erzeugungs- und Nachfrageseite und der Einsatz von Energiespei-chern zur Erreichung einer stabilen und wirtschaftlichen Stromversorgungerfordert neue Lösungen für die Planung und den Betrieb von SmartGrids. Andernfalls können Veränderungen an den Systemen des öffentli-chen Energiesektors (Stromnetz, IKT-Infrastruktur, Energiemarkt usw.)zu unerwarteten Problemen und damit auch zu Stromausfällen führen.Computersimulationen können deswegen helfen, das Verhalten von SmartGrids bei Veränderungen abzuschätzen, ohne das Risiko negativer Folgenbei unausgereiften Lösungen oder Inkompatibilitäten einzugehen.
Die wesentliche Zielsetzung der vorliegenden Dissertation ist die An-wendung und Analyse von Methoden des High-Performance Computings(HPC) und der Informatik zur Verbesserung von (Co-)Simulationssoftwareelektrischer Energiesysteme, um komplexere Komponentenmodelle sowiegrößere Systemmodelle in angemessener Zeit simulieren zu können. Durchdie zunehmende Automatisierung und Regelung in Smart Grids, die immerhöheren Anforderungen an deren Flexibilität und die Notwendigkeit einerstärkeren Marktintegration der Verbraucher werden Stromnetzmodelleimmer komplexer. Die Simulationen erfordern daher eine immer höhere
iii
Leistungsfähigkeit der eingesetzten Rechnersysteme. Der Schwerpunktder Arbeiten liegt deshalb auf der Verbesserung verschiedener Aspektemoderner und derzeit entwickelter Simulationslösungen. Dabei solltenjedoch keine neuen Simulationskonzepte oder -anwendungen entwickeltwerden, die ein Hochleistungsrechnen auf Supercomputern oder großenComputerclustern erst erforderlich machen würden.
Vielmehr werden in dieser Dissertation die Integrationen modernerdirekter Löser für dünnbesetzte lineare Systeme in verschiedene Strom-netzsimulations-Backends und die anschließenden Analysen mithilfe vongroßskaligen Stromnetzmodellen vorgestellt. Darüber hinaus wird eineneue Methode zur automatischen grobgranularen Parallelisierung vonStromnetz-Systemmodellen auf Komponentenebene präsentiert. Nebensolchen konkreten Anwendungen von HPC-Methoden auf Simulationsumge-bungen wird auch eine vergleichende Analyse verschiedener HPC-Ansätzezur Leistungssteigerung Python-basierter Software mithilfe von (Just-in-Time-)Kompilierern vorgestellt, da Python – in der Regel eine interpretierteProgrammiersprache – im Bereich der Softwarenetwicklung im Energiesek-tor immer beliebter wird. Im Weiteren stellt die Dissertation die Integrationeiner HPC-Netzwerktechnologie auf Basis des offenen InfiniBand-Standardsin ein Software-Framework vor, das für die Kopplung verschiedener Simu-lationsumgebungen zu einer Co-Simulation und für den Datenaustauschin Hardware-in-the-Loop (HiL) Aufbauten genutzt werden kann.
Für die Verarbeitung von Energiesystemtopologien durch Simulations-umgebungen, auf denen die oben genannten HPC-Methoden angewendetwurden, ist die Unterstützung eines standardisierten Datenmodells not-wendig. Die Dissertation behandelt daher auch das Common InformationModel (CIM), wie in IEC 61970 / 61968 standardisiert, welches für dieSpezifikation von Datenmodellen zur Repräsentierung von Energiesystem-topologien verwendet werden kann. Zunächst wird ein gesamtheitlichesDatenmodell vorgestellt, das für Co-Simulationen des Stromnetzes mitdem zugehörigen Kommunikationsnetz und dem Energiemarkt durch eineErweiterung von CIM entwickelt wurde. Um eine nachhaltige Entwicklungvon CIM-bezogenen Softwaretools zu erreichen, wird im Folgenden eineautomatisierte (De-)Serializer-Generierung aus CIM-Spezifikationen vorge-stellt. Die Deserialisierung von CIM-Dokumenten ist ein Schritt, der für dieanschließend entwickelte Übersetzung von CIM-basierten Netztopologienin simulatorspezifische Systemmodelle genutzt wird, die ebenfalls in dieserDissertation behandelt wird.
Viele der vorgestellten Erkenntnisse und Ansätze können auch zur Ver-besserung anderer Software im Bereich der Elektrotechnik und darüberhinaus genutzt werden. Zudem wurden alle in der Dissertation vorgestell-
iv
Abstract
In the Renewables Directive of the European Union, in effect since 2009,the member states agreed that the share in renewable energy should be20 % of the total energy by 2020. The concomitantly growing number ofrenewable energy producers such as photovoltaic systems and wind powerplants leads to a more decentralized power generation. This results in amore complex power grid management.
To ensure a secure power grid operation even so, there is a transformationfrom conventional power grids to so-called smart grids where, for instance,not only status information of power producers but also of consumers(e. g. heat pumps and electrical vehicles) is included in the power gridmanagement. The utilization of flexibility on generation and demand sideand the use of energy storage systems for achieving a stable and economicpower supply requires new solutions for the planning and operation ofsmart grids. Otherwise, manipulations of the systems in the public energysector (i. e. power grid, information and communications technology (ICT)infrastructure, energy market, etc.) can lead to unexpected problems suchas power failures. Computer simulations therefore can help to estimatethe behavior of smart grids on any changes without the risk of negativeconsequences in case of immature solutions or incompatibilities.
The main objective of this dissertation is the application and analysisof high-performance computing (HPC) and computer science methodsfor improving power system (co-)simulation software to allow simulatingmore detailed models in a, for the particular use case, appropriate time.Through more automation and control in smart grids, the higher demandon flexibility, and the need of stronger market integration of consumers, thepower system models become more and more complex. This requires anever greater performance of the utilized computer systems. The focus wason the improvement of different aspects of state-of-the-art and currentlydeveloped simulation solutions. The intention was not to develop new
vii
simulation concepts or applications that would make large-scale HPC onsuper-computers or large computer clusters necessary.
The dissertation presents the integration of modern direct solvers forsparse linear systems in various power grid simulation back-ends andsubsequent analyses with the aid of large-scale power grid models. Fur-thermore, a new method for an automatic coarse-grained parallelizationof power grid system models at component level is shown. Besides suchconcrete applications of HPC methods on simulation environments, alsoa comparative analysis of various HPC approaches for performance im-provement of Python based software with the aid of (just-in-time) com-pilers is presented, as Python – usually an interpreted programminglanguage – becomes more popular in the area of power system relatedsoftware. Moreover, the dissertation shows the integration of an HPCinterconnect solution based on InfiniBand – an open standard – in a soft-ware framework for the coupling of different simulation environments to aco-simulation and for Hardware-in-the-Loop (HiL) setups.
The support of a standardized data model for the processing of powersystem topologies by simulation environments, on which the aforemen-tioned HPC methods were applied, is necessary. Therefore, the dissertationconcerns the Common Information Model (CIM) as, i. a., standardizedby IEC 61970 / 61968, which can be used for the specification of datamodels representing power system topologies. At first, a holistic datamodel is introduced that was developed for co-simulations of the powergrid with the associated communication network and the energy marketby extending CIM. To achieve a sustainable development of CIM relatedsoftware tools, an automated (de-)serializer generation from CIM spec-ifications is presented. The deserialization from CIM is a step neededfor the subsequently developed template-based translation from CIM tosimulator-specific system models which is also covered in this dissertation.
Many presented findings and approaches can be used for improvingfurther software from the area of electrical engineering and beyond that.Moreover, all presented approaches were implemented in open-sourcesoftware projects, accessible by the public.
viii
Acknowledgement
I would like to thank the following people:My doctoral supervisor, Prof. Antonello Monti, for the guidance and the
support of my initiatives throughout my whole time as doctoral studentat the Institute for Automation of Complex Power Systems, my secondreviewer, Prof. Andrea Benigni, for the kind feedback to my dissertationmanuscript, and Prof. Ferdinanda Ponci for the helpful feedback andsupport regarding my scientific publications.
My colleagues Jan Dinkelbach, for reading the manuscript (especiallythe boring parts) and the great support during my way from a computerscientist to an engineer, Markus Mirz, for a great cooperation as wellas the inclusion of my humble self in interesting additional projects andactivities, Steffen Vogel for the assistance in software-technical matters,Simon Pickartz for the sophisticated LATEX template, and Stefan Dählingfor proofreading the final version.
All student researchers and students who participated in the researchand development related to this dissertation.
The Réseau de Transport d’Électricité co-workers Adrien Guironnet andGautier Bureau for a successful and enjoyable cooperation.
�
Vor allem möchte ich meinen Eltern danken, die auf vieles verzichtet undes mir durch Ihre Unterstützung erst ermöglicht haben, diesen beruflichenWeg zu beschreiten.
Zu guter Letzt danke auch Dir, mein Schatz, für Deine Unterstützungund Geduld während meiner Promotionszeit!
Aachen, May 2020 Lukas Daniel Razik
ix
Contents
Acknowledgement viii
List of Publications xv
1 Introduction 11.1 Challenges in Smart Grids . . . . . . . . . . . . . . . . . . 11.2 Large-Scale Multi-Domain Co-Simulation as a Solution . . 31.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Multi-Domain Co-Simulation 132.1 Fundamentals and Related Work . . . . . . . . . . . . . . 14
2.1.1 Architecture and Topology Data Model . . . . . . 142.1.2 Common Information Model . . . . . . . . . . . . 152.1.3 Simulation of Smart Grids . . . . . . . . . . . . . . 162.1.4 Classification of Simulations . . . . . . . . . . . . . 16
2.2 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Concept of the Co-Simulation Environment . . . . . . . . 19
2.4.1 Holistic Topology Data Model . . . . . . . . . . . . 192.4.2 Model Data Processing and Simulation Setup . . . 222.4.3 Synchronization . . . . . . . . . . . . . . . . . . . 232.4.4 Co-Simulation Runtime Interaction . . . . . . . . . 24
2.5 Validation by Use Case . . . . . . . . . . . . . . . . . . . 262.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Automated De-/Serializer Generation 293.1 CIM Formalisms and Formats . . . . . . . . . . . . . . . . 313.2 CIM++ Concept . . . . . . . . . . . . . . . . . . . . . . . 33
xi
Contents
3.3 From CIM UML to Compilable C++ Code . . . . . . . . 353.3.1 Gathering Generated CIM Sources . . . . . . . . . 373.3.2 Refactoring Generated CIM Sources . . . . . . . . 383.3.3 Primitive CIM Data Types . . . . . . . . . . . . . 40
3.4 Automated CIM (De-)Serializer Generation . . . . . . . . 413.4.1 The Common Base Class . . . . . . . . . . . . . . 413.4.2 Integrating an XML Parser . . . . . . . . . . . . . 423.4.3 Unmarshalling . . . . . . . . . . . . . . . . . . . . 433.4.4 Unmarshalling Code Generator . . . . . . . . . . . 463.4.5 Marshalling . . . . . . . . . . . . . . . . . . . . . . 49
3.5 libcimpp Implementation . . . . . . . . . . . . . . . . . . 503.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.7 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 51
4 From CIM to Simulator-Specific System Models 554.1 CIMverter Fundamentals . . . . . . . . . . . . . . . . . . 57
4.1.1 Modelica . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 Template Engine . . . . . . . . . . . . . . . . . . . 59
4.2 CIMverter Concept . . . . . . . . . . . . . . . . . . . . . . 594.3 CIMverter Implementation . . . . . . . . . . . . . . . . . 62
4.3.1 Mapping from CIM to Modelica . . . . . . . . . . 634.3.2 CIM Object Handler . . . . . . . . . . . . . . . . . 64
4.4 Modelica Workshop Implementation . . . . . . . . . . . . 654.4.1 Base Class of the Modelica Workshop . . . . . . . 664.4.2 CIM to Modelica Object Mapping . . . . . . . . . 664.4.3 Component Connections . . . . . . . . . . . . . . . 67
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.6 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 70
5 Modern LU Decompositions in Power Grid Simulation 755.1 LU Decompositions in Power Grid Simulation . . . . . . . 76
5.1.1 From DAEs to LU Decompositions . . . . . . . . . 765.1.2 LU Decompositions for Linear System Solving . . 785.1.3 KLU, NICSLU, GLU, and Basker by Comparison . 80
5.2 Analysis of Modern LU Decompositions for Electrical Circuits 835.2.1 Analysis on Benchmark Matrices from Large-Scale
Grids . . . . . . . . . . . . . . . . . . . . . . . . . 845.2.2 Analysis on Power Grid Simulations . . . . . . . . 92
5.3 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 95
xii
Contents
6 Exploiting Parallelism in Power Grid Simulation 976.1 Parallelism in Simulation Models . . . . . . . . . . . . . . 98
6.1.1 Task Scheduling . . . . . . . . . . . . . . . . . . . 1006.1.2 Task Parallelization in DPsim . . . . . . . . . . . . 1066.1.3 System Decoupling . . . . . . . . . . . . . . . . . . 110
6.2 Analysis of Task Parallelization in DPsim . . . . . . . . . 1116.2.1 Use Cases . . . . . . . . . . . . . . . . . . . . . . . 1126.2.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . 1136.2.3 System Decoupling . . . . . . . . . . . . . . . . . . 1176.2.4 Compiler Environments . . . . . . . . . . . . . . . 122
6.3 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 124
7 HPC Python Internals and Benefits 1277.1 HPC Python Fundamentals . . . . . . . . . . . . . . . . . 129
7.1.1 Classical Python . . . . . . . . . . . . . . . . . . . 1307.1.2 PyPy . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.3 Numba . . . . . . . . . . . . . . . . . . . . . . . . 1397.1.4 Cython . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2 Benchmarking Methodology . . . . . . . . . . . . . . . . . 1477.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . 1507.4 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 155
8 HPC Network Communication for HiL and RT Co-Simulation 1578.1 VILLAS Fundamentals . . . . . . . . . . . . . . . . . . . . 1588.2 InfiniBand Fundamentals . . . . . . . . . . . . . . . . . . 159
8.2.1 InfiniBand Architecture . . . . . . . . . . . . . . . 1618.2.2 OpenFabrics Software Stack . . . . . . . . . . . . . 165
8.3 Concept of InfiniBand Support in VILLAS . . . . . . . . . 1678.3.1 VILLASnode Basics . . . . . . . . . . . . . . . . . 1678.3.2 Original Read and Write Interface . . . . . . . . . 1678.3.3 Requirements on InfiniBand Node-Type Interface . 1708.3.4 Memory Management of InfiniBand Node-Type . . 1718.3.5 States of InfiniBand Node-Type . . . . . . . . . . . 1728.3.6 Implementation of InfiniBand Node-Type . . . . . 173
8.4 Analysis of the InfiniBand Support in VILLAS . . . . . . 1758.4.1 Service Types of InfiniBand Node-Type . . . . . . 1788.4.2 InfiniBand vs. Zero-Latency Node-Type . . . . . . 1818.4.3 InfiniBand vs. Existing Server-Server Node-Types 182
8.5 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . 183
9 Conclusion 1859.1 Summary and Discussion . . . . . . . . . . . . . . . . . . 185
xiii
Contents
9.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
A Code Listings 193A.1 Exploiting Parallelism in Power Grid Simulation . . . . . 193
B Python Environment Measurements 195B.1 Execution Times . . . . . . . . . . . . . . . . . . . . . . . 195B.2 Memory Space Consumption . . . . . . . . . . . . . . . . 197
List of Acronyms 201
Glossary 207
List of Figures 209
List of Tables 213
Bibliography 215
xiv
List of Publications
Journal Articles
[DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera-tion of Go data models for OWL ontologies with integratedserialization and deserialization functionality”. In: To appearin SoftwareX (2020).
[Raz+19b] L. Razik, N. Berr, S. Khayyam, F. Ponci, and A. Monti.“REM-S-–Railway Energy Management in Real Rail Opera-tion”. In: IEEE Transactions on Vehicular Technology 68.2(Feb. 2019), pp. 1266–1277. doi: 10.1109/TVT.2018.2885007.
[Kha+18] S. Khayyamim, N. Berr, L. Razik, M. Fleck, F. Ponci, andA. Monti. “Railway System Energy Management Optimiza-tion Demonstrated at Offline and Online Case Studies”. In:IEEE Transactions on Intelligent Transportation Systems19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050. doi: 10.1109/TITS.2018.2855748.
[Mir+18] M. Mirz, L. Razik, J. Dinkelbach, H. A. Tokel, G. Alirezaei,R. Mathar, and A. Monti. “A Cosimulation Architecture forPower System, Communication, and Market in the SmartGrid”. In: Hindawi Complexity 2018 (Feb. 2018). doi: 10.1155/2018/7154031.
[Raz+18a] L. Razik, M. Mirz, D. Knibbe, S. Lankes, and A. Monti. “Auto-mated deserializer generation from CIM ontologies: CIM++—an easy-to-use and automated adaptable open-source libraryfor object deserialization in C++ from documents basedon user-specified UML models following the Common Infor-mation Model (CIM) standards for the energy sector”. In:Computer Science - Research and Development 33.1 (Feb.
xv
Contents
2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/s00450-017-0350-y.
[Raz+18b] L. Razik, J. Dinkelbach, M. Mirz, and A. Monti. “CIMverter—a template-based flexibly extensible open-source converterfrom CIM to Modelica”. In: Energy Informatics 1.1 (Oct.2018), p. 47. issn: 2520-8942. doi: 10.1186/s42162-018-0031-5.
[Gre+16] F. Gremse, A. Höfter, L. Razik, F. Kiessling, and U. Nau-mann. “GPU-accelerated adjoint algorithmic differentiation”.In: Computer Physics Communications 200 (2016), pp. 300–311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10.027.
[Fin+09b] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Low-Latency Linux Drivers for Ethernet over High-Speed Net-works”. In: IAENG International Journal of Computer Sci-ence 36.4 (2009).
Book Chapters
[Fin+10] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “Trans-parent Integration of a Low-Latency Linux Driver for DolphinSCI and DX”. In: Electronic Engineering and Computing Tech-nology. Ed. by S.-I. Ao and L. Gelman. Dordrecht: SpringerNetherlands, 2010, pp. 539–549. isbn: 978-90-481-8776-8. doi:10.1007/978-90-481-8776-8_46.
Conference Articles
[Raz+19a] L. Razik, L. Schumacher, A. Monti, A. Guironnet, and G.Bureau. “A comparative analysis of LU decomposition meth-ods for power system simulations”. In: 2019 IEEE MilanPowerTech. June 2019, pp. 1–6.
[Vog+17] S. Vogel, M. Mirz, L. Razik, and A. Monti. “An Open Solutionfor Next-generation Real-time Power System Simulation”. In:2017 IEEE Conference on Energy Internet and Energy SystemIntegration (EI2). Nov. 2017, pp. 1–6. doi: 10.1109/EI2.2017.8245739.
xvi
Conference Articles
[Pic+16] S. Pickartz, N. Eiling, S. Lankes, L. Razik, and A. Monti. “Mi-grating LinuX Containers Using CRIU”. In: High PerformanceComputing. Ed. by M. Taufer, B. Mohr, and J. M. Kunkel.Cham: Springer International Publishing, 2016, pp. 674–684.isbn: 978-3-319-46079-6.
[Var+11] E. Varnik, L. Razik, V. Mosenkis, and U. Naumann. “FastConservative Estimation of Hessian Sparsity”. In: Fifth SIAMWorkshop on Combinatorial Scientific Computing, May 19–21,2011, Darmstadt, Germany. May 2011, pp. 18–21.
[Fin+09a] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETH-OM, an Ethernet over SCI and DX Driver for Linux”. In:Proceedings of 2009 International Conference of Parallel andDistributed Computing (ICPDC 2009), London, UK. 2009.
[Fin+08] R. Finocchiaro, L. Razik, S. Lankes, and T. Bemmerl. “ETHOS,a generic Ethernet over Sockets Driver for Linux”. In: Proceed-ings of the 20th IASTED International Conference. Vol. 631.017. 2008, p. 239.
xvii
1Introduction
In 1993 there were announcements in German newspapers saying that sun,water, and wind will also in long-range not cover more than 4 percent of theelectricity demand [Büt16]. As early as 2007 their share on the electricitysupply amounted to 14.2 percent. This was also the year when the firstofficial definition of smart grid was provided by the Energy Independenceand Security Act of 2007 [Con07] which was approved by the US Congressin January 2007. Meanwhile, the term smart grid is used worldwide forresearch, development, and investment programs with regard to technologyinnovations and the expansion of power grids. The principle approachesfor the transformation of conventional power grids to smart grids weredeveloped in the experts team Advisory Council of the Technology Platformfor Europeans in the years from 2005 to 2008 to establish a conceptual basefor a secure grid integration of significant electrical generation capacitieson the basis of renewable, mostly volatile and weather-dependent energysources.
1.1 Challenges in Smart Grids
Smart grids particularly require an improved coordination of grid operationand grid user behavior with the aid of information and communicationstechnology (ICT) and with the objective to ensure the sustainability ofeconomical, reliable, secure as well as ecofriendly power supply in theenvironment of increased energy efficiency and decreased greenhouse gasemissions. For instance, Smart Distribution plays a major role in the area
1
Chapter 1 Introduction
of smart grids. It can be divided into three pillars with the followingchallenges [BS14a; BS14b]:
1. Automation and remote control of local distribution grids: e. g.voltage control at distribution level (traditional as well as includingthe grid users), possibilities of power flow control, accelerated faultlocation and resumption of normal grid operation, as well as enhancedprotection concepts;
2. Flexibility by virtual power plants (VPPs): i. e. demand side man-agement and benefits of VPPs in a perspectival market organization;
3. Smart Metering and market integration of consumers: i. e. dynamictariffs, demand side response, and electromobility.
Since the efficiency of proper solutions can be improved by collectingand analyzing of information (i. e. data), a research field called EnergyInformatics has been established around 2012 with conferences such asthe DACH+ Conference on Energy Informatics and the ACM e-EnergyConference. Furthermore, Centers for Energy Informatics for exampleat the University of Southern Denmark and the University of SouthernCalifornia were founded to address ICT challenges of smart grids, e. g.with the help of artificial intelligence and machine learning approaches.However, new approaches and international standards in the area of ICTare not sufficient. Besides regulatory (i. e. legal) aspects, the introductionof new market rules is also necessary.
A successful realization of the European goals for global warming gasesreduction, the increase of energy efficiency as well as the continuouslyrising use of renewable energy sources require a harmonized design of theinterrelations between all participants in the process of electrical powersupply [BS14a; BS14b]. Both, the proponents as well as the opponents ofrenewable energies agree: the contribution of 37.8 percent from renewableenergy sources to gross electricity consumption in Germany [Umw19]can be significantly increased in the long term only with a utilization offlexibility (on generation and demand side) and the use of energy storagesystems.
Since modifications of the subsystems involved in the energy sector (i. e.power grid, ICT infrastructure, energy market, etc.) involve the risk oftechnical and economical faults such as destabilizations, major changesshould not be made without an accurate analysis of possible effects on thepower system. Computer simulations can help to estimate the behaviorof systems on their modifications with the aid of mathematical modelsto avoid negative consequences that could occur in real systems. In the
2
1.2 Large-Scale Multi-Domain Co-Simulation as a Solution
following, different kinds of power system simulation are introduced and itis motivated why large-scale multi-domain co-simulation is a solution forthe here presented three pillars of challenges in smart grids.
1.2 Large-Scale Multi-Domain Co-Simulation as aSolution
There are different types of (co-)simulations depending on their goals.Depending on the considered aspects, simulation types can be classified by
mathematical models with, e. g., pure algebraic equations for steady-stateobservations or ordinary differential equations (ODEs) for dynamicobservations;
simulation time which, e. g., can be continuous for “floating” physicalprocesses or discrete for events that occur at particular points intime, marking a change of the system’s state;
orchestration which, e. g., can be hybrid, when multiple system modelsfrom different domains are simulated by the same solver, or a co-simulation, when multiple system models are computed by differentsimulation solvers which are coupled (i. e. exchanging informationduring the simulation) [Sch+15].
Obviously, this list is not complete. The next sections, however, shallprovide a more general overview of simulation types with some of their goalsto motivate the contribution of this work. First, it shall be differentiatedbetween online and offline simulations.
Online Simulation
Online simulations are, e. g., performed for steady-state security assessment(SSA) and dynamic security assessment (DSA). In case of SSA, power flowsimulations of a sequence of (n-1)-states are needed to examine the abid-ance of the principle that, under the predicted maximum transmission andsupply responsibilities, the grid security is ensured even when a componentsuch as a transformer or a line becomes unplanned inoperative. The DSAis based on dynamic simulation which supplements the steady-state gridsecurity calculations with calculations of power plant dynamics in case ofclose short circuits and grid equipment outage. These dynamic stabilitycalculations can be very time-consuming. In Germany this means thatfor a timely availability of DSA results, to be used as decision aid forthe dispatcher, all (n-1)-scenarios should be available within 5 minutes.
3
Chapter 1 Introduction
This means that around 100 dynamic stability calculations must be ac-complished within this time frame. Such a real-time (RT) requirementis very challenging and hence requires an intelligent management of thecalculation cases as well as low simulation execution times [BS14a; BS14b].
Offline Simulation
Offline simulations (steady-state, electromagnetic transient simulation(EMT), etc.) are performed, e. g., for grid expansion planning, maintenanceplanning, commissioning of new operating equipment, and so forth. Asoffline simulations are not performed simultaneously to grid operation,they do not have any RT requirements. Nevertheless, low simulationexecution times are important to obtain simulation results in acceptabletime frames when many use cases or scenarios (e. g. the same power gridwith various switching events, changing its topology during simulation)have to be simulated or in case of simulation models with thousands ofnodes.
Large-Scale Simulation
Such simulations with several thousand of nodes, called large-scale, be-come important when simulation environments shall be applicable alsoon real-world scenarios rather than for lab experiments only. Thoughthere are commercial simulation tools which allow large-scale power gridsimulations for certain use cases, they have a significant disadvantage:they are closed source and thus changes of existing models (i. e. com-ponent models) or the solvers are often not possible. However, furtherdevelopment of models is an essential concern of scientific research toadapt them for future applications in smart grids as the lack of inertia inpower grids through a decreasing ratio of big power generators and moredistributed energy resources (DERs) can lead to frequency instabilities thatcannot be simulated by conventional models. Hence, at the Institute forAutomation of Complex Power Systems (ACS) new methods and conceptsare implemented into open-source software which can be used and im-proved by everyone. Here, it should be noted that not only publicly fundedscientific facilities can benefit from open-source simulation software butalso economic enterprises, of which some increasingly count on open-sourcealternatives instead of closed source products. For instance, RTE-France,the French transmission system operator (TSO), also develops open-sourcesimulation environments such as Dynaωo [Gui+18]. But, as in case of com-mercial software, compliance with international standards of associationssuch as CIGRE, IEC, IEEE, and VDE are crucial for the comparability
4
1.2 Large-Scale Multi-Domain Co-Simulation as a Solution
of solution approaches, study results, and applicability in existing systemenvironments.
Co-Simulation
Especially in case of co-simulation – a definition is given by [Sch+15] –where multiple simulators are coupled together, standardized data modelsfor information exchange between them are usually necessary. There aresingle-domain and multi-domain co-simulations.
Single domain co-simulations can be conducted with and without RTrequirements. Without RT requirements, co-simulations can be usefulif the involved simulators have complementary features but there is noneed for a synchronization of the simulation time with the real time(i. e. the co-simulation time can run slower or faster than the wall clock).Particularly in the power grids domain, RT requirements can come intoplay, e. g. with (power) Hardware-in-the-Loop (HiL), Control-in-the-Loop(CiL), and Software-in-the-Loop (SiL) use cases where a solution (i. e.an embedded system such as a control device) has to be connected toa simulated environment to verify its correct functioning within a realenvironment. A special case hereof is the geographically distributed real-time simulation (GD-RTS) which is based on the concept of a virtualinterconnection of laboratories in real time [Mon+18]. In this concept, amonolithic simulation model is partitioned into subsystem portions that aresimulated concurrently on multiple digital real-time simulators (DRTSs).As a result, comprehensive and large-scale real-world scenarios can besimulated for the validation of the interoperability of novel hardware andsoftware solutions with the existing power grid and without the need forany arbitrary in-the-Loop setups to be located at the same facility.
Multi-domain co-simulation in the following denotes a coupling of one ormore power grid simulators with other simulators of different domains suchas ICT, market, weather, and so on, for the obtainment of a holistic viewof the power system. Therefore, the term power system in this work doesnot stand for power grid only but for the power grid together with anyassociated system such as the ICT infrastructure and the energy market ina holistic view. This is the key for an extensive analysis and understandingof smart grids as depicted in [BS14a; BS14b].
In the previous sections, the use of large-scale single- and multi-domainco-simulation as a solution for the analysis and development of smart gridswith a continually growing share of renewable energy sources is motivated.The merits through the application of simulations during power systemoperation as well as for the planning of power systems as conducted sincedecades is undisputed.
5
Chapter 1 Introduction
Due to the three main challenges arising through the transition to smartgrids (see Sect. 1.1) – more automation and control in local distributiongrids (e. g. because of the needed digitalization), the higher demand onflexibility (e. g. by demand side management), and the need of strongermarket integration of consumers – the power system models become moreand more complex. This requires an ever greater performance of theutilized computer systems.
1.3 Contribution
The main objective of this dissertation is the application of high-performancecomputing (HPC) methods in the area of Energy Informatics and theiranalysis for improving power system (co-)simulation software to allow sim-ulating more complex component models as well as larger system modelsin an appropriate time. While in the past processor performance increasedcontinuously with increasing CPU clock rates, since around 2005 this isnot the case anymore because of the power wall [Bos11]. From then on,computer performance was increased by a growing numbers of cores perprocessor and accelerators such as, e. g., graphic processing units (GPUs)and Intel Xeon Phi adapters. Nowadays, the power draw is not a problemof central processing units (CPUs) only but also of whole supercomputers.Therefore, while the trend to more parallelism continues, HPC systemdesigners are more and more turning to hardware architectures respec-tive accelerators with high power efficiency (usually measured by FLOPSper watt) like GPUs, Advanced RISC Machines (ARM) processor basedsystems or field-programmable gate array (FPGA) accelerators [Gag+19].As in case of multi-core and manycore systems with special instructionssets for performance improvements (e. g. vector instructions), the softwarenowadays must be adapted continuously to make use of such new hardwarefeatures and accelerators. Under these circumstances, the focus is on theimprovement of different aspects on state-of-the-art and currently devel-oped simulation solutions in academic area as well as enterprises. Thus,the intention was not to develop new simulation concepts or applicationsthat would make large-scale HPC on supercomputers or large computerclusters necessary. However, especially computer and network hardware ofmodern commodity clusters is in focus of the contribution.
Figure 1.1 shows the real-world challenge of an improved coordinationof smart grid operation and grid user behavior. This is addressed bya solution based on an appropriate and therefore increasingly complexmodeling as well as (co-)simulation for smart grid planning and operation.The three major aspects of the solution, the contribution of this work
6
1.3 Contribution
Transition of conventional power grids to smart grids
Challenge:Smart grids require improved coordination
of grid operation and grid user behavior
Solution:Appropriate and more complex modeling and (co-)simulation
for smart grid planning and operation
Modeling Simulation Informationexchange
High-performance computing and energy informatics
Chapter 2:Multi-domain co-simulation with a
holistic CIM-based topology
data model
Chapter 3:Automated
(de-)serializergeneration from
CIM UML
Chapter 4:From CIM-based
topologies tosimulator-specificsystem models
Chapter 5:Modern LU
decompositionsin power grid
simulation
Chapter 6:Exploiting
parallelism inpower gridsimulation
Chapter 7:HPC Pythoninternals and
benefits
Chapter 8:HPC network
communicationfor HiL and RTco-simulation
Figure 1.1: Contribution overview of this work
7
Chapter 1 Introduction
refers to, are modeling, simulation, and information exchange. Arrowsfrom bottom to the top illustrate the contribution of this work to thesemajor aspects of large-scale power system (co-)simulation.
On the one hand, mathematical component models become more com-plex, for example because of an increasing use of power electronics, onthe other hand, the complexity of system models increases for examplebecause of new electrical equipment and facilities, in case of smart gridsever more often with connections to other domains such as ICT, weather,mechanics, energy market, etc. which also need to be simulated. There-fore, a contribution of this thesis is the presentation of a multi-domainco-simulation architecture with a holistic (i. e. multi-domain) topologydata model which is based on the Common Information Model (CIM) asstandardized by IEC 61970 / 61968 / 62325, describing terms in the energysector and relations between them. CIM plays an important role as itbelongs to the IEC core semantic standards for smart grids [LE].
CIM makes use of the Unified Modeling Language (UML) which isstate-of-the-art in computer science for the specification of classes andtheir relationships in object-oriented software design. This thesis thereforecontributes with the concept of an automated (de-)serializer generationfrom a specification based on UML. Among others, the automated codegeneration process implements a CIM data model in C++ according to thegiven UML specification. It can be applied whenever the UML specificationchanges between its versions what usually happens a couple of times peryear. This avoids manual changes in a code base with currently around onethousand classes and many relations between each other which would bevery time-consuming and error-prone. The resulting (de-)serializer allowsreading in CIM documents in C++, according to the CIM-based datamodel, modifying the data in the main memory, and writing the data intoCIM documents.
Due to CIM’s fine granularity over several abstraction levels, a compo-nent (e. g. power transformer) consists of many CIM objects. This is areason why a mapping from CIM to a simulator-specific system modelis intricate. However, when a mapping to a system model of a certainsimulator is achieved, the mapping often can also be used for systemmodels of different simulators. Therefore, a template-based mapping fromCIM to system models is proposed. The templates allow a specification onhow model parameters from a CIM document have to be written into thesimulator-specific system model target format. The advantage of templatesis that if the system model format is written in a given language (e. g.Modelica), the templates are written in the same language with placeholders for the data from a CIM object to be mapped. Therefore, the user
8
1.3 Contribution
does not need to learn another language for specifying the system modelformat.
Simulation environments make use of various approaches for the transi-tion from a system model to a system of mathematical equations, whichis solved by the simulation solver. In case of power grid simulations, forinstance, the resistive companion approach in combination with the New-ton(-Raphson) method can be applied which results in a linear system ofequations (LSE), for each time step and Newton iteration. In another ap-proach, all component models can be combined into a differential-algebraicsystem of equations (DAE) which is then passed to a DAE solver whichfinally linearizes it to LSEs as well. For power grid simulations, LSEs typi-cally are very sparse (i. e. the fraction of non-zero elements in the matrixis typically much less than 1 h) and therefore require appropriate LSEsolvers. The contribution in this work is a comparative analysis of severalmodern LU decompositions for the solution of sparse LSEs coming frompower grids against KLU 1 which is a well-established LU decompositionfor electric circuits and therefore taken as the reference. The LU decom-positions concerned are called modern as they are developed especiallyfor current multi-core or massively parallel computer architectures. Thecomparison is based on benchmark matrices that arose during power gridsimulation and simulations performed by existing simulation environmentsin which the most promising LU decompositions were integrated.
There are various methods of expressing parallelism in power systemsimulation. On the one hand, the processing within a simulation solver canbe parallelized for instance with the aid of a certain parallel programmingparadigm in the solver’s programming language (e. g. with parallelconstructs using OpenMP in C++ [Ope19b]). Similarly to that approach,on the other hand, parallelism in a system model can also be expressedwith the aid of a formalism for parallel structures in the model (e. g. withparallel constructs in the modeling language ParModelica [Geb+12]).Besides such an explicit expression of parallelism in the solver or model, itis also possible to extract parallelism, e. g., from mathematical models atequation level which is a variant of already existing automatic fine-grainedparallelization of mathematical models. The contribution of this workis, however, the introduction of an automatic exploitation of parallelismin system models at component level and therefore called an automaticcoarse-grained parallelization of mathematical models. For this coarse-grained parallelization of mathematical models, parallel task schedulingsare introduced. Accordingly, various task schedulers allow the parallel
1The “K” in KLU stands for “Clark Kent” which is the bourgeois identity behindthe fictional superhero Superman. This is an allusion to SuperLU, which is awell-known LU decomposition for sparse linear systems [DP10].
9
Chapter 1 Introduction
processing of component models related tasks within one simulation step.An analysis of the whole implementation shows the execution time speedupswith respect to different scheduling methods and other modeling andsoftware technical aspects.
Power system simulation requires not only simulation itself but alsodata processing before the simulation (e. g. load and generation profiles),during the simulation (e. g. data exchanged between simulators), andafter the simulation (e. g. simulation results). Since Python, as a modernand relatively easy-to-learn script language, is enjoying ever growingpopularity under programming beginners, many power engineers programdiverse parts of software projects in the area of power system simulationin Python. Especially the pre- and postprocessing of simulation data isperformed in Python, while the simulation cores are often programmedin other programming languages such as C++. Sometimes executiontimes of (usually interpreted) Python applications are too long for givenuse cases and there is not enough time or a lack of know-how to portthe Python application to a more runtime-efficient language such as,e. g., C++. Admittedly, there are Python modules, just-in-time (JIT)compilers, and Python language extensions which allow improving theruntime efficiency of Python programs but their internals and benefits arerather unknown. The contribution of this work is therefore an overview andcomparative analysis of the most popular approaches for the performanceimprovement of Python, not necessarily with the aid of parallelization(e. g. multithreading).
Co-simulations as well as HiL setups require an information exchangebetween simulators as well as between devices and simulators. Especiallyin case of RT applications, short latencies in information exchange can becrucial. To reduce latencies, HPC interconnects, in contrast to commonlyused interconnects, provide connection modes in which data is directlytransmitted to or read from the main memory of a remote server withoutinvolving the operating system or a process running on a remote serveras it is usually the case. Therefore, a contribution of this thesis is thepresentation of InfiniBand (IB), a widely-used HPC network communica-tion standard, and its integration in a state-of-the-art software frameworkthat can, for instance, be freely used for hard RT coupling of devices withsimulators in case of HiL setups as well as for the coupling of simulatorsin case of hard RT co-simulations with very low latencies.
All the contributed approaches were implemented or integrated in exist-ing or new open-source software projects which can be used and investi-gated. Moreover, the concepts and analyses introduced in this work for animproved modeling, simulation, and information exchange shall supportother researchers, developers and users of (co-)simulation software.
10
1.4 Outline
1.4 Outline
Chapter 2 shows the benefits of multi-domain co-simulation and introducesan appropriate co-simulation environment for the three smart grid domainspower grid, communication network, and energy market, developed inthe research project SINERGIEN. This SINERGIEN environment is thestart for several approaches, concepts, and analyses which are presentedin the following chapters. The usage of UML for the specification of CIMallows extending it to a holistic topology data model that is used forthe SINERGIEN co-simulation environment with simulators for the threementioned domains.
Chapter 3 presents the automated (de-)serializer generation from aspecification based on UML. The automated deserializer generation isimplemented in the CIM++ software project which can map CIM, asspecified by UML, to a C++ code base, also implementing the holisticCIM-based data model. The thus created open-source software libraryallows reading and writing arbitrary CIM-based documents in C++.
Chapter 4 shows the approach on how CIM-based documents for powergrid topology representation can be translated into simulator-specificsystem models with the aid of template documents. In the SINERGIENenvironment this became necessary for the power grid simulator based onModelica to run simulations of power grid topologies stored in CIM-baseddocuments, as CIM is used more and more by distribution system operators(DSOs) and TSOs. The translation from CIM to a simulator-specific modelwas implemented in the open-source software CIMverter. It uses templatedocuments making it possible to modify the simulator-specific systemmodels to be outputted in case the input format of the target simulatoris changing, e. g. because of a newer version which allows to set moreparameters or new component models to be included in a system model.This allows a flexible adaption of the translation from CIM to a supportedsimulator-specific model without a recompilation of CIMverter which isalso shown in this chapter.
Chapter 5 outlines the comparative analysis of several modern LU de-compositions for sparse linear systems. In the first part of the analysis theyare compared by different benchmark matrices arising from simulationsof large-scale power grids. This analysis was a help for deciding whichLU decomposition is worth to be integrated into existing simulation envi-ronments. In the second part, the most promising modern decomposition(after its integration) is compared with the reference decomposition bysimulations with both a fixed time step and a variable time step solver.Therefore, these LU decompositions were i. a. integrated into the DAE
11
Chapter 1 Introduction
solver used by the open-source simulation environments OpenModelicaand Dynaωo.
Chapter 6 presents the approach for exploiting parallelism in powergrid simulation from the newly introduced type of approaches describedas automatic coarse-grained parallelization of mathematical models fora higher performance through thereby enabled parallel computations inpower system simulators. This approach is applied on a newly developedopen-source power grid simulator called DPsim. At first, the implementedparallelization approach is categorized into the existing parallelism cate-gories of simulation models. Moreover, an overview of formally definedscheduling methods for the parallel processing of data independent tasksis provided. It follows a performance analysis of the implemented taskparallelization methods.
Chapter 7 provides an overview of the internals of HPC approaches toimprove the runtime of Python applications and an comparative analysis ofthese approaches. The comparative analysis is based on various benchmarkalgorithms of different algorithm classes that were programmed in Pythonand in C++, as an efficient reference. This comparative analysis canhelp Python programmers to chose the right approach for increasing theperformance of Python applications with or without multithreading, withthreads that are executed really in parallel which is not always the case inPython as will be explained, too.
Chapter 8 presents the integration of a HPC network communicationinto HiL and RT co-simulation. The HPC interconnect solution chosen forthe integration in the open-source VILLASframework, that can be utilizedfor the setup of HiL simulations and (even hard) RT coupling of DRTSs, isbased on IB. IB was chosen as it is an open standard that is implementedby various manufacturers. The integration of IB is also compared withother communication methods provided by the VILLASframework.
Chapter 9 concludes the dissertation, providing a summary and discus-sion on all topics of this work. Moreover, it gives an overview of futurework that can be conducted for an improvement of the introduced conceptsas well as their analyses and implementations.
12
2Multi-Domain Co-Simulation
More and more distributed energy resources (DERs) at distribution levelcause bidirectional power flows between distribution and transmission levelwhich require changes in the related information and communicationstechnology (ICT) and energy market mechanisms. Therewith associatedextension of the measurement devices in lower voltage layers, for instance,require appropriate communication network capabilities for meeting therequirements on the exchange of measurement data between the mea-surement devices and all involved entities such as control centers andsubstations. Therefore, electrical grids and the belonging communicationnetworks should be planned holistically to take the interactions betweenboth domains into account [Li+14]. Apart from that, new energy marketmodels are developed for customers (i. e. prosumers) to empower themto a more active role in the exchange of energy with the grid [WH16] ina way that their behavior will be considered in grid operation [EFF15]and possibly vice versa. Given these facts, it is reasonable to include alsothe energy market simulation in the planning to get a holistic picture offuture grids.
An integration of energy market mechanisms, the communication net-work, and power grid hamper future studies on power grids due to a lack ofestablished modeling approaches which encompass the three domains andthere are only few tools which enable a joint simulation. In this chaptera comprehensive data model is presented together with a co-simulationarchitecture based on it. Both enable an investigation of dynamic inter-actions between power grid, communication network and market. Suchinteractions can be technical constraints of the grid which require actions
13
Chapter 2 Multi-Domain Co-Simulation
on market side as well as communication failures which affect the com-munication between grid and market or market decisions that change thebehavior of a generation unit or energy prosumers connected to the grid.For this purpose, a data model based on the Common Information Model(CIM) as standardized in IEC 61970/61968/62235 was created to be ableto describe an entire smart grid topology with components and actors fromall three domains. This data model is called SINERGIEN_CIM as it resultedin the research project SINERGIEN. It allows the storage of the wholenetwork topology with components from all three domains in a singlewell-defined data model, hiding some complexity of the simulation fromthe users. SINERGIEN_CIM-based topology descriptions are being processedby the co-simulation architectures as presented in [Mir+18].
Some parts of the SINERGIEN co-simulation architecture will be ad-dressed in the following as they are relevant for the research and develop-ment that is presented in the following chapters of this dissertation.
After a section on the related work and another one on various usecases for multi-domain simulation, the challenges for the realization of theimplemented SINERGIEN co-simulation environment are discussed. Itfollows a section about the concept and a further one on its validation bya use case. The chapter is concluded with final remarks in its last section.The work in this chapter has been partially presented in [Mir+18]1.
2.1 Fundamentals and Related Work
2.1.1 Architecture and Topology Data Model
A major formal modeling method for future intelligent power grids is givenby the Smart Grid Architecture Model (SGAM) [Sma12]. Therefore, theSGAM framework provides five levels: for physical components in thenetwork (component layer), protocols for information exchange betweenservices or systems (communication layer), data models which definethe rules for data structures (information layer), functions and services(function layer), as well as business and market models (business layer).Furthermore, the model divides all two-dimensional layers into the domaindimension from generation over transmission, and so forth to customerpremises and in the zones dimension from process over field, and so on, tothe market.
1 “A Cosimulation Architecture for Power System, Communication, and Market inthe Smart Grid” by Markus Mirz, Lukas Razik, Jan Dinkelbach, Halil Alper Tokel,Gholamreza Alirezaei, Rudolf Mathar, and Antonello Monti is licensed under CCBY 4.0
14
2.1 Fundamentals and Related Work
SGAM shall accelerate and standardize the development of unified datamodels, services, and applications in industry and research. In this context,the SINERGIEN data model and the co-simulation framework build uponSGAM as follows:
• the unified data model formally defines the data exchange structurein alignment with the information layer concept of SGAM (seeSect. 2.4);
• the domain-specific simulators of our co-simulation environmentinclude models of power grid and communication network componentsas well as market actors in the distribution, DER, and customerpremise domains of the SGAM component layer;
• the communication layer is abstracted by a co-simulation interfaceand software extensions for the particular domain-specific simulatorsin order to enable data exchange between the components (seeSect. 2.4);
• the example use case presented in Sect. 2.2 with an optimal man-agement of distributed battery storage systems is an example of asystem function that would fall on the SGAM function layer. Fur-thermore, the business model motivating the provision of a propersystem function, e. g., an incentive by a distribution system operator(DSO) is defined within the business layer.
For our unified data model we chose CIM as well-established basis forpower grid data that can be extended in a flexible manner. An extensionof CIM was needed for communication infrastructure and energy marketas for example shown in [Haq+11] and [Fre+09].
2.1.2 Common Information ModelSome of the most important smart grid related standards (i. e. core stan-dard) are from the IEC Technical Committee 57 (IEC TC 57). The so-called CIM is standardized in IEC 61970 (Energy Management Systems),IEC 61968 (Distribution Management), and IEC 62325 (Energy MarketCommunications) [IEC12b; IEC12a; IEC14]. Therefore, CIM belongs tothe core standards included in the IEC/TR 62357 reference architecture[IEC; IEC16b].Originally CIM was developed as a database model forenergy management systems (EMSs) and supervisory control and dataacquisition (SCADA) systems but then changed into an object-orientedapproach for electric distribution, transmission, and generation. Use casesof CIM are system integration using pre-defined interfaces between the
15
Chapter 2 Multi-Domain Co-Simulation
IT of distribution management systems (DMSs) and automation parts,custom system integration using XML-based payloads for semanticallysound coupling of systems, and serializing topology data using the ResourceDescription Framework (RDF) [Usl+12]. The IEC considers CIM and theIEC 61850 series as the pillars for a realization of the smart grid objectivesof interoperability and device management [LE].
2.1.3 Simulation of Smart Grids
Example approaches for co-simulations on power grids and communicationare presented in [Li+14; Hop+06; ZCN11; Lin+12] with focus on short-termed effect and therefore not including the energy market. In MOCES[EFF15] a holistic approach is taken for modeling distributed energysystems but the result is a monolithic co-simulation and no co-simulationwith a hybrid simulation for the physical part and an agent-based partfor behavior-based simulations, e. g., coming from the market. With theSINERGIEN co-simulation environment the advantages of existing toolsshall be harnessed which enhances the credibility of simulation results andobviate reinventing the wheel. The SINERGIEN co-simulation platformconsists of several domain specific simulators with the possibility to usethe “best tool” for each domain.
2.1.4 Classification of Simulations
In [Sch+15] a classification scheme for energy-related co-simulations isintroduced, with the four modeling categories continuous processes, discreteprocesses / events, roles, and statistical elements. The power grid in theSINERGIEN co-simulation environment is modeled based on Modelica. Ashort introduction to Modelica is provided in Sect. 4.1.1. Thermal systems[Mol+14] as well as power grids [MNM16] were modeled in Modelica. TheModelica models in the SINERGIEN co-simulation express continuousas well as discrete processes and events which makes the power gridsimulation a hybrid simulation. The communication network is simulatedwith available discrete event simulation (DES) tools, such as ns-3. In a DESthe simulation time does not proceed continuously but with the arising ofcertain events such as packet arrival, time expiry etc. [WGG10]. The energymarket simulation was implemented also as DES but in Python which isflexible and suitable to test different optimization methods [Har+12]. Eachmarket participant aims at optimizing the schedule for its assets, e. g.,minimizing energy costs and maximizing its profit. Examples for statisticalelements are, e. g., wind farm models of the power grid simulator.
16
2.2 Use Case
In view of the above, the SINERGIEN co-simulation environment isformalized as a coupled Discrete Event System Specification (DES) asdefined in [ZPK00]. This formalization is shown in Sect. 2.4.
2.2 Use Case
The SINERGIEN environment can be used for an evaluation of differentscenarios with
fast phenomena in the range from microseconds to seconds (i. e. withsmaller simulation time steps) between highly dynamic power gridcomponents, e. g., power electronics and communication network and
slow phenomena in the range from minutes to hours (i. e. with largersimulation time steps) that include market entities, power grid, andcommunication network.
More on these two phenomena classes can be found in [Mir+18] with afocus on slow phenomena and a discussion on fast phenomena containinga description of adaptions needed for fast phenomena investigations.
Based on this classification it can be concluded that the three simulatorsdo not necessarily need to participate in each co-simulation. The exampleuse case that was chosen in [Mir+18] for a validation of the SINERGIENenvironment was an optimal management of distributed storage systemsfor peak-shaving to support the grid operation. The SINERGIEN environ-ment including the communication network allows testing the effects ofcommunication failures on the operation strategy and eventually on theelectrical grid, which can provide valuable insights for decision making.Simulation results for this example are also provided in [Mir+18].
Before the co-simulation is initiated, it is necessary to define and storethe topology under investigation along with the scenario-specific parame-ters. For example, various scenarios in which failures in the communicationnetwork are stochastically or deterministically set by the user in the datamodel can be examined. From a user perspective it would be advanta-geous if all components, their links, and parameters could be defined inone environment rather than splitting this information between differentsoftware solutions and formats. Then, the data model for the topologyneeds to include components that couple different domains.
Under these requirements, following challenges were identified:
• definition of a common data model that includes components of alldomains and their interconnections;
17
Chapter 2 Multi-Domain Co-Simulation
• interaction of simulators with different simulation types, e. g., event-driven for the communication network and continuous processes forthe power grid;
• choice of the co-simulation time step which is limited by the synchro-nization method connecting the simulators.
2.3 Challenges
A major issue in coupling of simulators with different modeling approachesis the selection and implementation of a synchronization mechanism whichensures proper progress of the simulation time and a timely data ex-change between the simulators. This selection is of crucial significancefor minimization of the error propagation in the co-simulation and thesynchronization overhead in terms of simulation time. Since this is outof scope of this work, please refer to [Mir+18] for more details, whereasthe definition and implementation of a new proper data model, involvingall three mentioned domains, is crucial for the whole following work onlarge-scale co-simulation.
Holistic Topology Data Model
A common data model that covers the power grid, communication infras-tructure and electrical market was not existent. Besides the benefit forthe user of a co-simulation environment with a single data model, for thespecification of a holistic co-simulation topology, also the data exchangebetween simulators is simplified. A system description that encompassesall components of smart grids as shown in Fig. 2.1 (1) can be either useddirectly by a single multi-domain smart grid simulator or divided intosubsystems for a co-simulation as in Fig. 2.1 (2). For many components,this division is obvious since their parameters are only needed by onedomain-specific simulator but some components (called inter-domain com-ponents) constitute natural coupling points between the three domains.For instance, a battery storage device connected to the grid can act as amarket participant that offers its capability to charge or discharge. In orderto enable its participation in the energy market, the battery storage needsan interface which is a communication modem in this case. The modemcan be seen as a part of the battery storage. For a co-simulation, theinformation on inter-domain components must be split into several partsas each simulator has to simulate a dedicated part of these components.
18
2.4 Concept of the Co-Simulation Environment
2.4 Concept of the Co-Simulation Environment
2.4.1 Holistic Topology Data Model
As already mentioned, a holistic data model for a whole three domainsco-simulation topology can be based on CIM with an extension by furtherclasses. These classes, introduced for completing CIM in its representationof smart grids, are linked to already existing CIM classes using the Unified
Wholesale andRetail Market
MarketParticipants
(1)
CommunicationNetworkPower Grid Market
Wholesale andRetail Market
MarketParticipants
(2)
Figure 2.1: Exemplary topology including components of (1) all domainsand (2) domain-specific topologies
19
Chapter 2 Multi-Domain Co-Simulation
Modeling Language (UML). The proposed format can be structured infour packages:
• Original CIM (IEC 61970 / 61968 / 62325),
• Communication,
• Market, and
• EnergyGrid.
Whenever suitable, original CIM classes are considered. However, somecomponents do not have an associated class in the standard yet andtherefore are added in one of the other three packages. This approachleads to a flexible update to a new CIM version without losing the addedclasses with their links.
The most important feature of the SINERGIEN data model is theinterconnection of domains. Examples of inter-domain components, namelyBatteryStorage, SolarGeneratingUnit, and MarketCogeneration, areshown in Fig. 2.2, an excerpt from the SINERGIEN data model. Accordingto the UML diagram, the energy market components are associated tothe power grid components, whereas power grid components have an
Communication
Power Grid
Market
Equipment::MarketSolarGeneratingUnit
Equipment::MarketBatteryStorage
PowerSystemResource
Production:: CogenerationPlant
Equipment::MarketCogenerationUnit
RegulatingCondEq
EnergyStorage:: BatteryStorage
Modems::ComMod
GeneratingUnit
Production:: SolarGeneratingUnit
Figure 2.2: Inter-domain connections between classes of power grid, com-munication network and market
20
2.4 Concept of the Co-Simulation Environment
aggregation relationship to communication devices. This means thatparameters specific to the market, communication network, and powergrid which relate to the same device are linked with each other. Therefore,all information on one device is easily accessible but at the same timethere is a separation according to the domains. The connections betweenclasses of different domains are defined in a logical and not a topologicalmanner. Instead, topological connections exist to interconnect power gridcomponents, for instance.
In the mentioned battery storage device example, the data model isas follows: the device is a part of the grid and has electrical parameters.Furthermore, the battery storage might participate in the market, e. g.,as part of a virtual power plant (VPP). Market-specific information canbe stored in MarketBatteryStorage class objects which is associated withthe BatteryStorage. The communication modem ComMod which could beused to communicate with the VPP is aggregated to the BatteryStorageclass.
The three additional packages EnergyGrid, Communication, and Marketare needed for the following:
• some newer components occurring in power grids are missing in orig-inal CIM. For instance, it was necessary to create a new model forelectrical energy storages like stationary batteries. A battery storageis a conducting equipment that is able to regulate its energy through-put in both directions. Therefore, the class BatteryStorage addedin the EnergyGrid package is a specialization of a CIM Regulating-ConductingEquipment since it can influence the flow of power at aspecific point in the grid.
• the key component of the Market package for the scenarios that wewould like to investigate is a VPP since the aggregation of smallDER units enables their participation at electricity markets.
• the Communication package includes all additionally defined classesthat are related to the communication network model, such as classesfor communication links and technologies, modems, network nodesalong with their parameters and their relations with the classes inCIM, power grid package, and market package.
Figure 2.3 shows an excerpt from the communication data model withan aggregation to a WindGeneratingUnit. By means of the associatedclasses for modems, communication requirements and channels, the modelenables a description of network parameters and topology. More on thepackages can be found in [Mir+18].
21
Chapter 2 Multi-Domain Co-Simulation
2.4.2 Model Data Processing and Simulation Setup
The overall information flow for the simulation setup is depicted in Fig. 2.4.After the holistic topology is edited in a graphical Topology Builder,including all objects of the three domains, it is forwarded to the co-simulation interface. In order to execute a simulation, the Modelicasolver requires a Modelica model, whereas the communication networktopology can be given to the communication network simulator in CIMformat, which includes the components of the network, their connections,and parameters. The co-simulation interface incorporates a componentcalled CIMverter, based on CIM++ [Raz+18a] presented in Chap. 3. TheCIMverter [Raz+18b] reads in the CIM document and outputs a Modelicasystem model (Chap. 4) for the power grid simulator. In contrast, thePython-based market simulation relies on a C++/Python interface, whichcould be realized using one of the common libraries for Python to wrapC++ data types and functions, to retrieve the market relevant informationfrom the C++ objects and store them in Python objects. A detailedexplanation of the translation from CIM to Modelica is given in Chap. 4.
CommunicationRequirement
GeneratingUnit WindGeneratingUnit
ComMod
WirelessMod WiredMod
LTEModem FiberModemWiredInterface FiberInt
WiredChannel FiberChannel
1
11
1..*
0..10..*1
1..*
Figure 2.3: Communication network class association example
22
2.4 Concept of the Co-Simulation Environment
2.4.3 Synchronization
The synchronization during simulation is performed at fixed time steps. Forslow phenomena scenarios this is managed by mosaik, a well-establishedco-simulation framework [SST11]. It allows coupling the three simulatorsin a simple manner as explained in Sect. 2.4.4 in case of longer synchro-nization time steps. VILLASnode, a software project for coupling real-timesimulations in LANs [Vog+17; Ste+17], is a suitable alternative for mosaikin the case of synchronizations with very short synchronization time steps.
In Modelica, the synchronization data exchange is achieved by inte-grating Modelica blocks of the Modelica_DeviceDrivers library, whichwas originally developed for interfacing devices to Modelica environments[Thi19]. The library conveniently allows the definition of a fixed interval fordata exchange that can be different from the simulation time step. Moreon this choice and the integration can be found in [Mir+18]. Figure 2.5depicts the flow of time for the co-simulation and each simulator. Thepower grid and market simulators compute in parallel, whereas the com-
Co-Simulation Interfacemosaik, CIM++,CIMverter, etc.
Power GridSimulation
Modelica (Dymola,OpenModelica)
Market SimulationPython
CommunicationSimulationns-3, etc.
Topology Builder
C++ objects from CIM++ &simulator-specific system models
XMLrepresentation
of topology
Modelica Models Python ObjectsCommunication
Network Topology
Block diagramof topology
Extended CIM
Figure 2.4: Overall SINERGIEN architecture for simulation setup
23
Chapter 2 Multi-Domain Co-Simulation
munication network is waiting for their inputs. The interactions betweenthe simulators in each co-simulation step can be formalized by
up(n + 1) = Fc(Fm(um(n))), (2.1)
um(n + 1) = Fc(Fp(up(n))), (2.2)where uc, um and up are the corresponding input values of the simulatorsfor the power grid, energy market and communication network for eachtime step. Therefore, it is required to set initial values, up(0), um(0),uc(0), at the beginning of the co-simulation. n denominates the currentco-simulation time step. Fc (communication), Fm (market) and Fp (powergrid) are the functions describing the calculation within a step.
2.4.4 Co-Simulation Runtime InteractionFigure 2.6 shows the coupling of the simulators for their co-simulationruntime interaction with following entities:mosaik As already mentioned, mosaik is used for the coordination during
the synchronization steps of several minutes (in simulation time)regarding all simulators [Sch19].
Market Simulator Implemented in Python, it can make use of mosaik’sso-called high-level API as illustrated in Fig. 2.6.
Communication Network Simulator Based on available DES tools, theirnetwork simulation modules are extended with inter-process commu-nication functionalities for message exchange with mosaik.
PowerGrid
PowerGrid
Market Market
CommunicationNetwork
CommunicationNetwork
PowerGrid
Market
1 1
1 1
1
1 20
20
20
IndividualSimulatorSteps
Co-SimulationSteps
0Event0
Event1
Event2
Event3
1 2
Figure 2.5: Synchronization scheme of simulators at co-simulation timesteps
24
2.4 Concept of the Co-Simulation Environment
Power System Simulator The integration of so-called TCPIP_Send/Recv_IO blocks from Modelica_DeviceDrivers into the Modelicamodels, allows the exchange of simulation data via sockets but in theform of Modelica variables as bitvectors instead of messages in JSON,an open-source and human-readable data format [ecm19]. Therefore,the MODD Server is implemented.
MODD Server It receives commands from the socket connected withmosaik. Based on these commands it starts, for example, the power
DESEnvironment
CommunicationSimulation
InfiniBand
TCP Sockets
mosaik-core
TCP Sockets
MODD Server
TCP Sockets
TCP Sockets
VILLAS-node
InfiniBand
ModelicaEnvironment
Power GridSimulation
ModelicaDeviceDrivers
TCP Sockets
Mod
elic
aD
evic
eDriv
ers
Shar
ed M
emor
y
Python Environment
MarketSimulation
TCP Sockets
Fast phenomena communication
Slow phenomenacommunication
Shared Memory
Figure 2.6: Scheme of runtime interaction between co-simulation compo-nents
25
Chapter 2 Multi-Domain Co-Simulation
grid simulator or receives the bitstream from Modelica_DeviceDriversand encapsulates it into JSON messages before transferring themto mosaik. Besides the synchronization steps controlled by mosaik,there will be also more fine-grained synchronization steps of fractionsof seconds between the power grid and communication networksimulator. That is why a VILLASnode gateway is included.
VILLASnode Instead of the Transmission Control Protocol (TCP) asin case of mosaik, VILLASnode can make use of InfiniBand (IB)interconnects for data exchange between real-time simulators ondifferent machines and shared-memory regions on the same machine.The use of shared-memory regions and IB interconnects leads tolower latencies and consequently to shorter synchronization timesteps as shown in Chap. 8.
For more on the formalization of the SINERGIEN co-simulation and thelimitations of the environment please refer to [Mir+18].
2.5 Validation by Use Case
The proper functioning of the SINERGIEN co-simulation environment hasbeen validated with the aid of different use case scenarios. In the use casepresented in [Mir+18] there is the assumption that a VPP operator triesto reduce the VPP’s peak power. This behavior could be desired by theresponsible DSO and come with financial incentives. Therefore, a peak-shaving algorithm is utilized for an optimal management of distributedbattery storage systems.
First of all, simulation results which are obtained without the SIN-ERGIEN environment were compared with results obtained with the SIN-ERGIEN environment for demonstrating that the results do not changeunder the assumption of an ideal communication network when simulatingthe same scenario. Furthermore another scenario was presented where thecommunication network was supposed to impair the control loop betweenthe power grid and the market due to communication device failures. Moreon the details of the co-simulated scenarios again can be found in [Mir+18]as the simulations themselves are not in focus of this work. Anyway,with the simulation results of both scenarios a proper functioning of theco-simulation environment has been shown.
26
2.6 Conclusion
2.6 Conclusion
The here presented architecture of the implemented multi-domain co-simulation environment shows the applicability of the CIM-based holisticdata model for smart grid simulations which include the three domains:power grid, communication network, and market. The data model facili-tates the use of the software environment since the domain-specific smartgrid component parameters and their interconnections can be modified andstored in a self-contained topology description. Due to the SINERGIENco-simulation approach the user can take advantage of established domain-specific simulators for each domain.
For this purpose, also new software tools have been developed. TheModPowerSystem library can be used for scientific research on variousmodels since Modelica as modeling language simplifies the development andimprovement of component models. Because of the increasing use of CIM-based documents for grid topology representation, the choice of Modelicalead to the development of a CIM to Modelica mapping that is presentedin Chap. 4. Besides the initiation of the CIM related topics (Chap. 3and Chap. 4), the SINERGIEN co-simulation architecture illustrates howthe work on HPC Python (Chap. 7) and the integration of InfiniBand inVILLAS Chap. 8 can be used in power system co-simulation. The workin Chap. 5 and Chap. 6, however, contributes to a higher performanceof the simulation itself which is accomplished by the simulators of theco-simulation environment.
In the following chapter, the automated generation of a (de-)serializer forreading and writing CIM-based documents, implemented in the mentionedCIM++ software library, is presented.
27
3Automated De-/SerializerGeneration
Due to growing automation in smart grids with the aid of an increasingdigitalization and a rising number of decentralized energy systems, theactors in this area are increasingly dependent on ICT systems that mustbe compatible with each other, which in particular concerns the dataexchange between these systems. Therefore, different countries, organiza-tions, and vendors started to develop smart grids related standards withdifferent focus on technical and economical aspects. Eventually, only fewnational standards have been integrated into standards of the InternationalElectrotechnical Commission (IEC) or the International Organization forStandardization (ISO) [Usl+12].
In recent years, the CIM standards (IEC 61970/61968/62325, seeSect. 2.1.2) have been subject to numerous research activities often relatedto use cases for CIM [MDC09; DK12; Wei+07]. Some of them, like inthe research project SINERGIEN, also introduce extensions by classes notincluded in original CIM as, for instance, in [MMS13] where a method-ology for modeling telemetry in power systems using IEC 61970/68 inthe case of a US independent system operator is presented. There arealso harmonization approaches because of data exchange between energyrelated software systems based on CIM standards and ICT for substationautomation based on IEC 61850 [LK17; Bec10; SRS10].
Since CIM is object-oriented, it specifies classes of objects containinginformation about energy system aspects as well as relations betweenthese classes (referred to as the ontology) [GDD+06]. Currently more
29
Chapter 3 Automated De-/Serializer Generation
and more commercial software tools in the energy sector provide importand export of CIM documents. Moreover, there are already about 200corporate members organized in the CIM User Group (CIMug) providingCIM models for common visual Unified Modeling Language (UML) editors[CIM].
This high acceptance among companies and institutions has pushed theadoption of CIM also in the simulation environment as presented in Chap. 2with respect to the SINERGIEN co-simulation environment, where thedata format of the multi-domain component-based co-simulation model(referred to as the topology) is based on CIM. As the topology, includingpower grid and communication network components as well as energymarket actors, evolves continuously, a high compatibility, updatability,and extensibility of the chosen data model are key requirements.
The object-oriented design with concepts such as inheritance, associa-tions, aggregations, etc. led to a CIM data encapsulation format referred toas RDF/XML [IEC06] coming from the area of semantic web [AH11] andnot as common in other domains. This and the huge specification of CIMwith hundreds of classes and relations between them, making CIM veryextensible and universal applicable in comparison to other more specificand static data models, have a deterrent effect to new users. However,keeping CIM based software up-to-date continuously can be too effortful,especially in the scientific and academic area. These could be the reasonwhy there are hardly any software libraries especially for handling CIMdocuments.
Therefore, in this chapter an automated (de-)serializer generation fromUML based CIM ontologies is presented. The approach was introducedin [Raz+18a] and implemented in a chain of tools for generating an open-source library libcimpp within the CIM++ software project [FEI19a].libcimpp can be used for reading in CIM RDF/XML documents directlyinto CIM C++ objects (called deserialization) and is currently also ex-tended for serialization (i. e. writing of CIM C++ objects from the memoryto RDF/XML documents). Due to a model-driven architecture (MDA),libcimpp can be adapted to new CIM versions or user-specified CIM basedontologies in an automated way. For this purpose, the approach makesuse of a common visual UML editor and our CIM++ toolchain generatinga complete compilable CIM C++ codebase of given CIM UML models(i. e. CIM profiles) which are kept up-to-date (e. g. by the CIMug). Itis also shown how this CIM C++ codebase can be used for holding thedeserialized CIM objects as well as for an automated generation of C++code for exactly this deserialization. Hence, if the CIM C++ codebasechanges (because of changes in the CIM UML), there is no need to adaptcode for libcimpp by hand.
30
3.1 CIM Formalisms and Formats
The direct deserialization into C++ objects makes the library very easyto apply because its user does need neither any CIM RDF/XML knowledgenor have to handle intermediate representations of the CIM RDF/XMLdocument like a Document Object Model (DOM) in combination with theResource Description Framework (RDF) syntax. For instance, in case ofa power grid topology stored in CIM documents, a power grid simulatorcan directly access the CIM objects, deserialized by libcimpp, in form ofcommon C++ objects.
The chapter gives a short introduction to data formats as well as othercomponents used in CIM++ followed by an overview of the overall concept.Then it explicitly describes how Common Information Model (CIM) is au-tomatedly mapped to compilable C++ code which is used by the CIM++Deserializer (i. e. libcimpp) during the so-called unmarshalling step ex-plained subsequently together with its automated generation. Followingthis, the final libcimpp is introduced. Finally, the chapter is concluded bya roundup and an outlook of future work. The work in this chapter hasbeen partially presented in [Raz+18a]1.
3.1 CIM Formalisms and Formats
An introduction to CIM is provided by Sect. 2.1.2. CIM makes use ofseveral formalisms and formats which are explained in the following.
UML
UML is a well-established formalism for graphical object-oriented modeling[RJB04]. In CIM only UML class diagrams with attributes and inheritanceas well as associations, aggregations, and compositions with multiplicitiesare used. The CIM UML contains no class methods as CIM defines just thesemantics of its object classes and their relations without any functionalityof the objects just to specify which kind of information a CIM objectcontains.
CIM UML diagrams can, as other UML diagrams, be edited by visualUML editors and stored in a proprietary or open format like XML MetadataInterchange (XMI) [KH02]. Conveniently, the CIMug provides such CIMmodel drafts [CIM]. While UML resp. XMI is used for the definition
1 Reprinted by permission from Springer Nature Customer Service Centre GmbH:Springer Computer Science – Research and Development (“Automated deserial-izer generation from CIM ontologies: CIM++—an easy-to-use and automatedadaptable open-source library for object deserialization in C++ from documentsbased on user-specified UML models following the Common Information Model(CIM) standards for the energy sector“, Lukas Razik, Markus Mirz, Daniel Knibbe,Stefan Lankes, Antonello Monti), © (2017)
31
Chapter 3 Automated De-/Serializer Generation
of all classes with their attributes and relations among them, the actualobjects (i. e. instances of these classes) are stored in form of RDF/XMLdocuments.
XML and RDF
The Extensible Markup Language (XML) is a widely used text-based for-malism for human- and machine-readable documents [Bra+97]. In general,XML documents have a tree structure which is why XML itself is not wellsuited for representing arbitrary graphs. Therefore it is combined with theRDF [Pan09]. RDF provides triples of the form “<Subject> <Predicate><Object>” which allow representing a relation (<Predicate>) betweenresources (<Subject> and <Object>). Therefore, links (i. e. instances ofassociations, aggregations, . . . ) between CIM objects, as specified in theUML ontology, can be expressed by RDF/XML.
For instance, in List. 3.1 the object of class BatteryStorage has anrdf:ID (line 7) which is referenced in the Terminal (line 5) with theRDF/XML attribute rdf:resource="#BS7". A brief introduction to CIMwith its key concepts is provided by [McM07].
XML Parsers
There are three common types of pure XML parsers [Fri16; HR07; KH14].During parse time, the so-called DOM parser generates a treelike structure
Listing 3.1: Snippet of a CIM document representing an IEEE EuropeanLow Voltage Test Feeder with an additional BatteryStorage
1 <cim: Terminal rdf:ID=" BADCAB1E ">2 <cim: IdentifiedObject .name >T13 </cim: IdentifiedObject .name >4 ...5 <cim: Terminal . ConductingEquipment rdf: resource ="#BS7"/>6 </cim: Terminal >7 <cim: BatteryStorage rdf:ID="BS7">8 <cim: Equipment . EquipmentContainer rdf: resource ="#C7"/>9 <cim: IdentifiedObject .name >Battery -1
10 </cim: IdentifiedObject .name >11 <cim: BatteryStorage . nominalP >500012 </cim: BatteryStorage . nominalP >13 <cim: BatteryStorage . ratedU >40014 </cim: BatteryStorage . ratedU >15 ...16 </cim: BatteryStorage >
32
3.2 CIM++ Concept
with strings of the whole document what can be very memory demanding.For further processing, the particular strings have to be picked out manuallyand interpreted respective converted to desired data types. To avoidloading a whole document into memory, StAX parsers (a kind of pullparsers [Slo01]) can be used. They are a compromise solution betweenDOM and Simple API for XML (SAX) parsers as they allow randomaccess to all elements within a document. SAX parsers are most commonlyused. They traverse XML documents linearly and trigger event callbacksat certain positions. Because of the fact that one linear reading of theCIM document is sufficient for its deserialization, a SAX parser is used.
C++ Source Code Analysis
For C++ source code analysis, correction, and adaption, which is neededin several steps of the automated generation, a compiler front-end waschosen. It can transform source code into a so-called abstract syntax tree(AST) [Aho03]. With further functionalities provided by the compilerfront-end, e. g., static code analysis [Bou13] and source code manipulationscan be performed. One of the conceptional ideas is to use the data fromthe AST as input for a template engine.
Template Engines
Template engines are mainly used for generation of dynamic web pages[Fow02; STM10]. The core idea behind them is to separate static context(e. g. HTML code defining the structure of a web page) from dynamicdata (e. g. the actual web page content). Therefore, the static part can bewritten in template documents with place holders filled by the templateengine with data from a data base as described in Sect. 3.4.4.
3.2 CIM++ Concept
An conceptual overview of the automated (de-)serializer generation fromCIM UML is presented in Fig. 3.1. The upper part of the diagram showsthe automated code generation process from the definition of the ontologyby CIM UML to the (un-)marshalling code generation of the CIM++(De-)Serializer libcimpp. The lower part shows the deserialization processfrom a given topology (based on the specified CIM ontology) to CIM C++objects. The CIM based specification, which represents classes and theirrelations in UML, is loaded with a visual UML editor and transformedto a C++ codebase. Before this C++ codebase can be included by the(de-)serializer’s source-code (i. e. libcimpp), it is adapted by the developed
33
Chapter 3 Automated De-/Serializer Generation
CIM C++Codebase
AdaptedCIM C++
Codebase
UnmarshallingTemplates
CIM XML/RDF Topology Document(s)
CIM C++ Topology Objects
CIM++ (Un-)Marshalling Generator
Template Engine
Compiler Front-End
CIM++ (De-)Serializerlibcimpp
(Un-)Marshalling
Topology EditorCIM based Topology
Visual UML EditorCIM UML Ontology
CIM++ Code Toolchain
Compiler Front-End
TemplateEngine
DB
Figure 3.1: Overall concept of the CIM++ project
34
3.3 From CIM UML to Compilable C++ Code
CIM++ code toolchain to compilable C++ code as the original CIMC++ codebase is not complete as explained later. This adapted codebaseCIM++ (Un-)Marshalling Generator for unmarshalling code generationneeded for the CIM++ (de-)serializer. Originally only a deserialization wasimplemented in libcimpp but a serialization is currently being implementedas this concept can be applied for both directions. The code toolchain aswell as the (un-)marshalling generator make use of a compiler front-endand the latter make also use of a template engine getting its data fromabstract syntax trees created by the compiler front-end while reading inthe adapted CIM++ codebase. Afterwards, the template engine can fillthe data about the codebase into the (un-)marshalling code templates.After all these automated steps, which can be repeated whenever theCIM based specification in UML form is visually modified, the CIM++deserializer can be compiled to a library.
This CIM++ (de-)serializer library (libcimpp) can be used by C++programs for reading (by deserialization of C++ objects) and writing (byserialization of C++ objects) CIM documents. In the shown topologyeditor screenshot, for instance, all components of a grid with their links(i. e. the grid’s topology corresponding to the previously defined CIM spec-ification) are stored by a topology editor in one or more CIM RDF/XMLdocuments. These documents can be directly transformed to C++ objectsby libcimpp.
C++ was chosen as programming language because of its high exe-cution time and memory space efficiency and to be directly compatiblewith programs written in C++. Before the automated generation of(un-)marshalling code can be introduced, the mapping of the CIM UMLspecification to the adapted and therefore compilable C++ codebase ispresented.
3.3 From CIM UML to Compilable C++ Code
With visual UML editors the CIM model can be rapidly modified orextended to individual requirements. Moreover, many tools follow MDAapproaches making round trip engineering (RTE) possible. RTE in relationto UML allows the user to keep UML models and related source codesconsistent by two software engineering principles: Forward engineering,where changes to UML diagrams lead to an automated adaption of thebelonging source codes. And reverse engineering (if supported by the UMLeditor), where changes to source codes lead to an automated adaptionof the belonging UML diagrams [Dav03; Reu+16]. In our case, theseprinciples provide the ability for incremental development of CIM ontologies
35
Chapter 3 Automated De-/Serializer Generation
HydroPump ReservoirHydroPowerPlant1..* 0..1 0..* 0..1
Figure 3.2: UML diagram of HydroPowerPlant class which instances canbe associated with no more than one Reservoir instance
(respective data models based on CIM) and the automated generatedCIM C++ codebases. This leads to better software documentation andcompatibility between different (distributed) software developing entities.
Unfortunately, there are no standardized canonical mappings betweenUML associations, aggregations, composition, etc. on the one side andobject-oriented programming (OOP) languages on the other one. Therefore,different C++ code representations for CIM UML aspects had to be chosenfor the code generation. For instance, in case of no multiplicity, the chosenrepresentation of an association is a pointer to an instance of the associatedclass. In case of a possible multiplicity greater than 1, it is a pointer toan Standard Template Library (STL) list of pointers to instances of theassociated class.
The CIM UML specification of the HydroPowerPlant class is partlypresented in Fig. 3.2. Since the given multiplicity of the aggregatedHydroPump objects can be greater than one, for the belonging HydroPumpsattribute in the generated code a list is used as depicted in List. 3.2. TheHydroPowerPlant aggregates one or more HydroPump instances. Further-more, there can be multiple HydroPowerPlant instances associated with aReservoir and a HydroPump can also exist without being aggregated by aHydroPowerPlant.
Inheritance in CIM UML can be easily represented by C++ inheritance.Due to the fact that no operations are defined in CIM UML, i. a. the
Listing 3.2: Snippet of HydroPowerPlant class
class HydroPowerPlant : publicIEC61970 :: Base :: Core :: PowerSystemResource
{public :
std :: list < IEC61970 :: Base :: Generation :: Production :: HydroPump *>* HydroPumps ;
IEC61970 :: Base :: Generation :: Production :: Reservoir * Reservoir ;...
};
36
3.3 From CIM UML to Compilable C++ Code
generated standard constructors are empty, there are no further classfunctions, and all UML-defined attributes stay uninitialized. The classesdefined in the CIM standard as primitive types (see also Sect. 3.3.3), aregenerated as empty classes. In case of the used code generator and highlylikely most others, the generated enum types are not strongly typed andtherefore have no scope. Besides these circumstances, due to the CIM UMLstandards in conjunction with C++, the generated code also comes withsome software technical deficiencies. For instance, the #include directivefor the chosen std::list container is not automatically inserted etc.
The mentioned facts lead to source code files that could not be directlyused for the subsequent automated (un-)marshalling code generation.Therefore following solution approaches were also considered: ReplacingC++ by any other programming language would not guarantee a solutionof any mentioned issues. Writing a new generation tool would result inan additional sophisticated software project just for the special case ofgenerating C++ code from a machine readable CIM UML representationsuch as XMI. Therefore, a cost-benefit analysis led to the decision todevelop a toolchain for automated code correction and adaption basedon existing widely used transformation and C++ refactoring techniques.Thus, the developed toolchain should be easily adaptable for the usagewith different general purpose CIM UML to C++ code generators. Asource code transformation by hand, e. g., in case of IEC 61970 / 61968on around 2000 source files, would be an cumbersome and error-pronetask. The demands on the generated code after its transformation by thetoolchain are hierarchical includes of all header files as well as an adequateusage of the chosen container class (i. e. std::list). Furthermore, acommon BaseClass for all CIM classes is needed as it will be shown later.
3.3.1 Gathering Generated CIM Sources
The first steps are performed by the CIM-Include-Tool, grouping all C++source files together that are created by the code generator of the vi-sual UML editor from CIM UML. The tool scans all source files writtenby the code generator for the container class that was chosen for asso-ciations with multiplicities and adds missing header includes (here i. e.#include <list>). In case of the used code generator, all files are groupedtogether according to the CIM packages. For instance, the definition of theIEC 61970 class Terminal, located in the package Base::Core, is be storedin the directory path IEC61970/Base/Core. This is why all occurrences of
# include " Terminal .h"
are transformed to
37
Chapter 3 Automated De-/Serializer Generation
# include " IEC61970 /Base/Core/ Terminal .h"
for keeping the hierarchical structure of all directories and files [Daw;ISO14].
3.3.2 Refactoring Generated CIM Sources
After that the CIMRefactorer based on the Clang LibTooling library isexecuted. Clang is a compiler front-end supporting C/C++, developedwithin the LLVM compiler infrastructure project [LLV]. During parsetime, the library creates an AST containing objects that represent thewhole source code like declarations and statements [DMS14]. For an ASTtraversal the visitor design pattern [KWS07] is used to evaluate and processthe AST. Due to the usage of a visitor pattern, the implementation of theso-called composite does not need to be adapted if its processing has to bechanged or extended. If a new action has to be performed on the AST,a new visitor has to be implemented only. Clang provides the class tem-plate clang::RecursiveASTVisitor<Derived> for this need. It is derivedwith an appropriate implementation of a visitor class given as templateparameter as pictured in Fig. 3.3 for an example visitor MyASTVisitor.By design there also exists the MyASTConsumer class inheriting from clang::ASTConsumer which determines the entry point of the AST. It callsthe TraverseDecl() of the AST visitor which then calls the appropriatemethods of the given MyASTVisitor.
The CIM models provided by the CIMug include UML enumerations,e. g., for units and multipliers. Thereby, several enumerations containsame symbols. For example, the enumeration UnitSymbol contains theenumerator m as unit for meters while the enumeration UnitMultipliercontains the enumerator m as SI prefix for milli. Since C++ requires unique
clang::RecursiveASTVisitor
TraverseDecl(Decl* D)
MyASTVisitor
VisitDecl(Decl* D)VisitStmt(Stmt* S)
clang::ASTConsumer HandleTopLevelDecl(clang::DeclGroupRef DR)
MyASTConsumer std::unordered_set<std::string> Locations
MyASTVisitor
1 1
Figure 3.3: UML diagram of the class MyASTVisitor
38
3.3 From CIM UML to Compilable C++ Code
symbols which is not true for the symbol m declared twice, the generatedcode with unscoped enum types is incorrect. Therefore, VisitDecl() i. a.adds the class keyword to all visited unscoped enumerations. However,this does not define them as classes, it is just a reuse of an existing C++keyword. Furthermore, the visitor checks each statement if any used datatype is an enumeration and adds its corresponding scope as prefix.
Hence, e. g. the unscoped enumeration with a corresponding assignmentstatement
enum UnitSymbol { F, ... }...const IEC61970 :: Base :: Domain :: UnitSymbol
Capacitance :: unit = F;
is adapted by VisitDecl() to a strongly typed enumerationenum class UnitSymbol { F, ... }...const IEC61970 :: Base :: Domain :: UnitSymbol
Capacitance :: unit =EC61970 :: Base :: Domain :: UnitSymbol ::F;
with the needed scope in the assignment statement. Such modificationsare not performed by the visitor on existing code directly but temporarilystored in the designated container provided by Clang for later usage toavoid invalid ASTs.
Furthermore, initialization lists are added to the standard constructorsfor all class attributes which are provided by Clang with their belongingdata types. Also, all pointer operators * to the chosen container typein case of associations with given multiplicities are removed. The list ofattributes with their data types specified in the visited class declaration isalso provided by Clang. Thus, such associations are finally represented aslists of pointers.
std :: list < IEC61970 :: Base :: Generation :: Production :: HydroPump *>HydroPumps ;
Almost all thousands of CIM headers include other headers which wouldlead to many repeatedly visited declarations and consequently to very longexecution times. As already mentioned, MyASTConsumer defines the entrypoints of the AST which are the top-level declarations of the CIM C++headers. A top-level declaration is not included in another declaration.Hence, each top-level declaration is traversed in order to visit all nodesof the AST. During this, the position of each node in the source code isstored in a hash table with an average-case time complexity for searchoperations of O(1). As a result, in case a declaration’s position is alreadycontained in the table, the declaration is ignored.
39
Chapter 3 Automated De-/Serializer Generation
3.3.3 Primitive CIM Data Types
The CIM standards do not only define classes for virtual instances ofreal objects but also so-called primitive (data) types String, Integer,Float, Decimal, and Boolean which correspond to intrinsic data typesof many programming languages. All other CIM data types are classesthat can contain these primitive types. For the CIM classes representingsuch primitive types just empty skeletons are generated with the resultthat they must be implemented depending on their aim which can differbetween different CIM respective libcimpp users. In the used CIM modelthere is also the Decimal type which is not specified as primitive (inpresent CIM standards) but used like the four others and is thus handledby the toolchain like a primitive type. Thus, two different methods forthe implementation of primitive types have been discussed: simple typedefinitions, e. g., with typedef on intrinsic C++ data types, and theimplementation of C++ classes.
For the unmarshalling step (explained in Sect. 3.4.3) it is mandatorythat the class attributes provide reading from C++ string streams. Sincea design decision was to throw exceptions while trying to read from neverdefined CIM class attributes, primitive types were implemented in form ofclasses (and not, e. g., just typedefs on intrinsic C++ types). Moreover,in case of numeric data types a sufficient precision can only be guaranteedsince C++11 which is the already used standard for CIM++ because ofother language features.
The primitive String type is based on std::string since it can storeindefinitely many UTF-8 encoded strings as required by the standard. Theintegral type Integer is implemented based on long which size dependson the used platform but usually is 32 bit which should be sufficientin most cases. Float is CIM’s floating-point numbers type for whichthe double type was chosen instead of float as a sufficient accuracyin case of CIM is more important than a higher runtime performance.All these types already provide read in from streams. Boolean is basedon bool which also provides read in from streams but only in case ofthe digits 0/1 and not in case of the words true/false as used in CIMRDF/XML documents. Therefore, it was implemented with appropriatestream and cast operators which make it i. a. comparable also with othertypes. Decimal was implemented based on std::strings to keep the readvalue as it is because of the standard’s requirement that it should beable to represent a base-10 real number without any restriction on itslength. Afterwards, it can be converted by the libcimpp user, e. g., intoan arbitrary-precision representation such as provided for example by the
40
3.4 Automated CIM (De-)Serializer Generation
Multiple-Precision Binary Floating-Point Library With Correct Rounding(MPFR) [Fou+07].
Overall CIM C++ Source Code Transformation
In addition to the previously described procedures and the implementationof primitive data types in the form of a patch, also a couple of code fixingpatches are applied on the generated CIM C++ code. Besides software-technical details like a correction of definitions inside the IEC61970Versionclass or making all source files Portable Operating System Interface
(POSIX) conform [IEE18], also some conceptional issues have to be solved.As the CIM standards define an enumerated type for three-digit currencycodes of ISO 4217 which can have a leading 0, they are interpreted inC++ code as octal numbers which is why such leading zeros must beremoved. Moreover, CIM defines an attribute switch which in C++ isa keyword. Therefore, the attribute is renamed which must be takeninto account during the unmarshalling step of the deserializer later onfor reading in the attribute by its original name. Afterwards, the code ischecked for its compilability with clang-check, what could be done byany C++ compiler, too. If the check is successful, the code can be usedfor the automated CIM++ (De-)Serializer generation for detecting errorswhen the CIM standard is changed or extended with the aid of a visualUML editor. Finally, the documentation generator doxygen is appliedon the now compilable CIM C++ code as support for the CIM++ user[FEIe].
3.4 Automated CIM (De-)Serializer Generation
With a UML to source code generator and the previously introducedtoolchain, the CIM UML model is transformed to a compilable codebasewhich can be used as a CIM data model with instantiatable C++ objects.These objects then can be filled with data read from a CIM RDF/XML doc-ument by a common XML parser with the aid of automatically generatedunmarshalling routines. Or the objects can be filled / modified by C++statements and serialized into a CIM RDF/XML document. However, forbeing able to store CIM C++ objects in (e. g. STL) containers, furtherwork needs to be done.
3.4.1 The Common Base ClassDuring reading of CIM RDF/XML documents, the thereby created CIMobjects are stored on the heap and therefore referenced by pointers which
41
Chapter 3 Automated De-/Serializer Generation
are collected in the a list container. Due to the fact that STL containersstore items of one single type, the concept of base class pointers is used.This means that objects of derived classes can also be referenced bypointers of their base classes. The motivation is that not all CIM classesinherit from the CIM class IdentifiedObject. Due to the absence of acommon base class for all CIM classes, it is not possible to collect them ina container of objects of one base class.
Several solutions for solving this issue have been discussed. One possibil-ity would be to use typeless pointers (void*) but as C++ is a strictly typedlanguage, this was rejected. Another possibility is the use of a containertype (like boost::any from the Boost.Any Library [Hen]). But to remainon STL, keeping it simple for the CIM++ user, and for avoiding furthersoftware dependencies, each top-level CIM class (i. e. a class without superclass) derives from a newly introduced BaseClass. As a consequence, itis the base class for all CIM classes and thus is added to the CIM C++codebase by the previously introduced CIMRefactorer.
3.4.2 Integrating an XML ParserBasically, CIM RDF/XML can be read by each XML parser. As alreadydescribed in Sect. 3.2, RDF extends XML i. a. by the possibility within anXML element of referencing other elements. There are a couple of librariesfor RDF handling such as Apache Jena for Java [Apa] and Redland RDFLibraries written in C [Bec]. The relevant Redland libraries are librdf,the actual RDF/XML parser, and libraptor, providing the data accessby RDF triples. Redland’s implementation is similar to a DOM parser.Data from RDF/XML documents is stored in an own container residingin the main memory. However, the main goal of CIM++ is to deserializethe CIM objects stored in RDF/XML into C++ objects. Consequently,all CIM data already stored in an intermediate format would need tobe copied into the objects instantiated accordingly to the defined CIMC++ classes. Therefore, the choice fell on a SAX parser which, with asucceeding unmarshalling step, can directly fill the read CIM RDF/XMLdata into the CIM C++ objects.
The first versions of the developed libcimpp library were using the event-based SAX parser of libxml++ [Cum] which is a C++ warapper for thewell-established libxml library. In the current libcimpp version, libxml++[Cum] was replaced by the Arabica XML Toolkit which comes with uniformSAX wrappers for several XML parsers [Hig] making libcimpp usable ondifferent Unix-like operating systems as well as on Windows. All event-based SAX parsers provide callback functions called whenever during parsetime a certain event occurs. In case of libcimpp these methods call the
42
3.4 Automated CIM (De-)Serializer Generation
unmarshalling code which instantiates proper CIM C++ objects and fillsthem with the read data.
Whenever a new opening XML tag is encountered, a startElement iscalled which gets the XML tag and its attributes that will be stored forlater use. If the tag represents a CIM class, an object of this class isinstantiated on the heap and referenced by a BaseClass pointer whichis pushed onto a stack and later on popped from the stack by a call ofendElement. If an opened XML tag contains an RDF attribute, whichrefers to another CIM object, a task is created and inserted into a taskqueue. This can be the case in all kinds of CIM UML associations. Finally,the end of the document endDocument is called which processes all tasks ofthe task queue. These tasks connect associated objects by pointers. If anopened XML tag contains an RDF attribute, which refers to another CIMobject, a task is created and inserted into a queue. This can be the case inall kinds of CIM UML associations. These tasks then connect associatedCIM objects by pointers. Therefore all objects of the CIM document haveto be instantiated before their pointers can be set correctly. Furthermore,a certain routine is called whenever the SAX parser encounters characterswhich represent no XML tag. These characters and the uppermost elementof the tag and the object stack is passed to an assignment function whichinterprets the characters to values and tries to assign them to the properattributes of the belonging CIM object.
3.4.3 UnmarshallingThe previously introduced assignment functions form the core functions ofthe unmarshaller. Since the CIM UML model is transformed into a correctcompilable CIM C++ codebase, it is possible to map XML elementswith their contents into the previously instantiated CIM C++ objects.For this purpose, a proper mapping function was defined which will beexemplarily described by the CIM RDF/XML snippet shown in List. 3.1.For instance, a function has to assign the name of the Terminal element(List. 3.1 line 2) to the name attribute of the corresponding C++ objectwhich is an instance of the Terminal class that inherits the attributefrom IdentifiedObject whose code snippet is shown in List. 3.3. Thesimplest way in general would be using reflection of the programminglanguage which is the ability to examine, introspect, and modify its ownstructure and behavior at runtime [DM95]. Reflection in OOP languagesi. a. allows “looking” into an object. For instance, that would allow theprogram to check if a certain object has the attribute name and access it atruntime. Usually, it would also be possible to iterate through all attributesof an object, entirely independently of their types. Contrary to dynamic
43
Chapter 3 Automated De-/Serializer Generation
Listing 3.3: Snippet of the CIM C++ class IdentifiedObject.
class IdentifiedObject {public :
IdentifiedObject ();IEC61970 :: Base :: Domain :: String name;...
};
Listing 3.4: Assignment function for IdentifiedObject.name
bool assign_IdentifiedObject_name (std :: stringstream & buffer ,BaseClass * base_class_ptr ) {
if( IEC61970 :: Base :: Core :: IdentifiedObject *element = dynamic_cast < IEC61970 :: Base :: Core ::
IdentifiedObject *>( base_class_ptr )){
buffer >> element ->name;...
}
programming languages such as Python which provide reflection and alsoobject runtime alternation [ŠD12; Chu01], C++ by design provides onlyvery limited reflection mechanisms. Without additional programmingeffort only information like, e. g., the object’s type identifier can be queriedat runtime which is no solution in this context. There are methods toextend C++ by reflection mechanisms with the aid of libraries addingmeta information but such an approach would increase the complexityof the CIM++ project significantly, add further dependencies, and alsodeteriorate its maintainability and flexibility. Instead, Clang’s LibToolingis used for generating the mapping functions based on information providedby the previously adapted CIM C++ codebase.
A mapping function needs the object whose attribute has to be accessed,the attribute’s name, and the character string which has to be interpretedand assigned to the attribute. In List. 3.1 line 2 the attribute is iden-tified by cim:IdentifiedObject.name, where cim is the namespace. Byimplication, a mapping function calls an appropriate assignment functionwhich, for the given case, is presented in List. 3.4. If the dynamic_cast issuccessful, the stream operator which was previously implemented for allprimitive types, is used for interpreting the given characters to the propervalue and its subsequent assignment to the attribute.
44
3.4 Automated CIM (De-)Serializer Generation
In addition to primitive types, there are also CIM classes which areno data types but in CIM based CGMES documents [ENT16] similarlyused and in context of OOP called structured data types. Apart from avalue attribute these classes just contain members representing enumer-ated types, units, or multipliers. CIM’s Domain package defines most ofthese classes such as Base::Domain::Voltage with the attributes valueof the type IEC61970::Base::Domain::Float, multiplier of the typeBase::Domain::UnitMultiplier, and unit of the type Base::Domain::UnitSymbol. Accordingly to the presented assignment function, the assign-ment for an attributenominalVoltage of the type Base::Domain::Voltagewould be:
buffer >> element -> nominalVoltage . value ;
Since for all such attributes there have to be similar assignment functionsimplemented, they are generated with the aid of a template engine by theunmarshaller generator explained in Sect. 3.4.4. In case of IEC 61970 only,there are more than 3000 assignment functions generated. To find theright one by if-branches at runtime would lead to an average-case timecomplexity of O(n) for each assignment, with n being the total number ofassignment functions. For improving the performance, a kind of dynamicswitch statement was implemented. For this, pointers to all assignmentfunctions are stored in a hash table with the attributes’ names as keys.Therefore, lookups in the hash table have an average time complexity ofO(1).
Before any assignment can take place, the proper objects have to beinstantiated. As already described this happens when a new opening XMLtag is encountered. In case of <cim:Terminal rdf:ID="BADCAB1E"> a newobject on the heap is instantiated by new Base::Core::Terminal. Themapping of such an XML tag to its line of code is also done with the aidof the dynamic switch statement concept. For each CIM class, there is afunction instantiating respective objects. These functions are part of aFactory design pattern [Ale01] implemented in the CIMFactory class whichis part of libcimpp. The object’s rdf:ID is stored as key value in a hashtable together with a pointer to the object for later task resolving. The Taskclass has a resolve() method which is called for setting the associationbetween objects as mentioned before. During construction, a Task instancegets the CIM object which represents the end of the regarding associationtogether with the association’s identifier. The identifier is the XML tagbelonging to the association. To resolve a task in resolve(), the rdf:IDis looked up in the hash table for getting the address of the associatedCIM object. Afterwards, a set of assignment functions is used to link theobjects together.
45
Chapter 3 Automated De-/Serializer Generation
3.4.4 Unmarshalling Code Generator
In the previous section the unmarshalling process of libcimpp was described.The developed CIM-Unmarshalling-Generator application generates C++code for the introduced classes Task and CIMFactory as well as for theassignment functions. The step is performed with the aid of the CTemplateengine [Spe].
Each template engine needs a data source for template file rendering.To be most independent from any tools, no proprietary format containingthe CIM model was used. It would be possible to export the availableCIM model to an open format like XMI but this approach was rejectedfor different reasons: as the code generation, the XMI export of the usedvisual UML editor can have inadequacies, too. The present corrected andadapted CIM C++ codebase already contains all needed information aboutthe given CIM model and can be used as input for the template engine’sdata base. Thus, subsequent manual changes to the CIM C++ codebasecan also be considered by the CIM++ toolchain. Therefore, the data baseneeded for the template engine is built from the CIM C++ codebase.
As already mentioned, the introduced class CIMFactory creates instancesof CIM classes that were requested by their names. Therefore, appropriatefunctions are needed for each CIM class. These functions can be expressedby a template snippet presented in List. 3.5. There, {{#FACTORY}} beginsa reiterative section and {{CLASS_NAME}} as well as {{QUAL_CLASS_NAME}}are place holders which are replaced at render time by values read fromthe data base. Based on this template, the CIM-Unmarshalling-Generatorwill create the appropriate function for each CIM class. Therefore, theASTVisitor creates a section dictionary for each class definition it findsin the CIM C++ files since the CTemplate engine works with dictionariesto set the place holders at render time. The final code for Terminal afterthe so-called rendering by the template engine is shown in List. 3.6.
The AST visitor also has access to a whitelist in which all CIM classesthat are used like data types (i. e. they just occur in attribute declarations
Listing 3.5: Snippet of CIMFactory template
{{# FACTORY }}BaseClass * {{ CLASS_NAME }} _factory () {
return new {{ QUAL_CLASS_NAME }};}{{/ FACTORY }}
46
3.4 Automated CIM (De-)Serializer Generation
Listing 3.6: Automated generated Terminal_factory()
BaseClass * Terminal_factory () {return new IEC61970 :: Base :: Core :: Terminal ;
}
of other CIM classes) are listed and, as a consequence, are not beingdirectly instantiated. For these classes no sections are generated.
The function which initializes the hash table of the CIMFactory is alsopart of the shown template with the aid of the created section dictionaries.The template for Task contains sections for attributes of a pointer type ora list of pointers (in case of given multiplicities greater than 1).
Although in CIM associations are generally developed in form of bidirec-tional links, in typical CIM RDF/XML documents they are implementedas unidirectional relations. Therefore, this is analogously done with CIMC++ objects. An example is the association of the class Terminal withConnectivityNode. In C++ this association is realized as a pointer at-tribute of Terminal. In CIM RDF/XML documents it is realized in formof the tag cim:Terminal.ConnectivityNode with an RDF reference to theRDF ID of an object of the class ConnectivityNode. For resolving a corre-sponding task, a function is needed which assigns the address of the object,referenced by the given RDF ID, to the attribute ConnectivityNode ofthe class Terminal.
Compositions are not used in the available CIM model but there aremany aggregations which are unidirectional, too. Nevertheless, the CIMC++ implementation of aggregations expressed in CIM RDF/XML arenot that straight forward. In C++, the aggregating object contains anattribute of the type pointer or a list of pointers which show(s) on theaggregated object(s). The XML document, however, contains XML tagswhich are part of elements embedded in the aggregated objects. Theseaggregated objects contain RDF references to their aggregating object.Therefore, functions are needed which assign pointers to the aggregatedobjects to the pointers or list of pointers of the aggregating objects.
The AST visitor generates an assignment function for each pointer orlist attribute of the CIM C++ classes. These functions get base classpointers as argument to the objects which have to be linked together. Thelookup of the proper function will be accomplished by another hash table.The main issue is the generation of the function which initializes the hashtable with the correct XML tags as keys to the function pointers.
47
Chapter 3 Automated De-/Serializer Generation
In some cases the association representation is rather simple. Exemplar-ily, for Terminal with the attribute ConnectivityNode, the AST visitorgenerates the key value cim:Terminal.ConnectivityNode. This is ex-pressed by the following template:
cim:{{ CLASS_NAME }}.{{ FIELD_NAME }}
In other cases (depending on the CIM UML specification), the gener-ation of correct key values is different. For instance, TopologicalNodeaggregates one or more instances of the CIM class Terminal but theXML tag representing the association (here an aggregation) is writ-ten the other way round (therefore called inverted XML tag), namelycim:Terminal.TopologicalNode. Therefore, in the case of the C++ classTopologicalNode with the attribute Terminal, which represents the asso-ciation, the key value can be expressed by the template:
cim:{{ FIELD_NAME }}.{{ CLASS_NAME }}
This proceeding is sufficient in the very most cases but in some CIMdocuments the XML tag representing the association looks different. There-fore, there are configuration files with proper mappings from key valuesgenerated by the previous template to the inverted XML tags representingassociations in the CIM RDF/XML documents to be deserialized. Theseconfiguration files (one for primitive types and another for the remain-ing classes) are read by libcimpp at runtime. Currently there are onlyaround a dozen such cases. With these and further template sections, theunmarshalling code of CIM++ is completed.
The sections of the class Task are filled (as shown with the previousexamples) by the AST visitor with the aid of further placeholders anddictionaries. Furthermore, the template for the assign function con-sists of two sections. The first one (ENUM_STRINGSTREAM) generates theunmarshalling function for enumerations and the second one the actualassignments of the read in data to the CIM C++ objects. In this unmar-shalling function the stream operators for enumerations are implemented.Therefore, for all enumerated types proper CIM RDF/XML data can beread in with the aid of streams as for primitive types. Since the enumeratedtypes are strongly typed, besides the placeholder {{ENUM_CLASS_TYPE}}for enumerations without a scope, there is {{QUAL_ENUM_CLASS_TYPE}}for scoped enumerations. For filling these placeholders, the AST visitortraverses all enum class declarations and generates the needed sectiondictionaries.
Finally, an ASSIGNMENT section for the assignment function containsseveral placeholders which are filled using section directories generatedwhile visiting attributes of all CIM C++ classes, which are a data type oran enumeration.
48
3.4 Automated CIM (De-)Serializer Generation
Listing 3.7: serialize function of ACLineSegment
1 std :: string ACLineSegment :: serialize (bool isXmlElement ,2 std ::map < BaseClass *,3 std :: string >4 * id_map )5 {6 std :: string output = "";78 if( isElement ) {9 output . append ("<cim: ACLineSegment rdf:ID =\"" +
10 mRID + "\" >\n");11 }1213 output . append ( IEC61970 :: Base :: Wires :: Conductor ::14 serialize (false , id_map ));1516 if(bch. value . initialized ) {17 output . append (" <cim: ACLineSegment .bch >" +18 std :: to_string (bch. value ) +19 " </cim: ACLineSegment .bch >\n");20 }21 ...2223 if( isElement ) {24 output . append (" </cim: ACLineSegment >\n");25 }26 }
3.4.5 MarshallingFor the serialization of CIM C++ objects from the main memory to CIMdocuments, BaseClass was extended by the member function
virtual std :: string serialize (bool isXmlElement ,std ::map < BaseClass *,
std :: string > * id_map )
that can be overridden by all CIM subclasses as they inherit all (di-rectly or indirectly through other classes) from BaseClass. For instance,ACLineSegment overrides it by the function partly depicted in List. 3.7.The isXmlElement parameter tells the serialize method if the attributesto be serialized belong to an superclass of the instance (isXmlElement= false) or to the class of the instance (isXmlElement = true). In thelatter case, XML element tags (see lines 10 and 24) must be wrappedaround the attributes’ marshalling output (between line 12 and 21). Thismeans that if an instance of ACLineSegment has to be deserialized, theACLineSegment::serialize is called with isXmlElement = true, leadingto a serialization with the introductory XML line <cim:ACLineSegment
49
Chapter 3 Automated De-/Serializer Generation
rdf:ID=...>. In line 14 the serialize method of its superclass Conductoris called with isXmlElement = false to achieve an unmarshalling of thesuperclass’ attributes without any XML tags introducing a new Conductorobject.
3.5 libcimpp Implementation
The CIM++ (De-)Serializer is implemented as a library which must beextended by CIM++ code toolchain. Afterwards, it can be easily built as acmake project. The libcimpp library is available as an open-source project[FEI19a]. It already contains automated generated code for current CIMversions.
Pointers to the deserialized C++ objects from CIM documents areprovided in form of a list. Furthermore, a documentation for libcimpp isgenerated by Doxygen [Hee] and available to the user.
3.6 Evaluation
The flexibility and usability of the developed and implemented approachesare here demonstrated by a use case scenario. Regarding the flexibil-ity it shall be shown that the developed toolchain for CIM C++ codeadaption (presented in Sect. 3.3) and the CIM-Unmarshalling-Generatorcan be successfully applied on a given CIM model, which was changedor extended by a visual UML editor. As already mentioned, the cur-rently available open-source version of libcimpp was generated and canbe used for deserialization of different CIM versions as published by theCIMug. However, the main goal was to make CIM++ able to deserializeobjects of classes added with a visual UML editor. This flexibility willbe shown exemplarily by newly introduced component classes, which aremissing in the original CIM standards and needed in the SINERGIENco-simulation environment. There, the original CIM classes are extendedby a Sinergien package containing the mentioned additional classes.One of them is the class BatteryStorage which has become necessarynow that battery storages are increasingly integrated on distributionlevel. After an extension of the original IEC 61970 standard (iec61970cim16v29a_iec61968cim12v08_iec62325cim03v01a from CIMug) by theSinergien package with the aid of Enterprise Architect (v11.0.1106), theCIM C++ code is generated and the introduced toolchain for adaptingthe CIM C++ code to be compilable is applied. This also allows anapplication of Doxygen on the code which generates the developer docu-mentation i. a. for the added Sinergien classes [FEId]. For instance, this
50
3.7 Conclusion and Outlook
also includes the collaboration diagram of the BatteryStorage class asdepicted in Fig. 3.4. After the toolchain for CIM C++ code adaption, theCIM-Unmarshalling-Generator is applied, which completes libcimpp bythe code for unmarshalling. The correct functionality of the generatedunmarshalling code is demonstrated in [Raz+18a] and in Chap. 4 by thetranslation of an established power grid topology with the aid of libcimpp.Among others, this shows the correct functionality of the unmarshallingcode generated by the CIM-Unmarshalling-Generator.
3.7 Conclusion and Outlook
In this chapter the concept of an automated CIM RDF/XML (de-)serializergeneration has been presented. The approach is based on an automatedmapping from CIM UML to compilable C++ code with the aid of a visualUML editor, a compiler front-end, and a template engine. Using thesecomponents, the implemented code adaption toolchain is flexible enoughto generate correct CIM C++ code from different CIM based ontologieswhich then, together with the automated generated unmarshalling code,can be integrated into the libcimpp (de-)serializer library.
Besides software technical improvements related to the libcimpp itself,the approach could be extended by serialization from C++ objects to CIMRDF/XML documents as well as to XML streams e. g. for XMPP commu-nication. After a definition of the required steps, the so-called marshallingcode can be added to the classes by the code adaption toolchain.
Additionally, it could happen that the generated CIM C++ codebasecontains circular class dependencies. In case of present CIM models thereare only few of them (always at the same positions) which is why theyare resolved during code adaption by the mentioned code patches usingforward declarations. Although circular dependencies should be avoided bya clean UML design, it could be researched how such forward declarationsand different solutions could be applied by the code adaption toolchain inan automated way.
Such efforts currently contribute to the first drafts of a harmonizationstandard [IEC17]. Differently to the mapping from CIM primitive datatypes to intrinsic C++ types and classes presented in this work, in [Lee+15]a data type unification of IEC 61850 and CIM is shown. This also includesdefinitions of operations from CIM and IEC 61850 types to unified datatypes using Query/View/Transformation (QVT) which is specified bythe Object Management Group (OMG) as part of MDA. Since the mainimportance for libcimpp is to store data adequately, transformations areonly performed if a sufficient accuracy can be achieved as specified by the
51
Chapter 3 Automated De-/Serializer Generation
Sinergien::E
nergyGrid
::EnergyS
torage::BatteryS
torage
IEC
61970::Base::W
ires::R
egulatingCondE
q
IEC
61970::Base::C
ore::C
onductingEquipm
entIE
C61970::B
ase::Core
::Equipm
entIE
C61970::B
ase::Core
::Pow
erSystem
Resource
IEC
61970::Base::W
ires::R
egulatingControl
IEC
61970::Base::C
ore::IdentifiedO
bjectIE
C61970::B
ase::Core
::PS
RType
BaseC
lass
Sinergien::E
nergyGrid
::Dom
ain::ElectricalC
apacity
Sinergien::C
omm
unication::com
municationR
equirement
PS
RType
IEC
61970::Base::D
omain
::Boolean
controlEnabled
aggregatenorm
allyInService
discreteenabled
isAvailableLTE
requiresCom
munication
isAvailableFiber
isAvailableW
LAN
isAvailableB
PLC
RegulatingC
ontrol
IEC
61970::Base::D
omain
::ActiveP
ower
nominalP
IEC
61970::Base::D
omain
::ApparentP
ower
ratedS
IEC
61970::Base::D
omain
::Voltage
ratedU
IEC
61970::Base::D
omain
::ReactiveP
ower
nominalQ
capacity
m_com
municationR
equirement
Figure 3.4: Section of collaboration diagram for BatteryStorage generatedby Doxygen on the automated adapted CIM C++ codebase.The entire diagram can be found in [FEIb]
52
3.7 Conclusion and Outlook
CIM standards. However, conform to the CIM++ approach, the generatedCIM C++ classes could be extended during their automated adaption bymember functions providing such QVTs for areas where a harmonizationwith IEC 61850 is desirable.
Our approach to synchronize UML models and source code in an au-tomated way are continuously improved [GDD+06; Sad+09]. The mainidea behind such RTE methods is a more visual software development[Die07] which is not finished after the software design phase but is itera-tively repeated during the implementation phase. Therefore, there is alsoongoing research which began with reverse engineering methods and so-called Computer-Aided Software Engineering (CASE) tools [Nic+00]. Forinstance, in [Kol+02] a comparison of the reverse engineering capabilitiesbetween commercial and academic CASE tools is presented. Because ofthe increasing complexity of software systems, the application of MDAbased methods is becoming more and more important. Thus, our approachcontributes to these efforts.
Besides, generic XML and RDF/XML parsers as mentioned in Sect. 3.4.2,which are subjects of research activities as well [Mae12], there is also aCIM specific parser available with serialization capabilities according to[IEC16a] called PyCIM [Lin], currently supporting only CIM versions until2011. Since the project is not maintained anymore, a new project for CIMdocument (de-)serialization called CIMpy is developed at ACS. BesidesCIM it will also support CGMES which is defined using information onCIM [ENT16]. CGMES is currently also being integrated into libcimpp inan automated way with deserialization as well as serialization capabilities.
53
4From CIM toSimulator-Specific System Models
In Chap. 3 the relevance of the Common Information Model (CIM) forpower grids has been outlined an automated generated (de-)serializerlibrary for documents based on the CIM has been presented. Because of thewidespread use of CIM-based grid topology interchange, commercial powersystem simulation and analysis tools such as NEPLAN and PowerFactorycan handle CIM. The problem of such proprietary simulation solutionsin academic area often is an insufficient or unavailable possibility forcomponent model as well as solver modifications. As a consequencemany open-source and free power system simulation tools have beendeveloped during recent years as, for instance, MATPOWER [MAT19]which is compatible to the proprietary MATLAB as well as the open-source GNU Octave [Eat19] environment [ZMT11] and its Python portPYPOWER [Lin19a] as well as pandapower [Fra19]. Other open-sourcesolutions are programmed in the object- and component-oriented multi-domain modeling language Modelica [Fri15b]. Since it allows a declarativedefinition of the model equations, the Modelica user resp. programmerdoes not need to transform mathematical models into imperative code(i. e. assignments). Modelica simulations can be executed with the aid ofproprietary environments such as Dymola and open-source ones such asOpenModelica [Fri+06] and JModelica [Åke+10] with various numericalback-ends. Modelica libraries with models for power system simulationsare PowerSystems [FW14] and ModPowerSystems [MNM16]. The use ofModelica for power system simulation is not limited to the academia but it
55
Chapter 4 From CIM to Simulator-Specific System Models
is also applied in real operation, especially with CIM-based grid topologiesas shown in [Vir+17]. However, in the presented approach an intermediatedata format, called IIDM, is used.
The main contribution of this chapter is the presentation of a template-based transformation from CIM to Modelica system models. It has beenimplemented in the open-source tool called CIMverter which, in its currentversion, transforms CIM documents into Modelica system models basedon arbitrary Modelica libraries, as specified by the user.
The transformation into arbitrary Modelica system models allows theexecution of any kind of Modelica simulations which shall make use ofinformation stored in CIM documents. To achieve this, CIMverter utilizesa template engine that processes template files written in Modelica, con-taining placeholders. These placeholders are filled by the template enginewith data from CIM documents and combined to a complete system modelthat can be simulated in an arbitrary Modelica environment. The use of atemplate engine leads to encapsulation, clarity, division of labor, componentreuse, single point-of-change, interchangeable views, and so forth, as statedin [Par04]. For instance, this means that in case of many interface changesof a component model, the Modelica user does not need to modify theCIMverter source files but just the templates written in Modelica. Hence,there is no special knowledge of CIMverter’s programming language (C++)or any domain-specific language (DSL) needed. Furthermore, this chapterpresents examples on how CIM objects can be mapped to objects of ausual Modelica power system library. Our template-based approach canalso be used for conversions to formats other than Modelica. Therefore,also system models of the Distributed Agent-Based Simulation of ComplexPower Systems (DistAIX) simulator [Kol+18] can also be generated fromCIM documents just through the undertaken adaption of the templatefiles used by CIMverter for the transformation.
This chapter gives a short introduction to data formats as well as themain software components used in CIMverter followed by an overviewof the overall concept. Then it describes how the mapping from CIM toModelica is performed at top level and on bottom level with the usage ofa C++ representation of the Modelica classes in the so-called ModelicaWorkshop. Following this, the approach and implementation is evaluatedwith the aid of two Modelica power system libraries and validated witha commercial simulation tool. Finally, related work is discussed and thechapter is concluded by a roundup and an outlook of future work. Thework in this chapter has been partially presented in [Raz+18b]1.
1 “CIMverter—a template-based flexibly extensible open-source converter from CIMto Modelica” by Lukas Razik, Jan Dinkelbach, Markus Mirz, Antonello Monti islicensed under CC BY 4.0
56
4.1 CIMverter Fundamentals
4.1 CIMverter Fundamentals
For an introduction to CIM, RDF, and XML please have a look intoSect. 3.1. In the following Modelica and template engines will be introducedbriefly.
4.1.1 Modelica
Modelica enables engineers to focus on the formulation of the physicalmodel by the implementation of the underlying equations in a declarativemanner [Fri15b]. The physical model can be readily implemented withoutthe necessity to fix any causality through the definition of input and outputvariables, thus, increasing the flexibility and reusability of the models[Til01]. Besides, existing Modelica environments relieve the engineer fromthe implementation of numerical methods to solve the specified equationsystem.
Modelica Models
The concept of component modeling by equations is shown exemplarilyin List. 4.1 for a constant power load, which is typically employed torepresent residential and industrial load characteristics in electrical gridsimulations.
The presented PQLoad model is part of the ModPowerSystems [MNM16]library and is derived from the base model OnePortGrounded using thekeyword extends, underlining that the Modelica language supports object-oriented modeling by inheritance. In the equation section, the physicalbehavior of the model is defined in a declarative manner by the commonequations for active and reactive power. The parameters employed inthe equations are declared in the PQLoad model beforehand, while the
Listing 4.1: Component model of a constant power load
model PQLoad " Constant power load"extends ModPowerSystems.Base.Interfaces.
ComplexPhasor.SinglePhase.OnePortGrounded ;parameter SI.ActivePower Pnom = 0.5 e6 " active power ";parameter SI.ReactivePower Qnom = 0.5 e6 " reactive power ";
equationPnom /3 = real(v*conj(i));Qnom /3 = imag(v*conj(i));
end PQLoad ;
57
Chapter 4 From CIM to Simulator-Specific System Models
declarations of the complex variables voltage and current are inheritedfrom the base model OnePortGrounded. A complex system, e. g., an en-tire electrical grid, can be implemented as system model by instantiatingmultiple components and specifying their interaction by means of connec-tion equations, see line 25 in List. 4.6. The connect construct involvestwo connectors and introduces a fixed relation between their respectivevariables, e. g., between their voltages (equality coupling) and currents(sum-to-zero coupling). Typically, Modelica environments provide a GUIfor the graphical composition of models.
Modelica Simulations
In [Fri15a] the translation and execution of a Modelica system model issketched. At first, the system model is converted into an internal represen-tation (i. e. an abstract syntax tree (AST)) of the Modelica compiler. Onthis representation, the Modelica language specific functionality is appliedand the equations of the used component models (which are the blocks inthe graphical representation of the system model) are connected together.This is resulting in the so-called flat model.
Then all equations are sorted according to the data-flow among themand transformed by algebraic simplification algorithms, symbolic indexreduction methods, and so forth, to a set of equations that will be solvednumerically. For instance, duplicates of equations are removed. Also,equations in explicit form are transformed to assignment statements (i. e.an imperative form) which is possible since they have been sorted. Theestablished execution order leads to an evaluation of the equations inconjunction with the iteration step of the numeric solver. Subsequently,the equations are translated to C code, equipped with a driver (i. e. C codewith a main-routine), and compiled to an executable (i. e. a program) whichis linked to the utilized numerical libraries. This program is then executedaccordingly to a configuration file which defines, e. g., the simulations startand end times, numerical methods to be utilized, simulation results format,and so forth. Initial values are usually taken from the model definitions inModelica.
For the conversion from CIM to Modelica system models it must bedefined where the topology parameters (written in the CIM documentto be converted) must be placed in the Modelica system model (i. e. theresulting Modelica file). For this purpose, a template engine is used, whichfunction principle is introduced in the following.
58
4.2 CIMverter Concept
4.1.2 Template EngineA template engine (also called template processor or template system)is common in web site development and generates the Modelica code.Template engines allow the separation of model (i. e. logic as well as data)and view (i. e. resulting code). For CIMverter it shortly means that thereis no Modelica code within the C++ source code of CIMverter. To achievethis, template engines have a
data model for instance based on a database, a text / binary file, or acontainer type of the template engine’s programming language,
template files also called templates) written in the language of the result-ing documents together with special template language statements,and
result documents which are generated after the processing of data andtemplate files, so-called expanding,
as illustrated in Fig. 4.1, where an example HTML code template with aplace holder {{name}} is filled with the name from a database, resulting ina complete HTML document. Such place holders are one type of templatemarkers.
4.2 CIMverter Concept
The concept of CIMverter is depicted in Fig. 4.2. The upper part showsthe automated code generation process from the definition of the ontol-ogy by CIM UML to the unmarshalling code generation of the CIM++(De-)Serializer library libcimpp. The middle part shows the transformationprocess from a given topology (based on the specified CIM ontology) to aModelica system model, based on Modelica libraries which are addressed
Template Engine <title>Hello World!</title>
<title>Hello {{name}}!</title>
Template
Output
Database
name = "World"
Figure 4.1: Template engine example with HTML code
59
Chapter 4 From CIM to Simulator-Specific System Models
by appropriate Modelica templates. It uses and extends the concept ofCIM++ as introduced in [Raz+18b]. The CIM UML ontology can beedited by a visual UML editor and exported to a CIM C++ codebasewhich is not compilable and therefore needs to be completed by the CIM++code toolchain. The resulting adapted CIM C++ codebase, representingall CIM classes with their relations, is compilable and used by the CIM++(Un-)Marshalling Generator for the generation of code which is neededfor the actual deserialization process of libcimpp. The CIM++ toolchainand the (Un-)Marshalling Generator are applied in an automated way,whenever the ontology is changed. This keeps libcimpp compatible withnewest CIM RDF/XML documents.
CIMverter uses the libcimpp for deserialization of CIM objects fromRDF/XML documents to C++ objects. Therefore, CIMverter also includesthe adapted CIM C++ codebase, especially the headers for all CIM classes.Due to ongoing development of CIM and the concomitant automatedmodifications of these headers, one might suppose that the CIMverterdevelopment has to keep track of all CIM modifications but in the vastmajority of cases a subsequent modification of CIMverter code is unneeded.This is because the continuous development of CIM mostly leads to new
CIM C++Codebase
AdaptedCIM C++
Codebase
CIM XML/RDF Topology
Document(s)
ModelicaWorkshop
CIM++ Unmarshalling Generator
CIMverter
Template Engine
CIM++ Deserializer
Modelica EditorComponent Model(s)
Modelica Libraries
Visual UML EditorCIM UML Ontology
Topology EditorCIM based Topology
CIM_N1
CIM_N2
CIM_Load1_ICIM_Load1_H
+CIM_L1_2
+CIM_TR
1
n:1
System Model
ModelicaTemplates
CIM++ Code Toolchain
Figure 4.2: Overall concept of the CIMverter project
60
4.2 CIMverter Concept
CIM classes with further relations or new attributes in existing classes.Such extensions of existing CIM classes require no changes on CIMvertercode using them.
With a Modelica editor, the component models of Modelica librariescan be edited. In case the interface of a component model is changed, theappropriate Modelica template files have to be adapted by the CIMverteruser. Thereby, using the template engine with the concomitant model-viewseparation leads to the following advantages:
clarity: the templates are written in Modelica with only few kind of tem-plate keywords (i. e. markers).
division of labor: the CIMverter user, typically a person with electricalengineering background and knowledge of Modelica, can adapt theModelica templates easily in parallel with the CIMverter programmerreducing conflicts during their developments. While the engineerdoes neither need any C++ programming skills nor any knowl-edge of CIMverter internals, the programmer does not need to keepCIMverter up-to-date with all Modelica libraries that could be usedwith CIMverter.
component reuse: for better readability, templates can include other tem-plates, which can be reused for different component models of thesame or further Modelica libraries.
interchangable views: some Modelica models can be compiled with variousoptions, e. g., for the use of different model equations, which can bedefined directly in the code of the system model. For this purpose,the user can easily specify another set of templates.
maintenance: changes to the Modelica code to be generated, which areneeded, e. g., due to changes of component model interfaces, canbe achieved by editings of template files in a multitude of cases.Changing a template, by the way, is less riskier than changing aprogram which can lead to bugs. Furthermore, recompiling andreinstalling of CIMverter is unnecessary.
As already pointed out, some changes to the Modelica libraries requiremore than a template adaption which is related to the mapping of thedeserialized CIM C++ objects to the dictionaries of the template engineused to complete the Modelica templates to full system models.
For a clear mapping between relevant data from the CIM C++ objects tothe template dictionaries, the Modelica Workshop was introduced. For eachModelica component, the Workshop contains a C++ class with attributes
61
Chapter 4 From CIM to Simulator-Specific System Models
holding the values to be inserted in the appropriate dictionary, whichwill be used for the Modelica code fragment expansion of the belongingcomponent within the Modelica system model. The mapping from CIMC++ objects to these Modelica Workshop objects is defined by C++ code.An alternative would have been the introduction of a DSL for a moreflexible mapping definition. However, a really flexible DSL would haveto support data conversions and computations for data mappings fromCIM to Modelica class instances. Despite tools for DSL specification andparser generation etc., the complexity of the CIMverter project wouldincrease. Moreover, CIMverter users as well as the programmers wouldneed to get familiar with the DSL. Both reasons would make CIMverter’smaintenance and further development more sophisticated and thereforeless attractive to potential developers. For instance, the co-simulationframework mosaik at the beginning also made use of a specially developedDSL for scenario definitions [Sch11] but it was removed later on and nowthe scenarios are described by Python, in which mosaik is implemented,as this is more flexible and powerful. The Modelica Workshop and otherimplementation design aspects, as described in the next sections, shallperform the C++ coded mappings in an intuitive and understandableway, making CIMverter therefore easily extensible by further Modelicacomponent models and libraries.
4.3 CIMverter Implementation
As described conceptually, CIMverter utilizes libcimpp for deserializationof CIM topology documents (e. g. power grids) for the generation of fullsystem models based on the chosen Modelica library (e. g. ModPowerSys-tems). C++ was selected as programming language because of libcimpp,with its including CIM C++ codebase, as well as CTemplate, both writtenin C++. As a static, strong type-checking language with less runtimetype information (RTTI) capabilities than a dynamic language such ase. g. Python, speculative dynamic typecasts are used for a return of thecorrect CIM C++ class object. Anyway, the time for converting CIMto Modelica models in comparison to the compile time of the generatedModelica models is negligible. The usage of C++ also allows looking upCIM details in the Doxygen documentation generated from the adaptedCIM C++ codebase of CIM++.
CIMverter has a command line interface (CLI) and follows the UNIXphilosophy of developing one program for one task [MPT78; Ray03].Therefore, it can be simply integrated into a chain of tasks which needto be performed between the creation of a CIM topology and the sim-
62
4.3 CIMverter Implementation
ulations within a Modelica environment as realized in the SINERGIENCo-Simulation project [Mir+18] described in Chap. 2.
A configuration file is handled with the aid of the libconfig++ library,where i. a. the default graphical extent of each Modelica component can beadjusted. It also allows the definition of default CIM datatype multipliers(e. g. M for MW in case of IEC61970::Base::Domain::ActivePower) whichare not defined in some CIM RDF/XML documents such as the onesfrom NEPLAN based on the European Network of Transmission SystemOperators for Electricity (ENTSO-E) profile, specified by [ENT]. Afterthese implementation details, in following subsections the main aspects ofthe overall implementation are presented.
4.3.1 Mapping from CIM to ModelicaThe mapping from CIM documents to Modelica system models can bedivided into three levels of consideration as in [Cao+15].
At first level, there are the library mappings. The relevant data fromCIM C++ objects, as deserialized by CIM++, is first stored in an inter-mediate object representation (i. e. in the Modelica Workshop) with aclass structure similar to the one of the Modelica library. Hence, for eachModelica library there can be a set of appropriate C++ class definitionsin the Modelica Workshop.
Object mappings are part of the second level. There are not just one-to-one mappings, as illustrated in Fig. 4.3. Sometimes, several CIM objects aremapped to one Modelica object resp. component, such as the IEC61970::Base::Wires::PowerTransformer. There are also CIM objects likeIEC61970::Base::Core::Terminal (electrical connection points, linked toother CIM objects) which are not mapped to any Modelica componentmodels.
Parameters and unit conversions are performed at the third level betweenthe CIM C++ objects and the Modelica Workshop objects. Examples are
Object
Object Object
Object
ModelicaCIM C++
Figure 4.3: Mapping at second level between CIM and Modelica objects
63
Chapter 4 From CIM to Simulator-Specific System Models
voltages, coordinates, and so forth. The next section faces the second andthird level mappings as part of the Modelica Workshop but before, theCIM object handling is explained.
4.3.2 CIM Object HandlerThe CIMObjectHandler is in charge of the CIM objects handling. List-ing 4.2 shows a part of its main routine ModelicaCodeGenerator. This is
Listing 4.2: Snippet of the routine ModelicaCodeGenerator
ctemplate :: TemplateDictionary *dict =new ctemplate :: TemplateDictionary (" MODELICA ");
...for( BaseClass * Object : this -> _CIMObjects ) {
if(auto * tp_node = dynamic_cast <TPNodePtr >( Object )) {BusBar busbar =
this -> TopologicalNodeHandler (tp_node , dict);...std :: list < TerminalPtr >:: iterator terminal_it ;for( terminal_it = tp_node -> Terminal . begin ();
terminal_it != tp_node -> Terminal .end (); ++ terminal_it ) {...if(auto * power_trafo = dynamic_cast < PowerTrafoPtr >(
(* terminal_it )-> ConductingEquipment )) {Transformer trafo =
PowerTransformerHandler (tp_node , (* terminal_it ),power_trafo , dict);
Connection conn (& busbar , & trafo );connectionQueue .push(conn);
}...
because topological nodes have a central role in bus-branch based CIMtopologies of power grids [Pra+11]. Therefore, finding a TopologicalNode(saved as tp_node), a busbar object of the Modelica Workshop class BusBaris initialized with it. busbar is needed later on, for the connections ofall kind of conducting equipment (i. e.. power grid components) that isconnected to it.
Then, the inner loop iterates over all terminals of the found tp_node andchecks which kind of ConductingEquipment is connected by the respectiveterminal to the tp_node. In case of a PowerTransformer, a trafo object ofthe Modelica Workshop class Transformer is initialized with the data fromthe PowerTransformerHandler. Furthermore, a new connection betweenthe previously created busbar and the trafo is constructed and pushed ona queue of all connections. These steps are performed for all other kinds
64
4.4 Modelica Workshop Implementation
of components, which is why the ModelicaCodeGenerator calls handlersfor all of them.
The tp_node with the terminal connected to the regarding component(here: trafo) are passed to the appropriate component handler (here:PowerTransformerHandler). Besides, the handler also gets the maintemplate directory dict, called "MODELICA". Within a handler, the con-versions from the required CIM C++ object(s) to the Modelica Workshopobject trafo are performed. Furthermore, a subdirectory (here called"TRANSFORMER" used for the Transformer subtemplate, see e. g. List. 4.4)is created and linked to the given main template directory (see List. 4.3).
Some conversions are related to the graphical representation of theCIM objects. This is because a graphical power grid editor, which canexport CIM documents, can link a IEC61970::Base::DiagramLayout::DiagramObject to each component, with information about the positionof this component, i. e. (x, y)-coordinates, in the coordinate system ofthe graphical editor. Since the coordinate system of the CIM exportingeditor (e. g. NEPLAN) can differ from the one of the Modelica editor (e. g.OMEdit), the coordinates are converted by following code lines:
t_points . xPosition = trans_para [0]*x + trans_para [1];t_points . yPosition = trans_para [2]*y + trans_para [3];
For reasons of flexibility, the four parameters trans_para can be set inthe configuration file and in case of NEPLAN and OMEdit are initializedby {1,0,-1,0} (for trans_para[0] to trans_para[3]). Furthermore, theNEPLAN generated CIM documents have several DiagramObject instanceslinked to one component. To avoid multiple occurrences of the samecomponent in the Modelica connections diagram, the middle point of theseDiagramObject coordinates is calculated. This middle point then definesthe component’s position in the Modelica connections diagram.
Another conversion must be performed for the instance names of Mod-elica classes which are derived from the name attribute of the CIM ob-ject and may not begin or contain certain characters. Each such objectderives its name attribute from the elementary IEC61970::Base::Core::IdentifiedObject superclass. More on the electrics related conversiondetails will be given in the next section.
4.4 Modelica Workshop Implementation
In List. 4.2, different CIM object handlers (e. g. PowerTransformerHandler) return appropriate Modelica Workshop objects which representcomponents of the targeted Modelica library. It should be stated at thisjuncture that CIM is not only related to power grid components and, for
65
Chapter 4 From CIM to Simulator-Specific System Models
instance, also includes energy market players (e. g. Customer), Asset, andso forth. Moreover, as presented in [Mir+18], CIM also can be extended byfurther classes of different domains. Hence, the Modelica Workshop doesnot need to be reduced to power grid components, even though the currentModelica Workshop is related to components for power grid simulations.This is due to ModPowerSystems as first Modelica library targeted by theCIMverter converter. Nonetheless, the current Modelica Workshop canbe used as is for the utilization of another Modelica library as presentedin the Evaluation. To avoid reimplementations, each Modelica Workshopclass representing a Modelica component, such as Slack or Transformer,inherits from the so-called ModBaseClass.
4.4.1 Base Class of the Modelica WorkshopAll Modelica components need an annotation information which defines thevisibility of the component, its extent, rotation, etc. Each Modelica Work-shop class, inheriting from ModBaseClass, therefore has an annotationmember holding the annotation data in a form as used in the Modelicacomponent’s annotation statement. For this purpose, ModBaseClass alsoholds several member functions which combine the annotation data to wellstructured strings as needed for the template dictionary used for filling theannotation statements of all Modelica template files as the annotationstatements of all Modelica components have the same structure and thesame markers (see lines 12-14 and 20-22 of List. 4.6).
For the Modelica statements which differ between different Modelica com-ponents (see lines 8-11 and 16-19 of List. 4.6) there exists a virtual functionset_template_values. In each of the component subclasses this functionwill be overridden with a specialized one which sets all markers thatare needed for a complete filling of the belonging Modelica componenttemplate, such as presented in List. 4.4.
Further member variables of ModBaseClass hold the name of the objectand the specified units information, whose default values are set in theconfiguration file. The object’s name is read from the name attribute ofthe CIM class IdentifiedObject. Besides, it accumulates objects of theCIM class DiagramObjects, where the objects rotation and points on theGUI coordinate systems are stored.
4.4.2 CIM to Modelica Object MappingOne of the most interesting mappings is from the CIM PowerTransformerto the Modelica Workshop Transformer class, as presented in Tab. 4.1. ThePowerTransformer consists of two or more coupled windings and therefore
66
4.4 Modelica Workshop Implementation
CIM Contained / Accumulated Modelica WorkshopPowerTransformer Member Variables Transformer
PowerTransformerEnd1BaseVoltage-> Vnom1nominalVoltage.value * mV
PowerTransformerEnd2BaseVoltage-> Vnom2nominalVoltage.value * mV
PowerTransformerEnd1 ratedS.value * mP SrPowerTransformerEnd1 r.value rPowerTransformerEnd1 x.value x
r · Sr
V 2nom1
· 100 Pcur√
r2 + x2 · Sr
V 2nom1
· 100 Vscr
Table 4.1: CIM PowerTransformer to Modelica Workshop Transformermapping. The left column shows the primary and secondaryPowerTransformerEnd which accumulate further CIM objects,as listed in the middle column, holding the information neededfor the initialization of the Transformer attributes as listed inthe right column. The constants mV and mP stand for thevoltage and power value multipliers. The bottom of the tableshows that additionally two conversions are needed to calculatethe rated short circuit voltage Vsc,r and the short circuit lossesPcu,r in percent
accumulates objects of the class PowerTransformerEnd which represent theconnectors of the PowerTransformer [FEIc]. Further important mappingsimplemented in the Modelica Workshop are listed in Tab. 4.2.
4.4.3 Component ConnectionsAfter the instantiations of all components in the Modelica system model,the connections must be defined as well. In List. 4.2 for each newlycreated component a connection (i. e. instance of Connection class) tothe corresponding busbar is created. Therefore, a function template ofConnection with the signature
template < typename T> void cal_middle_points (T * component );
is called in the constructors of Connection and computes one or two middlepoints between the endpoints of the connection line. The four differentcases for the middle points are illustrated in Fig. 4.4.
Furthermore, the connectors of the different components can vary bet-ween different Modelica libraries. Therefore, the connector names can be
67
Chapter 4 From CIM to Simulator-Specific System Models
CIM ModPowerSystems
TopologicalNode SlackExternalNetwoorkInjection
ACLineSegment PiLine
TopologicalNodePQLoadEnergyConsumer
SvPowerFlow
Table 4.2: Excerpt of further important mappings from CIM to ModPow-erSystems as implemented in the Modelica Workshop
configured in a separate configuration file, called connectors.cfg, whichis included in the directory of the belonging Modelica template files. Itssettings are read by all Connection constructors, combined, and fed intothe dictionary which is used for filling the connections subtemplate, in-cluded by the main template file. The final Modelica code generation willbe exemplarily presented in the next section.
4.5 Evaluation
For an evaluation of the approach and its implementation, exemplary tem-plates as well as the resulting Modelica models are shown. To demonstratethe flexibility and applicability of CIMverter, two different power systemlibraries are used; the ModPowerSystems and the PowerSystems library.Besides, the simulation results obtained with the generated models arevalidated against the commercial simulation tool NEPLAN.
The main Modelica template defines the overall structure of the Model-ica system model and contains markers for component instantiations andconnection equations, List. 4.3. The inserted subtemplates hold informa-tion regarding the library and package from which the models are taken.For instance, see line 1 in the corresponding subtemplates, List. 4.4 (forModPowerSystems) and List. 4.5 (for PowerSystems), of the Transformer
zero one two
Figure 4.4: Connections with zero, one, and two middle points betweenthe endpoints. The endpoints are marked with circles
68
4.5 Evaluation
model. As use case, we generate the components for a steady-state sim-ulation of a symmetrical power system in balanced operation. For theModPowerSystems library, we utilize models from the PhasorSinglePhasepackage, since complex phasor variables and a single phase representationare functional for this type of simulation. In case of the PowerSystemslibrary, we perform the simulation with models from the AC3ph package,obtaining comparable results by considering the dq0 transform in thesynchronously rotating reference frame. Other types of simulation mightbe performed by changing package and model names accordingly in thesubtemplates. The considered Transformer subtemplates, List. 4.4 andList. 4.5, contain markers to define primary and secondary nominal voltageas well as rated apparent power. The interface of the ModPowerSystemscomponent specifies the Transformer’s electrical characteristics by ratedshort circuit voltage Vsc,r and short circuit losses Pcu,r, while resistanceR and reactance X are defined for the PowerSystems component.
In our use case, we model the benchmark system described in [Rud+06],which is a medium-voltage distribution network with rural character.Integrated components are a slack bus, busbars, transformers, Pi lines, andPQ loads. An extract of the resulting Modelica system model generatedfrom the CIM data with the presented CIMverter converter shows List. 4.6.The system model of the benchmark grid was additionally generated
Listing 4.3: Main Modelica template related to ModPowerSystems, includ-ing several sections (e. g. SYSTEM_SETTINGS) and subtemplates(e. g. PQLOAD)
{{# HEADER_FOOTER_SECTION }} model {{ GRID_NAME }}{{/ HEADER_FOOTER_SECTION }}{{# SYSTEM_SETTINGS_SECTION }}inner ModPowerSystems.Base.System
{{ NAME }}( freq_nom ( displayUnit = "{{ FNOM_UNIT }}") = {{ FNOM }})annotation ( Placement ( visible = {{ VISIBLE }},
transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }})));
{{/ SYSTEM_SETTINGS_SECTION }}...{{ >PQLOAD }}{{ >TRANSFORMER }}...equation{{ >CONNECTIONS }}{{# HEADER_FOOTER_SECTION }}...end {{ GRID_NAME }}; {{/ HEADER_FOOTER_SECTION }}
69
Chapter 4 From CIM to Simulator-Specific System Models
Listing 4.4: Transformer subtemplate related to ModPowerSystems li-brary
ModPowerSystems.PhasorSinglePhase.Transformers.Transformer{{ NAME }}( Vnom1 = {{ VNOM1 }}, Vnom2 = {{ VNOM2 }},Sr( displayUnit = "{{ SR_DISPLAYUNIT }}") = {{ SR }},Pcur = {{ PCUR }}, Vscr = {{ VSCR }})
annotation ( Placement ( visible = {{ VISIBLE }},transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));
Listing 4.5: Transformer subtemplate related to PowerSystems library
PowerSystems.AC3ph.Transformers.TrafoStray{{ NAME }}( redeclare record Data =
PowerSystems.AC3ph.Transformers.Parameters.TrafoStray( puUnits = false, V_nom = { {{ VNOM1 }}, {{ VNOM2 }} },r = { {{R}}, 0 }, x = { {{X}}, 0 }, S_nom = {{ SR }}))
annotation ( Placement ( visible = {{ VISIBLE }},transformation ( extent = {{ TRANS_EXTENT_POINTS }},rotation = {{ ROTATION }}, origin = {{ ORIGIN_POINT }})));
for the use of the PowerSystems library, simply by switching from theModPowerSystems to the PowerSystems template set. The connectiondiagrams of the resulting models, Fig. 4.5, show the same grid topologyinvolving the respective components from both libraries.
For the validation of both Modelica system models, they were built andsimulated. Afterwards, the simulation results were compared with theones of the proprietary simulation tool NEPLAN, Tab. 4.3.
4.6 Conclusion and Outlook
This chapter presents an approach for the transformation from CIM toModelica. The mapping of CIM RDF/XML documents to Modelica systemmodels is based on a CIM to C++ deserializer, a Modelica Workshop rep-resenting the Modelica classes in C++, and a template engine. CIMverter,the implementation of this approach, is flexible enough to address arbitraryModelica libraries as presented by the generation of system models for twopower system libraries. In case of ModPowerSystems, there is no need ofmodifying the mappings as implemented in the CIM object handlers whileswitching to the PowerSystems library. Also, the Modelica Workshop
70
4.6 Conclusion and Outlook
Listing 4.6: Medium-voltage benchmark grid [Rud+06] as converted fromCIM to a system model based on the ModPowerSystems library
1 model modpowersystems_mv_benchmark_grid2 inner ModPowerSystems.Base.System3 System ( freq_nom ( displayUnit = "Hz") = 50.0)4 annotation ( Placement ( visible = true,5 transformation ( extent = {{0.0,-30.0},{30.0,0.0}},6 rotation = 0)));7 ...8 ModPowerSystems.PhasorSinglePhase.Loads.PQLoad9 CIM_Load12_H (Pnom( displayUnit = "W") = 15000000.000,
10 Qnom( displayUnit = "var") = 3000000.000,11 Vnom( displayUnit = "V") = 20000.000)12 annotation ( Placement ( visible = true,13 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}},14 rotation = 0, origin = {237.1,-107.8}) ));15 ...16 ModPowerSystems.PhasorSinglePhase.Transformers.Transformer17 CIM_TR1 ( Vnom1 = 110000.000, Vnom2 = 20000.000,18 Sr( displayUnit = "W") = 40000000.000,19 Pcur = 0.63000, Vscr = 12.04000)20 annotation ( Placement ( visible = true,21 transformation ( extent = {{-8.0,-8.0},{8.0,8.0}},22 rotation = -90, origin = {86.0,-64.3}) ));23 ...24 equation25 connect ( CIM_N0.Pin1,CIM_TR1.Pin1 )26 annotation (Line( points ={{153.80,-40.00},{153.80,-56.15},27 {86.00,-56.15},{86.00,-72.30}},28 color = {0,0,0}, smooth = Smooth.None ));29 ...30 end modpowersystems_mv_benchmark_grid ;
classes are compatible with both libraries. Subsequently, the generatedsystem models simulated with a Modelica environment are successfullyvalidated against a common power systems simulation tool. CIMverterhas already been successfully applied in the research area of power gridsimulations as, for instance, in [Din+18].
It is obvious that the current implementation can also be used for con-versions into other formats than Modelica even with the current ModelicaWorkshop as the introduced template markers can be used in every fileformat. Therefore, the Modelica Workshop could be cleaned up and ex-tended to a general Power Systems Workshop, addressing data formatsused by other power system analysis and simulation tools. Furthermore,the template-based approach also allows different target system model
71
Chapter 4 From CIM to Simulator-Specific System Models
Systemfreq. = 50.0 Hz
CIM_N12
CIM_N13
CIM_N14
CIM_N0
CIM_N1
CIM_N2
CIM_N3
CIM_N4
CIM_N5
CIM_N11
CIM_N10 CIM_N8
CIM_N7
CIM_N6
CIM_N9
CIM_HV_Netz
CIM_Load12_HCIM_Load12_I
CIM_Load13_I
CIM_Load14_HCIM_Load14_I
CIM_Load1_ICIM_Load1_H
CIM_Load3_HCIM_Load3_I
CIM_Load4_H
CIM_Load5_H
CIM_Load11_H
CIM_Load10_ICIM_Load10_HCIM_Load8_H
CIM_Load7_I
CIM_Load6_H
CIM_Load9_I
+C
IM_L12
_13+
CIM
_L13_14
+C
IM_L1_
2+
CIM
_L2_3
+C
IM_L3_
4
+C
IM_L4_
5
+C
IM_L5_
6
+C
IM_L10
_11+
CIM
_L9_10
+C
IM_L3_
8
+C
IM_L7_
8
+C
IM_L8_
9
+C
IM_T
R1
n:
1
+C
IM_T
R2
n:
1
(1) ModPowerSystems
Systemf:Parameterf_nom=50Synchron
SteadyState
CIM_N12
CIM_N13
CIM_N14
CIM_N0
CIM_N1
CIM_N2
CIM_N3
CIM_N4
CIM_N5
CIM_N11
CIM_N10 CIM_N8
CIM_N7
CIM_N6
CIM_N9
CIM_HV_Netz~Vslack
CIM_Load12_H
p q
CIM_Load12_I
p q
CIM_Load13_I
p q
CIM_Load14_H
p q
CIM_Load14_I
p q
CIM_Load1_I
p q
CIM_Load1_H
p q
CIM_Load3_H
p q
CIM_Load3_I
p q
CIM_Load4_H
p q
CIM_Load5_H
p q
CIM_Load11_H
p q
CIM_Load10_I
p q
CIM_Load10_H
p q
CIM_Load8_H
p q
CIM_Load7_I
p q
CIM_Load6_H
p q
CIM_Load9_I
p q
CIM
_L12_13
CIM
_L13_14
CIM
_L1_2
CIM
_L2_3
CIM
_L3_4
CIM
_L4_5
CIM
_L5_6
CIM
_L10_11
CIM
_L9_10
CIM
_L3_8
CIM
_L7_8
CIM
_L8_9
CIM
_TR
1
12
CIM
_TR
2
12
grd1
f
dq0
dq0
dq0
dq0
dq0
dq0
dq0
dq0
dq0
dq0
dq0 dq0
dq0
dq0
dq0
dq0
dq0 dq0
dq0
dq0 dq0
dq0dq0
dq0 dq0
dq0
dq0
dq0
dq0dq0dq0
dq0
dq0
dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
dq0dq0
(2) PowerSystems
Figure 4.5: Medium-voltage benchmark grid [Rud+06] converted from CIMto a system model in Modelica based on the ModPowerSystemsand PowerSystems library
72
4.6 Conclusion and Outlook
Grid NEPLAN ModPowerSystems PowerSystemsNode |V | [kV] ∠V [°] |V | [kV] ∠V [°] |V | [kV] ∠V [°]N0 110.000 0.000 110.000 0.000 110.000 0.000N1 19.531 -4.300 19.532 -4.268 19.532 -4.268N10 18.828 -4.900 18.828 -4.852 18.828 -4.852N11 18.825 -4.900 18.826 -4.852 18.826 -4.852
Table 4.3: Excerpt from the numerical results for node phase-to-phasevoltage magnitude and angle regarding the medium-voltagebenchmark grid. The models based on the ModPowerSystemsand PowerSystems libraries yield equal results using the Dymolaenvironment and dassl solver. The results deviate marginallyfrom the reference results obtained with the proprietary toolNEPLAN, which might be explained by numerical roundingand different solution methods
formats than Modelica. Meanwhile, also the system model format for theDistAIX simulator [FEIa] has been implemented.
Additionally, the current middle point calculations for the Modelicaconnections diagrams could be improved by the usage of a graph layoutlibrary such as Graphviz [Ell+01]. This would allow CIMverter to equip theoutputted document with proper diagram data even if the CIM topologyto be converted contains no diagram data at all.
73
5Modern LU Decompositionsin Power Grid Simulation
With the aid of CIMverter, which was presented in Chap. 4, systemmodels based on the ModPowerSystems (MPS) finally can be created fromup-to-date industry standard grid models (i. e. based on the CommonInformation Model (CIM)). This allows scientific studies on real world usecases with usually higher complexity than simple lab examples. Thesestudies often involve newly developed and more accurate models as wellas smaller time steps for higher resolution simulations. A possibility toaccomplish more accurate simulations within the same computation timeis the improvement of the numerical back-end of the utilized simulationenvironment.
During the development of the MPS library [FEI19b] (for more onModelica see Sect. 4.1.1) by ACS and the iTesla Power System Library(iPSL) i. a. developed by Réseau de Transport d’Électricité (RTE) [AIA19],a cooperation between RTE and ACS was established. Only little timebefore, the SUNDIALS/IDA solver [Hin+05] for differential-algebraic sys-tems of equations (DAEs) was integrated into OpenModelica, to achieve apotentially higher simulation performance in case of large models with asparse structure [Ope19a]. During its execution, IDA applies a backwarddifferentiation formula (BDF) to the given DAE, resulting in a nonlin-ear algebraic system of equations that is solved by Newton iterations[HSC19]. Within each iteration, a linear system needs to be solved. Forlinear system solution, IDA provides several iterative and direct methods[MV11]: BLAS/LAPACK [Uni17; Uni19] implementations are supplied for
75
Chapter 5 Modern LU Decompositions in Power Grid Simulation
dense as well as band matrices and KLU [DP10] as well as SuperLU_MT(a multithreaded version of the well-known SuperLU [Sup]) are suppliedfor sparse linear systems.
In the European project PEGASE [CRS+11], KLU has shown thehighest overall performance of all compared LU decompositions (the otherswere LAPACK, UMFPACK, MUMPS, SuperLU_MT, and PARADISO),applied on linear systems (i. e. Jacobian matrices) coming from differentpower grid simulation scenarios. However, new LU decompositions havebeen developed since PEGASE: the parallelized NICSLU [CWY13] andBASKER [BRT16] for conventional shared-memory computer architectures[Roo99] and GLU for graphic processing units (GPUs) [Che+15].
This chapter provides a comparison of the mentioned LU decompositions(e. g. KLU, NICSLU, BASKER, and GLU) that are all developed especiallyfor circuit simulation. This comprises a brief introduction to the workingprinciples of the decompositions for an illustration of the main ideasbehind them. The subsequent analysis is carried out on a set of benchmarkmatrices which came up during simulations with Dynaωo, an open-sourcesimulation tool, developed at RTE [Adr19]. Finally, the results are summedup and it a conclusion follows. The work in this chapter has already beenpartially presented in [Raz+19a].
5.1 LU Decompositions in Power Grid Simulation
In many simulation environments such as OpenModelica [Fri15a], systemmodels with algebraic and differential equations are transformed to a DAE.More on this transformation procedure from system models to DAEs isprovided in Sect. 4.1.1. A numeric DAE solver computes the values of allrelevant variables in the simulation specified time interval [tstart, tend].
5.1.1 From DAEs to LU Decompositions
Two famous DAE solvers are DASSL [Pet82] and IDA from the open-source SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers(SUNDIALS) [Hin+05]. IDA solves the initial value problem (IVP) for aDAE of the form
F (t, y, y) = 0, y(t0) = y0, y(t0) = y0, (5.1)
where F, y, y ∈ RN , t is the independent (time) variable, y = dy/dt, andthe initial values y0, y0 are given [HSC19].
76
5.1 LU Decompositions in Power Grid Simulation
The integration method in IDA is the so-called variable-order, variable-coefficient BDF in fixed-leading-coefficient form [BCP96] of order q ∈1, . . . , 5 given by the multistep formula
q∑i=0
αn,iyn−i = hnyn, (5.2)
where yn, yn are the computed approximations to y(tn) and y(tn), with(time) step size hn = tn − tn−1 and coefficients αn,i determined dependenton q. The application of this BDF to the DAE results in the followingnonlinear algebraic system to be solved at each step:
G(yn) := F
(tn, yn, h−1
n
q∑i=0
αn,iyn−i
)= 0. (5.3)
IDA solves Eq. (5.3) with the Newton method (or a user-defined nonlinearsolver). G(y), where y := yn in the n-th time step and y = (y1, . . . , yN )T ∈RN , is linearized with the aid of Newton’s method, by applying the Taylorseries on the component Gi around y(m), in the m-th Newton iteration:
Gi(y) = Gi(y(m)) +N∑
j=1
∂Gi(y(m))∂yj
(yj − yj (m)) + O(∥∥y − y(m)
∥∥22
),
(5.4)
with i = 1, . . . , N which can be shortened by using the Jacobian matrixdefinition [DR08]
J =
∂G1∂y1
. . . ∂G1∂yN
......
∂GN∂y1
. . . ∂GN∂yN
(5.5)
to the equation
G(y) = G(y(m)) + J(y(m))(y − y(m)) + O(∥∥y − y(m)
∥∥22
). (5.6)
Hence, neglecting the Taylor remainder approximation (i. e. the O-term)and setting G(y) to 0, for finding the zeros, in each Newton iteration alinear system of the form
J [yn (m+1) − yn (m)] = −G(yn (m)), (5.7)
needs to be solved, where yn (m) is the m-th approximation to yn in the n-thsimulation time step. For solving the linear system, LU decompositionscan be utilized.
77
Chapter 5 Modern LU Decompositions in Power Grid Simulation
5.1.2 LU Decompositions for Linear System Solving
LU decompositions belong to the category of direct solvers. There arevarious methods for different matrix types such as, e. g., the Choleskydecomposition for hermetian positive-definite matrices [FO08]. For thedecomposition of sparse matrices, special LU decomposition algorithmsare used which store just the non-zero entries of the matrices to reducememory consumption and arithmetic operations.
During factorization, a non-zero entry can arise at a position where a zeroentry has been before, which is called fill-in. Therefore, LU decompositionsusually perform a preordering step before the actual factorization step forfill-in reduction, leading to better memory space and time consumptionduring the subsequent factorization step [TW67]. In general, the problemof computing the lowest fill-in is NP-complete [Yan81].
Apart from direct solution methods, in [CRS+11] it has also been an-alyzed how well iterative methods for solving linear systems, inside theNewton iterations, perform. The iterative Generalized Minimal ResidualAlgorithm (GMRES) has been taken as it is convenient for general ma-trices. The conclusion was that GMRES is too costly on the Jacobianmatrices from the area of power grids, especially in the case when complexpreconditioning methods must be performed before the solver in order toachieve a better convergence behavior. Furthermore, the Jacobian matricesare not only sparse but also generate little fill-in during the processingby the LU decompositions. Similarly, [SV01] states that large electriccircuits are not easy to solve in an efficient manner by iterative methodsbut there is a development potential as there has not been much researchdone in this area, yet. In the following, the two main steps of current LUdecomposition methods are being introduced.
Preprocessing
Usually, the preprocessing consists of a preordering step and partial piv-oting. During the preordering, permutation matrices are computed. Thepartial pivoting is accomplished to reduce the round-off error during thesubsequent factorization. Hence, for a given linear system Ax = b, thefinal system of equations which has to be solved, after preordering andfactorization with pivoting, can be represented as
(P AQ)QT x = P b,
where the row permutations as well as partial pivoting are performed byP and the column permutations by Q [DP10].The preordering methods
78
5.1 LU Decompositions in Power Grid Simulation
for fill-in reduction are usually based on one of the following heuristics:
minimum degree (MD) which belongs to the greedy algorithms [Heg+01];
nested dissection (ND) which is based on graph partition computationby a divide and conquer approach [Geo73].
In General, nested dissection based fill reduction algorithms are more time-consuming [Heg+01] but the results usually lead to less fill-in [KMS92].Besides the permutations coming from fill-in reductions, there are alsoother permutations performed during the preprocessing of some LU decom-position methods as well as matrix scaling and scheduling of the parallelfactorization (if any). This will be mentioned during the introduction ofthe particular decomposition method.
Factorization
The actual LU factorization, with the factors L and U is performed onthe previously permuted matrix A′ = P AQ, such that A′ = LU , andb′ = P b. For efficiency reasons, preorderings are not performed beforeeach factorization. In case that, e. g., the values of a Jacobian change butthe structure remains, the same permutations can be reapplied. In circuitsimulation this is very often the case [CWY12].
Solving with the computed LU decomposition
Usually, LU decompositions also provide a functionality for right-handsolving as this needs the permutation of the preordering to return correctresults. Hence, for a given A′ = LU , the solution x for Ax = b can becomputed from the solution vector x′, whereby
A′x′ = b′ ⇔ Ly = b′ and Rx′ = y
with x = QT x′ and b = P T b′.The solving step is computationally less time expensive than the two
steps before but it is repeated many times in Newton’s (iterative) method.In this work, the term decomposition is used when the whole method suchas KLU is meant. Whereas factorization is meant when the focus restsupon the actual factorization step of the decomposition. The consideredLU decompositions for electrical circuits (NICSLU, GLU, and Basker inrespect to KLU as reference) are compared in the following.
79
Chapter 5 Modern LU Decompositions in Power Grid Simulation
5.1.3 KLU, NICSLU, GLU, and Basker by ComparisonContrary to KLU, all newer LU decompositions (i. e. NICSLU, Basker,and GLU) are developed especially for modern computer architectureswith multi-core central processing units (CPUs) or even GPUs. As thereis actually no single-core performance improvement since around theyear 2005 [Pre12], the utilization of parallel architectures is of essentialimportance for a higher runtime efficiency on newer computer hardwarecoming with more and more CPU cores as well as more performantaccelerators.
KLU
KLU is a decomposition algorithm for asymmetric sparse matrices in powergrid simulation [DP10]. Besides commercial tools, such as the numericalcomputing environment MATLAB and the circuit simulator Xyce, KLUis integrated into IDA. Since KLU was developed with focus on circuitmatrices, it shows a high runtime efficiency in the area of power gridsimulations [CRS+11]. Therefore, the OpenModelica and the Dynaωosimulation environment make use of KLU as linear solver within IDAwhich is utilized as solver for the initial value problems of DAEs resultingfrom system models.
When solving the first matrix (in a sequence), KLU performs four steps:
1. A permutation of the given matrix A, to be factorized into L and U ,is performed by the matrices P (row) and Q (column permutation)into a block triangular form (BTF):
P AQ =
A11 A12 · · · A1n
0 A22. . .
......
. . .. . .
...0 · · · 0 Ann
The diagonal blocks can be independent and therefore be the onlyones requiring factorization.
2. The Approximate Minimum Degree (AMD) ordering algorithm isperformed block-wise on each block Akk for fill-in reduction [ADD04].Fill-in is defined as a non-zero entry arising during factorization inL or U at a position at which A has a zero entry. Fill-in reductionis a crucial step in sparse matrix factorizations as new non-zeroentries in sparse matrices require memory space (zero entries needno space). This leads to a higher memory consumption and, duringfurther processing, to more memory accesses which can be very
80
5.1 LU Decompositions in Power Grid Simulation
time-costly, decreasing the performance of the whole factorizationsignificantly (esp. on modern processors because of the memory wall[ECT17]). Therefore, KLU is optimized for fill-in reduction of circuitmatrices. Alternatively to AMD, the Column Approximate MinimumDegree (COLAMD) ordering algorithm [Dav+04] or CHOLMOD,such as nested dissection based on METIS (an unstructured graphpartitioning and sparse matrix ordering algorithm [KK95]), as wellas a user-defined permutation can be chosen for each block.
3. Each Akk is scaled and symbolically as well as numerically factorizedusing KLU’s implementation of Gilbert/Peierls’ (GP) left-lookingalgorithm. The scaling of the block matrices (i. e. achieving matrixentries with comparable magnitudes) is a pre-step for pivoting whichis performed on each Akk as the factorization method is also appliedblock-wise and leads to a higher numerical stability.
4. Optional: The whole system is solved with the resulting factorizationusing block back substitution.
In case of subsequent factorizations of matrices with same non-zero pattern,the first two steps are omitted and in the third step a simplified left-lookingmethod does not perform the partial pivoting. Therefore, this is calleda refactorization step which allows the omission of a depth-first searchwithin the GP algorithm, leading to a higher performance. The first twosteps build the preordering. A parallelization approach was mentioned in[Abu+18] but without any implementation details. However, the officialKLU version is not parallelized.
NICSLU
NICSLU is a shared-memory parallelized [Roo99] LU decomposition[CWY13]. Nevertheless, some steps performed by NICSLU are similar tothe ones of KLU:
1. Instead of BTF, the MC64 algorithm is utilized for finding a permu-tation and diagonal scaling for sparse matrices. Putting large entrieson the diagonal can make the subsequent pivoting numerically morestable.
2. As opposed to KLU, the AMD algorithm for fill-in reduction is notapplied on each diagonal block but on the whole matrix.
3. This step determines if the subsequent factorization shall be per-formed (in 4.1.) sequentially or (in 4.2.) in parallel (i. e. withmultiple threads, e. g., on several CPU cores).
81
Chapter 5 Modern LU Decompositions in Power Grid Simulation
4.1. The sequential factorization is based on the left-looking GP algorithm,performing a symbolic factorization, a numeric factorization, andpartial pivoting.
4.2. The parallel factorization was developed based on the left-lookingGP and KLU algorithm [CWY13].
5. Optional: The whole system is solved with the resulting factorizationusing classical right-hand solving.
Analogous to KLU, the first two steps make up the preordering phase andtogether with step 3 the whole preprocessing. In [CWY13] the authorspresent a benchmarking i. a. of NICSLU vs. KLU on 23 circuit matriceswith NICSLU showing speedups of 2.11 to 8.38 on geometric mean, exe-cuted with 1 to 8 parallel computing threads. These parallel speedups wereone reason for the choice of NICSLU in the later presented comparativeanalysis of modern LU decompositions.
GLU
GLU is also a parallelized LU decomposition but for CUDA-enabled GPUs[Che+15]. As it was also developed for circuit matrices, it has similarsteps to KLU and NICSLU:
1. MC64 is performed as in NICSLU.
2. AMD is performed as in NICSLU (i. e. on whole matrix).
3. A symbolic factorization, with 0 and 1 as only entries for zero andnon-zero values, is performed to determine the structure of L and U aswell as grouping of independent columns into so-called column-levels.
4. A hybrid right-looking LU factorization (instead of left-looking as inGP) is performed which benefits from the column-level concurrencyand symbolic factorization.
The first three (preprocessing) steps are executed on the CPU and onlystep 4 on the GPU. Experimental results where presented in [Che+15],including, e. g., speedups of 19.56 over KLU on the set of typical circuitmatrices from the University of Florida.
Basker
Basker is the newest of the four LU decompositions and, like NICSLU,also shared-memory parallelized but from algorithmic point of view mostly
82
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
similar to KLU [BRT16]. It was developed as an alternative to KLU forcircuit simulation by performing a two level parallelism: between blocksand within blocks, as described below:
1. Such as by KLU, BTF is performed (can be disabled). The resultingmatrix has large and small diagonal blocks.
2.1. The small diagonal blocks can be factorized in parallel, by a so-calledFine Block Triangular Structure, as they do not depend on eachother. Hereby,
a) each small diagonal block is symbolically factorized in paralleland afterwards
b) a parallel loop over all these small blocks applies the sequentialGP algorithm on each of them.
2.2. Instead, the large diagonal blocks could be too large to be factorizedby sequential GP as this could dominate the complete LU decompo-sition time. Therefore, large blocks
a) are reordered by ND and
b) the ND structure is mapped to threads by using a task de-pendency graph which is transformed into a task dependencytree which represents level sets that can be executed in parallel.After that
c) the parallel ND Symbolic Factorization and
d) the parallel ND Numeric Factorization are performed.
In [Che+15] a geometric means speedup of 5.91 over KLU is stated for aCPU-based system with 16 cores.
5.2 Analysis of Modern LU Decompositionsfor Electrical Circuits
For the comparative analysis of the new LU decompositions with KLUas reference, they have been integrated into a measurement environmentwith drivers for a set of benchmark matrices to evaluate which could beintegrated into a power grid simulation environment for further analyses.In this work, the results presented in [Raz+19a] were extended by furtheranalyses, especially on Basker.
83
Chapter 5 Modern LU Decompositions in Power Grid Simulation
5.2.1 Analysis on Benchmark Matrices from Large-Scale Grids
For an equal measurement of all LU decomposition methods, a measure-ment environment was developed in C++ which also helped with theintegration of promising methods into proper simulation environments.The driver executes each decomposition and measures the wall clock timeof each relevant processing step.
Benchmark Matrices
For an analysis of the correctness and performance of the LU decompo-sitions, a benchmark around seven matrices has been developed. Thematrices have been extracted from Dynaωo static phasor simulations ofreal test cases conducted at RTE, spanning from a regional portion ofthe grid to a test case representing a merge of the networks of differentcountries, with high voltage (HV) and extra high voltage (EHV) parts.
The modeling choices are the same for all scenarios (except the load mod-els): synchronous machines with their control for classical generation units,standard static var-compensators, controllers as well as special protectionschemes (tap and phase shifter, current limit controller, voltage controller,etc.), primary frequency control, and primary as well as secondary voltageregulations.
Loads are modeled either as first-order restorative loads denoted assimplified loads (SLs) or as voltage dependent loads (VDLs) behind oneor two transformers. Both models are used at RTE depending on thestudy scope and are thus of practical relevance. Tab. 5.1 presents allbenchmark matrices provided by RTE with some information about theirorigin and their characteristics. Moreover, Fig. 5.1 depicts the matrixsparsity patterns which are typical for power grid matrices. Usually they
No. Power Grid K N NNZ d [%]
(1) French EHV with SL 2000 26432 92718 0.013(2) French EHV with VDL 2000 60236 188666 0.0051(3) F. + one neighbor EHV, SL 3000 47900 205663 0.0089(4) F. + one neighbor EHV, VDL 3000 75300 266958 0.0047(5) F. + neighb. countries EHV, SL 7500 70434 267116 0.0054(6) F. EHV + regional HV, VDL 4000 197288 586745 0.0015(7) F. + neighb. countries EHV, VDL 7500 220828 693442 0.0014
Table 5.1: Characteristics of squared matrices with size N × N , K nodes,sorted by nonzeros NNZ, and with density factor d = NNZ
N·N in %
84
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
show a very low density factor (i. e. number of non-zero elements), mainlyconcentrated around the diagonal.
In all the shown matrices, the upper left part corresponds to the networkpart. It is followed by a lot of little blocks around the diagonal: the injectionmodels (generators, loads, etc.) which are modeled using only one interfaceto the network (current and voltage). Finally, the columns in the right partof the matrix, containing non-zero elements, result from the system-widecontrols such as calculations of the system frequency that are related toall generators.
The density factor is higher with SLs models than with VDLs modelsas VDLs have much more variables which are mainly linked togetherbut not with outer variables (except through a single network interface).More information on the (a-)symmetry of circuit matrices can be found in[DP10].
Measurements Environment
The following execution time measurements were performed on a serverwith 2 sockets, each with an Intel Xeon E5-2643v4 3.4 GHz (3.7 GHzTurbo), 6 cores CPU with Hyper-Threading (HT); 32 GB DDR4 mainmemory; NVIDIA TESLA P40, GP102 Pascal, 24 GB GDDR5; running anx86_64 Ubuntu 16.04 Server Linux with kernel a) 4.13.0-46-generic forgeneral measurements, b) 4.11.5-RT (with enabled PREEMPT_RT [Lin19b])for real-time (RT) kernel measurements, and c) 4.13.16-custom for GLUmeasurements with NVIDIA driver x86_64-396.44 and CUDA 9.2 . Theversions of the LU decompositions and compilers are: KLU v1.3.8 withgcc-7.2.0, NICSLU v3.0.1 with clang-4.0.1-6, GLU v2.0 with g++-7.2.0,all built with compiler optimizations level 2 as this leads to highest perfor-mance. All measured times are wall clock times.
85
Chapter 5 Modern LU Decompositions in Power Grid Simulation
(1) (2)
(3) (4)
(5) (6)
(7)
Figure 5.1: Sparsity patterns of benchmark matrices
86
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
Complete Decomposition
The total execution times (i. e. preprocessing and factorization) for acomplete decomposition of the benchmark matrices are plotted in Fig. 5.2.Through almost all matrices, Basker is the most time-consuming method,followed by NICSLU which on some matrices is nearly as performant asKLU. Only in case of matrix no. 3, Basker shows a better performancethan NICSLU. As pictured in Fig. 5.1, this matrix has many relatively bigblocks on its diagonal. While the times of all CPU-based implementationsare below ca. 1 s, the GLU times are in most cases more around 10 timeshigher. The main reason is the preprocessing time as it can be seen in thenext plots.
1 2 3 4 5 6 7Matrix no.
0.10
1.00
Time [s]
KLUNICSLUBasker
(1) Basker, KLU, and NICSLU
1 2 3 4 5 6 7Matrix no.
1.00
10.00
Time [s]
GLU
(2) GLU
Figure 5.2: Total (preprocessing+factorization) times
Preprocessing
The preprocessing times of KLU, as shown in Fig. 5.3, are lowest in allcases. The main reason for this is the application of AMD on (smaller)submatrices instead of the whole matrix. In case of Basker, not onlythe whole runtime but also the preprocessing in case of matrix no. 3 isperformed relatively quickly in comparison to the other matrices. For GLUit can be seen that the preprocessing occupies most of the total time for adecomposition which is due to the symbolic factorization step performedon the CPU.
87
Chapter 5 Modern LU Decompositions in Power Grid Simulation
1 2 3 4 5 6 7Matrix no.
0.01
0.10
1.00
Time [s]
KLUNICSLUBasker
(1) Basker, KLU, and NICSLU
1 2 3 4 5 6 7Matrix no.
1.00
10.00
Time [s]
GLU
(2) GLU
Figure 5.3: Preprocessing times
Factorization
The factorization times in Fig. 5.4 of Basker are also higher than the onesof KLU and NICSLU. The times which NICSLU needs are mostly equalor lower than the ones which KLU needs for factorization, especially formatrix no. 6 which is one of the larger matrices and has a quite big denseblock in its upper left corner as depicted in Fig. 5.1. The factorizationsperformed by GLU on the TESLA device need only a fraction of the totaldecomposition time but are still around 10 times slower than KLU andNICSLU on the CPU.
1 2 3 4 5 6 7Matrix no.
0.01
0.10
Time [s]
KLUNICSLUBasker
(1) Basker, KLU, and NICSLU
1 2 3 4 5 6 7Matrix no.
0.10
1.00
Time [s]
GLU
(2) GLU
Figure 5.4: Factorization times
Complete Decomposition and Preprocessing on RT kernel
In Fig. 5.5 the execution times for the most promising decompositionmethods, KLU and NICSLU, are compared between the generic and the
88
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
RT kernel. KLU needs considerably more time on the RT as on thegeneric kernel. In case of NICSLU there are only small differences betweenthe kernels. As a consequence, the total times of KLU with the generickernel are always lower and with the RT kernel often higher, compared toNICSLU. At this point it is important to notice that a real-time optimizedsystem respectively kernel does not need to run faster than a generic one.Instead, the goal is that it runs deterministic within well-specified timeconstraints. The pure preprocessing times of KLU, as shown in Fig. 5.5,are lowest in all cases. The main reason for this is the application ofAMD on (smaller) submatrices instead of the whole matrix. Again, therun-times of KLU on the different kernels differ more than the ones ofNICSLU.
1 2 3 4 5 6 7Matrix no.
0.05
0.10
0.15
0.20
0.25
Time [s]
KLUNICSLUKLU-RTNICSLU-RT
(1) Total (preprocessing+factorization)
1 2 3 4 5 6 7Matrix no.
0.00
0.05
0.10
0.15
0.20Time [s]
KLUNICSLUKLU-RTNICSLU-RT
(2) Preprocessing
Figure 5.5: Execution times on generic vs. RT kernel
Refactorization vs. Factorization
Since neither Basker nor GLU currently supports refactorization, theseexecution times were only measured for KLU and NICSLU on the genericand the RT kernel as depicted in Fig. 5.6. In both methods the time forrefactorizations is much lower than for factorizations. The refactorizationsare performed by NICSLU much faster than by KLU. On the RT kernelmost NICSLU factorizations of the provided matrices are even faster thanKLU refactorizations.
89
Chapter 5 Modern LU Decompositions in Power Grid Simulation
1 2 3 4 5 6 7Matrix no.
0.00
0.02
0.04
0.06
0.08
Tim
e [s
]KLU fact.NICSLU fact.KLU ref.NICSLU ref.
(1) Generic kernel
1 2 3 4 5 6 7Matrix no.
0.00
0.05
0.10
0.15
Time [s]
KLU-RT fact.NICSLU-RT fact.KLU-RT ref.NICSLU-RT ref.
(2) RT kernel
Figure 5.6: (Re-)factorization times
Parallel Shared-Memory Processing of Basker and NICSLU
The CPU-based LU decompositions in the previously presented measure-ments were executed sequentially. As there is no official parallelized versionof KLU available to the authors, only the parallel processing of Basker andNICSLU is considered in the following. The parallel processing of NICSLUis shown in Fig. 5.7. The total execution times for multiple threads arealways higher than for one single thread. This cannot be caused by theturbo mode (i. e. higher CPU clock rate) only as the times with twothreads are also higher. Even the pure factorization times with multiplethreads are higher than a single thread. Obviously, the parallelizationof NICSLU does not scale for the benchmark matrices. The reason forthe low performance with 16 threads is the total number of 12 physicalprocessors (i. e. 24 logical processors with HT), leading the operatingsystem scheduler to switch between running and waiting threads moreoften. The accompanied context switching causes longer execution times.
The parallel processing of Basker is shown in Fig. 5.8. Contrary toNICSLU, the factorization performed by Basker can scale well with multiplethreads, e. g. for matrix no. 6. But still, Basker is not faster than thesequential KLU even with a higher number of threads. Since Basker is inalpha stadium, it can only handle numbers of threads that are power of two.This is why not more than 8 really independently executed threads couldhave been started on the 12 core system but there were software-technicalissues with Basker and some matrices leading to the limit of 4 threads inthe measurements.
90
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
1 2 3 4 5 6 7Matrix no.
0.05
0.10
0.15
0.20
0.25
0.30Time [s]
1 T2 T4 T8 T16 T
(1) Total (preprocessing+factorization)
1 2 3 4 5 6 7Matrix no.
0.000
0.025
0.050
0.075
0.100
0.125
Time [s]
1 T2 T4 T8 T16 T
(2) Factorization
Figure 5.7: NICSLU’s scaling over multiple threads (T )
1 2 3 4 5 6 7Matrix no.
0.00
0.25
0.50
0.75
1.00
Time [s]
KLU1 T2 T4 T
(1) Total (preprocessing+factorization)
1 2 3 4 5 6 7Matrix no.
0.0
0.2
0.4
0.6Time [s]
KLU1 T2 T4 T
(2) Factorization
Figure 5.8: Basker’s scaling over multiple threads (T)
Alternative Preordering Methods
For a performance analysis with different preordering methods (AMD,METIS, and COLAMD), we integrated METIS and COLAMD into NIC-SLU. In case of METIS, the total execution times for the LU decomposi-tions, as depicted in Fig. 5.9, are significantly higher than in case of AMDand COLAMD. The reasons are long execution times of METIS. Thesecan be derived from Fig. 5.10, as the factorization times after METISpreorderings are comparable to the factorization times after AMD andCOLAMD.
On the generic kernel, KLU benefits from AMD. On the RT kernel itbenefits from COLAMD but in case of pure factorizations it can benefitfrom METIS as well. The NISCLU factorization times, in case of AMD
91
Chapter 5 Modern LU Decompositions in Power Grid Simulation
1 2 3 4 5 6 7Matrix no.
0.0
0.2
0.4
0.6
0.8Time [s]
KLU-AMDKLU-COLAMDKLU-METIS
(1) KLU on generic kernel
1 2 3 4 5 6 7Matrix no.
0.00
0.25
0.50
0.75
1.00
1.25
Time [s]
NICSLU-AMDNICSLU-COLAMDNICSLU-METIS
(2) NICSLU on generic kernel
1 2 3 4 5 6 7Matrix no.
0.0
0.2
0.4
0.6
0.8
Time [s]
KLU-AMD-RTKLU-COLAMD-RTKLU-METIS-RT
(3) KLU on RT kernel
1 2 3 4 5 6 7Matrix no.
0.0
0.5
1.0
1.5
2.0
Time [s]
NICSLU-AMD-RTNICSLU-COLAMD-RTNICSLU-METIS-RT
(4) NICSLU on RT kernel
Figure 5.9: Total times with different preorderings
and COLAMD, are in all cases very close together and lower than afterMETIS preorderings.
5.2.2 Analysis on Power Grid Simulations
The benchmarks presented in this subsection were performed by RTEand parts of the text were authored by the co-authors from RTE of thepublication [Raz+19a]. Because of the low performance in comparisonto the other LU decompositions, GLU was not selected for the integra-tion in simulations environments. Basker, however, was integrated intoOpenModelica but because of its early development stadium it was notmature enough for adequate simulation benchmarks as it generates errorsat certain system sizes.
Due to NICSLU’s relatively high performance on the benchmark matri-ces, it was integrated into the IDA versions used by OpenModelica andDynaωo. Moreover, due to positive performance results in parallel mode,
92
5.2 Analysis of Modern LU Decompositions for Electrical Circuits
1 2 3 4 5 6 7Matrix no.
0.02
0.04
0.06
0.08Time [s]
KLU-AMDKLU-COLAMDKLU-METIS
(1) KLU on generic kernel
1 2 3 4 5 6 7Matrix no.
0.01
0.02
0.03
0.04
0.05
0.06
Time [s]
NICSLU-AMDNICSLU-COLAMDNICSLU-METIS
(2) NICSLU on generic kernel
1 2 3 4 5 6 7Matrix no.
0.05
0.10
0.15
Time [s]
KLU-AMD-RTKLU-COLAMD-RTKLU-METIS-RT
(3) KLU on RT kernel
1 2 3 4 5 6 7Matrix no.
0.00
0.02
0.04
0.06
0.08
Time [s]
NICSLU-AMD-RTNICSLU-COLAMD-RTNICSLU-METIS-RT
(4) NICSLU on RT kernel
Figure 5.10: Factorization times with different preorderings
Basker was integrated into the IDA version of OpenModelica for testing.This needs more effort as Basker is in a too early development stage (e. g. itreturns errors for certain matrices) which is why it was not integrated intoDynaωo. As a result, simulations were performed with NICSLU in Dynaωo[Gui+18], that contains two solvers utilizing SUNDIALS. Three test caseshave been selected to measure the performance of both LU decompositionswith the aforementioned two solvers which will be introduced later on:
(1) French EHV network with SL models
(2) French EHV network with VDL models
(3) French EHV/HV network with VDL models
Measurements Environment
For each test case, the simulation lasts for 200 s with a line disconnection att = 100 s and is done on a machine with an Intel Core i7-6820HQ 2.7 GHz
93
Chapter 5 Modern LU Decompositions in Power Grid Simulation
(3.6 GHz Turbo), 4 cores CPU with HT; 62 GB DDR4 main memory;running on Fedora Linux with kernel 4.13.16-100.fc25.x86_64. Allmeasured times are wall clock times.
Dynaωo’s Fixed Time Step Solver
The first of the two solvers available in Dynaωo is a fixed time step solver,inherited from PEGASE [Fab+11; FC09] and specifically designed for afast long-term voltage stability simulation. It applies an first-order Eulermethod using a Newton-Raphson (NR) approximation for resolving thenonlinear system at each time step (with KINSOL, a NR based methodavailable in SUNDIALS). In this approach, the LU decomposition for theJacobian is computed as few times as possible.
Tab. 5.2 shows that only a few milliseconds are spent in the LU de-composition and the Jacobian evaluation. Moreover, most of the timeelapses for the residual evaluation. It is important to notice that the LUdecompositions are performed only when there are major changes of thegrid.
Case KLU NICSLU Eval. JF Eval. Fno. [s] C [s] C [s] C [s] C
(1) 0.095 3 0.071 3 0.11 3 2.01 561(2) 0.215 4 0.215 4 0.46 4 5.96 617(3) 0.847 13 0.790 13 1.61 13 9.41 767
Table 5.2: Total execution times and numbers C of calls of the correspond-ing routines within the fixed time step solver, with JacobianJF and residual function vector F
Dynaωo’s Variable Time Step Solver
The second solver available in Dynaωo is a variable time step solver basedon SUNDIALS/IDA plus additional routines to deal with algebraic modechanges due to topology modifications of the grid. Jacobian evaluationsand LU decompositions occur much more often than with the fixed timestep solver.
Table 5.3 presents the results with the variable time step solver. Theyconfirm the trends observed with the individual matrices, i. e. the preorder-ing step takes more time with NICSLU than KLU but these extra costs areoffset by a substantial reduction on the factorization and refactorizationsteps. Usually, there should be mainly refactorizations. Factorizations
94
5.3 Conclusion and Outlook
should appear only when there is a change in the matrix structure (thatcorresponds either to a change in the grid topology or a deep change in theform of the injection equations). Keeping this point in mind, it should bepossible to gain time with NICSLU on complete simulation times comparedto KLU ( 26.67 s vs. 34.56 s in case (3) ). This gain remains minimal at themoment compared to the overall numerical resolution time ( 36 s in case(1), 102 s in (2), and 266 s in (3) ) but if improvements are also achievedon the other elementary tasks, it could help making the difference in thelong term.
5.3 Conclusion and Outlook
This chapter presents most promising recently developed LU decompositionmethods (Basker, NICSLU, and GLU) for electric circuit simulation thathave been found in current literature. After a short introduction of themain ideas behind the methods, a comparative analysis with KLU (as thereference LU decomposition for power grids) was conducted on benchmarkmatrices from large-scale phasor time-domain simulation. Through theintegration of NICSLU in OpenModelica’s and Dynaωo which is stable,it can already be used in productive environment. The immature Baskerimplementation, however, can be software-technically improved and testedwithin the OpenModelica environment, where it was integrated, to gainbetter runtime stability.
The analysis shows that KLU and NICSLU achieve a similar perfor-mance for total execution times on benchmark matrices while Basker’sperformance, especially in single-threaded mode, is lower. However, Baskercan achieve speedups of the factorization when running parallel threads.
Case Preord. [s] Fact. [s] Refact. [s] Sum [s] D f Method
(1) 2.42 2.58 2.85 7.85 461 0.33 KLU2.74 0.88 0.72 4.34 461 0.33 NICSLU
(2) 4.98 2.81 2.72 10,51 466 0.34 KLU6.28 1.59 1.22 9.09 466 0.34 NICSLU
(3) 15.01 10.79 8.76 34.56 899 0.42 KLU18.96 4.87 2.84 26.67 899 0.42 NICSLU
Table 5.3: Accumulated execution times for the listed steps of the variabletime step solver, with D LU decompositions and a factorizationratio f = #Fact.
#Refact.
95
Chapter 5 Modern LU Decompositions in Power Grid Simulation
Moreover, Basker’s speedup behaves superlinear in a subset of the bench-mark matrices. Superlinear speedup often occurs due to hardware featuresregarding CPU caches [Ris+16]. It can be caused by less data amountper thread which is fitting better into caches. Basker’s developers indeedmentioned that for a larger number of threads, the ND-tree may providesmaller cache-friendly submatrices [BRT16]. Since Basker’s implementa-tion is in an alpha state, one could achieve possibly better results withfurther development. For instance, it is dependend on the Trilinos library[Tri], especially on the parallel execution framework Kokkos. An individualparallelization of Basker, however, could result in a higher performance.GLU, despite its massive parallelization for GPUs, in the presented analy-sis cannot compete with current CPU-based implementations as it showeda much lower performance in all cases.
The preprocessing of NICSLU is usually slower than of KLU but espe-cially refactorizations are performed faster. Such as other shared-memoryparallelized LU decompositions for sparse systems in many cases, NICSLUcannot make use of multiple CPU cores. This is a problem since CPUclock speeds are not increasing anymore and the performance of processorsnowadays is mainly increased by more CPU cores.
Executed on an RT kernel, NICSLU has shown a better performancethan KLU but there is more causal investigation needed. Both, KLU andnow also NICSLU, can benefit from different preorderings. Regardingcomplete simulations, NICSLU can provide improvements compared toKLU, benefiting from its different refactorization step, which is morecommon during simulations than a complete factorization step.
The analysis of the unitary LU decompositions opens new perspectivesfor the generic numerical schemes and the choices made to improve theperformance of power grid simulation solvers as well as other power gridrelated software that can make use of LU decompositions. Furthermore, theintegration of a performant LU decomposition (esp. into the widely-usedSUNDIALS library) allows the simulation environment users to switchbetween different solvers not just for a better runtime performance underdifferent circumstances – e. g. offline vs. real-time simulations – but alsofor a different numerical behavior. This can lead to better results in caseof possible numerical instabilities but also to an alternative in case of asolver issue.
96
6Exploiting Parallelismin Power Grid Simulation
Besides runtime improvements through the application of numerical meth-ods such as LU decompositions, that suite better in general or in thespecial case of power grid simulation, also proper methods from the area ofhigh-performance computing (HPC) can be applied on the regarding simu-lation software. One such that has been recently developed at the Institutefor Automation of Complex Power Systems (ACS) is the Dynamic PhasorReal-Time Simulator (DPsim) which introduces the dynamic phasor (DP)approach to real-time (RT) power grid simulation as larger simulation stepsare possible without losing accuracy [Mir+19]. This leads to a smallerimpact of communication delays, e. g., between geographically distributedsimulators running in different laboratories with special Hardware-in-the-Loop (HiL) setups. A reason for the coupling to one RT co-simulationcould be the lack of needed resources (e. g. hardware, software, know-how,location, etc.) to run a complete HiL simulation in just one laboratory[Mir+17].
DPsim uses several external software libraries which includes the VIL-LAS framework for the communication with other real-time simulators,control/monitoring software as well as hardware, and so forth. Grid datain a Common Information Model (CIM)-based format is read using thelibcimpp library of the CIM++ project as introduced in Chap. 3. Further-more, multiple numerical libraries are used as there are several solversimplemented in DPsim, such as a modified nodal analysis (MNA) basedsolver which utilizes LU factorizations on dense and sparse matrices of
97
Chapter 6 Exploiting Parallelism in Power Grid Simulation
Eigen [Eig19]. Also the SUite of Nonlinear and DIfferential/ALgebraicEquation Solvers (SUNDIALS) library is used as backend of DPsim’sordinary differential equation (ODE) solver.
To benefit from modern shared-memory multi-core systems, the com-putations within one time step are partitioned into multiple tasks, asdefined by the parts of the simulation such as the utilization of differentsolvers and interfaces (e. g. an interface for real-time data exchange, auser interface for monitoring, end so forth). At this point, it should benoted that different solvers can be utilized within a single time step asthis depends on the components of the power grid model. Because of datadependencies between these tasks, they cannot run all in parallel as thiswould lead to data races with wrong results [Qui03]. Therefore, a taskdependency analysis is performed to achieve a data race free parallel taskexecution.
This chapter gives an overview of multiple kinds of model parallelizationapproaches on different abstraction resp. implementation levels thathave been implemented, exemplarily, in the OpenModelica simulationenvironment. It also introduces schedulers for parallel tasks executionand describes how they are implemented into DPsim in combination withthe task dependency analysis. This is followed by a runtime analysis ofDPsim with the implemented approach on various power grids of differentsizes. The chapter concludes with a discussion on the advantages anddisadvantages of the parallel execution as well as on the utilized schedulers.This chapter presents outcomes of the supervised thesis [Rei19].
6.1 Parallelism in Simulation Models
Chapter 5 i. a. dealt with approaches where parallelism within numericalmethods (LU decompositions in the present case) is used for shorterexecution times on multi-core architectures. But instead of using theparallelism of numerical solvers, it is also possible to exploit the inherentparallelism of the model as such. The inherent parallelism of a model canbe either expressed by its developer or automatically recognized. Withoutany claim to completeness, [Lun+09] describes the following first threetypes of approaches for exploiting parallelism in mathematical models:
1. Explicit Parallel Programming
This type concerns approaches where parallel constructs are expressed inthe programming language of the mathematical model itself. For example,ParModelica [Geb+12] is an extension to the modeling language Modelicawhich allows the user to express parallelism in algorithm sections (i. e.
98
6.1 Parallelism in Simulation Models
in imperative programmed parts of a model instead of declarative partsas expressed by equation sections). In this approach, the developer ofthe model is responsible for its (correct) parallelization. For this purpose,ParModelica provides parallel variables (allocated in different memoryspaces) as well as functions, parfor loops, and kernel functions whichare executed on OpenCL devices (e. g. graphic processing units (GPUs))as part of so-called heterogeneous computer systems.
2. Explicit Parallelization Using Computational Components
Another type of explicit parallelization exploitation is achieved by struc-turing the model into computational components using strongly typedcommunication interfaces. For this, the architectural language propertiesof Modelica, supporting components and strongly typed connectors, aregeneralized to distributed components and connectors. An example for thisapproach is the Transmission Line Modeling (TLM) where the physicalmodel is distributed among numerically isolated components [Sjö+10].Hence, the equations of each submodel can be solved independently andthus in parallel.
This kind of explicit parallelization is implemented in DPsim by SystemDecoupling in form of two different methods: the Decoupled Line Modeland Diakoptics as presented in [Mir20].
3. Automatic Fine-Grained Parallelization of Mathematical Models
Besides the explicit expression of parallelism, it is also possible to extractparallelism from the high-level mathematical model or from the numericalmethods used for solving the problem. The parallelism exploitation frommathematical models is categorized by the following the subtypes:
Parallelism over time: for example in case of discrete event simulationswhere certain events are independent from other events and thereforecan be handled in parallel;
Parallelism of the system: this means that the modeled system (i. e. themodel equations) is parallelized. There has been much research doneon automatic parallelization, especially at equation level methods[Aro06; Cas13; Wal+14].
Similarly to the fine-grained approach, the following new (4.) approachtype was introduced.
99
Chapter 6 Exploiting Parallelism in Power Grid Simulation
(4.) Automatic Coarse-Grained Parallelization of Mathematical Models
Rather than exploiting the parallelism at equation level, it is also possibleto consider it at component level. This new methodology was implementedinto DPsim by splitting one simulation step into separate tasks, wherebyevery component in the power grid model declares a list of tasks that haveto be processed in each simulation step. The approach will be presentedin the following.
6.1.1 Task Scheduling
This chapter deals with the scheduling of tasks, i. e. parts of a solutionprocedure, which can be performed by multiple threads that are spawnedon a multiprocessor system by the process’ main thread. It is not aboutany operating system schedulers for processes running on a single- ormultiprocessor system [Tan09]. The term multiprocessor system refersto logical central processing units (CPUs) and therefore includes systemswith a single physical CPU and multiple cores as well as systems withmultiple physical CPUs and multiple cores. As the simulated modelsare small enough to fit into the main memory of current workstationsand servers, only shared-memory parallel programming is considered.Therefore, multiple threads sharing same memory regions can be usedinstead of multiple processes running in parallel on multiple interconnectedprocessors with distributed memory.
The obstacle in case of shared-memory parallelization is that multiplethreads could access same data concurrently which can lead to so-calledrace conditions [Roo99] causing wrong results. Therefore, a synchronizationbetween the parallel running threads must be performed and the executionorder of program statements that are dependent on each other must bekept. For example, if a value is calculated in a program statement S1 andused in S2 as input value, then statement S1 must be executed before S2.Statement S2 depends on S1 and, therefore, both statements cannot beexecuted in parallel. Dependency analyses on statements have long beensubjects of research [WB87] but can also be performed on procedures ortasks. For what applies to single statements, equally applies to a bunch ofstatements (i. e. tasks).
The scheduling of tasks to a set of processors in [KA99] is divided intodifferent categories as pictured in Fig. 6.1. As the considered tasks dependon each other, a scheduling variant from the scheduling and mappingmust be chosen, with the two subcategories dynamic scheduling and staticscheduling. Dynamic scheduling is chosen when there is not enough apriori information about the tasks’ processing durations available before
100
6.1 Parallelism in Simulation Models
their processing. Instead, static scheduling can be used in case there isenough a priori information given which can be used for an mostly efficientscheduling. Static scheduling can again be divided into such based on taskinteraction graphs and on task precedence graphs. Task interaction graphscan be used when loosely coupled communicating processes need to bescheduled which can be true on a distributed (memory) system. As thisis not true in case of the intended shared-memory parallelization, staticscheduling based on task precedence graphs (in the following called as taskgraphs) was chosen.
The task processing times of a time step can be exploited for the nextsteps as long as the mathematical structure of the grid model resp. thecontrol flow within the tasks does not change too much. A reason thatthe control flow within a task changes could be a switching between oneand another simulation step within a component. Whereas a switchingbetween components (e. g. by a breaker) could change the data flow resp.dependency between tasks and therefore require an updated task graph.In the following, some formal definitions for the used terms are introduced.
Basic Terms
At this point, a task can be considered as a sequence of program statementsthat are executed by a processor sequentially.
Definition 6.1 (Dependency and task graph)Given a set T = {T1, ..., Tn} of tasks, which is the set of nodes of thebelonging task graph, an edge (Ti, Tj) ∈ E = T × T , with i, j ∈ {1, ..., n},
Parallel ProgramScheduling
Dynamic Scheduling Static Scheduling
Task InteractionGraph
Task PrecedenceGraph
Scheduling and Mapping(multiple interacting tasks)
Job Scheduling(independent tasks)
Figure 6.1: Categories of parallel task scheduling
101
Chapter 6 Exploiting Parallelism in Power Grid Simulation
expresses a data dependency of Tj on Ti, requiring that Ti must be per-formed before Tj, also denoted as Ti ≺ Tj. The resulting directed acyclicgraph (DAG) G = (T, E) is called the task graph.
Definition 6.2 (Task types, weight, and length)Given a task graph G = (T = {T1, ..., Tn}, E = T × T ),
• a task V ∈ T without incoming edges, i. e.,it holds that ∀U ∈ T : @(U, V ) ∈ E,is called an entry task;
• a task V ∈ T without outgoing edges, i. e.,it holds that ∀W ∈ T : @(V, W ) ∈ E,is called an exit task;
• the weight (i. e. execution time) of a task V ∈ T is given by w(V ),with the weight function w : T → N;
• the length lp of a path p = Ti1 ≺ ... ≺ Tik , with k ∈ N tasks,is defined as the sum of the weights of its tasks, i. e.,lp =
∑k
j=1 w(Tij
).
In case of a distributed-memory system, also communication costs couldbe taken into account as edge weights (e. g. because of message passingbetween the computing nodes) but they are neglected for the intendedshared-memory parallelization. An example task graph is given in Fig. 6.2,where the weights of the tasks are given in parantheses beside the taskidentifier. In the following it is shown how task graphs can be utilized todistribute the tasks among multiple processors in an optimal way regardingtheir total processing time.
T2(2) T3(1)T1(1)
T6(1)
T4(3) T5(1)
Figure 6.2: Example task graph
102
6.1 Parallelism in Simulation Models
General Scheduling Problem for Parallel Processing
A task schedule must provide the start time for each task. This can beformalized on the basis of [Ull75] as follows.
Definition 6.3 (Schedule function, optimal schedule)Given a set of tasks T = {T1, ..., Tn} to be executed on a system with p ∈ Nprocessors, a schedule function f : T → N0 which specifies the start timefor each task is sought, for which following restrictions hold:
• a task Tj that depends on Ti may not start before Ti, i. e.,∀Ti, Tj ∈ T : If Ti ≺ Tj, then f(Ti) + w(Ti) ≤ f(Tj);
• at each time point, at most p tasks are processed concurrently, i. e.,∀t ∈ N0 : |{V ∈ T |f(V ) ≤ t < f(V ) + w(V )}| ≤ p.
A schedule specified by the schedule function fopt is an optimal scheduleiff. the total execution time is minimal under the restrictions above, i. e.,
maxi{fopt(Ti) + w(Ti)} = min∀f {maxi{f(Ti) + w(Ti)}}.
The problem of finding an optimal schedule in case of p = 2 and a singleexecution time tconst = w(V ), for all V ∈ T , can be solved determin-istically in polynomial time but for p > 2 it is generally NP-complete[Ull75]. Therefore, instead of trying to find an optimal schedule, heuristicalgorithms are applied. Two classes of such heuristic schedulers will bepresented in the following.
Level Scheduling
A level scheduling based approach for equation-based parallelization ofModelica was implemented in OpenModelica. At the beginning all entrytasks are assigned to the first level as they do not depend on each other.All tasks that depend on tasks in the first level only, are assigned to thesecond level and so forth.Definition 6.4 (Level scheduling)Given a task V ∈ T = {T1, ..., Tn} and the set of predecessors PV = {U ∈T |U ≺ V }, the level function l : T → N returns the level of the task Vaccording to the recursive definition
l(V ) ={
0, if PV = ∅1 + max{l(S)|S ∈ PV }, otherwise.
As the tasks within a certain level are independent from each other, theycan be executed in any order or in parallel. In the simplest form, the tasks
103
Chapter 6 Exploiting Parallelism in Power Grid Simulation
T2(2) T3(1)T1(1)
T6(1)
T4(3) T5(1)
Level 0
Level 1
Level 2
Figure 6.3: Example task graph including levels
within a level are therefore distributed among the available processorswithout regard to their execution times. If the integer division of n byp returns a rest, the remaining tasks are arbitrarily distributed amongthe processors which causes that certain processors have to execute onetask more than the others. Fig. 6.3 shows exemplarily how levels couldbe assigned to the tasks in Fig. 6.2. Derived from this level assignment, afinal schedule for p = 2 processors is illustrated in Fig. 6.4.
In case of level scheduling, the synchronization (typically threads on ashared-memory system) confines itself to barriers [Cha+08] between theexecutions of the levels. This leads to a simple implementation and lowsynchronization costs. But it could be improved by an enhanced assignmentof the tasks within a level to the processors to minimize the execution timeof each level. However, this corresponds to the NP -complete problem ofmulti-way number partitioning where a given set of integers needs to bedivided into a collection of subsets, so that the sum of the numbers in eachsubset are as nearly equal as possible [Kor09]. A famous greedy heuristic[Cor+01] for solving this problem is to sort the numbers (here: w(Ti),with i = 1, ..., n) in decreasing order and assign each one to the subset(here: processor) with the smallest sum so far. Since the partial order ≺,
P1 T1 T2 T4 T6
P2 T3 T5
Figure 6.4: Schedule for task graph in Fig. 6.2 with p = 2 using levelscheduling
104
6.1 Parallelism in Simulation Models
restricted on the tasks within a level, is empty (as the tasks within a levelare independent), the ratio between the execution time resulting fromthe greedy heuristic and the optimal execution time is limited by 4
3 − 13p
[Gra69]. This can be an acceptable value in many cases but it must be keptin mind that the division of tasks into levels generally is not optimal. Withthe aid of the greedy heuristic, the two smaller tasks T1 and T3 in level 0of the example shown in Fig. 6.3 are assigned to the first processor andtask T2 to the second one (see Fig. 6.5), resulting in a shorter executiontime of level 0 than before (see Fig. 6.4). The total execution time of alllevels therefore reduces from 7 to 6. The next implemented method is thelist scheduling introduced in the following.
P1 T1 T3 T4 T6
P2 T2 T5
Figure 6.5: Schedule for task graph in Fig. 6.2 with p = 2 using levelscheduling considering execution times
List Scheduling
A comparison of list schedules for parallel processing systems is providedby [ACD74]. All of them accomplish the following steps:
1. Creation of a scheduling list (i. e. sequence of task to be scheduled)by assigning them priorities.
2. While the task graph is not empty:
a) Assignment of the task with highest priority to the next availableprocessor and
b) removing of it from the task graph.
The difference between the algorithms lies in the determination of thetasks’ priorities. Two often used attributes for the assignment of prioritiesto tasks are the t-level (top level) and b-level (bottom level). The t-levelof a task V ∈ T is the length (as defined in Def. 6.2) of a longest pathfrom an entry task to V. The same applies to the b-level of a task V withthe length of a longest path from it to an exit task.
105
Chapter 6 Exploiting Parallelism in Power Grid Simulation
T2(2, 4) T3(1, 3)T1(1, 5)
T6(1, 1)
T4(3, 4) T5(1, 2)
Figure 6.6: Example task graph including b-levels, with node label formatTi(w(Ti), b(Ti))
Definition 6.5 (B-level function)Given a task V ∈ T = {T1, ..., Tn} the b-level function b : T → N returnsthe b-level of the task V according to the recursive definition
b(V ) ={
w(V ), if {W ∈ T |V ≺ W } = ∅w(V ) + max{b(W )|V ≺ W }, otherwise.
A critical path (CP) of a DAG is a longest path in the DAG and thus ofhigh importance for a schedule (see [KA99] where also algorithms for t- andb-level computations are presented). In general, scheduling in a descendingb-level order tends to schedule tasks on a CP first, while scheduling inan ascending t-level order tends to schedule tasks in topological order (formore on topological ordering see [KK04]). In [ACD74] the performance ofdifferent heuristic list scheduling algorithms is analyzed. It has been shownthat the CP-based algorithms have near-optimal performance. One ofthese is the Highest Level First with Estimated Times (HLFET) algorithm.Another algorithm with a similar procedure but assuming a uniformexecution time w(V ) = 1, for all V ∈ T , is the Highest Level First withNo Estimated Times (HLFNET) algorithm. Figure 6.6 shows the examplegraph, extended by the b-level for each node. Using HLFET on it, resultsin an optimal schedule es shown in Fig. 6.7. More on these and otherscheduling algorithms can be found in [KA99].
6.1.2 Task Parallelization in DPsim
The core part of the simulation toool is the actual simulation solver forpower grid simulation. One of its main steps is calculating the system ma-trix A by iterating through a list of power grid components, accumulatingeach component’s contribution. The simulation at time point t can then
106
6.1 Parallelism in Simulation Models
P1 T1 T4 T6
P2 T2 T3 T5
Figure 6.7: Schedule for task graph in Fig. 6.2 with p = 2 using HLFET
be sketched with the following steps:
1. computing the right-hand side vector b(t) by accumulating eachcomponent’s contribution (similar to the procedure composing thematrix A);
2. solving of the system equation Ax(t) = b(t);
3. updating components’ states (e. g. equivalent current sources) usingthe solution x(t).
These are just the major tasks as also others have to be performed in eachstep such as the simulation of the dynamics of the mechanical parts ofelectromechanical components like synchronous generators and simulationvalues must be exchanged between the time steps during a distributedsimulation. Eventually, simulation results and logs are saved where this isneeded.
A single step is split into tasks defined by a list of tasks for eachcomponent which has to be simulated. Further tasks are added for themain step of system solving and optionally also for logging of results aswell as data exchange with other processes (e. g. simulators) or HiL.
Task Dependency Analysis
For the representation of dependencies, a system of attributes is imple-mented. Attributes are properties of components such as, e. g., the voltageof a voltage source which are accessed during the simulation by a reador write. A task has two sets of attributes: one for attributes with readand one for those with write accesses. If an attribute is written by a taskT1 and read by task T2 then T2 depends on T1 which is represented by atask graph as defined in Def. 6.1 for all tasks within one simulation step.The task graph for an example circuit (see Fig. 6.8) is depicted in Fig. 6.9.In PreStep, certain values necessary for the current simulation step (i. e.contributions to the right-hand-side vector) are computed depending on
107
Chapter 6 Exploiting Parallelism in Power Grid Simulation
the solutions of the previous simulation step. In PostStep, certain com-ponent specific values are calculated from the system solution computedby Sim.Solve in the current simulation step. For optimization purpose,tasks that are not necessary in a certain simulation are omitted. In case ofthe Resistor component, e. g., a PostStep task is processed, calculatingthe current through it, based on the voltages from the system solution(e. g. calculated in Sim.Solve). More on the task dependency analysis canbe found in [Mir20].
Task Schedulers
Before the actual simulation, a scheduler analyzes the task graph in order tocreate a schedule for the simulation using a certain number of concurrentthreads which can be scheduled by the operating system on differentparallel processors for potential execution time improvements. Severalschedulers based on the presented scheduling methods (see Sect. 6.1.1)were implemented in DPsim as given in Tab. 6.1. Each scheduler has acreateSchedule for initialization purpose based on the task graph and astep method called in the main simulation loop. The SequentialSchedulersorts the task graph in topological order for obtaining a valid task schedulefor sequential processing. For the actual parallel processing, different
_+ V1
R1
C1
Figure 6.8: Example circuit
V1.PreStep C1.PreStep
Sim.Solve
Sim.LogV1.PostStep R1.PostStep C1.PostStep
Figure 6.9: Task graph resulting from Fig. 6.8
108
6.1 Parallelism in Simulation Models
Table 6.1: Overview of the implemented schedulersScheduler class name Short name Paradigm Algorithm
SequentialScheduler sequential - Topological sortOpenMPLevelScheduler omp_level OpenMP Level schedulingThreadLevelScheduler thread_level std::thread Level schedulingThreadListScheduler thread_list std::thread HLF(N)ET
Application Programming Interfaces (APIs) are used: OpenMP [Ope19b],providing a simple interface for the (incremental) development of parallelapplications and the std::thread class from the systems’ C++ StandardLibrary [Jos12].
The OpenMPLevelScheduler has the simplest implementation as it isutilizing the OpenMP API. Its step function (see List. A.1) forks a givennumber of concurrent threads (through a parallel section) in which a loopis processed by each thread sequentially (i. e. each thread processes eachlevel). Within this level loop, an parallel loop over the tasks within alevel is executed with an OpenMP schedule(static) clause, causing anearly equal distribution of the tasks among the threads. As a parallelfor-loop in OpenMP has an implicit barrier per default, the concurrentthreads process the levels synchronously. An advantage of OpenMP isthat there are many implementations for different computer platforms butthere could be significant differences in computing performance [Mül03].Also, the simple OpenMP pragmas allow an incremental development butalso prevent influence over some implementation details.
The ThreadScheduler was implemented based on the std::thread classfrom the C++ standard library [Wil19], implementing the step functionfor having more control over the synchronization between the threads. Inevery time step, each thread executes its list of assigned tasks successivelyand synchronized by atomic counters supporting the two operations: anatomic increment of the counter’s value and waiting until it reaches agiven value which is implemented in form of busy waiting [Tan09]. Thecounter of each task is incremented after its processing. Before each step,the atomic wait method is called on the counters of all tasks with an edge(in the task graph) to the current task. The actual distribution of thetasks among the threads is accomplished by the two sub classes of theThreadScheduler.
The ThreadLevelScheduler, like OpenMPLevelScheduler, realizes levelscheduling but with a different behavior. In case of the OpenMP-basedscheduler, there are barriers for all threads at each level’s end, causing also
109
Chapter 6 Exploiting Parallelism in Power Grid Simulation
threads without tasks within a certain level to wait before the executionof (independent) tasks of the next level. Such unnecessary barriers are notconducted by the ThreadLevelScheduler. Moreover, it can make use of ex-ecution times measured during a previous execution by applying the greedyheuristic for multi-way partitioning to keep the subsequent execution timeper level between the threads mostly uniform (see Sect. 6.1.1).
The ThreadListScheduler which also derives from ThreadSchedulerimplements the list scheduling algorithm based on HLFET, in case ex-ecution times are provided, and Highest Level First with No EstimatedTimes (HLFNET) if not (e. g. the execution times per task is assumed tobe uniform).
Component-Based Modified Nodal Analysis
The system to be simulated is passed as a list of component objects to aMNA solver, implemented with the MNASolver class. All components thatcan be simulated using the MNA approach, have the following in common:
• their internal state is initialized depending on the system frequencyand time step;
• their presence may change the system matrix;
• they specify tasks such as PreStep and PostStep which have to beprocessed at each time step.
At simulation start, each component is initialized, its contribution isaccumulated to the system matrix, and the decomposition is calculated.More details on the MNA implementation itself can be found in [Mir20].
A Simulation class constructs the task graph from the given list of tasksand such for logging as well as interfacing if needed. During simulation, thescheduler’s step method (for proceeding in time) is called which executesall tasks in a correct order (i. e. avoiding race conditions). Because ofthe distinction in the implementation between the scheduler and solver,the implemented framework for parallel processing is not MNA solverspecific but can be adapted to any solver structure which, however, mustbe divisible into tasks.
6.1.3 System Decoupling
Solving a linear system of size n requires O(n3) operations which leadsto long execution times in case of large matrices. Even if the systemmatrix stays fix between the simulation steps, leading to the fact thata LU decomposition of it could be reused for solving the system, the
110
6.2 Analysis of Task Parallelization in DPsim
forward-/backward substitutions would require O(n2) operations at eachtime step. Because of requirements on the time step in real-time simulation(dependent on the simulation model / method and use case), this wouldcause a limit in the size of the system model. A possible proceeding is, tosplit the system into smaller matrices that can be solved independently andto compose the solution of the whole system from all partial solutions. Incase, the LU decomposition can be reused, the potential speedup by solvingk systems of size n/k over solving a system of size n is n2
k·(n/k)2 = k. As thesmaller matrices are independent, they can be solved concurrently whichresults in a higher performance. Therefore, two methods for increasingthe performance gain from the presented parallelization by splitting thesystem matrix into smaller parts were implemented.
Decoupled Transmission Line Model
The application of the TLM (in literature also called decoupled transmissionline model) which belongs to the explicit parallelization approaches usingcomputational components (see Sect. 6.1), can split a grid into two subgridswhich are not topologically connected. This allows the creation of twoseparate system matrices that can be solved concurrently during eachtime step. DPsim automatically recognized such cases, solves the systemsseparately and simulates the line behavior of the equivalent components,connecting the two subnetworks together.
Diakoptics
Diakoptics is another method which allows the user to divide a gridinto subgrids. The resulting subgrids can also be computed concurrentlyand their results can be calculated to the whole solution. More on theimplementation of TLM and diakoptics in DPsim can be found in [Mir20].
6.2 Analysis of Task Parallelization in DPsim
In the following, the performance benefits of the previously introducedparallelization methods are analyzed on models without and with systemdecoupling. For that purpose, the average wall clock time needed for asingle simulations step is used as metric in all analyses. It was chosenbecause of its importance for soft real-time simulation where the elapsedtimes of all time steps must stay below a specified average. At first, theexecution times for the different schedulers are analyzed for several systemmodel sizes. Afterwards, the effect of the parallelization on the system
111
Chapter 6 Exploiting Parallelism in Power Grid Simulation
decoupling methods is investigated. Finally, the parallel performance iscompared when DPsim is built by various popular compiler environments.
Measurements Environment
All measurements in this section where accomplished on a server with 2sockets, each with an Intel Xeon Silver 4114 2.2 GHz (3.0 GHz Turbo),10 cores CPU with Hyper-Threading (HT); 160 GB DDR4 main memory;running an x86_64 Ubuntu 16.04 Server Linux with gcc v8.1.0 as defaultcompiler environment for DPsim.
6.2.1 Use CasesThe Western System Coordinating Council (WSCC) 9-bus transmissionbenchmark network was used as reference network which consists of threegenerators, each connected to a power transformer, and three loads con-nected to the generators by six lines in a ring topology. The whole network(e. g. system model) as depicted in Fig. 6.10 was provided in form of aCIM-based file. Its components were modeled in the following way:
• synchronous generators represented with the aid of an inductanceand an ideal voltage source whose value was updated in each stepbased on a model for transient stability studies;
• power transformers modeled as ideal transformers with an additionalresistance and inductance on the primary side to model in particularthe electrical impact of the windings and related power losses;
• transmission lines represented by PI models with additional smallso-called snubber conductances to ground at both ends;
• loads modeled as having a constant impedance and inductive behav-ior, thus represented by a resistance and inductance in parallel.
More on the component models can be found in [Mir20].For an analysis of various system model sizes, multiple replications of
the WSCC 9-bus system were combined in an automated way. For thispurpose, further transmission lines were added between nodes connectedto loads (labeled in Fig. 6.10 as BUS5, BUS6 and BUS8) to form furtherrings between components of the system copies. The resulting topologiesfor two and three system copies are illustrated in Fig. 6.11 where differentnode colors signify different copies of the original 9-bus system and newlyadded transmission lines are represented using solid lines. Only relevantbuses are shown and the omitted parts are sketched as dashed lines.
112
6.2 Analysis of Task Parallelization in DPsim
BUS3(14.14 kV > 4.88°)
BUS5(218.66 kV > -4.14°)
BUS6(222.23 kV > -3.74°)
BUS4(229.02 kV > -2.31°)
BUS1(17.16 kV > 0.00°)
BUS2(18.45 kV > 9.69°)
BUS7(229.40 kV > 3.97°)
BUS9(230.95 kV > 2.11°)
BUS8(225.21 kV > 0.84°)
GEN1DP::Ph1::SynchronGeneratorTrStab
GEN2DP::Ph1::SynchronGeneratorTrStab
GEN3DP::Ph1::SynchronGeneratorTrStab
LINE54DP::Ph1::PiLine
LINE64DP::Ph1::PiLine
LINE75DP::Ph1::PiLine
LINE78DP::Ph1::PiLine
LINE89DP::Ph1::PiLine
LINE96DP::Ph1::PiLine
LOAD5DP::Ph1::RXLoad
LOAD6DP::Ph1::RXLoad
LOAD8DP::Ph1::RXLoad
TR14DP::Ph1::Transformer
TR27DP::Ph1::Transformer
TR39DP::Ph1::Transformer
Figure 6.10: WSCC 9-bus transmission benchmark network
6.2.2 Schedulers
In the first part of the scheduler analysis, the different schedulers werecompared on various benchmark networks of different sizes. In Fig. 6.12the average wall clock times per step for simulating the 9-bus systemwere plotted for each implemented scheduler, depending on the number ofthreads from one to ten (due to a 10 cores server). The simulation had astep of 100 µs, was 100 ms long and the average execution time for a singletime step was calculated on the execution times of 50 simulations. Thescheduler names in the plot’s legend are as defined in Tab. 6.1, wherebythe adjunct meas indicates that the measured average task execution timeswere passed to the scheduler.
Compared to the sequential scheduler (dashed line) the parallel pro-cessing as scheduled by all methods is slower than sequential processing.All schedulers, except the OpenMP-based one with an addition overhead,
113
Chapter 6 Exploiting Parallelism in Power Grid Simulation
BUS6'
BUS5'BUS8'
BUS5
BUS6
BUS8
(1) Two system copies
BUS6'
BUS5'BUS8'
BUS5
BUS6
BUS8
BUS5''
BUS6''
BUS8''
(2) Three system copies
Figure 6.11: Schematic representation of the connections between systemcopies
2 4 6 8 100
0.2
0.4
0.6
0.8
1
1.2
·10−5
Number of threads
Wal
lclo
cktim
epe
rst
ep[s]
sequentialomp_level
thread_levelthread_level meas
thread_listthread_list meas
Figure 6.12: Performance comparison of schedulers for the WSCC 9-bussystem
114
6.2 Analysis of Task Parallelization in DPsim
lead to similar execution times which are increasing with the number ofthreads.
Therefore, the same benchmark was performed on a network with 20interlinked copies of the 9-bus system. For this larger system the par-allel processing for all schedules performed better than the sequentialone as depicted in Fig. 6.13. Here the OpenMP-based level scheduler
2 4 6 8 100
0.5
1
1.5
2
2.5
·10−3
Number of threads
Wal
lclo
cktim
epe
rst
ep[s]
sequentialomp_level
thread_levelthread_level meas
thread_listthread_list meas
Figure 6.13: Performance comparison of schedulers for 20 copies of theWSCC 9-bus system
implementation lead to the highest speedup of ∼1.27 in relation to se-quential processing but again there are only slight differences between theschedulers.
At the end of the scheduler analysis, the number of threads was fixedat eight (i. e. a few less than cores to reduce context switching causedby other system threads on the same CPU) whereas the system size wasvaried up to forty 9-bus copies. The resulting average execution times fora single time step plotted in Fig. 6.14 were calculated on 10 measurementsbecause of rising overall simulation times for larger systems. From fifteen9-bus copies on, the parallel processing shows an performance improvementover sequential processing and again there is no relevant difference between
115
Chapter 6 Exploiting Parallelism in Power Grid Simulation
0 10 20 30 400
0.5
1
1.5·10−2
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequentialomp_level
thread_levelthread_level meas
thread_listthread_list meas
Figure 6.14: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system
the parallel schedulers. Furthermore, the required simulation time growsquadratically with the system size.
As usual, it can be seen that a system must have a certain size to makeuse of parallelization during parallel processing as the synchronizationbetween multiple threads, realized by OpenMP barriers and a countermechanism of the other schedulers, requires too much time comparedto the actual simulation computations. In the dependency graph (seeFig. 6.15), where the area of a circle (representing a task) is proportionalto its execution time, it can be seen that most time is spent for onesingle task which solves the system equation. As the system to be solvedis growing quadratically with the number of nodes, the parallelizationspeedup is limited by the concerned SolveTask. This is also the reasonfor the little differences between the various schedulers, as there is only asmall number of meaningfully different schedules. The reduction of a bigSolveTask to multiple smaller subtasks was therefore the main motivationfor system decoupling in the following.
116
6.2 Analysis of Task Parallelization in DPsim
6.2.3 System Decoupling
In this analysis, the impact of the parallelization methods on decoupledsystems was examined. For this, the 9-bus system copies were connectedas described before. Then, the TLM is applied on the added transmissionlines. In a second case, the added transmission lines are used as so-calledsplitting branches for the diakoptics method [Mir20]. Again, the simulationhad a step of 100 µs, was 100 ms long and the average execution time fora single time step was calculated on the execution times of 10 simulations.
At first, the parallel performance of an increasing number of sys-tems using the TLM is depicted in Fig. 6.16 exemplarily for the OpenMPLevelScheduler and the ThreadLevelScheduler (without any informa-tion about the execution times of the tasks in a previous step) dependingon the number of deployed threads. The parallel processing leads to muchlower execution times in case of both schedulers and scales up to 8 threadson the utilized 10 cores system although the execution times needed bysequential processing are already much slower than without TLM. Themaximum achieved speedups with 8 as well as 10 threads in relation tosequential execution are around two orders of magnitude.
The TLM performance of all schedulers was measured with 8 threadsand is shown in Fig. 6.17. There, the average execution time per step isnearly the same for all schedulers. It does not grow linearly with the systemcopies (as the solving of the decoupled subsystems grows quadratically)and the plots have sharp increases at some points which could stem fromsystem size which does not fit in the cache of a certain level anymoreleading to higher latencies while accessing the cache of the next level resp.the main memory.
Similar measurements were performed using diakoptics instead of TLMas depicted in Fig. 6.18. Again, the parallel processing scheduled by theOpenMPLevelSchedulerand ThreadLevelScheduler show a higher performance compared to se-
Figure 6.15: Task graph for simulation of the WSCC 9-bus system
117
Chapter 6 Exploiting Parallelism in Power Grid Simulation
quential processing with maximum speedups of around one order of mag-nitude. Unfortunately, the speedup from two to more threads is verylimited.
The diakoptics performance of all schedulers was measured with 8 threadsand is shown in Fig. 6.19. Here as well, the parallel processing based on allschedulers leads to very similar execution times but without any regularsharp increases as in case of the parallel processing on decoupled systemsusing TLM.
118
6.2 Analysis of Task Parallelization in DPsim
0 10 20 30 400
2
4
6
·10−4
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequential2 threads4 threads8 threads10 threads
(1) OpenMPLevelScheduler
0 10 20 30 400
2
4
6
·10−4
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequential2 threads4 threads8 threads10 threads
(2) ThreadLevelScheduler
Figure 6.16: Performance for a varying number of copies of the WSCC9-bus system using the decoupled line model
119
Chapter 6 Exploiting Parallelism in Power Grid Simulation
0 10 20 30 40
10−5
10−4
10−3
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequentialomp_level
thread_levelthread_level meas
thread_listthread_list meas
Figure 6.17: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using the decoupled linemodel with 8 threads
120
6.2 Analysis of Task Parallelization in DPsim
0 10 20 30 400
0.5
1
1.5
2
·10−3
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequential2 threads4 threads8 threads10 threads
(1) OpenMPLevelScheduler.
0 10 20 30 400
0.5
1
1.5
2
·10−3
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequential2 threads4 threads8 threads10 threads
(2) ThreadLevelScheduler.
Figure 6.18: Performance for a varying number of copies of the WSCC9-bus system using diakoptics
121
Chapter 6 Exploiting Parallelism in Power Grid Simulation
0 10 20 30 40
10−5
10−4
10−3
Number of system copies
Wal
lclo
cktim
epe
rst
ep[s]
sequentialomp_level
thread_levelthread_level meas
thread_listthread_list meas
Figure 6.19: Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using diakoptics with 8threads
6.2.4 Compiler Environments
The performance of the parallelization does not only depend on the schedul-ing methods but also on the parallelization paradigms (OpenMP andC++11 threads) of the used compuiler environments and their optimiza-tions. Table 6.2 lists three compiler environments that are nowadays oftenused in the scientific area together with the applied optimization level forcomparable results (i. e. programs). The simulation was repeated withall three compilers, having a time step of 100 µs and a duration of 100 ms.The average execution time for a single time step was calculated on theexecution times of 50 simulations as presented in Fig. 6.20. The gcc and
122
6.2 Analysis of Task Parallelization in DPsim
2 4 6 8 100
1
2
3·10−3
Number of threads
Wal
lclo
cktim
epe
rst
ep[s]
gcc (sequential)gcc (omp_level)
gcc (thread_level)clang (sequential)clang (omp_level)
clang (thread_level)icc (sequential)icc (omp_level)
icc (thread_level)
Figure 6.20: Performance comparison of compilers for 20 copies of theWSCC 9-bus system
123
Chapter 6 Exploiting Parallelism in Power Grid Simulation
Table 6.2: Overview of the tested compilersCompiler Version Flags Reference
GNU Compiler Collection (gcc) 8.1.0 -O3 -march=native [GCC]Clang (clang) 7.0.1 -O3 -march=native [Cla]Intel C++ Compiler (icc) 19.0.1.144 -O3 -xHost [Int]
icc compilers lead to a comparable performance for all schedulers which,in case of the small system to be simulated, with parallelization is lowerthan with sequential execution. The executable compiled with clang,however, has the lowest performance.
Therefore, simulations with same parameters as before were performedon a system model consisting of twenty interlinked 9-bus copies. The plotsfor all compilers are quantitatively similar to the ones in Fig. 6.13. Forall compilers the parallel processing on this larger system model achieveslower execution times than with sequential processing. Here again, gccyields to the highest performance.
6.3 Conclusion and Outlook
This chapter provides an overview of approaches exploiting parallelism inmathematical models. In addition to the three known approach types insimulation area [Lun+09], it introduces a new Automatic Fine-GrainedParallelization of Mathematical Models which was implemented in DP-sim; an existing power grid simulation software. After a presentationand formal definition of different scheduling methods which are used forthe implementation of different parallelization methods belonging to thenew approach type, i. a. the task dependency analysis, and two systemdecoupling methods (TLM and diakoptics) are sketched.
The subsequent analysis of the task parallelization methods implementedin DPsim for shared-memory systems has shown sublinear speedups forsmall systems models with execution times per simulation step increasingwith the number of used CPUs. However, in case of larger system models(with more than 100 nodes) in combination with TLM superlinear speedupshave been achieved. Unfortunately, TLM has some restrictions on thesimulation time steps as well as the types of transmission lines for whichit can be applied and it also potentially introduces inaccuracies for higherfrequencies. The utilization of diakoptics, which does not introduce suchdisadvantages, leads to parallel speedups when applying the implementedparallelization methods.
124
6.3 Conclusion and Outlook
On the 9-bus system model, the various scheduling algorithms hadalmost no performance differences in many cases. Moreover, existingdifferences are not caused by the different scheduling concepts but insteadby the particular parallelization paradigm and the compiler environment.The reason is the general structure of the task dependency graphs whichleave only little flexibility for the algorithms to generate strongly differingschedules. Anyway, as the task dependency graphs depend on the systemmodels, a comprehensive analysis with different models could result in avariety of execution times depending on the parallel scheduling method.
The implemented task dependency analysis is general enough to intro-duce a finer-grained inner task parallelization. For instance, GPUs ona heterogeneous architecture could be utilized as accelerators (e. g. forcomputations of complex component models) by porting task related codesfor CPUs to GPU kernel code. Then, the schedulers could deploy tasksamong CPUs and GPUs. A utilization of further parallel programmingparadigms for distributed-memory architectures, such as Message PassingInterface (MPI), could be considered but only in case of very large systemmodels because of the higher latencies usually introduced by interactions(i. e. memory accesses and synchronizations) between different computernodes.
Furthermore, optimization efforts of the processing within tasks werebegun by the usage of explicit single instruction multiple data (stream)(SIMD) vectorization where vector instructions (such as Advanced VectorExtensions (AVX)) of modern CPUs are utilized. With these, a higherperformance can be achieved if same CPU instructions are performed onvectors instead of scalars. Modern compilers already perform automaticvectorization but only for parts of the code where they can assure cor-rectness by a static code analysis and they can recognize only in case ofcertain control flow patterns. In more complex computations, explicitvectorization can be enabled by the programmer, e. g. using OpenMPSIMD compiler directives or by SIMD compiler intrinsics.
125
7HPC PythonInternals and Benefits
In the past decade, Python developed to be one of the most popularprogramming languages. In many rankings of the most widely usedprogramming languages it is on one of the first three positions [Cas19].Especially in the engineering sector it enjoys a steadily growing popularityas it is said to be easy to learn because of its clear syntax and a relativelysmall set of keywords. Furthermore, there are several open-source Pythonimplementations with a comprehensive standard library available for free.It allows, e. g., object-oriented as well as functional programming and thevery portable Python implementations allow the use on many platforms.As Python programs are usually interpreted, they do not need to becompiled, which is why it is often used as a script language for small tasks.
But besides the duration and simplicity of software development also thetime efficiency of a programming language is crucial, especially in scientificcomputing. Python’s simple syntax and automatic memory managementleads to short development times in comparison to other programminglanguages. However, the execution times of interpreted programminglanguages usually is considerably higher than of compiled languages.
Therefore, various language extensions, optimized interpreters and com-pilers are developed to increase the time and memory efficiency of Pythonprograms. Important representatives are the Python package NumPy[VCV11; Numb], the just-in-time (JIT) compilers numba [LPS15; Numa]and PyPy as well as the language extension Cython [Beh+11; Cyt]. But ifan engineer, for instance, developed a software project in Python with all
127
Chapter 7 HPC Python Internals and Benefits
needed features but insufficient performance, the question arises, whichof the mentioned solutions should be taken for which kind of algorithms.Around these efforts, a scientific community has grown in the past yearswith conferences on Python for High-Performance and Scientific Comput-ing [Ger]. However, no systematic comparative analysis on the methodsimproving Python’s runtime performance has been accomplished so far.
In a blog post [Pug16] an execution time comparison based just on a LUdecomposition between Python 3, C, and Julia was shown in combinationwith the (JIT) compilers Cython and numba as well as the modules NumPyand SciPy [Bre12; Scid], containing numerical algorithms based on NumPy.The result of this benchmark was that the execution of conventionalPython was one decimal power slower than C and Julia. With the appliedimprovements, however, the performance of the Python solution was similarto C and Julia. The execution time of the SciPy-based implementation waseven more performant than the ones in C and Julia when using precompiledfunctions of the SciPy and NumPy modules. Except the conventionalPython 3 solution, each implementation was optimized for vector CPUinstructions.
In [Rog16] a benchmark of Python runtime environments was presented.The comparison was accomplished with the conventionally used referenceC-implementation CPython [Pytb] and the Java-implementation Jythonof the Python interpreter on the one hand as well as PyPy and Cython ascompilers on the other hand. The results are interesting as Jython achievesa higher performance than CPython and Cython is as fast as CPython.The latter is the case because the Cython version was not adapted to makeuse of Cython’s features which will be introduced later in this chapter.Furthermore, for this benchmark only Python 2 was used which is nowdeprecated and Python 3 is not backward compatible to Python 2.
The available benchmarks focus on the execution time only. For aholistic view of the solutions, the memory consumption must be taken intoaccount as well which has not been considered in the previous analyses.
Therefore, this chapter presents a comparative analysis of the currentlymost popular performance improvement solutions for Python programs ondifferent kind of standard algorithms from the area of numerical methodsand operations on common abstract data types (ADTs). These algorithmimplementations based on the various Python solutions are comparedwith reference implementations in C++ which is meant to be a time andmemory efficient object-oriented programming language. The comparativeanalysis presented here does not only compare the execution times of theprograms but also their memory consumption. It shall provide Pythonprogrammers an overview of current solutions to improve the performanceof their Python programs. Moreover, it shall provide them information on
128
7.1 HPC Python Fundamentals
how much effort is required for the application of a certain solution on theone side and which gain can be expected on the other side.
The chapter gives an introduction to HPC relevant properties of thePython language and its reference implementation CPython. It follows ashort introduction of the aforementioned Python runtime environmentswith a focus on their different approaches. Hereafter the benchmarkingmethodology based on representative algorithms is presented. The al-gorithms are used for the comparative analysis on the execution timesand memory consumption in case of the various Python environments,presented in the following section. Finally, a conclusion on the comparativeanalysis is given with an outlook of future work. This chapter presentsoutcomes of the supervised thesis [Kas17].
7.1 HPC Python Fundamentals
Before the available Python environments are presented, a short overviewof the HPC relevant peculiarities of Python is given. Usually high-levellanguages (HLLs) like Python are structured in a way that humans areable to read and maintain them easily and reuse certain parts of theprogram, and so forth. Hence, before such programs can be executed on acentral processing unit (CPU), the source code must be transformed into asequence of instructions of the actual CPU. This can be accomplished, forinstance, with an interpreter or compiler.
Interpreter
An interpreter processes the source code at the run time of a program.It reads the program’s source code, analyzes or even preprocesses it, andexecutes the statements by translating them successively into instructionsof the target CPU. In case of Python programs interpreted by the CPythonenvironment, the preprocessing consists of a transformation of the Pythoncode into an intermediate format, the bytecode (stored in .pyc files), for avirtual machine [Ben]. And the Python interpreter is an implementationof that virtual machine.
The successive execution of source code by an interpreter makes theprogramming language usable for scripting and usually allows a bettererror diagnosis [Aho03]. However, this has the disadvantage of a tenden-tially slower execution speed of interpreted in comparison with compiledprograms.
129
Chapter 7 HPC Python Internals and Benefits
Compiler
A compiler for HLLs usually translates the whole relevant source code toexecutable machine code (i. e. instructions of the target CPU). It can alsogenerate intermediate codes from the source code but the main differenceof this approach, in contrast to an interpreter, is that the program afterthe compilation process is available in a form that can be executed on theCPU directly. The direct execution of the program instructions on theCPU leads to a high execution speed but disadvantages are, e. g., that themachine code is CPU architecture dependent and must be compiled againfor different computer platforms. The same applies in case of source codechanges. Such compilers are therefore also called ahead-of-time (AOT)compilers.
Just-in-Time Compiler
In contrast to AOT compilers, JIT compilers translate the source codemostly during run time. Only those parts of the program that needto be executed are compiled. JIT compilers can be used to increasethe execution speed of interpreted programs when the execution of thecompiled part of the source code is so much faster than its interpretationthat the compilation process of the JIT compiler does not have a negativeeffect on the whole execution time. Once compiled parts do not need tobe compiled again in case of multiple executions such as in loops.
Tracing Just-in-Time Compiler
A tracing just-in-time (TJIT) compiler makes use of the assumption thatmost of a typical program’s run time is spent in loops [Bol+09]. Therefore,a TJIT compiler tries to identify often executed paths within loops. Theinstruction sequences of such execution paths are called traces. After theiridentification, the traces are usually optimized and translated to machinecode.
7.1.1 Classical Python
Python is continously further developed. Currently Python is available inversion 3, which has many new features breaking backward compatibilityto version 2 [Tad] [Rosb]. The Python version numbers refer to themajor version numbers of the reference Python interpreter implementationCPython [Pytb]. After around 20 years of development, Python 2 willbe retired and the last CPython 2.7.18 was released in April 2020 [Pytd].
130
7.1 HPC Python Fundamentals
Nevertheless, Python 2 was considered in this dissertation since there isstill much Python 2 code that has not been ported to Python 3.
Data Types
In Python, variables are not declared and can be used without a datatype definition. Everything is an object in Python and associated with acertain data type [Mon]. A Python variable can reference different objectsof different types. And the type of an object is determined dynamicallyat run time with the aid of their attributes and methods which is calledDuck Typing [FM09].
There are so-called mutable and immutable objects. Objects that are,e. g., of the type int, float, bool, and tuple are immutable. An instanceof an immutable data type has a constant value which cannot be changed.Multiple variables with the same value are not referencing multiple in-stances. Instead, the same instance is referenced by all these variables. Incontrast, a mutable instance can change its value during run time which iswhy the same mutable objects are created in the memory each time theyare newly requested. Mutable objects are, e. g., of the type list, dict,and set [Cara].
Python 3 distinguishes between several types for numbers. Integersare of the type int and have an arbitrary precision. In Python 2 an intrepresents an integer value with 64 bit and the type long int correspondsto int of Python 3.
An instance of type list is a sequence of objects that can have anarbitrary type. The content of the list can be changed during run timeand the objects can be mutable or immutable. Unlike a list, the contentof a tuple cannot be modified during run time.
An object of type dict is an associative data field which consists ofkey-value pairs. The keys, which can only be of an immutable type, referto objects of an arbitrary type.
Parameter Passing
The two most common evaluation strategies for parameters during afunction call in HLLs are call by value and call by reference. In the firstcase, the value of the given expression (passed to the function) is assigned tothe function’s parameter. In the second case, the object that is referencedby the given expression is also referenced by the functions’ parameterwithin the function. The latter leads to the fact that the object’s valueis changing in the calling part of the code when it is changed within thecalled function.
131
Chapter 7 HPC Python Internals and Benefits
Python, however, uses the mechanism referred to as call by object(reference) [Kle]. If a variable x in main is passed to a function as parametery, then x and y refer to the same object. This behavior corresponds tocall by reference. If, subsequently, another object is assigned to y withinthe function, y refers to the new object and x in main stays untouchedwhich corresponds to a call by value behavior.
Side Effects
The call by object reference principle can cause side effects. If a mutableobject, e. g., of the type list, referenced in main by the variable l is passedto a function with a parameter m, all modification to m within the functionare apply also to the list in main. To avoid this, a copy of the list can bepassed with the aid of the slicing function which can be used by writingl[:] instead of l as argument in the function call.
NumPy Module
NumPy, standing for Numerical Python, is a package for scientific com-puting with Python [Numb]. It contains an N -dimensional array objectimplementation (called ndarray), functions which can work on such arrays,tools for integrating C/C++ and Fortran code, and linear algebra Fouriertransform as well as random number capabilities. The ndarray can beused for numerical computations instead of the normal Python list. Allelements of the ndarray must be of the same data type as the NumPypackage is implemented with the use of C and therefore can benefit ofstatic typing at compile time for higher run time efficiency. Possible datatypes that can be used are, e. g., bool_, int_, float_, and complex_ forequivalent C-types as shown in [SWD15]. A one-dimensional ndarray withn 64 bit floating-point numbers containing zeros can only be created asfollows:
numpy . zeros (n, float )
An ndarray provides a Python object around a functionally extendedC-array. The following Python code shows a matrix-matrix multiplication:
for i in range (n):for j in range (n):
for k in range (n):C[i][j] += A[i][k] * B[k][j]
A usage of ndarrays for the matrices would lead to an additional overheadin the innermost loop. The overhead would occur at the border betweenthe pure Python code around the +=-statement (i. e. the three loops) andthe NumPy code executed during the evaluation of the statement. In the
132
7.1 HPC Python Fundamentals
case of a 10 × 10 matrix the border within the 3 loops would be passed103 = 1000 times. That could make the program execution slower thanwith the normal Python list which is why NumPy functions should becalled for sufficiently long processing on the provided data only.
Instead of applying pure Python operations over the entries of an ndarrayit is recommended to apply operations over the whole ndarray in C code.For this purpose, NumPy provides precompiled functions implementedin C as the following one that can be used for the multiplication of twomatrices:
numpy .dot(A, B)
With this function the border between Python and NumPy code will bepassed just once. Moreover, the precompiled functions of NumPy for linearalgebra make use of BLAS [Uni17] and LAPACK [Uni19]. In comparison toPython lists, the ndarray generates less overhead with regard to executiontime and memory usage [Coh] as it consists of continuous memory blocks(at least in virtual memory) whereas a Python list consists of pointers tomemory blocks which can be randomly distributed in the memory whichis unfavorable for CPU caches as depicted in Fig. 7.1.
Array Module
Python’s array module defines an object type which can compactly repre-sent an array of basic values of one C data type such as, e. g., char, int,float, double, etc. Hence, the module is also implemented using C butis not as powerful as ndarray because only one-dimensional arrays can
Python List
PyObject_HEAD
data
dimensions
strides
NumPy Array
12345678
0x1337180x1337480x1337300x1337600x1337000x1337b80x1337d00x1337e8
..............................................................................
PyObject_HEAD
length
items
Figure 7.1: NumPy ndarray vs. Python list [Van]
133
Chapter 7 HPC Python Internals and Benefits
be defined and there are no precompiled functions. More on this can befound in [Pyta].
Memory Management in CPython
The reference implementation CPython comes with an automatic memorymanagement based on so-called reference counting and a garbage collector(GC). [Dig]. Each Python object has a reference counter which is increasedwhen the object is referenced once more and decreased if a reference isdissolved. If the reference count equals zero, the memory allocated forthe object can be freed. However, the reference counting of CPythoncannot detect reference cycles which can occur, for instance, when oneor more objects are referencing each other [Glo]. Therefore, CPythonhas a generational cyclic GC that runs periodically, determining referencecycles for freeing the memory occupied by objects which are referencingjust themselves. As the garbage collection interrupts the execution of thePython program, there are certain thresholds that can be adjusted. Moreon that can be read in [Debb].
Architecture of the CPython Environment
The software architecture of the CPython environment is depicted inFig. 7.2. Before CPython can be used, it must be compiled from theCPython source code by a proper C compiler. The resulting pythonprogram can then be applied on the Python code to be executed whichis translated to bytecode and interpreted by the bytecode interpreter asCPU instructions of the target CPU.
The bytecode interpreter is implemented in form of a stack-based virtualmachine (VM) [Ben]. For the Python function
def add(a, b):z = a + breturn z
the following sequence of bytecode instructions is executed by the VM.First, the two operands a and b are pushed on the stack by the LOAD_FASTinstruction. Then the BINARY_ADD instruction pops the two operands fromthe stack, performs the addition, and pushes the result onto the stack. ASTORE_FAST instruction stores the result in z which is then pushed againon the stack by a further LOAD_FAST, to be returned by a RETURN_VALUEinstruction.
The data types of the objects (here: a, b, and z) are determined at theexecution of each bytecode instruction. Therefore, a BINARY_ADD duringone call can be performed onto two integer values as well as on two lists
134
7.1 HPC Python Fundamentals
during another call. This makes the interpretation process very flexiblebut also much more time-consuming than the direct execution of machineinstructions from a compiled program. For example, the call of BINARY_ADDon two integer values consists of following steps:
1. Determine data type of a
2. a is an int: get value of a
3. Determine data type of b
4. b is an int: get value of b
5. Call C function int binary_add(int, int) on values of a and b
6. Result of type int will be stored in result
Parallel Processing in CPython
CPython allows multithreading with the aid of the threading module[Pyte] which is based on POSIX threads on a Portable Operating SystemInterface (POSIX) conform operating system [IEE18] which is mappingthe Python threads to native threads of the operating system. However,because of a global interpreter lock (GIL) the Python threads within one
PythonCode
CPythonSource (C) C Compiler
python
Bytecode BytecodeInterpreter
MaschineCode
Computer Platform
Figure 7.2: Software architecture of CPython (python command)
135
Chapter 7 HPC Python Internals and Benefits
CPython interpreter are not really running concurrently. The reason forthe GIL is i. a. the automatic memory management by reference countersas explained above. Without the GIL in CPython, multiple threads thatare using the same Python object could increment and decrement itsreference counter concurrently. This could lead to a race condition onthe reference counter resulting in a wrong value. Besides the memorymanagement also global variables as well as mutable objects cause issuesfor a thread-safe program execution: if a thread modifies a global variable,another thread could use an old value – the same applies to mutableobjects.
Therefore, a Python thread in CPython must hold the GIL to be ableto execute bytecode instructions. How the GIL is assigned to the threadsdepends on the CPython version. If multiple threads are created, one getsthe GIL and the others wait (blocking) on it.
In Python 2 a check is implemented which counts the ticks (bytecodeinstruction) since the creation of a new thread [Bea]. After 100 ticks theactive thread is yielding the GIL and all inactive threads get a signal forrequesting the GIL. One of them gets it and continues with the executionof its bytecode while the other threads wait on the GIL.
In Python 3 each thread gets a time interval of 5 ms instead of ticks [Gir].After each interval the GIL is yielded and assigned to the next thread in arow. This avoids a competition between the threads leading to a fairnessof task scheduling.
CPython also provides the multiprocessing module [Pytc] with whichchild processes can be created within a Python process. Each child hasits own process memory that is independent from other processes. Hence,the memory management, global variable, and so forth are no issue forconcurrently running processes belonging to the same process tree. Thecommunication between such processes can be performed with the aid ofa Manager object.
All previously presented Python peculiarities are important to under-stand what the Python environments other than CPython do differentlyto achieve a higher run time performance. These Python environmentswill be presented in the following.
7.1.2 PyPyContrary to CPython, PyPy’s Python interpreter, implementing the fullPython language, is written in Restricted Python (RPython) rather thanin C. RPython is a restricted subset of Python and therefore suitable forstatic analysis. For instance, variables should contain values of at mostone type at each control flow point [Min]. The PyPy interpreter was
136
7.1 HPC Python Fundamentals
written in RPython as the language was designed for the development ofimplementations of dynamic languages. RPython code can be compiledby the RPython translation toolchain [PyPe] as it is done for the PyPyinterpreter. Due to a separation of language specification of the dynamiclanguage to be implemented and implementation aspects, the RPythontoolchain can automatically generate a JIT compiler for the dynamiclanguage. As a subset of Python, RPython can also be interpreted by anarbitrary Python interpreter [Min].
Architecture of the PyPy Environment
The software architecture of the PyPy environment is depicted in Fig. 7.3.Here, the program which runs the Python code to be executed is calledpypy and must by compiled from the PyPy source code with the RPythontoolchain. Similar to CPython, first the Python program is compiled tobytecode which is also processed by a stack-based virtual machine [PyPa].The important difference between the CPython and PyPy interpreter isthat the latter delegates all actual manipulations of the users’ Pythonobjects to a so-called object space which is some kind of a library of built-intypes [PyPc]. Hence, PyPy’s interpreter treats the Python objects as blackboxes.
PyPy Source(RPython)
RPythonToolchain
PythonCode
pypy
Bytecode BytecodeInterpreter Tracing JIT
MaschineCode
Computer Platform
Figure 7.3: Software architecture of PyPy (pypy command)
137
Chapter 7 HPC Python Internals and Benefits
The BINARY_ADD in PyPy is implemented as follows [BW12]:def BINARY_ADD (space , frame ):
object1 = frame .pop () # pop left operand off stackobject2 = frame .pop () # pop right operand off stackresult = space .add(object1 , object2 ) # perform operationframe .push( result ) # record result on stack
The interpreter pops the two operand objects from the stack and passesthem to the add method from the object space. In contrast to CPython,the PyPy interpreter does not determine the types of the objects which iswhy the latter does not need to be adapted when new data types need tobe supported.
The TJIT compiler, automatically generated by the RPython toolchain,uses meta-tracing [Bol+09]. Therefore, at runtime of the actual Pythonprogram executed by the user, the PyPy interpreter, implemented as astack-based VM in RPython, is traced and not the user program itself.Typically, a TJIT approach is based on a tracing VM which goes throughthe following phases [Cun10]:
Interpretation At first, the bytecode is interpreted as usual with theaddition of a lightweight code for profiling of the execution to detectwhich loops are run most frequently (i. e. hot loops). For thispurpose, a counter is incremented at each backward jump. At acertain threshold, the VM enters the tracing phase.
Tracing The interpreter records all instructions of a whole hot loop itera-tion. This record is called a trace which is passed to the JIT compiler.The trace is a list of instructions with their operands and results.
Compilation The JIT compiler turns a trace into efficient machine codethat is immediately executable and can be reused for the next itera-tion of the hot loop.
Running The compiled machine code is executed.
The phases above represent only the nodes of a graph with many possiblepaths which is not linear. For ensuring correctness, a trace contains aguard at each point where the path in the control flow graph (CFG) couldhave followed another branch, e. g., at conditional statements. If a guardfails, the VM falls back into interpretation mode.
However, the meta-tracing approach of PyPy is different. As the tracedprogram is the PyPy interpreter itself and not the interpreted program, ahot loop is the bytecode dispatch loop (and for many simple interpretersthis is the only hot loop). Tracing one iteration of this loop means that therecorded trace corresponds to executing one opcode (i. e. a machine code
138
7.1 HPC Python Fundamentals
instruction) and it is very unlikely that the same opcode is executed manytimes in a row. Therefore, the corresponding guard will fail, meaning thatthe performance is not improved. Better if the execution of several opcodescould be traced which would effectively unroll the bytecode dispatch loop.Ideally, the bytecode dispatch loop should be unrolled exactly so muchthat the unrolled version corresponds to one loop in the interpreted userprogram. Such user loops can be recognized if the program counter (PC)of the PyPy interpreter VM has the same value several times. Since theJIT cannot know which part of the PyPy interpreter represents the PCof the VM, the developer of the interpreter needs to mark the relevantvariables with a so-called hint. More on meta-tracing can be found in[Bol+09].
PyPy provides different parameters controlling the behavior of JITcompilation with some magic numbers which are [BL15]:
Loop threshold Determines the number of times a loop must be iteratedto be identified as hot loop (default: 1619);
Function threshold Determines how often a function must be called to betraced from the beginning (default: 1039);
Trace eagerness If a guard failures happen above this threshold, the TJITattempts to translate the sub-path from the point of the path failureto the loop’s end which is called a bridge (default: 200).
Memory Management in PyPy
Since PyPy’s initial release in 2007, many garbage collection methods,without reference counting, were implemented, such as Mark and Sweep,Semispace Copying Collector, Generational GC, Hybrid GC, Mark &Compact GC, and Minimark GC [PyPb] Currently the default one isIncminimark, a generational moving collector [PyPd]. Since Incminimarkis an incremental GC, the major collection is incremental (i. e. there aredifferent stages of collection). The goal is not to have any pause longerthan 1 ms, but in practice it depends on the size and characteristics of theheap and there can be pauses between 10-100 ms.
7.1.3 NumbaNumba is an open-source JIT compiler translating a subset of Python andNumPy into machine code [Numa] using the LLVM compiler infrastructureproject [LLV]. Most commonly, Numba is used through so-called decoratorsto code parts that shall be compiled instead of being interpreted by
139
Chapter 7 HPC Python Internals and Benefits
CPython. Numba is therefore no alternative to CPython but an extensionto it and available for Python 2 and Python 3.
Features of the Numba environment
Since code compilation can be time intensive, only code parts that havea high share in total execution time should be compiled. There are twomodes how the compiler treats the code [Anad]:Nopython mode Numba generates code which is independent of the Py-
thon C API which is the interface for C programs to the Pythoninterpreter. A function can be compiled in nopython mode only if adata type can be assigned to all objects accessed by the function. Innopython mode atomic (i. e. thread-safe) reference counters are used[Anaa] instead of the ones in CPython which are not thread-safe.
Object mode Numba generates code which declares all objects as Pythonobjects on which operations are performed with the aid of the PythonC API. Therefore, the performance improvement is lower than innopython mode, unless so-called loop-jitting can be applied by Numba.In the latter case the loop can be automatically extracted andcompiled in nopython mode which is possible if the loop containsnopython-supported operations only [Anae].
Numba supports standard data types such as, e. g., int16, float32,and complex128 with a precision of up to 64 bit per value [Anaf]. For thecompilation of a function by Numba, a decorator must be written beforethe function:
@jitdef f(x, y):
return x + y
There are two possibilities for using the jit-decorator:Lazy compilation Numba determines the function parameters’ type as well
as the result type at run time, compiling special code for differentinput data types.
Eager compilation The programmer determines all data types manually,i. e. in case of upper example the type definition could be:@jit(int32(int32, int32))
Moreover, following arguments can be set to True in the decorator [Anac]:nopython Numba tries to compile the function in nopython mode with
an error message if not possible instead of an automatic fallback toobject mode.
140
7.1 HPC Python Fundamentals
cache Numba saves the machine code of the compiled function instead ofcompiling it at each call.
nogil Since atomic reference counters are used in nopython mode, theGIL can be disabled, leading to a real concurrent execution of parallelrunning threads.
Supporting the NumPy module, Numba provides the possibility tobuild NumPy universal functions (ufuncs). A ufunc is a function thatoperates on ndarrays (for definition see Sect. 7.1.1) in an element-by-element fashion, supporting several standard features [Scic]. Hence, aufunc is a vectorized wrapper for a function that takes a fixed number ofspecific inputs, producing a fixed number of specific outputs. The wrappertherefore enables applying the wrapped function on the variable longndarray. For the generation of a ufunc, the vectorize-decorator is usedwhich allows lazy and eager compilation. In case of a lazy compilation,where no data types were defined, a dynamic universal function (DUFunc)is built which behaves like a ufunc with the difference that machine codeis compiled for loops if the given data types cannot be cast to types of theexisting code. In case of ufuncs, an error is thrown if the provided datacannot be cast [Scib]. The advantage of ufuncs over functions compiledwith the jit-decorator is the support of features like broadcasting. Basicoperations on ndarrays are performed element-wise which works on arraysof the same size. The broadcasting conversion, however, defines a way ofapplying operations on arrays of different size as specified in [Scia].
The vectorize-decorator supports scalar arrays only while guvectorizeallows multi-dimensional arrays as input and output. Unlike vectorize,in GUfunc signatures also the dimensions and relations of the inputs mustbe provided in a symbolic way. A guvectorize-decorator for the knownmatrix-matrix multiplication could be used as follows:
@guvectorize (["void(int32 ,float64 [: ,:] , float64 [: ,:] , float64 [: ,:])"],
"() , (m,m), (m,m)", nopython = True)def multiplication (n, A, B, C):
for i in range (n):for j in range (n):
for k in range (n):C[i][j] += A[i][k] * B[k][j]
In both decorators the nopython parameter can be specified to avoid afallback to object mode.
Numba does not support the whole Python language in nopython mode.Moreover, not all Standard Library modules of Python are supported.More on both can be read in [Anah]. However, NumPy is well integrated[Anag].
141
Chapter 7 HPC Python Internals and Benefits
General Procedure of Numba
Figure 7.4 shows the stages of the Numba compiler [Anab]:1) Bytecode Analysis Numba analyzes the function bytecode to find the
CFG.
2) Numba-IR Generation Based on the CFG and a data flow analysis, thebytecode is translated to Numba’s intermediate representation (IR)which is better suited to analyze and translate as it is not based ona stack representation (used by Python interpreter) but on a registermachine representation (used by LLVM).
3) Macro Expansion This step converts specific decorator attributes (e. g.CUDA intrinsics for grid, block, and thread dimension) into Numba-IR nodes representing function calls.
4) Untyped IR Rewriting Certain transformations on the untyped IR areperformed, e. g., for the detection of certain kinds of statements.
5) Type Inference The data type determination is performed as explainedfor lazy and eager compilation with fallback to object mode or errorin nopython mode.
6a) Typed IR Rewriting Optimizations like loop fusion are performed,where two loops with operations on the same array are mergedtogether into one loop.
6b) Automatic Parallelization This stage is performed only if the parallelparameter is passed to a jit-decorator for automatic exploitation
of parallelism in the semantics of operation in Numba-IR.
7a) Nopython Mode LLVM-IR Generation If a Numba type was foundfor every intermediate variable, Numba can (potentially) generatespecialized native code which is called lowering as Numba-IR isan abstract high-level intermediate language while LLVM-IR is amachine dependent low-level representation. The LLVM toolchain isthen able to optimize this to an efficient code of the target CPU.
7b) Object Mode LLVM-IR Generation If type inference fails to findNumba types for all values inside a function, it is compiled in objectmode which generates a significantly longer LLVM-IR as calls to thePython C API will be performed to basically all operations.
8 LLVM-IR Compilation The LLVM-IR is compiled to machine code bythe LLVM JIT compiler.
142
7.1 HPC Python Fundamentals
@jitdefdo_something(a,b):...>>>do_something(x,y)
Python Function(Bytecode)
FunctionArguments
BytecodeAnalysis
Numba-IRGeneration
MacroExpansion
Untyped IRRewriting
TypeInference
Nopython & Object Modeand LLVM-IR Generation
LLVM-IRCompilation
Typed IR Rewriting andAutomatic Parallelization
Computer Platform
Machine Code
Figure 7.4: Numba compilation stages
Numba does not implement a vectorization of the Numba-IR but LLVMcan apply automatic vectorization for single instruction multiple data(stream) (SIMD) capable CPUs [Anai]. LLVM’s behavior on that can bechanged by Numba environment variables [Anaj].
7.1.4 Cython
Cython is the name of a compiled programming language and of an open-source project, written in Python and C, an implementation of a Cythoncompiler with static code optimization [Cyt]. The Cython language shallcombine the simplicity of Python with the performance of C/C++ assketched in Fig. 7.5 [Behc] with mostly usual Python and additionalC-inspired syntax. Therefore, it mostly supports Python 2 as well asPython 3 and extends Python by C data types and structures. A detailed
143
Chapter 7 HPC Python Internals and Benefits
documentation of the differences in the semantics between the compiledcode and Python is provided in [Beha].
Cython Extending Python
In Cython it is possible to optimize Python code by static variable decla-rations such as
cdef int i
as it supports all basic C data types as well as pointers, arrays, typdef-edalias types, structs / unions, and function pointers. Furthermore, alsoPython types such as list and dict can be declared statically. Variableswithout a static variable declaration are handled by the Cython compiler,with the aid of the Python C API, as Python objects. Moreover, it ispossible letting the Cython compiler to typify variables as static in anautomatic way, in certain functions or even the whole code, with thefollowing compiler directive:
@cython . infer_types (True)
The compiler then tries to find the right data types by reference to theassignments in the related code. However, the static typing of variablesis not designated for the whole program. Only the variables withinparts which are relevant for the performance should be statically typed.Anyhow, a conversion from Python objects to C or C++ types or objects isunavoidable as will become apparent later. A Python integer, for instance,can be converted to char, int, long, etc. and a Python string can beconverted to a C++ std::string [Smi15].
Python and C functions have similarities as they obtain arguments andreturn values but Python functions are more powerful and flexible whichmakes them potentially slower. Cython therefore supports Python as wellas C functions which can call each other.
Performance
Sim
plic
ity
Fortran C
Python
C++
Cython
Figure 7.5: Comparison of Cython with other programming languages
144
7.1 HPC Python Fundamentals
A Python function is valid Cython code and can contain static typedefinitions as introduced before. These Python functions can be directlycalled by external Python code.
A C function can be included by a wrapper or directly implementedin Cython and therefore declared with the keyword cdef instead of def.Contrary to Python code, C code is not processed by the Cython compileras will become apparent later, too. A cdef function finally is a C functionthat is implemented in Python-based syntax. The function arguments’types and return types are defined statically. In cdef-functions C pointersas well as structs and further C types can be used. Moreover, the callof a cdef-function is as performant as the call of a pure C function by awrapper and the overhead of the call is minimal. It is also possible to usePython objects as well as dynamic typed variables in cdef-functions andpass them to the function in form of arguments. A cdef function cannotbe called from external Python code but it is possible to write a Pythonfunction within the same module which is externally visible and calls thecdef function, as for example the following one:
def externaly_visible_cfunction_wrapper ( argument ):return cfunction ( argument )
A third possibility for the implementation of a function is provided bycpdef which combines the access possibility of Python functions with theperformance of C functions [Rosa].
There is a restriction on Cython functions as the data types of thearguments and the return value must be compatible with C and Python.While each Python object can be represented in C, not each C type canbe represented in Python, as for example C pointers and arrays.
Cython provides a set of predefined Python and C/C++ related headerfiles with the filename extension .pxd. Most important is the C standardlibrary libc with the header files stdlib, stdio, math, etc. The sameapplies for the Standard Template Library (STL) with the option to makeuse of containers such as vector, list, map, etc.
Cython allows an efficient access to NumPy’s ndarrays (for definitionsee Sect. 7.1.1) that is defined in a separate .pxd file as it is written in C.Besides ndarrays, also Python’s array module can be used efficiently asPython accesses the elementary C array directly.
Since it is possible to access Python functions at runtime, they are notdefined in header files. Both declarations and definitions are located inthe implementation files with the filename extension .pyx.
145
Chapter 7 HPC Python Internals and Benefits
Cython Compilation Pipeline
Cython produces a standard Python module but in an unconventionalmanner that is depicted in Fig. 7.6. A script (here: setup.py) is usedto start the setuptools build procedure which translates the Cythonimplementation file(s) (here: hello.pyx) to optimized and platform inde-pendent C code (here: hello.c) with the aid of the Cython compiler. Forinstance, the mult function
def mult(a, b):return a * b
is compiled to several thousand lines of C code which mainly consists ofdefines for portability reasons:
__pyx_t_1 = PyNumber_Multiply ( __pyx_v_a , __pyx_v_b );if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error )__Pyx_GOTREF ( __pyx_t_1 );__pyx_r = __pyx_t_1 ;__pyx_t_1 = 0
It contains automatically generated variable names which make the codehard to read. However, this is no problem since no manual changes onit are expected. The first line invokes the function PyNumber_Multiplyfrom the Python C API, which performs a multiplication between twoPython objects that are passed in form of pointers to their addresses.The if-statement checks if the multiplication was successful and GOTREFimplements the reference counting.
CythonCompiler
C Compiler
setup.py
hello.pyx
launch.py
hello.so
hello.c
import
Figure 7.6: Cython’s workflow for Python module building [Dav]
146
7.2 Benchmarking Methodology
Using Cython’s advantage of static type declarations, the Cython code,in case that int was used, is translated to the following C code:
__pyx_t_1 = __Pyx_PyInt_From_int ( __pyx_v_a * __pyx_v_b );if ( unlikely (! __pyx_t_1 )) __PYX_ERR (0, 2, __pyx_L1_error )__Pyx_GOTREF ( __pyx_t_1 );__pyx_r = __pyx_t_1 ;__pyx_t_1 = 0
Here, the multiplication is performed directly in the first line of the upperC code and the result is converted to a Python integer. It is also possibleto convert the Cython code to C++ but the default target language is C.The outputted code of the Cython compiler can also be adapted by somedirectives listed in [Carb].
Afterwards, the generated C code is compiled by a C compiler suchas gcc [GCC] or Clang [Cla] to a shared library file (here: hello.so onPOSIX systems and hello.pyd on Windows). These shared libraries arecalled C extension modules and can be used like pure Python modulesafter a usual import. Depending on the setuptools script that is used,an extension module for the particular Python environment is generated.Therefore, Cython is not autarkic as it depends on a Python environmentsuch as CPython or PyPy.
Parallel Programming in Cython
With the nogil keyword after a function definition, the GIL is released:cdef int function ( double x) nogil :
After the return the GIL is active again. Also external C/C++ functionscan make use of concurrent processing by multithreading with nogil:
cdef extern from " header .h":double function ( double var) nogil
This is possible only if no Python objects are used within the function.Based on this also OpenMP can be used in an efficient manner [Behb].
7.2 Benchmarking Methodology
The benchmarking of the different environments for a high performanceexecution of Python programs was performed with the aid of the followingalgorithms:
Quicksort A sorting algorithm of the divide and conquer based approaches[Cor+01].
147
Chapter 7 HPC Python Internals and Benefits
Dijkstra Finds the shortest paths between a start node and all other nodesin a graph [Cor+01].
AVL Tree Insertion Insertion of values into a Adelson-Velsky and Landis(AVL) tree which is a self-balancing binary search tree [Cor+01].
Matrix-Matrix Multiplication Performs multiplication of two quadraticmatrices of same.
Gauss-Jordan Elimination Solves a system of linear equations by row re-ductions [Sto+13].
Cholesky Decomposition Computes the decomposition of a symmetricand positive definite matrix into a lower left triangular matrix andits transpose. The product of them equals to the original matrix[Sto+13].
PI Calculation Iterative algorithm for the approximation of π based onan integration using the rectangle rule [Qui03].
These algorithms were chosen to represent different algorithm categoriesfrom the area of classical data processing on common ADTs such as lists,graphs, and trees on the one side and numerical mathematics on the otherside. All algorithms except the PI calculation were implemented sequen-tially. The PI algorithm was chosen as a known example of a perfectlyparallel workload. As such it can be utilized to benchmark how well the in-dividual environments can perform when the workload can be parallelizedin an optimal way (i. e. with no synchronization / communication betweenthe parallel processors).
All algorithms are implemented in an iterative (not recursive) mannerand the following languages:
• C++, as a time and memory efficient object-oriented compiled pro-gramming language
• Pure Python 2
• Pure Python 3
• Pure Python 3 with NumPy
• Pure Python 3 with NumPy and Numba decorators
• Python 3 with Cython
148
7.2 Benchmarking Methodology
For some algorithms certain implementations are not available if, forinstance, there was no reasonable use of an ndarray from NumPy in caseof the AVL tree implementation. Furthermore, there are no further PyPyspecific implementations needed. The source code can be obtained bycontacting the author.
Matrix-Matrix Multiplication Implementation as Example
For the matrix-matrix multiplication in Python the code presented inSect. 7.1.1 is used. In C++ the matrix is implemented based on a structwith an elementary two-dimensional array for which memory from theheap is allocated dynamically:
struct Matrix { int n; double ** doublePtr ; };
Matrix newMatrix (int n) {Matrix mat;mat.n = n;mat. doublePtr = new double *[n];for(int i = 0; i < n; i++)
mat. doublePtr [i] = new double [n];return mat;
}
In pure Python the matrices are represented by lists and initialized withthe aid of list comprehension in Python 2 and Python 3:
def newMatrix (n):return [[0 for x in range (n)] for y in range (n)]
For the ndarray the same data types as in the C++ version are used:def newMatrix (n):
return np. zeros ( shape =(n,n), dtype =’float_ ’)
Relevant for the execution time are the three nested loops which is whythe according function in the Numba version is based on the ndarray andhas following decorator:
@guvectorize (["void(int32 , float64 [: ,:] , float64 [: ,:] ,\float64 [: ,:])"],
"() ,(m,m) ,(m,m) -> (m,m)", nopython = True)def multiplication (n, A, B, C):
In Cython the function that performs the actual multiplication wasequipped with static type definitions and the ndarray was used for efficiencyreasons as well:
def multiplication (int n,np. ndarray [np. float64_t , ndim =2] A,np. ndarray [np. float64_t , ndim =2] B):
cdef np. ndarray [np.float64_t , ndim =2]C = newMatrix (n)
149
Chapter 7 HPC Python Internals and Benefits
cdef:int iint jint k
for i in range (n):for j in range (n):
for k in range (n):C[i,j] += A[i,k] * B[k,j]
Here again, the data types are of the same precision as in the C++ version.The implementations of the further algorithms are achieved in a similarway to the matrix-matrix multiplication presented here.
Realization of the Measurements
The execution time (wall clock time) of each runtime environment resp.executable (in case of C++) was measured with the shell command time[IEE18]. For memory usage measurements libmemusage.so was used andthe captured value was the heap peak as defined in [Man].
To make sure that a given algorithm is executed with the same values,pseudo random generators were implemented to achieve the same inputdata in each run and runtime environment. For comparison reasons, auto-matic vectorization for CPUs with SIMD instructions were disabled in allPython environments as well as during the Cython and C++ compilation.
7.3 Comparative Analysis
A comparative analysis of the different Python runtimes against C++ wasaccomplished on the following computer system.
Measurements Environment
All measurements in this section where accomplished on a server with 2sockets, each with an Intel Xeon X7550 2.0 GHz (2.4 GHz Turbo), 8 coresCPU with Hyper-Threading (HT); 256 GB DDR3 main memory; runningan x86_64 Scientific Linux 6. Following software packages were used:
• CPython2 v2.7
• CPython3 v3.6.0 in combination with– NumPy 1.12.0– Cython 0.25.2– Numba 0.31.0
• PyPy2 v5.6.0 in combination with
150
7.3 Comparative Analysis
– NumPy 1.12.0– Cython 0.25.0
• PyPy3 v5.5.0
• Clang v.3.9.1 with LLVM v3.9.1
PyPy’s NumPyPy module was not considered as it was too incomplete atthe time of the analysis. The legends of the plots show different measuredcases with following meanings:
CPython2 Implementation in pure Python 2 and executed by CPython2
CPython3 Implementation in pure Python 3 and executed by CPython3
C++ Implemented in C++ and compiled by Clang
Cython (Pure Python) Implementation in pure Python 3, translated toC and compiled by Clang. The thus generated extension module isimported by a Python file
Cython (Optimized) Implementation in Cython with C data types, trans-lated to C and compiled by Clang. The thus generated extensionmodule is imported by a Python file
CythonPyPy (Optimized) Analogue to “Cython (Optimized)”. The exten-sion module is utilized with the aid of the cpyext, PyPy’s subsystemwhich provides a compatibility layer to compile and run CPython Cextensions inside PyPy [Cun]
Numba Implementation in Python 3
Numba + NumPy Implementation in Python 3 with the aid of ndarray
PyPy2 (Pure Python) Implementation in pure Python 2, executed byPyPy 2
PyPy3 (Pure Python) Implementation in pure Python 3, executed byPyPy 3
PyPy2 + NumPy Implementation in Python 2 with ndarray, utilized withthe aid of the cpyext subsystem, and executed by PyPy 2
CPython3 + Array Implementation in Python 3 with array module andexecuted by CPython3
151
Chapter 7 HPC Python Internals and Benefits
0.0 0.5 1.0 1.5 2.0 2.5Size of array 1e7
10 1
100
101
102
103
104
Tim
e [s
]
Quicksort
C++CPython3CPython2Cython(Pure Python)Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy
PyPy3(Pure Python)PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)
Figure 7.7: Execution times for Quicksort
Quicksort Analysis as Example
In Fig. 7.7 the execution time of the various Quicksort implementationsis plotted with logarithmic y-scale. Due to much higher execution timesthan in the other cases, the measurement of PyPy2 with NumPy wasaborted when 2 million elements were to sort. The other measurementscan be divided into a slower and a faster group. In the slower group,among others, are both reference Python environments CPython2 andCPython3. During the whole measurement, the CPython2 solution wasaround 30 % faster than CPython3 but the Cython-compiled version ofpure Python 3 code is even faster than both CPython version. The usageof the arrays from the NumPy and the array module has no positive effects.CPython3 with NumPy is even slower than pure Python on CPython3.In the faster group, PyPy2 and PyPy3 have similar execution times: fora size of 10 million elements they are around 35 times faster than purePython on CPython3. Even faster are both Numba cases (pure Pythonand with NumPy) and the optimized Cython module built for PyPy2 asruntime environment. Only the optimized Cython module for CPython3has a higher performance than all other versions and is almost as fast as
152
7.3 Comparative Analysis
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Size of array 1e7
0
250
500
750
1000
1250
1500
1750
2000He
ap-p
eak
[Mb]
QuicksortC++CPython3CPython2Cython(Pure Python)Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy
PyPy3(Pure Python)PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)
Figure 7.8: Memory consumption (maximum heap peak) for Quicksort
the C++ implementation. For more execution time measurements seeAppendix B.1.
In Fig. 7.8 the memory space consumption (i. e. maximum heap peak) ofall previously mentioned Quicksort implementations are plotted with nowlinear y-scale. There, PyPy2 with NumPy and CPyton2 show a eminentlyhigher memory consumption than the other cases which is why they wereplotted in Fig. 7.9 separately. As the execution times of the three casesPyPy2 + NumPy, CPython3 + Array, and CPython3 + NumPy werevery high, so that the memory measurements were aborted at 2 millionelements. Here, the resulting plot lines can be divided into three groups.The group with the pure Python implementation on Numba and the arraymodule based implementation on CPython3 show the highest memoryconsumption. The second group consists of PyPy2, PyPy3, CPython3,and the Cython module for CPython3 (all four pure Python). In themost memory efficient group are Numba + NumPy and CPython3 +NumPy. Moreover, the optimized Cython implementation for CPython3has the lowest memory consumption which corresponds to the one of theC++ implementation. For more memory consumption measurements seeAppendix B.2.
153
Chapter 7 HPC Python Internals and Benefits
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Size of array 1e7
0
50
100
150
200
250
300
350He
ap-p
eak
[Mb]
QuicksortC++CPython3Cython(Pure Python)Cython(Optimized)Numba + NumPyPyPy3(Pure Python)
PyPy2(Pure Python)PyPy2 + NumPyCPython3 + NumPyCPython3 + ArrayNumba(Pure Python)
Figure 7.9: Memory consumption (maximum heap peak) for Quicksort ofselected runtime environments
Parallel Processing
For the PI calculation the best-of-three execution times, which weremeasured for each case, was plotted in Fig. 7.10. The integration of thePI calculation was performed with 1 trillion rectangles. Depending on thenumber of threads to be forked for the measurement, an area of rectangleswas assigned to each thread to achieve a parallel processing of the work bymultithreading vs. multiprocessing. PyPy was not considered as it has aGIL like CPython. Therefore, as expected, there are no speedups gainedfrom multithreading in case of CPython3 and CPython2. Instead, the JITcompiled Numba case (pure Python3 plus jit-decorator with nopython =True, nogil = True in its signature) and the AOT compiled Cython case(optimized with static types) achieve speedups through multithreading.The C- and Cython-based solution with four threads has an executiontime of 2.5 s while with one thread 10 s were needed, which equals to aparallel speedup of 4 and an efficiency of 100 %. The multiprocessing case(pure Python3) also leads to speedups but these are much lower.
154
7.4 Conclusion and Outlook
1 Thread 2 Threads 3 Threads 4 Threads100
101
102
103
Tim
e [s
]
PI CalculationCCPython2CPython3
NumbaCythonMultiprocessing
Figure 7.10: Execution times for PI calculations with multiple threads
7.4 Conclusion and Outlook
In this chapter a comparative analysis of various High Performance Pythonenvironments was presented. For this, benchmark algorithms from dif-ferent categories were chosen and also implemented in C++ as referencelanguage. Furthermore, great value was given to ensuring an equivalentimplementation in each case to achieve similar conditions for the differentcases. Hence, only one C/C++ compiler (i. e. Clang) with the sameoptions was used for the compilation of the Cython and C++ solutions.
In case of the sequential Python-based solutions, Cython optimizedby static data type declarations and built for CPython3 has shown theshortest execution times in all test cases, which could always competewith the C++ implementations. Even the unoptimized Cython builds forCPython3 has led to performance gains of 40 to 50 %. The execution timesfor PyPy2 builds have shown great variations. Numba solutions also haveled to high performance gains despite longest startup times of the runtimeenvironment. Both PyPy versions have had similar execution times, whichin case of pure Python were always at least one order of magnitude fasterthan the CPython environments.
155
Chapter 7 HPC Python Internals and Benefits
The memory consumption measurements regarding the heap peak duringruntime have shown the highest consumption for CPython2. Both PyPyversions have also shown a higher memory consumption than CPython3.For particularly large data structures, it has been shown that a veryefficient memory consumption can be achieved through the usage of staticdata types in Cython built for CPython3. In case of Cython built forPyPy2 the results has shown great variations depending on the algorithms.Numba also has led to a higher memory efficiency than in case of CPython.
The multithreading benchmark has shown no parallel speedup for bothCPython versions because of the GIL. In case of Numba and Cython wherethe GIl could be disabled, multithreading led to high parallel speedups.Especially in case of Cython the perfectly parallelizable PI calculation led toa perfect parallel efficiency of around 100 %. The usage of multiprocessingin case of CPython led to low speedups only.
The presented comparative analysis gives Python programmers anoverview of the analyzed solutions to improve the performance of theirPython programs. Moreover, it provides information on how much effortis required for the application of a certain solution on the one side andwhich gain can be expected on the other side.
In future work, a closer look should be taken at Numba’s precompiledfunctions that are vectorized. These were not considered in this work toachieve comparable implementation between the various solutions. Further-more, besides multithreading and multiprocessing as parallel paradigmssuitable for shared-memory computer architectures, also paradigms thatare suitable for distributed-machines such as Message Passing Interface(MPI) for Python should be included in future considerations. An imple-mentation of MPI for Python is mpi4py [Dal]. Since all Python runtimeenvironments and the language itself are under development, a frameworkfor automated ongoing comparative analyses and result presentation wouldbe useful.
156
8HPC Network Communicationfor Hardware-in-the-Loopand Real-Time Co-Simulation
A digital real-time simulator (DRTS) for power grids reproduces voltageand current waveforms with a desired accuracy that represent the behaviorof the real power grid that is simulated. To be RT-capable, the DRTS needsto solve the power grid model equations for each time-step within the timepassed in the real world (i. e. according to the wall clock time) [Far+15;BDD91]. To achieve this, outputs are generated in the simulations atdiscrete time intervals while the system states are computed at certaindiscrete time intervals with a fixed time step. In [Far+15] two classesof digital real-time (RT) simulations are defined. There are full digitalRT simulations that are modeled in the DRTS completely and (power)Hardware-in-the-Loop (HiL) RT simulations which can exchange simulationdata through I/O interfaces with real hardware. Besides the improvementof DRTSs, e. g., with the aid of more performant numerical algorithmsto be able to simulate increasingly complex models of Smart Grids inreal-time, it is also possible to distribute a simulation among multipleDRTSs. An approach of a coupling of DRTSs in laboratories even fromdifferent countries is presented in [Ste+17]. The coupling of this so-calledgeographically distributed real-time simulation (GD-RTS) was performedwith the VILLASframework [Vog+17], abbreviated in the following asVILLAS, which was chosen for the integration of InfiniBand (IB).
157
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
In the following section, the fundamentals of VILLAS are covered tomotivate its choice for the integration of IB. The fundamentals of IBare introduced in the subsequent section. Then, the concept of the IBsupport in VILLAS is presented and analyzed comparatively with otherinterconnecting methods available in VILLAS. Finally, the chapter isconcluded and an outlook of future work is given. This chapter presentsoutcomes of the supervised thesis [Pot18].
8.1 VILLAS Fundamentals
The VILLASframework is a collection of open-source software packagesfor local and geographically distributed RT (co-)simulations. VILLASnodeis part of the collection that can be used as gateway for simulation data.It supports several interfaces (called node-types) of the three classes,
internal communication such as file for logging and replay, shmem forshared-memory communication, signal for test signal generation,etc.;
server-server communication such as socket for UDP/IP communication,mqtt for Message Queue Telemetry Transport (MQTT) communica-tion, websocket for WebSocket based communication, etc.;
simulator-server communication such as opal for connections to OPAL-RT devices, gtwif for connections to RTDS devices, fpga for connec-tions to VILLASfpga PCI-e cards, etc.
The instance of a node-type is called a node.In Fig. 8.1 besides VILLASnode also the VILLASweb component of the
whole framework is depicted. As sketched in the figure, a lab(-oratory)can contain multiple nodes which are used as gateway between software(SW) and hardware (HW) solutions. The interconnected nodes can runon the same or on different host systems in one or multiple labs. Some ofthe nodes can be hard or just soft RT capable, which depends on theirnode-type.
While there are hard RT capable node-types for internal and simulator-server communication, there was no such node-type for server-servercommunications, because they are all depending on the Internet Protocol(IP) which is mostly used with Ethernet based interconnects for local areanetworks (LANs). One problem of Ethernet based solutions are relativelyhigh latencies of the data transfers also caused by the network protocolstack of the operating system [Lar+09]. Another problem of Ethernetbased solutions is that quality of service (QoS) support is very limited.
158
8.2 InfiniBand Fundamentals
That is why latencies of the data transfers have a relatively high variabilitywhich is a disadvantage for hard RT applications. To achieve hard RTbetween different computer hosts, IB was used as alternative technologydesigned for low-latency and high-throughput server-server and device-server communication (e. g. for interconnecting storage solutions withcomputer clusters). The following chapter introduces how these propertiesof IB are achieved.
8.2 InfiniBand Fundamentals
Before an introduction to IB with its benefits, the main difference toclassical utilization of network interface controllers (NICs) must be ex-plained. Usually, NICs are utilized through sockets (also called Berkeleyor BSD sockets) which is an Application Programming Interface (API) forinter-process communication (IPC) coming from the Unix-like BerkeleySoftware Distribution (BSD) operating system (OS) [Tan09] and with lit-tle modification standardized in the Portable Operating System Interface(POSIX) specification. However, socket API implementations are not onlypart of POSIX conform OSs but, e. g., also of Windows and Linux. Thefocus in this chapter is on the latter as the approach presented here wasimplemented based on Linux. A POSIX socket is a user space abstraction
web-basedaccess
user n
web-basedaccess
user 2
web-basedaccess
user 1
domain-specificoffline analysis
dataas a
service
offline integration layer
simulationas a
service
SWHW
node...
lab
1
lab
n SWHW
node...SWHW
node
hard real-time integration layer
soft real-time integration layer
VILL
ASn
ode
VILL
ASw
eb
...
...
modelparameter
setting
node node node
Figure 8.1: Overview of the VILLASframework
159
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
of network communication, which is mainly using the operating systemkernel based TCP/IP or UDP/IP stack (Transmission Control Protocol(TCP), User Datagram Protocol (UDP), IP) [Ker]. The network commu-nication through sockets is accomplished via function calls on the socket.As in many other OSs, these user space calls are mapped on system calls(i. e. OS kernel function calls) which generate so-called traps (a typeof synchronous interrupt) and sysenter instructions in case of moderncomputer architectures which let the central processing unit (CPU) switchfrom user to kernel mode [Ker10; Tan09]. The switches between user andkernel mode (and back) can be time expensive in relation to the datatransfer through the NIC itself. This and other drawbacks were the reasonfor a development of the virtual interface architecture (VIA) [Com97].Some of the VIA characteristics mentioned in [Dun+98] are the avoidanceof system callbacks whenever possible, data transfers with zero-copy, nointerrupts for initializing and completing data transport, and there is asimple set of instructions for exchanging data. Therefore, some of thetasks, which are handled by the IP stack in case of standard sockets (i. e.such that are mapped on standard kernel sockets), such as data transferscheduling, in VIA must be handled by the NIC.
Contrary to standard sockets, VIA provides virtual interfaces (VIs)which are direct interfaces to the NIC through which each process assumesto own the interface and that there is no need for system calls during datatransfers. Each such VI consists of a send and receive (work) queue, whichcan hold descriptors that contain all needed information for data transferssuch as the destination address, transfer mode, and location of the payloadin the main memory. After completed transfers (with or without an error),the descriptors are marked by the NIC. Usually, the so-called VI consumer,residing in the user space, is responsible to remove processed descriptorsfrom their work queues. Alternatively, on creation, a VI can be boundto a Completion Queue (CQ) where notifications on completed transfersare directed. Each CQ has to be bound to at least one work queue whichmeans that notifications of multiple work queues can be directed to asingle queue.
The VIA supports the two following asynchronously operating datatransfer models:
Send and receive messaging model (channel semantics) The receivingcomputer node (in this section referred to as node) specifies wherein its local memory received data shall be saved by submitting adescriptor to the receive work queue. Afterwards, a sending nodeacts analogously with its data to be sent and the send work queue.
160
8.2 InfiniBand Fundamentals
Remote Direct Memory Access (RDMA) model (memory semantics)The so-called active node specifies the local memory region andthe remote memory region of the so-called passive node. There aretwo possible operations in the RDMA model: In case of an RDMAwrite transfer, the active node specifies with the local memory re-gion the data to be sent while with the remote memory region itspecifies where the data shall be stored. In case of an RDMA readtransfer, the active node makes analogous specifications. To initiatean RDMA transfer, the active node specifies the local and remotememory addresses as well as the operation mode in a descriptor andsubmits it to the sending work queue. The operating system andother software on the passive node is not actively participating inthe (write or read) transfer. Therefore, no descriptors are submittedto the receive queue at passive node.
8.2.1 InfiniBand Architecture
The InfiniBand Architecture (IBA) makes use of the abstract VIA designdecisions [Pfi01]. The InfiniBand Trade Association (IBTA), founded in1999 by more than 180 companies, describes the IBA in [Inf07] and thephysical implementation of IB in [Inf16].
Network Stack
In Fig. 8.2 the IBA is depicted in form of a network stack which consists ofa physical, link, network, and transport layer. Hints for the IBA realizationsare given to the right of the various layers.
Endnodes and Channel Adapters
The communication within an IB network takes place between (end)nodeswhich can be, e. g., a server node or a storage system in a computer cluster.A Channel Adapter (CA) is the interface between a node and a link. Itcan be either a Host Channel Adapter (HCA) which are used in computerhosts, supporting certain software features defined by so-called verbs. Orit can be a Target Channel Adapter (TCA), which has no defined softwareinterface and is normally used in devices such as a storage system.
161
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
Service Types
InfiniBand supports the following service types:
Reliable Connection (RC) A connection between nodes is established andmessages are reliably transferred between them (optional for TCAs).One Queue Pair (QP), which is IB’s equivalent to a VI), on a localnode is connected to one QP on a remote node.
Reliable Datagram (RD) Single packet messages are transferred reliablywithout a one-to-one connection. A local QP can communicate withany other RD QP. This is optional and not implemented in theOFED stack (see Sect. 8.2.2).
Unreliable Connection (UC) Analogous to RC but unreliable (i. e. packetscan get lost).
Unreliable Datagram (UD) Analogous to RD but unreliable.
Raw Datagram Packets are sent without IBA specific headers.
Message Segmentation
The payload is divided into messages between 0 B and 2 GiB for all servicetypes except for UD which supports messages between 0 B and 4096 B,depending on the message transmission unit (MTU). Messages bigger
consumer
IBA operationssegmentation & reassembly
network
link encoding
media access control
port
consumer operations (verbs)
channel adapter´s port /physical link
flow control
subnet routing (LRH)
inter subnet routing (GRH)
messages (queue pairs)
network
link
physical
transport
Figure 8.2: Network stack of the InfiniBand Architecture (IBA)
162
8.2 InfiniBand Fundamentals
than the MTU are segmented into smaller packets by the IB hardwarewhich, thus, should not affect the performance as in case of software basedsegmentation [CDZ05]. In the following, QPs are explained further.
Queue Pairs and Completion Queues
Figure 8.3 shows an abstract model of the IBA. Such as VIs also QPs haveSend Queues (SQs) and Receive Queues (RQs) which enable processes todirectly communicate with the HCA. Like descriptors in the VIA, WorkRequests (WRs) are submitted to a work queue before message transfer,resulting in Work Queue Elements (WQEs) in the queue. In case of a sendWR, the WQE contains the address to the memory location containingdata to be sent. In case of a receive WR, the WQE contains the addressto the memory location where received data shall be stored. Not each QPcan access each memory location due to memory protection mechanisms
main memory
0x0F ...
0x0E 0x0D 0x0C 0x0B 0x0A ...
IBA memorymanagement
QPs
Nx
send recvsend recvsend recvsend recv
QPCQCQCQ
Mx
CQs
DMA Engine
channel adapter
transport
VL 2VL 1
arbiter
VL P...VL 2VL 1
arbiter
VL P... VL 2VL 1
arbiter
VL P...
...
port Qport 2port 1 ...Figure 8.3: InfiniBand Architecture (IBA) model
163
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
that also handle which locations can be accessed by remote hosts and theHCA. A WQE in the SQ also contains the network address of the remotenode and the transfer model (i. e. send messaging or RDMA).
Data Transmissions Example
Figure 8.4 shows a sending and a receiving node, each with three QPs.Each QP is always initialized with a send and a receive queue but for thesake of clarity the unused empty queues are not depicted.
Before a transmission, the receiving node submits WRs to its RQs. Inthe figure, the receiving node’s consumer is submitting a WR to the redRQ. Afterwards, WRs can be submitted to the SQs and will then beprocessed by the CA. While the processing order between queues dependson the priority of the services, on congestion control, and the HCA, WQEswithin a certain queue are processed in first in – first out (FIFO) order. Inthe figure, the sending node’s consumer is submitting a WR to the blackSQ and the HCA is processing a WQE from the blue SQ.
After the HCA processed a WQE, it places a Completion Queue Entry(CQE) in the completion queue, which, i. a., contains information aboutthe processed WQE and the status of the operation, indicating a successfultransmission or an error if, e. g., the queue concerned was full. A CQE isposted as soon as a WQE is processed, which depends on the used servicetype. For instance, in case of a unreliable type, a CQE is posted as soonas the HCA sends the data belonging to a send WR. Instead, in case of
WQE WQE WQEWQE
WQE
WQE WQE
send queues
CQECQE CQE
completion queues HCA HCA
CQE CQE CQECQE
WQE WQE WQEWQEconsumer
recv.work
request
workcompl.
completion queues
receive queues
sending node receiving node
consumer
sendwork
request
workcompl.
WQE WQE WQE
WQE WQE WQE
message
Figure 8.4: InfiniBand data transmission example
164
8.2 InfiniBand Fundamentals
a reliable type, the CQE is not posted until the message is successfullyreceived by the remote node.
In the figure, the receiving node’s HCA is consuming a WQE fromthe blue receive queue. After receiving a WQE, the HCA will write thereceived message into the memory location contained in the WQE and posta CQE. If the sending node’s consumer have included so-called immediatedata in the message, that will be present in the CQE of the receiving node.
Work Queue Entry Processing
After the submission of a WR to a queue by a process, the HCA startsprocessing the resulting WQE. In Fig. 8.3 can be seen that an internalDirect Memory Access (DMA) engine is accessing the memory locationcontaining in the WQE and copying the data from that location to alocal buffer of the HCA. Every HCA port has several such buffers, calledVirtual Laness (VLs). After this step, the arbiter of each port decidesfrom which VL packets will be sent through the physical link. More onthat and further details on the InfiniBand Architecture can be found in[Pot18].
8.2.2 OpenFabrics Software Stack
The IBA does not include a full API specification to allow vendor specificAPIs. In 2004 the nonprofit OpenIB Alliance was founded and renamedlater to OpenFabrics Alliance, which releases the open-source OpenFabricsEnterprise Distribution (OFED). OFED is a software stack including, i. a.,software drivers, kernel code, and user-level interfaces such as verbs. MostInfiniBand vendors provide OFED based software, with little adaptionsand enhancements, together with their hardware solutions. Figure 8.5shows the sketch of an OFED stack [Mel18] where the user and the kernelverbs can be seen, whereby in this work verbs always refer to user verbs.
Submitting Work Requests to Queues
The submission of WRs to the work queues allows user space processesto initiate data transfers through a HCA without an intervention of theoperating system kernel. As mentioned before, WQEs contain memorylocations for data read and written by the HCA. A WR contains thepointer to a list with at least one scatter/gather element (sge) containingthe memory address and length of a memory location as well as a localkey for access control. Besides a list of sges, the receive WR structurecontains only a few further data elements such as a pointer to the next
165
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
receive WR. Additionally, the send WR structure contains even moreelements by which various (sometimes optional) features of HCAs can beenabled. The opcode element defines the operation to send the associatedmessage(s). Which operations are allowed depends on the QP the WRis submitted to and the chosen service type. Furthermore, send_flagscan hold various flags defining how the send WR shall be processed. Oneof the flags is IBV_SEND_INLINE which causes that the data pointed toby the sge is directly copied into the WQE by the CPU. This avoids acopying, performed by the HCA’s DMA engine, from the host’s mainmemory to the internal buffer of the HCA. The inline send mode is notdefined in the original IBA and therefore not supported by each HCA.
OpenFabrics user verbs / RDMA CM
process processprocessdiag.tools
openSM
user levelMAD API
IP
TCP UDP ICMP
socket layer
netdevice
IPoIB
applicationlevel
user API
upperlayer
protool
mid-layer
provider
hardware
kern
el s
pace
user
spa
ce traditionalnetworkinterface
data channel(kernel bypass)
commandchannel
OpenFabrics kernel verbs
adapter drivers
InfiniBand HCA
SA client SMA CM
CMA
MAD
Figure 8.5: An overview of the OFED stack
166
8.3 Concept of InfiniBand Support in VILLAS
Since it potentially leads to lower latencies and the buffers can be releasedfor re-use immediately after submission of the send WR, the InfiniBandintegration here presented makes use of the inline mode. More detailsabout the OFED can be found in [Pot18].
8.3 Concept of InfiniBand Support in VILLAS
The InfiniBand support was implemented in the VILLASnode sub-projectof the VILLASframework. Therefore, the VILLASnode component isintroduced in the following.
8.3.1 VILLASnode Basics
As already mentioned in Sect. 8.1, VILLASnode supports different node-types. One VILLASnode instance, called super-node can have multiplenodes that are sources and / or sinks of simulation data. Hence, a super-node can serve as a gateway for simulation data. In the context of VILLAS,a node is defined as an instance of a node-type from one of three categoriesintroduced in Sect. 8.1. The connections within a super-node are realizedwith paths between source and output nodes. A path starts from an inputnode obtaining data that can, optionally, be sent through a hook to modifythe data (e. g. by a filter). Subsequently, the data is written to a FIFOqueue (for buffering) before it can be sent through a register which canmultiplex and mask it. After this, it can be manipulated by more hooksagain before it will be passed to the output queue which holds the datauntil the output node is ready. The data is transmitted as samples holdingthe payload (e. g. simulation data) with metadata (timestamps and asequence number). As a sample is the internal format for payload exchangebetween nodes of arbitrary types, its structure is kept simple to avoidoverhead.
Figure 8.6 depicts a super-node with five node-type instances: opal,file, socket, mqtt, and the additionally implemented IB node, presentedin this chapter. The paths (1 to 3) connect the nodes (n1 to n5) throughhooks (h1 to h6), registers (r1 to r3), and input queues (q1,1 to qi,5) aswell as output queues (qo,1 to qo,4). More on the types of the nodes canbe found in [FEIg].
8.3.2 Original Read and Write Interface
For the interoperability between nodes of different types, various functionssuch as start(), stop(), read(), write() must be provided by the imple-
167
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
mentation of a new node-type in form of assignments of the implementedfunctions’ addresses to the specified function pointers, as for instance:
int (* read )( struct node *n , struct sample * smps [] ,unsigned cnt );
int (* write )( struct node *n , struct sample * smps [] ,unsigned cnt );
Some of the functions are optional and will be omitted if no implementationis available for a certain node-type.
Read Function in General
Figure 8.7 depicts the general proceeding of the read function of anarbitrary node-type. During the call of a read function, the super-nodepasses the address to a field of sample addresses (*smps[]) of the lengthcnt ≥ 1 for the data the super-node wants to read from the node. Asample contains, i. a., a sequence number, reference counter (refcnt), anda field for the actual signal (i. e. payload such as, e. g., 64 bit integersand floating-point numbers). During the allocation of a sample by the
socket
mqtt
IB
opal
file
h4h1r1 h2 h3
h6
h5 r2
h3r3
qi,3
qi,1 qo,1
qi,2
qi,4
qo,2
qo,3
qo,4
qi,5
n1
n2
n4
n3
n5
path 1 path 2 path 3
Figure 8.6: An example super-node with three paths connecting five nodesof different node-types
168
8.3 Concept of InfiniBand Support in VILLAS
super-node, its refcnt is set to 1 and their memory will not be freed whenrefcnt > 1. Releasing of a sample means decreasing its refcnt. Withinthe read function, the node (i. e. its receive module) is instructed to storeat most cnt received samples in the passed list of samples. When thereceiving module finished copying ret ≤ cnt samples, the read functionreturns ret.
After that, the super-node can then process the samples by hooks beforeforwarding them to another node. Finally, all samples are released, usuallyresulting in freeing of their memory.
Write Function in General
The general proceeding of the write function of an arbitrary node-type issimilar to the one of the read function. Here the super-node passes a fieldwith cnt samples to the function which will be copied to the send modulewithin the write function. The send module tries to send all samples whichis blocking the return of the write function. After finishing the sending,the number of sent samples, ret, is returned. If ret is not equal to cnt,the super-node handles the sending error properly. Anyway, the refcnt ofall cnt samples is decremented.
allocate cntsamples and setrefcnt= 1
decreaserefcnt ofcnt samples
receivemodule
receivemodule
_read( )
*smps[]
copy max cntsamples to *smps[]
smps[]
cnt
*smps[]
*smps[]
ret
copy
smps[]
_read( )
processret cntsamples
≤
cnt
return number ofreceived samples
cnt
node node
super-node super-nodere
turn
read
call
read
Figure 8.7: General read function working principle in VILLAS
169
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
8.3.3 Requirements on InfiniBand Node-Type InterfaceInfiniBand with its zero-copy principle, inherited from the VIA, requiresthat the receive / send modules do not copy any data between their localbuffers and the super-nodes’ buffers. Instead, pointers to the super-nodes’buffers and their lengths should be passed to the HCA which uses themdirectly for received data and for data to be sent. In the following thedesired behavior of the read and write function is sketched.
Read Function of InfiniBand Node-Type
Figure 8.8 depicts the read function’s proceeding of the IB node-type.The QP is instructed to receive data by a WQE in its RQ. Therefore, areceive WR pointing to buffers of the super-node must be submitted tothe RQ. For compatibility reasons with existing node-types, the furthersteps were implemented in a way that causes as few changes as possible.Therefore, within the read function, the addresses of the samples (passedas *snmp[] parameter) are assigned to sges and inserted into WRs whichare then submitted to the RQ. This results in a direct storage of received
allocate cntsamples and
set refcnt = 1
decrease refcntof samples that arenot in any queue
_read( )
*smps[]
smps[]
cnt
*smps[]
*smps[]
ret
smps[]
_read( )
processret cntsamples
≤
poll CQ, returnnumber of CQEs
cnt
super-node super-node
receive module
RQ CQ
receive module
RQ CQ
node node
amount ofsmps torelease
replace original pointers of samplesthat may not be released (i. e., that
are submitted to the RQ) byaddresses in wr_id of the CQE
place max cntsamples in RQ
call
read
retu
rn re
ad
Figure 8.8: InfiniBand node-type read function working principle
170
8.3 Concept of InfiniBand Support in VILLAS
data by the HCA in the super-node’s samples field, avoiding data copying.Furthermore, the returning of the read function is very different from othernode-types. If the CQ contains no CQEs, the HCA received no data and,thus, the ret value should be 0. However, the sample buffers must not bereleased (i. e. no refcnt may be decreased) as they are submitted to theRQ of the HCA. If the CQ contains CQEs, the addresses of the buffersfrom the CQ holding the received data are assigned to the pointers of thesmps[] field that was passed to the read function. Moreover, the ret valueis set to the number of pointers that have been replaced. Furthermore, thebuffers containing the received data must be released after being processedby the super-node. This approach requires that the super-node calls theread function once (during initialization) without reading any data sinceonly after this first call the HCA knows where to store received data.
Write Function of InfiniBand Node-Type
The write function’s proceeding of the IB node-type must be similar tothe read function in order to achieve zero-copy. When the addresses ofthe sample buffers that are passed to the write function are submittedvia send WRs to the SQ, ret must be set to the number of submittedpointers. If the CQ is empty, none of the passed buffers may be releasedas the HCA has to send the data they contain. If the CQ is not empty,previously submitted WRs are finished and the buffers they point to canbe released. Therefore, the addresses of the buffers that were passed ina previous call of the write function are assigned to the pointers of thepointers of the smps[] field that was passed with the current call of theread function. Furthermore, the super-node must be notified to releasethe sample buffers that were yielded by the according CQEs.
Adapted Read and Write Interface
The original interface could be adapted in order to return the numberof samples that must be released by the super-node as it cannot predictthe number, especially in the case of sending inline, where buffers can bereleased immediately after send WR submission or in the case where aWR could not be successfully submitted to the SQ. The information onthe number of samples to be released could be passed to the super-node bya further integer pointer in the signatures of the read and write function.
8.3.4 Memory Management of InfiniBand Node-TypeVILLASnode allows memory allocation that is improved for real-timeprocessing. The implemented alloc() function can allocate huge pages,
171
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
which leads to a faster mapping between virtual and physical memory[Deba]. Furthermore, it can lead to less page faults and in case of enabledpage pinning, the pages must remain in main memory (i. e. are notswapped), which avoids delays in the execution of the program that couldcause real-time violations. These and some other memory types arenot sufficient for the IB node-type as the HCA will access the buffers,allocated by the super-node and referenced by WRs. Therefore, everynode-type defines what kind of memory allocation is performed by alloc()and if it should be registered with a memory region (as needed for IB).Furthermore, this also allows to implement functionality for local keyacquiring for samples that are passed to read / write functions. Thedefinition of preferred memory-type for a node-type allows the super-nodesto use a proper memory allocations for input and output buffers that areconnected to nodes of that type.
8.3.5 States of InfiniBand Node-TypeBefore the implementation of the IB node-type, a node could be in sixstates that are depicted in Fig. 8.9 as circles with solid lines. If, e. g.,the _start() function of the node-type interface is called successfully,
initialized
pendingconnect
parsed
stopped
checked
started
destroyed
connected
_parse( )
_check( )
_start( )
_stop( )
Figure 8.9: VILLASnode state diagram with newly implemented states
172
8.3 Concept of InfiniBand Support in VILLAS
the transition checked→started is performed. According to the VIA,a node-type could be initiated but not connected (i. e. the node is notable to send data). Therefore, the start state of VILLASnode is notsufficient and was extended by the new state connected. Moreover, beforethe receiving of any data, WQEs must be present in the regarding RQ.These circumstances lead to the finite-state machine in Fig. 8.9 with thenew states printed with dashed lines. If a node is in one of these states,the super-node interprets it as if it would be in the start state. Thisfinite-state machine can also be used for other future node-types than IBthat are based on the VIA.
8.3.6 Implementation of InfiniBand Node-Type
An overview of the implemented IB node-type is shown in Fig. 8.10. Themost important aspects are explained in the following, i. e. the read and
protection domain*
communicationevent channel*
smps[ ]
smps[ ]
buffers*
writ
e*re
ad*
star
t*st
op
communicationmanagement
thread*
HC
A*
recv
send
CQs
recv
send
queue pair*
rdma_cm_id*modify QP
VILLASnode*
InfiniBand node
comm.
data
data
Figure 8.10: Components of InfiniBand node-type
173
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
write function which allow the kernel bypass offered by InfiniBand. Thewhole source code is open-source and part of the VILLAS project [FEIf].
Start Function
The start function is called by the super-node for initialization purpose.First, RDMA communication event channels are created to be able toresolve the remote address (as active node) or to place itself into a listeningstate (as passive node). Whether a node is active or passive is defined byits configuration. In case of a successful start, the super-node transitionsinto the started state.
Communication Management Thread
The communication management thread is spawned by the start function.It waits for the blocking rdma_get_cm_event() function for events suchas connection requests, errors, rejection, and establishing. Depending onthe node, the thread acts as follows:
Active node As the node tries to connect to another node, theRDMA_CM_EVENT_ADDR_RESOLVED signals that the address could beresolved. After a succeeding initialization of various structures,the RDMA route is resolved which should end with an RDMA_CM_EVENT_ROUTE_RESOLVED event, followed by an RDMA_CM_EVENT_ESTABLISHED if the remote node accepts the connection which resultsin a transition to the connected state of the node. In this state datacan be transmitted.
Passive node As the node listens for connection requests of other nodes,the RDMA_CM_EVENT_CONNECT_REQUEST event occurs if another nodeperforms such a request. In case the service type is UC or RC,the node transitions to the pending connect state. In case of theunconnected service type UD, it transitions to the connected state.An RDMA_CM_EVENT_ESTABLISHED event occurs after a successfully es-tablished connection, which let the node transition to the connectedstate.
Read Function
The read function’s functionality differs from the principle as depicted inFig. 8.7 as it can happen that samples could not be submitted successfullyand therefore must be released again. For this purpose, a thresholdnumber can be defined in the node’s configuration to achieve that at least
174
8.4 Analysis of the InfiniBand Support in VILLAS
threshold samples can be received. If the threshold is reached, the CQis polled until it contains enough CQEs which intentionally blocks thefurther execution of the read function. Moreover, entries in *smps[] arefreed as it can hold only a certain amount of values.
Write Function
When the super-node calls the write function, it tries to submit all passedsamples to the SQ. Iterating through the samples, the node decides dy-namically in which manner the samples have to be sent:
1. samples are submitted normally and are not released by the super-node until a CQE with the proper address appears;
2. samples are submitted normally but some are marked as bad andmust be released by the super-node;
3. samples will be sent inline (i. e. are copied by the CPU directly intothe HCA’s buffers) and must be released by the super-node.
More on the implementation of the InfiniBand node-type can be found in[Pot18].
8.4 Analysis of the InfiniBand Support in VILLAS
The performance of the newly implemented IB node-type is evaluatedin the following in comparison to other already existing node-types ofVILLASnode.
Measurements Environment
All measurements in this section where accomplished on a DELL T630server with a NT78X mainboard providing 2 sockets, each with an IntelXeon E5-2643 v4 3.4 GHz (3.7 GHz Turbo), 6 cores CPU with Hyper-Threading (HT); 32 GB DDR4 main memory at 2400MHz; 2x MellanoxConnectX-4 MT27700 HCAs with 100GBit/s, interconnected via a 0.5 mMellanox MCP100-E00A passive copper cable; running an x86_64 Fe-dora Linux with kernel 4.13.9-200 and MLNX OFED Linux 4.4-2.0.7.0.Moreover, the system was optimized for real-time processing.
175
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
Real-Time Optimizations
The following real-time optimizations were accomplished [Pot18]:
Memory optimizations Achieved through the utilization of huge pages,aligned memory allocations, and memory pinning.
CPU isolation and affinity Achieved by using the isolcpus kernel parame-ter, excluding processor cores from general balancing and schedulingmechanisms, resulting in the avoidance of process scheduling toexcluded CPUs unless a process is explicitly assigned to a CPUby sched_setaffinity(). Moreover, cpusets are used for allowingthreads that are forked by processes on an excluded CPU to be sched-uled among all available excluded CPUs instead of being assignedonly to the CPU of their forking process.
Non-movable kernel threads During system booting, kernel threads arecreated for tasks of the kernel and pinned to CPUs. This can beavoided to have no kernel threads running on excluded CPUs.
Interrupt affinity This is used for re-routing of interrupts that woulddisturb a CPU which is performing time-critical operations (e. g.busy polling on a signal variable for a certain event) to a CPU whichis not assigned to time-critical processing.
Tuned daemon Red Hat based systems such as the used Fedora Linuxsupport the tuned daemon for monitoring devices and adjustingsystem settings for higher performance. Supported tuning pluginsare, e. g., cpu, net, sysctl, and so forth. The tuned offers manypredefined profiles such as latency-performance for low-latencyapplications. For instance, this profile sets the CPU frequencygovernor to a performance.
Figure 8.11 shows the distribution of the CPUs among the cpusets. CPUsin the two real-time<N> cpusets are limited to their memory locationsof their non-uniform memory access (NUMA) node. This leads to lowermemory access latencies as in NUMA computer architectures the mainmemory is distributed among the nodes (here: processors) of the systemas shown in Fig. 8.11 for the described test system. The limited memorylocations are also used for the respective HCAs for writing and reading ofreceived data and data to be sent. Therefore, all time-critical processesusing the HCAs (i. e. mlx5_0 and mlx5_1) were executed on the CPUs 16,18, 20, and 22 as well as 17, 19, 21, and 23 (see Fig. 8.11).
176
8.4 Analysis of the InfiniBand Support in VILLAS
distance: 21
0 4 6
8 10 14
221816
2
12
20
cpuset: real-time-0no IRQs to this group
Xeon
® E
5-26
43 v
4
cpuset: system
NUMA node 0(internal distance: 10)
16 G
B, D
DR
-4, 2
400
MH
z1 5 7
9 11 15
231917
3
13
21
cpuset: system
cpuset: real-time-1no IRQs to this group
Xeon
® E
5-26
43 v
4
NUMA node 1(internal distance: 10)
16 G
B, D
DR
-4, 2
400
MH
z
MELLANOX ConnectX4mlx5_1 / net-ib1MELLANOX ConnectX4
mlx5_0 / net-ib0
Dell PowerEdge T630
Cable: 0.5m long
Figure 8.11: Computer system with NUMA architecture used for measure-ments
VILLASnode Node-Type Benchmark
The VILLASnode node-type benchmark was used to compare the perfor-mance of different node-types. Its working principle is depicted in Fig. 8.12.First, the already existing signal node for sample generation with times-tamps was used which are then forwarded to a file node that stores themin a file of comma-separated values (CSV), called in. Concurrently, thesamples are sent to a sending node of the type to be benchmarked. Areceiving node gets the samples and writes them together with timestampsof their reception to a CSV file, called out. Therefore, the benchmark
file node
signal node
file node
node node
node-type under test
super-node 1 super-node 2
in
out
Figure 8.12: VILLASnode node-type benchmark working principle
177
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
is utilized for measuring the transfer latencies between nodes. Althoughthe out file contains the generation and receive timestamps, the in file isneeded to determine which samples were lost. Since the signal node canmiss steps during sample generation at high rates, it can be determined ifsamples are missing because of the signal node or because they were lostbetween the nodes. The reason for the missed steps is explained in thefollowing.
Sample Generation
For the payload generation at different rates, the signal node was config-ured accordingly. It can make use of two different timers: the timerfd,which relies on the wait on a file descriptor used that is used for notifica-tions and the time-stamp counter (TSC), which is a 64 bit CPU registerthat is incremented mainly depending on the CPU’s maximum core fre-quency. In separate latency benchmarks [Pot18] with a rate of 100 kHzwith ten 64 bit floating point numbers per sample, and RC as service typeof an IB node, it was determined that the timerfd had a higher effect thanthe TSC on the median tlat of the measured latencies. Indeed, with TSCand relatively small rates below 2500 Hz, steps were missed. For example,the fraction amount of missed steps at 100 Hz was around 8 %. Sinceusing timerfd at this slow rates would have skewed the results and sincea deviation of 8 Hz at a rate of 100 Hz will hardly influence any latencyresults, the TSC was chosen.
8.4.1 Service Types of InfiniBand Node-Type
In the following, different service modes of the IB node-type were compared,which are RC and UD as they are officially supported by the RDMAcommunication manager (CM) (see Fig. 8.5) and therefore do not requireany modifications of the RDMA library. All measurements in this sectionwere performed with 250’000 samples.
Various Sample Generation Rates
In these measurements the samples contained 8 random 64 bit floating-point numbers and were generated at rates between 100 Hz and 100 kHz.For RC with 24 B of metadata such a message has 88 B and for UD with aGlobal Routing Header (GRH) of 40 B a message has 128 B – all messageswere sent inline. Figure 8.13 shows the results that are relatively similar forboth modes (RC and UD) over all rates which is typical for InfiniBand asthe reliability is implemented in the HCA which causes less overhead than
178
8.4 Analysis of the InfiniBand Support in VILLAS
an implementation in the network stack (e. g. TCP/IP) of the operatingsystem.
In both cases, tlat is decreasing with higher frequencies and, thus, shorterperiods between sample transmissions. Assuming one-way transmissiontimes of 1 µs [Pot18], transmission rates of up to approximately 1 GHzshould be possible. However, higher rates than 100 kHz were not measuredas the signal node of VILLAS missed even more steps. A higher ratecould not be achieved despite optimizations of the file node. Anotherlimitation is that the rate for clearing the CQ and refilling the RQ, bythe IB node, depends on the rate of the read function calls. If the RQsize is sufficient, it can absorb short message peaks but not in case ofcontinuously high rates.
Various Sample Sizes
For a measurement over various sample sizes, the rate was fixed to 25 kHzand the messages contained 1 to 64 values per sample, resulting in messagesof 32 B to 536 B for the RC and 74 B to 576 B for the UD type. Messagesof 188 B or smaller were sent by the used HCAs inline. In Fig. 8.14 anincreasing median latency can be seen when the message size exceedsabout 128 B which is in accordance with the findings presented in [MR12].Furthermore, the variability of the latencies with the UD type is higherthan with the RC type. Moreover, the RC type shows lower latencies thanthe UD type which can be explained with the adding of the remote node’saddress handle (AH) to every send WR and the GRH to every message,both not needed in case of the RC type.
100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]
0
1
2
3
4
t lat [µ
s]
RC UD
Figure 8.13: Median latencies tlat over various sample rates
179
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
Various Generation Rates and Sample Sizes
The result of a combined measurement with various generation rates andsample sizes are shown in Fig. 8.15 for the RC service type only. Thefindings of both previous messages are reflected in this overall measurementdiagram. The percentage of missed steps is also shown in Fig. 8.15 and
1 2 4 8 16 32 64number of values in sample
0
1
2
3
4
5
t lat [µ
s]
RC UD
Figure 8.14: Median latencies tlat over various sample sizes
rate
1002500
500010000
2500050000
100000num
ber of
value
s in s
ample
24
816
3264
t lat [µs]
0
1
2
3
4
5
6
8%3%
0%0%
0%0%
0%
8%
3%0%
0%0%
0%0%
8%
3%0%
0%0%
0%0%
8%
3%0%
0%0%
0%0%
8%
3%0% 0%
0%0%
0%
8%
3%0%
0%0%
0%22%
8%
3%0%
0%0%
10% 55%
2.0
2.5
3.0
3.5
4.0
t lat [µ
s]
min tlat: 1.706 µs max tlat: 4.915 µs
% of samples missed by signal generator
Figure 8.15: Median latencies tlat over various sample rates and samplesizes
180
8.4 Analysis of the InfiniBand Support in VILLAS
colored in red if the signal node missed more than 10 % of the steps.With these results, the data rate T can be calculated with
T =(
1 − pmissed
100 %
)· ssample · fsignal, (8.1)
with pmissed being the percentage of missed samples, ssample the size of asample and fsignal the sample generation rate. In the measurements thedata rate was approximately 20 MiB/s which shows that the file nodewas not able to process large amounts of data.
8.4.2 InfiniBand vs. Zero-Latency Node-TypeFor the comparison of the IB node-type with a zero-latency node-type, theshmem node-type was chosen as this is utilizing the POSIX shared-memoryAPI for the communication between VILLAS nodes and therefore thelatency between two shmem nodes will correspond to the shared-memoryregion used by both of them. Again, 250’000 samples were sent at ratesbetween 100 Hz and 100 kHz, each containing 8 random 64 bit floating-point numbers.
Figure 8.16 shows the results for both node-types. The latency differ-ences between the node-types can be assumed as being caused by the IBcommunication. Both, the median latency of the IB node and the one ofthe shmem node were decreasing with higher frequencies. Therefore, thiseffect cannot be caused by the PCI-e bus or the IB node implementationitself.
100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]
0
1
2
3
4
t lat [µ
s]
8.03%3.72% 0.03% 0.08% 0.13% 0.25%
0.49%
8.03% 3.71% 0.03% 0.04% 0.07% 0.14% 0.28%
infiniband (RC) shmem
% of samples missed by signal generator
Figure 8.16: Median latencies tlat of IB vs. shmem node-type over varioussample rates
181
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
Furthermore, the IB node only missed a negligible amount of steps morethan the shmem node. This implies that the write function of the IB nodereturned fast enough and did not influence the signal generation too much.With median latencies of around 2.5 µs, transmission rates up to ~400 kHzcould be possible.
8.4.3 InfiniBand vs. Existing Server-Server Node-Types
One reason for the integration of IB into VILLASnode was the lack ofa hard real-time capable server-server node-type. Therefore, this sectioncompares the IB and shmem node-type with existing node-types commonlyused for server-server communication, zeromq and nanomsg. Once more,250’000 samples were sent at rates between 100 Hz and 100 kHz, eachcontaining 8 random 64 bit floating-point numbers.
Loopback vs Physical Link
First, the loopback IP 127.0.0.1 was used for the IP-based node-typeszeromq and nanomsg nodes and then repeated on a real physical link.Afterward, a physical link was used for the IP-based node-types whichusually is based on the Ethernet technology. However, to avoid skewingof the results by another link technology such as Ethernet, the IB HCAswas also used as physical link for the communication between the IP-based node-types. This was realized utilizing the Internet Protocol overInfiniBand (IPoIB) driver which provides an IP-based interface that canbe used by the IP-based node-types.
Figure 8.17 shows the results for the IP-based node-types. For ratesbelow 25 kHz there were no relevant latency deviations between the loop-back and the physical link. Above 25 kHz the latencies on the physicallinks increased especially with zeromq. The percentage of missed stepsfor 100 Hz and 250 Hz was for both IP-based nodes the same as for the IBand shmem nodes, again indicating TSC to be the reason.
In Fig. 8.18 the results of the IP-based nodes on physical links arecompared to the ones of the IB and shmem nodes. It can be seen that thelatencies of the hard real-time capable IB node were at least one orderof magnitude lower than the ones of the IP-based node-types that aresoft real-time capable only. Also the variability of the latencies in caseof IB was very low comparatively to the IP-based types, especially forrates above 25 kHz, when the IP-based types showed increasing latencymagnitudes.
182
8.5 Conclusion and Outlook
8.5 Conclusion and Outlook
The results presented in this chapter show that the integration of InfiniBandin the VILLAS framework enables the transmission of samples at relativelyhigh rates with latencies of few microseconds and under hard real-timerequirements. These low latencies were achieved by a strict compliance tothe principles of VIA such as zero-copy and the utilization of InfiniBand’scapabilities to initiate data transmissions without using system calls or
100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]
102
103
t lat [µ
s]
nanomsg
nanomsg (lo)
zeromq
zeromq (lo)
Figure 8.17: Median latencies of nanomsg and zeromq node-type over vari-ous sample rates via loopback (lo) and physical link
100 2500 5000 10000 25000 50000 100000sample generation rate [Hz]
100
101
102
103
t lat [µ
s]
infiniband
nanomsg
shmem
zeromq
Figure 8.18: Median latencies of nanomsg and nanomsg via physical linksas well as IB and shmem node-type over various sample rates
183
Chapter 8 HPC Network Communication for HiL and RT Co-Simulation
the active participation of a CPU. This is why the IB node-type can beadapted for other VIA-based interfaces.
While for small messages at high rates the IB node-type showed medianlatencies of around 1.7 µs, the median latencies for larger message sizesat low rates were around 4.9 µs. Compared to the – almost zero-latency –shmem node-type, the median latencies were only 1.5 −2.5 µs higher whichis of high value in the area of real-time processing as the shmem allowsonly communication between the nodes of a shared-memory system whichare typically located on the same computer. Moreover, existing VILLASnode-types for communication among different systems over IP showedmedian latencies that were one to two orders of magnitude larger than incase of IB. The latter can furthermore be used for much higher samplerates.
With the IB node-type, VILLASnode can be used for hard real-timecapable coupling of simulators running on conventional and inexpensivecomputer hardware in academic area and industry. Moreover, in the future,e. g., HiL setups are possible where devices to be connected to a computerhost, running the simulation, can be supplied with an InfiniBand TCA forlow latency data transfers between the device and the simulation. Thesame setup can be used for real-time operation.
The IB node-type implementation could be further improved for real-time processing with the aid of RT_PREEMPT-capable Linux kernel. Furtherperformance analyses, e. g., based on a profiling of the node-type’s readand write functions could be accomplished for a code optimization leadingto even lower latencies.
184
9Conclusion
In the following, the conclusions of all previous chapters are summarizedand discussed.
9.1 Summary and Discussion
This dissertation presents various methods from high-performance com-puting (HPC) in support of power system simulation. These methodsshall help other power system simulation users and developers in theirundertakings. Therefore, all presented approaches were implemented inopen-source software projects.
In Chapter 2 a data model for multi-domain smart grid topologies,based on the Common Information Model (CIM), was presented. CIM waschosen despite the lack of all object classes for communication networks,many classes for energy markets, and some classes for power grids. Thethus needed CIM extensions can lead to various extensions of differentorganizations. Of course, CIM is not the only possible information modelbut it provides the biggest well-specified subset of needed classes for aholistic representation of smart grid topologies from a high to a relativelydetailed level. Moreover, the CIM User Group (CIMug) extends CIMby new classes to achieve a more and more holistic model of smart gridtopologies. The developed SINERGIEN_CIM data model was furthermoreused for the successful validation of the automated CIM (de-)serializergeneration in Chap. 3 with ontologies that extend CIM, ensuring a generalapplicability of the approach.
185
Chapter 9 Conclusion
Chapter 3 introduced an approach for an automated (de-)serializergeneration based on a UML to C++ code generation, a subsequent codeadaption plus extension, and a template based (un-)marshalling codegeneration with the aid of a C++ compiler front-end. In contrast, insteadof making use of a UML editor such as Enterprise Architect (EA), onecould save the UML specification in an open document format such as XMLMetadata Interchange (XMI). Then a code generator (to be developed)could directly generate (un-)marshalling code by traversing through theXMI document. This would make the code adaption as well as thecompiler front-end processing unnecessary. In fact, this proceeding iscurrently applied for the integration of Common Grid Model ExchangeStandard (CGMES) into CIM++. However, instead of an XMI documentrepresenting the CGMES UML specification, the generator for CGMES(un-)marshalling code for CIM++ makes use of Resource DescriptionFramework Schema (RDFS) documents which define the structure of aconcrete RDF based document type. Furthermore, the compilation of morethen thousand CIM classes into libcimpp is not needed for each projectusing it. To reduce its size, e. g. for the application in embedded systemswith a very limited main memory and program storage, an approach willbe implemented which enables the possibility to choose a certain subsetof CIM classes. Of course, all superclasses of a given subset must beautomatically integrated into libcimpp as well. Despite all that, libcimppis already utilized not only in Institute for Automation of Complex PowerSystems (ACS) software projects but also by a Swiss and a Czech companyand potentially also by other Github users which stay anonymous. Thisindicates that it is not only usable in academic area but also in enterpriseapplication.
Chapter 4 presents a template based translation method from CIM tosimulator-specific system models as implemented in the CIMverter project.One could argue that the template based approach is too inflexible incomparison to a domain-specific language (DSL) based approach as morecomplex mappings must be implemented in C++. A contrary indicationis that further component translations from CIM to Modelica requiredhardly any or even none changes in the C++ Modelica Workshop. Alsothe integration of the DistAIX system model format in CIMverter wasaccomplished in a couple of person-days as it could be performed mainlywith new templates. Besides these examples, it must be assumed thatcomplex mappings would also require a comprehensive DSL. This wouldlead to higher efforts while learning the DSL and implementing it inCIMverter. Furthermore, the presented mapping from CIM to simulator-specific system models covers so-called bus-branch models only. Therefore,CIMverter is not able to handle node-breaker models but this follows the
186
9.1 Summary and Discussion
chosen UNIX philosophy of developing one program for one task [MPT78;Ray03] as a node-breaker model does not provide all information neededfor a final system model. In fact, it provides a set of topologies which candiffer depending on the configuration of the breakers. Hence, the mappingof a node-breaker model to a bus-branch model should be handled by aseparate topology processor with respect to the given breaker configuration.
Chapter 5 presents modern LU decomposition methods for circuit simu-lation that have been parallelized for current parallel processors. It showsa comparative analysis with the state of the art LU decomposition KLU,based on benchmark matrices as well as on real simulations performedby Dynaωo. It could be regarded as a disadvantage that solving of linearsystems only was considered although in power system simulation usuallynon-linear systems are solved. One reason is that solving a non-linearsystem usually is implemented by linearization and a subsequent solvingof a linear system. Furthermore, in case of large-scale power grid models,Réseau de Transport d’Électricité (RTE), the French transmission systemoperator (TSO), found out that during the solving of DAEs (e. g. withthe aid of IDA) most of the CPU time is spent in the LU factorization(i. e. KLU). RTE and other partners of the PEGASE research projectconducted a comprehensive analysis on different solvers. The outcomes ofthe analysis are another reasons why only LU decompositions and thusdirect solvers have been analyzed and no iterative ones.
Chapter 6 introduces the new approach type of an automatic fine-grainedparallelization of mathematical models that was implemented in the newpower grid simulator DPsim. It is about exploiting parallelism in math-ematical power system models for making use of multi-core processorswith shared-memory architectures that are common in today’s computersystems. The MNA solver itself has not been improved which is based onthe SparseLU method of the Eigen library which implements the supern-odal LU factorization [Sup] for sparse non-symmetric systems of linearequations. Obviously, at this point also other LU decompositions couldbe integrated and tried to be improved, analogously to the work alreadydone for OpenModelica and Dynaωo in Chap. 5. This would improve theperformance of task processing itself instead of the implemented paral-lel processing of the tasks which was the goal of this work. From HPCpoint of view, the power grid simulations that were performed, e. g., byOpenModelica, Dynaωo, and DPsim, because of the sparsity of the linearsystems, never led to matrices of a size that was large enough for anefficient use of distributed-memory systems or even supercomputers. Evenin case of large-scale static phasor power grid simulations with more than7500 nodes, the matrices had indeed a size of 200000 × 200000 but withup to 700000 nonzeros their memory consumption was around 5 MiB only.
187
Chapter 9 Conclusion
Hence, no parallelization approaches for distributed-memory architectureswere needed or implemented. This could change in case of large-scaledynamic simulations but such have not been considered, yet.
Chapter 7 addresses different approaches for increasing the performanceof Python programs. After an introduction to the ideas and internalsof the Python runtime environments implementing these approaches, acomparative analysis based on algorithms from different algorithm classesis presented. The analysis helps programmers to understand how toadapt Python programs to achieve a better runtime performance in acertain environment, for instance based on just-in-time (JIT) compilation.Thus, it also helps to estimate the efforts and benefits of the development.The analysis was mainly focused on sequential processing with shared-memory parallelization by multithreading only but exactly achieving betterperformance in case of sequential Python code was in focus of the analysis.There are also other JIT compilation based programming languages suchas Julia. Julia’s syntax is similar to MATLAB and Python and providesmemory management making it easy to learn by programming beginners.However, Julia was developed as a language for scientific computing andPython is much more popular in engineering.
Chapter 8 presents the implementation of InfiniBand (IB) support intothe VILLASframework for Hardware-in-the-Loop (HiL) setups and thereal-time (RT) coupling of simulators. The implemented IB communicationshows transmission latencies that are one order of magnitude lower thanthe corresponding latencies of Internet Protocol (IP) based communicationwith nanomsg and zeromq. Furthermore, the IB latencies are slightlyhigher than in case of a shared-memory based data exchange which islimited to the same computer host. The InfiniBand latencies also showa very low variability which is important for RT requirements. Eventhough InfiniBand based communication is not suitable for wide areanetworks (WANs), distances above 15 m can be bridged by active fiberoptic cables with hundreds of meters in length. Therefore, with InfiniBandinterconnects, a widely-used HPC network technology can be appliedfor hardware-server and server-server communication even with hard RTrequirements via the VILLASnode software gateway for simulation dataexchange.
Taken as a whole, it can be concluded that the work presented inthis dissertation already improved or can improve the performance ofdifferent (co-)simulation environments. Furthermore, it enables the useof CIM topologies in different power system simulators, allowing thesimulation of large-scale real world power systems. Also, many findingsand approaches can be used for improving further software from the areaof electrical engineering and beyond that. The implemented open-source
188
9.2 Outlook
software projects can be used and improved by scientists and developersin academic area and free economy.
9.2 Outlook
Some concrete improvements of the developed concepts and approacheshave already been mentioned in the Discussion above. One of today’s mostimportant goals in the area of HPC for smart grid simulation is a solverfor linear systems of equations, arising during simulations performed bypower grid simulation environments, which scales with the cores of modernmulti-core processors. At least in case of state-of-the-art steady-statesimulations it has been seen that there is no need for parallel architectureswith distributed memory. Workstations and servers with a shared-memoryarchitecture can compete with steady-state simulations but larger andincreasingly complex system models on the basis of component model im-provements, more elaborate models of new equipment, and more grid nodesrequire an efficient utilization of the utilized computer hardware. Hence,if simulations with more complex system models must not run longer,the software must make use of new hardware developments. Therefore,it needs more research and development on the power system simulationenvironments and their numerical back-ends to make use of wider vectorunits of today’s processors and accelerators such as graphic processingunits (GPUs) and field-programmable gate arrays (FPGAs).
Further scientific work related to HPC in power system simulation isalso needed in the area of dynamic security assessment (DSA) based ondynamic grid simulation. In DSA systems, different scenario computationscan be triggered by certain events such as the outage of grid equipment.Then, dynamic (n-1)-computations must be performed which can provideinformation on the voltage stability, small-signal stability, and transientstability of the system. DSA systems can make use of expert systems, forexample on the basis of neural networks, that can derive grid operationimprovements from the mentioned analyses. Since the real-time require-ments on DSA systems can be very challenging and dynamic computationscan be very time-consuming, the application of high-throughput computing(HTC) on distributed-memory systems can be the method of choice, whereHTC denotes a computing paradigm that focuses on the efficient executionof a large number of loosely-coupled tasks [Eur].
In the context of dynamic (n-1)-computations, a topology processorwhich generates bus-branch topologies from node-breaker models with agiven breaker setting could be helpful. The (n-1)-computation controlcould make use of such a topology processor providing all topologies to
189
Chapter 9 Conclusion
be considered in case of the scenarios which have to be calculated for aDSA computations triggering event. These topologies could be used forthe aforementioned stability calculations as well es for additional dynamicand static protection simulations.
Besides multithreading paradigms for shared-memory parallelizatin inPython programs, also paradigms for distributed-memory parallelization,e. g. with the aid of the Message Passing Interface (MPI). The bench-marking of an MPI implementation itself should not be performed bythe examination of the performance of a set of MPI based applicationsbut more systematically for different kinds of communication operations(e. g. individual, collective, one-sided, etc.), for various communicationpatterns (one-to-one, one-to-many, all-to-all), and multiple numbers ofcommunication modes. For this purpose, an approach such as implementedin the special Karlsruher MPI-benchmark (SKaMPI) could be followedwhich performs various measurements of different MPI functions in acustomizable way.
190
ACode Listings
A.1 Exploiting Parallelism in Power Grid Simulation
Listing A.1: step method of the OpenMP-based level scheduler
void OpenMPLevelScheduler :: step(Real time , Int timeStepCount ) {size_t i = 0, level = 0;
# pragma omp parallel shared (time , timeStepCount ) \private (level , i) num_threads ( mNumThreads )
for ( level = 0; level < mLevels .size (); level ++) {# pragma omp for schedule ( static )for (i = 0; i < mLevels [ level ]. size (); i++) {
mLevels [ level ][i]-> execute (time , timeStepCount );}
}}
193
BPython Environment Measurements
B.1 Execution Times
0 5000 10000 15000 20000 25000 30000 35000Number of nodes
10 1
100
101
102
103
104
Tim
e [s
]
AVL Tree Insertion
C++CPython3CPython2Cython(Pure Python)
Cython(Optimized)CythonPyPy(Optimized)PyPy3(Pure Python)PyPy2(Pure Python)
Figure B.1: Execution times for AVL Tree Insertion
195
Appendix B Python Environment Measurements
0 500 1000 1500 2000 2500Number of nodes
10 1
100
101
102
103
104
Tim
e [s
]
Dijkstra
C++CPython3CPython2Cython(Pure Python)Cython(Optimized)
CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)
Figure B.2: Execution times for Dijkstra
0 500 1000 1500 2000 2500 3000 3500Size of matrices
10 1
100
101
102
103
104
Tim
e [s
]
Gauss-Jordan Elimination
C++CPython3CPython2Cython(Pure Python)Cython(Optimized)
CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)
Figure B.3: Execution times for Gauss-Jordan Elimination
196
B.2 Memory Space Consumption
B.2 Memory Space Consumption
0 200 400 600 800Size of matrices
0
20
40
60
80
100
Heap
-pea
k [M
b]
Cholesky DecompositionC++CPython3CPython2Cython(Pure Python)Cython(Optimized)
CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)
Figure B.4: Memory consumption (maximum heap peak) for Cholesky
197
Appendix B Python Environment Measurements
0 250 500 750 1000 1250 1500 1750 2000Size of matrices
0
50
100
150
200
Heap
-pea
k [M
b]
Gauss-Jordan EliminationC++CPython3CPython2Cython(Pure Python)Cython(Optimized)
CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)
Figure B.5: Memory consumption (maximum heap peak) for Gauss-JordanElimination
0 250 500 750 1000 1250 1500 1750 2000Size of matrices
0
5
10
15
20
25
30
35
40
Heap
-pea
k [M
b]
Gauss-Jordan EliminationC++CPython3Cython(Pure Python)
Cython(Optimized)Numba + NumPy
Figure B.6: Memory consumption (maximum heap peak) for Gauss-JordanElimination of selected runtime environments
198
B.2 Memory Space Consumption
0 250 500 750 1000 1250 1500 1750 2000Size of matrices
0
50
100
150
200
250
300
350
Heap
-pea
k [M
b]
Matrix-Matrix MultiplicationC++CPython3CPython2Cython(Pure Python)Cython(Optimized)
CythonPyPy(Optimized)Numba + NumPyPyPy3(Pure Python)PyPy2(Pure Python)
Figure B.7: Memory consumption (maximum heap peak) for Matrix-Matrix Multiplication
0 250 500 750 1000 1250 1500 1750 2000Size of matrices
0
20
40
60
80
100
Heap
-pea
k [M
b]
Matrix-Matrix MultiplicationC++CPython3Cython(Pure Python)
Cython(Optimized)CythonPyPy(Optimized)Numba + NumPy
Figure B.8: Memory consumption (maximum heap peak) for Matrix-Matrix Multiplication of selected runtime environments
199
List of Acronyms
ACS Institute for Automation of Complex Power Systems 4, 97,186
ADT abstract data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128AH address handle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179AMD Approximate Minimum Degree. . . . . . . . . . . . . . . . . . . . . . . . 80AOT ahead-of-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130API Application Programming Interface . . . . . . . . . . . . . 109, 159ARM Advanced RISC Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6AST abstract syntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 58AVL Adelson-Velsky and Landis . . . . . . . . . . . . . . . . . . . . . . . . . . 148AVX Advanced Vector Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 125
BDF backward differentiation formula . . . . . . . . . . . . . . . . . . . . . . 75BSD Berkeley Software Distribution. . . . . . . . . . . . . . . . . . . . . . . 159BTF block triangular form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
CA Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161CASE Computer-Aided Software Engineering . . . . . . . . . . . . . . . . 53CFG control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138CGMES Common Grid Model Exchange Standard . . . . . . . . . . . . 186CiL Control-in-the-Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5CIM Common Information Model . iv, viii, 8, 14, 31, 55, 75, 97,
185CIMug CIM User Group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 185CLI command line interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62CM communication manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178COLAMD Column Approximate Minimum Degree . . . . . . . . . . . . . . . 81CP critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106CPU central processing unit . . . . . . . . . . . . . . . 6, 80, 100, 129, 160
201
List of Acronyms
CQ Completion Queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160CQE Completion Queue Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164CSV comma-separated values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
DAE differential-algebraic system of equations . . . . . . . . . . . 9, 75DAG directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102DER distributed energy resource . . . . . . . . . . . . . . . . . . . . . . . . . 4, 13DES discrete event simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16DES Discrete Event System Specification. . . . . . . . . . . . . . . . . . . 17DistAIX Distributed Agent-Based Simulation of Complex Power
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56DMA Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165DMS distribution management system . . . . . . . . . . . . . . . . . . . . . . 16DOM Document Object Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31DP dynamic phasor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97DPsim Dynamic Phasor Real-Time Simulator . . . . . . . . . . . . . . . . 97DRTS digital real-time simulator . . . . . . . . . . . . . . . . . . . . . . . . . 5, 157DSA dynamic security assessment . . . . . . . . . . . . . . . . . . . . . . 3, 189DSL domain-specific language . . . . . . . . . . . . . . . . . . . . . . . . . 56, 186DSO distribution system operator . . . . . . . . . . . . . . . . . . . . . . 11, 15DUFunc dynamic universal function . . . . . . . . . . . . . . . . . . . . . . . . . . 141
EA Enterprise Architect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186EHV extra high voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84EMS energy management system . . . . . . . . . . . . . . . . . . . . . . . . . . . 15EMT electromagnetic transient simulation . . . . . . . . . . . . . . . . . . . 4ENTSO-E European Network of Transmission System Operators for
Electricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
FIFO first in – first out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164FPGA field-programmable gate array. . . . . . . . . . . . . . . . . . . . . 6, 189
GC garbage collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134GD-RTS geographically distributed real-time simulation . . . . 5, 157GIL global interpreter lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135GMRES Generalized Minimal Residual Algorithm . . . . . . . . . . . . . 78GP Gilbert/Peierls’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81GPU graphic processing unit . . . . . . . . . . . . . . . . . . . . 6, 76, 99, 189GRH Global Routing Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
HCA Host Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161HiL Hardware-in-the-Loop. . . . . . . . . . . . . iv, viii, 5, 97, 157, 188
202
List of Acronyms
HLFET Highest Level First with Estimated Times . . . . . . 106, 210HLFNET Highest Level First with No Estimated Times . . . . . . . 106HLFNET Highest Level First with No Estimated Times . . . . . . . . 110HLL high-level language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129HPC high-performance computing. . . . . . . . . . . . . . . vii, 6, 97, 185HT Hyper-Threading . . . . . . . . . . . . . . . . . . . . . . . 85, 112, 150, 175HTC high-throughput computing . . . . . . . . . . . . . . . . . . . . . . . . . . 189HV high voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84HW hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
IB InfiniBand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 26, 157, 188IBA InfiniBand Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 161, 211IBTA InfiniBand Trade Association . . . . . . . . . . . . . . . . . . . . . . . . 161ICT information and communications technology . . . . vii, 1, 13IEC International Electrotechnical Commission . . . . . . . . . . . . 29IP Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158, 188IPC inter-process communication . . . . . . . . . . . . . . . . . . . . . . . . . 159IPoIB Internet Protocol over InfiniBand . . . . . . . . . . . . . . . . . . . . 182iPSL iTesla Power System Library . . . . . . . . . . . . . . . . . . . . . . . . . . 75IR intermediate representation . . . . . . . . . . . . . . . . . . . . . . . . . . 142ISO International Organization for Standardization . . . . . . . . 29IVP initial value problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
JIT just-in-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 127, 188
LAN local area network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158LSE linear system of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
MD minimum degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79MDA model-driven architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30MNA modified nodal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97MPI Message Passing Interface . . . . . . . . . . . . . . . . . . 125, 156, 190MPS ModPowerSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75MQTT Message Queue Telemetry Transport . . . . . . . . . . . . . . . . . 158MTU message transmission unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
ND nested dissection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79NIC network interface controller . . . . . . . . . . . . . . . . . . . . . . . . . . 159NR Newton-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94NUMA non-uniform memory access. . . . . . . . . . . . . . . . . . . . . . . . . . 176
ODE ordinary differential equation . . . . . . . . . . . . . . . . . . . . . . . 3, 98
203
List of Acronyms
OFED OpenFabrics Enterprise Distribution . . . . . . . . . . . . . . . . . 165OMG Object Management Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 51OOP object-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . 36OS operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
PC program counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139POSIX Portable Operating System Interface . . . . . . . . 41, 135, 159
QoS quality of service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158QP Queue Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162QVT Query/View/Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . 51
RC Reliable Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162RD Reliable Datagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162RDF Resource Description Framework . . . . . . . . . . . . . . . . . . 16, 31RDFS Resource Description Framework Schema . . . . . . . . . . . . 186RDMA Remote Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . 161RPython Restricted Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136RQ Receive Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163RT real-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 85, 97, 157, 188RTE Réseau de Transport d’Électricité . . . . . . . . . . . . . . . . 75, 187RTTI runtime type information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
SAX Simple API for XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33SCADA supervisory control and data acquisition . . . . . . . . . . . . . . 15SGAM Smart Grid Architecture Model . . . . . . . . . . . . . . . . . . . . . . . 14sge scatter/gather element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165SiL Software-in-the-Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5SIMD single instruction multiple data (stream) . . . . . . . . 125, 143SKaMPI special Karlsruher MPI-benchmark . . . . . . . . . . . . . . . . . . 190SL simplified load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84SQ Send Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163SSA steady-state security assessment. . . . . . . . . . . . . . . . . . . . . . . . 3STL Standard Template Library . . . . . . . . . . . . . . . . . . . . . . 36, 145SUNDIALS SUite of Nonlinear and DIfferential/ALgebraic Equation
Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76, 98SW software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
TCA Target Channel Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161TCP Transmission Control Protocol . . . . . . . . . . . . . . . . . . . 26, 160TJIT tracing just-in-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
204
List of Acronyms
TLM Transmission Line Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 99TSC time-stamp counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178TSO transmission system operator. . . . . . . . . . . . . . . . . . . . . . 4, 187
UC Unreliable Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162UD Unreliable Datagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162UDP User Datagram Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160ufunc universal function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141UML Unified Modeling Language . . . . . . . . . . . . . . . . . . . . . 8, 19, 30
VDL voltage dependent load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84VI virtual interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160VIA virtual interface architecture . . . . . . . . . . . . . . . . . . . . . . . . . 160VL Virtual Lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165VM virtual machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134VPP virtual power plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 21
WAN wide area network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188WQE Work Queue Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163WR Work Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163WSCC Western System Coordinating Council . . . . . . . . . . 112, 210
XMI XML Metadata Interchange . . . . . . . . . . . . . . . . . . . . . . 31, 186XML Extensible Markup Language . . . . . . . . . . . . . . . . . . . . . . . . . 32
205
Glossary
barrier A barrier is a synchronization primitive e. g. among aset of threads or processes for which holds that eachthread / process of the regarding set must execute all in-structions before the barrier before all of them continuewith the instructions after the barrier. . . . . . . . . . . . . . . . 104
driver In context of numerical software (i. e. not hardware driver):a program) which is applying numerical methods, e. g. im-plemented in libraries that are linked to the program, withall needed initializations and parameters on a particularproblem to be solved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58, 83
flat model The semantics of the Modelica language is specified bymeans of a set of rules for translating any class described inthe Modelica language to a flat Modelica structure (i. e. aflat model). A class must have additional properties in orderthat its flat Modelica structure can be further transformedinto a set of differential, algebraic and discrete equations(i. e. a flat hybrid DAE). [Mod] . . . . . . . . . . . . . . . . . . . . . . . 58
Modelica Modelica is a free object-oriented multi-domain modelinglanguage for component-oriented modeling. . . . . . . . 16, 98
OpenModelica An open-source Modelica-based modeling and simulationenvironment intended for industrial and academic usage. .12, 76, 98
207
Glossary
pivoting The pivot element of a row or column of a matrix is the firstelement selected by an algorithm (e. g. during a Gaussianelimination) before a certain calculation step. Finding thiselement is called pivoting. In Gaussian elimination withpivoting, usually the element with the highest absolutevalue is chosen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
preordering Computation of permutation matrices which are applied onthe matrix to be factorized before the actual factorizationstep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
race condition A race condition is a condition where the result of concur-rently executed program statements is dependent on the(uncontrollable) execution order of the CPU instructionsbelonging to the statements. . . . . . . . . . . . . . . . . . . . . . . . . . 136
thread A thread (of execution) is a set of instructions associatedwith a process. A multi-threaded process has multiplethreads. If the computer system allows to run these threadsconcurrently, the process can benefit from a higher compu-tation computational power.. . . . . . . . . . . . . . . . . . . . . . . . . . 81,100
thread-safe A part of a program is thread-safe if multiple threads canexecute the part concurrently, always generating results asif the threads would have executed the part in a sequen-tial order (i. e. one thread executes the whole part, thenext thread executes the whole part, and so forth until allthreads finished). The sequential order can vary betweenthe executions of the program.. . . . . . . . . . . . . . . . . . . . . . . 136
wall clock time The wall clock time is the time which elapses in realityduring the measured process. . . . . . . . . . . . . . . . 84, 111, 150
208
List of Figures
1.1 Contribution overview of this work . . . . . . . . . . . . . 7
2.1 Exemplary topology including components of (1) all do-mains and (2) domain-specific topologies . . . . . . . . . . 19
2.2 Inter-domain connections between classes of power grid,communication network and market . . . . . . . . . . . . 20
2.3 Communication network class association example . . . . 222.4 Overall SINERGIEN architecture for simulation setup . . 232.5 Synchronization scheme of simulators at co-simulation time
steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 Scheme of runtime interaction between co-simulation com-
ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Overall concept of the CIM++ project . . . . . . . . . . . 343.2 UML diagram of HydroPowerPlant class which instances
can be associated with no more than one Reservoir instance 363.3 UML diagram of the class MyASTVisitor . . . . . . . . . . 383.4 Section of collaboration diagram for BatteryStorage gen-
erated by Doxygen on the automated adapted CIM C++codebase. The entire diagram can be found in [FEIb] . . . 52
4.1 Template engine example with HTML code . . . . . . . . 594.2 Overall concept of the CIMverter project . . . . . . . . . 604.3 Mapping at second level between CIM and Modelica objects 634.4 Connections with zero, one, and two middle points between
the endpoints. The endpoints are marked with circles . . 684.5 Medium-voltage benchmark grid [Rud+06] converted from
CIM to a system model in Modelica based on the ModPow-erSystems and PowerSystems library . . . . . . . . . . . . 72
209
List of Figures
5.1 Sparsity patterns of benchmark matrices . . . . . . . . . . 865.2 Total (preprocessing+factorization) times . . . . . . . . . 875.3 Preprocessing times . . . . . . . . . . . . . . . . . . . . . 885.4 Factorization times . . . . . . . . . . . . . . . . . . . . . . 885.5 Execution times on generic vs. RT kernel . . . . . . . . . 895.6 (Re-)factorization times . . . . . . . . . . . . . . . . . . . 905.7 NICSLU’s scaling over multiple threads (T ) . . . . . . . . 915.8 Basker’s scaling over multiple threads (T) . . . . . . . . . 915.9 Total times with different preorderings . . . . . . . . . . . 925.10 Factorization times with different preorderings . . . . . . 93
6.1 Categories of parallel task scheduling . . . . . . . . . . . . 1016.2 Example task graph . . . . . . . . . . . . . . . . . . . . . 1026.3 Example task graph including levels . . . . . . . . . . . . 1046.4 Schedule for task graph in Fig. 6.2 with p = 2 using level
scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.5 Schedule for task graph in Fig. 6.2 with p = 2 using level
scheduling considering execution times . . . . . . . . . . . 1056.6 Example task graph including b-levels . . . . . . . . . . . 1066.7 Schedule for task graph in Fig. 6.2 with p = 2 using Highest
Level First with Estimated Times (HLFET) . . . . . . . . 1076.8 Example circuit . . . . . . . . . . . . . . . . . . . . . . . . 1086.9 Task graph resulting from Fig. 6.8 . . . . . . . . . . . . . 1086.10 Western System Coordinating Council (WSCC) 9-bus trans-
mission benchmark network . . . . . . . . . . . . . . . . . 1136.11 Schematic representation of the connections between system
copies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.12 Performance comparison of schedulers for the WSCC 9-bus
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.13 Performance comparison of schedulers for 20 copies of the
WSCC 9-bus system . . . . . . . . . . . . . . . . . . . . . 1156.14 Performance comparison of schedulers for a varying number
of copies of the WSCC 9-bus system . . . . . . . . . . . . 1166.15 Task graph for simulation of the WSCC 9-bus system . . 1176.16 Performance for a varying number of copies of the WSCC
9-bus system using the decoupled line model . . . . . . . . 1196.17 Performance comparison of schedulers for a varying number
of copies of the WSCC 9-bus system using the decoupledline model with 8 threads . . . . . . . . . . . . . . . . . . 120
6.18 Performance for a varying number of copies of the WSCC9-bus system using diakoptics . . . . . . . . . . . . . . . . 121
210
List of Figures
6.19 Performance comparison of schedulers for a varying numberof copies of the WSCC 9-bus system using diakoptics with8 threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.20 Performance comparison of compilers for 20 copies of theWSCC 9-bus system . . . . . . . . . . . . . . . . . . . . . 123
7.1 NumPy ndarray vs. Python list [Van] . . . . . . . . . . . 1337.2 Software architecture of CPython (python command) . . . 1357.3 Software architecture of PyPy (pypy command) . . . . . . 1377.4 Numba compilation stages . . . . . . . . . . . . . . . . . . 1437.5 Comparison of Cython with other programming languages 1447.6 Cython’s workflow for Python module building [Dav] . . . 1467.7 Execution times for Quicksort . . . . . . . . . . . . . . . . 1527.8 Memory consumption (maximum heap peak) for Quicksort 1537.9 Memory consumption (maximum heap peak) for Quicksort
of selected runtime environments . . . . . . . . . . . . . . 1547.10 Execution times for PI calculations with multiple threads 155
8.1 Overview of the VILLASframework . . . . . . . . . . . . . 1598.2 Network stack of the InfiniBand Architecture (IBA) . . . 1628.3 InfiniBand Architecture (IBA) model . . . . . . . . . . . . 1638.4 InfiniBand data transmission example . . . . . . . . . . . 1648.5 An overview of the OFED stack . . . . . . . . . . . . . . . 1668.6 An example super-node with three paths connecting five
nodes of different node-types . . . . . . . . . . . . . . . . 1688.7 General read function working principle in VILLAS . . . . 1698.8 InfiniBand node-type read function working principle . . . 1708.9 VILLASnode state diagram with newly implemented states 1728.10 Components of InfiniBand node-type . . . . . . . . . . . . 1738.11 Computer system with NUMA architecture used for mea-
surements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.12 VILLASnode node-type benchmark working principle . . 1778.13 Median latencies tlat over various sample rates . . . . . . 1798.14 Median latencies tlat over various sample sizes . . . . . . . 1808.15 Median latencies tlat over various sample rates and sample
sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1808.16 Median latencies tlat of IB vs. shmem node-type over various
sample rates . . . . . . . . . . . . . . . . . . . . . . . . . . 1818.17 Median latencies of nanomsg and zeromq node-type over
various sample rates via loopback (lo) and physical link . 1838.18 Median latencies of nanomsg and nanomsg via physical links
as well as IB and shmem node-type over various sample rates183
211
List of Figures
B.1 Execution times for AVL Tree Insertion . . . . . . . . . . 195B.2 Execution times for Dijkstra . . . . . . . . . . . . . . . . . 196B.3 Execution times for Gauss-Jordan Elimination . . . . . . 196B.4 Memory consumption (maximum heap peak) for Cholesky 197B.5 Memory consumption (maximum heap peak) for Gauss-
Jordan Elimination . . . . . . . . . . . . . . . . . . . . . . 198B.6 Memory consumption (maximum heap peak) for Gauss-
Jordan Elimination of selected runtime environments . . . 198B.7 Memory consumption (maximum heap peak) for Matrix-
Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 199B.8 Memory consumption (maximum heap peak) for Matrix-
Matrix Multiplication of selected runtime environments . 199
212
List of Tables
4.1 CIM PowerTransformer to Modelica Workshop Transformermapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Excerpt of further important mappings from CIM to Mod-PowerSystems as implemented in the Modelica Workshop 68
4.3 Excerpt from the numerical results for node phase-to-phasevoltage magnitude and angle regarding the medium-voltagebenchmark grid . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Characteristics of squared matrices with size N × N , Knodes, sorted by nonzeros NNZ, and with density factord = NNZ
N·N in % . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 Total execution times and numbers C of calls of the corre-
sponding routines within the fixed time step solver, withJacobian JF and residual function vector F . . . . . . . . 94
5.3 Accumulated execution times for the listed steps of thevariable time step solver, with D LU decompositions and afactorization ratio f = #Fact.
#Refact. . . . . . . . . . . . . . . . 95
6.1 Overview of the implemented schedulers . . . . . . . . . . 1096.2 Overview of the tested compilers . . . . . . . . . . . . . . 124
213
Bibliography
[Abu+18] A. Abusalah et al. “CPU based parallel computation of elec-tromagnetic transients for large power grids”. In: ElectricPower Systems Research 162 (Sept. 2018), pp. 57–63.
[ACD74] T. L. Adam, K. M. Chandy, and J. Dickson. “A comparisonof list schedules for parallel processing systems”. In: Commu-nications of the ACM 17.12 (1974), pp. 685–690.
[ADD04] P. R. Amestoy, T. A. Davis, and I. S. Duff. “Algorithm837: AMD, an Approximate Minimum Degree Ordering Al-gorithm”. In: ACM Trans. Math. Softw. 30.3 (Sept. 2004),pp. 381–388. issn: 0098-3500. doi: 10.1145/1024074.1024081.
[Adr19] Adrien Guironnet. GitHub - dynawo/dynawo. 2019. url: https://github.com/dynawo/dynawo (visited on 10/21/2019).
[AH11] D. Allemang and J. Hendler. Semantic Web for the WorkingOntologist: Effective Modeling in RDFS and OWL. Elsevier,2011.
[Aho03] A. Aho. Compilers: Principles, Techniques and Tools (forAnna University),2/e. Pearson Education, 2003. isbn: 978-8-13176-234-9.
[AIA19] AIAitesla. GitHub - itesla/ipsl. 2019. url: https://github.com/itesla/ipsl (visited on 10/21/2019).
[Åke+10] J. Åkesson et al. “Modeling and optimization with Optim-ica and JModelica. org—Languages and tools for solvinglarge-scale dynamic optimization problems”. In: Computers& Chemical Engineering 34.11 (2010), pp. 1737–1749.
[Ale01] A. Alexandrescu. Modern C++ design: generic programmingand design patterns applied. Addison-Wesley, 2001.
215
Bibliography
[Anaa] Anaconda, Inc. Notes on Numba Runtime. url: http://numba . pydata . org / numba - doc / dev / developer / numba -runtime.html (visited on 02/09/2020).
[Anab] Anaconda, Inc. Numba architecture. url: http://numba.pydata . org / numba - doc / dev / developer / architecture .html (visited on 02/10/2020).
[Anac] Anaconda, Inc. Numba: Compilation Options. url: http://numba.pydata.org/numba-doc/latest/user/jit.html#compilation-options (visited on 02/09/2020).
[Anad] Anaconda, Inc. Numba: Just-in-Time compilation. url: http://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html (visited on 02/09/2020).
[Anae] Anaconda, Inc. Numba: LoopJitting. url: http://numba.pydata.org/numba-doc/dev/developer/numba-runtime.html (visited on 02/09/2020).
[Anaf] Anaconda, Inc. Numba: Numbers. url: http://numba.pydata.org/numba-doc/latest/reference/types.html#numbers(visited on 02/09/2020).
[Anag] Anaconda, Inc. Numba: Supported NumPy features. url: http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html (visited on 02/10/2020).
[Anah] Anaconda, Inc. Numba: Supported Python features. url: http://numba.pydata.org/numba-doc/dev/reference/pysupported.html (visited on 02/10/2020).
[Anai] Anaconda, Inc. Numba: Why my loop is not vectorized? url:http://numba.pydata.org/numba-doc/dev/user/faq.html#does-numba-vectorize-array-computations-simd(visited on 02/10/2020).
[Anaj] Anaconda, Inc. Numba: Why my loop is not vectorized?url: http : / / numba . pydata . org / numba - doc / 0 . 30 . 1 /reference/envvars.html#compilation-options (visitedon 02/10/2020).
[Apa] Apache Jena. Apache Jena - Home. url: http : / / jena .apache.org (visited on 12/23/2019).
[Aro06] P. Aronsson. “Automatic Parallelization of Equation-BasedSimulation Programs”. PhD thesis. Institutionen för dataveten-skap, 2006.
216
[BCP96] K. Brenan, S. Campbell, and L. Petzold. Numerical Solutionof Initial-Value Problems in Differential-Algebraic Equations.Classics in Applied Mathematics. Society for Industrial andApplied Mathematics, 1996. isbn: 9780898713534.
[BDD91] T. Berry, A. Daniels, and R. Dunn. “Real time simulation ofpower system transient behaviour”. In: 1991 Third Interna-tional Conference on Power System Monitoring and Control.IET. 1991, pp. 122–127.
[Bea] D. Beazley. Understanding the Python GIL. url: http://www.dabeaz.com/python/UnderstandingGIL.pdf (visited on02/09/2020).
[Bec] Beckett, Dave. Redland RDF Libraries. url: http://librdf.org (visited on 12/23/2019).
[Bec10] D. Becker. “Harmonizing the International ElectrotechnicalCommission Common Information Model (CIM) and 61850”.In: Electric Power Research Institute (EPRI), Tech. Rep1020098 (2010).
[Beha] S. Behnel. Limitations – Cython 3.0a0 documentation. url:http://www.behnel.de/cython200910/talk.html (visitedon 02/10/2020).
[Behb] S. Behnel. Using Parallelism – Cython 3.0a0 documentation.url: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html (visited on 02/10/2020).
[Behc] S. Behnel. Using the Cython Compiler to write fast Pythoncode. url: http://www.behnel.de/cython200910/talk.html (visited on 02/10/2020).
[Beh+11] S. Behnel et al. “Cython: The Best of Both Worlds”. In:Computing in Science & Engineering 13.2 (2011), p. 31.
[Ben] J. Bennett. An introduction to Python bytecode. url: https://opensource.com/article/18/4/introduction-python-bytecode (visited on 02/09/2020).
[BL15] M. Barros and Y. Labiche. Search-Based Software Engineer-ing: 7th International Symposium, SSBSE 2015, Bergamo,Italy, September 5-7, 2015, Proceedings. Lecture Notes inComputer Science. Springer International Publishing, 2015.isbn: 9783319221830.
217
Bibliography
[Bol+09] C. F. Bolz et al. “Tracing the Meta-Level: PyPy’s Trac-ing JIT Compiler”. In: Proceedings of the 4th Workshop onthe Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. ICOOOLPS’09. Genova, Italy: Association for Computing Machinery,2009, pp. 18–25. isbn: 9781605585413.
[Bos11] P. Bose. “Power Wall”. In: Encyclopedia of Parallel Computing.Ed. by D. Padua. Boston, MA: Springer US, 2011, pp. 1593–1608. isbn: 978-0-387-09766-4. url: https://doi.org/10.1007/978-0-387-09766-4_499 (visited on 12/24/2019).
[Bou13] J.-L. Boulanger. Static Analysis of Software: The AbstractInterpretation. John Wiley & Sons, 2013.
[Bra+97] T. Bray et al. “Extensible Markup Language (XML)”. In:World Wide Web Journal 2.4 (1997), pp. 27–66.
[Bre12] E. Bressert. SciPy and NumPy: An Overview for Developers.O’Reilly Media, 2012. isbn: 9781449361631. url: https://books.google.de/books?id=c-xzkDMDev0C.
[BRT16] J. D. Booth, S. Rajamanickam, and H. Thornquist. “Basker:A Threaded Sparse LU Factorization Utilizing HierarchicalParallelism and Data Layouts”. In: 2016 IEEE InternationalParallel and Distributed Processing Symposium Workshops(IPDPSW). May 2016, pp. 673–682. doi: 10.1109/IPDPSW.2016.92.
[BS14a] B. Buchholz and Z. Styczynski. Smart Grids: Grundlagen undTechnologien der elektrischen Netze der Zukunft. VDE-Verlag,2014. isbn: 9783800735624.
[BS14b] B. M. Buchholz and Z. Styczynski. Smart grids-fundamentalsand technologies in electricity networks. Vol. 396. Springer,2014.
[Büt16] F. Bütow. Zeitgeistwandel: Vom Aufbruch der Neuzeit zumAufbruch ins planetarische Zeitalter. Books on Demand, 2016.isbn: 9783734741074.
[BW12] A. Brown and G. Wilson. PyPy. The Architecture of OpenSource Applications. Creative Commons, 2012. isbn: 978-1-10557-181-7. url: http://aosabook.org/en/pypy.html(visited on 02/09/2020).
218
[Cao+15] J. Cao et al. “A flexible model transformation to link BIMwith different Modelica libraries for building energy perfor-mance simulation”. In: Proceedings of the 14th IBPSA Con-ference. 2015.
[Cara] A. Carattino. Mutable and Immutable Objects. url: https://www.pythonforthelab.com/blog/mutable-and-immutable-objects/ (visited on 02/09/2020).
[Carb] C. Carey. Why Python is Slow: Looking Under the Hood |Pythonic Perambulations. url: https://github.com/cython/cython/wiki/enhancements-compilerdirectives (visitedon 02/10/2020).
[Cas13] F. Casella. “A Strategy for Parallel Simulation of Declara-tive Object-Oriented Models of Generalized Physical Net-works”. In: Proceedings of the 5th International Workshopon Equation-Based Object-Oriented Modeling Languages andTools; April 19; University of Nottingham; Nottingham; UK.084. Linköping University Electronic Press. 2013, pp. 45–51.
[Cas19] S. Cass. “The 2018 Top Programming Languages”. In: IEEESpectrum (2019). url: https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages(visited on 12/10/2019).
[CDZ05] D. Crupnicoff, S. Das, and E. Zahavi. Deploying Quality ofService and Congestion Control in InfiniBand-based DataCenter Networks. Tech. rep. 2379. 2005.
[Cha+08] B. Chapman et al. Using OpenMP: Portable Shared Mem-ory Parallel Programming. Scientific Computation Series.Books24x7.com, 2008. isbn: 9780262533027.
[Che+15] X. Chen et al. “GPU-Accelerated Sparse LU Factorizationfor Circuit Simulation with Performance Modeling”. In: IEEETransactions on Parallel and Distributed Systems 26.3 (Mar.2015), pp. 786–795. doi: 10.1109/tpds.2014.2312199.
[Chu01] W. Chun. Core Python Programming. Vol. 1. Prentice HallProfessional, 2001.
[CIM] CIM User Group. Home - CIMug. url: http://cimug.ucaiug.org (visited on 12/22/2019).
[Cla] Clang community. Clang C Language Family Frontend forLLVM. url: https://clang.llvm.org (visited on 12/22/2019).
219
Bibliography
[Coh] O. Cohen. Is your Numpy optimized for speed? url: https://towardsdatascience.com/is-your-numpy-optimized-for-speed-c1d2b2ba515 (visited on 02/09/2020).
[Com97] Compaq, Intel, Microsoft. Virtual Interface Architecture Spec-ification. Version 1.0. Compaq, Intel, Microsoft. Dec. 1997.
[Con07] Congress, 110th United States. Energy Independence andSecurity Act of 2007. 2007. url: https://www.govinfo.gov/content/pkg/PLAW-110publ140/html/PLAW-110publ140.htm (visited on 10/21/2019).
[Cor+01] T. Cormen et al. Introduction To Algorithms. MIT Press,2001. isbn: 9780262032933.
[CRS+11] CRSA et al. D4.1: Algorithmic requirements for simulationof large network extreme scenarios. Tech. rep. tech. rep., PE-GASE Consortium, 2011.
[Cum] M. Cumming. libxml++ – An XML Parser forC++. url: http://libxmlplusplus.sourceforge.net (vis-ited on 12/23/2019).
[Cun] A. Cuni. PyPy Status Blog. url: https://morepypy.blogspot.com/2018/09/inside-cpyext-why-emulating-cpython-c.html (visited on 02/10/2020).
[Cun10] A. Cuni. “High performance implementation of Python forCLI/.NET with JIT compiler generation for dynamic lan-guages”. PhD thesis. Dipartimento di Informatica e Scienzedell’Informazione, 2010.
[CWY12] X. Chen, Y. Wang, and H. Yang. “An Adaptive LU Factor-ization Algorithm for Parallel Circuit Simulation”. In: 17thAsia and South Pacific Design Automation Conference. Jan.2012, pp. 359–364. doi: 10.1109/ASPDAC.2012.6164974.
[CWY13] X. Chen, Y. Wang, and H. Yang. “NICSLU: An AdaptiveSparse Matrix Solver for Parallel Circuit Simulation”. In:IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems 32.2 (Feb. 2013), pp. 261–274. doi:10.1109/tcad.2012.2217964.
[Cyt] Cython community. Cython: C-Extensions for Python. url:https://cython.org (visited on 12/11/2019).
[Dal] L. Dalcin. MPI for Python – MPI for Python 3.0.3 documen-tation. url: https://mpi4py.readthedocs.io/en/stable/(visited on 02/10/2020).
220
[Dav] DavidBrooksPokorny. Cython. url: https://en.wikipedia.org / wiki / Cython # /media / File : Cython _ CPython _ Ext _Module_Workflow.png (visited on 02/10/2020).
[Dav+04] T. A. Davis et al. “Algorithm 836: COLAMD, a ColumnApproximate Minimum Degree Ordering Algorithm”. In: ACMTrans. Math. Softw. 30.3 (Sept. 2004), pp. 377–380. issn:0098-3500.
[Dav03] F. S. David. Model driven architecture: applying MDA toenterprise computing. 2003.
[Daw] Dawes, Beman. Filesystem Home - Boost.org. url: http://www.boost.org/libs/filesystem (visited on 12/23/2019).
[Deba] Debian Wiki team. Hugepages – Debian Wiki. url: https://wiki.debian.org/Hugepages (visited on 02/14/2020).
[Debb] A. Debrie. Python Garbage Collection: What It Is and HowIt Works. url: https://stackify.com/python-garbage-collection/ (visited on 02/09/2020).
[Die07] S. Diehl. Software visualization: visualizing the structure, be-haviour, and evolution of software. Springer Science & Busi-ness Media, 2007.
[Dig] Digi International Inc. Python garbage collection. url: https://www.digi.com/resources/documentation/digidocs/90001537/references/r_python_garbage_coll.htm(visited on 02/09/2020).
[Din+18] J. Dinkelbach et al. “Hosting Capacity Improvement Unlockedby Control Strategies for Photovoltaic and Battery StorageSystems”. In: 2018 Power Systems Computation Conference(PSCC). IEEE. 2018, pp. 1–7.
[DK12] P. Dutta and M. Kezunovic. “Unified representation of dataand model for sparse measurement based fault location”. In:Power and Energy Society General Meeting, 2012 IEEE. IEEE.2012, pp. 1–8.
[DM95] F.-N. Demers and J. Malenfant. “Reflection in logic, func-tional and object-oriented programming: a short comparativestudy”. In: Proceedings of the IJCAI. Vol. 95. 1995, pp. 29–38.
[DMS14] E. B. Duffy, B. A. Malloy, and S. Schaub. “Exploiting theClang AST for Analysis of C++ Applications”. In: Proceedingsof the 52nd Annual ACM Southeast Conference. 2014.
221
Bibliography
[DP10] T. A. Davis and E. Palamadai Natarajan. “Algorithm 907:KLU, A Direct Sparse Solver for Circuit Simulation Problems”.In: ACM Trans. Math. Softw. 37.3 (Sept. 2010), 36:1–36:17.issn: 0098-3500. doi: 10.1145/1824801.1824814.
[DR08] W. Dahmen and A. Reusken. Numerik für Ingenieure undNaturwissenschaftler. Springer-Lehrbuch. Springer Berlin Hei-delberg, 2008. isbn: 9783540764939.
[DRM20] S. Dähling, L. Razik, and A. Monti. “OWL2Go: Auto-genera-tion of Go data models for OWL ontologies with integratedserialization and deserialization functionality”. In: To appearin SoftwareX (2020).
[Dun+98] D. Dunning et al. “The Virtual Interface Architecture”. In:IEEE micro 18.2 (Mar. 1998), pp. 66–76. issn: 0272-1732.
[Eat19] J. Eaton. GNU Octave. 2019. url: https://www.gnu.org/software/octave (visited on 11/25/2019).
[ecm19] ecma International. Standard ECMA-404 – The JSON DataInterchange Syntax. 2019. url: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf(visited on 10/23/2019).
[ECT17] D. Efnusheva, A. Cholakoska, and A. Tentov. “A Survey OFDifferent Approaches for Overcoming the Processor-MemoryBottleneck”. In: International Journal of Computer Science& Information Technolog (2017). doi: 10.5121/ijcsit.2017.9214.
[EFF15] L. Exel, F. Felgner, and G. Frey. “Multi-domain modelingof distributed energy systems-The MOCES approach”. In:Smart Grid Communications (SmartGridComm), 2015 IEEEInternational Conference on. IEEE. 2015, pp. 774–779.
[Eig19] Eigen Developers. Eigen. Aug. 2019. url: http://eigen.tuxfamily.org (visited on 10/21/2019).
[Ell+01] J. Ellson et al. “Graphviz—open source graph drawing tools”.In: International Symposium on Graph Drawing. Springer.2001, pp. 483–484.
[ENT] ENTSO-E. COMMON INFORMATION MODEL (CIM) –MODEL EXCHANGE PROFILE 1. url: https://docstore.entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/140610 _ ENTSO - E _ CIM _ Profile _ v1 _ UpdateIOP2013 . pdf(visited on 05/13/2018).
222
[ENT16] ENTSO-E. Common Grid Model Exchange Specification(CGMES) – Version 2.5. 2016. url: https : / / docstore .entsoe.eu/Documents/CIM_documents/IOP/CGMES_2_5_TechnicalSpecification_61970-600_Part%201_Ed2.pdf(visited on 11/21/2019).
[Eur] European Grid Infrastructure community. Glossary V1 - EGI-Wiki. url: https://wiki.egi.eu/wiki/Glossary_V1#High_Throughput_Computing (visited on 12/24/2019).
[Fab+11] D. Fabozzi et al. “On simplified handling of state events intime-domain simulation”. In: Proc. of the 17th Power SystemComputation Conference PSCC. 2011.
[Far+15] M. O. Faruque et al. “Real-Time Simulation Technologiesfor Power Systems Design, Testing, and Analysis”. In: IEEEPower and Energy Technology Systems Journal 2.2 (2015),pp. 63–73.
[FC09] D. Fabozzi and T. V. Cutsem. “Simplified time-domain simu-lation of detailed long-term dynamic models”. In: 2009 IEEEPower & Energy Society General Meeting. IEEE, July 2009.
[FEIa] FEIN Aachen e. V. DistAIX – Scalable simulation of cyber-physical power distribution systems. url: https://fein-aachen.org/en/projects/distaix/ (visited on 12/26/2019).
[FEIb] FEIN Aachen e. V. Doxygen generated webpages of CIM++Adapted CIM_SINERGIEN Codebase: BatteryStorage ClassReference. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_SINERGIEN_20170324//classSinergien_1_1EnergyGrid_1_1EnergyStorage_1_1BatteryStorage.html (visited on 12/23/2019).
[FEIc] FEIN Aachen e. V. Doxygen generated webpages of CIM++Adapted CIM_SINERGIEN Codebase: PowerTransformerClass Reference. url: http : / / cim . fein - aachen . org /libcimpp/doc/IEC61970_16v29a_IEC61968_12v08/classIEC61970_1_1Base_1_1Wires_1_1PowerTransformer.html(visited on 05/31/2018).
[FEId] FEIN Aachen e. V. Doxygen generated webpapes of CIM++Adapted CIM_SINERGIEN Codebase. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_SINERGIEN_20170324/ (visited on 12/23/2019).
223
Bibliography
[FEIe] FEIN Aachen e. V. IEC61970 16v29a - IEC61968 12v08:Class List. url: https://cim.fein-aachen.org/libcimpp/doc/IEC61970_16v29a_IEC61968_12v08/annotated.html(visited on 12/26/2019).
[FEIf] FEIN Aachen e. V. VILLAS. url: https://villas.fein-aachen.org/website (visited on 02/14/2020).
[FEIg] FEIN Aachen e. V. VILLASframework: Node-types. url: https://villas.fein-aachen.org/doc/node-types.html(visited on 02/14/2020).
[FEI19a] FEIN Aachen e.V. CIM++. 2019. url: https : / / www . fein-aachen.org/projects/modpowersystems (visited on11/21/2019).
[FEI19b] FEIN Aachen e.V. ModPowerSystems. 2019. url: https://www.fein-aachen.org/projects/modpowersystems (visitedon 10/21/2019).
[Fin+08] R. Finocchiaro et al. “ETHOS, a generic Ethernet over Sock-ets Driver for Linux”. In: Proceedings of the 20th IASTEDInternational Conference. Vol. 631. 017. 2008, p. 239.
[Fin+09a] R. Finocchiaro et al. “ETHOM, an Ethernet over SCI andDX Driver for Linux”. In: Proceedings of 2009 InternationalConference of Parallel and Distributed Computing (ICPDC2009), London, UK. 2009.
[Fin+09b] R. Finocchiaro et al. “Low-Latency Linux Drivers for Ethernetover High-Speed Networks”. In: IAENG International Journalof Computer Science 36.4 (2009).
[Fin+10] R. Finocchiaro et al. “Transparent Integration of a Low-Latency Linux Driver for Dolphin SCI and DX”. In: ElectronicEngineering and Computing Technology. Ed. by S.-I. Ao andL. Gelman. Dordrecht: Springer Netherlands, 2010, pp. 539–549. isbn: 978-90-481-8776-8. doi: 10.1007/978-90-481-8776-8_46.
[FM09] M. Foord and C. Muirhead. IronPython in Action. ManningPubs Co Series. Manning, 2009. isbn: 9781933988337. url:http://www.voidspace.org.uk/python/articles/duck_typing.shtml#duck-typing (visited on 02/09/2020).
[FO08] H.-r. Fang and D. P. O’Leary. “Modified Cholesky algorithms:a catalog with new approaches”. In: Mathematical Program-ming 115.2 (Oct. 2008), pp. 319–349. issn: 1436-4646.
224
[Fou+07] L. Fousse et al. “MPFR: A Multiple-Precision Binary Floating-Point Library With Correct Rounding”. In: ACM Transactionson Mathematical Software (TOMS) 33.2 (2007), p. 13.
[Fow02] M. Fowler. Patterns of Enterprise Application Architecture.Addison-Wesley Longman Publishing Co., Inc., 2002.
[Fra19] Fraunhofer IEE and University of Kassel Revision. PYPOWER.2019. url: https://pandapower.readthedocs.io (visitedon 11/25/2019).
[Fre+09] J. Fremont et al. “CIM extensions for ERDF information sys-tem projects”. In: Power & Energy Society General Meeting,2009. PES’09. IEEE. IEEE. 2009, pp. 1–5.
[Fri+06] P. Fritzson et al. “OpenModelica-A free open-source envi-ronment for system modeling, simulation, and teaching”. In:Computer Aided Control System Design, 2006 IEEE Inter-national Conference on Control Applications, 2006 IEEEInternational Symposium on Intelligent Control, 2006 IEEE.IEEE. 2006, pp. 1588–1595.
[Fri15a] P. Fritzson. Principles of Object-Oriented Modeling and Sim-ulation with Modelica 3.3: A Cyber-Physical Approach. Wiley,2015. isbn: 9781118859162.
[Fri15b] P. A. Fritzson. Principles of object oriented modeling andsimulation with Modelica 3.3. 2nd ed. Hoboken: John Wiley& Sons, 2015.
[Fri16] J. Friesen. Java XML and JSON. Apress, 2016.[FW14] R. Franke and H. Wiesmann. “Flexible modeling of electri-
cal power systems–the Modelica PowerSystems library”. In:Proceedings of the 10 th International Modelica Conference;March 10-12; 2014; Lund; Sweden. 096. Linköping UniversityElectronic Press. 2014, pp. 515–522.
[Gag+19] F. Gagliardi et al. “The international race towards Exas-cale in Europe”. In: CCF Transactions on High PerformanceComputing (2019), pp. 1–11.
[GCC] GCC team. GCC, the GNU Compiler Collection. url: https://gcc.gnu.org (visited on 12/22/2019).
[GDD+06] D. Ga, D. Djuric, V. Deved, et al. Model Driven Architec-ture and Ontology Development. Springer Science & BusinessMedia, 2006.
225
Bibliography
[Geb+12] M. Gebremedhin et al. “A Data-Parallel Algorithmic Mod-elica Extension for Efficient Execution on Multi-Core Plat-forms”. In: Proceedings of the 9th International MODEL-ICA Conference; September 3-5; 2012; Munich; Germany. 76.Linköping University Electronic Press; Linköpings universitet,2012, pp. 393–404.
[Geo73] A. George. “Nested Dissection of a Regular Finite ElementMesh”. In: SIAM Journal on Numerical Analysis 10.2 (1973),pp. 345–363.
[Ger] German Aerospace Center (DLR). DLR – Simulation andSoftware Technology – 8th Workshop on Python for High-Performance and Scientific Computing. url: https://www.dlr.de/sc/en/desktopdefault.aspx/tabid-12954/22625_read-52397/ (visited on 12/11/2019).
[Gir] C. Giridhar. Understanding Python GIL. url: https://callhub.io/understanding-python-gil/ (visited on 02/09/2020).
[Glo] A. Gloubin. Garbage collection in Python: things you needto know. url: https://rushter.com/blog/python-garbage-collector (visited on 02/09/2020).
[Gra69] R. L. Graham. “Bounds on multiprocessing timing anoma-lies”. In: SIAM journal on Applied Mathematics 17.2 (1969),pp. 416–429.
[Gre+16] F. Gremse et al. “GPU-accelerated adjoint algorithmic differ-entiation”. In: Computer Physics Communications 200 (2016),pp. 300–311. issn: 0010-4655. doi: 10.1016/j.cpc.2015.10.027.
[Gui+18] A. Guironnet et al. “Towards an open-source solution usingModelica for time-domain simulation of power systems”. In:Proc. 8th IEEE PES ISGT Europe. Sarajevo, Bosnia andHerzegovina, Oct. 2018.
[Haq+11] E. Haq et al. “Use of Common Information Model (CIM) inelectricity market at California ISO”. In: Power and EnergySociety General Meeting, 2011 IEEE. IEEE. 2011, pp. 1–6.
[Har+12] W. E. Hart et al. Pyomo – Optimization Modeling in Python.Vol. 67. Springer, 2012.
[Hee] D. van Heesch. Doxygen: Main Page. url: http : / / www .doxygen.org (visited on 12/23/2019).
226
[Heg+01] P. Heggernes et al. The Computational Complexity of theMinimum Degree Algorithm. Tech. rep. Lawrence LivermoreNational Lab., CA (US), 2001.
[Hen] Henney, Kevlin. Chapter 5. Boost.Any. url: http : / / www . boost . org / doc / libs / release / libs / any (visited on12/23/2019).
[Hig] J. Higgins. Arabica. url: https://github.com/RWTH-ACS/arabica (visited on 12/23/2019).
[Hin+05] A. C. Hindmarsh et al. “SUNDIALS: Suite of Nonlinear andDifferential/Algebraic Equation Solvers”. In: ACM Trans.Math. Softw. 31.3 (Sept. 2005), pp. 363–396. issn: 0098-3500.doi: 10.1145/1089014.1089020.
[Hop+06] K. Hopkinson et al. “EPOCHS: a platform for agent-basedelectric power and communication simulation built from com-mercial off-the-shelf components”. In: IEEE Transactions onPower Systems 21.2 (2006), pp. 548–558.
[HR07] S. C. Haw and G. R. K. Rao. “A Comparative Study andBenchmarking on XML Parsers”. In: Advanced Communica-tion Technology, The 9th International Conference on. Vol. 1.IEEE. 2007, pp. 321–325.
[HSC19] A. C. Hindmarsh, R. Serban, and A. Collier. User Documen-tation for IDA v4.1.0. 2019. url: https://computing.llnl.gov/sites/default/files/public/ida_guide.pdf (visitedon 10/21/2019).
[IEC] IEC. IEC Smart Grid - IEC Standards. url: http://www.iec.ch/smartgrid/standards (visited on 12/22/2019).
[IEC06] IEC. IEC 61970-501:2006 Energy management system appli-cation program interface (EMS-API) – Part 501: CommonInformation Model Resource Description Framework (CIMRDF) schema. 2006.
[IEC12a] IEC. IEC 61968-11:2013 Application integration at electricutilities - System interfaces for distribution management –Part 11: Common information model (CIM) extensions fordistribution. 2012.
[IEC12b] IEC. IEC 61970-301:2012 Energy management system appli-cation program interface (EMS-API) – Part 301: CommonInformation Model (CIM) base. 2012.
227
Bibliography
[IEC14] IEC. IEC 62325-301:2014 Framework for energy market com-munications – Part 301: Common information model (CIM)extensions for markets. 2014.
[IEC16a] IEC. IEC 61970-552:2016 Energy management system appli-cation program interface (EMS-API) - Part 552: CIMXMLModel exchange format. 2016.
[IEC16b] IEC. IEC/TR 62357-1:2016 Power systems management andassociated information exchange - Part 1: Reference architec-ture. 2016.
[IEC17] IEC. IEC TS 62361-102 ED1 Power systems managementand associated information exchange - Interoperability in thelong term - Part 102: CIM - IEC 61850 harmonization. 2017.
[IEE18] IEEE and The Open Group. The Open Group Base Specifi-cations Issue 7 – IEEE Std 1003.1, 2018 Edition. New York,NY, USA: IEEE, 2018. url: http://pubs.opengroup.org/onlinepubs/9699919799.
[Inf07] InfiniBand Trade Association. InfiniBand Architecture Specifi-cation, Volume 1. Release 1.2.1. InfiniBand Trade Associationet al. Nov. 2007.
[Inf16] InfiniBand Trade Association. InfiniBand Architecture Specifi-cation Volume 2. Release 1.3.1. InfiniBand Trade Associationet al. Nov. 2016.
[Int] Intel Corporation. Intel C++ Compiler. url: https : / /software . intel . com / en - us / c - compilers (visited on12/22/2019).
[ISO14] ISO. ISO/IEC JTC 1/SC 22/WG 21 N4100 ProgrammingLanguages – C++ – File System Technical Specification. 2014.
[Jos12] N. Josuttis. The C++ Standard Library: A Tutorial andReference. Addison-Wesley, 2012. isbn: 9780321623218.
[KA99] Y.-K. Kwok and I. Ahmad. “Static Scheduling Algorithmsfor Allocating Directed Task Graphs to Multiprocessors”. In:ACM Comput. Surv. 31.4 (Dec. 1999), pp. 406–471. issn:0360-0300.
[Kas17] S. Kaster. Runtime Analysis of Python Programs. 2017.[Ker] Kernel development community. Networking – The Linux
Kernel documentation. url: https://linux-kernel-labs.github.io/master/labs/networking.html (visited on02/14/2020).
228
[Ker10] M. Kerrisk. The Linux Programming Interface: a Linux andUNIX System Programming Handbook. No Starch Press, 2010.isbn: 978-1-59327-220-3.
[KH02] J. Kovse and T. Härder. “Generic XMI-based UML modeltransformations”. In: Object-Oriented Information Systems(2002), pp. 183–190.
[KH14] G. Krüger and H. Hansen. Java-Programmierung – Das Hand-buch zu Java 8. O’Reilly Germany, 2014.
[Kha+18] S. Khayyamim et al. “Railway System Energy ManagementOptimization Demonstrated at Offline and Online Case Stud-ies”. In: IEEE Transactions on Intelligent TransportationSystems 19.11 (Nov. 2018), pp. 3570–3583. issn: 1524-9050.doi: 10.1109/TITS.2018.2855748.
[KK04] W. Kocay and D. Kreher. Graphs, Algorithms, and Optimiza-tion. Discrete Mathematics and Its Applications. CRC Press,2004. isbn: 978-0-20348-905-5.
[KK95] G. Karypis and V. Kumar. METIS – Unstructured GraphPartitioning and Sparse Matrix Ordering System, Version 2.0.Tech. rep. University of Minnesota, Department of ComputerScience, 1995.
[Kle] B. Klein. Python3-Tutorial: Parameterübergabe. url: https://www.python-kurs.eu/python3_parameter.php (visited on02/09/2020).
[KMS92] M. S. Khaira, G. L. Miller, and T. J. Sheffler. Nested Dissec-tion: A survey and comparison of various nested dissectionalgorithms. Carnegie-Mellon University. Department of Com-puter Science, 1992.
[Kol+02] R. Kollmann et al. “A Study on the Current State of the Artin Tool-Supported UML-Based Static Reverse Engineering”.In: Reverse Engineering, 2002. Proceedings. Ninth WorkingConference on. IEEE. 2002, pp. 22–32.
[Kol+18] S. Kolen et al. “Enabling the Analysis of Emergent Behaviorin Future Electrical Distribution Systems Using Agent-BasedModeling and Simulation”. In: Complexity 2018 (2018).
[Kor09] R. E. Korf. “Multi-Way Number Partitioning”. In: Twenty-First International Joint Conference on Artificial Intelligence.2009.
229
Bibliography
[KWS07] U. Kastens, W. M. Waite, and A. M. Sloane. GeneratingSoftware from Specifications. Jones & Bartlett Learning, 2007.
[Lar+09] S. Larsen et al. “Architectural breakdown of end-to-end la-tency in a TCP/IP network”. In: International Journal ofParallel Programming 37.6 (Dec. 2009), pp. 556–571. issn:1573-7640. doi: 10.1007/s10766-009-0109-6.
[LE] T. Lefebvre and H. Englert. IEC TC57 Power system man-agement and associated information exchange. url: https://www.iec.ch/resources/tcdash/Poster_IEC_TC57.pdf(visited on 02/22/2020).
[Lee+15] B. Lee et al. “Unifying data types of IEC 61850 and CIM”. In:IEEE Transactions on Power Systems 30.1 (2015), pp. 448–456.
[Li+14] W. Li et al. “Cosimulation for Smart Grid Communications”.In: IEEE Transactions on Industrial Informatics 10.4 (2014),pp. 2374–2384.
[Lin] R. Lincoln. PyCIM – Python implementation of the CommonInformation Model. url: https://github.com/rwl/pycim(visited on 12/23/2019).
[Lin+12] H. Lin et al. “GECO: Global event-driven co-simulation frame-work for interconnected power system and communicationnetwork”. In: IEEE Transactions on Smart Grid 3.3 (2012),pp. 1444–1456.
[Lin19a] R. Lincoln. PYPOWER. 2019. url: https://pypi.org/project/PYPOWER/ (visited on 11/25/2019).
[Lin19b] R.-T. Linux. realtime:start [Wiki]. 2019. url: https : / /wiki.linuxfoundation.org/realtime/start (visited on10/21/2019).
[LK17] B. Lee and D.-K. Kim. “Harmonizing IEC 61850 and CIMfor connectivity of substation automation”. In: ComputerStandards & Interfaces 50 (2017), pp. 199–208.
[LLV] LLVM Foundation. The LLVM Compiler Infrastructure Project.url: http://www.llvm.org (visited on 12/23/2019).
[LPS15] S. K. Lam, A. Pitrou, and S. Seibert. “Numba: a LLVM-basedPython JIT compiler”. In: Proceedings of the Second Workshopon the LLVM Compiler Infrastructure in HPC. ACM. 2015,p. 7.
230
[Lun+09] H. Lundvall et al. “Automatic Parallelization of SimulationCode for Equation-based Models with Software Pipelining andMeasurements on Three Platforms”. In: SIGARCH Comput.Archit. News 36.5 (June 2009), pp. 46–55. issn: 0163-5964.
[Mae12] K. Maeda. “Performance Evaluation of Object SerializationLibraries in XML, JSON and Binary Formats”. In: DigitalInformation and Communication Technology and it’s Appli-cations (DICTAP), 2012 Second International Conference on.IEEE. 2012, pp. 177–182.
[Man] Man-Pages Authors. memusage(1) - Linux manual page. url:http://man7.org/linux/man-pages/man1/memusage.1.html (visited on 02/10/2020).
[MAT19] MATPOWER Developers. GNU Octave. 2019. url: https://matpower.org (visited on 11/25/2019).
[McM07] A. W. McMorran. “An Introduction to IEC 61970-301 &61968-11: The Common Information Model”. In: Universityof Strathclyde 93 (2007), p. 124.
[MDC09] A. Mercurio, A. Di Giorgio, and P. Cioci. “Open-SourceImplementation of Monitoring and Controlling Services forEMS/SCADA Systems by Means of Web Services – IEC61850 and IEC 61970 Standards”. In: IEEE Transactions onPower Delivery 24.3 (2009), pp. 1148–1153.
[Mel18] Mellanox Technologies. Mellanox OFED for Linux User Man-ual. 2877. Rev 4.3. Mellanox Technologies. Mar. 2018.
[Min] S. Mingshen. Getting Started. url: http://mesapy.org/rpython-by-example/getting-started/index.html (visitedon 02/09/2020).
[Mir+17] M. Mirz et al. “Dynamic phasors to enable distributed real-time simulation”. In: 2017 6th International Conference onClean Electrical Power (ICCEP). June 2017, pp. 139–144.
[Mir+18] M. Mirz et al. “A Cosimulation Architecture for Power System,Communication, and Market in the Smart Grid”. In: HindawiComplexity 2018 (Feb. 2018). doi: 10.1155/2018/7154031.
[Mir+19] M. Mirz et al. “DPsim—A dynamic phasor real-time simulatorfor power systems”. In: SoftwareX 10 (2019), p. 100253. issn:2352-7110. doi: https://doi.org/10.1016/j.softx.2019.100253. url: http://www.sciencedirect.com/science/article/pii/S2352711018302760.
231
Bibliography
[Mir20] M. Mirz. “A Dynamic Phasor Real-Time Simulation BasedDigital Twin for Power Systems”. PhD thesis. RWTH AachenUniversity, 2020.
[MMS13] N. V. Mago, J. D. Moseley, and N. Sarma. “A methodol-ogy for modeling telemetry in power systems models usingIEC-61968/61970”. In: Innovative Smart Grid Technologies-Asia (ISGT Asia), 2013 IEEE. IEEE. 2013, pp. 1–6.
[MNM16] M. Mirz, L. Netze, and A. Monti. “A multi-level approach topower system Modelica models”. In: Control and Modeling forPower Electronics (COMPEL), 2016 IEEE 17th Workshopon. IEEE. 2016, pp. 1–7.
[Mod] Modelica Association. Introduction – Modelica Language Spec-ification 3.3 Revision 1. url: https : / / modelica . readthedocs . io / en / latest / introduction . html (visited on12/26/2019).
[Mol+14] C. Molitor et al. “MESCOS–A Multienergy System Cosimula-tor for City District Energy Systems”. In: IEEE Transactionson Industrial Informatics 10.4 (2014), pp. 2247–2256.
[Mon] Montaigne, Michel de. Native datatypes – Dive Into Python3. url: https://diveintopython3.net/native-datatypes.html (visited on 02/09/2020).
[Mon+18] A. Monti et al. “A Global Real-Time Superlab: Enabling HighPenetration of Power Electronics in the Electric Grid”. In:IEEE Power Electronics Magazine 5.3 (Sept. 2018), pp. 35–44.
[MPT78] M. D. McIlroy, E. Pinson, and B. Tague. “UNIX Time-SharingSystem: Foreword”. In: Bell Labs Technical Journal 57.6(1978), pp. 1899–1904.
[MR12] P. MacArthur and R. D. Russell. “A Performance Studyto Guide RDMA Programming Decisions”. In: High Perfor-mance Computing and Communication & 2012 IEEE 9thInternational Conference on Embedded Software and Systems(HPCC-ICESS), 2012 IEEE 14th International Conferenceon. IEEE, 2012, pp. 778–785.
[Mül03] M. S. Müller. “An OpenMP Compiler Benchmark”. In: Sci-entific Programming 11.2 (2003), pp. 125–131.
232
[MV11] A. Meister and C. Vömel. Numerik linearer Gleichungssys-teme: Eine Einführung in moderne Verfahren. Mit MAT-LAB®-Implementierungen von C. Vömel. Vieweg+TeubnerVerlag, 2011. isbn: 9783834881007.
[Nic+00] U. A. Nickel et al. “Roundtrip engineering with FUJABA”. In:Proceedings of the 2nd Workshop on Software-Reengineering(WSR), August. Citeseer. 2000.
[Numa] Numba community. Numba: A High Performance PythonCompiler. url: http : / / numba . pydata . org (visited on12/11/2019).
[Numb] NumPy developers. NumPy. url: https://numpy.org (vis-ited on 12/11/2019).
[Ope19a] OpenModelica Developers. Major OpenModelica Releases.2019. url: https://www.openmodelica.org/doc/OpenModelicaUsersGuide/latest/tracreleases.html#release-notes-for-openmodelica-1-11-0 (visited on 10/21/2019).
[Ope19b] OpenMP Architecture Review Board. Home – OpenMP. 2019.url: https://www.openmp.org (visited on 10/21/2019).
[Pan09] J. Z. Pan. “Resource Description Framework”. In: Handbookon Ontologies. Springer, 2009, pp. 71–90.
[Par04] T. J. Parr. “Enforcing strict model-view separation in tem-plate engines”. In: Proceedings of the 13th international con-ference on World Wide Web. ACM. 2004, pp. 224–233.
[Pet82] L. Petzold. Description of DASSL: a differential/algebraicsystem solver. Tech. rep. Sandia National Labs., Livermore,CA (USA), Sept. 1982.
[Pfi01] G. F. Pfister. “An Introduction to the Infiniband Architec-ture”. In: High Performance Mass Storage and Parallel I/O42 (2001), pp. 617–632.
[Pic+16] S. Pickartz et al. “Migrating LinuX Containers Using CRIU”.In: High Performance Computing. Ed. by M. Taufer, B. Mohr,and J. M. Kunkel. Cham: Springer International Publishing,2016, pp. 674–684. isbn: 978-3-319-46079-6.
[Pot18] D. Potter. Implementation and Analysis of an InfiniBandbased Communication in a Real-Time Co-Simulation Frame-work. 2018.
233
Bibliography
[Pra+11] Y. Pradeep et al. “CIM-Based Connectivity Model for Bus-Branch Topology Extraction and Exchange”. In: IEEE Trans-actions on Smart Grid 2.2 (June 2011), pp. 244–253. issn:1949-3061. doi: 10.1109/TSG.2011.2109016.
[Pre12] J. Preshing. A Look Back at Single-Threaded CPU Perfor-mance. 2012. url: https://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/ (vis-ited on 10/21/2019).
[Pug16] J. F. Puget. A Speed Comparison Of C, Julia, Python, Numba,and Cython on LU Factorization. 2016. url: https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en (visited on02/10/2020).
[PyPa] PyPy community. Bytecode Interpreter. url: http://doc.pypy.org/en/latest/interpreter.html#introduction-and-overview (visited on 02/09/2020).
[PyPb] PyPy community. Garbage Collection in PyPy. url: https://doc.pypy.org/en/release-2.4.x/garbage_collection.html (visited on 02/09/2020).
[PyPc] PyPy community. Goals and Architecture Overview. url:http://doc.pypy.org/en/latest/architecture.html#id1(visited on 02/09/2020).
[PyPd] PyPy community. Incminimark. url: https://doc.pypy.org/en/latest/gc_info.html#incminimark (visited on02/09/2020).
[PyPe] PyPy community. RPython Documentation. url: https://rpython.readthedocs.io/en/latest/index.html#index(visited on 02/09/2020).
[Pyta] Python Software Foundation. array — Efficient arrays ofnumeric values. url: https://docs.python.org/3/library/array.html#module-array (visited on 02/09/2020).
[Pytb] Python Software Foundation. CPython. url: https://www.python.org (visited on 12/12/2019).
[Pytc] Python Software Foundation. multiprocessing – Process-basedparallelism. url: https://docs.python.org/3.6/library/multiprocessing.html (visited on 02/09/2020).
234
[Pytd] Python Software Foundation. Python Software Foundation:Press Release 20-Dec-2019. url: https://www.python.org/psf/press-release/pr20191220/ (visited on 02/09/2020).
[Pyte] Python Software Foundation. threading – Thread-based par-allelism. url: https://docs.python.org/3.6/library/threading.html (visited on 02/09/2020).
[Qui03] M. Quinn. Parallel Programming in C with MPI and OpenMP.McGraw-Hill, 2003. isbn: 9780071232654.
[Ray03] E. S. Raymond. The art of Unix programming. Addison-Wesley Professional, 2003.
[Raz+18a] L. Razik et al. “Automated deserializer generation from CIMontologies: CIM++—an easy-to-use and automated adaptableopen-source library for object deserialization in C++ fromdocuments based on user-specified UML models following theCommon Information Model (CIM) standards for the energysector”. In: Computer Science - Research and Development33.1 (Feb. 2018), pp. 93–103. issn: 1865-2042. doi: 10.1007/s00450-017-0350-y.
[Raz+18b] L. Razik et al. “CIMverter—a template-based flexibly ex-tensible open-source converter from CIM to Modelica”. In:Energy Informatics 1.1 (Oct. 2018), p. 47. issn: 2520-8942.doi: 10.1186/s42162-018-0031-5.
[Raz+19a] L. Razik et al. “A comparative analysis of LU decompositionmethods for power system simulations”. In: 2019 IEEE MilanPowerTech. June 2019, pp. 1–6.
[Raz+19b] L. Razik et al. “REM-S-–Railway Energy Management inReal Rail Operation”. In: IEEE Transactions on VehicularTechnology 68.2 (Feb. 2019), pp. 1266–1277. doi: 10.1109/TVT.2018.2885007.
[Rei19] G. Reinke. Development of a Dependency Analysis betweenPower System Simulation Components for their Parallel Pro-cessing. 2019.
[Reu+16] R. H. Reussner et al. Modeling and Simulating Software Ar-chitectures: The Palladio Approach. MIT Press, 2016.
[Ris+16] S. Ristov et al. “Superlinear speedup in HPC systems: Whyand when?” In: 2016 Federated Conference on ComputerScience and Information Systems (FedCSIS). IEEE. 2016,pp. 889–898.
235
Bibliography
[RJB04] J. Rumbaugh, I. Jacobson, and G. Booch. Unified ModelingLanguage Reference Manual, The (2Nd Edition). PearsonHigher Education, 2004. isbn: 0321245628.
[Rog16] A. Roghult. “Benchmarking Python Interpreters”. In: KTHRoyal Institute of Technology School (2016). url: http://kth.diva-portal.org/smash/get/diva2:912464/FULLTEXT01.pdf.
[Roo99] S. Roosta. Parallel Processing and Parallel Algorithms: The-ory and Computation. Springer New York, 1999. isbn: 978-0-38798-716-3.
[Rosa] P. Ross. Cython Function Declarations – Cython def, cdefand cpdef functions 0.1.0 documentation. url: https : / /notes-on-cython.readthedocs.io/en/latest/function_declarations.html (visited on 02/10/2020).
[Rosb] G. van Rossum. What’s New In Python 3.0. url: https://docs.python.org/3/whatsnew/3.0.html (visited on02/09/2020).
[Rud+06] K. Rudion et al. “Design of benchmark of medium voltagedistribution network for investigation of DG integration”. In:Power Engineering Society General Meeting, 2006. IEEE.IEEE. 2006, 6–pp.
[Sad+09] A. Sadovykh et al. “On Study Results: Round Trip Engineer-ing of Space Systems”. In: European Conference on ModelDriven Architecture-Foundations and Applications. Springer.2009, pp. 265–276.
[Sch+15] F. Schloegl et al. “Towards a classification scheme for co-simulation approaches in energy systems”. In: Smart ElectricDistribution Systems and Technologies (EDST), 2015 Inter-national Symposium on. IEEE. 2015, pp. 516–521.
[Sch11] S. Schütte. “A Domain-Specific Language For SimulationComposition”. In: ECMS. 2011, pp. 146–152.
[Sch19] S. Scherfke. mosaik Documentation — Release 2.5.1. June2019. url: https://media.readthedocs.org/pdf/mosaik/latest/mosaik.pdf (visited on 10/21/2019).
[Scia] SciPy community. Broadcasting. url: http://scipy-lectures.org/intro/numpy/operations.html#broadcasting(visited on 02/09/2020).
236
[Scib] SciPy community. Casting Rules. url: https://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules(visited on 02/09/2020).
[Scic] SciPy community. Universal functions (ufunc). url: https://docs.scipy.org/doc/numpy/reference/ufuncs.html(visited on 02/09/2020).
[Scid] SciPy developers. SciPy.org. url: https://www.scipy.org(visited on 12/11/2019).
[ŠD12] V. Štuikys and R. Damaševičius. Meta-Programming andModel-Driven Meta-Program Development: Principles, Pro-cesses and Techniques. Vol. 5. Springer Science & BusinessMedia, 2012.
[Sjö+10] M. Sjölund et al. “Towards Efficient Distributed Simulationin Modelica using Transmission Line Modeling”. In: 3rd In-ternational Workshop on Equation-Based Object-OrientedModeling Languages and Tools; Oslo; Norway; October 3. 047.Linköping University Electronic Press. 2010, pp. 71–80.
[Slo01] A. Slominski. Design of a Pull and Push Parser System forStreaming XML. Tech. rep. Technical Report TR-550, IndianaUniversity, 2001.
[Sma12] Smart Grid Coordination Group, CEN-CENELEC-ETSI.Smart Grid Reference Architecture. Nov. 2012. url: https://ec.europa.eu/energy/sites/ener/files/documents/xpert_group1_reference_architecture.pdf (visited on10/21/2019).
[Smi15] K. Smith. Cython: A Guide for Python Programmers. O’ReillyMedia, 2015. isbn: 9781491901755. url: https://books.google.de/books?id=ERFkBgAAQBAJ.
[Spe] O. van der Spek. C++ CTemplate system. url: https://github.com/OlafvdSpek/ctemplate (visited on 12/23/2019).
[SRS10] R. Santodomingo, J. Rodrıguez-Mondéjar, and M. Sanz-Bobi.“Ontology Matching Approach to the Harmonization of CIMand IEC 61850 Standards”. In: Smart Grid Communications(SmartGridComm), 2010 First IEEE International Confer-ence on. IEEE. 2010, pp. 55–60.
237
Bibliography
[SST11] S. Schütte, S. Scherfke, and M. Tröschel. “Mosaik: A frame-work for modular simulation of active components in SmartGrids”. In: Smart Grid Modeling and Simulation (SGMS),2011 IEEE First International Workshop on. IEEE. 2011,pp. 55–60.
[Ste+17] M. Stevic et al. “Multi-site European framework for real-time co-simulation of power systems”. In: IET Generation,Transmission & Distribution 11.17 (2017), pp. 4126–4135.issn: 1751-8687. doi: 10.1049/iet-gtd.2016.1576.
[STM10] L. Surhone, M. Timpledon, and S. Marseken. Template Pro-cessor. Betascript Publishing, 2010. isbn: 9786130536886.
[Sto+13] J. Stoer et al. Introduction to Numerical Analysis. Textsin Applied Mathematics. Springer New York, 2013. isbn:9781475722727.
[Sup] SuperLU developers. SuperLU: Home Page. url: https://portal.nersc.gov/project/sparse/superlu (visited on12/24/2019).
[SV01] Y. Saad and H. A. Van Der Vorst. “Iterative Solution ofLinear Systems in the 20th Century”. In: Numerical Analysis:Historical Developments in the 20th Century. Elsevier, 2001,pp. 175–207.
[SWD15] R. Sedgewick, K. Wayne, and R. Dondero. Introduction toProgramming in Python: An Interdisciplinary Approach. Pear-son Education, 2015. isbn: 9780134076522. url: https://introcs.cs.princeton.edu/python/appendix_numpy/ (vis-ited on 02/09/2020).
[Tad] Tadeck. Why is Python 3 not backwards compatible? url: https://stackoverflow.com/questions/9066956/why-is-python-3-not-backwards-compatible (visited on 02/09/2020).
[Tan09] A. Tanenbaum. Modern Operating Systems. Pearson PrenticeHall, 2009. isbn: 9780138134594.
[Thi19] B. Thiele. GitHub - modelica-3rdparty/Modelica_DeviceDrivers. 2019. url: https://github.com/modelica-3rdparty/Modelica_DeviceDrivers (visited on 10/23/2019).
[Til01] M. Tiller. Introduction to physical modeling with Modelica.Boston: Kluwer Academic Publishers, 2001.
[Tri] Trilinos developers. GitHub - trilinos/Trilinos. url: https://github.com/trilinos/Trilinos (visited on 02/23/2020).
238
[TW67] W. F. Tinney and J. W. Walker. “Direct Solutions of SparseNetwork Equations by Optimally Ordered Triangular Fac-torization”. In: Proceedings of the IEEE 55.11 (Nov. 1967),pp. 1801–1809. issn: 0018-9219. doi: 10.1109/PROC.1967.6011.
[Ull75] J. D. Ullman. “NP-Complete Scheduling Problems”. In: Jour-nal of Computer and System sciences 10.3 (1975), pp. 384–393.
[Umw19] Umweltbundesamt, German. Erneuerbare Energien inDeutschland – Daten zur Entwicklung im Jahr 2018. 2019.url: https://www.umweltbundesamt.de/sites/default/files/medien/1410/publikationen/uba_hgp_eeinzahlen_2019_bf.pdf (visited on 10/21/2019).
[Uni17] University of Tennessee, Knoxville. BLAS (Basic Linear Alge-bra Subprograms). 2017. url: http://www.netlib.org/blas/(visited on 10/21/2019).
[Uni19] University of Tennessee, Knoxville et al. LAPACK – LinearAlgebra PACKage. 2019. url: http://www.netlib.org/lapack/ (visited on 10/21/2019).
[Usl+12] M. Uslar et al. The Common Information Model CIM: IEC61968/61970 and 62325 – A practical introduction to theCIM. Power Systems. Springer Berlin Heidelberg, 2012. isbn:9783642252150. url: https://books.google.de/books?id=cdw6gtzwc-QC.
[Van] J. VanderPlas. Why Python is Slow: Looking Under the Hood| Pythonic Perambulations. url: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow (visited on02/10/2020).
[Var+11] E. Varnik et al. “Fast Conservative Estimation of HessianSparsity”. In: Fifth SIAM Workshop on Combinatorial Sci-entific Computing, May 19–21, 2011, Darmstadt, Germany.May 2011, pp. 18–21.
[VCV11] S. Van Der Walt, S. C. Colbert, and G. Varoquaux. “TheNumPy Array: A Structure for Efficient Numerical Compu-tation”. In: Computing in Science & Engineering 13.2 (2011),p. 22.
239
Bibliography
[Vir+17] R. Viruez et al. “A Modelica-based Tool for Power SystemDynamic Simulations”. In: Proceedings of the 12th Interna-tional Modelica Conference, Prague, Czech Republic, May15-17, 2017. 132. Linköping University Electronic Press. 2017,pp. 235–239.
[Vog+17] S. Vogel et al. “An Open Solution for Next-generation Real-time Power System Simulation”. In: 2017 IEEE Conferenceon Energy Internet and Energy System Integration (EI2). Nov.2017, pp. 1–6. doi: 10.1109/EI2.2017.8245739.
[Wal+14] M. Walther et al. “Equation based parallelization of Modelicamodels”. In: Proceedings of the 10 th International ModelicaConference; March 10-12; 2014; Lund; Sweden. 096. LinköpingUniversity Electronic Press. 2014, pp. 1213–1220.
[WB87] M. Wolfe and U. Banerjee. “Data dependence and its appli-cation to parallel processing”. In: International Journal ofParallel Programming 16.2 (Apr. 1987), pp. 137–178. issn:1573-7640.
[Wei+07] S. Wei et al. “Multi-Agent Architecture of Energy Manage-ment System Based on IEC 61970 CIM”. In: Power Engineer-ing Conference, 2007. IPEC 2007. International. IEEE. 2007,pp. 1366–1370.
[WGG10] K. Wehrle, M. Günes, and J. Gross. Modeling and tools fornetwork simulation. Springer Science & Business Media, 2010.
[WH16] Z. Wang and Y. He. “Two-stage optimal demand responsewith battery energy storage systems”. In: IET Generation,Transmission & Distribution 10.5 (2016), pp. 1286–1293.
[Wil19] A. Williams. C++ Concurrency in Action. Manning Publica-tions Company, 2019. isbn: 9781617294693.
[Yan81] M. Yannakakis. “Computing the Minimum Fill-In is NP-Complete”. In: SIAM Journal on Algebraic Discrete Methods2.1 (1981), pp. 77–79.
[ZCN11] K. Zhu, M. Chenine, and L. Nordstrom. “ICT architectureimpact on wide area monitoring and control systems’ relia-bility”. In: IEEE transactions on power delivery 26.4 (2011),pp. 2801–2808.
[ZMT11] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas.“MATPOWER: Steady-state operations, planning, and analy-sis tools for power systems research and education”. In: IEEETransactions on Power Systems 26.1 (2011), pp. 12–19.
240
[ZPK00] B. P. Zeigler, H. Praehofer, and T. G. Kim. Theory of Model-ing and Simulation: Integrating Discrete Event and Continu-ous Complex Dynamic Systems. Academic press, 2000.
241
E.ON ERC Band 1
Streblow, R.
Thermal Sensation and
Comfort Model for
Inhomogeneous Indoor
Environments
1. Auflage 2011
ISBN 978-3-942789-00-4
E.ON ERC Band 2
Naderi, A.
Multi-phase, multi-species
reactive transport modeling as
a tool for system analysis in
geological carbon dioxide
storage
1. Auflage 2011
ISBN 978-3-942789-01-1
E.ON ERC Band 3
Westner, G.
Four Essays related to Energy
Economic Aspects of
Combined Heat and Power
Generation
1. Auflage 2012
ISBN 978-3-942789-02-8
E.ON ERC Band 4
Lohwasser, R.
Impact of Carbon Capture and
Storage (CCS) on the European
Electricity Market
1. Auflage 2012
ISBN 978-3-942789-03-5
E.ON ERC Band 5
Dick, C.
Multi-Resonant Converters as
Photovoltaic Module-
Integrated Maximum Power
Point Tracker
1. Auflage 2012
ISBN 978-3-942789-04-2
E.ON ERC Band 6
Lenke, R.
A Contribution to the Design of
Isolated DC-DC Converters for
Utility Applications
1. Auflage 2012
ISBN 978-3-942789-05-9
E.ON ERC Band 7
Brännström, F.
Einsatz hybrider RANS-LES-
Turbulenzmodelle in der
Fahrzeugklimatisierung
1. Auflage 2012
ISBN 978-3-942789-06-6
E.ON ERC Band 8
Bragard, M.
The Integrated Emitter Turn-
Off Thyristor - An Innovative
MOS-Gated High-Power
Device
1. Auflage 2012
ISBN 978-3-942789-07-3
E.ON ERC Band 9
Hoh, A.
Exergiebasierte Bewertung
gebäudetechnischer Anlagen
1. Auflage 2013
ISBN 978-3-942789-08-0
E.ON ERC Band 10
Köllensperger, P.
The Internally Commutated
Thyristor - Concept, Design
and Application
1. Auflage 2013
ISBN 978-3-942789-09-7
E.ON ERC Band 11
Achtnicht, M.
Essays on Consumer Choices
Relevant to Climate Change:
Stated Preference Evidence
from Germany
1. Auflage 2013
ISBN 978-3-942789-10-3
E.ON ERC Band 12
Panašková, J.
Olfaktorische Bewertung von
Emissionen aus Bauprodukten
1. Auflage 2013
ISBN 978-3-942789-11-0
E.ON ERC Band 13
Vogt, C.
Optimization of Geothermal
Energy Reservoir Modeling
using Advanced Numerical
Tools for Stochastic Parameter
Estimation and Quantifying
Uncertainties
1. Auflage 2013
ISBN 978-3-942789-12-7
E.ON ERC Band 14
Benigni, A.
Latency exploitation for
parallelization of
power systems simulation
1. Auflage 2013
ISBN 978-3-942789-13-4
E.ON ERC Band 15
Butschen, T.
Dual-ICT – A Clever Way to
Unite Conduction and
Switching Optimized
Properties in a Single Wafer
1. Auflage 2013
ISBN 978-3-942789-14-1
E.ON ERC Band 16
Li, W.
Fault Detection and
Protection inMedium
Voltage DC Shipboard
Power Systems
1. Auflage 2013
ISBN 978-3-942789-15-8
E.ON ERC Band 17
Shen, J.
Modeling Methodologies for
Analysis and Synthesis of
Controls and Modulation
Schemes for High-Power
Converters with Low Pulse
Ratios
1. Auflage 2014
ISBN 978-3-942789-16-5
E.ON ERC Band 18
Flieger, B.
Innenraummodellierung einer
Fahrzeugkabine
in der Programmiersprache
Modelica
1. Auflage 2014
ISBN 978-3-942789-17-2
E.ON ERC Band 19
Liu, J.
Measurement System and
Technique for Future Active
Distribution Grids
1. Auflage 2014
ISBN 978-3-942789-18-9
E.ON ERC Band 20
Kandzia, C.
Experimentelle Untersuchung
der Strömungsstrukturen in
einer Mischlüftung
1. Auflage 2014
ISBN 978-3-942789-19-6
E.ON ERC Band 21
Thomas, S.
A Medium-Voltage Multi-
Level DC/DC Converter with
High Voltage Transformation
Ratio
1. Auflage 2014
ISBN 978-3-942789-20-2
E.ON ERC Band 22
Tang, J.
Probabilistic Analysis and
Stability Assessment for Power
Systems with Integration of
Wind Generation and
Synchrophasor Measurement
1. Auflage 2014
ISBN 978-3-942789-21-9
E.ON ERC Band 23
Sorda, G.
The Diffusion of Selected
Renewable Energy
Technologies: Modeling,
Economic Impacts, and Policy
Implications
1. Auflage 2014
ISBN 978-3-942789-22-6
E.ON ERC Band 24
Rosen, C.
Design considerations and
functional analysis of local
reserve energy markets for
distributed generation
1. Auflage 2014
ISBN 978-3-942789-23-3
E.ON ERC Band 25
Ni, F.
Applications of Arbitrary
Polynomial Chaos in Electrical
Systems
1. Auflage 2015
ISBN 978-3-942789-24-0
E.ON ERC Band 26
Michelsen, C. C.
The Energiewende in the
German Residential Sector:
Empirical Essays on
Homeowners’ Choices of
Space Heating Technologies
1. Auflage 2015
ISBN 978-3-942789-25-7
E.ON ERC Band 27
Rohlfs, W.
Decision-Making under Multi-
Dimensional Price Uncertainty
for Long-Lived Energy
Investments
1. Auflage 2015
ISBN 978-3-942789-26-4
E.ON ERC Band 28
Wang, J.
Design of Novel Control
algorithms of Power
Converters for Distributed
Generation
1. Auflage 2015
ISBN 978-3-942789-27-1
E.ON ERC Band 29
Helmedag, A.
System-Level Multi-Physics
Power Hardware in the Loop
Testing for Wind Energy
Converters
1. Auflage 2015
ISBN 978-3-942789-28-8
E.ON ERC Band 30
Togawa, K.
Stochastics-based Methods
Enabling Testing of Grid-
related Algorithms through
Simulation
1. Auflage 2015
ISBN 978-3-942789-29-5
E.ON ERC Band 31
Huchtemann, K.
Supply Temperature Control
Concepts in Heat Pump
Heating Systems
1. Auflage 2015
ISBN 978-3-942789-30-1
E.ON ERC Band 32
Molitor, C.
Residential City Districts as
Flexibility Resource: Analysis,
Simulation, and Decentralized
Coordination Algorithms
1. Auflage 2015
ISBN 978-3-942789-31-8
E.ON ERC Band 33
Sunak, Y.
Spatial Perspectives on the
Economics of Renewable
Energy Technologies
1. Auflage 2015
ISBN 978-3-942789-32-5
E.ON ERC Band 34
Cupelli, M.
Advanced Control Methods for
Robust Stability of MVDC
Systems
1. Auflage 2015
ISBN 978-3-942789-33-2
E.ON ERC Band 35
Chen, K.
Active Thermal Management
for Residential Air Source Heat
Pump Systems
1. Auflage 2015
ISBN 978-3-942789-34-9
E.ON ERC Band 36
Pâques, G.
Development of SiC GTO
Thyristors with Etched
Junction Termination
1. Auflage 2016
ISBN 978-3-942789-35-6
E.ON ERC Band 37
Garnier, E.
Distributed Energy Resources
and Virtual Power Plants:
Economics of Investment and
Operation 1. Auflage 2016
ISBN 978-3-942789-37-0
E.ON ERC Band 38
Calì, D.
Occupants' Behavior and its
Impact upon the Energy
Performance of Buildings
1. Auflage 2016
ISBN 978-3-942789-36-3
E.ON ERC Band 39
Isermann, T.
A Multi-Agent-based
Component Control and
Energy Management System
for Electric Vehicles
1. Auflage 2016
ISBN 978-3-942789-38-7
E.ON ERC Band 40
Wu, X.
New Approaches to Dynamic
Equivalent of Active
Distribution Network for
Transient Analysis
1. Auflage 2016
ISBN 978-3-942789-39-4
E.ON ERC Band 41
Garbuzova-Schiftler, M.
The Growing ESCO Market for
Energy Efficiency in Russia: A
Business and Risk Analysis
1. Auflage 2016
ISBN 978-3-942789-40-0
E.ON ERC Band 42
Huber, M.
Agentenbasierte
Gebäudeautomation für
raumlufttechnische Anlagen
1. Auflage 2016
ISBN 978-3-942789-41-7
E.ON ERC Band 43
Soltau, N.
High-Power Medium-Voltage
DC-DC Converters: Design,
Control and Demonstration
1. Auflage 2017
ISBN 978-3-942789-42-4
E.ON ERC Band 44
Stieneker, M.
Analysis of Medium-Voltage
Direct-Current Collector Grids
in Offshore Wind Parks
1. Auflage 2017
ISBN 978-3-942789-43-1
E.ON ERC Band 45
Bader, A.
Entwicklung eines Verfahrens
zur Strompreisvorhersage im
kurzfristigen Intraday-
Handelszeitraum
1. Auflage 2017
ISBN 978-3-942789-44-8
E.ON ERC Band 46
Chen, T.
Upscaling Permeability for
Fractured Porous Rocks and
Modeling Anisotropic Flow
and Heat Transport
1. Auflage 2017
ISBN 978-3-942789-45-5
E.ON ERC Band 47
Ferdowsi, M.
Data-Driven Approaches for
Monitoring of Distribution
Grids
1. Auflage 2017
ISBN 978-3-942789-46-2
E.ON ERC Band 48
Kopmann, N.
Betriebsverhalten freier
Heizflächen unter zeitlich
variablen Randbedingungen
1. Auflage 2017
ISBN 978-3-942789-47-9
E.ON ERC Band 49
Fütterer, J.
Tuning of PID Controllers
within Building Energy
Systems
1. Auflage 2017
ISBN 978-3-942789-48-6
E.ON ERC Band 50
Adler, F.
A Digital Hardware Platform
for Distributed Real-Time
Simulation of Power Electronic
Systems 1. Auflage 2017
ISBN 978-3-942789-49-3
E.ON ERC Band 51
Harb, H.
Predictive Demand Side
Management Strategies for
Residential Building Energy
Systems
1. Auflage 2017
ISBN 978-3-942789-50-9
E.ON ERC Band 52
Jahangiri, P.
Applications of Paraffin-Water
Dispersions in Energy
Distribution Systems
1. Auflage 2017
ISBN 978-3-942789-51-6
E.ON ERC Band 53
Adolph, M.
Identification of Characteristic
User Behavior with a Simple
User Interface in the Context of
Space Heating
1. Auflage 2018
ISBN 978-3-942789-52-3
E.ON ERC Band 54
Galassi, V.
Experimental evidence of
private energy consumer and
prosumer preferences in the
sustainable energy transition
1. Auflage 2017
ISBN 978-3-942789-53-0
E.ON ERC Band 55
Sangi, R.
Development of Exergy-based
Control Strategies for Building
Energy Systems
1. Auflage 2018
ISBN 978-3-942789-54-7
E.ON ERC Band 56
Stinner, S.
Quantifying and Aggregating
the Flexibility of Building
Energy Systems
1. Auflage 2018
ISBN 978-3-942789-55-4
E.ON ERC Band 57
Fuchs, M.
Graph Framework for
Automated Urban Energy
System Modeling
1. Auflage 2018
ISBN 978-3-942789-56-1
E.ON ERC Band 58
Osterhage, T.
Messdatengestützte Analyse
und Interpretation
sanierungsbedingter
Effizienzsteigerungen im
Wohnungsbau
1. Auflage 2018
ISBN 978-3-942789-57-8
E.ON ERC Band 59
Frieling, J.
Quantifying the Role of Energy
in Aggregate Production
Functions for Industrialized
Countries
1. Auflage 2018
ISBN 978-3-942789-58-5
E.ON ERC Band 60
Lauster, M.
Parametrierbare
Gebäudemodelle für
dynamische
Energiebedarfsrechnungen von
Stadtquartieren
1. Auflage 2018
ISBN 978-3-942789-59-2
E.ON ERC Band 61
Zhu, L.
Modeling, Control and
Hardware in the Loop in
Medium Voltage DC
Shipboard Power Systems
1. Auflage 2018
ISBN 978-3-942789-60-8
E.ON ERC Band 62
Feron, B.
An optimality assessment
methodology for Home Energy
Management System
approaches based on
uncertainty analysis
1. Auflage 2018
ISBN 978-3-942789-61-5
E.ON ERC Band 63
Diekerhof, M.
Distributed Optimization for
the Exploitation of Multi-
Energy Flexibility under
Uncertainty in City Districts
1. Auflage 2018
ISBN 978-3-942789-62-2
E.ON ERC Band 64
Wolisz, H.
Transient Thermal Comfort
Constraints for Model
Predictive Heating Control
1. Auflage 2018
ISBN 978-3-942789-63-9
E.ON ERC Band 65
Pickartz, S.
Virtualization as an Enabler for
Dynamic Resource Allocation
in HPC
1. Auflage 2019
ISBN 978-3-942789-64-6
E.ON ERC Band 66
Khayyamim, S.
Centralized-decentralized
Energy Management in
Railway System
1. Auflage 2019
ISBN 978-3-942789-65-3
E.ON ERC Band 67
Schlösser, T.
Methodology for Holistic
Evaluation of Building Energy
Systems under Dynamic
Boundary Conditions
1. Auflage 2019
ISBN 978-3-942789-66-0
E.ON ERC Band 68
Cui, S.
Modular Multilevel DC-DC
Converters Interconnecting
High-Voltage and Medium-
Voltage DC Grids
1. Auflage 2019
ISBN 978-3-942789-67-7
E.ON ERC Band 69
Hu, J.
Modulation and Dynamic
Control of Intelligent Dual-
Active-Bridge Converter Based
Substations for Flexible DC
Grids
1. Auflage 2019
ISBN 978-3-942789-68-4
E.ON ERC Band 70
Schiefelbein, J.
Optimized Placement of
Thermo-Electric Energy
Systems in City Districts under
Uncertainty
1. Auflage 2019
ISBN 978-3-942789-69-1
E.ON ERC Band 71
Ferdinand, R.
Grid Operation of HVDC-
Connected Offshore Wind
Farms: Power Quality and
Switching Strategies
1. Auflage 2019
ISBN 978-3-942789-70-7
E.ON ERC Band 72
Musa, A.
Advanced Control Strategies
for Stability Enhancement of
Future Hybrid AC/DC
Networks
1. Auflage 2019
ISBN 978-3-942789-71-4
E.ON ERC Band 73
Angioni, A.
Uncertainty modeling for
analysis and design of
monitoring systems for
dynamic electrical distribution
grids
1. Auflage 2019
ISBN 978-3-942789-72-1
E.ON ERC Band 74
Möhlenkamp, M.
Thermischer Komfort bei
Quellluftströmungen
1. Auflage 2019
ISBN 978-3-942789-73-8
E.ON ERC Band 75
Voss, J.
Multi-Megawatt Three-Phase
Dual-Active Bridge DC-DC
Converter
1. Auflage 2019
ISBN 978-3-942789-74-5
E.ON ERC Band 76
Siddique, H.
The Three-Phase Dual-Active
Bridge Converter Family:
Modeling, Analysis,
Optimization and
Comparison of Two-Level and
Three-Level Converter
Variants
1. Auflage 2019
ISBN 978-3-942789-75-2
E.ON ERC Band 77
Heesen, F.
An Interdisciplinary Analysis
of Heat Energy Consumption
in Energy-Efficient Homes:
Essays on Economic, Technical
and Behavioral Aspects
1. Auflage 2019
ISBN 978-3-942789-76-9
E.ON ERC Band 78
Möller, R.
Untersuchung der
Durchschlagspannung von
Mineral-, Silikonölen und
synthetischen Estern bei
mittelfrequenten Spannungen
1. Auflage 2020
ISBN 978-3-942789-77-6
E.ON ERC Band 79
Höfer, T.
Transition Towards a
Renewable Energy
Infrastructure: Spatial
Interdependencies and Stake-
holder Preferences
1. Auflage 2020
ISBN 978-3-942789-78-3
E.ON ERC Band 80
Freitag, H.
Investigation of the Internal
Flow Behavior in Active
Chilled Beams
1. Auflage 2020
ISBN 978-3-942789-79-0
81
This dissertation deals with established and newly developed methods from the field of high-performance computing (HPC) and computer science which were implemented in existing and new software that can be used for the simulation of large-scale power systems. The motivation for this is the transformation from conventional power grids to smart grids due to the growing share of renewable energies that require a more complex power grid management. The presented HPC methods make it possible to use the potential of modern computer hard-ware which, for example, comes along with more and more parallel computing units or decreasing latencies in network communication that can be of decisive importance especially for real-time applications. In addition to measures for the optimization of hardware utilization the dissertation also deals with the represen-tation of power systems. In the simulation of smart grids, this includes not only the power grid but also, for instance, the associated communication network and the energy market. Therefore, a data model for smart grid topologies based on existing standards is introduced and validated in a co-simulation environ-ment. In addition, an approach is presented that automatedly generates a soft-ware library from the specification of the data model. Subsequently, an appro-ach is shown which uses the library for converting topological data into various simulator-specific system models. All presented approaches were implemented in open-source software projects, accessible by the public.
ISBN 978-3-942789-80-6