portavita benchmark: a dataset generator for healthcare...storage of these documents into a...

43
Portavita Benchmark: A Dataset Generator for Healthcare Editors Albana Gaba, Yeb Havinga, Tom van der Weide License Creative Commons, Attribution-ShareAlike Date Februari 3, 2015 Contributors Albana Gaba, Yeb Havinga, Tom van der Weide, Jasper Visser, Evert Jan Hoijtink, Heimen Brons, Jan Willem Kijne, Pieter Spoelstra (Portavita) - Willem Dijksta, Fabian Walraven (MGRID) The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 318633.

Upload: others

Post on 03-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Portavita Benchmark: A Dataset

Generator for Healthcare

Editors Albana Gaba, Yeb Havinga, Tom van der WeideLicense Creative Commons, Attribution-ShareAlikeDate Februari 3, 2015

Contributors Albana Gaba, Yeb Havinga, Tom van der Weide, Jasper Visser,Evert Jan Hoijtink, Heimen Brons, Jan Willem Kijne, PieterSpoelstra (Portavita) - Willem Dijksta, Fabian Walraven(MGRID)

The research leading to these results has received funding from the European Union’sSeventh Framework Programme (FP7/2007-2013) under grant agreement nr. 318633.

Page 2: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Contents

1. Summary 5

2. Introduction 5

3. Background on healthcare information model standards 63.1. HL7 Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2. Clinical Document Architecture — CDA . . . . . . . . . . . . . . . . . 7

3.3. Fast Healthcare Interoperable Resource — FHIR . . . . . . . . . . . . . 7

4. Building data models 94.1. Overview of Portavita’s data representation . . . . . . . . . . . . . . . 9

4.1.1. Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.2. Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.3. Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.4. Examinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2. Modeling Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.1. Modeling organizations . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.2. Modeling Patients . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.3. Modeling examinations frequency . . . . . . . . . . . . . . . . . 13

4.2.4. Modeling Examinations . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.5. The problem of missing values . . . . . . . . . . . . . . . . . . . 14

4.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5. Dataset generator 165.1. Synthetic data generation process . . . . . . . . . . . . . . . . . . . . . 16

5.2. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2.1. Single variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.2. Multi-variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.3. Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3.1. Portavita Benchmark v1 . . . . . . . . . . . . . . . . . . . . . . . 20

5.3.2. Portavita Benchmark v1 — Performance evaluation . . . . . . . 21

5.3.3. Portavita Benchmark v2 . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.4. Portavita Benchmark v2 — Performance evaluation . . . . . . . 22

5.3.5. Conclusions on performance . . . . . . . . . . . . . . . . . . . . 24

6. Benchmark queries 256.1. Queries overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7. Conclusions 27

A. Clinical Document Architecture 29

2

Page 3: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

B. List of Observations per Examination 32

Bibliography 43

List of Figures

3.1. Simplified representation of the RIM classes. . . . . . . . . . . . . . . . 6

3.2. UML model of the FHIR Observation resource. . . . . . . . . . . . . . 8

4.1. Data generation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2. Organization hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3. Treatment representation. . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.4. An examination representation. . . . . . . . . . . . . . . . . . . . . . . . 11

5.1. Steps for generating a synthetic dataset with only one organization. . 16

5.2. Discrete real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3. Discrete synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.4. Continuous real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.5. Continuous synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.6. Discrete real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.7. Discrete synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.8. Continuous real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.9. Continuous synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.10. Imputed missing real data . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.11. Imputed missing synth data . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.12. Co-missing real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.13. Co-missing synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.14. The components involved in the batch processing architecture of Por-tavita Benchmark v1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.15. Performance of Portavita Benchmark v1 grouped by component. Timerequired to create 1GB of synthetic data for various dataset sizes. . . . 21

5.16. Portavita Benchmark v2. Micro-batch architecture. . . . . . . . . . . . . 22

5.17. Performance of Portavita Benchmark v2. Time required to generate1GB of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.18. Comparative performance evaluation. Amount of documents loadedper second for v1 and v2. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3

Page 4: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

List of Tables

4.1. Data used to model a patient in relation to a treatment. . . . . . . . . . 12

4.2. Data used to model the distribution of examinations per patient foreach year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3. An example of lab examinations. Each row represents a lab examina-tion instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4

Page 5: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

1. Summary

In this document we introduce Portavita Benchmark[6], a data generator for bench-marking on healthcare data. The generator is based on statistical models of anonymizedclinical data from Portavita’s care management system. It generates organizations,practitioners and nearly 50 different kinds of examinations that consist of 940 dif-ferent kinds of observations. The data generated is fully compliant with the HL7

healthcare interoperability standards. Portavita Benchmark includes both clinicaldocument generation and transformation to relational database persistence, andgenerates up to 1TB/hour of clinical documents. Benchmark queries are included fordatabase performance measurements.

2. Introduction

To develop effective techniques for processing and storing large medical data, it isoften required to evaluate and compare the performance of these systems. In thisdocument we introduce Portavita Benchmark, the first dataset generator specific forhealthcare data. It uses real clinical data for building models, which are then used tocreate arbitrarily large synthetic datasets.

A notable aspect that distinguishes medical data from other data is the use of domain-specific conventions, typically put forward by standards organizations. Standardsare instrumental as health data is exchanged between various healthcare organiza-tions, or even between various departments of the same organization. They provide,among others, well-defined information models and specific names for each medicalconcept.

Portavita Benchmark is composed of two parts: data modeling based on existingclinical data and data generation. In the first part we see how data is structured inthe Portavita care management system, compliant to Health Level 7 (HL7) standards.In the second part, synthetic health records are generated in two exchangeable HL7

formats, namely CDA and FHIR. This process is followed by the transformation andstorage of these documents into a PostgreSQL DBMS. Finally, we provide a briefoverview of the queries that come with the Portavita Benchmark for benchmarkingpurposes.

5

Page 6: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

3. Background on healthcareinformation model standards

Standards play a fundamental role in facilitating the interoperability between varioushealthcare organizations spread across different countries. They determine commonways to model data and to express domain-specific concepts.

In this chapter we introduce three important Health Level 7 information modelswhich are widely adopted by the healthcare community. Portavita Benchmark heavilyrelies on them, as we will see in the next chapters.

3.1. HL7 Version 3

HL7 Version 3 provides a object-oriented development methodology based on areference information model (RIM). The RIM is an essential part of the HL7 Version 3

as it provides a universal information model for healthcare interoperability, coveringthe entire healthcare domain [2].

In the RIM there are six high-level concepts to describe all clinical data: entities, roles,acts, participations, role links, and act relationships, as shown in Figure 3.1. There areseveral specializations of the main classes. We describe briefly the most importantclasses that we use in our data model.

Act

• classCode• moodCode• e�ectiveTime• con�dentiality Code• statusCode• negationInd

Person Organization

Role

• classCode• code• e�ectiveTime• con�dentiality Code

Entity

• classCode• name

1 0..n 10..n1 0..n 1 0..n

Employee Patient

• VIPCode

Assigned-

Entity

Participation

• typeCode• e�ectiveTime

CareProvision Observation Substance-

Administration

Act-

Relationship

Figure 3.1.: Simplified representation of the RIM classes.

Entities can be organizations or persons. A role can be played by an entity and scopedin another entity. For example, patient is a role that is played by a person (an entity)and is scoped by an organization (also an entity). Acts describe events. For example,

6

Page 7: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

observations and examinations are acts. A role participates in acts. For example, apatient role can participate as a subject in an observation and a practitioner rolecan participate as a performer in an observation. Specializations, e.g., Observation,inherit properties from the class they specialize from, e.g., Act, while they have theirspecific attributes.

3.2. Clinical Document Architecture — CDA

Clinical Document Architecture, often referred to as CDA, like any clinical documen-tation, is used to describe care provided to a patient, to maintain a patient medicalrecord and to exchange information between healthcare providers. The CDA is aXML-based markup standard, based on HL7 Version 3 RIM, and as such, it fullymaps clinical data modeled with the RIM.

A CDA document is comprised of two parts. The header contains contextual in-formation, such as, the patient it applies to, the organization and the person whowrote it and the time when the document was written. The body contains human-readable narrative text and optional structured clinical statements, including act,observation, substance administration, encounter, procedure, organizer and sup-ply [3]. Appendix A illustrates an example of a CDA document.

3.3. Fast Healthcare Interoperable Resource — FHIR

FHIR has been introduced by HL7 as a “next generation standards healthcare frame-work”. It combines the best features of HL7’s Version 2, Version 3 and CDA productlines while applying a tight focus on easing the exchange of clinical documents andimplementability.

For example, FHIR provides highly modular components called “Resources” whichcan be easily assembled in a way that they can fully represent clinical data. Mostresource elements and data type properties include mappings to the RIM. So, justlike with the RIM components, FHIR provides resources for modeling RIM classes,like Observation, Patient, Organization and so on.

Figure 3.2 shows the UML diagram of an Observation FHIR resource. Similarlyto an observation RIM class, it has references to the patient, the performer, theorganization, and among other fields, it includes the observation code and thecorresponding value.

FHIR resources are easily accessible in a wide variety of contexts, including mo-bile phone applications. In particular, FHIR provides additionally a RESTful APIwhich defines a set of common interactions performed on a repository of typedresources (read, update, search, etc). These interactions follow the RESTful paradigmof managing state by Create/Read/Update/Delete actions on a set of identifiedresources [4].

7

Page 8: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 3.2.: UML model of the FHIR Observation resource.

8

Page 9: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

4. Building data models

An important aspect of Portavita Benchmark is that it generates data that resemblereal clinical data. Figure 4.1 shows the processes involved by Portavita Benchmark togenerate synthetic datasets based on real data. First, we retrieve the necessary data,according to the models we aim to create, and aggregate them into csv documents.Further, the aggregated data is used to create models by training Bayesian Networks.These models are finally used by the generator to create arbitrarily large datasets ofCDAs and FHIR documents.

We start by introducing the way data is structured originally in Portavita. This willhelp to understand later the structure of the synthetic datasets.

Portavita

Database

AggregateCreate

Bayesian Networks

Dataset-generatorCDA / FHIR

Figure 4.1.: Data generation process.

4.1. Overview of Portavita’s data representation

Portavita provides a care management system for treating patients with chronicconditions. The treatments covered by Portavita are numerous, but for the purposeof Portavita Benchmark we consider only data from diabetes, COPD1 and CVRM2

treatments, which are the ones with the highest number of records.

Portavita uses HL7 Version 3 RIM data models to represent clinical data. This sectiongives a short overview of the main data concepts, which are also represented in thedata generated by Portavita Benchmark.

1Chronic Obstructive Pulmonary Disease2CardioVascular Risk Management

9

Page 10: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

4.1.1. Organizations

The organizations of Portavita’s customers are organized in hierarchies. For Portavita,a top-level organization is called a care group. A care group has a number of sub-organizations, each of which can have sub-organizations on its own. As such, thisstructure can be represented as a hierarchy. An example hierarchy is given in thefollowing figure.

Care group

GP Surgery PharmacyGP Surgery

GP SurgeryDietitian

Figure 4.2.: Organization hierarchy.

4.1.2. Roles

There is a large variety of roles played by users in Portavita’s system. Here are someof the most relevant for Portavita Benchmark.

• A care group employee is a role played by users that work for the care groupand act as a superusers. One of the main activities is monitoring organizationswithin the care group.

• A quality employee is a role that a user can have, which allows them to comparepractitioners within an organization or set of organizations.

• A practitioner is a care provider that can be active in the treatment of patients.• A patient is a person who receives treatment.• A researcher is a person who is granted access to part of the data to perform

research.

Note that a person can play multiple roles inside the organization. For example, apractitioner could also be a patient within the same organization.

4.1.3. Treatments

When a new patient is entered in the system, a treatment such as diabetes orCVRM, is assigned to them. Every treatment has at least two participations: asubject participation (i.e., patient), and a performer participation (i.e., principalpractitioner). The principal practitioner has the final responsibility over the treatment.The diagram in Figure 4.3 illustrates the relationships between a treatment (a type ofact), examinations (also acts) performed in the context of a treatment, and the patient

10

Page 11: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

and principal practitioner participations in the treatment. It is important to note thattypically each observation performed to a patient is linked to a care treatment thepatient is part of.

Examination

Treatment

ExaminationExamination

Principal practitionerPatient

Figure 4.3.: Treatment representation.

4.1.4. Examinations

Examinations make up the largest part of the data in Portavita’s database. Theyare typically performed in the context of a treatment. For example, in a diabetestreatment, a patient typically performs examinations on a regular basis, such as yearlycheckups, foot examinations, eye checkups and so on. The general structure of anexamination is shown in Figure 4.4. An examination has at least four participations:a subject participation (the patient), a performer participation (the practitioner whoperforms the examination), a data enterer participation (the person who entersthe data into the system), and a legal authenticator participation (the practitionerwho accords the data that was entered). Most often, though, a single practitionerparticipates in the role of performer, data enterer and legal authenticator.

Examination

Organizer

Patient Data enterer

Performer

Legal authenticator

Organizer Observation

Observation Observation

Organizer

Observation Observation

Figure 4.4.: An examination representation.

Examinations consist of simple observations and organizers, which contain obser-vations grouped together in meaningful sets. Typically, an examination has a smallnumber of observations that are mandatory. However, larger examinations may

11

Page 12: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

consist of up to a hundred of different observations. Typically, several observationscan only be entered if certain other observations have been made. For example, onlyif a patient has a skin rash, the performer can enter details about the skin rash.

4.2. Modeling Clinical Data

We create models of clinical data by training Bayesian Networks upon Portavitaproduction data. Bayesian Networks capture dependencies between various variablesand distribution of their values. For each concept to be modeled, we aggregatethe necessary data by querying the production Portavita database. The queries areperformed on anonymized patient data. We create models for the following concepts:organization, patient, examination, and treatment.

We use the R package bnlearn for training Bayesian networks. To train BayesianNetworks it is important to differentiate between discrete and continuous variables.For example, a discrete variable is used to express whether a patient smokes, withpossible values ‘yes’, ‘no’, or ‘in the past’. A continuous variable is, for instance,HbA1c which comprehends values in a certain range. Integer numbers (e.g., numberof days) are considered as continuous values by bnlearn.

4.2.1. Modeling organizations

To generate an organization with a certain patient population size and number ofpractitioners, we learn a mixture Gaussian from the sizes of the actual organizationsfrom one caregroup. For this, we use the R package called mixtools.

4.2.2. Modeling Patients

A patient is represented by his age, treatment he is part of and the duration ofsuch treatment. Therefore, for each treatment we train Bayesian networks using thevariables as shown in Table 4.1.

Field Type

Patient Age (days) ContinuousTreatment DiscreteTreatment duration (days) Continuous

Table 4.1.: Data used to model a patient in relation to a treatment.

12

Page 13: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

4.2.3. Modeling examinations frequency

To capture the dependency between the age of the patient and the kind of ex-aminations that are performed in one year, we train Bayesian networks using thepatient’s age, and for each type of examination, the number of examinations thatwere performed in that year. Table 4.2 shows the data used to model examinations.

Field Type

Patient Age (days) Continuous# Examinations A Continuous# Examinations B Continuous... ...

Table 4.2.: Data used to model the distribution of examinations perpatient for each year.

4.2.4. Modeling Examinations

There are about 50 examinations in Portavita’s system for the treatments we haveconsidered, such as yearly check up, foot checkup, lab examinations and so on. A listof examinations is provided in appendix B. We create a model for each of them. Forevery kind of examination that can be entered into the system, a query is performedthat returns a table with a column for each possible observation value. In addition,for every examination instance, a single row is returned, which contains a cell forevery observation that could be measured within that examination.

As an example, let us look at the “Lab examination”. On a regular basis, the blood ofthe patient must be examined. When performing a lab examination, a blood sampleis collected and sent to the lab. The lab examines the blood on a subset of around 30

measurable variables, such as HbA1c and Cholesterol. Not all variables are alwaysmeasured, but only those that were requested. Furthermore, the data contain bothcontinuous, discrete variables, or a missing value, as shown in Table 4.3.

HbA1c HDL LDL Triglyceride .. Albumin

47 4.7 .. 274.5 3.9 ..

4.7 .. 31.. .. .. .. .. ..

Table 4.3.: An example of lab examinations. Each row represents alab examination instance.

The R package bnlearn cannot be used to learn hybrid models, i.e., models contain-ing both discrete and continuous variables. We have tried other packages such asdeal, but without success. The number of variables was too big (e.g. 80) resultingin the package being unable to allocate an array of the right size. Because the data

13

Page 14: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

consist of discrete and continuous values, we came up with the following solution.Namely, for every examination we train three Bayesian networks:

• network for discrete values (missing values are represented explicitly)• network for continuous values, and• network for the missing values in continuous variables

Note that missing values for discrete variables can be seen as simply another value,but that for continuous variables this is not the case. To generate examinations thathave similar patterns of missing values, we therefore must train another network forthe patterns in the missing data of continuous variables.

4.2.5. The problem of missing values

Because the input data contain many missing values and many machine learningalgorithms cannot deal with missing values, it is important to impute the data.To keep the performance within reasonable levels, we use two packages in R forimputation: mice and imputation. We start with the mice imputation method called“norm predict”. This method calculates regression weights from the observed data.However, the result of performing this method may still contain missing values. Ifthis is the case, then we continue with the imputation package. If there are morethan 1000 rows in the data, then we use the method gbmImput, which uses boostedregression trees for each column x to predict x using all other columns except x.GBM impute is only used when the dataset is large enough, otherwise it does notwork well. If there are less than 1000 rows, then we use lmImput, which fills missingvalues in a column by running locally weighted least squares regression.

4.3. Limitations

The models created to represent the original data determine the quality of thesynthetic dataset. Therefore, it is worth recalling the limitations of the models builtthroughout this stage.

Hybrid Bayesian Networks The networks we have trained, and hence modeled,are separate for continuous (or numeric) values, discrete and missing values.It implies that within an examination there may be discrete variables thatare inconsistent with other continuous values. For example, discrete variable“isSmoker” may be “no”, while the number of cigarettes smoked per day maybe a non-zero integer.

Time-based observations per patient Subsequent examinations related to a singlepatient are not co-related. This means that observations concerning a singlepatient over time are not consistent.

Imputed missing values Since a number of observations were imputed, the accu-racy of the models including such observations may be affected. As a conse-quence, the values generated may be less representative of the original data.

14

Page 15: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Natural numbers The models do not discern between natural and real numbers. Asa result, all generated numeric data are real numbers. For instance, smokingdaily units are natural numbers in the original data, but real numbers in thesynthetic dataset. Another consequence is that negative numbers are created inthe synthetic dataset, even when the data type in the original dataset has nonegative values.

15

Page 16: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

5. Dataset generator

5.1. Synthetic data generation process

The generator uses the model of a healthcare organization, as described in Sec-tion 4.1.1, to assign the number of patients and practitioners to a synthetically gen-erated organization. This way, the size of a generated dataset is determined by thenumber of organizations that are required to be generated (numorganizations).

Figure 5.1 depicts the consecutive steps to construct synthetic healthcare information.After an organization is created, it is assigned a set of patients and practitioners.Further on, to each patient p is assigned a number of treatments. For each treatmentof patient p, a number of examinations is generated distributed over the time oftreatment duration. Finally, for each examination, a practitioner is randomly assignedfrom the set of practitioners of the organization and, based on the model of theexamination, a number of observations types with their corresponding values.

1.

2. 3. 4. 5. 6.

Organization

Practitioners Patients Treatment Examinations Observations

Figure 5.1.: Steps for generating a synthetic dataset with only oneorganization.

The patients and the organizations created by Portavita Benchmark are FHIR re-sources, while the rest, examinations and observations, are CDAs. Both CDA andFHIR are XML formats.

5.2. Validation

In order to validate that the dataset generator produces meaningful healthcareinformation, we compare synthetic data with real healthcare data.

16

Page 17: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 5.2.: Discrete real data Figure 5.3.: Discrete synthetic data

Taking into account the way in which separate Bayesian Networks are used tomodel dependencies between variables, we can expect the following structure in thesynthetic data:

• Correlation between continuous variables in the same examination instance.Refer to Section 4.1.4 for a definition of an examination and Appendix B forlists of observations that occur within an examination.

• Correlation between discrete variables in the same examination instance.• Percentage of missing values and correlation between occurrences of missing

values.

The synthetic dataset generator does not create the full statistical structure that ispresent in the real data. See section 4.3 for a discussion about limitations of themodels. Two kinds of correlations that are not present are worth mentioning:

• Correlation between different instances of examinations. This means that apatient today can have a irreversible complication such as retinopathy, which isnot present during the following examination. A consequence is that aggrega-tion will cause loss of correlation between variables. For instance, there will beno correlation between yearly averages of systolic and diastolic bloodpressures.

• Correlation between discrete and continuous variables, such as the discretevalue ‘smoking y/n’ and the numeric variable ‘amount of daily smoking units’.Since the bayesian network toolkit we used does not support hybrid networks,separate networks for discrete and continuous variables are used to modeldependencies, as described in Section 4.2.4.

Comparison of the real and synthetic dataset is done in four ways, as described inthe next two sections. We compare the real data with synthetic data using a singlevariable for discrete and continuous attributes. We also compare the data sets fordependencies between two variables. We use Orange v2.7 [5] to analyze and visualizethe data.

5.2.1. Single variable

The histograms shown in figures 5.2 and 5.3 show the frequency of values for thediscrete variable wellbeing of the real and synthetic dataset. Visual inspection revealsthat the distribution of values is similar.

17

Page 18: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 5.4.: Continuous real data Figure 5.5.: Continuous synthetic data

Figure 5.6.: Discrete real data Figure 5.7.: Discrete synthetic data

For continuous variables, figures 5.4 and 5.5 show box plots of the blood pressure onreal and synthetic data. We can see that the statistical mean and standard deviationfor the synthetic and real data is similar. The box plots also reveal a difference inthe number of distinct values. As described in section 4.3, the generator makes nodistinction between natural and real numbers and treats all numeric data as realnumbers. As a consequence, almost every value in the synthetic dataset is unique,whereas observations with a natural number domain in the real dataset contain lessdistinct values.

5.2.2. Multi-variable

The mosaic diagrams 5.6 and 5.7 give insight into co-occurrences of pairs of valuesfor the discrete attributes exercise and wellbeing. The size of the area indicates thenumber of samples with the corresponding values in the the dataset. Both graphsshow a similar structure for all combinations of values.

For continuous variables, correlation between variables is shown using scatterplot.Figures 5.8 and 5.9 show the correlation between systolic and diastolic blood pressure.

18

Page 19: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 5.8.: Continuous real data Figure 5.9.: Continuous synthetic data

Figure 5.10.: Imputed missing real data Figure 5.11.: Imputed missing synth data

Finally we consider percentage and co-occurence of missing values for continuousdata.1 To visualize missing values, we impute missing values for systolic and diastolicblood pressure with the value 300. Figures 5.10 and 5.11 compare frequencies of themissing values for the real and synthetic dataset for 330 samples. We can see thatboth datasets show a comparable amount of missing values for the systolic bloodpressure.

Besides amount of missing data, co-occurences of missing values in the real datasetshould also be reflected in the synthetic data. Again we use data with missing valuesimputed to value 300. Figures 5.10 and 5.11 show systolic bloodpressure plottedagainst diastolic blood pressure. The presence of only a dot at point 300,300 in thegraph, but no other dots on the x = 300 or y = 300 line, indicate that a missing valuefor systolic bloodpressure is always matched with a missing value for diastolic bloodpressure, in both the real and synthetic dataset.

1For discrete data, missing values are modeled with an additional nominal in the value domain,hence require no additional validation.

19

Page 20: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 5.12.: Co-missing real data Figure 5.13.: Co-missing synthetic data

5.3. Performance evaluation

To perform any operation on the dataset created by the generator, it is necessary totransform and store CDA/FHIR documents in a relational database. To this end,Portavita has delivered two versions of the database generator. This section describesarchitectural differences between the two versions, and provides results on tests thatcompare the speed of data generation.

5.3.1. Portavita Benchmark v1

As shown in Figure 5.14, Portavita Benchmark v1 consists of the following compo-nents:

1. Clinical Document Architecture and FHIR XML message generator (genxml)2. Message converter from XML to SQL (xml2sql)3. Loading SQL documents in a staging database (sql2db)4. Update statistics used by the PostgreSQL planner to determine the most

efficient way to execute a query (vacuumanalyze)5. Transformation of staging data to dimensional warehouse format

(transform2dimentional)6. Loading the dimensional warehouse format to the final database. (copy2dwh)

Dimensional

Data

Warehouse

Files on disk In DB

XML2SQL SQL2DBCDA/FHIR-generator

(GENXML)

*Transform to dimensional Data Warehouse

Transform*

Figure 5.14.: The components involved in the batch processing ar-chitecture of Portavita Benchmark v1.

20

Page 21: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

The v1 database generator operates in batches; first all XML documents are createdon the file system in step 1. Then the message converter reads the XML files andproduces SQL scripts that can be run against a database, and so forth until the laststep, that copies data from the staging to the final database.

The batch-wise approach has the following drawbacks:

• Some of the steps are single threaded. The benchmark results show that thesequential steps dominate the generation time of large data sets.

• The larger batches are, the harder it gets to keep the state between variousbatches. For instance, using /tmp as storage for temporal steps will causeproblems if /tmp is on the root filesystem with limited space.

5.3.2. Portavita Benchmark v1 — Performance evaluation

We measured the time it takes to create datasets of different sizes. These tests wereperformed on the AXLE Manchester server with PostgreSQL 9.5 development versionfrom December 11th 2014. The specifications of this server are:

• 8 x 8 Intel(R) Xeon(R) CPU E5-4620 @ 2.20GHz• 256GB RAM

Each data point represents the average result of at least two executions with the sameparameters.

�������

��������

�������

��������������

����������������������

���������

��

��

��

��

��

��

��

��

��

���

����������������

�������

�������

���������

���������

����������

Figure 5.15.: Performance of Portavita Benchmark v1 grouped bycomponent. Time required to create 1GB of syntheticdata for various dataset sizes.

21

Page 22: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Figure 5.15 shows the time it takes to create 1 GB in the database for different scalingsof 100.000, 500.000, 2.500.000, 5.000.000 and 10.000.000 XML documents respectively.The various database scalings are in conformity with the requests made by AXLEpartners during internal mailing-list discussions.

As Figure 5.15 shows, genxml and vacuumanalyze are the lowest resource-intensivecomponents. In particular, the XML generator genxml requires ca. 3 seconds per GB,which translates to over 1TB/hour of XML data. The most CPU time is spent ontasks 2, 3 and 5. Portavita Benchmark v2 focuses on improving the performance ofthese tasks, as we will see in the next section.

5.3.3. Portavita Benchmark v2

The design of the Portavita Benchmark v2 was focused on improving the shortcom-ings of v1 as follows:

1. Redesign architecture from a batch-oriented to a near-real-time streamingarchitecture using micro-batches. The purpose-built multidimensional starschema model from v1 was removed; in v2 the HL7v3 RIM model is useddirectly as the source atomic data of the data warehouse. This resulted in theelimination of the sequential step 5 ‘transform2dimensional’.

2. Steps 2 and 3 were already parallelized on a single node using GNU parallel.We analyzed the performance of each component and mitigated performancebottlenecks. In addition, we added a scale-out option for steps 2 and 3, to alsogo beyond single node performance.

Data Lake

XML2SQL SQL2DBXML2SQL SQL2DB

XML2SQL SQL2DB

CDA/FHIR-generator

(GENXML)

Figure 5.16.: Portavita Benchmark v2. Micro-batch architecture.

Together these two design decisions should lead to faster creation of syntheticdatabases. Nonetheless, scale-out of the system also introduces a new component, theRabbitMQ message broker, and with the broker new configuration and flow controloptions, that require configuration and monitoring to reach maximum throughput.

5.3.4. Portavita Benchmark v2 — Performance evaluation

In Portavita Benchmark v2 we use a different way to configure the amount ofresulting data generated. Unlike v1, where we specify the number of documentswe wish to generate, in v2 we set the number of organizations, which ultimately

22

Page 23: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

determines the number of generated documents, hence the database size. Figure 5.17

shows the performance of the Portavita Benchmark v2 in terms of time it takes togenerate 1 GB of data for databases of various sizes. The size of the databases isshown by both, the number of organizations and the number of documents generated.The graph shows that the generation rate is about 11 hours/TB of data.

5 10 20Number of organizations

0

5

10

15

20

25

30

35

40

45

Tim

e(s

eco

nds)

/ 1

GB

1,248,261 2,470,915 4,991,004Total number of documents

Figure 5.17.: Performance of Portavita Benchmark v2. Time requiredto generate 1GB of data.

Since Portavita Benchmark v2 is a streaming architecture based on micro-batches,it is not possible to measure the processing time of each step, like was done forv1. Moreover, as the database format was changed in v2, we can only compare thedatabase generation rate based on the number of documents loaded into the finaldatabase per second. Figure 5.18 shows the performance of both v1 and v2 on thesame single-node server. The data generation rate of v2 is more than twice higherthan that of v1.

0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000Total number of documents

0

50

100

150

200

250

300

350

Docu

ments

/seco

nd

v1v2

Figure 5.18.: Comparative performance evaluation. Amount of doc-uments loaded per second for v1 and v2.

23

Page 24: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

5.3.5. Conclusions on performance

The core of the generator of the AXLE synthetic dataset is the XML generator. It isthis generator that is most useful as benchmarking tool for software related to theexchange of healthcare data, since it emits HL7v3 Clinical Document Architecture(CDA) XML and FHIR messages. The XML generator speed is over 1TB/hour. Withadditional transformation to database format and additional processing, generationspeed is 1TB/11 hours on a single-node server.

24

Page 25: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

6. Benchmark queries

We provide a number of queries for secondary use, i.e., for reporting and analyticspurposes. The full document describing the queries has been delivered in June2014, whereas the queries sources are included in the github repository of AXLEHealthcare Benchmark [1].

6.1. Queries overview

Cross-Organization comparative analysis This query compares all the organiza-tions based on the percentage of the patients who have had one of the followingexaminations in the last year: Fundus checkup, Foot checkup, Intermediarycheckup, Risk inventory, Diabetes medication, Dietary advice.

Cross-practitioner comparative analysis This query is quite similar to the previousquery in that compares on the same indicators, with the only difference that thisquery focuses on the performance of each practitioner within an organization.

Deviation of organization performance This query reports on organization aver-age values with respect to a number of important patient measurements, suchas glucose level, blood pressure and so on. These measurements give a high-level overview of how well the patients within an organization are doing.

Extreme values This query shows all patients with observation values that havebeen classified as ‘extreme’. Such patients are shown with some additional datasuch as gender, age, most recent HbA1c, triglyceride and blood pressure.

Relative extreme values The definition of an extreme value is often determinedby national benchmarks. But there are local patient populations with localaverages that deviate considerably from the national average. Consideringthat the organizations are geographically distributed across The Netherlands,this query reports on the patients that have ‘extreme values’ compared to theaverage observation values within the organization where these patients aretreated.

Abnormal blood pressure or macroangiopathy This query retrieves all patientsthat currently either have an abnormal blood pressure (highly related to the ageof the patient) or macroangiopathy. Typically, these patients have an increasedrisk of developing new complications and are monitored closely.

Data Analysis — Influence of medication on HbA1c HbA1c is a type of bloodvalue that shows glucose levels over longer periods of time. This query shows

25

Page 26: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

for all patients and for all of their medications the average HbA1c one yearbefore the patient started the initial medication, the average HbA1c in betweenthe initial medication and the new medication, and the average HbA1c oneyear after the new medication.

Trend Analysis — Trends in the process This query aims to gain insight into howoften examinations are performed across various organizations and the trend.So, for every organization, type of examination, and time period, this queryshows the average number of times that the examination was performed perpatient in that period in that organization. Six periods of three months eachare defined starting from the current date.

Trend Analysis — Trends in smoking This query shows the trends in smoking perorganization. Namely, in the last 6 periods of half a year, this query reports onthe number of active patients in the organization, how many of them smokedin that period, and how many ceased smoking in that period.

26

Page 27: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

7. Conclusions

In this document we presented Portavita Benchmark, a dataset generator specific forhealthcare. The generated data are based on models built upon real health records,and comply with the exchangeable HL7 formats, namely CDA and FHIR (XML-based). Portavita Benchmark borrows libraries from MGRID in order to efficientlytransform and store the generated data in a PostgreSQL DBMS.

The synthetic clinical data includes examinations and observations which are as-signed to synthetic patients in the context of a diabetes, COPD or CVRM treatment.

The validation of Portavita Benchmark showed that observation values, correlationsamong various observations, and occurence of missing values in the synthetic datasetresemble the original data.

We showed that the process of XML data generation is relatively fast, namely1TB/hour, compared to the subsequent transformation and storage processes, that is,1TB in about 11 hours on a single-node server.

27

Page 28: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Appendix

28

Page 29: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Appendix A.

Clinical Document Architecture

The following xml snippet depicts an example of a CDA document. The first partof the CDA consists of contextual information, such as the code to identify thetype of document (code), the time it was issued (effectiveTime), the classificationlevel of the document (confidentialityCode), the patient reference (recordTarget)and information about the person who authored, entered and/or authenticated thedocument. In this case all three roles are covered by the same person.

The second part of the CDA starts with the component structuredBody and hasinformation about the observations performed. In this case, the element organizercontains more contextual information about the observations, such as time andperformer, and an additional nested organizer with display name ‘Blood pressure’.This organizer contains two elements Observation that have values for the systolicand diastolic measurements.

<?xml version="1.0" encoding="UTF-8"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="1aa415ca-61ed-4498-9665-d1523a7477d3"/><code code="68608-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC"

displayName="Summarization note"/><title>PHR Update</title><effectiveTime value="20130709150524"/><confidentialityCode code="N" codeSystem="2.16.840.1.113883.5.25" codeSystemName="Confidentiality"displayName="Normal"/><recordTarget>

<patientRole><id nullFlavor="UNK"/><patient>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="71738452"/></patient><providerOrganization>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="206605404"/></providerOrganization>

</patientRole></recordTarget><author>

<time value="20130709150524"/><assignedAuthor>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="-1774162843"/><representedOrganization>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="206605404"/></representedOrganization>

</assignedAuthor></author><dataEnterer>

<time value="20130709150524"/><assignedEntity>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="-937850301"/><representedOrganization>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="206605404"/>

29

Page 30: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

</representedOrganization></assignedEntity>

</dataEnterer><legalAuthenticator>

<time value="20130709150524"/><signatureCode code="S"/><assignedEntity>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="667284500"/><representedOrganization>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="206605404"/></representedOrganization>

</assignedEntity></legalAuthenticator><documentationOf>

<serviceEvent classCode="PCPR"><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="1894524026"/><code code="17074200" codeSystem="2.16.840.1.113883.6.96" codeSystemName="SNOMED-CT"displayName="Diabetes treatment"/>

</serviceEvent></documentationOf><component>

<structuredBody><component>

<section><entry>

<organizer classCode="BATTERY" moodCode="EVN"><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="2096295173"/><code code="Portavita1234" codeSystem="2.16.840.1.113883.2.4.3.31.2.1"

codeSystemName="Portavita" displayName="Self-check CVRM"/><statusCode code="completed"/><effectiveTime>

<low value="20120504055554"/></effectiveTime><performer typeCode="PRF">

<assignedEntity><id root="2.16.840.1.113883.2.4.3.31.3.2" extension="1236771654"/><representedOrganization>

<id root="2.16.840.1.113883.2.4.3.31.3.2" extension="206605404"/></representedOrganization>

</assignedEntity></performer><component>

<organizer classCode="BATTERY" moodCode="EVN"><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="2096295172"/><code code="Portavita1235" codeSystem="2.16.840.1.113883.2.4.3.31.2.1"

codeSystemName="Portavita" displayName="Blood pressure Self measurement"/><statusCode code="completed"/><effectiveTime>

<low value="20120504055554"/></effectiveTime><component>

<observation classCode="OBS" moodCode="EVN"><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="2096295170"/><code code="Portavita1236" codeSystem="2.16.840.1.113883.2.4.3.31.2.1"

codeSystemName="Portavita" displayName="Systolic BP, self measurement"/><statusCode code="completed"/><effectiveTime>

<low value="20120504055554"/></effectiveTime><value value="130.63331049967894" unit="mm Hg" xsi:type="PQ"/>

</observation></component><component>

<observation classCode="OBS" moodCode="EVN"><id root="2.16.840.1.113883.2.4.3.31.3.1" extension="2096295171"/><code code="Portavita1237" codeSystem="2.16.840.1.113883.2.4.3.31.2.1"

codeSystemName="Portavita" displayName="Diastolic BP, self measurement"/><statusCode code="completed"/><effectiveTime>

<low value="20120504055554"/></effectiveTime><value value="68.57299120677533" unit="mm Hg" xsi:type="PQ"/>

</observation>

30

Page 31: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

</component></organizer>

</component></organizer>

</entry></section>

</component></structuredBody>

</component></ClinicalDocument>

31

Page 32: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Appendix B.

List of Observations per Examination

Below follows the list of examinations and their observations generated by PortavitaBenchmark. The examinations are denoted with ===== and a .json suffix andhave a corresponding human-readable display name. Examinations are composedby a number of observations which are listed after the examination name. Theobservations have a code which is preceded by the corresponding data type, whichis either PQ (physical quantity – continous) or CD (coded description – discrete).

Examination/Observation code | displayname--------------------------------+--------------------------------------------------------------------------------------------===== 12133-5.json ===== | Systolic flow/Diastolic flow:VelRto:Pt:Cerebral artery anterior^fetus:Qn:US.dopplerpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:===== 127783003.json ===== | Spirometrycd_Portavita1327 | Interpretationpq_313232000 | Peak expiratory flow rate after bronchodilationpq_313276007 | Peak expiratory flow rate before bronchodilationpq_401012008 | FEV1 before bronchodilationpq_401013003 | FEV1 after bronchodilationpq_407561008 | Forced vital capacity (FVC) after bronchodilationpq_407602006 | Forced expiratory volume 1 (FEV1)/ forced vital capacity (FVC) ratio before bronchodilatorpq_407603001 | Forced expiratory volume 1 (FEV1)/ forced vital capacity (FVC) ratio after bronchodilatorpq_Portavita534 | FEV1 Pre (% of predicted)pq_Portavita535 | FEV1 Post (% of predicted)pq_Portavita536 | Reversibility FEV1 (%)pq_Portavita537 | FVC Pre (Absolute)pq_Portavita538 | FVC Pre (% of predicted)pq_Portavita539 | FVC Post (% of predicted)pq_Portavita540 | PEF Pre (% of predicted)pq_Portavita541 | PEF Post (% of predicted)pq_Portavita542 | TLC Pre (Absolute)pq_Portavita543 | TLC Pre (% of predicted)===== 164847006.json ===== | Standard ECGcd_271921002 | ECG finding===== 170744004.json ===== | Follow-up diabetic assessmentcd_129863004 | Deficient knowledge of dietary regimencd_237635002 | Nocturnal hypoglycaemiacd_302866003 | Hypoglycaemiacd_361137007 | Irregular heart beatcd_365275006 | General well-being findingcd_Portavita1342 | Antihypertensivescd_Portavita1343 | Diureticscd_Portavita1344 | Beta-blockerscd_Portavita1345 | Calcium antagonistscd_Portavita1346 | Drugs affecting the renin-angiotensin systemcd_Portavita1347 | Alpha-blockerscd_Portavita1348 | Other antihypertensivescd_Portavita1349 | Blood-thinning drugscd_Portavita1350 | Platelet aggregation inhibitorscd_Portavita1351 | Anticoagulantscd_Portavita1352 | Other blood-thinning drugscd_Portavita1353 | Lipid-lowering drugscd_Portavita1354 | Statinscd_Portavita1355 | Other lipid-lowering drugscd_Portavita1428 | Extra attention for Individual care plancd_Portavita24 | Fasting hyposcd_Portavita25 | Hypos after breakfastcd_Portavita26 | Hypos before lunchcd_Portavita27 | Hypos after lunchcd_Portavita28 | Hypos before dinnercd_Portavita29 | Hypos after dinnercd_Portavita30 | Hypos before bedtimecd_Portavita34 | Therapy compliancecd_Portavita38 | Dietary advice problemscd_Portavita39 | Insufficient application of guidelinescd_Portavita648 | Diabetes medicationpq_170749009 | Frequency of hypoglycaemia attacks

32

Page 33: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

pq_27113001 | Body weightpq_364075005 | Heart ratepq_365811003 | Glucose level - findingpq_50373000 | Body height measurepq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:pq_Portavita175 | Fasting blood glucosepq_Portavita176 | Blood glucose after breakfastpq_Portavita177 | Blood glucose before lunchpq_Portavita178 | Blood glucose after lunchpq_Portavita179 | Blood glucose before dinnerpq_Portavita180 | Blood glucose after dinnerpq_Portavita181 | Blood glucose before bedtimepq_Portavita182 | Nighttime blood glucose===== 170757007.json ===== | Fundoscopy - diabetic checkcd_Portavita220 | Assessment of fundus image===== 170777000.json ===== | Diabetic annual reviewcd_106070007 | Cardiac auscultation findingcd_129863004 | Deficient knowledge of dietary regimencd_162274004 | Visual symptomscd_207057006 | [D]Shortness of breathcd_207260005 | [D]Other specified symptomscd_219006 | Current drinkercd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_237635002 | Nocturnal hypoglycaemiacd_249475006 | Thirst symptomcd_266257000 | Transient cerebral ischaemiacd_28442001 | Polyuriacd_29857009 | Chest paincd_302866003 | Hypoglycaemiacd_30782001 | Diastolic murmurcd_309597007 | Foot abnormality - diabetes-relatedcd_312975006 | Microalbuminuriacd_31574009 | Systolic murmurcd_32738000 | Prurituscd_361137007 | Irregular heart beatcd_365275006 | General well-being findingcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_370992007 | Dyslipidaemiacd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_401207004 | Medication side effects presentcd_40733004 | Infectious diseasecd_52311001 | Homocystinaemiacd_82184000 | Aortic bruitcd_84114007 | Heart failurecd_Portavita10 | Carotid aorta rightcd_Portavita11 | Renal aorta leftcd_Portavita12 | Renal aorta rightcd_Portavita13 | Femoral aorta leftcd_Portavita1342 | Antihypertensivescd_Portavita1343 | Diureticscd_Portavita1344 | Beta-blockerscd_Portavita1345 | Calcium antagonistscd_Portavita1346 | Drugs affecting the renin-angiotensin systemcd_Portavita1347 | Alpha-blockerscd_Portavita1348 | Other antihypertensivescd_Portavita1349 | Blood-thinning drugscd_Portavita1350 | Platelet aggregation inhibitorscd_Portavita1351 | Anticoagulantscd_Portavita1352 | Other blood-thinning drugscd_Portavita1353 | Lipid-lowering drugscd_Portavita1354 | Statinscd_Portavita1355 | Other lipid-lowering drugscd_Portavita14 | Femoral aorta rightcd_Portavita1428 | Extra attention for Individual care plancd_Portavita24 | Fasting hyposcd_Portavita25 | Hypos after breakfastcd_Portavita26 | Hypos before lunchcd_Portavita27 | Hypos after lunchcd_Portavita28 | Hypos before dinnercd_Portavita29 | Hypos after dinnercd_Portavita30 | Hypos before bedtimecd_Portavita308 | Assessment of ophthalmic examinationcd_Portavita34 | Therapy compliancecd_Portavita38 | Dietary advice problemscd_Portavita39 | Insufficient application of guidelinescd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita44 | Symptoms indicating hypoglycemiacd_Portavita440 | Follow-up appointment madecd_Portavita48 | Decrease of physical capacitycd_Portavita5 | Auscultation indicatedcd_Portavita50 | Pain in calves when walkingcd_Portavita52 | Pain or tingling in legscd_Portavita54 | Sexual dysfunction disorderscd_Portavita6 | First soundcd_Portavita61 | Hypoglycemia recognitioncd_Portavita63 | Patient uses caffeinecd_Portavita64 | Products with glycyrrhizic acidcd_Portavita648 | Diabetes medicationcd_Portavita68 | Diabetes in first-degree or second-degree relatives

33

Page 34: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_Portavita69 | Lipid metabolism disorder in first-degree relativescd_Portavita7 | Second soundcd_Portavita70 | Hypertension in first-degree relativescd_Portavita71 | Cardiovascular diseases in first-degree relativescd_Portavita8 | Auscultation of arteries performedcd_Portavita9 | Carotid aorta leftpq_160573003 | Alcohol intakepq_170749009 | Frequency of hypoglycaemia attackspq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_364075005 | Heart ratepq_365811003 | Glucose level - findingpq_396552003 | Abdominal circumferencepq_50373000 | Body height measurepq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:pq_Portavita175 | Fasting blood glucosepq_Portavita176 | Blood glucose after breakfastpq_Portavita177 | Blood glucose before lunchpq_Portavita178 | Blood glucose after lunchpq_Portavita179 | Blood glucose before dinnerpq_Portavita180 | Blood glucose after dinnerpq_Portavita181 | Blood glucose before bedtimepq_Portavita182 | Nighttime blood glucose===== 183056000.json ===== | Patient advised about diabetic dietcd_129863004 | Deficient knowledge of dietary regimencd_365275006 | General well-being findingcd_Portavita1428 | Extra attention for Individual care plancd_Portavita34 | Therapy compliancecd_Portavita38 | Dietary advice problemscd_Portavita39 | Insufficient application of guidelinespq_27113001 | Body weightpq_396552003 | Abdominal circumferencepq_50373000 | Body height measurepq_60621009 | Body mass index===== 27113001.json ===== | Body weightpq_27113001 | Body weight===== 282294001.json ===== | Laboratory test findingpq_102737005 | HDL cholesterolpq_102739008 | LDL cholesterolpq_103232008 | HbA>1c<pq_166842003 | Total cholesterol:HDL ratio measurementpq_250745003 | Albumin/creatinine ratio measurementpq_26091008 | Aspartate aminotransferasepq_271000000 | Urine albumin measurementpq_275788007 | Sodium in samplepq_275789004 | Potassium in samplepq_275792000 | Creatinine in samplepq_275795003 | Albumin in samplepq_38082009 | Haemoglobinpq_52302001 | Glucose measurement, fastingpq_56935002 | Alanine aminotransferasepq_60153001 | gamma-Glutamyltransferasepq_75828004 | Creatine kinasepq_84698008 | Cholesterolpq_85600001 | Triacylglycerolpq_8879006 | Creatinine measurement, 24 hour urinepq_Portavita1338 | PTHpq_Portavita1339 | Vitamin Dpq_Portavita1340 | MCVpq_Portavita1356 | Calciumpq_Portavita1357 | Phosphatepq_Portavita189 | Creatinine clearance (Cockcroft)pq_Portavita190 | Non-fasting blood glucosepq_Portavita191 | Creatinine clearance (24-hour urine)pq_Portavita304 | Creatinine clearance (MDRD)pq_Portavita845 | BNP===== 396552003.json ===== | Abdominal circumferencepq_396552003 | Abdominal circumference===== 401191002.json ===== | Diabetic foot examinationcd_122480009 | Hallux valguscd_201251005 | Neuropathic diabetic ulcer - footcd_249802001 | Pes cavuscd_268068002 | Ankle and/or foot joint stiffnesscd_275520000 | Claudicationcd_299653001 | Amputated footcd_403059006 | Onychomycosis of toenailscd_53226007 | Pes planuscd_86380000 | Acquired claw toescd_Portavita107 | Muscle cramp in calves when lying downcd_Portavita109 | Skin defect and/or infectioncd_Portavita110 | Autonomic neuropathycd_Portavita111 | Tylosis or clavuscd_Portavita114 | Pressure sorescd_Portavita116 | Purple discolorationcd_Portavita118 | Temperature differencecd_Portavita120 | Posterior tibial arterycd_Portavita121 | Dorsalis pedis arterycd_Portavita123 | Superficial sensitivity disordercd_Portavita131 | Deep sensitivity disorderscd_Portavita1326 | Peripheral Arterial Disease (PAD)cd_Portavita712 | SIMMS classification===== 401221002.json ===== | Ankle brachial pressure index - ABPIpq_Portavita714 | Systolic blood pressure left anklepq_Portavita715 | Systolic blood pressure right anklepq_Portavita716 | Systolic blood pressure left arm

34

Page 35: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

pq_Portavita717 | Systolic blood pressure right armpq_Portavita718 | ABPI leftpq_Portavita719 | ABPI right===== 50373000.json ===== | Body height measurepq_50373000 | Body height measure===== 77386006.json ===== | Patient currently pregnantcd_77386006 | Patient currently pregnant===== BATT204006.json ===== | Smokingcd_365980008 | Tobacco use and exposure - findingcd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madepq_266918002 | Tobacco smoking consumptionpq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita1224.json ===== | Inhalation techniquecd_Portavita466 | Inhalation technique examinedcd_Portavita467 | Problems with inhalation technique===== Portavita1232.json ===== | Diagnosis (Diabetes)cd_Portavita1232 | Diagnosis (Diabetes)===== Portavita1234.json ===== | Self-check (CVRM)===== Portavita136.json ===== | Self-check (Diabetes)cd_169449001 | Trying to conceivecd_309597007 | Foot abnormality - diabetes-relatedcd_365275006 | General well-being findingcd_77386006 | Patient currently pregnantcd_Portavita34 | Therapy compliancecd_Portavita38 | Dietary advice problemspq_27113001 | Body weightpq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:===== Portavita140.json ===== | Risk inventory (Diabetes)cd_106070007 | Cardiac auscultation findingcd_108365000 | Infection of skincd_162274004 | Visual symptomscd_169449001 | Trying to conceivecd_207057006 | [D]Shortness of breathcd_207260005 | [D]Other specified symptomscd_219006 | Current drinkercd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_249475006 | Thirst symptomcd_266257000 | Transient cerebral ischaemiacd_267036007 | Dyspnoeacd_278542003 | Dental appliance or restoration findingcd_279333002 | Pruritus of skincd_28442001 | Polyuriacd_29857009 | Chest paincd_30782001 | Diastolic murmurcd_309597007 | Foot abnormality - diabetes-relatedcd_312975006 | Microalbuminuriacd_31574009 | Systolic murmurcd_32738000 | Prurituscd_361137007 | Irregular heart beatcd_365275006 | General well-being findingcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_370992007 | Dyslipidaemiacd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_401207004 | Medication side effects presentcd_402599005 | Acanthosis nigricanscd_40733004 | Infectious diseasecd_52311001 | Homocystinaemiacd_56727007 | Vitiligocd_77386006 | Patient currently pregnantcd_82184000 | Aortic bruitcd_84114007 | Heart failurecd_Portavita10 | Carotid aorta rightcd_Portavita11 | Renal aorta leftcd_Portavita12 | Renal aorta rightcd_Portavita1232 | Diagnosis (Diabetes)cd_Portavita13 | Femoral aorta leftcd_Portavita14 | Femoral aorta rightcd_Portavita1428 | Extra attention for Individual care plancd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita44 | Symptoms indicating hypoglycemiacd_Portavita440 | Follow-up appointment madecd_Portavita48 | Decrease of physical capacitycd_Portavita5 | Auscultation indicatedcd_Portavita50 | Pain in calves when walkingcd_Portavita52 | Pain or tingling in legscd_Portavita54 | Sexual dysfunction disorderscd_Portavita6 | First soundcd_Portavita63 | Patient uses caffeinecd_Portavita64 | Products with glycyrrhizic acidcd_Portavita68 | Diabetes in first-degree or second-degree relativescd_Portavita69 | Lipid metabolism disorder in first-degree relativescd_Portavita7 | Second soundcd_Portavita70 | Hypertension in first-degree relativescd_Portavita71 | Cardiovascular diseases in first-degree relatives

35

Page 36: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_Portavita8 | Auscultation of arteries performedcd_Portavita9 | Carotid aorta leftpq_160573003 | Alcohol intakepq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_364075005 | Heart ratepq_396552003 | Abdominal circumferencepq_50373000 | Body height measurepq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:===== Portavita141.json ===== | Risk factorscd_22298006 | Myocardial infarctioncd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_312975006 | Microalbuminuriacd_367416001 | Angina pectoriscd_370992007 | Dyslipidaemiacd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_52311001 | Homocystinaemiacd_84114007 | Heart failure===== Portavita1422.json ===== | Annual check-up (Asthma/COPD)cd_10312003 | Prednisone preparationcd_162895003 | O/E - accessory resp.m’s.usedcd_170617002 | Respiratory drug side effectcd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_268929007 | O/E - rhonchi presentcd_28743005 | Productive coughcd_301272007 | Chest auscultation findingcd_301282008 | Finding of respirationcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_3723001 | Arthritiscd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_390800000 | Goal achievement findingcd_400047006 | Peripheral vascular diseasecd_41006004 | Depressioncd_414059009 | Drug therapy compliance findingcd_417523004 | Loss of interest in previously enjoyable activitycd_419597003 | Respiratory corticosteroidcd_48694002 | Anxietycd_64859006 | Osteoporosiscd_66493003 | Theophyllinecd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_78275009 | Obstructive sleep apnoea syndromecd_79015004 | Worriedcd_79042003 | Crepitationcd_84114007 | Heart failurecd_Portavita1428 | Extra attention for Individual care plancd_Portavita359 | Undesirable weight losscd_Portavita400 | Problems in coughing up slimecd_Portavita427 | More complaints after exposure to work/hobbycd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita459 | Short acting b2-mimeticcd_Portavita460 | Long-acting b2-mimeticcd_Portavita461 | Short-acting anticholinergicscd_Portavita462 | Long-acting anticholinergicscd_Portavita463 | LTRCAcd_Portavita466 | Inhalation technique examinedcd_Portavita467 | Problems with inhalation techniquecd_Portavita565 | Barrel chestcd_Portavita713 | Combination therapycd_Portavita857 | Impairment due to COPDpq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_50373000 | Body height measurepq_60621009 | Body mass indexpq_86290005 | Respiratory ratepq_Portavita360 | Fat Free Mass Index (FFMI)pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita1423.json ===== | Interim check-up (Asthma/COPD)cd_10312003 | Prednisone preparationcd_162895003 | O/E - accessory resp.m’s.usedcd_170617002 | Respiratory drug side effectcd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_268929007 | O/E - rhonchi presentcd_28743005 | Productive coughcd_301272007 | Chest auscultation findingcd_301282008 | Finding of respirationcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_3723001 | Arthritis

36

Page 37: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_390800000 | Goal achievement findingcd_400047006 | Peripheral vascular diseasecd_41006004 | Depressioncd_414059009 | Drug therapy compliance findingcd_417523004 | Loss of interest in previously enjoyable activitycd_419597003 | Respiratory corticosteroidcd_48694002 | Anxietycd_64859006 | Osteoporosiscd_66493003 | Theophyllinecd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_78275009 | Obstructive sleep apnoea syndromecd_79015004 | Worriedcd_79042003 | Crepitationcd_84114007 | Heart failurecd_Portavita1428 | Extra attention for Individual care plancd_Portavita359 | Undesirable weight losscd_Portavita400 | Problems in coughing up slimecd_Portavita427 | More complaints after exposure to work/hobbycd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita459 | Short acting b2-mimeticcd_Portavita460 | Long-acting b2-mimeticcd_Portavita461 | Short-acting anticholinergicscd_Portavita462 | Long-acting anticholinergicscd_Portavita463 | LTRCAcd_Portavita466 | Inhalation technique examinedcd_Portavita467 | Problems with inhalation techniquecd_Portavita565 | Barrel chestcd_Portavita713 | Combination therapycd_Portavita857 | Impairment due to COPDpq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_50373000 | Body height measurepq_60621009 | Body mass indexpq_86290005 | Respiratory ratepq_Portavita360 | Fat Free Mass Index (FFMI)pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita1442.json ===== | Diagnosis (Asthma/COPD)===== Portavita154.json ===== | Interim check-up (Diabetes)cd_182838006 | Change of medicationcd_274785000 | Examination of blood pressurecd_299478007 | Foot problemcd_316360006 | [V]Other reasons for encountercd_33747003 | Glucose measurement, bloodcd_361137007 | Irregular heart beatcd_54777007 | Deficient knowledgecd_78164000 | Feeding problemcd_Portavita1342 | Antihypertensivescd_Portavita1343 | Diureticscd_Portavita1344 | Beta-blockerscd_Portavita1345 | Calcium antagonistscd_Portavita1346 | Drugs affecting the renin-angiotensin systemcd_Portavita1347 | Alpha-blockerscd_Portavita1348 | Other antihypertensivescd_Portavita1349 | Blood-thinning drugscd_Portavita1350 | Platelet aggregation inhibitorscd_Portavita1351 | Anticoagulantscd_Portavita1352 | Other blood-thinning drugscd_Portavita1353 | Lipid-lowering drugscd_Portavita1354 | Statinscd_Portavita1355 | Other lipid-lowering drugscd_Portavita1428 | Extra attention for Individual care plancd_Portavita648 | Diabetes medicationpq_364075005 | Heart ratepq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:pq_Portavita175 | Fasting blood glucosepq_Portavita176 | Blood glucose after breakfastpq_Portavita177 | Blood glucose before lunchpq_Portavita178 | Blood glucose after lunchpq_Portavita179 | Blood glucose before dinnerpq_Portavita180 | Blood glucose after dinnerpq_Portavita181 | Blood glucose before bedtimepq_Portavita182 | Nighttime blood glucose===== Portavita174.json ===== | Glucose curvepq_Portavita175 | Fasting blood glucosepq_Portavita176 | Blood glucose after breakfastpq_Portavita177 | Blood glucose before lunchpq_Portavita178 | Blood glucose after lunchpq_Portavita179 | Blood glucose before dinnerpq_Portavita180 | Blood glucose after dinnerpq_Portavita181 | Blood glucose before bedtimepq_Portavita182 | Nighttime blood glucose===== Portavita305.json ===== | Ophthalmic examination (Diabetes)cd_Portavita308 | Assessment of ophthalmic examination===== Portavita316.json ===== | Asthma Control Questionnaire (ACQ)===== Portavita323.json ===== | Respiratory Illness Questionnaire-Monitoring 10 (RIQ-Mon10)cd_Portavita324 | Symptoms scorecd_Portavita325 | Activities score===== Portavita336.json ===== | Clinical COPD Questionnaire (CCQ)

37

Page 38: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_Portavita337 | Symptoms scorecd_Portavita338 | Functional scorecd_Portavita339 | Mental score===== Portavita351.json ===== | Lung specialist consultationcd_Portavita1333 | Lung function measurement reliablecd_Portavita1334 | Lung function matchingcd_Portavita1335 | Indication for referral to lung specialist===== Portavita353.json ===== | Intake/Diagnostics (Asthma/COPD)cd_10312003 | Prednisone preparationcd_161524000 | H/O: hay fevercd_161527007 | H/O: asthmacd_161561009 | H/O: eczemacd_162895003 | O/E - accessory resp.m’s.usedcd_170617002 | Respiratory drug side effectcd_21719001 | Allergic rhinitis due to pollencd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_232347008 | Dander (animal) allergycd_266257000 | Transient cerebral ischaemiacd_267036007 | Dyspnoeacd_268929007 | O/E - rhonchi presentcd_28743005 | Productive coughcd_301272007 | Chest auscultation findingcd_301282008 | Finding of respirationcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_3723001 | Arthritiscd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_41006004 | Depressioncd_414059009 | Drug therapy compliance findingcd_417523004 | Loss of interest in previously enjoyable activitycd_419597003 | Respiratory corticosteroidcd_48694002 | Anxietycd_64859006 | Osteoporosiscd_66493003 | Theophyllinecd_68154008 | Chronic coughcd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_78275009 | Obstructive sleep apnoea syndromecd_79015004 | Worriedcd_79042003 | Crepitationcd_80313002 | Palpitationscd_84114007 | Heart failurecd_Portavita359 | Undesirable weight losscd_Portavita378 | Frequent respiratory infectionscd_Portavita380 | Complaints triggered by medicationcd_Portavita381 | Respiratory medication discontinued in the pastcd_Portavita391 | Pulmonary diseases of first-degree relativescd_Portavita392 | Atopic disorders in first-degree relativescd_Portavita394 | Frequency of shortness of breathcd_Portavita396 | Frequency of wheezingcd_Portavita400 | Problems in coughing up slimecd_Portavita402 | Periods without symptomscd_Portavita404 | Nighttime symptomscd_Portavita409 | Last van kortademigheid, piepen of hoesten bij specifieke of aspecifieke prikkelscd_Portavita410 | Dusty or wet environmentcd_Portavita414 | Tobacco smokecd_Portavita416 | Other non-specific stimuli (cold air, fog, baking odours, paint odours, perfume, etc.)cd_Portavita419 | RAST test indicatedcd_Portavita420 | Results of previous RAST testcd_Portavita422 | Skin prick test indicatedcd_Portavita423 | Results previous skin prick testcd_Portavita427 | More complaints after exposure to work/hobbycd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita441 | Ethnicitycd_Portavita459 | Short acting b2-mimeticcd_Portavita460 | Long-acting b2-mimeticcd_Portavita461 | Short-acting anticholinergicscd_Portavita462 | Long-acting anticholinergicscd_Portavita463 | LTRCAcd_Portavita466 | Inhalation technique examinedcd_Portavita467 | Problems with inhalation techniquecd_Portavita565 | Barrel chestcd_Portavita566 | Wheezingcd_Portavita713 | Combination therapycd_Portavita857 | Impairment due to COPDpq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_50373000 | Body height measurepq_60621009 | Body mass indexpq_86290005 | Respiratory ratepq_Portavita360 | Fat Free Mass Index (FFMI)pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita354.json ===== | Follow-up consultation (Asthma/COPD)cd_10312003 | Prednisone preparationcd_162895003 | O/E - accessory resp.m’s.usedcd_170617002 | Respiratory drug side effectcd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercising

38

Page 39: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_268929007 | O/E - rhonchi presentcd_28743005 | Productive coughcd_301272007 | Chest auscultation findingcd_301282008 | Finding of respirationcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_3723001 | Arthritiscd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_390800000 | Goal achievement findingcd_400047006 | Peripheral vascular diseasecd_41006004 | Depressioncd_414059009 | Drug therapy compliance findingcd_417523004 | Loss of interest in previously enjoyable activitycd_419597003 | Respiratory corticosteroidcd_48694002 | Anxietycd_64859006 | Osteoporosiscd_66493003 | Theophyllinecd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_78275009 | Obstructive sleep apnoea syndromecd_79015004 | Worriedcd_79042003 | Crepitationcd_84114007 | Heart failurecd_Portavita1428 | Extra attention for Individual care plancd_Portavita359 | Undesirable weight losscd_Portavita400 | Problems in coughing up slimecd_Portavita427 | More complaints after exposure to work/hobbycd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita459 | Short acting b2-mimeticcd_Portavita460 | Long-acting b2-mimeticcd_Portavita461 | Short-acting anticholinergicscd_Portavita462 | Long-acting anticholinergicscd_Portavita463 | LTRCAcd_Portavita466 | Inhalation technique examinedcd_Portavita467 | Problems with inhalation techniquecd_Portavita565 | Barrel chestcd_Portavita713 | Combination therapycd_Portavita857 | Impairment due to COPDpq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_50373000 | Body height measurepq_60621009 | Body mass indexpq_86290005 | Respiratory ratepq_Portavita360 | Fat Free Mass Index (FFMI)pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita571.json ===== | Cessation of smokingcd_13432000 | Nortriptylinecd_365980008 | Tobacco use and exposure - findingcd_Portavita438 | Motivation to stop smokingcd_Portavita574 | Nicotine replacement therapycd_Portavita577 | Smoking cessation appointment madecd_Portavita581 | Fear of weight gaincd_Portavita582 | Stresscd_Portavita583 | Social pressurecd_Portavita584 | Withdrawal symptomscd_Portavita585 | Increase of respiratory complaintscd_Portavita586 | Previous failed cessation attemptscd_Portavita587 | Not the right timecd_Portavita589 | Follow-up appointment madecd_Portavita594 | Weight gaincd_Portavita595 | Stresscd_Portavita596 | Social pressurecd_Portavita597 | Withdrawal symptomscd_Portavita598 | Increase of respiratory complaintscd_Portavita843 | Vareniclinepq_266918002 | Tobacco smoking consumptionpq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita652.json ===== | Clinimetry Physiotherapycd_Portavita665 | Borg Dyspnea scale before the testcd_Portavita666 | Borg Dyspnea scale after the testcd_Portavita668 | Borg dyspnea severity before the testcd_Portavita669 | Borg dyspnea severity after the testcd_Portavita671 | Bode Indexpq_Portavita655 | Predicted distancepq_Portavita656 | Test result for distancepq_Portavita657 | Distance percentagepq_Portavita659 | Oxygen saturation before the testpq_Portavita660 | Oxygen saturation after the testpq_Portavita662 | Heart rate before the testpq_Portavita663 | Heart rate after the testpq_Portavita670 | Oxygen saturationpq_Portavita673 | Predicted quadriceps forcepq_Portavita674 | Test result for quadriceps forcepq_Portavita675 | Quadriceps force percentagepq_Portavita677 | Predicted hand grip strengthpq_Portavita678 | Test result for hand grip strength

39

Page 40: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

pq_Portavita679 | Hand grip strength percentagepq_Portavita681 | Predicted mouth pressure===== Portavita67.json ===== | Family anamnesis (Diabetes)cd_Portavita68 | Diabetes in first-degree or second-degree relativescd_Portavita69 | Lipid metabolism disorder in first-degree relativescd_Portavita70 | Hypertension in first-degree relativescd_Portavita71 | Cardiovascular diseases in first-degree relatives===== Portavita684.json ===== | Blood pressure 24 hourspq_Portavita687 | Systolic averagepq_Portavita688 | Diastolic averagepq_Portavita690 | Systolic standard deviationpq_Portavita691 | Diastolic standard deviationpq_Portavita694 | Systolic averagepq_Portavita695 | Diastolic averagepq_Portavita697 | Systolic standard deviationpq_Portavita698 | Diastolic standard deviationpq_Portavita701 | Systolic averagepq_Portavita702 | Diastolic averagepq_Portavita704 | Systolic standard deviationpq_Portavita705 | Diastolic standard deviationpq_Portavita707 | Average heart rhythmpq_Portavita708 | Average heart rhythmpq_Portavita709 | Average heart rhythmpq_Portavita710 | Drop in systolic blood pressurepq_Portavita711 | Drop in diastolic blood pressure===== Portavita727.json ===== | Annual check-up (CVRM)cd_169449001 | Trying to conceivecd_207057006 | [D]Shortness of breathcd_207260005 | [D]Other specified symptomscd_219006 | Current drinkercd_22298006 | Myocardial infarctioncd_228450008 | Time spent exercisingcd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_29857009 | Chest paincd_312975006 | Microalbuminuriacd_361137007 | Irregular heart beatcd_365980008 | Tobacco use and exposure - findingcd_367416001 | Angina pectoriscd_370992007 | Dyslipidaemiacd_38341003 | Hypertensive disordercd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_401207004 | Medication side effects presentcd_48194001 | Pregnancy-induced hypertensioncd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_77386006 | Patient currently pregnantcd_84114007 | Heart failurecd_Portavita1342 | Antihypertensivescd_Portavita1343 | Diureticscd_Portavita1344 | Beta-blockerscd_Portavita1345 | Calcium antagonistscd_Portavita1346 | Drugs affecting the renin-angiotensin systemcd_Portavita1347 | Alpha-blockerscd_Portavita1348 | Other antihypertensivescd_Portavita1349 | Blood-thinning drugscd_Portavita1350 | Platelet aggregation inhibitorscd_Portavita1351 | Anticoagulantscd_Portavita1352 | Other blood-thinning drugscd_Portavita1353 | Lipid-lowering drugscd_Portavita1354 | Statinscd_Portavita1355 | Other lipid-lowering drugscd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita48 | Decrease of physical capacitycd_Portavita50 | Pain in calves when walkingcd_Portavita54 | Sexual dysfunction disorderscd_Portavita69 | Lipid metabolism disorder in first-degree relativescd_Portavita71 | Cardiovascular diseases in first-degree relativescd_Portavita735 | AAA (abdominal aortic aneurysm) in first-degree relativescd_Portavita748 | Interventioncd_Portavita751 | Knowledge of healthy dietcd_Portavita753 | Insight into own dietcd_Portavita755 | Reason for adjustment of dietcd_Portavita756 | Motivation for adjustment of dietcd_Portavita758 | Interventioncd_Portavita761 | Awareness of effects of alcohol usecd_Portavita763 | Insight into own alcohol usecd_Portavita765 | Reason for adjustment of alcohol usecd_Portavita766 | Motivation for adjustment of alcohol usecd_Portavita768 | Interventioncd_Portavita771 | Awareness of importance of physical exercisecd_Portavita773 | Insight into own physical exercisecd_Portavita775 | Reason for adjustment of physical activity patterncd_Portavita776 | Motivation for adjustment of physical activity patterncd_Portavita778 | Interventioncd_Portavita780 | Stress symptoms more than 3 monthscd_Portavita783 | Insight into own stress statuscd_Portavita785 | Interventioncd_Portavita788 | Awareness of effects of obesitycd_Portavita790 | Insight into own target weightcd_Portavita792 | Reason for losing weightcd_Portavita793 | Motivation for losing weightcd_Portavita795 | Interventioncd_Portavita798 | Awareness of effects of high blood pressure

40

Page 41: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_Portavita800 | Insight into own target valuescd_Portavita802 | Interventioncd_Portavita805 | Awareness of effects of increased cholesterol valuescd_Portavita807 | Insight into own target valuescd_Portavita809 | Interventioncd_Portavita817 | Reduced walking distancecd_Portavita829 | Chronic Obstructive Pulmonary Disease (COPD)pq_160573003 | Alcohol intakepq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_364075005 | Heart ratepq_396552003 | Abdominal circumferencepq_50373000 | Body height measurepq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita730.json ===== | Family anamnesis (CVRM)cd_Portavita69 | Lipid metabolism disorder in first-degree relativescd_Portavita71 | Cardiovascular diseases in first-degree relativescd_Portavita735 | AAA (abdominal aortic aneurysm) in first-degree relatives===== Portavita746.json ===== | Interim check-up (CVRM)cd_219006 | Current drinkercd_228450008 | Time spent exercisingcd_361137007 | Irregular heart beatcd_365980008 | Tobacco use and exposure - findingcd_Portavita1342 | Antihypertensivescd_Portavita1343 | Diureticscd_Portavita1344 | Beta-blockerscd_Portavita1345 | Calcium antagonistscd_Portavita1346 | Drugs affecting the renin-angiotensin systemcd_Portavita1347 | Alpha-blockerscd_Portavita1348 | Other antihypertensivescd_Portavita1349 | Blood-thinning drugscd_Portavita1350 | Platelet aggregation inhibitorscd_Portavita1351 | Anticoagulantscd_Portavita1352 | Other blood-thinning drugscd_Portavita1353 | Lipid-lowering drugscd_Portavita1354 | Statinscd_Portavita1355 | Other lipid-lowering drugscd_Portavita438 | Motivation to stop smokingcd_Portavita439 | Stop smoking advice givencd_Portavita440 | Follow-up appointment madecd_Portavita748 | Interventioncd_Portavita751 | Knowledge of healthy dietcd_Portavita753 | Insight into own dietcd_Portavita755 | Reason for adjustment of dietcd_Portavita756 | Motivation for adjustment of dietcd_Portavita758 | Interventioncd_Portavita761 | Awareness of effects of alcohol usecd_Portavita763 | Insight into own alcohol usecd_Portavita765 | Reason for adjustment of alcohol usecd_Portavita766 | Motivation for adjustment of alcohol usecd_Portavita768 | Interventioncd_Portavita771 | Awareness of importance of physical exercisecd_Portavita773 | Insight into own physical exercisecd_Portavita775 | Reason for adjustment of physical activity patterncd_Portavita776 | Motivation for adjustment of physical activity patterncd_Portavita778 | Interventioncd_Portavita780 | Stress symptoms more than 3 monthscd_Portavita783 | Insight into own stress statuscd_Portavita785 | Interventioncd_Portavita788 | Awareness of effects of obesitycd_Portavita790 | Insight into own target weightcd_Portavita792 | Reason for losing weightcd_Portavita793 | Motivation for losing weightcd_Portavita795 | Interventioncd_Portavita798 | Awareness of effects of high blood pressurecd_Portavita800 | Insight into own target valuescd_Portavita802 | Interventioncd_Portavita805 | Awareness of effects of increased cholesterol valuescd_Portavita807 | Insight into own target valuescd_Portavita809 | Interventionpq_160573003 | Alcohol intakepq_266918002 | Tobacco smoking consumptionpq_27113001 | Body weightpq_364075005 | Heart ratepq_396552003 | Abdominal circumferencepq_50373000 | Body height measurepq_60621009 | Body mass indexpq_8462.4 | Intravascular diastolic:Pres:Pt:Arterial system:Qn:pq_8480.6 | Intravascular systolic:Pres:Pt:Arterial system:Qn:pq_Portavita435 | Smoking history: number of yearspq_Portavita436 | Smoking history: units per daypq_Portavita437 | Cessation attemptspq_Portavita457 | Pack years===== Portavita814.json ===== | Comorbidities (CVRM)cd_22298006 | Myocardial infarctioncd_230690007 | Cerebrovascular accidentcd_266257000 | Transient cerebral ischaemiacd_312975006 | Microalbuminuriacd_367416001 | Angina pectoriscd_370992007 | Dyslipidaemiacd_38341003 | Hypertensive disorder

41

Page 42: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

cd_386137000 | Tortuous coronary arterycd_400047006 | Peripheral vascular diseasecd_48194001 | Pregnancy-induced hypertensioncd_69896004 | Rheumatoid arthritiscd_73211009 | Diabetes mellituscd_84114007 | Heart failurecd_Portavita829 | Chronic Obstructive Pulmonary Disease (COPD)===== Portavita846.json ===== | Thoraxcd_168734001 | Standard chest X-ray abnormal

(939 rows)

42

Page 43: Portavita Benchmark: A Dataset Generator for Healthcare...storage of these documents into a PostgreSQL DBMS. Finally, we provide a brief overview of the queries that come with the

Bibliography

[1] AXLE project GitHub page: AXLE Healthcare Benchmark. https://github.com/AXLEproject/axle-healthcare-benchmark, 2015.

[2] T. Benson. Principles of health interoperability HL7 and SNOMED. Springer Science & Business Media, 2012.

[3] K. Boone. The CDA TM book. Springer London, 2011.

[4] HL7 FHIR documentation. http://hl7-fhir.github.io/index.html, 2014.

[5] Orange Data Mining. http://orange.biolab.si, 2015.

[6] Portavita Benchmark web page. http://portavitabenchmark.com, 2015.

43