an approach for integrating heterogeneous information sources in a medical data warehouse

P1: kkk

Journal of Medical Systems [joms] pp112-299076 March 12, 2001 12:38 Style file version Nov. 19th, 1999

Journal of Medical Systems, Vol. 25, No. 3, 2001

An Approach for Integrating Heterogeneous InformationSources in a Medical Data Warehouse

E. M. Kerkri,1,6 C. Quantin,1 F. A. Allaert,1,2 Y. Cottin,3 Ph. Charve,4

F. Jouanot,5 and K. Yetongnon5

In recent years, medical professionals are witnessing an explosive growth in datacollected by various organizations and institutions. At the same time, the ongoingdevelopments of networking technologies provide doctor with the capability to accessthese data across the boundaries of interconnected computers. In this paper we presenta medical data warehousing methodology that aims to use data semantics to regroupand merge patients’ medical data from different health information systems, whichmay be autonomous and heterogeneous. The proposed solution takes into accountEuropean laws concerning the security and anonymity of personal data.

KEY WORDS: data warehousing; data integration; heterogeneous systems.

INTRODUCTION

In recent years, medical professionals are witnessing an explosive growth in datacollected by various organizations and institutions. At the same time, the ongoingdevelopments of networking technologies provide doctor with the capability to accessdata across the boundaries of interconnected computers.

To perform semantic integration, authors of(1) propose a pragmatic way to de-scribe the semantics of the elements of a database, based on a bottom-up three-stepprocess: (1) a back documentation of the elements of the system from their descrip-tion contained in the data catalog of the database, (2) a first semantic extension totransform a data catalog into a data dictionary, (3) a second semantic extension to

1Dijon University Hospital, Medical Informatics Department (Pr Quantin) CHU -1, Bd. Jeanne d’Arc–BP1542 - 21034 Dijon cedex, France.

2Healthcare Security and Software quality, European Committee for Standardization.3Dijon University Hospital, Cardiology Department (Pr Wolf), Dijon, France.4Dijon University Hospital, Anesthesia Unit, Dijon, France.5University of Burgundy, LE21, Informatics Engineering team (Pr Yetongnon) Universite de Bourgogne,LE21, Equipe Ingenierie Informatique et Bases de Donnees–BP 47870 - 21078 Dijon cedex, France.

6To whom correspondence should be addressed.

167

0148-5598/01/0600-0167$19.50/0 C© 2001 Plenum Publishing Corporation

P1: kkk


168 Kerkri, Quantin, Allaert, Cottin, Charve, Jouanot, and Yetongnon

create a dictionary of the medical concepts from a data dictionary. Their approachfocused on the design of a structure of a dictionary able to describe the data catalogof a database and to extend it with the semantics expressed by the conceptual level,that is, the semantics of the domain of end users.

The aim of this work is to define a data warehouse framework to regroup pa-tient’s medical information from various health structures at a regional level and tointegrate them in a comprehensive information system. The data in the warehouseprovide access to relevant data concerning the patient such as his/her antecedents andrisk factors, in order to enhance diagnosis and medical decisions. This informationstored in the warehouse can also be used for epidemiological and medicoeconomicstudies.

To provide reliable, secured, and controlled information to practitioners, wepropose a data integration approach, which combines data warehouse strategy withcontext mediation methodology. The fusion of medical data is guided by a semanticdata warehousing methodology that facilitates integration and migration of data fromdistributed and heterogeneous systems.

We propose a semantic model to associate a meaning to information and acontext fusion mechanism (1) to optimize the discovery of relevant data accordingto pathology or symptom requirements, (2) to automate the integration of retainedinformation and the migration of data, (3) to generate views on the data warehousesuitable for pathology requirements.

Our solution respects the European legislation concerning medical file link-age. An anonymous record linkage procedure has been developed,(2) to ensureanonymity. The following section presents a project of myocardial infarction follow-up as motivating example and data warehousing background. Epidware general ar-chitecture is introduced under Epidware; A medical data Warehousing Network.Constructing tools and methodologies for constructing the integrator and the wrap-per are presented, respectively, under Constructing the Integrator and Constructingthe wrapper, before concluding this paper.

MOTIVATING EXAMPLE AND BACKGROUND

Epidemiological Follow-up of the Myocardial Infarction Cases

At the international level, there are important variations of strategies about thecare of myocardial infarction (IDM). In particular, in the diagnosis and therapeuticapproaches, a recent paper(3) shows the difficulties in interpreting the results of themain randomized studies that aim to compare these strategies because the patientsand the “reference centers” included in these studies are not representative of thewhole population.

In France, Hannebicque(4) stresses that actually there is no epidemiological dataconcerning the participation of nonteaching hospitals in the management of myocar-dial infarction. For example, the USIK study(5) including 373 French centers onNovember 1995 did not provide any information on the proportion of patients caredfor according to the hospital type (teaching, nonteaching, for-profit, non-for-profit).

P1: kkk


An Approach for Integrating Heterogeneous Information 169

Myocardial infarction registries (MONICA in particular) more often concernthe pre-hospitalization period and the first hospital stay for the acute informationbut do not provide precise information about the organization of the care after thisfirst stay.

The main objective of the project currently developed at the level of theBurgundy region is to obtain a follow-up of myocardial infarction cases. This projectwill enable obtaining an economic assessment of medical strategies and developingepidemiological studies such as evaluation of the impact of risk factors on cardio-vascular mortality, taking into account the hospital type.

Epidware architecture and methodology will enable regrouping efficient datafor such projects. Heterogeneity of information sources due to the different codesused for encoding medical data, by example, will be transparent to the end users.

Data Warehousing Concept

In the databases technology area, a data warehouse is defined as a collection ofdecision support functionalities/environment to allow decision-makers to make morerapid and relevant decisions. The Data warehouse has to ensure the coherence ofinformation and facilitate its access. These objectives imply the integration of querytools, analysis tools, and information presentation tools. Data have to be carefullygathered from different information sources, cleaned and filtered, and then storedonly after validation of their quality. Thus, information provided by data warehouseis not crude data but information that assists and helps the decision-makers. Data inthe warehouse are subject-oriented, aggregated, easily accessible, reliable, relevant,nonvolatile and possess a specific temporal context.(6)

The principle of data warehousing is to extract useful information, and then tocombine and to consolidate it into a coherent data repository. This improves dataquality and allows users to retrieve necessary data by themselves. The coherence ofdata is measured globally, based on the viewpoint of the manager who seeks, forexample, data that are complete and without contradiction.(7)

Data Warehousing Characterization

A recent study based on 456 published articles on data warehousing classifiesthe data warehousing literature and identifies the advantages and disadvantages thatInformation Systems managers will encounter developing data warehouses.(8) Somementioned advantages of data warehousing are: allowing existing legacy systems tocontinue in operation, consolidating inconsistent data from various legacy systemsinto one coherent set, improving data quality, and allowing users to retrieve necessarydata by themselves. Clients of the data warehouse cannot directly query the source ofdata, thus improving security of the production databases as well as their productivityis an important advantage of data warehousing approach.

The most mentioned disadvantages concern complexity in development, time,and cost constraints. Each data warehouse has a unique architecture and a set ofrequirements. Builders need to pay as much attention to the structure, definitions,and flow of data as they do to choosing hardware and software. Data warehouse

P1: kkk



construction requires a sense of anticipation about future usage of the collectedrecords. In some cases, specifying and justifying the needs may take much time. Datamust be moved or copied from existing databases, sometimes manually, and data needto be translated into a common format. These are some reasons data warehouses areexpensive to build.

Data Warehousing in Medical Area

Since a few years ago, some data warehousing projects were developed aiming tostatistical or managerial objectives. Actually, medical and epidemiological domainsrequire data-warehouse solutions. At a regional level, health structures do not havethe same legal status and are independent of each other. Moreover, their informationsystems are autonomous and heterogeneous. Thus, building a data warehouse formedical and epidemiological data will be very useful to healthcare networks enablingthe enhancement of patient care.

However, some specificities of the medical area have to be taken into accountto ensure the success of such projects. At a legal and deontological level, the mostimportant requirement is to respect the privacy of the patient’s medical data. Inthe same way, as health professionals do not belong to the same structures, whichare independent to each other, confidentiality of the activity of each structure mustbe ensured as well. At a technical level, information sources are heterogeneous,autonomous, and have an independent life cycle. Thus, cooperation between thesesystems needs specific solutions.

EPIDWARE: A MEDICAL DATA WAREHOUSING FRAMEWORK

Epidware is an integrated system for providing access to a collection of hetero-geneous medical information systems. It is based on two evolving information designmethodologies: data warehousing and database integration and interoperation. Datawarehousing, the implementation process of data warehouses, is a progressive pro-cess that can be carried out in two main steps. First, data marts, which are data ware-houses related to a given department or activity, are implemented. Next, two types ofdevelopments are possible according to the organizational choice of the enterpriseor institution: (1) progressive centralization of strategic data or (2) decentralizedimplementation of service-specific data marts.

The second methodology used in Epidware is the integration or interoperationof information systems. The first step is to create a target schema that defines theoverall structure of the data that are merged to create the data mart or warehouse.To model the target data structure, the designers must establish, with the help ofthe enterprise or institution, a dictionary of data descriptions (metadata) and specifytools to extract, translate and integrate data from the different data or informationsources (the initial databases). This requires integration techniques to reconcile se-mantic differences among the schema or data structures of the underlying databases.Once the target schema is implemented, the data warehouse must be updated to en-sure that the format of data corresponds to the user’s need, which may change lateron. Furthermore, changes in sources of information have to be propagated on the

P1: kkk



Fig. 1. EPIDWARE architecture.

data warehouse. Finally, data from the source databases are extracted, aggregated,and filtered to harmonize their formats and to eliminate redundancies. Before load-ing data in the data warehouse, coherent values are assigned to variables or fieldsthat cannot be initialized from the local systems.

Figure 1 depicts the general architecture of the Epidware integrated system. Itconsists of information systems at the lowest level, a group of components, calledwrappers, at the intermediate level and the Integrator and Data warehouseat the top level. These components are described in detail in the followingsections.

Information Sources

The low-level information sources can have different formats(9): database sys-tems, files systems, document HTML or knowledge bases. In medical context, theinformation sources are database or any information systems used in public or pri-vate hospitals, biomedical analysis laboratories, and radiology departments. The dataare either patients’ case data or general public health medical information.

Wrapper

A wrapper provides three main functions: translation, monitoring, and anony-mity. The main role of the Translator is to convert data from the format and themodel of the information sources to the common format and model of the datawarehouse. The translation consists in making the underlying information sourceappear as if it is described in data warehouse model. For example, if the informationsource is a simple data file and the data warehouse uses the relational data model, thewrapper/monitor must provide an interface to present the data in a relational for-mat.(Most of the commercial Data warehousing systems suppose that the informa-tion sources and the Data Warehouse are relational).

P1: kkk



The detection of data changes concerning the data warehouse and their prop-agation to the integrator is the main role of the Monitor. It initiates an update ofthe data warehouse when changes are made to local data. Changes in the local dataare translated by the wrapper from the local format and model of the informationsource to the format and model of the data warehouse, just as for data themselves.A different approach is not to detect changes as they occur and to delay updatingthe data warehouse. Periodically, extracting and copying data from the informationsources enable updating the data warehouse. The Integrator can then combine thesedata with data from other sources or ask all information sources to make changesto the warehouse. This implies that the data warehouse is off line during updates,and thus cannot be concurrently accessed. However, if continuous access of updatedinformation is required, it is preferable to adopt the principle of change detectionand their propagation to data of the warehouse. The first stage of Anonymity proce-dure is included in the Wrapper. This tool encodes the information extracted fromsources to guarantee confidentiality of patients’ data. This encoding is based on anonreversible encryption algorithm.(2)

Integrator

To load information in the data warehouse, the integrator must filter, refine,integrate, and combine information coming from the local sources. It provides thedecision-maker with a clear, relevant, and nonredundant view of patients’ medicalinformation. To achieve this, the integrator includes several functionalities. First, itreceives change notifications from the wrapper and updates the data warehouse.Second, it applies the anonymity algorithms to data received from informationsources. Finally, it merges received medical files to create a coherent and integratedmedical record for patients.

CONSTRUCTING THE INTEGRATOR

Data Conflicts

Merging or combining data from heterogeneous information sources can behindered by many heterogeneity problems resulting from different types of con-flicts. Low-level conflicts concern hardware, operating system, and interconnectionnetworks. For example, communication protocols conflicts can be resolved by usingInternet concepts such as HTTP, TCP/IP, Java, and Corba.

Data conflicts are the results of information heterogeneity at an application level.Three main types of data conflicts can be distinguished. Syntactic conflicts are relatedto conceptual differences of the models used to represent information (XML, rela-tional, object oriented, etc.). Schematic conflicts appear when different data struc-ture and classification are used to represent the same kind of information (typeconflicts, naming conflicts, and conceptual granularity conflicts) by different sources.Finally, semantic conflicts are due to the fact that information can be interpreted

P1: kkk



differently or have a different meaning depending on the local domain of application.Different semantic conflicts including value, taxonomy, scale, granularity, and cogni-tive have been identified.

Data Fusion: An Interoperable Framework

The goal of data fusion is to resolve data conflicts to combine information fromremote sources in a transparent, correct, and coherent manner. The literature oncooperative information systems and database research presents several approachesdirected toward solving data conflicts problems. Several categories of approacheshave been identified. The federation of databases(10) is used to define a static coop-eration of information systems in which a federated schema is created and used tomerge or integrate the local systems. User queries submitted on the federated schemaare decomposed into elementary subqueries against the local schema. The mediationapproach is used for a more dynamic environment comprising a large number of in-formation sources. Typically, it consists of two modules: a wrapper is used to resolvesyntactic conflicts of local sources while a mediator is used to carry out data fusion byresolving schematic and semantic data conflicts.(11) Depending on which data con-flict resolution tools are used, two mediation approaches can be distinguished. Inthe schema mediation approach, the mediator contains an integrated schema that isused to reconcile one or more information systems. A static query process(12) is usedto execute user requests. In the context mediation approach, semantic knowledgeor metadata are used to resolve data conflicts and process query dynamically.(13)

A context defines precisely a domain of application. It can be based on an ontologythat provides a language and a vocabulary to describe a “thematic world.”

Metadata Repository

The Epidware methodology proposed in this article is based on a semantic me-diation approach to data integration that is underway at the university of Burgundy.An important characteristic of the Epidware is the use of metadata. The metadataare stored in a metadata repository that is essential for resolving semantic differencesamong information sources. It is essential for understanding information stored indata warehouses. A metadata is defined as data and/or information about data. In an-other way, a metadata is a definition or description of data. According to Brakett,(14)

there are two aspects to metadata. The semantic aspect includes data that representthe meaning of the data: what the data represent in the real world. It includes theformal data names, comprehensive data definitions, data integrity and accuracy, andany other data that help business clients readily understand the data. The corporealaspect includes data that represent the physical nature of the data, as they are stored.It includes the data types, formats, locations, database management systems, and anyother data that help database analysts manage the database.

P1: kkk



Fig. 2. Data model translation.

CONSTRUCTING THE WRAPPER

Data Model Translation

The integration of data from remote heterogeneous sites requires the presenta-tion of data in a syntactic homogeneous format. The relational model is retained asthe common model to exchange information. Epidware uses many translators, whichtransform source information into a target representation written in this commonmodel. Specialists build the data warehouse by integrating the whole shared schemaexpressed in the common representation and can focus on the resolution of schematicand semantic conflicts (helped by metadata).

A translator is a black box with one input and one output. The transformationprocess is similar to a 2-2 translator with a schema S1 using a specific model M1 asinput and a schema S2 using the common model MC as output (Fig. 2). Instancesof S1 follow the same translation path, all information contained in instances of S1have to be found in instances of S2.

A translator respects several properties:

• Source schema S1 and target schema S2 have to be equivalent. As M1 andMC are heterogeneous, S1 and S2 cannot be identical, however, all informa-tion represented in S1 is expressed in S2. S1 and S2 are said to be semanticallyequivalent: they represent the same real-world description with different con-cepts. The translator has to minimize semantic loss.• All instances of S1 have to be correctly rewritten as instances of S2. A transla-

tor has functionalities to adapt data (unit change, value name change, etc.) andto aggregate data: data values have a homogeneous format, which facilitatesthe integration process.

The monitor sends new instances to the translator, which transforms them into newinformation populating the data warehouse.

Anonymous data are used as key attributes or foreign key attributes in thecommon representation of a given schema. They allow one to combine data froma source and to fuse patient data from many distributed sites. The preservation ofanonymous fields during the translation step is the key feature of the integration andmigration steps.

P1: kkk



Secure Data Transfer

From a legal and deontological point of view, the anonymity and linkage pro-cedures enable ensuring medical data confidentiality.(2) Indeed, the anonymity soft-ware is based on the Secure Hash Algorithm (SHA), which performs an irreversibletransformation to the nominative variables. SHA is considered to be the most se-cure system against cryptanalysis attacks. Although irreversible, this algorithm doesnot prevent dictionary attacks that exhaustively compare the code to be decipheredwith a great number of hashed identities. To prevent against this attack, a pad keyis introduced before the application of SHA. More, a spelling treatment could beapplied to some variables. This will procure two advantages: (1) reducing spellingand orthographic mistakes, (2) making a first transformation to the initial string andso complicating the task of an eventual evil-intentioned cryptanalyst.

From the data-engineering point of view, the code resulting from the anonymityprocedure could be considered as an identifiant especially if the number of variablesconcerned by the procedure is important (3 to 5). This will enable merging medicaldata and information concerning a given patient with a risk quasi null.

The first name, the last name, the date of birth and the zipped code are nom-inative (in sense of the law, i.e., directly or indirectly nominative) and have to betransformed and hashed in order to ensure personal data confidentiality. We haveto use the same pad and the same spelling treatment to those values of variables toenable the mergence of data from heterogeneous sources.

At the information sources level, the patient identifier may be not the samein different sources. It depends on the choice of the DBMS’ manager (in general,numerical or alphanumerical code). As we cannot use the Social Security Numberbecause of legal considerations, we will consider the hashed variables as the keyidentifiant of the patient’s record.

To decide merging two informative-object instantiations, we have to make surethat they belong to the same person and so that the values of variables constitutingthe identifiant are, respectively, identical.

CONCLUSION

In this paper, we have presented the general architecture of Epidware to builddata warehouses for epidemiological and medical studies. Our solution could beuseful at several levels. In the public health care area as well as in scientific domain, itwill enable statistical and epidemiological studies, medical-case oriented and patientoriented. This will be possible thanks to the anonymity software that performs anirreversible transformation to the nominative variables (first name, last name, dateof birth...), on one hand, and to the application of a linkage procedure that regroupand merge medical information relative to a given patient, on the other.

In the medical care area, these procedures will allow the improvement of med-ical care of the patients in the Care Units, by furnishing and providing exhaustiveinformation on the patient’s medical curriculum vitae concerning a given medicalproblem.

P1: kkk



ACKNOWLEDGMENTS

This research has been sponsored by the Burgundy Regional Council and theFrench Department of Education and Science.

REFERENCES

1. Staccini, P., Joubert, M., Fieschi, M., and Fieschi, D., Towards semantic integration within an existingmedical information system. Proceedings of MEDINFO 98 (B. Cesnik et al., eds.), pp. 935–939, 1998.

2. Quantin, C., Bouzelat, H., Allaert, F. A., Benhamiche, A. M., Faivre, J., and Dusserre, L., How toensure data securiy of an epideliological fillow-up: quality assessment of an anonymous record linkageprocedure. International. J. Med. Informatics 49:117–122, 1998.

3. Danchin, N., La reperfusion au stade aigu de l’infarctus du myocarde. La lettre de la thrombolyse25:113–115, 1998.

4. Hannebicque, G., Bearez, E., Rifai, A., et al., Prise en charge de l’infarctus du myocarde dans leshopitaux generaux sans plateau technique. La lettre de la thrombolyse 25:119–127, 1998.

5. Cambou, J. P., Genes, N., Vaur, L., et al., Epidemiologie de l’infarctus du myocarde en France.Specificites regionales. Archi. Maladies Coeur et des Vaisseaux 90:1511–1519, 1997.

6. Inmon, W. H., and Hackthorn, R.D., Using the Data Warehouse, Wiley-QED Publication, 1994.7. Kimball, R., The Data Warehouse Toolkit, French version by Raimond C., International Thomson

Publishing France, Paris, 1997.8. Sakaguchi, T., and Frolik Mark, N., A Review of the Data Warehousing Literature. Web:

http://www.people.memphis.edu/∼tsakagch/dw-web.htm. Jan. 31, 1996.9. Chaudhuri, S., and Dayal, U., An overview of data warehousing and OLAP technology. SIGMOD

Record. 26(1):65–74, 1997.10. Sheth, A. P., and Larson, J. A., Federated database systems for managing distributed, heterogeneous,

and autonomous databases. Computing Surveys 22(3):183–236, 1990.11. Wiederhold, G., Mediators in the architecture of future information systems. IEEE Computer

25(3):38–49, 1992.12. Li, C., Yerneni, R., Vassalos, V., et al., Capability Based Mediation in TSIMMIS, SIGMOD Conference

564–566, 1998.13. Bressan, S., Goh, C. H., Fynn, K., et al., The COntext INterchange Mediator Prototype, SIGMOD

Conference 525–527, 1997.14. Brackett, M. H., A New Approach to Metadata Management. http://www.tticom.com/mddq/md-

whitepaper.htm.

an approach for integrating heterogeneous information sources in a medical data warehouse

Documents