integrating gis components with knowledge discovery technology for environmental health decision...

16
Integrating GIS components with knowledge discovery technology for environmental health decision support YvanBe´dard a,c, *, Pierre Gosselin b,c , Sonia Rivest a ,Marie-Jose´e Proulx a , Martin Nadeau a , Germain Lebel c , Marie-France Gagnon c a Centre for Research in Geomatics, Universite´Laval, Pavillon Casault, Que´bec,Canada, G1K 7P4 b Institut national de sante´publique du Que´bec INSPQ, Beauport, Canada c Centre hospitalier universitaire de Que´bec (CHUQ), 2705 boulevard Laurier, Sainte-Foy, Que´bec, Canada, G1V 4G2 Received 20 June 2002; received in revised form 25 August 2002; accepted 17 October 2002 KEYWORDS Decision-support; Environmental health; Geographic knowledge discovery (GKD); Geographic information systems (GIS); Spatial on-line analytical processing (SOLAP); Public health Summary This paper presents a new category of decision-support tools that builds on today’s Geographic Information Systems (GIS) and On-Line Analytical Processing (OLAP) technologies to facilitate Geographic Knowledge Discovery (GKD). This new category, named Spatial OLAP (SOLAP), has been an R&D topic for about 5 years in a few university labs and is now being implemented by early adopters in different fields, including public health where it provides numerous advantages. In this paper, we present an example of a SOLAP application in the field of environmental health: the ICEM-SE project. After having presented this example, we describe the design of this system and explain how it provides fast and easy access to the detailed and aggregated data that are needed for GKD and decision-making in public health. The SOLAP concept is also described and a comparison is made with traditional GIS applications. 2002 Elsevier Science Ireland Ltd. All rights reserved. 1. Introduction Public health organizations collect significant volumes of data. Monitoring and assessing trends of environmental exposures and related health problems require health specialists to access ap- propriate information in a timely manner. This is true for public health planning, management and surveillance purposes in general. Quality informa- tion helps to identify and prioritize problems, to develop and evaluate policies and actions, to organize clinical health services delivery, to guide research and development, to contribute to stan- dards and guidelines development as well as to monitor progress and to inform the public. These general needs can be fulfilled with a series of systems that can be grouped in the next classes following increasing levels of technical complexity: *Corresponding author. E-mail addresses: [email protected] (Y. Be ´dard), [email protected] (P. Gosselin), [email protected] (S. Rivest), [email protected] (M.-J. Proulx), [email protected] (M. Nadeau), [email protected] (G. Lebel), [email protected] (M.-F. Gagnon). International Journal of Medical Informatics (2003) 70, 79 /94 www.elsevier.com/locate/ijmedinf 1386-5056/03/$ - see front matter 2002 Elsevier Science Ireland Ltd. All rights reserved. doi:10.1016/S1386-5056(02)00126-0

Upload: ulaval

Post on 10-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Integrating GIS components with knowledgediscovery technology for environmental healthdecision support

Yvan Bedarda,c,*, Pierre Gosselinb,c, Sonia Rivesta, Marie-Josee Proulxa,Martin Nadeaua, Germain Lebelc, Marie-France Gagnonc

a Centre for Research in Geomatics, Universite Laval, Pavillon Casault, Quebec, Canada, G1K 7P4b Institut national de sante publique du Quebec INSPQ, Beauport, Canadac Centre hospitalier universitaire de Quebec (CHUQ), 2705 boulevard Laurier, Sainte-Foy, Quebec, Canada,G1V 4G2

Received 20 June 2002; received in revised form 25 August 2002; accepted 17 October 2002

KEYWORDS

Decision-support;

Environmental health;

Geographic knowledge

discovery (GKD);

Geographic information

systems (GIS);

Spatial on-line analytical

processing (SOLAP);

Public health

Summary This paper presents a new category of decision-support tools that builds ontoday’s Geographic Information Systems (GIS) and On-Line Analytical Processing (OLAP)technologies to facilitate Geographic Knowledge Discovery (GKD). This new category,named Spatial OLAP (SOLAP), has been an R&D topic for about 5 years in a few universitylabs and is now being implemented by early adopters in different fields, includingpublic health where it provides numerous advantages. In this paper, we present anexample of a SOLAP application in the field of environmental health: the ICEM-SEproject. After having presented this example, we describe the design of this systemand explain how it provides fast and easy access to the detailed and aggregated datathat are needed for GKD and decision-making in public health. The SOLAP concept isalso described and a comparison is made with traditional GIS applications.– 2002 Elsevier Science Ireland Ltd. All rights reserved.

1. Introduction

Public health organizations collect significantvolumes of data. Monitoring and assessing trendsof environmental exposures and related healthproblems require health specialists to access ap-propriate information in a timely manner. This istrue for public health planning, management andsurveillance purposes in general. Quality informa-

tion helps to identify and prioritize problems, todevelop and evaluate policies and actions, toorganize clinical health services delivery, to guideresearch and development, to contribute to stan-dards and guidelines development as well as tomonitor progress and to inform the public. Thesegeneral needs can be fulfilled with a series ofsystems that can be grouped in the next classesfollowing increasing levels of technical complexity:

*Corresponding author.E-mail addresses: [email protected] (Y. Bedard), [email protected] (P. Gosselin),

[email protected] (S. Rivest), [email protected] (M.-J. Proulx), [email protected] (M. Nadeau),[email protected] (G. Lebel), [email protected] (M.-F. Gagnon).

International Journal of Medical Informatics (2003) 70, 79�/94

www.elsevier.com/locate/ijmedinf

1386-5056/03/$ - see front matter – 2002 Elsevier Science Ireland Ltd. All rights reserved.doi:10.1016/S1386-5056(02)00126-0

1) find what information exists and where it isstored (e.g. digital libraries, web portals,search engines, metadata engines, spatialdata infrastructures);

2) access and query the data (e.g. DatabaseManagement Systems (DBMS), low-end Geo-graphic Information Systems (GIS), spatial view-ers);

3) visualize pre-built outputs (e.g. Executive In-formation Systems (EIS), dashboards, DBMSviews, GIS result sets);

4) create new types of outputs (e.g. querybuilders, report builders, low-end GIS);

5) perform advanced analysis (e.g. statisticalpackages, high-end GIS);

6) perform interactive exploration of largeamounts of data (e.g. On-Line Analytical Pro-cessing (OLAP), Spatial OLAP (SOLAP));

7) trigger automatic detection of patterns in thedata (e.g. data mining).

A recent study of the needs of high-level healthspecialists conducted across Canada in differentorganizations (federal, provincial, universities, pri-vate companies) showed that the above functionsare currently being used by the respondents to animportant degree, but mostly with the non-spatialcomponent of the data [1]. Non-spatial knowledgediscovery is already emerging at that level ofdecision-making. However, the picture is comple-tely different in non-specialized local or regionalpublic health agencies as shown for instance in alarge survey done for the province of Quebec [2]where these tools are only sparingly used, althoughin demand for the mid-term.

The pan-Canadian study also showed that thegeospatial component is expected to be used in allof the functions listed above in the very short term.According to this study, it is going to happenbecause the use of the geospatial componentallows for better presentation and visualization ofthe data, improved dissemination and communica-tion, enhanced analysis and better support fordecision-making. The study also indicated thatspecialists gradually implement geo-digital li-braries, spatial viewers, low-end and high-end GIS

packages to conduct the first five functions withthe geospatial component of the data. However,the integration of the geospatial component in thelast two functions is only starting, mostly inresearch groups. New technologies that bettersupport these functions would give health specia-lists a new exploration and analysis potentialknown as Geographic Knowledge Discovery (GKD)[3,4]. GKD tools directly support decision-makinginvolving geospatial data.

Although knowledge discovery in general may beconducted using tools such as OLAP and data mining(respectively, user-driven and software-drivenknowledge discovery), today’s commercialpackages rarely take into account the geospatialcomponent of the data. To support GKD in anefficient manner, appropriate geospatial technol-ogy has to be developed. GIS alone is simply not anefficient solution because it is built on a transac-tional paradigm [5]. In order to better meet theabove-mentioned needs with a fast and intuitivesolution, one needs a system built on the multi-dimensional paradigm. Integrating GIS and OLAP tooffer interactive data exploration in a hypertextstyle manner with cartographic, statistical andtabular views represent an innovative solution.Their combination into so-called SOLAP [6�/9] pro-vides a new capability for GKD that goes wellbeyond traditional GIS. For example, SOLAP andspatial data mining [10,7,11] allow the users toeasily and rapidly navigate within geospatial data-sets as well as within descriptive data. This bettersupports the creation and/or the first validation ofhypotheses as implicit spatial relationships be-tween phenomena rapidly become evident andnew relations are more likely to emerge in themind of the user since he did not have to botherabout Structured Query Language (SQL)-type com-mands or wait for more than a few secondsbetween the display of detailed or general maps,detailed or aggregated histograms, detailed orsummarized tables, etc. Such GKD tools help todiscover new tracks for analysis and to betterfocused research, to rapidly eliminate irrelevanthypotheses and also to have access to largevolumes of data that are sometimes difficult toaccess for non-GIS and non-database specialists.Furthermore, an improved and easier access tobasic GIS functionalities for non-specialists facil-itates the inclusion of data-based evidence in thedecision-making processes for such generic tasks aslocating service delivery units (on the basis ofcriteria such as population density, access by publictransit, rates of disease in the community, etc.)and marketing health promotion programs wherethey are most needed (for instance, high-risksubpopulations).

Today, SOLAP has become a viable solution toexplore geospatial health data and the most recentresearch aims at improving the underlying funda-mental concepts as well as the capabilities ofemerging commercial offerings (see [4] for goodexamples). The goal of this paper is to present anexample of a SOLAP-based system developed forenvironmental health purposes during the ICEM-SEproject. We begin with a presentation of the

80 Y. Bedard et al.

system that has been developed, including anoverview of its content, its functions and itsarchitecture. Then, we describe the SOLAP conceptin more details with references to the ICEM-SE dataand GIS technology where appropriate. This de-scription is followed by a comparison betweenSOLAP and GIS.

2. The ICEM-SE project: a practicalexample of a SOLAP-based system forenvironmental health

This section describes a practical example of aSOLAP-based system developed during the ICEM-SEproject (Cartographic Interface for the Multidimen-sional Exploration of Environmental Health Indica-tors on the World Wide Web). This research anddevelopment project aimed at building a new typeof cartographic interface for the multidimensionalexploration of environmental and health indicatorsvia the World Wide Web. It is one of the GEOIDEprojects (Canada’s Network of Centers of Excel-lence in Geomatics) [12].

2.1. Objectives of the ICEM-SE project

The general objective of the project was toimprove the capabilities of geomatics technologiesfor decision-support (user-driven GKD). More pre-cisely, the project aimed at building and testing anew geomatics solution, a SOLAP-based user inter-face, to facilitate the access and exploration ofenvironmental and health geospatial data forhealth specialists [13].

The long-term goal of the ICEM-SE project was tohelp reduce health risks caused by an environmen-tal source by providing a quick and easy access tohigh quality environmental and health data toimprove decision-making and interventions, accessto statistics and other information and the dis-covery of new knowledge.

The geomatics research objectives were to adaptthe concepts coming from multidimensional data-bases and OLAP for a geospatial context. In parallel,the health research objectives were to developmeaningful indicators in environmental health fortheir use in the SOLAP-based interface.

The technical objective was to develop fullyfunctional prototypes that would provide userswithout any GIS background the capability to easilyand rapidly explore their data in order to under-stand complex phenomena related to environmen-tal health.

2.2. General description of the prototypes

The project strategy included the exploration ofdifferent combinations of technologies and thedevelopment of generic shells to build prototypes.Using commercial software components, our pro-totypes were designed as much as possible tofacilitate their reuse for new types of data ex-ploration or other public health applications. Thisstrategy allows for:

�/ the addition of new health and environmentaldata;

�/ the addition of new cartographical layers;�/ the customization of the different displays

(number of classes used in a statistical chart ormap, default parameters for map or chartsemiology, selection of axes in tables, etc).

Within the prototypes developed, users caneasily navigate through:

�/ different levels of details, from local to regionaland provincial levels and conversely, for exam-ple; or from all cancers to cancers of therespiratory system to lung cancers for anotherexample;

�/ different themes, from asthma to cancer andother diseases, or from industrial pollutants todrinking water quality and other environmentalfactors for example;

�/ different epochs;�/ different subgroups of population (age groups

and sex);�/ different statistical measures.

Results are presented via displays with synchro-nized refreshing that can be used to navigate intothe database with functions such as ‘drill-down’,‘roll-up’, ‘pivot’ and ‘drill-across’ for example.The synchronized displays may include several:

�/ thematic maps;�/ statistical diagrams (bar charts, pie charts);�/ tables.

2.3. Development of the prototypes

Two SOLAP prototypes have been developed: anentry-level prototype for simpler navigation and ahigh-end prototype with more GKD functions. Bothprototypes are based on the multidimensionaldatabase structure as used in data warehousing,OLAP and data mining systems. The developedprototypes provide quick and easy access to en-vironmental and health data for a temporal cover-age varying from 5 to 15 years depending on the

Geographic knowledge discovery 81

topic and for a geographic coverage for theprovince of Quebec at the local level (communityhealth centers (CLSC)), regional level (regionalhealth authorities (RSS)) and provincial level.Currently, the prototypes contain data and meta-data about the following indicators:

�/ Cancer (incidence and deaths);�/ Respiratory diseases (hospitalizations and

deaths);�/ Notifiable diseases (incidence);�/ Poisonings;�/ Air quality monitoring;�/ National pollutant release inventory;�/ Greenhouse gas;�/ Pesticides sales;�/ Waste management;�/ Environmental health teams activities.

Other indicators can easily be added to eitherprototype, as this has become a straightforwardoperation. The actual indicators are structureddifferently in the two prototypes as explained inthe next pages.

2.4. Details on the health data sources

The health data sources used in this project werethe following:

�/ individual data on new cancer cases (incidence):the Quebec tumors file (‘Fichier des tumeurs duQuebec’), Quebec Ministry of Health and SocialServices;

�/ individual death data: the deaths file (‘Fichierdes deces’), Quebec Ministry of Health andSocial Services;

�/ individual hospitalization data: the Med-Echoregistry (‘Registre Med-Echo’), Quebec Ministryof Health and Social Services.

For each case (incidence, death or hospitaliza-tion), the data collected at the time of the eventwere: the diagnosis or the death cause (accordingto the International Classification of Disease, Ninthrevision), the sex, the age, the event date, themunicipal code and the postal code of the indivi-dual’s principal residence. The postal code hasbeen used to assign the correct CLSC code. Thisprocess has been done according to the territorialdivisions (the M-22 system) effective 31st March,1999, as recommended by the Quebec Ministry ofHealth and Social Services.

Due to the confidentiality of individual healthdata, these are currently only available to thepublic health director for the region under hisresponsibility. For this project, an agreement has

been made with the Quebec regional board, andonly the postal codes for the Quebec regionresidents were available.

The population data per year, per sex, per 5-yearage group and per community health center wereobtained from the Quebec Ministry of Health andSocial Services [14].

The standardized rates were calculated accord-ing to the direct standardization method. Theweight system used in the standardization processhas been calculated from the 1991 population data(men and women grouped) for the whole provinceof Quebec. The comparative figure corresponds tothe ratio of the standardized rate for a particularterritory and the standardized rate for the pro-vince.

2.5. Entry-level ICEM-SE prototype

First, to illustrate the operation of the entry-level prototype, a simple analysis is presented:what were the regional health authorities, in theprovince of Quebec, with a high comparativehospitalization figure from asthma in 1998? Forthe Quebec region (the region of Quebec City) inparticular, what was each community health centercomparative hospitalization figure? Were the com-parative figures different according to the sex ofthe affected persons?

To conduct this analysis with a GIS would requireseveral lines of SQL querying (if the user masters SQL

and the database structure) or complex manipula-tion (if the GIS provides a graphical user interface).In addition, the response times could vary fromseveral seconds to minutes. On the other hand,with the ICEM-SE entry-level SOLAP prototype, theuser executes rapidly with a few mouse clicks thefollowing tasks:

1) the user clicks on the desired informationelements (called dimension members and mea-sures in the multidimensional vocabulary), forexample ‘Asthma’�/‘Hospitalizations’�/‘Com-parative figure’�/‘Regional healthauthorities’�/‘1998’ in the selection trees ofthe navigation panel (see Fig. 1) and, afterclicking on the appropriate button, the proto-type always displays within 10 s the corre-sponding thematic map.

2) To have more details about the Quebec region,the user clicks on this region directly on themap and then executes a drill-down operation(by clicking on the appropriate button of theuser interface).

3) To see the comparative figures for men andwomen, the user first selects ‘Women’ in the

82 Y. Bedard et al.

Fig. 1 Interface of the ICEM-SE entry-level prototype application.

Fig. 2 Map of the comparative hospitalization figure from asthma, for the different regions of the province ofQuebec, in 1998, for all the population.

Geographic knowledge discovery 83

Fig. 3 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centerscorresponding to the Quebec region, for all the population.

Fig. 4 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centerscorresponding to the Quebec region, for women.

84 Y. Bedard et al.

Fig. 5 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centerscorresponding to the Quebec region, for men.

Fig. 6 Bar chart of the comparative hospitalization figure from asthma, in 1998, for the community health centerscorresponding to the Quebec region, for men.

Geographic knowledge discovery 85

‘Population’ selection tree and displays the‘Women’ map by clicking on the ‘Map’ button.Then, the user selects ‘Men’ in the ‘Population’selection tree and displays a second map byclicking on the ‘Map’ button.

4) The user then changes the type of display for abar chart by clicking on the ‘bar chart’ buttonof the interface.

This sequence of tasks is illustrated in Figs. 1�/6.It is to be noted that this example does not take

into account the level of statistical significance(this is the default option). However, it is possibleto display the results using the 1 or 5% levels ofstatistical significance of comparative hospitaliza-tion figure.

This new type of interface is very easy to use andfast enough to support decision-making. It allowsfast and easy navigation within the health andenvironmental data at whatever level of aggrega-tion they are. The displays are drawn and updatedvery quickly (e.g. 3�/10 s on a notebook PC). Thesame analysis, conducted with a traditional GIS andtransactional database, would have required thefollowing steps (assuming that the health statisticsare already calculated, inside a DBMS or a statis-tical package, for example).

To create the first thematic map (regional healthauthorities):

�/ to select the appropriate geographic layer;�/ to select the type of thematic map (range map);�/ to select the field to map;�/ to create the ranges of the map;�/ to modify the styles of the ranges.

To create the second thematic map (communityhealth centers):

�/ to repeat all the same steps required to createthe first map.

To create the third thematic map:

�/ to modify the field to map;�/ to modify the ranges of the map;�/ to modify the styles of the ranges.

To create the fourth thematic map:

�/ to modify the field to map;�/ to modify the ranges of the map;�/ to modify the styles of the ranges.

To create the bar chart:

�/ to select the data to graph;�/ to select the type of graph;

�/ to modify the different parameters of the graphaccording to the preferences.

These steps may require several mouse clickseach and they are not straightforward for non-GIS

specialists (e.g. doctors, epidemiologists). In addi-tion, the process of creating the different displaystakes a certain time. This does not help the user tomaintain a train of thought when analyzing thedata or when trying to find correlations and trends.

The entry-level prototype has been developedusing a so-called ‘star data structure’ with MICRO-

SOFT ACCESS† and the SOFTMAP

† cartographic engine(a cartographic visualization software from Soft-Map Technologies Inc.) in a custom VISUAL BASIC†

application. It aims to satisfy the needs of typicalusers (about 90% of the public health professionalstaff) and remains a low-cost solution. Currently,base data are imported from different governmentsources. Statistical data and their aggregates arecalculated by specialists using the SAS† statisticalpackage before being integrated in the SOLAP

database. Users can access these prepared dataas they are, without any customization except foron-the-fly reclassifying. Nevertheless, users havethe freedom to choose the desired analysis ele-ments (called dimension members and measures inthe multidimensional vocabulary), the displaytypes and the graphical semiology. Different typesof choropleth maps and point maps are possible ontop of raster 1:50 000, 1:250 000 and 1:8 000 000topographic base maps (satellite imagery is alsoavailable). Navigation is possible within the naviga-tion menus and the thematic maps, but not in thestatistical charts and tables. With this prototype,doctors, epidemiologists and other health profes-sionals can now produce, within seconds, severalhundreds of thousands of maps without eventouching their keyboard. This entry-level prototypepresently works on a local workstation. In the nextversion, however, it will be extended for use on theInternet by replacing the SOFTMAP† GIS engine byJMAP† (a JAVA-based web mapping solution thatsupports groupware, from Kheops TechnologiesInc).

2.6. High-end ICEM-SE prototype

Below is another simple analysis to illustrate theoperation of the high-end prototype. For example,the user is searching for possible causes of the highcomparative hospitalization figure from asthma forthe period covering 1994�/1998. With the ICEM-SEhigh-end SOLAP prototype, the user rapidly executesthe following tasks:

86 Y. Bedard et al.

Fig. 7 Interface of the ICEM-SE high-end application prototype.

Fig. 8 Comparative hospitalization figure from asthma for the different regions of the province of Quebec, for 1994�/

1998, for all the population.

Geographic knowledge discovery 87

Fig. 9 Average SO2 concentration for 24 h periods (default), for the different regions of the province of Quebec, for1994�/1998.

Fig. 10 Average SO2 concentration for hour periods, for the different regions of the province of Quebec, for 1994�/

1998.

88 Y. Bedard et al.

Fig. 11 Average O3 concentration for hour periods, for the different regions of the province of Quebec, for 1994�/

1998.

Fig. 12 Average O3 concentration for hour periods, for the different sampling stations of the Montreal region, for1994�/1998.

Geographic knowledge discovery 89

1) the user clicks on the desired dimensionmembers and measures, for example ‘Hospita-lizations from respiratory diseases’�/‘Asth-ma’�/‘Comparative figure’�/‘Regional healthauthorities’�/‘1994�/1998’ in the selectiontrees of the navigation panel. The displaysare updated automatically. The user findsthat high comparative hospitalization figuresare mostly located in the regions surroundingMontreal.

2) The user wants to look at the air qualitymonitoring results for these regions and selects‘Air quality monitoring’�/‘SO2’�/‘Average con-centration’�/‘Regional health authorities’�/

‘1994�/1998’ in the selection trees. The resultsfor the 24-h periods (the default) do not seemto have a relation with the high comparativefigures.

3) The user selects the ‘Hour’ period in the‘Periods’ selection tree. The new results donot seem to have a relation with the asthmaproblems.

4) The user selects ‘O3’ in the ‘Contaminants’selection tree. The results seem to be moreinteresting than the SO2 results. More investi-gation could then be undertaken to verify ifthere is a certain correlation between the highcomparative hospitalization figures fromasthma and the high average O3 concentra-tions.

5) To have more details about the average O3

concentration at the different sampling sites ofthe Montreal region, which region has thehighest average concentration among all thedifferent regions of the province of Quebec,the user executes a drill-down operation,directly in the table (in the Montreal regioncell).

This sequence of tasks is illustrated in Figs. 7�/

12.This example shows that a SOLAP-based interface

allows the user to concentrate on his analysis needsrather than on how to use the software or on howto formulate queries.

This second prototype uses MICROSOFT SQL SERVER†,Microsoft Analysis Services† (Microsoft’s OLAP ser-ver), PROCLARITY† (an OLAP client from ProClarityInc.) and KMAPX† (a MapInfo MapX-based plug-inallowing basic cartographic visualization and ma-nipulation of the geospatial data, also from Pro-Clarity Inc.). It is developed using HyperTextMarkup Language (HTML) and VBSCRIPT† and is acces-sible via the Internet for the clients that haveinstalled the ProClarity plug-in. The cartographiccomponent of this high-end prototype will also, in

the short term, be replaced by JMAP as the latter isa more flexible cartographic visualization andmanipulation engine with groupware functions,interoperability capabilities and a very efficientvector-based applet.

This high-end prototype is also very easy to useand very fast over the web. It aims at satisfying theneeds of technically advanced users (about 10% ofthe public health professional staff). It is moreflexible than the first prototype presented andallows the users to create their own dimensionmembers and measures from the data that arestored in the databases. The different health andenvironment indicators are here structured inmultidimensional data cubes. Users have the free-dom to select the desired dimension members,existing or new measures, the display types and thegraphical semiology. Navigation is possible via thenavigation menus and via all the display types.Navigation via the legend is also possible. This high-end prototype is intended for an Internet use.

Table 1 Characteristics of OLTP and OLAP systems

OLTP OLAP

Original source Copy or read-only dataDetailed data Detailed and aggregated

dataCurrent data Historical and current dataPriority to data securityand integrity

Priority to data explora-tion and analysis

Normalized data struc-ture (no, or low, dataredundancy)

Denormalized data struc-ture (redundancy encour-aged if it increases queryperformance)

Continually updated No update, periodical ad-dition of new data only

Query tool dependent ofthe data structure (a usermust know the datastructure to query it ef-ficiently)

No query tool, the userinteracts directly with thedata

Non-aggregative queries(little data per transac-tion, mostly update op-erations)

Aggregative queries (lotsof data per transaction,analysis operations)

Concepts: table, column,tuple, key

Concepts: dimension,member, measure, fact,cube

90 Y. Bedard et al.

3. The spatial on-line analyticalprocessing (SOLAP) concept and itscomparison with GIS

OLAP has been defined for the first time as ‘(. . .)the name given to the dynamic enterprise analysisrequired to create, manipulate, animate andsynthesize information from exegetical, contem-plative and formulaic data analysis models. Thisincludes the ability to discern new or unanticipatedrelationships between variables, the ability toidentify the parameters necessary to handle largeamounts of data, to create an unlimited number ofdimensions, and to specify cross-dimensional con-ditions and expressions’ [15]. The reader is re-ferred to [15] for a detailed description of eachdata analysis model. Caron [16] proposed anotherOLAP definition: ‘‘A software category intended forthe rapid exploration and analysis of data based ona multidimensional approach with several aggrega-tion levels’’. We must add to this latter definitionthe fact that the exploration and analysis of data isusually driven by the user with OLAP technologywhile it is usually automated with data miningtechnology (and the boundary between the twotends to blur over time).

OLAP technology relies on the multidimensionaldatabase approach, which introduces conceptsthat differ from the concepts found in the transac-tional database approach typical of GIS applica-tions. These multidimensional concepts include:dimensions, members, measures, granularity, factsand data cubes. The dimensions represent theanalysis themes, or the analysis axis (e.g. ‘time’,‘cancer’, ‘territorial subdivisions’). A dimensioncontains members (e.g. ‘1998’, ‘stomach cancer’,‘Quebec region’) that are organized hierarchicallyinto levels of granularity (e.g. ‘province’, ‘regionalhealth authorities’, ‘local health authorities’). Themembers of one level (e.g. months) can beaggregated to form the members of the next higherlevel (e.g. years). The dimensions can be ofdifferent types: temporal, spatial (non-carto-graphic in the case of a conventional OLAP tool)and descriptive (or thematic). The measures (e.g.standardized rate) are the numerical attributesanalyzed against the different dimensions. A mea-sure can then be considered as the dependentvariable while dimensions are the independentvariables (e.g. the measure ‘standardized rate’depends on the members of the ‘cancer’, ‘time’,‘population’ and ‘territorial subdivisions’ dimen-sions). The different combinations of dimensionmembers and measures represent facts (e.g. thestandardized rate of death from stomach cancer

for the year 1998, for the women and for theQuebec region is 4.079). A data cube is a set ofmeasures aggregated according to a set of dimen-sions [17]. Inside a data cube, the possible aggre-gations of measures on all the possiblecombinations of dimension members (the facts)can be pre-computed to increase query perfor-mance. Several data cubes can be built from thesame sources of data, as they are ‘read-only’datasets (e.g. several SOLAP applications couldimport their data from a same GIS). Table 1 presentsthe differences between a transactional database(also called ‘On-Line Transaction Processing’(OLTP)) in a relational server (typical of GIS applica-tions) and an analytical database in a multidimen-sional server (typical of decision-support systemsbuilt with OLAP).

The general OLAP architecture comprises threecomponents: the multidimensionally structureddatabase, the OLAP server and the OLAP client thataccesses the database via the OLAP server. The OLAP

client allows the end user to visualize the datausing different types of diagrams (e.g. bar chartsand pie charts) and tables. It also allows the user toexplore and analyze the data using differentoperators such as drill-down (show a more detailedlevel inside a dimension), roll-up (show a moregeneral level inside a dimension), drill-across(show another theme at the same level of detail)and swap (interchange visible dimensions in thechart or table). Such system is built especially tonavigate within the data cube, i.e. to go from onefact to another in a simple manner and to obtainfast responses.

It is commonly found in the literature, as ourprototypes have also shown, that the multidimen-sional approach of analysis is more in agreementwith the end user’s mental model of the data thanthe traditional transactional approach [18]. The

Fig. 13 Differences between typical GIS and OLAP

applications with regards to three axes of requirementsfor spatial decision-support. After [20].

Geographic knowledge discovery 91

interface of a tool exploiting the multidimensionalparadigm, such as OLAP, provides unique capabilitiesto explore data in an intuitive and interactive way(similar to web hyperlinks). The user can performsimple to complex analyses mostly by clicking onthe data being organized in a way that is mean-ingful [19]. Such easiness and rapidity are twoessential conditions for an analyst to maintain atrain of thought when exploring or validatinghypotheses. Health users already report OLAP abil-ities to provide timely information and assistancein decision-making, program evaluation, and ana-lysis [1]. This will likely prove to be even moreevident with SOLAP.

GIS systems are known to be not very welladapted for decision-making because they havecomplex query interfaces and spatial operatorsthat are not intuitive for non-specialists (e.g.doctors, epidemiologists), they do not supportwell aggregate data and processing times may bevery long for the complex queries that are typicalof strategic decision-making. However, they arevery useful for the visualization and manipulationof the cartographic data. Since data visualizationfacilitates the extraction of insight from the com-plexity of the spatio-temporal phenomena andprocesses being analyzed, some authors claimthat GIS are decision-support tools. Nevertheless,

Fig. 14 Current architecture of the ICEM-SE prototypes.

Fig. 15 Future architecture of the ICEM-SE applications.

92 Y. Bedard et al.

one needs to fully harness the power of multi-scales maps with multi-levels multi-themes multi-epochs data to reach a better understanding of thestructure and relationships contained within thedatasets. GIS alone cannot do it in a fast andintuitive manner; one needs SOLAP capabilities ontop of, or in tandem with, GIS. In the context ofGKD, maps and graphics do more than make datavisible; they are active instruments in the end-users thinking process [20] and as such mustsupport spatial navigation operators like spatialdrill-down and spatial roll-up as well as thematicoperators. Such data manipulation allows access tothe intelligence contained in the data. Fig. 13shows the characteristics of typical SOLAP applica-tions compared with the characteristics of typicalGIS applications.

Without spatial navigation operators and mapvisualization, conventional OLAP possess only alimited potential to support GKD [16]. Commercialsystems integrating OLAP with spatial display func-tionalities recently appeared on the market butthey have many limitations. The ideal SOLAP toolmust offer a level of flexibility not currentlyoffered to meet multidimensional spatio-temporalanalysis needs [21].

4. Discussion

Fig. 14 shows the current architecture of bothprototypes, which are based on different technol-ogies. For the entry-level prototype, the sourcedata are imported in SAS† where the statistical dataare calculated. These statistical data are thenintegrated in a MICROSOFT ACCESS† database andaccessed by the prototype user interface. For thehigh-end prototype, the source data are importeddirectly from the sources in a temporary datawarehouse stored in MICROSOFT SQL SERVER†. Then,multidimensional data cubes are built, in MicrosoftAnalysis Services†, from the data stored in thewarehouse. These data cubes are accessed by theprototype user interface.

A future version of the system, to be implemen-ted at the Quebec Ministry of Health and SocialServices in 2002, will be based on the architecturepresented in Fig. 15.

For the high-end application, the source data areimported in a temporary data warehouse stored inMICROSOFT SQL SERVER†. Then, multidimensional datacubes are built, in Microsoft Analysis Services†,from the data stored in the warehouse. These datacubes are accessed by the application. For theentry-level application, the source data are im-ported directly from the multidimensional data

cubes into the relational database. This databaseis accessed by the application. Both applicationswill use JMAP† as the mapping and spatial naviga-tion engine.

This approach provides a built-in quality controlmechanism in the sense that methodological andorganizational decisions are done only once in thecentral unit in charge of the system, by specialistsin epidemiology, statistics, computer science, geo-matics and confidentiality protection. Crucial con-cerns such as data validation, choice of appropriatestatistical tests and measures, statistical stabilityof data, restriction of access to personal data,warnings and other similar topics can be addressedin a uniform and state-of-the-art manner, beforewider dissemination and use of the data for every-day interventions. This can best be done in theadministrative unit responsible for data quality andconfidentiality. In North America, this is likely to bethe provincial or state public health agency, and ata more aggregate level, the federal agency such asHealth Canada, or the Centers for Disease Controland Prevention (CDC). One may also find enoughexpertise in large metropolitan areas (or evensmall areas for some categories of data) to applysuch a quality-control approach, and it thenbecomes a matter of internal administrative agree-ments within an organization.

Very little training is necessary to use the aboveapplications. Our tests with end-users have shownthat less than an hour of training is sufficient to usethe software. The first results have been very wellreceived by future users in Public Health and alsoby users in other fields of application where similarsystems are being developed with new researchchallenges.

5. Conclusion

This paper presented a tool that has beendeveloped to support GKD in the field of environ-mental health. This tool enhances GIS software withSpatial SOLAP capabilities to better support deci-sion-making for health users who need to analyzeinterconnections between risk factors, clusters,interventions and outcomes. It also better supportshealth users who need to rapidly discover/elim-inate potential relations between health problemsand environmental factors, to better target inter-vention efforts or medical resources distribution.These are only a few applications in Public Healththat can benefit from a technology providing fastaccess to the detailed and aggregated data, eitheron maps, tables or charts, and providing database

Geographic knowledge discovery 93

navigation capabilities without the need to learn aquery language. Such SOLAP application:

�/ aims at supporting, transparently, the way pub-lic health specialists think and analyze;

�/ allows them to focus on the results of thenavigation rather than on the analysis processitself (i.e. focus on ‘what to obtain’ rather thanon ‘how to obtain it’);

�/ is used without knowing any query language;�/ provides practically instantaneous response

times (the optimal response time for spatio-temporal exploration and analysis being lessthan 10 s [22].

The two prototypes developed during this pro-ject have achieved these objectives. They havebeen described and an example analysis has beenpresented for both. The first results have been verywell received by future users in Public Health. Thefinal development and implementation of bothprototypes for the Quebec Ministry of Health andSocial Services should be completed by the begin-ning of 2003.

Acknowledgements

This research has been realized with the financialsupport of the GEOIDE Network of Centers ofExcellence SOC#1 (Cartographic Interface for theMultidimensional Exploration of EnvironmentalHealth Indicators) and DEC#2 (Designing the Tech-nological Foundations of Geospatial Decision-Mak-ing with the World Wide Web) projects, the QuebecMinistry of Health and Social Services and theNatural Sciences and Engineering Research Councilof Canada individual research grant program.

References

[1] P. Gosselin, Y. Bedard, M. Jerrett, S.J. Elliott, R. Catelan, P.Poitras, A. Gingras, GIS and OLAP in Health Surveillance:Needs Analysis for Successful Integration, Report presentedto Health Protection Branch, Health Canada, 2000.

[2] D. Belanger, P. Gosselin, G. Lebel, Bilan et perspectives enmatiere de surveillance en protection de la sante publique,Rapport depose au ministere de la Sante et des Servicessociaux du Quebec, realise avec la collaboration de l’INSPQet du Centre de recherche du CHUQ, 2002.

[3] W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowl-edge discovery in databases: an overview, in: G. Piatetsky-Shapiro, W.J. Frawley (Eds.), Knowledge Discovery inDatabases, AAAI/MIT Press, Cambridge, 1991.

[4] H.J. Miller, J. Han (Eds.), Geographic Data Mining andKnowledge Discovery, Taylor & Francis, London, 2001.

[5] Y. Bedard, T. Merrett, J. Han, Fundamentals of spatial datawarehousing for geographic knowledge discovery, in: H.

Miller, J. Han (Eds.), Geographic Data Mining and Knowl-edge Discovery, Taylor & Francis, London, 2001.

[6] Y. Bedard, Spatial OLAP, Videoconference, 2eme Forumannuel sur la R-D, Geomatique VI: Un monde accessible,Montreal, Canada, November, 1997.

[7] J. Han, Conference Tutorial Notes: Spatial Data Mining andSpatial Data Warehousing, Paper presented at the FifthInternational Symposium on Spatial Databases (SSD’97),

Berlin, Germany, 1997.[8] N. Stefanovic, Design and Implementation of On-Line

Analytical Processing (OLAP) of Spatial Data, M.Sc. Thesis,Simon Fraser University, Vancouver, Canada, 1997.

[9] M.L. Gonzales, Spatial OLAP: Conquering Geography, DB2Magazine, Retrieved 23 November, 1999, from http://www.db2mag.com/db_area/archives/1999/q1/

99sp_gonz.shtml, Spring, 1999.[10] K. Koperski, J. Adhikary, J. Han, Spatial Data Mining:

Progress and Challenges, Paper presented at the SIGMOD’96Workshop on Research Issues on Data Mining and KnowledgeDiscovery, Montreal, Canada, June, 1996.

[11] M. Ester, H.P. Kriegel, J. Sander, Spatial data mining: adatabase approach, in: M. Scholl, A. Voisard (Eds.),Advances in Spatial Databases, Springer, Berlin, 1997, pp.

47�/66.[12] The GEOIDE Network of Centers of Excellence: Geomatics

for Informed Decisions, Retrieved September 17, 2002 fromhttp://www.geoide.ulaval.ca

[13] Y. Bedard, M. Nadeau, M.J. Proulx, A New Tool for User-driven Geographic Knowledge Discovery with Application toEnvironment Health Indicators, Paper presented at the

General Annual Meeting of the NCE GEOIDE, Fredericton,Canada, June 2001.

[14] G. Pelletier, La population du Quebec par territoire deCLSC, de DSC et de RSS, pour la periode 1981 a 2016,Rapport prepare par le Ministere de la Sante et des Servicessociaux, 1996.

[15] E.F. Codd, S.B. Codd, C.T. Salley, Providing OLAP (On-LineAnalytical Processing) to User-Analysts: an IT Mandate,

Hyperion White Paper, 1993.[16] P.-Y. Caron, Etude du potentiel OLAP pour supporter

l’analyse spatio-temporelle, Memoire de maıtrise, Univer-site Laval, Sainte-Foy, Canada, 1998.

[17] E. Thomsen, G. Spofford, D. Chase, Microsoft OLAP Solu-tions, Wiley, New York, 1999.

[18] OLAP Council OLAP and OLAP Server Definitions, RetrievedOctober 10, 1999 from http://www.olapcouncil.org/re-search/glossaryly.htm, 1995.

[19] P. Youngworth, OLAP Spells Success For Users and Devel-opers, Data Based Advisor, December, 1995, pp. 38�/49.

[20] A.M. MacEachren, M.-J. Kraak, Research challenges ingeovisualization, Cartography and Geographic InformationScience 28 (1) (2001) 3�/12.

[21] S. Rivest, Y. Bedard, P. Marchand, Towards better supportfor spatial decision-making: defining the characteristics of

Spatial On-Line Analytical Processing (SOLAP), Geomatica,Journal of the Canadian Institute of Geomatics 55 (4)(2001) 539�/555.

[22] P. Marchand, Y. Bedard, G. Edwards, A hypercube-basedmethod for spatio-temporal exploration and analysis,

GeoInformatica, July, 2002, in press.

94 Y. Bedard et al.