interactive map reports summarizing bivariate geographic data · interactive map reports...

Procedia Computer Science 00 (2019) 1–10

ProcediaComputerScience

Interactive Map Reports Summarizing Bivariate Geographic Data

Shahid Latif, Fabian Beck

University of Duisburg-Essen, Germany

paluno – The Ruhr Institute of Software Technology

Abstract

Bivariate map visualizations use different encodings to visualize two variables but comparison across multiple encodings is challenging.Compared to a univariate visualization, it is significantly harder to read regional differences and spot geographical outliers. Especially targetinginexperienced users of visualizations, we advocate the use of natural language text for augmenting map visualizations and understanding therelationship between two geo-statistical variables. We propose an approach that selects interesting findings from data analysis, generates arespective text and visualization, and integrates both into a single document. The generated reports interactively link the visualization with thetextual narrative. Users can get additional explanations and have the ability to compare different regions. The text generation process is flexibleand adapts to various geographical and contextual settings based on small sets of parameters. We showcase this flexibility through a number ofapplication examples.

Keywords: Geographic visualization, natural language generation, interactive documents

1. Introduction

The interplay of two variables reveals how one entity poten-tially influences the other. In a geographic context, this influ-ence may depend on the geography of the region. For instance,storms might cause more fatalities in densely populated areas.Standard map visualizations such as heatmaps, choropleths, andcartograms are designed to visualize one numerical variable at atime. The visualization of two geo-statistical variables on mapsis more challenging. To simultaneously visualize two variables,a combination of two univariate maps can be overlaid, for in-stance, the first variable shown as a choropleth map and the sec-ond variable encoded in sizes of overlaid shapes; alternatively,separate views can be used. However, especially inexperiencedusers might face problems interpreting the bivariate visualiza-tion correctly and effectively. Users with a low visualizationliteracy might have problems understanding the respective vi-sualization. And even users with more experience might findit hard to detect spatial patterns and spot outliers. Hence, thereis a need to make bivariate geo-statistical visualizations moreself-explaining and guide users through the data analysis.

When a visualization is not fully self-explaining, we canadd text in the form of captions and annotations to it. Also,

Email addresses: [email protected] (Shahid Latif),[email protected] (Fabian Beck)

to describe the results of data analysis, textual representationscan easily guide a reader through important findings. Hence,we believe that augmenting a bivariate map visualization witha textual report and interactively linking both can significantlyimprove users’ abilities to understand the data. Using naturallanguage generation technology, we could easily develop a jointautomatic generation of the visualization and an accompanyingtext for a specific type of applications (e.g., reports of fatal-ities). However, we want to find more generalizable solutionsthat can deal with various types of geo-statistical variables (e.g.,fatalities, monetary, demographic). Whereas visualizations areusually generalizable to different scenarios already, automati-cally produced text heavily depends on domain vocabulary andcontext. In contrast, we propose a text generation process that isflexible and adaptable to different variable types and geographicsettings. This flexibility is achieved through a set of parametersproviding meta data and context. The parameters also influencethe visual encoding. Text and visualization are finally presentedtogether in a linked interactive representation.

We developed interactive Map Reports (iMR), a Web-basedtool that automatically generates a narrative and visualization todescribe the analysis results for bivariate geo-statistical data.The sample shown in Figure 1 explains fatalities caused bystorm events in the USA, 2017. The reports summarize note-worthy patterns and relationships among the variables. In ad-dition, they provide explanations on selected regions and the

1

Shahid Latif, Fabian Beck / Procedia Computer Science 00 (2019) 1–10 2

Figure 1. A map report describing loss of lives due to storms in the USA during 2017. The map visualization uses two different encodings to visualize a focus and acontext variable. The narrative (right column) provides an overview of the data analysis. Graphics in the text help establish linking between the two representations.Users can get additional details on a selected region or on a comparison of two selected regions (dashed rectangles).

ability to compare any two regions of interest on demand. Thecolor and shape encodings of variables in the text help establishquick linking of respective regions across both representations.To generate the reports, we combine data analysis techniqueswith natural language generation and interactive visualization.Our main scientific contributions are:

• an automatic detection and selection of relevant informa-tion from bivariate geo-statistical data (Section 3),

• a versatile template-based text generation technique toproduce a narrative adaptable to different contextual sce-narios (Section 4),

• an adaptable method to produce map reports for variousgeographical regions and granularity consisting of inter-actively linked narrative and visualizations (Section 5),and

• a demonstration of generalizability of the approach invarious application scenarios (Section 6).

The interactive system (iMR) is available at https:

//vis-tools.paluno.uni-due.de/imr and the sup-plemental material contains an interactive appendix havingadditional examples.

2. Related Work

We explore the existing literature from two differentperspectives—techniques for visualizing bivariate statisticaldata on maps and the approaches that use a combination of textand visualizations for communicating the results of analysis.

Thematic maps are used to visualize variations in values ofa variable across geographical space. The variable is mostlyencoded as colors [1], sizes and shapes of the geographical re-gions (cartograms), or specific symbols overlaid on top of themap [2]. Although most of the thematic maps are used to visu-alize a single variable (e.g., choropleths, heatmaps), there ex-ist techniques that generalize this concept to bivariate [3][4]and multivariate data [5]. A taxonomy of bivariate map typescan be found in Elmer’s work [6, Figure 2.1]. This classifica-tion is based on various combinations of visual variables andis adapted from the work of Nelson [7] and MacEachren [8].According to Elmer [6], although there are more than elevendifferent types (identified from six cartography books [6, Table2.1]) of bivariate map types, only two (bivariate choroplethsand choropleths with overlaid graduated symbols) have beengenerally used in previous literature. Bivariate map visualiza-tions, by construction, are visually more complex and harder tocomprehend in comparison to their univariate counterparts. Tofacilitate the analysis of relationships among variables in ge-

2

https://vis-tools.paluno.uni-due.de/imr

https://vis-tools.paluno.uni-due.de/imr


ographical setting, Monmonier [9] combines spatial represen-tation with visual statistical summaries (scatter plot matrix). Incontrast, we advocate textual explanations that make the visual-izations self-explaining and provide an anchor point to explorethe information specifically to users lacking visualization liter-acy.

The automatic generation of written narratives from dataand other abstracted information falls under the scope of Nat-ural Language Generation (NLG) [10][11][12]. Commercialtools such as Wordsmith and Arria NLG Studio allow for build-ing customizable templates for text generation and use an ad-vanced grammar model to do the grammar related tasks (e.g.,subject-verb agreement). However, in these systems, users haveto construct templates for each application or dataset. In con-trast, we aim at generalizing the template-based text generationapproach to different contexts by using a minimalistic set of pa-rameters.

Most of the existing approaches generate natural languagetext for non-geographic data; only few have addressed geo-graphical data. Among these, Dale et al. [13] generate routedescriptions of paths constructed from the Geographical Infor-mation System datasets. Ramos et al. [14] produce weatherforecast reports. Thomas and Sripada [15] provide (audio) sum-maries of geo-referenced data for better conveying informationto the visually impaired population. Turner et al. [16] presentroad ice weather forecasts by taking into account geographicfeatures such as altitude, direction, population, etc. A very re-cent system, SafeDrive [17], provides the textual feedback ondriving style of drivers to improve driving habits. The work ofMolina et al. [18] is close to ours as they produce descriptions ofgeographically distributed hydrological sensor data alongsidea map. While their system includes a geographical map withthe possibility to get temporal distributions of individual sen-sor readings, the textual and graphical representations are notinteractively linked. Furthermore, the major focus of these ap-proaches is on NLG and the text generation process is data de-pendent which makes them hard to generalize to other datasets.

Among the many other existing text generation ap-proaches [12], only few discuss the generation of text incombination with visualizations. They are spread acrossmany different domains, for instance, generation of instructionmanuals for simple machines [19], health care data report-ing [20][21], performance analysis of participants in virtuallearning environment [22], scuba diver’s profiling [23], anddescription of the execution behavior of a program [24]. Otherapproaches have discussed the automatic technical summariza-tion of already generated graphical content [25][26][27]. Thesesystems focus on the explanatory aspect of data analysis anddo not offer much explorability.

More recent approaches show that the interlinking of textand visualization can facilitate users in the visual explorationof data. For instance, Voder [28] uses automatically gener-ated textual descriptions about the visualized data as interac-tive links to suggest other relevant types of visualizations forbetter understanding. VisJockey [29] suggests animating corre-sponding parts of a visualization (e.g., parallel coordinates) oninteracting with the relevant text. But here the text is written

Figure 2. Bivariate map visualizations used in the explorative study. (Left, P1)Deaths caused by storms in various states of the USA. (Right, P2) Average lifeexpectancy and health expenditures across Europe.

by a human expert and not automatically generated. VIS Au-thor Profiles [30] combines generated text and visualizationsin an interactive document offering both explanatory and ex-ploratory aspect of data analysis. However, the focus of VISAuthor Profiles and other discussed approaches is narrow andthey are tailored for specific applications and are not easy to begeneralized for different contexts and datasets. In comparison,our focus is broader and the presented approach generalizes todifferent contextual settings.

Besides the interactive linking of text and visualizations, theuse of word-sized graphics or sparklines [31] has also been sug-gested for better integrating textual and visual content [32][33].In this paper, we use filled circles in line with text to better con-nect it with the map visualization as shown in Figure 1.

3. Content Selection

Any natural language generation begins with content deter-mination, which decides on what information would be con-veyed [11]. Before delving into the details of data analysis,we first introduce our target dataset and the results of an explo-rative study aimed at getting an initial overview on the possiblecontent.

3.1. Bivariate Geo-statistical DataWe are focusing on the analysis of bivariate geo-statistical

data—measurements of two numerical variables for a geo-graphic region. Particularly relevant are those scenarios whereone variable potentially influences the other. For instance, lifeexpectancy might depend on the amount of health expenditure.Similarly, intensity and number of storms can influence thenumber of lives lost. In the following, we refer to the variablethat potentially depends on the other as the focus variableand the other as the context variable. If a causality can beassumed (e.g., because it is obvious or there exists a reasonableexplanation for it), it points from context to focus variable. Inour analysis, we use three levels of geography, namely regions(e.g., USA), groups of subregions (e.g., group of states), andsubregions (e.g., individual states). We use the storm–deathdataset as our running example.

3.2. Explorative StudyTo get an initial overview of what aspects to consider while

describing bivariate geo-statistical data, we conducted an ex-3


plorative study with two participants (P1, P2). Both partici-pants were PhD students working in the field of visualization,but were not involved in this project. They were presented withan interactive version of a bivariate map visualization as shownin Figure 2, similar to the ones later used in our interactive sys-tem. In these visualizations, the focus variable was encoded inthe radius of filled circles placed on top of the choropleth mapshowing the context variable.

The participants were asked to summarize the visualization(Task I), describe one particular subregion (Task II), and pro-vide a comparative view of two given subregions (Task III).They had the possibility to write as much text as they wantedand there was no time limit.

Task I – Summary: Both participants began with the de-scription of the subregions having minimum and maximum val-ues of the focus variable followed by the explanation of outly-ing regions. P1 included information on a possible correlationbetween the variables. P2 described spatial trend of the con-text variable. Finally, both participants noticed and describedabrupt changes of values between neighboring subregions.

Task II – Region-specific description: P1 described the val-ues of both variables for the given subregion, followed by nam-ing the other regions that show similar behavior, whereas P2provided a comparison with the mean values. Further, P2 alsohighlighted a specific subregion that has higher values of thecontext variable compared to its direct neighbors.

Task III – Comparison: Both participants compared the val-ues of each variable for two regions and described them in asingle sentence. P1 included more details about one subregionas one of the subregions presented to him for comparison wasan outlier.

The results show that the prominent aspects are the report-ing of outliers (univariate and bivariate), comparisons of re-gions with their neighbors, variations of variable values acrossspace, and subregions showing similar behavior. In addition, acorrelation and variations of values across different subregionallevels (e.g., parts of Europe) might reveal interesting patternsand are worth reporting. For instance, in Figure 2 (left), thecorrelation is much stronger in the Southern states (ρ = 0.753)compared to the overall correlation for the country (ρ = 0.400).

3.3. Data Analysis

Next, we discuss statistical approaches to automaticallyidentify the content that will be part of our narrative. In contrastto basic information such as ranges of variable values, corre-lations among variables, and extreme values, the detection ofunivariate outliers, bivariate outliers, and regional differencesrequires more sophisticated data analysis approaches.

3.3.1. Univariate OutliersThe importance of extreme values (minimum and max-

imum) in a dataset varies depending on the distribution ofvariables. A Tukey’s boxplot [34] uses measures namely,the first quartile (Q1), median (Q2), third quartile (Q3), andinterquartile range (IQR = Q3 − Q1) to describe a univariatedistribution. Hoaglin et al. [35] categorize the observations

0 3,621

0 184

No. of storms

No. of deathsNevada

Florida

Figure 3. Box plots showing the distribution of deaths caused by storms in theUSA during 2017. The dataset contains univariate outliers in both variables.

Nevada

Florida

Figure 4. Bagplot of deaths caused by storms in the USA during 2017. The bag(blue) contains almost 50% of the data points, the loop (light blue) includingpoints outside bag but inside the fence. Bivariate outliers are marked as reddots.

smaller than Q1− 1.5 · IQR or larger than Q3 + 1.5 · IQR as thepotential candidates for outliers. Although somewhat arbitrary,this threshold for detecting outliers works well based on theirexperience with many datasets.

We analyze each variable individually and identify the uni-variate outliers i.e., the points lying outside [Q1−1.5·IQR,Q3+

1.5 · IQR] range. Figure 3 shows the distribution and outlierscorresponding to each of the two variables in our exemplarydataset.

3.3.2. Bivariate OutliersWe are also interested in subregions that demonstrate dif-

ferent behavior compared to the rest of the subregions based onvalues of both variables. Such a bivariate outlier may not nec-essarily be an outlier in both of the univariate variables. Forinstance, although the states of Nevada and Florida in Figure 3(marked with red dots) are not outliers in variable storms, theyare bivariate outliers as shown by the bagplot in Figure 4. Abagplot [36] is a bivariate generalization of a boxplot and vi-sualizes the distribution, spread, and outliers jointly for bothvariables. Three main components of a bagplot are: the bagcontaining 50% of the observations, the fence usually obtainedby inflating the bag by a factor of 3 separating inliers from out-liers, and the loop, that is the convex hull of the points lyingbetween the bag and the fence.

4


The detection of bivariate outliers depends on the shapeor distribution of the data, which is often characterized bya covariance matrix. For identifying the outliers, we usea well-known distance measure, the Mahalanobis distance,which takes into account the covariance matrix and is definedas the distance between an observation and a multivariatedistribution. Mathematically, this distance is specified as:

d =

√(x − µ)T S −1(x − µ) (1)

where x = (x1, x2) is the vector of variables, µ = (µ1, µ2)is the vector of means and S is a two-dimensional symmetriccovariance matrix. The resulting value d represents the Maha-lanobis distance of point x from the mean µ of the distribution.

For a constant value of d, Equation 1 defines a two-dimensional ellipsoid centered at µ. The probability ofellipsoid follows a χ2 distribution with p degrees of free-dom [37]. Therefore, the ellipsoid satisfying

(x − µ)T S −1(x − µ) ≤ χ2p(α) (2)

has a probability of 1 − α. Hence, for p = 2 (bivariate case)and α = 0.5 ⇒ χ2 = 5.99. Equation 2 states that any obser-vation is considered a bivariate outlier for which the squaredMahalanobis distance is greater than 5.99.

3.3.3. Geospatial TrendsThe behavior of any statistical variable can vary consider-

ably depending on the geographical subregion. For instance,Figure 1 shows that the coastal states of the USA have experi-enced a higher number of storms, and as a consequence, morecasualties. To identify this behavior, we take a regional subdi-vision of the overall shown geographic region under considera-tion. The United Nations geoscheme [38] provides a classifica-tion of the countries of the world into groups. For instance, Eu-ropean countries are grouped into Eastern, Western, Northern,and Southern countries. Similarly, the regional classification ofthe USA discerns West, Midwest, Northeast, and South. Usingthis grouping (or other externally provided groupings), we canlook for differences between these groups. In particular, we de-tect if there is a strong positive or negative correlation betweenfocus and context variable in one or more of these groups. Be-sides the bivariate outliers, an identification of subregions thatshow different behavior compared to the adjacent subregionscan be of interest. For instance, Figure 1 shows that the stateof Nevada has different statistics with respect to both variablescompared to its neighboring states, Arizona, California, Idaho,Oregon, and Utah. To this end, we compare the values of eachvariable for every subregion with its neighbors to identify theregions showing different statistics.

4. Text Generation

In contrast to advanced text generation approaches based ona grammar model or machine learning [12], we use a template-based text generation approach because of its good applicabilityand sufficient flexibility. Deemter et al. [10] provide an in-depthcomparison of generation approaches.

Yes

StartUnivariate

outliers?

Correlation?Bivariate

outliers?

Bivar.

outlier desc.

Vis. desc.

Pos. / neg.

corr. descr.

Uni. outlier

desc.

Extent of

focus var.

Regional

differences?

Stop

>+0.5 / < -0.5

Reg. differences

desc.

No

Yes Outlier among

neighbors desc.

Outlier among

neighbors?

Yes

No

Yes

Correlation in

reg. groups?

Reg. corr.

desc.

No

[-0.5,0.5]

Yes

>+0.75 / < -0.75

[-0.75,0.75]

No

Figure 5. Decision graph that shows the text generation process. Round-rectangular decision nodes control the path while rectangular text nodes adda text fragment when visited. The green path marks the narrative generation forthe example in Figure 1.

Table 1. User-defined parameters for configuring the map reports.

Parameter Description Values

Region Name of the region for whichmap is displayed

String value, e.g., World,Europe, Germany

Subregion Level Name of the type of regions themap is subdivided in

String value, e.g., countries,states, cities

Focus/Context Type Variable types according topredefined categories

incidents, casualties,demographic-indicator,quantitative, percentage,monetary, or indicator

Situation Type of situation with respectto focus variable

positive, negative, or neutral

Causality If causality can be assumedfrom context to focus variable

yes or no

4.1. Templates to NarrativeHaving selected the content, the next step is to transform

this information into a written narrative that consists of para-graphs containing interconnected sentences. To this end, weuse a similar approach for controlling the sequence of the gen-erated phrases and sentences as described in Method ExecutionReports [24] and VIS Author Profiles [30]. Directed acyclic de-cision graphs guide the generation flow and produce text frompre-written templates. Figure 5 shows the decision graph thatis responsible for generating the main part of our map reports.The process begins with the Start node and follows a determin-istic path until the Stop node is reached. The decision nodes(rounded rectangular) lead the path according to the values ofthe decision variables. The text nodes (rectangular) are respon-sible for sentence creation and, when visited, add a new sen-tence or phrase to the narrative. Any traversal from Start toStop node results in a meaningful narrative.

4.2. Adaptable TemplatesTo achieve flexibility in narration and to make the template

adaptable to different datasets, we leverage user-defined param-eters that describe the meta data about the scenario. Throughthese parameters, we add semantics and domain-specific vocab-ulary that cannot be automatically detected from the raw data.The list of parameters along with short descriptions and pos-sible values is shown in Table 1. The parameters Region and

5


Subregion Level define the name of the region and the nameof the level of detail for regions respectively. The parametersFocus and Context Type define the type of both variables thatcan be selected from a list of predefined values. The choiceof adjectives, quantifiers, and verbs depends on these variabletypes. For instance, for the type casualties, possible phrasesare: “X suffered several casualties”, “X reported large num-ber of deaths”, or “X lost many lives”. Similarly, for the vari-able type monetary, a possible phrase could be “X spent largeamount on Y” or “X spent less on Y”. Depending on the vari-able type, we pick verbs from a list of synonymous verbs tomake the text more interesting to read.

In addition to the quantifiers and verbs, the choice of ad-verbs (e.g., better, worst) depends on the context or situationunder consideration. We describe three possible situations:

• Positive: situations where higher values of the focus vari-able are desirable. For instance, higher values of averagelife expectancy are commonly considered to be desirable.

• Negative: situations that favor lower values of the focusvariable. For example, the cities reporting less number offatalities occurred in road accidents would be consideredas better.

• Neutral: situations that do not clearly favor small orlarge values of the focus variable. For instance, onlydepending on a country’s situation (e.g., aging society orunemployment of young people), lower or higher birthrates are desirable.

Combining the variable type with the situation, we can now usemore expressive and specific phrases to describe the results. ForFocus Type← demographic-indicator and Situation← positive,a possible phrase could be; “X reports better values of lifeexpectancy compared to Y”. Similarly, the Context Type ←incidents but situation ← negative could result in; “X was thesafest subregion due to the least number of accidents”.

Another consideration is the presence of a strong correla-tion which may wrongly be interpreted as causal. However,correlation does not imply causality and it is not possible to au-tomatically extract causality from the numerical data. The pa-rameter Causality helps in avoiding wrong interpretations aboutcausality based on the values of correlations.

4.3. Long Lists of Items

During the analysis process, we need to handle long lists ofsubregions, e.g., a larger number of univariate outliers. Eachmember of the list is associated with a numerical value of thevariable attached. Since our final output is natural languagetext, inclusion of long lists makes the text lengthy and boring toread. Therefore, we restrict the size of these lists. However,instead of cutting the list to a fixed size, we use a dynamicselection method that slices the list to have items in a givenrange [30, Section 4.4]. The list is cut at the point where thedifference to the following value is quite large.

5. Interactive Map Reports

To implement our approach, we developed Interactive MapReports (iMR), a Web-based system that generates analysis re-ports for bivariate geo-statistical data. Figure 1 shows the in-terface of our tool and the components of the generated report.A map visualization on the left visualizes two variables usingtwo different encodings. The right column presents the gener-ated narrative consisting of an overview and additional detailson the selected subregion or a comparison of any two selectedsubregions (shown below the map in Figure 1 for space effi-ciency). The small info icon indicates the availability of ad-ditional explanations—for instance, a complete list of regionswith their respective variable values or details on the analysismethods used to phrase the respective sentence. The use ofsmall graphics (circles for the focus and color coding for thecontext variable) in the text supports the quick comparison ofvarious regions while reading the text and also makes it easierto find the corresponding subregion on the map. The subregionnames are produced in boldface characters and are clickable—when clicked, the system highlights the respective subregion onthe map. A tool-tip presents the exact numerical values of bothvariables when hovering over the subregions.

5.1. Bivariate Map VisualizationFor visualizing bivariate geo-statistical data on a geograph-

ical map, we employed a standard technique that performedwell in a user study comparing different bivariate map visual-izations [6]. It uses two different encodings, one for each vari-able. The context variable is visualized as a choropleth mapbased on a single-colored linear brightness gradient. The val-ues of the focus variable are encoded in the radii of filled circlesand are overlaid on top of the choropleth map. These circles arepositioned at the centroid of the respective subregion.

The selection of colors for encoding the focus variable de-pends on the specified situation, i.e., positive→ green, negative→ red, and neutral→ orange. This choice is based on the factthat green color is generally associated with positive and safesituations while red is considered to be a sign of warning or dan-ger. However, the choice of orange color for neutral situationis somewhat arbitrary and has been chosen for better visibilityas it has to be overlaid on top of black and gray color. For thecontext variable, we always use the same neutral gradient (lightgray to dark gray) irrespective of the situation. as the situationdepends only on the focus variable.

5.2. Textual Summary of AnalysisThe first section of the generated narrative is the overview

that summarizes the results of the data analysis. This section isdivided into three paragraphs. The structure and ordering of theparagraphs are fixed but the sentences change considerably de-pending on the dataset and scenario. In Figure 1, the overviewis generated by traversing the green path in the decision graphof Figure 5.

The opening paragraph consists of a single sentence that in-troduces the dataset and the visual encodings with the help ofin-line legends. The second paragraph summarizes the results

6


Table 2. Parameter configuration for shown examples.

Fig. Title Region Subregion Level Focus Type Context Type Focus Name Context Name Situation Causality

1 Fatalities caused by storms, USA, 2017 USA states casualties incidents deaths storms negative yes

6 Average life expectancy and spendings onhealth, Europe, 2018

Europe countries demographic-indicators

monetary average life ex-pectancy

health expenditure positive no

7 Adolescent birth rates and use of Internet,World, 2015

World countries demographic-indicators

percentage adolescentbirth rate

Internet users neutral no

8 Obesity and consumption of alcohol, World,2010

World countries percentage indicator obese people alcohol consump-tion

negative no

of the univariate analysis for the focus variable. It starts by stat-ing the average values of focus variable, followed by the rangeof its values accompanied by the subregion names (text nodeVis. desc.). In case of multiple subregions having the sameminimum (or maximum) value, it names one subregion as anexample. The complete list of these regions can be viewed byhovering over the info icon . The next sentence lists the re-gions that are univariate outliers according to the focus variable(text node Uni. outlier desc.). This and all other similar listsof subregions are restricted to show only 2 to 4 subregions ac-cording to the dynamic selection method (Section 4.3) with thepossibility to view the complete list on demand.

Next, in the second paragraph, the text node Outlier amongneighbors desc. is responsible for describing the regions thatexhibit substantially different values compared to their adja-cent subregions. We use the method described in Section 3.3.1,Tukey’s fences, for identifying local outliers. It works well ifthe number of adjacent subregions are larger (e.g., for Missouri,Nevada, Texas, and Wisconsin). However, in case of few neigh-boring subregions (e.g., Florida has only two adjacent states),cannot detect meaningful outliers. For this particular case, evenDixon’s Q test [39]–efficient method for detecting outliers insmall number of observations–failed to identify Florida as anoutlier. Since these situations are harder to identify, we take aconservative decision and exclude all the subregions that haveless than three neighbors from the analysis.

The last sentence of the second paragraph goes into detailsof the regional differences found in the values of both variables(text node Reg. differences desc.). Depending on the regionalclassification of the geographical region under consideration,we describe the subregional groups that show distinct behaviorcompared to the other groups. For example, Figure 1 depictsthat Southern states lost more lives to storms in comparison toother states. The same text node produces another variationof this sentence in Figure 6 describing that, although countriesin Western Europe spend more on health, Southern Europeancountries have higher average life expectancies.

The final paragraph highlights the relationship of the con-text to the focus variable followed by a description of bivariateoutliers. It begins by describing a positive or negative correla-tion among the variables (text node Pos./neg. correlation). Incase of Causality set to yes, a different phrasing and vocabu-lary is used to imply causality. For instance, Figure 1, thirdparagraph, it is stated that “Texas experienced a high num-ber of deaths as a result of a high number of storms”. Thechoice of the phrase “as a result of” is specific to causality.

The first sentence of this paragraph is not available in Figure 1as the overall value of correlation is below the threshold value(shown in Figure 5); Figure 6 gives an example of this sen-tence. The presence of a strong positive or negative correlationamong one or more subregional groups is highlighted in thenext sentence (text node Reg. corr. desc.). Then follows the de-scription of regions that shows bivariate outliers. For instance,Figure 1 highlights that the states of Texas and Nevada are bi-variate outliers—Texas having maximum values for both vari-ables whereas Nevada suffered a very high number of casualtiesin a relatively small number of storms.

5.3. On-Demand Explanations

The overview section provides a high-level summary of theanalysis and does not include descriptions of every subregion.Therefore, in addition to the tool-tips showing the values of thefocus and context variables, we present additional descriptionson every subregion. Users can click any subregion to acquireadditional details that are displayed below the overview sectionas shown in Figure 1. The generation process for the on-demand explanations follows a similar decision graph to theone shown in Figure 5; the generated text consists of a singleparagraph.

The first sentence compares the values of the focus and con-text variables for the selected subregion with the respective av-erage values across all subregions. If the selected subregion isamong one of the extreme cases, it is stated by using quanti-fiers such as highest, lowest, most, etc. For instance, in caseof Texas, this sentence reads: “Texas experienced the highestnumber of deaths (184) and highest number of storms (3,621)among all states of the United States of America.” The nextsentence states the statistical ranking of the selected subregionwith respect to the focus variable. The last sentence providesa comparison of the selected subregion with its neighboring re-gions for highlighting similar or dissimilar statistics. For in-stance, the state of Utah is the only state among its neighboringstates that does not report any casualties.

Besides the explanations on one subregion, it is also possi-ble to compare any two subregions by simultaneously selectingthem. Here, the generated text consists of a single sentence thatcontrasts both regions based on the values of both variables. Forinstance, Figure 7 and Figure 8 presents two different instancesof comparison texts.

7


6. Results

We present a number of examples to demonstrate the use-fulness of our approach and support our claims that the iMR(i) detects outliers, regional differences, and prominent patternsreliably for various datasets, (ii) produces meaningful textualdescriptions about the analysis results, and (iii) adapts to dif-ferent variable types and different levels of subregional gran-ularity. In addition to the examples presented in this section,readers can explore more examples by running the iMR systemin any modern Web browser.

In what follows, we demonstrate map reports for three dif-ferent Regions: world (Figure 7 and 8), continent (Figure 6),and country (Figure 1) and two different Subregional Levels:countries and states. Table 2 shows the values of the user-defined parameters for the examples. At world level, the re-port describes the group of countries showing distinct behavior.For instance, the European countries have higher numbers ofInternet users and lower adolescent birth rates in comparison tothe rest of the world. On the continent level, Figure 6 reportsthe differences among various parts of Europe—countries in theSouthern Europe have better average life expectancies despitespending less on health. At the country level, in addition to de-scribing the differences across various states of the country, thereport also highlights the states showing dissimilar behavior incontrast to their adjacent states. For instance, Figure 1 showsthat the states Missouri, Nevada, and Wisconsin suffered a lotmore deaths than their neighbors.

To show the adaptability of the generated text to varioussituations, we showcase examples for each type of situation.Figure 7 highlights the relationship between adolescent birthrates—annual number of live births per women aged 15–19,and number of people who have access to the Internet. Theadolescent birth rate (focus) is neither clearly positive nor nega-tive, this report is generated according to the neutral situation.

The map reports shown in Figure 1 and Figure 8 are pro-duced with the negative value of Situation. For instance, thephrase “suffered a lot more casualties” (Figure 1). Althoughboth examples share the same value of situation, the narrativediffers considerably based on the variable types and the pres-ence of correlations. The former example highlights the pres-ence of a positive correlation among Southern states while therewas not considerable correlation for the entire USA. In contrast,there is no paragraph about correlations between variables inFigure 8 as the value is not large enough.

Figure 6 presents the average life expectancy and the moneyper household spent on health. Here, higher values of life ex-pectancy are favorable so the situation is positive. The phrase“better is the life expectancy”, reflects the positive characterof the situation while describing the higher values of the focusvariable.

Mostly, the choice of quantifiers and verbs depends on thevariable type. For instance, Figure 1 uses the casualties as thetype of the focus variable and hence the phrase “number of”is used. Similar is the case with variable type percentage inFigure 7 (“percentage of Internet users”) and Figure 8 (“per-centage of obese people”). Referring to Figure 6, the choice

Figure 6. An interactive map report describing average life expectancy andhealth expenditure across Europe in 2018.

Figure 7. An interactive map report showing the possible relationship betweenadolescent birth rates and percentage of Internet users across countries of theworld in 2015.

of quantifier “values of” for the variable health expenditure isbased on the variable type monetary. However, the variabletype demographic-indicators does not need any phrase as seenin the examples of Figure 6 and Figure 7 (“African countriesshowed higher adolescent birth rates”). In Figure 1, the verbs“suffered”, “experienced”, and “faced” correspond to FocusType← casualties.

7. Discussion and Conclusion

We demonstrated an approach to create bivariate geo-statistical data analysis reports consisting of generatednarrative accompanying a map visualization. The reports guidethrough the analysis results and provide additional explanationsfor interpreting the data. Through a number of examples, wehave shown the flexibility of our approach and that it producesmeaningful interactive reports for different variable types,geographic regions, and scenarios.

The scope of our work is limited to bivariate geo-statistical(i.e., numeric) data where one variable potentially influences

8


Figure 8. An interactive map report describing the percentage of obese peopleand alcohol consumption in the world during 2010.

the other. While we chose a specific geographic visualization toencode the bivariate data, it would be relatively easy to replacethe visualization by a different one or even make it customiz-able. An option to achieve it, could be the use of comprehensivedeclarative model for producing visualizations as described byJo et al. [40]. We have coverage for a number of variable types,but cannot claim that every variable can be classified as one ofthe mentioned categories. However, the generic category quan-titative results in a less tailored but still meaningful narrative.Interesting future work includes extending the approach to cat-egorical variables, multivariate data (i.e., more than two vari-ables), and spatiotemporal information. While the first exten-sion is estimated to require only smaller changes, the two latterscenarios will likely require substantially different data analy-sis, narration, and visualization techniques. In contrast to mostprevious systems that generate combined visual and textual de-scriptions, our focus is comparatively broad and covers differentscenarios. The use of a small set of parameters gives sufficientflexibility to tailor the same text generation process for differentdatasets. Hence, our approach can be considered as lying be-tween the fully automated text generation systems (which aretailored to a narrow scenario) and tools that allow for build-ing fully customizable templates. However, our approach doesnot support manual refining or extending the reports beyond theconfigurations that can be specified through the parameters.

Since our map reports contain text and visualizations, astwo different parts of the report, one might argue that this intro-duces split-attention effect. However, this is true for any rep-resentation of data that includes multiple views. The availableinteractions in our reports counterbalance this effect by provid-ing an easier and quicker way of cross-referencing the two rep-resentations. Furthermore, the use of word-sized graphics alsohelps better integrating textual and visual information [32].

One problem with the auto-generated reports like ours isthat the possibility of incorrect information being generatedcannot be excluded. One might argue that the said problem en-dangers visualizations to some extent as well especially when

the mapping between data and visual elements is complex.However, the problem is more severe with auto-generated textas it is more explicit. Like for any complex software system,one possible countermeasure is to apply thorough testing.

At the moment our reports are generated based on pre-defined settings and do not take into account user interactionsin the generation process. It would be interesting to considerexploration histories, and even relative user characteristics (asfound in Toker et al.’s work [41]). In addition to the on-demandtextual explanations, the use of summary visualizations asdiscussed by Monmonier [9] to reveal relationships amongvariables on a sub-selection of regions can further enrich thedocuments.

Although it is obvious that both visual and textual datadescriptions have their advantages, it remains a largely openresearch question which information is better represented inwhich modality. Some existing works have already providedevidence that a bimodal (i.e., text and visualization) represen-tation can be beneficial for understanding and interpreting theinformation. Gkatzia et al. [42] demonstrate better decisionmaking under uncertainty using a task-based study. Similarly,Sripada and Gao [23] claim that divers find the bimodalrepresentation more comprehensive while judging the safetyof a deep dive. However, their results are based on specificdatasets and cannot be generalized. Our approach assumesthat a bivariate map visualization is given; the narration onlyaccompanies the visual representation, but is not consideredto live without the visualization. As a next step, it would beinteresting to perform user studies to investigate if especiallyusers rather inexperienced with visualization will profit fromthe additional text as expected.

Acknowledgments

Fabian Beck is indebted to the Baden-Wurttemberg Stiftungfor the financial support of this research project within the Post-doctoral Fellowship for Leading Early Career Researchers.

References

[1] C. A. Brewer, A. M. MacEachren, L. W. Pickle, D. Herrmann, Mappingmortality: Evaluating color schemes for choropleth maps, Annals of theAssociation of American Geographers 87 (3) (1997) 411–438. doi:

10.1111/1467-8306.00061.[2] J. J. Flannery, The relative effectiveness of some common graduated point

symbols in the presentation of quantitative data, Cartographica: The Inter-national Journal for Geographic Information and Geovisualization 8 (2)(1971) 96–109. doi:10.3138/J647-1776-745H-3667.

[3] D. Howard, A. M. MacEachren, Interface design for geographic vi-sualization: Tools for representing reliability, Cartography and Geo-graphic Information Systems 23 (2) (1996) 59–77. doi:10.1559/

152304096782562109.[4] C. Brewer, A. J. Campbell, Beyond graduated circles: varied point sym-

bols for representing quantitative data on maps, Cartographic Perspec-tives (29) (1998) 6–25. doi:10.14714/CP29.672.

[5] S. Kim, R. Maciejewski, A. Malik, Y. Jang, D. S. Ebert, T. Isenberg,Bristle Maps: A multivariate abstraction technique for geovisualization,IEEE Transactions on Visualization and Computer Graphics 19 (9) (2013)1438–1454. doi:10.1109/TVCG.2013.66.

[6] M. E. Elmer, Symbol considerations for bivariate thematic mapping,Ph.D. thesis, University of Wisconsin–Madison (2012).

9

http://dx.doi.org/10.1111/1467-8306.00061

http://dx.doi.org/10.1111/1467-8306.00061

http://dx.doi.org/10.3138/J647-1776-745H-3667

http://dx.doi.org/10.1559/152304096782562109

http://dx.doi.org/10.1559/152304096782562109

http://dx.doi.org/10.14714/CP29.672

http://dx.doi.org/10.1109/TVCG.2013.66


[7] E. Nelson, The impact of bivariate symbol design on task performance ina map setting, Cartographica: The International Journal for GeographicInformation and Geovisualization 37 (4) (2000) 61–78. doi:10.3138/

V743-K505-5510-66Q5.[8] A. M. MacEachren, How maps work: representation, visualization, and

design, Guilford Press, 2004.[9] M. Monmonier, Strategies for the visualization of geographic time-series

data, Cartographica: The International Journal for Geographic Infor-mation and Geovisualization 27 (1) (1990) 30–45. doi:10.3138/

U558-H737-6577-8U31.[10] K. V. Deemter, M. Theune, E. Krahmer, Real versus template-based nat-

ural language generation: A false opposition?, Computational Linguistics31 (1) (2005) 15–24. doi:10.1162/0891201053630291.

[11] E. Reiter, R. Dale, Z. Feng, Building natural language generation systems,MIT Press, 2000.

[12] A. Gatt, E. Krahmer, Survey of the state of the art in natural languagegeneration: Core tasks, applications and evaluation, Journal of ArtificialIntelligence Research 61 (2018) 65–170. doi:10.1613/jair.5477.

[13] R. Dale, S. Geldof, J.-P. Prost, Using natural language generation in auto-matic route description, Journal of Research and practice in InformationTechnology 37 (1) (2005) 89.

[14] A. Ramos-Soto, A. J. Bugarin, S. Barro, J. Taboada, Linguistic descrip-tions for automatic generation of textual short-term weather forecasts onreal prediction data, IEEE Transactions on Fuzzy Systems 23 (1) (2015)44–57. doi:10.1109/TFUZZ.2014.2328011.

[15] K. Thomas, S. Sripada, Atlas.txt: Linking geo-referenced data to text forNLG, in: Proceedings of the Eleventh European Workshop on NaturalLanguage Generation, ENLG ’07, 2007, pp. 163–166.

[16] R. Turner, S. Sripada, E. Reiter, I. P. Davy, Using spatial reference framesto generate grounded textual summaries of georeferenced data, in: Pro-ceedings of the Fifth International Natural Language Generation Confer-ence, INLG ’08, Association for Computational Linguistics, Stroudsburg,PA, USA, 2008, pp. 16–24. doi:10.3115/1708322.1708328.

[17] D. Braun, E. Reiter, A. Siddharthan, SaferDrive: An NLG-based be-haviour change support system for drivers, Natural Language Engineering(2018) 1–38doi:10.1017/S1351324918000050.

[18] M. Molina, J. Sanchez-Soriano, O. Corcho, Using open geographicdata to generate natural language descriptions for hydrological sen-sor networks, Sensors 15 (7) (2015) 16009–16026. doi:10.3390/

s150716009.[19] W. Wahlster, E. Andre, W. Finkler, H.-J. Profitlich, T. Rist, Plan-based

integration of natural language and graphics generation, Artificial In-telligence 63 (1-2) (1993) 387–427. doi:10.1016/0004-3702(93)

90022-4.[20] A. Jain, J. M. Keller, Textual summarization of events leading to health

alerts, in: Engineering in Medicine and Biology Society, 37th AnnualInternational Conference of the IEEE, EMBC ’15, 2015, pp. 7634–7637.doi:10.1109/EMBC.2015.7320160.

[21] J. Hunter, A. Gatt, F. Portet, E. Reiter, S. Sripada, Using naturallanguage generation technology to improve information flows in in-tensive care units, in: ECAI, 2008, pp. 678–682. doi:10.3233/

978-1-58603-891-5-678.[22] A. Ramos-Soto, B. Vazquez-Barreiros, A. Bugarın, A. Gewerc, S. Barro,

Evaluation of a data-to-text system for verbalizing a learning analyticsdashboard, International Journal of Intelligent Systems 32 (2) (2017)177–193. doi:10.1002/int.21835.

[23] S. G. Sripada, F. Gao, Summarizing dive computer data: A case studyin integrating textual and graphical presentations of numerical data, in:Workshop on Multimodal Output Generation, MOG ’07, 2007, p. 149.

[24] F. Beck, H. A. Siddiqui, A. Bergel, D. Weiskopf, Method Execution Re-ports: Generating text and visualization to describe program behavior, in:Proceedings of the 5th IEEE Working Conference on Software Visualiza-tion, IEEE, 2017, pp. 1–10. doi:10.1109/VISSOFT.2017.11.

[25] V. O. Mittal, S. F. Roth, J. D. Moore, J. Mattis, G. Carenini, Generatingexplanatory captions for information graphics, in: Proceedings of the 14thInternational Joint Conference on Artificial Intelligence, IJCAI, MorganKaufmann Publishers Inc., 1995, pp. 1276–1283.

[26] S. Demir, S. Carberry, K. F. McCoy, Summarizing information graphicstextually, Computational Linguistics 38 (3) (2012) 527–574. doi:10.

1162/COLI_a_00091.

[27] J. Hullman, N. Diakopoulos, E. Adar, Contextifier: automatic generationof annotated stock visualizations, in: Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems, CHI, ACM, 2013, pp.2707–2716. doi:10.1145/2470654.2481374.

[28] A. Srinivasan, S. M. Drucker, A. Endert, J. Stasko, Augmenting visu-alizations with interactive data facts to facilitate interpretation and com-munication, IEEE Transactions on Visualization and Computer Graphics25 (1) (2019) 672–681. doi:10.1109/TVCG.2018.2865145.

[29] B. C. Kwon, F. Stoffel, D. Jckle, B. Lee, D. Keim, VisJockey: EnrichingData Stories through Orchestrated Interactive Visualization, in: Compu-tation+Journalism Symposium 2014, 2014.

[30] S. Latif, F. Beck, VIS Author Profiles: Interactive descriptions of publi-cation records combining text and visualization, IEEE Transactions onVisualization and Computer Graphics 25 (1) (2019) 152–161. doi:

10.1109/TVCG.2018.2865022.[31] E. R. Tufte, Beautiful Evidence, 1st Edition, Graphics Press, 2006.[32] F. Beck, D. Weiskopf, Word-sized graphics for scientific texts, IEEE

Transactions on Visualization and Computer Graphics 23 (6) (2017)1576–1587. doi:10.1109/TVCG.2017.2674958.

[33] P. Goffin, W. Willett, J.-D. Fekete, P. Isenberg, Exploring the place-ment and design of word-scale visualizations, IEEE Transactions on Vi-sualization and Computer Graphics 20 (12) (2014) 2291–2300. doi:

10.1109/TVCG.2014.2346435.[34] J. W. Tukey, Exploratory Data Analysis, Vol. 2, Reading, Mass., 1977.[35] D. C. Hoaglin, F. Mosteller, J. W. Tukey, Understanding Robust and

Exploratory Data Analysis, Vol. 1, Wiley Classic Library, 2000.[36] P. J. Rousseeuw, I. Ruts, J. W. Tukey, The bagplot: A bivariate box-

plot, The American Statistician 53 (4) (1999) 382–387. doi:10.1080/

00031305.1999.10474494.[37] W. Hardle, L. Simar, Applied multivariate statistical analysis, Vol. 22007,

Springer, 2007.[38] United Nations, Statistical Division, Countries or areas / geographical

regions, https://unstats.un.org/unsd/methodology/m49.[39] R. B. Dean, W. Dixon, Simplified statistics for small numbers of obser-

vations, Analytical Chemistry 23 (4) (1951) 636–638. doi:10.1021/

ac60052a025.[40] J. Jo, F. Vernier, P. Dragicevic, J.-D. Fekete, A declarative rendering

model for multiclass density maps, IEEE Transactions on Visualizationand Computer Graphics 25 (1) (2019) 470–480. doi:10.1109/TVCG.

2018.2865141.[41] D. Toker, C. Conati, G. Carenini, User-adaptive support for processing

magazine style narrative visualizations: Identifying user characteristicsthat matter, in: 23rd International Conference on Intelligent User Inter-faces, ACM, 2018, pp. 199–204. doi:10.1145/3172944.3173009.

[42] D. Gkatzia, O. Lemon, V. Rieser, Data-to-text generation improvesdecision-making under uncertainty, IEEE Computational IntelligenceMagazine 12 (3) (2017) 10–17. doi:10.1109/MCI.2017.2708998.

10

http://dx.doi.org/10.3138/V743-K505-5510-66Q5

http://dx.doi.org/10.3138/V743-K505-5510-66Q5

http://dx.doi.org/10.3138/U558-H737-6577-8U31

http://dx.doi.org/10.3138/U558-H737-6577-8U31

http://dx.doi.org/10.1162/0891201053630291

http://dx.doi.org/10.1613/jair.5477

http://dx.doi.org/10.1109/TFUZZ.2014.2328011

http://dx.doi.org/10.3115/1708322.1708328

http://dx.doi.org/10.1017/S1351324918000050

http://dx.doi.org/10.3390/s150716009

http://dx.doi.org/10.3390/s150716009

http://dx.doi.org/10.1016/0004-3702(93)90022-4

http://dx.doi.org/10.1016/0004-3702(93)90022-4

http://dx.doi.org/10.1109/EMBC.2015.7320160

http://dx.doi.org/10.3233/978-1-58603-891-5-678

http://dx.doi.org/10.3233/978-1-58603-891-5-678

http://dx.doi.org/10.1002/int.21835

http://dx.doi.org/10.1109/VISSOFT.2017.11

http://dx.doi.org/10.1162/COLI_a_00091

http://dx.doi.org/10.1162/COLI_a_00091

http://dx.doi.org/10.1145/2470654.2481374







http://dx.doi.org/10.1080/00031305.1999.10474494

http://dx.doi.org/10.1080/00031305.1999.10474494

https://unstats.un.org/unsd/methodology/m49

http://dx.doi.org/10.1021/ac60052a025

http://dx.doi.org/10.1021/ac60052a025



http://dx.doi.org/10.1145/3172944.3173009

http://dx.doi.org/10.1109/MCI.2017.2708998

interactive map reports summarizing bivariate geographic data · interactive map reports...

Documents