visualising complex linked data · 2015-11-01 · visualising complex linked data ... focus on...

AUSTRALIA NATIONAL UNIVERSITY

Canberra, ACT

Visualising complex linked data

A thesis submitted in partial fulfilment

of the requirement for the course of

Computing Project (COMP 8715)

by

Quanwei Han

under the guidance of

Assoc. Prof. Peter Christen and Mr. Jeffrey Fisher

Table of contents

Abstract .......................................................................................................................... 1

1. Introduction ............................................................................................................. 2

1.1 Glossary ........................................................................................................ 4

2. Data collection ........................................................................................................ 5

2.1 Background................................................................................................... 5

2.2 Data description ............................................................................................ 6

3. Data visualisation .................................................................................................. 10

3.1 Single life segment visualisation ................................................................ 11

3.2 Multiple life segment visualisation ............................................................ 14

3.3 Data inconsistencies visualisation .............................................................. 15

3.4 Supplementary functions ............................................................................ 18

3.5 Flexibility and reusability issues ................................................................ 19

4. Evaluation ............................................................................................................. 23

5. Conclusion ............................................................................................................ 26

Acknowledgement ....................................................................................................... 27

References .................................................................................................................... 28

1

Abstract

Data linkage and data visualisation are both widely used in data analysis and

presentation. However, very little effort has been devoted to visualising complex

linked data. This thesis takes a data set of linked demographic data from the Isle of

Skye in Scotland and describes a visualisation technique for communicating linked

data to users in a way that shows the characteristics of linked data and potential wrong

or inconsistent links together in one figure. The technique is implemented as a

Python-based program, which is flexible and configurable and thus can be easily

applied on other demographic datasets. The resulting prototype has been evaluated by

data linkage researchers and was considered an effective tool to illustrate linked

demographic data.

2

1. Introduction

As a mature technique, data linkage has been increasingly relied on by researchers,

government agencies and businesses to integrate and analyse their data. It helps

increase the integration and quality of available data, and allows data mining to be

applied on multiple databases (Christen & Goiser, 2007). Data visualisation is also

widely used by many industries for communicating information to users efficiently

and intuitively (Ward, Grinstein, & Keim, 2010), as well as finding hidden patterns

and trends (Keim, 2002). However, little research has been done to apply visualisation

on complex linked databases. Visualisation of linked data is important because it

illustrates the resulting data in a straightforward way. In addition, it accelerates and

facilitates the process of identifying potentially wrong or inconsistent links (Appendix

A).

This project aims to develop innovative network-based techniques for visualising

complex linked data. The data used for visualisation is a demographic dataset which

links historical birth, death, marriage and census records from the Isle of Skye in

Scotland. These visualisation techniques are implemented as a Python prototype

program that generates different visualisations of the Isle of Skye historical data. This

program is flexible and configurable so that it can be applied on other linked

demographic datasets (See Appendix A for details).

During the last couple of decades, many visualisation techniques have been applied to

reflect the quality of data linkage strategies and they played an important role in

evaluating data linkage techniques (Abecasis & Cookson, 2000; Wigginton &

Abecasis, 2005; Christen & Goiser, 2007). Whereas almost all of these visualisations

focus on visualising the quality measures used for data linkage, like accuracy and

precision, hardly any of them try to visualise the linked data themselves. In other

words, these visualisation techniques can be used to assess data linkage results but

they do not work for identifying potentially wrong links. This project researches novel

visualisation methods for linked data, which will enable researchers to identify

3

potentially wrong links, and look into them by checking relevant characteristics of the

linked data.

The main challenge in this project is how to visualise the characteristics of linked data

and data inconsistencies together in one figure. They need to be shown concurrently

because they are complementary to each other when identifying potentially wrong

links. Potentially wrong links must be validated by checking the characteristics of

relevant linked records, and the volume of linked data is often large so that it is

impossible to traverse them without hints about potentially wrong links. These two

kinds of elements should be clearly distinguishable so that researchers can easily

focus on one of them from different perspectives. In this project, a solution has been

found for this problem, which is discussed in Section 3.3.

Many research works have been conducted to visualise demographic data (Wang,

Ibarra, Adnan, Longley, & Maciejewski, 2014) and the pickings have been

considerably rich (Andreev, 2000). One possible solution for this project is making

use of these techniques and incorporating the visualisation of data inconsistencies into

them. The problem here is that such demographic visualisations are designed only to

reflect the characteristics of demographic data. They tend to encode all the necessary

and useful information so only little space is left for extra information. In this project,

two visualisations for demographic data are developed, and they are succinct and

compatible with the visualisation of data inconsistencies.

The major contribution of this project is a technique for visualising complex linked

data. We demonstrate this technique using a demographic linked dataset as an

example, and evaluate how effective and intuitive it is. Though it is not a standardised

visualisation technique which works with all kinds of datasets, it is still of referential

value for future projects. Another contribution is that it provides a set of concise and

clear visualisations of demographic data which are implemented as a Python-based

program. This program is flexible and configurable such that it can easily been

applied on other similar datasets.

4

1.1 Glossary

This thesis involves many domains. To avoid any ambiguity, we provide a glossary

here. Terms are emphasised in bold. These terms are to be used throughout this thesis.

Data linkage : a technique used to create links between records which represent the

same entity. If two records which represent two entities are linked together, that is a

wrong link. The information in two wrongly linked are usually conflicts with each

other, that is a data inconsistency, which can be used to detect potentially wrong

links.

Data visualisation: a technique used to communicate information in a graphical

format. The term visualisation can also refer to a model which visualise data in a

specific way. In this project, data are presented in the format of a figure.

Life segment: An important part of this work is the visualisation of individuals‘ life

histories. However, an individual‘s life history may be divided into multiple segments

due to missed links or the individual‘s absence from the observed area. Thus, strictly,

it is the life segments that be visualised. For the purpose of comprehensibility, when

describing life segment visualisation, we still refer to life segment as individual. And

we use Lidlife_segment_id to represent the life segment whose id is life_segment_id,

e.g. Lid1559 represents the life segment whose id is 1559.

Event: an event happened in an individual‘s life. Each event corresponds to a civil

registration record or a census record. It is represented by an event object in the data

interface, and represented by a circle or wedge (when multiple events share a circle)

in the visualisation. These circles are concatenated by a lifeline in chronological

order.

Related person: a person who has a relationship to the individual we are focusing on.

It is represented by a related person object in the data interface, and represented by a

square in the visualisation.

5

2. Data collection

2.1 Background

The data used in this program come from a project conducted by Alice Reid, Ros

Davies and Eilidh Garrett, which linked 19th-century census data and civil registration

data from the Isle of Skye in Scotland to form a dynamic model showing how the

family structures changed and how the population migrated (Reid, Davies, & Garrett,

2002). Before the data linkage process, the census records and civil registration records

are discrete data points. They are only useful for gathering population statistics. In

contrast, the data linkage process concatenates these data points into data lines and

reveals the connections such as causation between data points. Moreover, these data

lines together with relationships between individuals, which are embedded in the raw

data, form a data network where more hidden patterns and trends can be tapped.

When applying data linkage on demographic data, population mobility is an important

factor which may affect the quality of the data linkage result. If a location has a

transient population, it is unlikely that the data linkage result will be fruitful. Only

part of these transient individuals‘ demographic data will be recorded in this place,

which will result in many incomplete data lines in the data linkage result. Thus, an

island like the Isle of Skye, which has natural boundary to impede migration, makes a

logical choice for a small linkage project (Reid, Davies, & Garrett, 2002).

Alice Reid, Ros Davies and Eilidh Garrett (2002) chose a ‗sets of related individuals‘

approach to perform data linkage. With this approach, the census records and civil

registration records belonging to one individual are linked under the context of family.

Specifically, it identifies as many families from different datasets as possible and then

matches the remaining individuals. According to the authors‘ analysis, this approach

leads to more robust results than linkage at a purely individual level because the chance

of two families sharing the same name and structure is much less than that of two

individuals, especially when the name pool is quite small.

6

2.2 Data description

The linked data from the work described in Section 2.1 (Reid, Davies, & Garrett, 2002)

is the dataset used in this project. It includes nine tables: longitudinal, birth, marriage,

death, and five census tables, each corresponding to one of the censuses in 1861, 1871,

1881, 1891 and 1901. Among them, only the longitudinal table was created during the

data linkage process. It acts as a ―hub‖ which links all the other tables together. Each

record in it represents a life segment that consists of a set of identifiers corresponding

to census and vital event records.

Here are some important characteristics of the longitudinal table:

Identifiers for all birth, marriage, death and census records are included in the

table.

All finished links are reflected in the table.

Each birth ID only appears once as an individual can only born once.

Each death ID only appears once as an individual can only die once.

Each census ID for each year only appears once as an individual can only be

recorded in each census once.

Each marriage ID must appear twice, once for the bride and once for the groom,

but each individual can marry multiple times. In this dataset, the maximum

number of marriages for one individual is four.

Figure 2.1 shows all columns in the longitudinal table together with comments and

statistics about the columns, and Figure 2.2 gives a sample set of data from the

longitudinal table.

7

Column name Comment Number of

records

with value

Number of

records with

value missing

Number of

unique values

lifesegmentID Unique ID for life segment 54,537 0 54,537

sex Sex of the life segment 54,261 276 2

BirthID ID from birth table, refers to the individual‘s own birth record 17,614 36,923 17,614

sibsetID Indicates the sibset group in which individual is a sibling 16,713 37,824 4,326

parentmarriageID ID from marriage table, refers to the individual‘s parents‘ marriage record 10,571 43,966 2,134

DeathID ID from death table, refers to the individual‘s own death record 12,285 42,252 12,285

marriageID1 ID from marriage table, refers to the individual‘s first marriage record 5,237 49,300 2,666

marriageID2 ID from marriage table, refers to the individual‘s second marriage record 96 54,441 94

marriageID3 ID from marriage table, refers to the individual‘s third marriage record 2 54,535 2

marriageID4 ID from marriage table, refers to the individual‘s fourth marriage record 1 54,536 1

61pid Person ID and Scheme (household) ID from 1861 census table, together

refer to the individual‘s person record in 1861 census

19,604 34,933 19,604

61sch 19,604 34,933 4,078



18,101 36,436 18,101

71sch 18,101 36,436 3,773



17,684 36,853 4,305

81sch 17,684 36,853 3,796



16,476 38,061 3,933

91sch 16476 38,061 3,664



14,609 39,928 14,609

01sch 14,609 39,928 3,385

Figure 2.1: Columns in the longitudinal table

8

life

segment

ID

sex Birth

ID

sibset

ID

parent

marriage

ID

Death

ID

Marriage

ID1

61

pid

61

sch

71

pid

71

sch

81

pid

81

sch

91

pid

91

sch

01

pid

01

sch

88 m 140 1474 23 86 74 7101015 250 8103005 98 9102007

617 m 673 1473 86 100 9102007 222 102029

641 f 697 1473 86 101 9102007 223 102029

667 m 723 1473 86 648 102 9102007

700 f 756 1473 86 224 102029

741 m 798 1473 86 225 102029

1559 f 1673 3251 16 684 86 3330 7207021 154 8102012 99 9102007

17802 f 16 189 6102011 3328 7207021 152 8102012 103 9102007 221 102029

51617 m 1390 16

Figure 2.2: Sample rows from the longitudinal table. Note: blank cells represent null values. The columns marriageID2, marriageID3 and

marriageID4 are hidden because all their values for the given rows are null.

Birth records Death records Marriage records

ID Year

140 1864

673 1888

697 1889

723 1891

756 1893

798 1895

1673 1867

ID Year

648 1894

684 1896

1390 1869

ID Year

16 1862

23 1863

86 1887

Figure 2.3: Sample partial records of vital events. Note: all identifiers here come from Figure 2.2

9

All other tables are from raw data, which means that they comprise the original

columns as well as those added during the data linkage process. Due to confidentiality

requirements, these tables are inaccessible for this project, except identifiers and

occurrence years for vital event records (census years are already reflected in the

longitudinal table). However, this information is enough to support this project.

Figure 2.4 contains statistics about vital event records, and Figure 2.3 on the previous

page gives some sample records of vital events. Note that all identifiers appearing in

Figure 2.3 are for life segments shown in Figure 2.2. The records in Figures 2.2 and

2.3 will be used to assist in the demonstration of data visualisation.

Event

type

Number of

records

Number of

records with

year missing

Number of

unique years

Minimum

year

Maximum

year

birth 17,614 0 41 1861 1901

marriage 2,668 0 41 1861 1901

death 12,279 0 41 1861 1901

Figure 2.4: Statistics about vital event records

10

3. Data visualisation

As mentioned in the introduction, the aim of this project is to develop a novel approach

to visualise the characteristics of linked demographic data as well as data

inconsistencies to help identify the potentially wrong links generated during the data

linkage process (Appendix A). Because the latter feature requires some

domain-specific knowledge and is more subject to adjustment, we decided to design it

first and then incorporate the former feature into it.

When showing the characteristics of demographic data, two aspects are important to

reflect one individual: the individual‘s life experience and family relationships.

According to common practice, life experiences are usually drawn in a lifeline which

concatenates meaningful events that happened in an individual‘s life, and

relationships are also often displayed as a line between two objects. In this project, we

do not intend to break either of these conventions, so different orientations are used to

distinguish them: lifelines are drawn horizontally, and relationship lines are placed

vertically. Thus, a conceptual graph is formed which is shown in Figure 3.1.

Figure 3.1: Conceptual graph which illustrates demographic data

Based on the conceptual graph, two models are created to visualising linked

demographic data: single life segment visualisation and multiple life segment

visualisation. They are described at length respectively in the following two sections.

Though the feature of showing the characteristics of linked data and the feature of

highlighting data inconsistencies are complementary to each other for the purpose of

11

identifying and correcting such data inconsistencies. When they are shown in a figure

together, users should not be confused. Thus, both features should be clearly

distinguishable so that users can easily focus on one of them from different perspectives.

The detailed designs are described in Section 3.3.

Some additional functions were added to equip the program with greater interactivity,

in order to facilitate ease of use as well as provide more information. These functions

are presented in Section 3.4.

The visualisation techniques used in this project are developed based on the linked data

sets from the Isle of Skye in Scotland, as described in Section 2.1. However, this

program should be flexible and configurable so that it can be reusable for other data sets.

Data interfaces and configuration files were defined to satisfy this demand, and Section

3.5 describes them in detail.

3.1 Single life segment visualisation

As the name suggests, the single life segment model is used to reflect the life and family

relationships of an individual. The first conceptual graph in Figure 3.1 at the beginning

of this chapter shows that an individual is represented by a horizontal lifeline. Related

persons are connected to this lifeline by vertical lines. Both events and related persons

are represented by nodes. For the purpose of differentiation, we use circles to represent

events and squares for related persons.

In order to provide extra information in an intuitive way, year of event is set as the

x-axis and year of birth of a person is set as the y-axis. Thus, when a user checks an

individual‘s lifeline from left to right, the sequence of events and time intervals

between them are shown visually. At the same time, distance on the y-axis between

different individuals describes the age distribution of a family, e.g. age gaps between

different generations.

Based on the demographic data, important events include birth, marriage, census and

death. Some other events can also be derived, for example, a birth event of a baby also

12

means a ‗birth of child‘ event to the parents. In the visualisation, different colours are

used to indicate different types of events (green: birth, gray: census, blue: marriage,

yellow: birth of child, red: death).

Besides the function of describing an individual‘s lifeline, these events are important to

link people with their related persons. A wedding (technically, a marriage registration)

creates a spousal relationship between the bride and the groom, and the birth of a baby

builds a parent-child relationship between the parents and the baby. In this project,

squares share the same colours with correlated circles. Different types of circles and

squares are illustrated in Figure 3.2.

Figure 3.2: Different types of circles and squares in this project

One problem here is that there might be more than one event in the same year, for

instance, a newly married couple welcomed their first child in the same year they

married. Under these circumstances, the circles that represent these events share the

same x-coordinate, in other words, these circles will cover each other.

This problem can be solved by stacking circles vertically, rather than strictly strung on

the lifeline (shown on the left of Figure 3.3). However, in practice, this method can

cause another issue: the stacked circles usually overlap the squares which represent

spouses (generally the age gap between bride and groom is not too large and the

y-coordinate of a lifeline or a square is the birth year of the corresponding person). The

final solution is to let multiple events share a circle if they happen in the same year as

shown on the right of Figure 3.3. In practice, this approach performs well.

Figure 3.3: Different approaches to show three events (marriage, birth of child, and

census) which happened in the same year

13

In certain cases, squares can overlap each other, e.g. two squares represent parents of

the same age (shown on the left of Figure3.4). In such a situation, the squares will be

moved upwards or downwards slightly to make them distinguishable. Specifically,

denote by the original y-coordinate of the squares and denote by

the new

y-coordinates after the movement, then and

. Some

asterisks are also added to inform the users that these squares have been moved

marginally, as shown on the right of Figure3.4.

Figure 3.4: A comparison of the effects before and after the movement

Figure 3.5 illustrates the effect of the visualisation of an example individual. In the

graph, It can be observed that Lid1559, who was born in 1867 to her parents,

experienced three censuses during her life, married Lid88 when she was 20, where her

husband was three years older than her. In the next 8 years, she gave birth to five

children: Lid617, Lid641, Lid667, Lid700 and Lid741. Unfortunately, she died very

young, at the age of 29. Also, it can be noticed that the age gap between her parents is

much larger than that between her and her husband.

Figure 3.5: Visualisation of an example individual‘s life and her related persons. Note:

the records used in this figure are presented in Figure 2.2 and Figure 2.3

14

3.2 Multiple life segment visualisation

In most cases of the single life segment visualisation, there are large blank areas at the

lower left and upper right of the figures. Because the event birth, which creates the

relationship between an individual and his/her parents is always the starting point

(leftmost) of that individual‘s lifeline, and parents must be older than children, so the

squares representing parents typically cluster in the upper left corner. Similarly, the

squares representing children tend to cluster in the lower right corner. This

characteristic can be observed in Figure 3.5.

These empty areas can be used to provide more information in one figure. A multiple

life segment visualisation can be got if expanding all squares denoting related persons

into lifelines. In this model, the given individual and related persons are all represented

by horizontal lifelines so that multiple individuals‘ life experiences can be viewed

simultaneously. In order to highlight the individual currently being focused on, the

circles‘ size is decreased and a thinner lifeline is used for all related persons.

Figure 3.6: Multiple life segment visualisation of an example individual. Note: the

records used in this figure are presented in Figure 2.2 and Figure 2.3

Figure 3.6 shows an example of a multiple life segment visualisation, the data used here

are the same as that of Figure 3.5. In this figure, the given individual and her related

persons‘ lifelines are displayed together. It can be observed that her father (Lid51617)

15

died young and her mother (Lid17802) had a relatively long life, and one of her

children (Lid667) died in infancy.

In some cases, two individuals are of the same age. The solution in this model is: if one

of them is the individual highlighted, smaller circles that denote events of related

persons are displayed on top of larger circles (representing the central individual) so

that they will not be hidden, as shown on the top of Figure 3.7. If both of them are

related persons, their lifelines will be moved slightly in the vertical direction to make

sure they are distinguishable. Shifted lifelines are marked by an asterisk to indicate the

displacement, as shown on the bottom of Figure 3.7.

Figure 3.7: The solutions to the problems of two individuals sharing the same age

3.3 Data inconsistencies visualisation

The second part of this project was the visualisation of potential inconsistencies.

However, this depends on some information which is not directly included in the

provided linked datasets. Before they can be visualised, functionality to detect those

data inconsistencies is need.

In this project, data inconsistency descriptions are expressed in the form of rules. Here

are some examples:

a) An individual cannot have an event prior to their birth;

b) An individual cannot give birth to his/her first child before 8;

c) There should not be more than 3 marriages in one individual‘s life.

It is impossible to enumerate all data inconsistencies here, because for projects with

different research directions, demographic datasets (output by different systems) or

backgrounds, the definition of data inconsistencies can vary. Also, as a data linkage

and analysis project proceeds, more data inconsistencies might be found and need to

be added into the set of current rules. Based on these demands, the function should be

16

easily maintainable and extensible.

Here, the concept of constraints in the database field can be used for reference. The

domain-specific rules are reduced to a few inbuilt domain- independent rules. For

instance, all examples mentioned above can be supported by the following two

domain- independent rules:

a) For one individual, the time lag between two types of events cannot be larger or

smaller than a given threshold;

b) For one individual, the frequency of a particular type of event cannot be higher or

lower than a given threshold.

This approach greatly increases the flexibility and reusability of the program. If a

domain-specific rule is supported by current domain- independent rules, it can be

added in a configuration file rather than having to modify the source code. A more

detailed description of domain- independent rules will be presented in Section 3.5.

Note that not all data inconsistencies happen inside one lifeline (intra- inconsistencies),

a case in point expressed in rule form is that the age gap between bride and groom is

larger than 20 years. If a data inconsistency happens between two individuals, it is

referred as an inter- inconsistency. The detection of an intra- inconsistency indicates

that there may exist some errors when linking the individual‘s relevant records in

different datasets, and an inter-inconsistency suggests that the errors can be related to

either or both involved individuals. The visualisation of data inconsistencies should

clearly reflect their differences.

Another issue about data inconsistencies is that, just like exceptions in software, data

inconsistencies also can be divided into multiple levels. In this project, two levels are

defined: error and warning. Errors mean there must be some errors in the linked data,

either introduced during the data linkage process or in the raw data, while warnings

indicate that there is something unusual but are not physically impossible, but not sure.

For example, if the three rules mentioned at the beginning of this section are applied,

data inconsistencies detected by a) and b) are errors, because these situations should

17

never happen, and those detected by c) are warnings as they are possible, although

very rare. Thus, these levels should be included when defining rules and when

visualising data inconsistencies.

When visualising the characteristics of demographic data, circles/squares are used to

represent events/related persons and use different colours to indicate different events

or relationship types. These circles and squares are linked by horizontal lifelines and

vertical relationship lines. These lines can be used to reflect the data inconsistencies:

lifelines for intra- inconsistencies and relationship lines for inter- inconsistencies.

Normally, lines are black, however, if there exists one or more data inconsistencies,

the corresponding line will be coloured differently: red for an error and orange for a

warning. If both types of data inconsistencies are reflected on one line, the line will be

coloured red.

Figure 3.8 is an example of data inconsistency visualisation. In the figure, the lifeline

of Lid38052 is coloured orange, indicating one or more warnings exist. Right clicking

on the coloured line will pop up a window which shows the detailed information

about the corresponding data inconsistencies, i.e. the rule which is violated. In this

case, the warning is: there are three marriages in Lid38052‘s life. The right-click

function will be described in Section 3.4.

Figure 3.8: An example of data inconsistencies visualisation

18

3.4 Supplementary functions

Some additional functions were added to make the figures more interactive and thus

facilitate users‘ research. They aim to provide extra information (directly or indirectly)

when a certain element has been chosen in the current figure. As Figure 3.8 shows,

there are dozens of elements in a figure and it is infeasible to choose one of them just

based on keyboard inputs, so these additional functions are mainly aimed to respond to

mouse actions, such as buttons pressed.

Figure 3.9: The window showed when a circle is right clicked. Note: the records used in

this figure are presented in Figure 2.2 and Figure 2.3

Right clicking a circle (wedge if multi-events share a circle) or a coloured line will

pop-up a window which shows some additional information about the circle (wedge) or

line. For a circle (wedge), the information is a detailed description about the

corresponding event. While for a line it is a list of data inconsistencies which explains

the colour of the line. Figure 3.9 illustrates the effect of right clicking a circle.

Left clicking a circle in the lifeline which corresponds to a related person will jump to

another figure which focuses on the related person‘s lifeline and shows all his/her

related persons. This enables users to look into an individual‘s related persons and

conduct continuous research. In addition, users‘ browsing histories are recorded so that

users can use the arrow buttons at the top right corner to go back to previous views or

19

forward to following views. If left clicking the circle (which represents the 1871‘s

census in Lid17802‘s life) in Figure 3.9 instead of right clicking, it will jump to another

figure which selects Lid17802 as the first person, as shown in Figure 3.10.

Also, note that in the top right corner that the ―go back‖ button is not grayed out any

more, which means that the button can be clicked to go back to previous view which is

shown in Figure 3.9.

Figure 3.10: Resulting figure when left clicking Lid17802‘s lifeline in Figure 3.9

3.5 Flexibility and reusability issues

Efforts have been made to make sure the program is configurable and can be applied on

other demographic datasets. As Section 3.3 discussed, specific data inconsistencies are

reduced to domain- independent data inconsistencies so they are configurable. The

visualisation function and data inconsistency detection function both share the same

data interface, so once demographic data are loaded into this structure, they can be

visualised without changing the source code.

If a domain-specific rule is supported by current domain- independent rules, it can be

added into the program by configuring a profile without modifying the source code.

20

Below is the structure for each line in the configuration file:

in-built_rule: formula ―shown_information‖ inconsistency_level

Here, in-built_rule is the abbreviated name of an in-built domain- independent rule.

Formula defines how to detect a certain type of data inconsistency with the chosen

domain- independent rule. It is similar to an arithmetic expression, and its grammar

varies according to the chosen domain- independent rule. Shown_information is the

information which appears in the popup window when right clicking the corresponding

coloured line in a visualised figure, and inconsistency_level is an integer which

represents the level of the detected data inconsistencies, 1 stands for error and 2 strands

for warning.

Currently, there are three in-built domain- independent rules for detecting data

inconsistencies. Below are their definitions, formula grammars and some examples:

a) Intra time lag inconsistency: for one individual, the largest (smallest) time lag

between two types of events is larger (smaller) than a given threshold. Here is the

grammar for this rule:

event_type1 – event_type2 comparison_operator threshold

Event_type can be birth, marriage, census, birthofchild, or death. – stands for

minus. Comparison_operator can be >, >=, < or <=. Threshold is the given

threshold. Here is an example which adds four lines in the profile to detect if there

is any event of an individual that happens before his/her birth:

timelag(intra): birth - marriage > 0 "birth year after marriage year" 1

timelag(intra): birth - census > 0 "birth year after census year" 1

timelag(intra): birth - birthofchild > 0 "birth year after his/her 1st child's birth year" 1

timelag(intra): birth - death > 0 "birth year after death year" 1

b) Inter time lag inconsistency: similar to intra time lag inconsistency, however, it is

for two individuals on each side of a particular type of relationship (parent-child or

marriage). Here is the grammar for this rule:

role1.event_type1 – role2.event_type2 comparison_operator threshold

One of role1 and role2 must be main, which represents the individual currently

being focused on, and the other can be a type of related person: parent, spouse or

21

child. The domains of other elements are the same as that of intra time lag

inconsistency. Here is an example which adds two lines in the profile to detect if

the age gap between an individual and his/her spouse(s) is larger than 20 years:

timelag(inter): main.birth - spouse.birth >= 20 "the age gap between bride and

groom is larger than 20" 2

timelag(inter): spouse.birth - main.birth >= 20 "the age gap between bride and

groom is larger than 20" 2

c) Intra frequency inconsistency: for one individual, the highest (lowest) frequency

of a certain type of event is higher (lower) than a given threshold. Here is the

grammar for this rule:

event_type comparison_operator number in interval year(s)

The domains of event_type and comparison_operator are equal to that of intra

time lag inconsistency. However, the (frequency) threshold is defined by

number/interval. Besides, the ―in interval year(s)‖ at the end of the grammar can be

omitted if the interval is the individual‘s life. Here are two examples which

respectively detect if an individual has more than 3 children born in one year and if

he/she married more than 3 times:

frequency(intra): birthofchild >= 3 in 1 year "give birth to more than 3 children in

1 year" 2

frequency(intra): marriage >= 3 "more than 3 marriages in his/her life" 2

Note that in the second example (line) the ―in interval years‖ has been omitted.

Besides the configurability of data inconsistency detection rules, a data interface is

also be defined to make sure the data inconsistency detection function and the

visualisation function can be reused. The general idea is that an individual can be

depicted by a list of events which can be used to construct his/her lifeline and a list of

related persons that describe relationships. Each event and related person has many

attributes.

Based on the analysis above, the data interface for an individual (denoted by A)

includes a list of event objects sorted chronologically by and a list of related person

objects. Here, list refers to a data structure. Every event object or related person object

are represented by an associative array, where each key-value pairs describes the

22

name and value of an attribute. Figure 3.11 and Figure 3.12 shows the attributes

together with some comments about an event object and a related person object

respectively.

Attribute name Comment

id the identity of the event, used to distinguish one event from

another

type the type of the event, i.e. birth, marriage, census, birthofchild

or death

event_year the occurrence year of the event

birth_year the birth year of A

description a detailed description of the event which to be shown when

the corresponding circle (wedge) is right clicked

Figure 3.11: Attributes of an event object

Attribute name Comment

id the identity of the related person, used to distinguish one

related person from another

type the type of the event, i.e. parent, spouse or child

event_year the occurrence year of the event which creates the

relationship between the related person and A

birth_year the birth year of the related person

description a detailed description of the related person which to be shown

when the square representing the related person is right

clicked

Figure 3.12: Attributes of a related person object

When another linked demographic dataset is given, if all its individuals‘ data are

provided conforming to the data interface, then the information contained will allow

data inconsistency detection and visualisation, and the dataset can be visualised in the

program without changes to the source code.

23

4. Evaluation

In general, the proportion of data visualisation papers having evaluation is much

lower than that of papers from other domains. Most papers published at four main

visualisation conferences (EuroVis, InfoVis, IVS and VAST) do not have any

evaluation (Elmqvist & Yi, 2015). In addition, different from other domains, the

evaluation techniques for data visualisation tend to be qualitative rather than

quantitative (Redpath & Srinivasan, 2003), which means that evaluation techniques

that work for other domains are not applicable to this project.

In this project, qualitative evaluation is applied. When choosing evaluation methods,

Komlodi, Sears and Stanziola‘s classification of information visualisation evaluation

practices (Plaisant, 2004) is used for reference, and four main patterns for evaluating

data visualisation techniques are defined:

a) Controlled experiments comparing design elements: compare the effect of certain

elements in the same visualisation technique.

b) Usability evaluation of a technique: get feedback from users.

c) Controlled experiments comparing two or more techniques: compare two or more

visualisation techniques in the same scenario.

d) Case studies of techniques in realistic settings: evaluate a technique in a natural

environment doing real tasks.

Among them, c) is impractical in this project as there is hardly any other research about

visualising linked data, and d) is too time consuming to be adopted. As a result, both a)

and b) are applied as each of them has its own advantage.

During the development of visualisations, the spiral model (Boehm, 1988) was used. At

the end of each cycle, the models were assessed by the project supervisors. The criteria

include the accuracy and integrity of the reflected information, as well as the

accessibility and unambiguity of figures. If a certain design element showed poor

performance, an alternative element was designed and compared with the original one,

and the latter was substituted if the former leads to a better quality figure. One case in

24

point is that multiple events sharing a circle rather than stacking circles vertically,

which has been illustrated in Figure 3.3. Some other improvements were made after

self-evaluations are listed below:

Y axis was inversed to make the figures more intuitive.

Always to show an individual's information on the left of the lifeline.

Using a dashed vertical line to make the lifelines and relationship lines more

distinguishable.

Pop-up a window containing detailed information when right clicking rather than

hovering over a circle or a coloured line.

In multiple life segment visualisation, small circles (representing related persons'

events) will always be in front of large circles (representing the events of the

individual in the spotlight) to make sure they will not be hidden, as described in

Section 3.2.

If two related persons share the same age, their squares or life lines will be moved

upwards or downwards slightly to make sure they are distinguishable, as described

in Sections 3.1 and 3.2.

After the visualisations were completed, two videos demonstrating the models were

made by a supervisor and sent to data linkage researchers in Cambridge and Scotland

for evaluation. Feedback on the videos (See Appendix B, the feedback is anonymous

for privacy issues) shows that most of the reviewers consider these visualisations as a

―clean and tidy‖ way to illustrate ―messy‖ linked records, and that showing data

inconsistencies via colouring lines is novel and useful. Some of them mentioned that

the usefulness of the models cannot be confirmed without practical application. In

addition, some researchers suggest it would be useful to visualise the data at a higher

level and check more individuals simultaneously. This is the goal of a future project.

Other positive opinions in the feedback include:

It can be applied to other domains.

The right click function makes it possible to track a wrong link back to its origin.

It can be ―a good base‖ for further extensions.

25

Some critical opinions in the feedback are listed below:

In multiple left segments visualisation, it is not clear if a vertical relationship line

just passes by a lifeline or connects with the lifeline.

It is not clear if having small circles always in front of large circles will help make

figures more intuitive.

In addition, many other suggestions are provided in the feedback. These suggestions

were all recorded and could be used as references for further works:

Indicate gender with colour coding.

Evaluate the visualisations by comparing them with other diagram styles.

Visualise a family together with the causes of all deaths in this family.

Enable users to select and organise required data in a more arbitrary way, for

example, have a query system.

Enable users to choose how many generations to be shown simultaneously in one

figure.

26

5. Conclusion

This thesis has described a novel approach for the visualisation of complex linked

databases, showing the characteristics of linked data as well as data inconsistencies in a

concise and clear way. Though this technique is not universal enough to be applied on

all kinds of linked datasets, it still has referential value for other projects. In addition,

this method also provided original and standardised models for the visualisation of

demographic data, so other similar datasets can be easily visualised using these models

if they are loaded into the pre-defined data interface.

In order to keep visualisations concise and clear as well as make sure all important

information is included, a multi- layer display is used. The structure and major attributes

of the given demographic data are shown in a graphical interface, and detailed

information is provided via supplementary functions when users interact with the

visualisations.

Current results have shown that the characteristics of data and data inconsistencies can

be visualised in one figure together to help identify potentially wrong links, without

confusing a user. Feedback from the data linkage researchers in Cambridge and

Scotland also indicates they consider these visualisations are ―clean and tidy‖ when

illustrating intricate linked records, and the visualisation of data inconsistencies is an

effective way to highlight areas that researchers should pay more attention to.

However, with only low-level visualisations, this approach is not ideal for identifying

potentially wrong or inconsistent links, as it does not support roll-up and drill-down

operations which are common in visual data analysis. In the future, works will

concentrate on visualising the given data at higher levels, such as family tree

visualisation and communal visualisation, as well as connecting the visualisation at

different levels with each other.

27

Acknowledgement

This project was conducted under the guidance of Assoc. Prof. Peter Christen and Mr.

Jeffrey Fisher. I would like to express my sincere gratitude to them for their

continuous support and infinite patience. They spent a lot of time helping me and

provided plenty of suggestions covering from the design of the visualisations and the

implementation of the Python program, to the structure and contents of this thesis as

well as the preparation of the presentations.

Thanks also to Assoc. Prof. Weifa Liang for his help while I was choosing the project

and writing this thesis.

28

References

1. Abecasis, G. R., & Cookson, W. (2000). GOLD—graphical overview of linkage

disequilibrium. Bioinformatics , pp. 182-183.

2. Andreev, K. (2000). Sex differentials in survival in the Canadian population,

1921--1997: a descriptive analysis with focus on age-specific structure.

Demographic Res , 3:article 12.

3. Boehm, B. (1988). A spiral model of software development and enhancement.

Computer , pp. 61-72.

4. Christen, P., & Goiser, K. (2007). Quality and complexity measures for data

linkage and deduplication. In Quality Measures in Data Mining (pp. 127-151).

Springer.

5. Elmqvist, N., & Yi, J. S. (2015, 7). Patterns for visualization evaluation.

Information Visualization , pp. 250-269.

6. Keim, D. (2002). Information visualization and visual data mining. Visualization

and Computer Graphics, IEEE Transactions on , pp. 1-8.

7. Plaisant, C. (2004). The Challenge of Information Visualization Evaluation. In

Proceedings of the working conference on Advanced visual interfaces (pp.

109-116). ACM.

8. Redpath, R., & Srinivasan, B. (2003). Criteria for a Comparative Study of

Visualization. In Intelligent Systems Design and Applications (pp. 609-620).

Springer Berlin Heidelberg.

9. Reid, A., Davies, R., & Garrett, E. (2002). Nineteenth-Century Scottish

Demography From Linked Censuses and Civil Registers: A'Sets of Related

Individuals' Approach. History and Computing , pp. 61-86.

10. Wang, F., Ibarra, J., Adnan, M., Longley, P., & Maciejewski, R. (2014). What‘s in a

Name? Data Linkage, Demography and Visual Analytics. In EUROGRAPHICS

29

(pp. 7-11).

11. Ward, M. O., Grinstein, G., & Keim, D. (2010). Interactive data visualization:

foundations, techniques, and applications. CRC Press.

12. Wigginton, J. E., & Abecasis, G. R. (2005). PEDSTATS: descriptive statistics,

graphics and quality assessment for gene mapping data. Bioinformatics , pp.

3445-3447.

Appendix A

INDEPENDENT STUDY CONTRACT Note: Enrolment is subject to approval by the projects co-ordinator

SECTION A (Students and Supervisors)

UniID: u5455264

SURNAME: Han FIRST NAMES: Quanwei

PROJECT SUPERVISOR (may be external): Assoc Prof Peter Christen, Mr Jeffrey Fisher

COURSE SUPERVISOR (a RSCS academic): ________________________________________________

COURSE CODE, TITLE AND UNIT: COMP8715 Computing Project 12units

SEMESTER S1 S2 YEAR: 2015

PROJECT TITLE:

Visualising complex linked data

LEARNING OBJECTIVES:

Become familiar with complex linked data sets; learn about graph visualisation techniques; and become

proficient in the Python programming language to develop prototypes for processing and analysing large data

sets.

PROJECT DESCRIPTION:

Social science researchers, government agencies and businesses increasingly rely on the linking of larger and

complex databases. Such linking allows enrichment of data, helps improve data quality, and enables data

mining not feasible on a single database. One crucial aspect so far neglected by many data linkage projects is

the visualisation of complex linked databases. Visualisation is important when identifying potentially wrong or

inconsistent links, and helps users better understand the characteristics of linked data.

The first outcome of this project is a set of Python prototype programs that generate different visualisations of

the Isle of Skye historical data, as well as other data sets in a similar format. The programs need to be flexible

and configurable in order to work with other data sets. The second outcome is a report describing the

visualisation techniques implemented in the Python prototype programs, how to use these programs, and how

they have been tested and evaluated.

The aims of this project are to research and develop novel network- and graph-based visualisation techniques

for complex linked data. Specifically, using a data set of linked historical birth, death, marriage and census data

from the Isle of Skye in Scotland, the objective is to develop and evaluate a set of visualisation techniques

implemented as Python-based prototypes.

ASSESSMENT (as per course’s project rules web page, with the differences noted below):

Assessed project components: % of mark Due date Evaluated by:

Report: name style: _____________________________

(e.g. research report, software description...)

60%

Artefact: name kind: ____________________________

(e.g. software, user interface, robot...)

30%

Peter Christen

and Jeff Fisher

Presentation:

10%

MEETING DATES (IF KNOWN): Weekly time and day to be arranged.

STUDENT DECLARATION: I agree to fulfil the above defined contract:

______________________________ ______________________

Signature Date

SECTION B (Supervisor):

I am willing to supervise and support this project. I have checked the student's academic record and

believe this student can complete the project.

______________________________ ______________________

Signature Date

REQUIRED DEPARTMENT RESOURCES:

SECTION C (Course coordinator approval)

______________________________ ______________________

Signature Date

SECTION D (Projects coordinator approval)

______________________________ ______________________

Signature Date

Appendix B

Feedback on videos demonstrating the visualisations of life segments in this project.

It's hard to be confident about usefulness without having actually used it for a specific

purpose - but I would be optimistic - it looks very nice.

The only specific thing that occurs to me is to wonder whether gender could also be

indicated with colour coding?

Is it significant that the dotted vertical lines on the right side of the screen in video 1

don't connect with the male?

I assume it was a deliberate design decision to limit the quantity of data being

displayed to 3 generations. For evaluation, it might be interesting to contrast this with

other common genealogy diagram styles e.g. tree, fan etc.

It would be interesting, if perhaps less useful, to be able to zoom out to see a large

number of people simultaneously, perhaps with the ability to highlight sets of people

e.g. ancestors of a given person.

Related to that, the orange coding for errors looks very useful when inspecting

particular parts of the graph. It might also be useful to have a way of indicating errors

in a zoomed-out view, so that you could find them without having to browse around.

In the future obviously it would be worth considering whether this sort of tool could

be extended to indicate linkage, uncertainties etc. As you may remember we have an

HCI group here who might well be interested in helping with the visualization side of

things.

I have managed to download videolan and watch the two videos, many thanks. It

certainly got me wanting to be able to click on the circles to see what happened.

Well done to the student for managing to make what can be quite 'messy' data look so

'clean and tidy'. I thought the way that it linked marriage partners, their parents and

their children together was excellent.

The colour coded dots on the time line were easy to follow, and I thought that the

colour coded 'warning' lines were a very good idea, and would highlight for

researchers where extra 'digging' might be needed.

I am off to find the gentleman with 19 children in the Skye records....

One thing I wondered for an 'extra' would be the ability to show a family with the

causes of all deaths displayed shown at once - might give some interesting research

leads.

I've taken a look at the videos and its looking nice and clean as a visualisation tool.

One thing I wasn't able to work out from the videos was in relation to the colouring of

lines to visualise warnings and errors. Are these only shown when the person in

question is 'in focus'? If not I feel these things may be easy to miss if viewing the

population at a larger scale (i.e. more zoomed out) where there maybe many more

people on a single screen.

I'm not sure if having peoples time lines laid atop of one another as in video 2 makes

thing easier to intuitively understand - in the complex example that is given it doesn't

seem possible to work out whose death event is whose without clicking into the

individual (I guess to whom the death belongs could be inferred by genders of married

individuals - but this may be an assumption that isn't possible to make in more

modern populations).

I can see it being useful across a number of domains. From a generating synthetic

populations viewpoint it would be very useful to be able to graphically review

subsections of a population to check it for sanity (here having higher zoom levels

would be useful too).

From a linkage viewpoint again it could be useful to be able to identify places where

automated linkages have gone awry - here it being possible to display, on the right

click, the underlying provenance of linkage and the values associated with the linkage

information may be useful.

It's looking like a nice tool, that potentially could be a good base for further modules

to be added onto.

I'd be interested to see how useful the tool would be for reviewing the synthetic

populations I'm currently generating, if you'd be able to send over a copy of the code

it would be appreciated.

So I think the visualisation looks good - ie clean and clear, so the question is therefore

how it can help with various tasks/ uses

As an exploratory tool to look across records it is clearly very helpful ie rather than

looking across multiple tables etc. and then perhaps running queries to calculate birth

intervals etc.

useful extension to this use would be:

- ways of organising, grouping and searching the life segments ie 'show all segments

for a part of Skye', 'show a particular pedigree starting at person X', 'show all families

with >9 births', show all segments with an average birth interval <x

- How many records could be shown on average on a page? Could there be a

change in form with progression 'up' to more records ie - could there be a visual

summary (typology) at 'higher levels'. This would then allow a user to drill down into

a particular set of records that might be of interest to them (eg short lived).

This would then help a user find the segments they are interested in

visualising complex linked data · 2015-11-01 · visualising complex linked data ... focus on...

Documents