a rule-based approach to external context extraction from … · 2010-12-06 · a rule-based...

A Rule-based Approach to External Context Extraction from

Biomedical Literature: URL and Role Extraction

A dissertation submitted to The University of Manchester for the degree of

Master of Science Informatics

In the Faculty of Engineering and Physical Sciences

2010

Azad Dehghan

School of Computer Science

2 | P a g e

Table of Contents

Table of Contents .......................................................................................................................... 2

List of Tables ................................................................................................................................. 4

List of Figures ................................................................................................................................ 6

List of Abbreviations ..................................................................................................................... 7

Abstract ........................................................................................................................................ 8

Declaration ................................................................................................................................... 9

Copyright Statement ..................................................................................................................... 9

Dedication .................................................................................................................................. 10

Acknowledgement ...................................................................................................................... 10

1. Introduction ........................................................................................................................ 11

1.1. Motivation ................................................................................................................... 11

1.2. Project Aims ................................................................................................................ 12

1.2.1. Conceptualisation of Project Specific Terminology ............................................... 12

1.3. Project Objectives ........................................................................................................ 13

1.4. Availability ................................................................................................................... 13

1.5. Overview of Chapters .................................................................................................. 14

2. Background ......................................................................................................................... 15

2.1. Text Mining.................................................................................................................. 15

2.1.1. Information Retrieval ........................................................................................... 15

2.1.2. Natural Language Processing ................................................................................ 16

2.2. Information Extraction ................................................................................................. 17

2.2.1. Rule-based and Statistical-based Approaches to IE ............................................... 18

2.2.2. IE Application Development Tools/Software......................................................... 18

2.3. NLM Journal Archiving and Publishing DTDs ................................................................. 19

2.4. Related Work ............................................................................................................... 21

2.5. Summary of Chapter .................................................................................................... 24

3. Software Requirements ....................................................................................................... 26

3.1. Description of Main Tasks ............................................................................................ 26

3.1.1. URL Extraction...................................................................................................... 26

3.1.2. Acknowledgement Extraction ............................................................................... 27

3.2. Functional User and System Requirements .................................................................. 27

3.2.1. Functional User Requirements and Use Case Diagram .......................................... 27

3.2.2. Functional System Requirements ......................................................................... 29

3.2.3. Requirement Traceability Matrix .......................................................................... 33

3 | P a g e

3.3. Non-Functional Requirements ..................................................................................... 34

4. System Design and Analysis ................................................................................................. 35

4.1. Generic System Architecture ........................................................................................ 35

4.2. Description of External Context Extraction ................................................................... 36

4.2.1. URL Module ......................................................................................................... 36

4.2.2. IE Module............................................................................................................. 39

4.3. System Architecture..................................................................................................... 41

4.3.1. Subsystems Architecture ...................................................................................... 41

4.4. System Design ............................................................................................................. 42

4.4.1. Database Layer..................................................................................................... 43

4.4.2. Application Layer ................................................................................................. 44

4.4.3. Presentation Layer ............................................................................................... 47

5. Implementation................................................................................................................... 48

5.1. Tools & Implementation Environment ......................................................................... 48

5.2. Implementation of URL Module ................................................................................... 48

5.2.1. Extraction of URLs ................................................................................................ 49

5.2.2. Checking Resource Availability ............................................................................. 49

5.2.3. Determining Resource Type ................................................................................. 50

5.3. Implementation of IE Module ...................................................................................... 53

5.3.1. GATE .................................................................................................................... 53

5.3.2. Java Annotation Pattern Engine............................................................................ 53

5.3.3. Implementation of IE Module Described .............................................................. 54

5.3.4. Information Extraction ......................................................................................... 60

6. Evaluation ........................................................................................................................... 63

6.1. URL Extraction ............................................................................................................. 63

6.1.1. Discussions........................................................................................................... 65

6.2. Role Extraction ............................................................................................................ 66

6.2.1. Discussions........................................................................................................... 68

6.3 System Limitations ....................................................................................................... 70

7. Conclusion ........................................................................................................................... 72

7.1. Limitations and Future Work........................................................................................ 73

References .................................................................................................................................. 74

Appendix A – System Architecture and Design ............................................................................ 77

Appendix B – Implementation ..................................................................................................... 80

Appendix C – Evaluation Data...................................................................................................... 81

4 | P a g e

List of Tables

Table 1 – Relevant XML Tags 20

Table 2 – Most Acknowledged Funding Organisations 23

Table 3 – Ideal Results from URL Extraction Process 26

Table 4 - Ideal Results of TM Process 27

Table 5 – Description of Actor (AC) 28

Table 6 – Description of Use Cases 28

Table 7 – Mapping between Projects Objective and Implementation Objectives 29

Table 8 – Implementation Objective 1 30







Table 15 – (Implementation) Objective 8 33

Table 16 – Requirement Traceability Matrix 33

Table 17 – Ideal Results from URL Extraction Process 37

Table 18 – HTTP Response Codes 38

Table 19 – Examples of REs for Collaborators and Funders 39

Table 20 - Results of TM Process 40

Table 21 – Regular Expressions for URL Validation 49

Table 22 – Sample of Keywords 50

Table 23 – Distributed Score of Soft Decision Algorithm 51

Table 24 – Result by Soft Decision Algorithm 52

Table 25 – Sample of One-Word Role Expression Lists 56

Table 26 – Sample of Multi-Word Role Expression Lists 56

Table 27 - Results of Role Extraction 61

Table 28 – Evaluation Terms Described 63

5 | P a g e

Table 29 – Total Resource Type Referenced 63

Table 30 – Resource Availability by Year 64

Table 31 – True Positives: Role Extraction 67

Table 32 – Most Acknowledged Funding Organisation 67

Table 33 – Description of RE Transducers Rule 69

Table 34 - Development and Evaluation Environment 70

Table 35 – Accomplished Project Aims 72

Table 36 – List Keywords for Resource Type Identification 80

Table 37 – URL Extraction Data 81

Table 38 – Role Extraction Data 82

Table 39 –Role Expression Extraction Data 83

Table 40 –Name Entity Extraction Data 80

6 | P a g e

List of Figures

Figure 1 - URL Decay (Wren, 2008) 24

Figure 2 - Use Case Diagram 28

Figure 3 – High-Level System Architecture 35

Figure 4 – URL Module Overview 37

Figure 5 - Generic NLP/IE Pipeline 40

Figure 6 - ExtConX2 Layered Subsystems 42

Figure 7 - ExtConX2 Database Layer 43

Figure 8 - Relational Database Schema 44

Figure 9 - ExtConX2 Application Layer 45

Figure 10 - ExtConX2 Presentation Layer 47

Figure 11 - IE Application Pipeline 55

Figure 12 – URL Decay 64

Figure 13 - System Db EER Diagram 77

Figure 14 - ExtConX2 Architectural Design 78

Figure 15 - ANNIE Default IE Modules (www.gate.ac.uk) 79

7 | P a g e

List of Abbreviations

a Nearly-New Information Extraction System ANNIE

API for XML SAX

Common Pattern Specification Language CPSL

Data Mining DM

Document Object Identifier DOI

Document Object Model DOM

Graphical User Interface GUI

Human Computer Interaction HCI

Hypertext Transfer Protocol HTTP

Information Extraction IE

Information Retrieval IR

Integrated Development Environment IDE

Java Annotation Pattern Engine JAPE

that Java Virtual Machines JVM

Left-hand-side LHS

Model-View Controller MVC

National Centre for Biotechnology Information NCBI

National Institute of Health NIH

National Library of Medicine NLM

Natural Language Processing NLP

Object Oriented Programming OOP

PubMed Central PMC

Relational Database Management System RDBMS

Right-hand-side LHS

Role Expression RE

Separation of Concern SoC

Software Development Processes SDP

Software Requirements Engineering SRE

Software Requirements Specification SRS

Text Mining TM

8 | P a g e

Abstract

With a huge number of publications within the biomedical domain, there is an increasing number

of references to URLs, and acknowledgements of individuals and funding organisations. This

project was motivated by providing a look-into the scope of the problem of URL decay, and to

explore and uncover fact of e.g., most active funding organisations, relationship between funding

agencies and research themes, and scientists and research themes, and so on.

EXTernal CONtext eXtractor 2 (ExtConX2) was developed in order to aid with this aim. Rule-

based approaches were adopted in order to extract URLs and acknowledgements from PubMed

Central documents. From the entire PMC dataset of roughly 190, 000 PMC documents processed,

147, 133 URLs, and 194,539 roles were extracted.

Using this data, we have analysed some trends in URL decay and acknowledgments. For example,

we found that URL decay can be described as a function of publication year: the older the

publication the less accessible resource contained within publications. We also found that most

funding acknowledgements were associated with National Institutes of Health, National Science

Foundation, and Wellcome Trust respectively.

The adopted approach for URL extraction achieved precision of 98.6% and a recall of 96%. The

role extraction task achieved a recall of 67.6% and precision of 92.6%.

.

9 | P a g e

Declaration No portion of the work referred to in the dissertation has been submitted in support of an

application for another degree or qualification of this or any other university or other institute of

learning.

Copyright Statement

i. The author of this dissertation (including any appendices and/or schedules to this

dissertation) owns any copyright in it (the ―Copyright‖) and he has given The University of

Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes.

ii. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these

regulations may be obtained from the Librarian. This page must form part of any such

copies made.

iii. The ownership of any patents, designs, trademarks and any and all other intellectual

property rights except for the Copyright (the ―Intellectual Property Rights‖) and any

reproductions of copyright works, for example graphs and tables (―Reproductions‖), which may be described in this dissertation, may not be owned by the author and may be owned

by third parties. Such Intellectual Property Rights and Reproductions cannot and must not

be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or

Reproductions described in it may take place is available from the Head of School of

Computer Science.

10 | P a g e

Dedication This project is first and foremost dedicated to Science. I hope that the excel of science and reason

will continue to prevail! The earth is round indeed!

Secondly, I would also like to dedicate this project to my family: my parents Siavash Dehghan and

Shahnaz Gharehjani, and my brother Arash for his support.

Acknowledgement

I am grateful to Dr. Goran Nenadic for helpful comments and suggestions. I would also like to acknowledge the gnTeam for providing the PubMed Central dataset.

11 | P a g e

1. Introduction

The presence of overwhelming amounts of unstructured textual information within scientific

literature has made the need for machine-supported analysis of text ever more important to aid

scientists with scientific hypothesis generation and knowledge discovery (Ananiadou & McNaught,

2006; Ananiadou et al., 2005; Uramoto et al., 2004). A specific problem domain is that of

biological sciences, reflected by the share volume of academic publications. For instance, in the

previous year alone (2009), over 710,000 approved references were added to MEDLINE®/

PubMed®; or between 60,000-120,000 reference added each month (NLM 2008; NLM 2009). The

share numbers of publications is simply not human digestible by any individual scientist.

This domain in particular has made the application of text mining (TM) techniques to analyse huge

quantities of unstructured information a vital means to extend and further scientific/knowledge

discovery (Ananiadou & McNaught, 2006). The implications of traditional knowledge discovery or

to generate scientific hypothesis without the aid of TM techniques should be evident.

With a huge number of publications within the biomedical domain, (1) there is an increasing

number of references to URLs or online resources (e.g., publications, software, and so on), and (2)

acknowledgements of individuals and funding organisations. The aim of this dissertation may be

described as discovery-oriented (see Fayyad et al., 1996), i.e., to uncover previously unknown facts

or knowledge in regards to relationships/patterns involving these aspects using TM techniques.

1.1. Motivation

With unprecedented growth of biomedical literature coupled with the increase practice of

referencing of online resources (URLs) that become inaccessible over time (i.e., URL decay). This

project is motivated by providing an analysis of the scope of this problem. While previous studies

(Wren, 2004; Wren, 2008) have confirmed the issue of URL decay, this project will extend upon

previous researches by providing a more holistic conclusion through the analysis of a broader

dataset.

Another motivation is similarly and partly derived from the unprecedented quantities of research

and publication within the biomedical domain. As biomedical research attracts billion of pounds of

research grants and investment from governmental, commercial, and academic sources worldwide

each year; it will be interesting to explore and uncover patterns of e.g., most active funding

12 | P a g e

agencies or institutions, relationship between funding agencies and research themes, and scientists

and research themes, and so on.1

1.2. Project Aims

The aim of this project is to design and implement a system to enable the analysis of trends such as

URL decay (i.e., the phenomenon of inaccessible online resources), type of online resources most

often referenced, and exploration of acknowledgements: of individuals and organisations and their

respective roles in relation to the research/article where acknowledged. Therefore, the system must

enable extraction of so called external context from biomedical research: (1) URLs and (2)

acknowledgments. This software system will be referred to as EXTernal CONtext eXtractor 2 or

ExtConX2 hereafter.2

Moreover, ExtConX2 may be described as two systems in one: (1) URL extractor and (2)

acknowledgement extractor. Description of these subsystems follows:

(1) URL Extractor

The URL Extractor must enable (1) extraction of URLs, (2) for each URL extracted, the system

must determine the type of resource referenced (i.e., Document, Databank, Software, or

Organisation), and (3) determine if the URL is accessible or not.

(1) Acknowledgement Extractor

The Acknowledgement Extractor must enable the identification and extraction of (1) name entities

(NEs) such as persons and organisations, (2) role expression (RE) or the acknowledged role of

given NE, and (3) identify relations or association between a NE and corresponding RE.

1.2.1. Conceptualisation of Project Specific Terminology

Various project specific terminologies are used throughout this dissertation. This section provides

conceptualisation of these terms for easy referencing:

1 Apart from providing practical applications as described in section 1.2.1, biomedical research could at time be

controversial (e.g., stem-cell research; health risk of cigarettes), hence, uncovering of patterns between funding organisations and research could be important to maintain scientific and academic integrity. 2 2 – Indicates the number of tasks the system handles: (1) URL extraction and (2) acknowledgement extraction.

13 | P a g e

(1) Conceptualisation of Role Entities:

i. Collaborator – any NE (person or organisation), apart from the author(s), that provide any

non-financial support (e.g., editorial, conceptual, technical, and so on).

ii. Funder – any NE that provides financial support to the corresponding research.

iii. Role Expression – the literal role of a collaborator or funder.

Note that collaborator / contributor, and sponsor / funder will be used interchangeably throughout

this report.

(2) Conceptualisation of Resource Types:

i. Databank – any database or repository of information which may facilitate dynamic

information retrieval.

ii. Document – any article, report, book, or any static information resource.

iii. Organisation – any organisation or institute (literal definition).

iv. Software – any computer program or application (literal definition).

1.3. Project Objectives This project will aim to achieve the following objectives:

1. Design and implement a relational database (Db) schema to store extracted data.

2. Design and implement a module to extract URLs from documents, determine if the given

URL is accessible or not, determine type of resource (or URL) extracted/referenced and

insert this data into a database.

3. Design and implement a module to identify and extract funders and collaborators (i.e.,

persons/organisations and their respective roles) from acknowledgements and insert this

data into a database.

4. Design and implement a GUI that will facilitate exploration of system functionalities and

which provides general statistics.

5. Evaluation of the purposed methodology.

1.4. Availability

The PubMed Central dataset will be available from gnode1 (gnode1.mib.man.ac.uk) for the use

within this project.

14 | P a g e

1.5. Overview of Chapters

The remainder of this dissertation is organised as followed:

Chapter 2 – Background: provides a general description of the project background such as Text

Mining (TM) processes and concepts, and review of related work

Chapter 3 – Software Requirements: provides a high-level description of the main requirements

of ExtConX2, and further defines functional and non-functional requirements.

Chapter 4 – System Design and Analysis: illustrates and discusses the overall system design and

individual software components of ExtConX2.

Chapter 5 – Implementation: discusses the implementation of the system by analysing selected implementation components.

Chapter 6 – Evaluation: presents and discusses the results of the knowledge discover stage of the dissertation and evaluation of adopted methods.

Chapter 7 – Conclusion: concludes the dissertation by reflection of the project aims, limitation of

the system, and suggestions for future work.

15 | P a g e

2. Background

2.1. Text Mining

TM generally involves the application of techniques such as Information Retrieval (IR), Natural

Language Processing (NLP), Information Extraction (IE), and Data Mining (DM) (JISC, 2006;

Uramoto et al., 2004) to unstructured text. Hearst (2003) summarises the general notion of TM as:

the discovery by computer of new, previously unknown information, by automatically [or

semi-automatically] extracting information from different written resources. A key element

is the linking together of the extracted information together to form new facts or new

hypotheses to be explored further by more conventional means of experimentation.

While TM is often an iterative process, its techniques/stages are generally applied in an ordered

manner. TM or knowledge discovery is a process-oriented activity. Further, due to the relative new

research field of TM, concepts used are not always consistent across literature (see Hotho et al.,

2005; Fayyad et al., 1996). However, while it is not within the scope of this report to further

discuss this issue it is important to acknowledge. Hence, this section will briefly review processes,

techniques, and concepts involved within TM. This ought to clarify the conceptual foundation and

aid the understanding of further description of the overall project pursued.

2.1.1. Information Retrieval

Information retrieval is a discipline and problem concerned with the finding of

documents/information (Hotho et al., 2005). IR covers a wide variety of research areas such as

document classification and categorisation, data visualisation, filtering, modelling, and so forth

(Baeza-Yates & Ribeiro-Neto, 1999). Often referenced IR systems are search engines such as

Yahoo3 and Google

4 which identify documents/information according to the user‘s search queries

(JISC, 2006). IR systems within the biomedical domain include Entrez PubMed and PubMed

Central (PMC). PubMed® is a free resource which provides access to MEDLINE

® (Medical

Literature Analysis and Retrieval System Online), the U.S. National Library of Medicine‘s (NLM)

database of citation and abstracts. Currently, PubMed contains over 19 million references from

approximately 5,400 biomedical journals published worldwide (NLM, 2010a). PubMed Central is

the corresponding (free) full-text digital archive developed and managed by U.S. National Institute

of Health‘s (NIH) National Centre for Biotechnological Information (NCBI).

3 www.yahoo.co.uk 4 www.google.co.uk

16 | P a g e

Moreover, within the context of TM or knowledge discovery process, IR refers to the process of

finding and retrieving appropriate documents relevant to some particular problem (JISC, 2006).

While IR is considered as a sub-process of NLP by some researchers (e.g., Polajnar, 2006), within

this project, IR will be regarded as a separate and antecedent process of NLP.

2.1.2. Natural Language Processing

Natural language processing is concerned with the problem of understanding natural language (NL)

by the use of computers (JISC, 2006; Hotho et al., 2005). Due to the inherent ambiguity of NL, the

complexity to analyse NL by the use of machines is a evident reality. Thus, NLP is commonly

divided into several layers of processing (Hahn & Wermter, 2006): lexical, syntactic, and semantic

level. The lexical level processing deals with how words can be recognised, analysed, and

identified to enable further processing (Hahn & Wermter, 2006). The syntactic level analysis deals

with identification of structural relationships between groups of words in sentences, and the

semantic level is concerned with the content-oriented perspective or the meaning attributed to the

various entities identified within the syntactic level (Hahn & Wermter, 2006).

(1) Lexical Level Processing

The tokenisation process or the segmentation of text into individual meaningful elements is the

initial stage of lexical level processing. Tokens such as words, acronyms, abbreviations, numbers,

and so on are linguistically identified (Hahn & Wermter, 2006). Other interrelated sub-processes

associated with lexical level processing include (Hahn & Wermter, 2006):

Part-Of-Speech (POS) tagging which is considered as the core of this level processing

Morphological analysis (the association/linking of varied forms of lexical elements to their

canonical base form)

Unknown word handling

Acronym detection

Name Entity Recognition (NER)

An example of a widely used and reliable POS tagger within the biomedical domain is GENIA

Tagger v3.0 (Tsuruoka et al., 2005). Computational lexicons (e.g., BioThesaurus) are also utilised

at this stage to aid with the overall lexical level processing. While lexicons often vary depending

upon domain/task, in general and the bare minimum, computational lexicons contain lexical

elements such as full or canonical base forms of words and additional linguistic information (e.g.,

part-of-speech category and morphological information), and so on.

17 | P a g e

(2) Syntactic Level Processing

Common methods applied within the syntactic level processing are chunkers and parsers.

Chunkers partition or label sentences into phrasal units (i.e., noun, preposition, verb, or adjective

phrases) (see Hahn & Wermter, 2006, p.23 for details), and parsers identify clauses such as word

sequences containing a subject and a predicate (Hahn & Wermter, 2006, p.25). An example of

domain specific (i.e., biomedical) shallow parser is GENIA Tagger. Moreover, the application of

name entity recogniser (NER) at this level of processing has proven beneficial within biological

text mining as most name entities are contained within nouns or prepositional phrases (Hahn &

Wermter, 2006). Some examples of NER systems include ANNIE for, e.g., person and organisation

name recognition (Cunningham et al., 2010), LINNAEUS for species name recognition (Gerner et

al., 2010), and TerMine for technical terms recognition.

Resources commonly utilised to aid with the overall syntactic level process are grammars and

treebanks. Treebanks are annotated text corpora with syntactic annotations at sentence level (i.e.,

POS tags and syntactic structures), and grammars contain some subset of linguistic syntax,

commonly, rules or constraints which characterises morpho-syntactic and nonterminal grammar

categories (see Hahn & Wermter, 2006, p.21). An example of widely used Treebank (within the

biomedical domain) is GENIA Treebank v1.0, which is based upon annotated PubMed abstracts

(Kim & Tsujii, 2006; Tateisi, 2004).

(3) Semantic Level Processing

The semantic level analysis consists of linking terms or concepts to form logical/knoweldge

propositions (Hahn & Wermter, 2006). This level processing is directly based upon the

combination of the lexical and syntactic level analysis. For instance, within the scope of this

project, the semantic level processing involves the linking of NEs and their respective roles.

2.2. Information Extraction

Information extraction may be described as a subsequent stage of NLP. IE is the process of

automatically or semi-automatically extracting predefined data from unstructured text (JISC, 2006)

and inserting this data into forms or templates (see McNaught & Black, 2006, p.143), which

subsequently convey the data into some factual information (Hotho et al., 2005). As defined by

Message of Understanding Conference (MUC), tasks commonly associated with IE are:

Recognition and classification of words denoting name of persons, organisations, locations;

and numeric and temporal expressions (i.e., name entity task).

Identifying links references to entities extracted (i.e., coreference task)

18 | P a g e

Extracting identifying and descriptive attributes of name entities (i.e., template element

task).

Extracting relationships between name entities (i.e., template relation task).

Extracting events in combination with either template element/relation tasks (McNaught

and Black, 2006, p.147).

Moreover, a common used method to aid the overall NER process include the use of gazetteers

(i.e., lists defining NEs such as persons, organisation, etc).

Data mining refers to the process of identifying patterns from a (often large) structured datasets

(such as a database). Within the TM process, DM techniques are typically applied to facts extracted

during the IE stage in the purpose to identify patterns and discover new knowledge (JISC, 2006).

2.2.1. Rule-based and Statistical-based Approaches to IE

Methods which may be used for IE tasks include rule-based (e.g., Common Pattern Specification

Language; Java Annotated Pattern Engine) and statistical-based (e.g., Support Vector Machines;

Hidden Markov Models) approaches. Both types of methods have their strengths and weaknesses.

For instance, statistical-based methods tend to require more computing resources as opposed to

rule-based which tend to be more light-weight (thus resulting in faster processing). On the other

hand, rule-based or knowledge engineering approach is domain or even task dependent, while

statistical or automatic training approach is relatively domain independent (Appelt & Israel 1999).

Hence, domain portability is quite straightforward with statistical-based approaches (Appelt &

Israel, 1999). While both methods could be equally labour and time intensive these methods differ

in their inherit way of designing an IE application. Rule-based approach often requires domain

knowledge and a skilled knowledge engineer to implement effective rules for the IE task. On the

other hand, statistical-based approach requires annotator(s) with some knowledge about the domain

and task in order to annotate some training corpus for model information sought to be extracted

(Appelt & Israel, 1999).

2.2.2. IE Application Development Tools/Software

Many tools/software are available to aid scientists and developers to create IE applications, e.g.,

CAFETIERE (see Black et al., 2005), LingPipe,5 MinorThird,

6 and GATE (General Architecture

5 http://alias-i.com/lingpipe/ 6 http://sourceforge.net/apps/trac/minorthird/wiki

19 | P a g e

for Text Engineering)7. A common denominator across the latter three tools is that they provide

Java APIs for use within custom build standalone applications.

(1) CAFETIERE (or Conceptual Annotation for Facts, Events, Terms, Individual Entities, and

RElations) is a rule-based information extraction system for various IE tasks as specified within its

title. CAFETIERE provides various NLP components as tokenisers, POS taggers, NERs, etc., for

text pre-processing and a customised rule-based language that may be used for semantic level

processing of text (Black et al., 2005). Further, CAFETIERE provides a graphical user interface

(GUI) (i.e., Analyser and Annotation Editor) which supports viewing and editing annotation (which

is useful for iterative development of IE rules).

(2) LingPipe may be described as a toolkit for processing text using computational linguistics and

primarily contains Java APIs for NER, POS, classification, and so on.

(3) MinorThird is another toolkit containing a collection of Java APIs for various NLP and IE

tasks. In contrast to Lingpipe, MinorThird also provides a GUI for invoking APIs and debugging or

manipulating annotations.

(4) GATE may be considered as the more mature tool of the latter two, due to its extensive

documentation and user friendly GUI. GATE is in essence an integrated development environment

providing reusable processing resources enabling the development and deployment of customised

applications to solve NLP problems/tasks (Cunningham et al., 2010). Processing resources are

individual NLP processing components such as tokanisers, POS taggers, NERs, etc., which may be

applied to individual documents or a corpus in a customised order to create an IE application.8

These resources are collectively known as a Collection of REusable Objects for Language

Engineering (CREOLE). GATE may be used to create annotations over documents (for instance, to

be used with statistical-based approaches) or create IE applications which may be used apart from

GATE interface via APIs (GATE Embedded) 9 (Cunningham et al., 2010).

2.3. NLM Journal Archiving and Publishing DTDs

Both PubMed and PubMed Central (PMC) documents are provided in XML formats (defined by

NLM Journal Archiving and Publishing DTDs) as an alternative to common Portable Document

Format (pdf). As previously mentioned, PubMed contains citations and abstracts, and PMC is the

7 http://www.Gate.ac.uk 8 Java APIs from LingPipe, Google, Yahoo (and many more) for NLP/IE are provided as processing resources. 9 GATE API to integrate the IE application into a Java application.

20 | P a g e

corresponding full-text digital archive. The dataset from PMC, which contains approximately 190,

000 documents, will be used in this project.

While NLM Journal Archiving and Interchange Tag Suit was created in order to provide a common

format for publisher and archives to exchange journal content (NLM, 2010b), its usefulness for TM

applications has been widely appreciated. This Tag Suit defines elements and attributes to describe

full article contents such as meta-data, acknowledgement, abstract, article body, citations, URLs,

and so on. This has proven beneficial to researchers who may only be interested in a particular

section(s) of articles, e.g., abstracts or acknowledgements. For, instance instead of using regular

expression over a whole document to identify particular sections of interest, researcher could use

XML parser10

to parse documents and extract relevant section. This has at least couple of

advantages over the use of regular expressions. Providing that a tag set exists for particular

document content of interest, the utilisation of an XML tags to extract this content could often be

more accurate than using regular expressions (hence improving results). In addition, when

designing a TM application, which often processes huge amount of documents, given the

opportunity to only parse documents for specific content rather than process whole documents

could significantly improve performance (i.e., response time and use of computing resources).

Currently there exist seven different types of Tag Suit versions or Document Type Definitions (or

DTDs)11

for PMC articles. However, these versions are consistent in regards to tags used for

content which are of interest to this project, namely for acknowledgements and URLs.

Table 1 describes XML tags which will be used in the implementation of ExtConX2 (NLM,

2010c):

Table 1 – Relevant XML Tags

(1a) <ext-link> </ext-link>

Tag defining external resource outside of the scope

of an article.

(1b)ext-link-type=”uri”

Tag (1a) must contain attribute: ext-link-type

which has the value uri. This indicates that the tag contains a URL.

(1c)xlink:href

Finally within the tag element a third attribute (1c)

must identify the external link. (2)<ack> </ack> Tag defining acknowledgement content/section.

Below is a simplified XML skeleton in the NLM Archiving and Interchange format. Sample of tags

described in Table 1 may be found at lines 28 and 34 in the following example:

10 XML Parser generally refers to an API that enables one to programmatically read XML files and extract content of

interest. Common APIs used for XML parsing in Java include Document Object Model and Simple API for XML. 11

Tag Suit versions include: 1.0, 1.1, 2.0, 2.1, 2.2, 2.3, and 3.0 (current).

21 | P a g e

1 <article>

2 <front>

3 <journal-meta>

4 <journal-id>Journal Acronym</journal-id>

5 ...

6 </journal-meta>

7 <article-meta>

8 ...

9 <contrib id=”A1” contrib-type=”author”>

10 <name>

11 <surname>First</surname>

12 <given-names>Last</given-names>

13 </name>

14 </contrib>

15

16 <abstract> ... </abstract>

17 </article-meta>

18 </front>

19 <body>

20 <sec> <title>Introduction</title>

21 <p> … </p>

22 </sec>

23 <sec sec-type=”method”> <title> Methods </title>

24 <p> … </p>

25 </sec>

26 </body>

27 <back>

28 <ack> We like to thank Armand Seguin for his support of the

project and for many simulating discussions. </ack>

29

30 <ref-list>

31 <ref id="A1">

32 <citation citation-type="other">

33 <article-title>An Online Resource</article-title>

34 <ext-link ext-link-type="uri"

xlink:href="http://www.web.com"/>

35 </citation>

36 </ref>

37 </ref-list>

38 </back>

39 </article>

2.4. Related Work

Giles and Councill (2004) developed a system for acknowledgment extraction from Information

Science literature.12

Based upon their analysis of extracted data a classification scheme of six

categories of acknowledgements were identified: (1) moral support, (2) financial support, (3)

editorial support, (4) presentational support (i.e., presenting the paper at a conference), (5)

instrumental/technical support, and (6) conceptual support, or peer interactive communication

(PIC) as coined by Giles and Councill. They justified their classification scheme on the basis of

12

The IR system utilised for document retrieval: CiteSeer digital library - http://www.citeseer.ist.psu.edu

22 | P a g e

significance of acknowledgements. For instance, conceptual and technical support is arguably more

noteworthy as academic contribution than moral support (Giles & Councill, 2004). Nevertheless,

their argument was never reflected in their results.13

Giles and Councill‘s method is inherently a NER system, as actual roles were only determined by

post-extraction analysis. For instance, they provide a table which partly includes acknowledge

companies and funding agencies. However, it cannot be undoubtedly concluded if these

acknowledge entities provided funding, material, or even intellectual support. Giles and Councill‘s

conclusion is based on pre-knowledge of names of funding organisations and analysis of a sub-set

of most acknowledged entities. Thus, acknowledgements of funding agencies and companies can

only be assumed to represent financial support (see Giles & Councill 2004, p.17601). ExtConX2

will be more sophisticated in that respect, as NEs and their respective roles will be identified and

extracted from acknowledgements. Hence, this task will be slightly more challenging than Giles

and Councill‘s method, and therefore as the nature of evaluation metrics will differ, good metrics

will be more challenging to obtain.

The methodology adopted by Giles and Councill (2004) is a combination of rule-based and

statistical-based approach. Initially, regular expressions were used to identify sections which most

likely contained acknowledgements, specifically, section headings labelled acknowledgment. In

addition, the authors also identified acknowledgement passages within unmarked sections of

articles, typically within the document header (i.e., before the abstract/introduction or on the first

page) or footnotes (i.e., before the references or first appendix). Hence, all text on first page of the

document and the last page, before reference section or the appendix were processed using an SVM

to identify sentences containing acknowledgements. Subsequently, a rule-based parser was applied

to extract acknowledged name entities. Through extensive testing involving 1,800 manually

labelled documents the method achieved 78.45% precision and 89.55% recall.

Table 2 is an excerpt from Giles and Councill‘s (2004, p.17602) result of most acknowledged

funding agencies.

Table 2 – Most Acknowledged Funding Organisations

Funding Agencies No. of

acknowledgements

National Science Foundation 12, 287

Defence Advanced Research Projects Agency 4, 712

Office of Naval Research 3, 080

Deutsche Forschungsgemeinschaft 2, 780

National Aeronautics and Space Administration 2, 408

Engineering and Physical Sciences Research Council 2, 007

Air Force Office of Scientific Research 1, 657

13

Apart from financial support, no other category was presented in their results.

23 | P a g e

National Sciences and Engineering Research Council of Canada 1, 422

Department of Energy 1, 054

Australian Research Council 1, 010

European Union Information Technologies Program 825

National Institutes of Health 709

Army Research Office 666

Netherlands Organization for Scientific Research 646

Science and Engineering Research Council 489

Another research related to one of the applications of ExtConX2 is Wren‘s (2004, 2008) study of

URL decay within MEDLINE/PubMed citations. Wren has justified his motivation by the growth

in electronic references and the assumption of the unreliable nature of online resources compared to

traditional means of printed journals. This was confirmed by the results of his study. The

methodology used by Wren within the knowledge discovery process was straightforward. Wren

used Visual Basic as the chosen programming language and regular expressions to identify and

extract URLs from XML documents (containing the citations). Additional heuristic rules and

manual editing was applied to handle/correct human errors such as mistyped URLs. However,

neither heuristic rules nor regular expressions were provided. Nevertheless, common encountered

errors discussed were inappropriate spaces within URLs, the use of back-ward slashes instead of

forward slashes, non-alphanumeric characters, and inclusion of erroneous characters (see Wren,

2004, p.669).

Wren‘s (2004) initial study involved 1630 URLs extracted from nearly 13 million PubMed

citations. These URLs were programmatically checked for availability, over a four week period

using Microsoft Component Objects Internet Transfer Control (API). A URL was considered as

inaccessible if it did not respond within 60 seconds or if the response code received indicated that

the resource is inaccessible (e.g., 404 not found, file not found, etc.). In addition, if 25 consecutives

tries failed, a URL was considered as inaccessible. URLs that were accessible 90% of the time

checked were considered as active. This method is appropriate as web-servers do not tend to have

100% up-time (or be available 100% of the time). Hence, this method ensures maximised accuracy

of availability statistics.

Wren‘s (2008) follow up study used practically the same method as described above. URLs were

extracted/surveyed in the following years of the initial study (except for 2006): 2004 (total of 2294

URLs surveyed), 2005 (3327 URLs), and 2007 (6154 URLs). Both studies (Wren 2004; Wren

2008) showed time-dependant decay of URLs. More specifically, URL decay could be described as

a function of publication year: the older the publication the less accessible resources it contained.

Below is a graph representing results of URL decay from Wren‘s studies (2004, 2008):

24 | P a g e

Figure 1 - URL Decay (Wren, 2008)

While Wren‘s approach is solely focused on abstracts, ExtConX2 will be applied to full-text

articles, thus covering a larger scope. This will also mean that a more holistic conclusion could be

drawn regarding URL decay. In addition, as previously stated, URLs will be classified within four

different types of categories, enabling a broader analysis of the nature of resources referenced.

Nevertheless, Wren‘s research/results are excellent for post-research evaluation benchmark and

comparison. For instance, I would hypothesise that URL decay will be more sever within full-text

as oppose to citations.

2.5. Summary of Chapter

The aim of this project is to develop a system (ExtConX2) to enable discovery of specific trends

within the biomedical domain. Specifically: (1) the exploration of acknowledgements of

individuals and organisations, and (2) analysis of URL decay and most often referenced resources.

The dataset which will be utilised within this project is full-text XML articles from PubMed

Central.

TM techniques will be used to achieve the main aims defined. In particular, NLP processing such

as lexical, syntactic, and semantic level processing will be utilised to enable role extraction. In

addition, XML tags provided by NLM Archiving and Interchange DTDs will also be used for

25 | P a g e

extraction of URLs (not exclusively) and to aid the initial extraction of acknowledgement text from

PMC articles.

While prior research has had similar applications as ExtConX2, this project looks at extending the

scope by analysing larger datasets and adopting more sophisticated approaches. For instance, Wren

(2004, 2008) study of URL decay was solely confined to PubMed citations. In contrast, ExtConX2

will enable the analysis of URL decay within full-text articles. This will enable us to draw a more

holistic conclusion in regards to the implication URL decay and types of resources most often

referenced within the biomedical domain. Moreover, acknowledgement extraction has yet to be

applied within the biomedical domain. ExtConX2 is the first system to do so. Giles and Councill

(2004) research of acknowledgement extraction is concerned with publications within CiteSeer

digital library. Their approach can at best be described as a NER system as semantic level

processing is never applied. For instance, their result of most acknowledged funding agencies and

companies are based on an assumption and analysis of a subset of articles. In contrast, ExtConX2

will enable us to determine if in fact extracted NEs has provided funding or not by extracting NE‘s

corresponding roles as acknowledged in text.

26 | P a g e

3. Software Requirements

The initial part of this chapter (Section 3.1) provides high-level description of ExtConX2‘s main

requirements (1) URL extraction and (2) role extraction. Subsequently detailed descriptions of

functional user and system requirement, and non-functional system requirements are provided

(Section 3.2 and 3.3). These requirements have been derived from the project‘s objectives and the

software requirement engineering (SRE) process during the initial stages of this dissertation. These

requirements constitute the foundation of ExtConX2.

3.1. Description of Main Tasks

This section provides breif high-level description of main functional requirements of ExtConX2:

(1) URL extraction and related processes and (2) acknowledgement extraction. Some details have

been deliberately ignored for the sake of simplification of descriptions (e.g., use of XML

documents).

3.1.1. URL Extraction

As previous described, ExtConX2 must enable the extraction of URLs from the biomedical

publications. For each URL extracted the system must determine the type of resource referenced

(refer to Section 1.2.1) and if the given URL is accessible or not (URL Status: see Table 3). For

instance, given these hypothetical examples:

1. R-Project (http://www.r-project.org) was used for statistical processing of data.

2. The data was collected using GenBank (http://www.ncbi.nlm.nih.gov).

The ideal results of subsequent processing of these sentences (inserted into a database) ought to be

(Table 3):

Table 3 – Ideal Results from URL Extraction Process

URL Type of Resource URL Status Date Checked

(1) http://www.r-project.org Software Active/Inactive 2010-09-01

(2) http://www.ncbi.nlm.nih.gov Databank Active/Inactive 2010-09-01

27 | P a g e

3.1.2. Acknowledgement Extraction

Acknowledgement extraction involves the extraction of NEs and their respective REs from

acknowledgement sections. The ideal results of processing of given acknowledgements given

below (inserted into a database) should be (see Table 4):

(1) Financial support was obtained from the Swedish Research Council.

(2) The authors thank Ms. Maureen Stoddard Marlow for editing.

Table 4 - Ideal Results of TM Process

(1) Name Entity: Swedish Research Council

Role (enumeration): Funder

Role Expression: Financial support

(2) Name Entity: Ms. Maureen Stoddard Marlow

Role (enumeration): Collaborator

Role Expression: Editing

3.2. Functional User and System Requirements

3.2.1. Functional User Requirements and Use Case Diagram

[R1]. The user shall be able to initiate extraction of URLs from PMC XML documents (stored in

the Shared Database) and insert this data and respective attributes into the System

Database.14

a. Attributes for each URL include:

(1) URL status: if link is active or inactive,

(2) type of resource (i.e., Databank, Document, Organisation, or Software),

(3) decision data: data used to determine type of resource, and (4) date checked.

[R2]. The user shall be able to initiate role extraction (i.e., extraction of NEs and their respective

REs) from full-text XML documents and insert this data and additional attribute into the

system database.

a. Attribute for each set of roles include: (1) the acknowledgement text where role(s)

has been extracted.

[R3]. The user shall be able view general statistics:

14

The System Database (Db) refers to the Db specifically designed for ExtConX2: used to insert processed data. The

Shared Db is provided by the gnTeam (http://gnode1.mib.man.ac.uk/) and contains the PMC dataset.

28 | P a g e

a. (1) Number of documents processed, (2) number of URLs extracted, including

descriptive statistics of URL status (i.e., by year; in total), and (3) number of roles

extracted.

[R4]. The user shall be able to set parameters e.g., number of documents to be processed for IE

processes (i.e., R1 and R2).

A use-case diagram derived from the functional user requirements is provided below (Figure 2):

Figure 2 - Use Case Diagram

Description of Use Case Diagram:

Table 5 – Description of Actor (AC)

AC01 User System user.

Table 6 – Description of Use Cases

UC01 URL Extraction AC01 may initiate URL Extraction and related processes to

determine URL status, determine type of resource, compose decision data, and insert this data (including the date inserted)

into the System Database.

UC02 Role Extraction AC01 may initiate Role Extraction and insert this data

(including the acknowledgement text) into the system database.

UC03 View Statistics AC01 will be able to view statistics of IE processes: (1)

number of documents processed, (2) number of URLs

extracted, (2a) descriptive statistics of URL status (i.e., by

29 | P a g e

year; in total), and (3) number of roles extracted.

UC04 Set Parameters AC01 can set system parameters: e.g., number of documents to

be processed for IE processes (i.e., UC01 and UC02).

3.2.2. Functional System Requirements

This section describes functional system requirements and related processes by implementation

objectives (Tables 8-15).15

The Project objectives have been refined into implementation objectives

to reflect architectural design of the system e.g., database operations have been separated into

separate objective (implementation objective 6). See Table 7 for mapping between the project

objectives and implementation objectives.

Table 7 – Mapping between Projects Objective and Implementation Objectives

Project Objectives Implementation Objectives

1 1 (Table 8)

2 2-4; (6) (Tables 9-11 and 13)

3 5; (6) (Tables 12 and 13)

4 7 (Table 14)

5 8 (Table 15)

(1) Conceptualisation of Terms:

Conceptualisation of terms used in the following tables (Tables 8-15):

Risk – refers to degree of risk of completing a module/task and is based on several

factors such as time constraint, difficulty, dependency on other modules/tasks, and

external dependency. The level of risk is based on a subjective estimate of these

factors.

External Dependency – refers to dependency on external factors, e.g., IR

system(s), database(s), software, and so on.

Shared Database (Db) – refers to the database containing PMC articles in XML

format (i.e., gnode1).

System Db – refers to the database designed and implemented to store

extracted/processed data.

15

Evaluation (Table 15) is also included for the sake of completeness of requirements even though it is not a functional

requirement.

30 | P a g e

Table 8 – Implementation Objective 1

1. Design and implement a relational database schema to store extracted data (i.e, System

Db).

Functional Requirement: N/A

Risk: Low.

External Dependency: None.

Priority: High.

Pre-condition: Installed relational database management system (RDBMS), such

as MySQL.

Post-condition: Skeleton or empty Db schema: System Db.

Difficulty: Easy

Processes : 1. Design Enhanced Entity Relationship (EER) diagram. 2. Translate EER to Relational Schema.

3. Implement relational schema.


2. Design and implement a module to extract URLs from PMC XML documents

Functional Requirement: [R5]. The module shall be able to identify and extract URLs

from PMC XML documents.

Risk: Low.

External Dependency: Availability of Shared Db.

Priority: Intermediate.

Pre-condition: Objective 1, and Objective 6 (A)

Post-condition: A set of extracted URLs.

Difficulty: Intermediate

Process overview:

1. Objective 6, process A (Table 13).

2. Parse document and extract URL(s).


3. Design and implement a module to determine type of resource (or URL)

extracted/referenced.

Functional Requirement: [R6]. The module shall be able to identify the type of online resource referenced; Databank, Document, Organisation,

or Software.

Risk: Low.

External Dependency: Availability of Shared Db.

Priority: High.

Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).

Post-condition: Return type of resource or URL referenced (i.e., Databank,

Document, Organisation, or Software).

Difficulty: Intermediate

Process overview:

1. Get URL context.

2. Determine resource type by: a. keyword(s) within the

URL string, b. keyword(s) within URL reference context (i.e., title of reference and/or description of reference), or

c. keyword(s) within the article body where the URL is

cited.

31 | P a g e

3. Return resource type.


4. Design and implement a module to determine URL status: active or inactive link Functional Requirement: [R7]. The module shall be able to determine if URL is active or

inactive URLs (accessible or not).

Risk: Low.

External Dependency: No direct dependency, see pre-condition.

Priority: High.

Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).

Post-condition: Return URL status: 0/FALSE if inaccessible or 1/TRUE if

accessible.

Difficulty: Easy

Process overview:

1. Get URL to be checked (see Obj. 2).

2. Check if URL is active/inactive: if inactive return

0/FALSE, else (if active) return 1/TRUE.


5. Design and implement a module to identify and extract sponsors and contributors (NEs such as persons/organisations and their respective roles) from acknowledgments

Functional Requirements: [R8]. The module shall be able to identify NEs, such as persons

and organisations/institutions.

[R9]. The module shall be able to identify REs (i.e., sponsors/funders or collaborators/contributors).

[R10]. The module shall be able to link NEs to their respective

REs. [R11]. The module shall be able to extract NEs and their

respective roles from annotated documents.

Risk: High. Main reasons for risk level:

Dependent upon the use of appropriate methodology, and efficient use of tools (i.e., GATE 5.2.1).

Difficulty: Hard

Time constraint: as approaching project deadline. External Dependency: GATE 5.2.1 (see Section 2.2.2).

Priority: High.

Pre-condition: Objective 1, and Objective 6 (A). Post-condition: Return NEs and corresponding REs identified.

Difficulty: Hard

Process overview:

1. Implementation objective 6, process A (see Table 13)

2. Parse document and extract acknowledgement passage. 3. Process acknowledgement passage through text processing

application designed with GATE 5.2.1 (which returns a

Gate XML document with tags representing annotated entities: NEs corresponding REs).

4. Parse Gate XML document.

5. Extract annotated NEs and their respective roles.

32 | P a g e


6. Design and implement a module to handle database operations: (1) ensure synchronisation

of retrieval of documents for processing and documents already processed, (2) insert extracted/processed data into the system database.

Functional Requirements: [R12]. The module shall be able to synchronise retrieval of

documents for processing (from the Shared Db) and documents already processed (in the System Db).

[R13]. The module shall be able to insert given (tuple) of data

into the system database.

Risk: Low.

External Dependency: -

Priority: High.

Pre-condition: Implementation objectives 2-4, or 5.

Post-condition: Relevant data is inserted into the System Db.

Difficulty: Easy

Process overview:

This module is separated into two different tasks: (A)

synchronisation of processed documents (in System Db) and of

retrieval of documents (from the Shared Db) for processing, and (B) data insertion into the System Db.

A. Check last document processed for role extraction / URL

extraction: a. if none, get first document (documents may be retrieved in an ascending order enabled by auto-

incremented keys of records in the Shared Db)16

from the

Shared DB, b. else, get auto-incremented id of last document processed in the System Db and start retrieval

process from Shared Db by last document processed + 1.

B. Either get URL data (implementation objective 2-4) or

role data (implementation objective 5) and insert this data into the system database.


7. Design and implement a GUI that will facilitate exploration of system functionalities and

provides general statistics.

Functional Requirements:

[R14]. The module shall be able to view general statistics upon user request, such as; (1) number of documents processed,

(2) number of URLs extracted, (2a) descriptive statistics

of URL status (i.e., by year; in total), and (3) number of roles extracted.

[R15]. The module shall be able to invoke user parameters for

numbers of documents to be processed.

Risk: Intermediate. Main reasons for risk level:

Time constraint: approaching project deadline.

Dependent on successful completion of previous modules.


16 The implementation will take advantage of the available auto-incremented key within the Shared Db (and the

corresponding foreign key in the System Db) to keep track of documents processed or documents to be processed when new session is initiated.

33 | P a g e

Priority: Intermediate.

Pre-condition: Implementation objectives 1-6.

Post-condition: Interactive GUI.

Difficulty: Intermediate.

Process overview: See Use Case Diagram (Figure 2). Table 15 – (Implementation) Objective 8

8. Evaluation of the purposed methodology

Functional Requirement: N/A

Risk: Intermediate.

1. Time constraint: approaching project deadline. 2. Dependent upon successful completion system modules.


Priority: High.

Pre-condition: Completion of 1-4

Post-condition: -

Difficulty: Easy

Process overview: 1. Choose a random sample of results derived from previous

steps and apply evaluation metrics (see Chapter 6)

3.2.3. Requirement Traceability Matrix Requirement Traceability Matrix (Table 16) by User and System Functional Requirements versus

project objectives:

Table 16 – Requirement Traceability Matrix

Obj. 2 Obj. 3 Obj. 4

[R01] X

[R02] X

[R03] X

[R04] X

[R05] X

[R06] X

[R07] X

[R08] X

[R09] X

[R10] X

[R11] X

[R12] X X

[R13] X X

[R14] X

[R15] X

34 | P a g e

3.3. Non-Functional Requirements

In addition to functional requirements, a set of non-functional requirements have been derived from

the (SRE process or) requirement elucidation and analysis stage. While non-functional

requirements typically include product, external, and organisational requirements (Sommerville,

2004), this dissertation solely focuses on product requirements, specifically, system properties to

guide the architectural design and implementation of ExtConX2.

1. Extensibility

Within software engineering, extensibility refers to the notion of design/implementation of

a system which takes into consideration potential future extension of system functionalities

(Wikipedia, 2009). Extensibility may also be described as a system architecture designed to

accommodate future changes with minimal effort. For instance, system architecture based

upon modularity or compartmentalisation of which various software functions/components

are separated by concern (SoC)17

may address this requirement. Use of Object Oriented

Programming (OOP) language may also aid to achieve this end.

2. Maintainability

The notion of maintainability is similar to extensibility to some respect as the approaches

to accommodate these requirements may intersect. Nevertheless, the aim of this

requirement is to accommodate effortless maintenance of the system, to ease feature

amendment to implementation, and locate potential hidden software bugs. The use of OOP

language and SoC, and detailed documentation may be used to fulfil this requirement.

3. Reusability

The system ought to enable reusability of modules to the extent possible. This will

facilitate both extensibility and maintainability, in addition to provide software components

which may be used within future (unrelated) applications/research. The application of SoC

at class level may be used to fulfil this requirement.

17

Separation of concern (SoC) – refers to a logical separation of system functionalities. For instance, an analogy may be

drawn from the Model-View Controller (MVC) paradigm often used in web applications.

35 | P a g e

4. System Design and Analysis

This chapter is divided into two general sections:

a) Generic overview of the system architecture/design which describes high-level approaches

to extraction of external context (i.e., URLs and acknowledgements).

b) System Design and Analysis.

4.1. Generic System Architecture

A high-level overview of ExtConX2 is provided below (Figure 3, see footnotes for description of

arrows). Brief description follows (Figure 3):

Figure 3 – High-Level System Architecture18

1. The Database Module is responsible for (1) synchronisation between the Shared Database

(containing PMC XML documents) and the System Database, (2) retrieval of documents

(Db Traverser) for processing, and (3) insertion of extracted/processed data (Data Inserter)

into the System Database.

2. The URL Module is responsible for (1) parsing of PMC documents and extracting URLs

(URL Extractor), (2) determining if given URL is accessible or not (URL Status) and (3)

determining the type of resource referenced (Resource Type).

3. The IE Module is responsible for role extraction (IE Application). This module

encapsulates text pre-processing and IE task required to identify and extract NEs and

respective REs.

18

Solid arrows represent data flow, dashed arrows may be described as sub-module (of): the arrows head point toward

the super module.

36 | P a g e

4. The Parser Module encapsulates the XML parser. In addition, it handles NLM Journal

Archiving and Interchange DTDs. These are needed to parse PMC documents. The

DTDResover redirected the XML System IDs to a local repository where the DTDs are

stored.

ExtConX2 architecture is guided by the designed principle of SoC at the system level: Database

Module (including the Shared Db and System Db) encapsulates database operations (i.e., Database

Layer), and the URL Module and IE Module (including the Parser Module) encapsulates

application logic (i.e., Application Layer). This approach is coined as subsystems architecture

where each subsystem represents different level of abstraction (Bennet et al., 2006).19

This could be

considered as an approach to fulfil non-functional requirements previously defined (Section 3.3).

4.2. Description of External Context Extraction

This section provides high-level description of external context extraction based upon the generic

system design (Figure 3).

4.2.1. URL Module

The URL Module (refer to Figure 3) contains three main tasks: (1) extraction of URLs from PMC

documents, (2) determine resource type for each URL extracted, and (3) determine if a URL is

active or inactive (i.e., if resource is accessible or not).

An approach to process a given sentence containing a citation to an online resource is illustrated

below (Figure 4).

19

The system is divided into SoC: the Database Layer deals solely with retrieving documents and inserting

data (this includes the RDBMS), while the Application Layer is solely responsible for application logic.

37 | P a g e

Figure 4 – URL Module Overview

Given the following sentence:

1. The report was provided by World Health Organisation (http://www.who.int).

The output (Processed Data) of the given process (Figure 4) ought to be as followed (Table 17):

Table 17 – Ideal Results from URL Extraction Process

URL Type of Resource URL Status Date Inserted

(1) http://www.who.int Document Active 2010-09-01

A more detailed description follows. The following subsection describes (a) extraction of URLs

and determination of URL status, and (b) determining resource type (from the extracted URL

context), respectively:

a) URL Extraction

As PMC documents are provided in the NLM Archive and Interchange format (XML), the unique

tag provided for identifying URLs may be used to extract these URLs. For instance, given

hypothetical example of a URL within a PMC document (disregarding any context):

1 <ext-link ext-link-type="uri" xlink:href="http://www.who.int">

2 http://www.who.int

3 </ext-link>

The approach that may be adopted to extract the given URL follows:

Get URL:

1. Parse the given document using an XML Parser.

38 | P a g e

2. Traverse through the parsed XML document to find the XML tag identifying URLs (i.e.,

ext-link): see line 1 in the example above.

a. Ensuring ext-link contains the attribute tag: ext-link-type and that this attribute

equals uri (i.e., ext-link-type="uri”).20

This is an inference that the XML tag

contains an external URL.21

3. Subsequently, either (a) extract the URL between the ext-link start and end tag (on line 2

from the given example), or (2) extract the value of the attribute xlink:href (which also also

contains the URL: on line 1).

The XML tag pattern discussed above is consistent across all NLM Archiving and

Interchange DTDs used for PMC documents. Thus, this single approach ought to be a sufficient

method to extract URLs from different formatted PMC documents.

Determine URL Status: URL status may be determined programmatically by Hypertext Transfer Protocol (HTTP)

messages/response codes. For instance, common response codes returned by HTTP when trying to

establish a connection (either through a browser or programmatically) include (Berners-Lee et al.,

1996) these listed in Table 18:

Table 18 – HTTP Response Codes

HTTP Response Code Description

HTTP/1.0 200 OK The request was successful: URL accessible

HTTP/1.0 401 Unauthorized Unauthorised access: inaccessible

HTTP/1.0 404 Error/Not Found The resource could not be found: inaccessible

Determine Resource Type:

For each URL extracted the system must determine the type of resource referenced. For instance; is

the URL a reference to a Databank, Document, Organisation, or Software (refer to Section 1.2.1

for conceptualisation of terms).

A potential approach to determine resource type of a given URL is a mix of rule-based and

keyword-based lists which correspond to a specific resource types. Consider the following

hypothetical example:

1 <ref id="CR9">


3 The report was provided by World Health Organisation (

4 <ext-link ext-link-type="uri" link:href="http://www.who.int/annualreport">

5 http://www.who.int/report

20

Another valid value for ext-link-type is: ftp (File Transfer Protocol). 21 An external URL refers to resources/URLs outside the scope of the article. For instance, there exist other (which may

be described as internal) URLs within PMC documents which are for various XML specific validation (e.g., namespace declaration, and so on); these are non-valid.

39 | P a g e

6 </ext-link>

7 ).

8 </citation>

9 </ref>

A potential solution to determine referenced resource type is:

1. Analyse the URL string extracted for keywords that characterise specific URL

classes (e.g., report could be used as a keyword indicating Document resource

type); if unable to determine resource type, try next process (b):

2. Get the URL context:

3-7 The report was provided by World Health Organisation (http://www.who.int/report).

3. Subsequently, analyse this context (word by word) for keywords, starting from the

location of URL within the string until the start of the sentence (see bold text in

example given above).

In this example, report could be used as keywords to determine the resource type (Document). For

each of the URL types, a list of characteristic keywords will be constructed and used.

4.2.2. IE Module

The IE Module encapsulates the IE application which is responsible for role extraction.

Specifically, given an acknowledgement sentence, the IE Module must enable the identification and

extraction of NEs and their respective REs.

a) Acknowledgement Extraction

A rule-based approach in conjunction with gazetteers may be adopted for role extraction. Apart

from common TM stages previously discussed (see Section 2.1), some notable highlights are:

1. The use of gazetteers to define:

i. NEs: persons and organisations

ii. REs: collaborators and funders (Table 19)

Table 19 – Examples of REs for Collaborators and Funders

Collaborator Roles Funder Roles

Editorial support Financial support

Reviewing the manuscript Grant-in-aid

Helpful comments Grant

Helpful suggestions Funding

2. A rule-based approach applied at semantic level processing (see Section 2): linking of NEs

and their respective REs (Role Matcher: Figure 5).

40 | P a g e

3. Subsequently, programmatically extract these sets of NEs and corresponding REs (IE) and

insert them into a predefined template/database.

The generic NLP/IE pipeline is given in Figure 5.

Figure 5 - Generic NLP/IE Pipeline

For instance, consider the following acknowledgements:

1. The authors are grateful to John Dough for reviewing the manuscript.

2. This research was funded by BBSRC.

The NLP/IE process is as followed:

a) Get NEs

i. Person NE: John Dough

ii. Organisation NE: BBSRC

b) Get REs

i. Collaborator RE: reviewing the manuscript

ii. Funder RE: funded

c) Identify respective RE for each NE :

Patterns which indicate association between NE and RE, identified from above examples

are:

1. NE for RE (collaborator)

2. RE by NE (funder)

Hence, the application of rules to identify given patterns will be sufficient at semantic level

processing, for the given example.

d) Insert this data into predefined template/database:

Table 20 - Results of TM Process

41 | P a g e

(1) Name Entity: John Dough


Role Expression: reviewing the manuscript

(2) Name Entity: BBSRC

Role (enumeration): Funder

Role Expression: funded

4.3. System Architecture System Architecture is the organisation of a system in terms of its software components, including

subsystems and the relationship and interaction among them, and the principles that guide the

design of that software system (Bennett et al. 2006, p.340). System architecture could directly

influence non-functional features of a system (Bennett et al., 2006). For instance, subsystems

architecture is known for advantages such as maximising reusability and improving maintainability

among other things (Bennett et al., 2006). Therefore, the guidance of non-functional requirements

previously defined (Section 3.2) has been a central factor in the architectural design and

implementation ExtConX2.

4.3.1. Subsystems Architecture

The design of ExtConX2 is based on subsystems architecture, i.e., SoC at system level or

subdivision of software components which share some common properties (Bennett et al., 2006).

This means that a system is subdivided into different layers of abstraction or layers of service

which are responsible for different aspect of functionality of the system as whole (Bennett et al.,

2006, p.350). This approach has several known advantages such as:

Maximise reusability

Aid developers to handle complexities

Improve maintainability

Aid portability

ExtConX2 has three layers of abstraction:

1. Presentation Layer

The presentation layer is the topmost layer and is responsible for the human computer

interaction (HCI). This layer enables interaction between the user, and system

functionalities through a graphical user interface (GUI). A user is able to control/initiate

system functionalities (encapsulated by layer 2 or the application layer) through input

parameters, and view output resulting from the processing of the application layer. The

presentation layer satisfies functional user requirements 1-4 and functional system

requirements 14-15 (refer to Section 3.2).

2. Application Layer

42 | P a g e

The application layer is responsible for domain logic or domain specific functionalities of

ExtConX2: the core functional requirements of the system (i.e., functional system

requirements 5-11).

3. Database Layer

The database layer encapsulates the relational database management system (RDBMS) and

system specific database operations such as synchronisation between Shared DB and

System DB (i.e., between processed documents and PMC documents available for

processing), retrieval of documents to be processed, and insert data into the System DB.

The database layer satisfies functional system requirements 12-13.

The architecture of ExtConX2 is based on layered subsystems (see Bennett et al. 2006, p.351): any

layer N can only use the services provided by the layer immediately below it (N -1). For instance,

the presentation layer cannot directly use any services provided by the database layer (see Figure

6). This level of abstraction minimises dependencies among layers (and software components) and

facilitates extensibility and maintainability of the system (Bennett et al., 2006).

Figure 6 - ExtConX2 Layered Subsystems

4.4. System Design

This section provides detailed description of the system design, such as: database, application, and

presentation layers. All illustrations provided are based on class implementations. Complete system

designed is provided in Appendix A, Figure 14.

43 | P a g e

4.4.1. Database Layer

The database layer encapsulates system functionalities or services which are responsible for

database operations. This layer provides services for the application layer directly above it (N + 1).

The following Figure 7 illustrates main components of the database layer.

Figure 7 - ExtConX2 Database Layer

a) Description of Database Layer

1. Db Manager - The Db Manager is responsible for maintaining synchronisation between

the Shared Db (containing PMC XML documents) and System Db. This is achieved by two

methods: (1) determines the last existing PMC document in the Shared Db, and (2) to

determines the last processed PMC document stored in the System Db.22

2. Db Traverser - The Db Traverser is responsible for retrieving data from the Shared Db. In

addition, Db Manager is utilised by Db Traverser to ensure synchronisation.

3. Data Inserter - The Data Inserter encapsulates methods to insert processed data into the

System Db.

b) Relational System’s Schema

Below is the Relational Database Schema used by ExtConX2, the EER Diagram may be viewed in

Appendix A, Figure 13. The Shared Db (in part) 23

and System Db are both represented by the

following Figure 8.

PMC Articles contains PMC articles in XML format, and is linked from the Shared Db.

The System Db contains four relations: Meta Data, URL, Role, and Acknowledgement.

22 Both methods relay on the auto-incremented key and foreign key in the Shared Db and System Db respectively. 23

Only the relevant relation (PMC-Articles) and attributes of the Shared Db is included in the Relational/EER diagram.

44 | P a g e

Figure 8 - Relational Database Schema

24

4.4.2. Application Layer

The application layer encapsulates domain logic: functional system requirements 5-11. This layer is

further subdivided into three separate modules (see Figure 9):

URL Module, which contains classes for URL extraction and related processes.

IE Module, which contains classes for role extraction and related processes.

Parser Module, encapsulates classes for parsing and handling NLM Journal Archiving and

Interchange DTDs.

This subdivision of the application layer into further refined SoC is another example (in addition to

the subdivision at system level) of architectural design which addresses non-functional

requirements of ExtConX2.

24

Different types of arrows are only for visibility.

45 | P a g e

Figure 9 - ExtConX2 Application Layer

a) URL Module

The URL Module is responsible for extracting URLs from PMC documents,25

checking each URL

extracted if it is accessible or not, and determine the type of resource referenced. The URL Module

contains the following classes:

1. URL - The URL class may be described as a super-class; its responsibility includes

extraction of URLs from PMC documents and invoking other operations (i.e., URL Status

and Resource Type). In addition, URL acts as a gateway between the database layer and

application layer (i.e., retrieving PMC documents and returning processed data).

2. URL Status - URL Status checks if a given URL is accessible or not.

3. URL Identifier - URL Identifier is responsible for syntactically validating URLs, and to

identify URL protocols if any (i.e., http:// and ftp://). The latter functionality is used by

URL Status.

25

Not including URLs which are part of the article metadata, i.e., the corresponding prepublication paper and licence

(http://creativecommons.org).

46 | P a g e

4. Resource Type - Resource Type is responsible for collecting possible types of resource

referenced (i.e., Databank, Document, Organisation, or Software). Refer to Section 0 for

further description.

5. Soft Decision - Soft Decision may be described as a sub-class of Resource Type which

contains a method to determine the most likely URL resource type from a set of collected

possibilities (refer to Section 4.2.1 for description).

b) IE Module

The IE Module encapsulates the TM application which handles role extraction. Specifically, pre-

processing of acknowledgement text i.e., NLP, and subsequent IE (or extraction of collaborators

and funders, and their respective REs).

1. IE - The IE class is a the super-class within the IE Module, that extracts acknowledgement

text from PMC documents, and invokes the IE Application and Role Extractor in order to

complete the acknowledgement extraction sequence.

2. IE Application - The IE Application encapsulates the TM application (designed with

GATE). This class handles the pre-processing of acknowledgement text (including,

providing annotation over NEs and their respective REs). Further description is provided in

Section 4.4.2.

3. Role Extractor - The Role Extractor extracts NEs and their corresponding roles from pre-

processed acknowledgement text.

c) Parser Module

The Parser Module encapsulates the parser and a class to handle NLM Journal Archiving and

Interchange DTDs.

1. Parser - The Parser encapsulates the Document Object Model (DOM) parser used to parse

PMC documents.

2. DTD Resolver - DTD Resolver is responsible for redirecting XML System IDs 26

to the

local directory where NLM Journal Archiving and Interchange DTDs are stored. This class

is needed due to the variety of DTDs required for parsing PMC documents.

26

System ID is the URI/URL pointing to the given XML document‘s DTD.

47 | P a g e

4.4.3. Presentation Layer The presentation layer encapsulates methods for HCI (Figure 10). It includes the following classes:

Figure 10 - ExtConX2 Presentation Layer

1. Function Panel - This class constructs the function panel or buttons to initiate various

functionalities (e.g., initiating URL extraction and role extraction).

2. Entry Panel - This class constructs the entry panel: e.g., text fields for user input such as

parameters for number of documents to be processed etc.

3. Quitable Frame - This class is responsible for the popup dialog box to confirm user of

application and to exit the program.

4. GUI - This class constructs the GUI by invoking other classes.

5. InvokeApp - acts as a gateway to the application layer initiating application logic by user

input (see Appendix A, Figure 14).

48 | P a g e

5. Implementation

This chapter describes the implementation of the main functional requirements of ExtConX2: URL

Module and IE Module (refer to Figure 9). However, these descriptions are not comprehensive, as

only a few of the more noteworthy aspects are included. Other materials not provided in this

dissertation are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/).

5.1. Tools & Implementation Environment

Tools used to implement various component of ExtConX2 include:

1. Java Standard Edition 6 & Java Platform Enterprise Edition 6

Due to wide availability of tools and APIs for TM, Java was used as the main

programming language.

2. Eclipse IDE

Eclipse IDE was the used development environment.

3. Xerces Java Parser 2.6.1 – Document Object Model Parser

DOM API is used by ExtConX2 to parse XML documents. While Simple API for XML

(SAX) uses less resource and outperforms DOM parser in terms of speed (Frankling,

2010), DOM provides greater flexibility in terms of functionality for the tasks required .

For instance, within some PMC document, and all GATE XML Documents (see Section

5.3.3.1), certain XML tags lack separate closing tags e.g.:

1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov" />

In these cases, SAX does not recognise these tags, and is therefore unable to extract these

URLs. However, DOM provides the functionality required.

Descriptions of other tools used are provided by relevant implementation modules/components.

5.2. Implementation of URL Module

This section provides detailed description of the implementation of the URL Module. Specifically,

methods adopted for URL extraction, method adopted to determine type of resource referenced

(including soft decision), and a brief description of the implemented process of extracting a URL

and determining its resource type.

49 | P a g e

5.2.1. Extraction of URLs

Extractions of URLs from PMC documents are achieved through the use of inherit NLM Journal

Archiving and Interchange Tag Suit and regular expressions. The use of both methods supplements

each other and achieves better recall and precision than using either method on its own. An analysis

of roughly 100 documents showed that it is becoming common practice to provide hyperlink text

within XML documents rather than visible URL (see an example below). Thus, the sole use of

regular expressions on printable text resulted in poor recall.

1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov"> hyperlink</ext-link>

In addition, a clear majority of documents providing visible URLs also include the URLs as

attribute value within the XML tags. Therefore, the extraction of URL may be achieved solely

through the use of XML tags. However, regular expressions were still used to syntactically validate

URLs extracted to accommodate human error. This helped improve precision.

The implemented process to extract URLs is described below:

1. Parse PMC document using the DOM parser.

2. Traverse through the parsed document to find XML tags defining URLs (i.e., ext-link).

a. Ensuring latter tag includes the attribute: ext-link-type, and that this attribute has

the value uri (i.e., ext-link-type="uri”). This is an inference that the XML element

contains a URL.

3. Extract the attribute value of xlink:href which, by inference, ought to be a URL.

4. Finally, the value of xlink:href is syntactically validated as a URL by applying regular

expressions (see Table 21). This step may be applied as a precaution due to potential

human error.27

Table 21 – Regular Expressions for URL Validation (((http|https|ftp)://)|(www\\.))+

([\\d\\D&&[^\\s(@]])*\\.([\\d\\D&&[^\\s@)]]\\.?)+

5.2.2. Checking Resource Availability

The Java API: URLConnection (i.e., its sub-class HttpURLConnection) is used to check if

extracted URLs are accessible or not. A connection request is sent for each URL extracted with a

set connection timeout of 10 seconds. The URL is considered as accessible/active if a HTTP 200

OK response code is received (see Table 18). If no response code is returned within 10 seconds or

if any other response code is received, the URL is considered as inaccessible.

27

For instance, a common error found is included Document Object Identifier (DOI) instead of URLs within tags defined

for the use of defining URLs.

50 | P a g e

5.2.3. Determining Resource Type

For each URL extracted the system must determine the type of resource referenced (refer to

Section 1.2.1 for conceptualisation of resource types). The approach used to achieve this end is

rule-based in conjunction with lists containing keywords (and URLs). The choice of keywords is

based upon iterative testing and analysis of roughly 100 PMC documents, and carefully chosen to

reflect the relevant resource type. Table 22 shows a subset of five keywords used for each resource

type, the full list is provided in Appendix B, Table 36.

Table 22 – Sample of Keywords

Databank Document Organisation Software

data bank .doc organisation software

databank .pdf organization sourceforge

database journal institute program

genBank report international agency application

ncbi.nlm.nih.gov/protein facts - system

Moreover, all keywords are loaded as regular expressions. This has advantages such as:

1. Keywords can easily be used as case insensitive; uppercase and corresponding lower case

spelling for each word is not needed.

2. The use of grammatical root form of keywords is sufficient.28

Hence, shorter keyword lists

are sufficient to fulfil this function.

5.2.3.1. Soft Decision

Soft decision is a method/algorithm used to determine the most likely resource type for each URL

extracted. Up to four instances of resource type(s) could be determined for each URL mentioned

through the analysis of the URL context (also see description of implementation Section 5.2.3.2):

1. By keywords identified within the URL string.

2. By keywords identifies within the parent node of the URL tag.

Typically, within the reference list the parent node contains: title of the reference

and/or description of a reference.

3. By keywords identified within parent-parent node of the URL tag.

See previous description. This is needed due to inconsistent use of nodes with

XML documents: some reference titles/descriptions are not contained within the

first parent node, rather within the parent-parent node.

4. By keywords identified within citation context of the article (i.e., the actual sentence where

the resource is cited within the article body).

28

For instance, singular and plural forms of each keyword is not needed

51 | P a g e

Once all instances have been collected, this data is subsequently processed by soft decision.

The soft decision algorithm assigns a distributed weight of total of 1 to each (resource type)

instance identified. Subsequently, the instance with the largest weight is identified as the most

likely resource type. If two instances have equal weight, the first identified (instance) resource type

is returned as the likely type. The distributed weight is based upon an iterative analysis of which

decision instance is most reliable. The distributed weight is defined as followed:

Table 23 – Distributed Score of Soft Decision Algorithm

Distributed Score Description

1 0.400 Keyword identified within the URL string.

2 0.225 Keyword identifies within parent node.

3 0.225 Keyword identified within parent-parent node.

4 0.150 Keyword identified by citation reference within the article body.

5.2.3.2. Implementation of URL Module Described

Consider this hypothetical example:

1 <ref id="CR9">


3 MZmine 2 – software for mass-spectrometry was used in this research(

4 <ext-link ext-link-type="uri" link:href="http://www.mzm.sourceforge.net">

5 http://www.mzm.sourceforge.net

6 </ext-link>

7 ); to process the data presented in the results section.

8 </citation>

9 </ref>

The implemented process adopted to determine referenced resource type is as followed:

1. Parse the document using DOM parser.

2. Traverse through the parsed document to extract the URL.

a. Analyse URL string for keywords (see bold text below). Save the result for

analysis by soft decision.

http://www.mzm.sourceforge.net

b. (1) Get the parent node‘s (i.e., citation) context (all text between the citation start

and end node). (2) Analyse this context (word by word) for keywords, starting

from the location of the URL within this string until the start of the sentence (see

bold text below).

MZmine 2 – software for mass-spectrometry was

used in this research

(http://www.mzm.sourceforge.net); to process

the data presented in the results section.

52 | P a g e

If unable to determine a resource type, (3) analyse whole citation context (see bold

text below) starting from the beginning of the sentence to the end.

MZmine 2 – software for mass-spectrometry was

used in this research

(http://www.mzm.sourceforge.net); to process

the data presented in the results section.

Save the result for analysis by soft decision.

c. Do the same as previous step but with the parent-parent node context (in this

example the parent-parent node is ref tag; and the analysis of its context will give

identical result as the previous step). Save the result for analysis by soft decision.

d. (1) Get ref element attribute (id) value (i.e., CR9), if it exists (if not, return null).

(2) Find this citation within the article body by the reference id (CR9). (3) Finally,

analyse the sentence word by word for keywords starting from the location of

citation until the start of that sentence/paragraph. Save result for analysis by soft

decision.

3. Determine the most likely resource type by soft decision.

a. The soft decision data derived from the example above, based on the keywords

provided in Table 22, would be (Table 24):

Table 24 – Result by Soft Decision Algorithm

Instance Weight Resource Type Description

1 0.40 Software By keyword: sourceforge within the URL string

2 0.225 Software By keyword: software within the parent context

3 0.225 Software By keyword: software within the parent-parent context

4 0.150 null Assuming unidentifiable keywords within the article body

citation.

Hence, Software resource type would have a total weight of 0.85, so even if the last instance would

be identified as any other resource type, Software would be returned by soft decision as the likely

resource type.

53 | P a g e

5.3. Implementation of IE Module

This section provides a detailed description of a subset of the implementation of the IE Module

(refer to Figure 9). It presents the methods adopted for identification and extraction of NEs, REs,

and the semantic level processing.

5.3.1. GATE

GATE was used to develop the IE Application for extraction of acknowledgements. While there

exist many alternatives such as LingPipe or MinorThird, GATE was used due to availability of

extensive documentation, user friendly IDE for debugging and development, and easy integration

with Java.

GATE‘s default IE system, a Nearly-New Information Extraction System (or ANNIE), was used as

a starting point for the development of the IE Application. ANNIE contains a set of default

processing resources mostly based on Java Annotation Pattern Engine (JAPE)29

(see default

ANNIE pipeline in Appendix A, Figure 15) which was amended and extended as required, to

meet the requirements of this module.

5.3.2. Java Annotation Pattern Engine

Java Annotation Pattern Engine (JAPE) is a rule-based language which provides finite state

transduction over annotations (Cunningham et al., 2010) enabling various IE tasks through

manipulation of existing and creation of new annotations. An JAPE grammar may be split up in a

set of phases consisting of patterns and action rules that may be run sequentially (Cunningham et

al., 2010) in a customised order defined. In fact, the ability to create sequential pattern/action rules

enables the simplification of extraction of complex patterns into incremental simplified rules (see

Section 5.3 for example).

A JAPE rule consists of two primary parts: left-hand-side (LHS) and right-hand-side (RHS). LHS

shall consist of rule-based pattern description(s), and RHS shall consist of action rules or

annotation manipulation statements. JAPE syntax used for pattern description is quite similar to

regular expressions used in any programming language, hence no description of syntax will be

provided (refer Cunningham et al., 2010, Chapter 8). Following example is a simplified JAPE rule

to identify the pattern of two consecutive, upper initial proper nouns, and to subsequently labels

them as Person (see description of syntax provided):

29

JAPE is based on Common Pattern Specification Language (CPSL)

54 | P a g e

1 Phase: AnnotatePerson // Phase name or identifier for rule

2

3 // Input annotation must be defined (e.g., annotated by POS tagger)

4 // that will be used by the pattern description

5 Input: Token

6

7 Rule: Person1 // Rule name

8 (

9 // Pattern: NNP NNP (with uppercase initials)

10 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}

11 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}

12 ):temp // Temporary label

13

14 -->// Everything above this symbol is the LHS, and below RHS

15

16 // Convert temporary label to permanent annotation/label: Person

17 :temp.Person = {rule = " Person1"}

5.3.3. Implementation of IE Module Described

Similarly to Giles and Councill (2004), the NLP/IE process is not applied to entire PMC

documents, but solely the acknowledgement sections extracted. The general process for role

extraction follows:

1. PMC documents are parsed using a DOM parser.

2. The acknowledgement (section) is extracted using NLM Journal Archiving and

Interchange DTD tags: ack.

3. Subsequently, this text is processed using the IE Application developed using GATE (refer

to Figure 9). The output of this process is a GATE XML document which contains the

dump of annotations (i.e., NEs and their respective RE) in an XML format.

4. The Gate XML document is programmatically processed (or parsed) to extract NEs and

respective REs, and inserted them into the System Db.

55 | P a g e

Figure 11 - IE Application Pipeline

5.3.3.1. Description of IE Application

Out of eight processing resources used for text pre-processing and IE task (Figure 11), four are

custom designed: Gazetteer (partially), NE-Extended Transducers, Role Expression Transducers,

and Role Context Transducers. The latter three are developed using JAPE. Description of these

processing resources and some implementation examples follows:

1. Gazetteer - The ANNIE gazetteer which is used for name entity recognition (by default) is

further extended to accommodate role extraction.30

In particular:

i. The organisation‘s list is extended with known funding organisations.31

ii. Role Expression lists are added: containing collaboration and funder roles (see

Table 25 and 26). Each type of role has two separate lists: (1) multi-word and (2)

one-word lists. This enables prioritisation of multi-word roles at semantic level

30 Extended lists are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/) 31

Resources used for collecting research funding organisation names include: Wikipedia (2010), NIH(2010), and Giles

and Councill (2004).

56 | P a g e

processing which results in better evaluation results (i.e., one-word roles tend to

result in partial identification of roles).

Table 25 – Sample of One-Word Role Expression Lists

Funding Roles Collaboration Roles

Grant-In-Aid advice

grants assistance

sponsor discussions

sponsored comments

sponsors encouragement

Table 26 – Sample of Multi-Word Role Expression Lists

Funding Roles Collaboration Roles

financially supported assistance and comments

fellowship award critically reading

financial support critically reviewing the manuscript

research fund helpful comments

research funds technical assistance

2. NE-Extended Tranducer - ANNIE Gazetteer provides annotation of name entities (e.g.,

persons: first and last names), subsequently the ANNIE NE Transducers, which is based on

JAPE, contains rules to manipulate these annotations to further to create, e.g., person full

names (i.e., linking of first and last names). NE-Extended Transducers is required to

complement the ANNIE Gazetteer and NE Transducers. Initial testing of the ANNIE

system showed considerable number of NEs neglected in particular non-English names.

This resource was needed to improve the performance of the semantic processing resource

(i.e., Role Context Transducers) or the linking of NE and their respective RE.32

Below is a

simplified version of a rule used, see documentation in bold for description:

1 Rule: PersonExt1 // Rule name or identifier

2

3 /* VB - verb - base form: subsumes imperatives, infinitives and * subjunctives.

4 * VBP - verb - non-3rd person singular present.

5 * Target: e.g., “thank”, “grateful”, and so on

6 */

7 ({Token.kind==word, Token.category==VB}|

{Token.kind==word, Token.category==VBP})

8

9 // Any word token, non- Person or Organization.

// Target e.g., „to‟, „for‟, and so on.

10 ({Token.kind==word, !Token.orth==upperInitial,

!Person, !Organization})?

11

32

This due to the reason that: role association rules are based on the good functioning of NER system.

57 | P a g e

12 /* NNP - proper noun - singular: All names are typically

* capitalised.

13 * Create temporary label over the following pattern, given

* that the preceding patterns are true.

14 */

15 (

16 {Token.kind==word, Token.category==NNP,

Token.orth==upperInitial,!Person, !Organization}

17 {Token.kind==word, Token.category==NNP,

Token.orth==upperInitial, !Person, !Organization}


19

20 --> // LHS --> RHS

21

22 /* Convert temporary label, “temp”, to permanent label

* “Person” with given features:

23 * “rule = PersonExt1” and “rule1 = PersonFull”

24 */

25 :person.Person = {rule = "PersonExt1", rule1 = "PersonFull"}

The above rule annotates two consecutive proper nouns (with uppercase initials) (NNP) as

Person, given that the NNPs have not been annotated by default ANNIE resources and that

they are preceded by a verb (either base form: subsumes imperatives, infinitives and

subjunctives (VB) or non-3rd

person singular present (VBP)). Relevant VB/VBP include:

thank, grateful, etc. For instance, given the following sentences:

i. We are grateful to Jong Zang...

ii. We thank Youm Dom...

Jong Zang and Youm Dom will be annotated as Person by using the rule provided above.

3. Role Expression Transducer - In addition to the use of lists (gazetteer) to identify

collaboration roles, the JAPE grammar is used. In fact, the use of JAPE grammar to

indentify REs results in better performance than lists. This is due to the reason that there

exist far too many varieties of collaboration roles to account for and include in lists. For

instance, consider the following acknowledgement:

i. We thank Youm Dom for constructive feedback and providing GEO212.

The following rule (RoleExpression1), assuming that Youm Dom is annotated as Person,

provides annotation (of RE) over following parts of the sentence (see bold text):

ii. We thank Youm Dom for constructive feedback and providing GEO212.

The latter annotation would have also been accounted for by the use of gazetteer. However,

this is only a partial identification of the RE (i.e., providing GEO212 is missing).

Description follows:

58 | P a g e

1 Rule: RoleExpression1 // Rule name

2 (

3 {Person} // Annotated NE: Person

4

5 // NE may be (note use of: “?”) followed by '( ..... )'

// typically containing associations

6 ({Token.string=="("} ({Token})* {Token.string==")"})?

7

8 // There might exist additional words between {Person} and

// ['for'|'who']

9 ({Token.kind==word, !Person}{Token.kind==word, !Person}|

10 {Token.kind==word, !Person}{Token.kind==word,

!Person}{Token.kind==word, !Person})?

11

12 // NE must be followed by 'for' or 'who' – indicating beginning

// of a role expression

13 ({Token.string=="for"}|{Token.string=="who"})

14

15 // PRP$ - probably, possessive pronoun. Target cases: his, her,

// or their (may exists)

16 ({Token.category=="PRP$"})?

17 )

18

19 /* Annotate the following tokens/words as role (with temporary

* label).

20 * End annotation if negation cases are true: [.,;] or 'and'

21 */

22 (

23 ({Token, !Token.string==~"[.,;]", !Token.string=="and"})*

24 ): role // Temporary label

25 --> // LHS --> RHS

26 // Convert temporary label to permanent label: RoleEntity1 with

// given features.

27 :role.RoleExpression1 =

{kind = "PersonCollab", rule = "CollabRule1"}

The acknowledgement example given above involves two challenges: (1) the RE expands

over a conjunction (i.e., and) which could also indicate the end of an RE, and (2) the RE

itself cannot be accounted for prior to processing the text (as previously discussed). JAPE

provides the facility to split rules into a set of separate rules/phases in order to approach

this sort of complexities. This approach has been adopted for annotating RE.33

The following rule (RoleExpression3) is applied to text subsequent to RoleExpression1,

hence, continuing on the prior example, the result of RE annotation would be as followed

(see bold text):

iii. We thank Youm Dom for constructive feedback and providing GEO212.

33

The use of phases has been adopted for all three processing resourced developed: NE-Extended Tranducer, Role-

Expression Tranducer, and Role-Context Tranducer.

59 | P a g e

1 Rule: RoleExpression3 // Rule name

2 (

3 // Annotated Role Expression (derived from previous rule)

4 ({RoleExpression1})

5

6 // Ensure RoleExpression1 is not followed by a new

acknowledgement

7 ({Token.string=="and"}{!Person})

8 ({Token.kind==word, !Person, !Token.string=="and"})?

9 ({Token, !Token.string=="and", !Person, !Organization,

!Token.string ==~"[.,;]"})*


11

12 --> // LHS --> RHS

13

14 :temp.RoleExpression3 =

{kind = “PersonCollab”, rule = “CollabRule3”}

4. Role Context Tranducer

Role Context Tranducer is responsible for semantic level processing: linking of NEs and

respective REs. This is the last processing resource before extraction of annotated roles.

Below are examples of rules which link NEs (i.e., organisations) and their corresponding

REs.

Consider the following example:

i. National Institute of Health provided funding for this research.

The application of the following rule (OrgFund1) results in the annotation of the NE and

RE (which is annotated with customised RE lists discussed earlier) as a role context (see

bold text):

ii. National Institute of Health provided funding for this research.

1 Rule: OrgFund1 // Rule name

2 (

3 {Organization} // Annotated NE: Organisation

4

5 // There might exists a word between NE and RE (e.g., provided)

6 ({Token.string==word, !Token.string==","})?

7

8 /* Find Gazetteer annotated REs: Funder/Sponsor,

9 * priority given to 'multi-word' roles.

10 */

11 ({Lookup.majorType==role_fund, Lookup.minorType==multi_word}|

12 {Lookup.majorType==role_fund, Lookup.minorType==one_word})


14

15 --> // LHS --> RHS

16

17 /* create new annotation from temporary label*/

60 | P a g e

18 :temp.roleContext = {rule="OrgFund1"}

Another feature of JAPE which is useful and utilised by Role Context Tranducer is that it

enables prioritisation of rules which are applied sequentially or in a cascade. For instance,

if rules may overlap prioritisation weights may be applied accordingly: prioritising one rule

or set of annotations over another. The following rule (OrgFund2), which is applied

subsequent to the previous rule explained, identifies a funder RE (as annotated by the

gazetteer) and annotates the whole sentence as role context. This method is practically

feasible as acknowledgements of funders are typically separated by sentence.

1 Rule: OrgFund2

2 (

3 /* find Gazetteer annotated roles: Funder/Sponsor,

4 * priority given to 'multi-word' roles.

5 */

6 ( {Lookup.majorType==role_fund, Lookup.minorType==multi_word}|

7 {Lookup.majorType==role_fund, Lookup.minorType==one_word} )

8

9 /* Label all tokens until end of sentence */

10 ({Token, !Split})*


12

13 --> // LHS --> RHS

14

15 /*create new annotation from temporary label*/

16 :temp.roleContext = {rule="OrgFund2"}

5.3.4. Information Extraction

Following the text pre-processing, the IE Application (refer to Figure 9) returns a GATE XML

document (see sample of GATE document in the end of this section). The process adopted to

extract roles is as followed:

1. Parse the GATE XML document using DOM parser.

2. Find annotation type: Role Context and store its StartNode and EndNode values (e.g., 9 and

87 respectively in the example below) for later reference:

28 <Annotation Id="1924" Type="roleContext" StartNode="9" EndNode="87">

3. Get annotation type Role Expression within the range of Role Context StartNode and

EndNode:

34 <Annotation Id="1923" Type="RoleExp" StartNode="29" EndNode="87">

35 <Feature>

36 <Name className="java.lang.String">rule </Name>

37 <Value className="java.lang.String">CollabRule3</Value>

38 </Feature>

39 <Feature>

40 <Name className="java.lang.String">kind</Name>

41 <Value className="java.lang.String">PersonCollab</Value>

42 </Feature>

43 </Annotation>

61 | P a g e

4. Determine the type of Role Expression by appropriate child node. In this example (see

above: line 41), it is identified as a Person Collaboration role (hence, NE: Person and RE:

Collaboration). In addition, store Role Expression StartNode and EndNode values (i.e., 29

and 87 respectively) for later reference.

5. Get annotation type Person within the range of Role Context, and store its StartNode and

EndNode values (i.e., 9 and 24 respectively) for later reference:

44 <Annotation Id="1921" Type="Person" StartNode="9" EndNode="24">

45 <Feature>

46 <Name className="java.lang.String">rule</Name>

47 <Value className="java.lang.String">PersonFinal</Value>

48 </Feature>

49 </Annotation>

6. Extract NE: Person and Role Expression by previously stored node values from the

document content area (which contains the acknowledgement text with serialised nodes

corresponding to annotations):

10 <TextWithNodes>

11 <Node id="0" />We<Node id="2" />

12 <Node id="3" />thank<Node id="8" />

13 <Node id="9" />Dr<Node id="11" />

14 <Node id="12" />Melvin<Node id="18" />

15 <Node id="19" />Simon<Node id="24" />

16 <Node id="25" />for<Node id="28" />

17 <Node id="29" />critical<Node id="37" />

18 <Node id="38" />reading<Node id="45" />

19 <Node id="46" />of<Node id="48" />

20 <Node id="49" />the<Node id="52" />

21 <Node id="53" />manuscript<Node id="63" />

22 <Node id="64" />and<Node id="67" />

23 <Node id="68" />helpful<Node id="75" />

24 <Node id="76" />discussions<Node id="87" />.<Node id="88" />

25 </TextWithNodes>

7. Result of role extraction from given example is provided in Table 27. The full GATE

XML document used in this example follows.

Table 27 - Results of Role Extraction

(1) Name Expression: Dr Melvin Simon


Role Expression: Critical reading of the manuscript and helpful

discussions

62 | P a g e

Gate XML Document:

1 <GateDocument>

2 

3 <GateDocumentFeatures>

4 <Feature>

5 <Name className="java.lang.String">MimeType</Name>

6 <Value className="java.lang.String">text/plain</Value>

7 </Feature>

8 </GateDocumentFeatures>

9 

10 <TextWithNodes>

11 <Node id="0" />We<Node id="2" />

12 <Node id="3" />thank<Node id="8" />

13 <Node id="9" />Dr<Node id="11" />

14 <Node id="12" />Melvin<Node id="18" />

15 <Node id="19" />Simon<Node id="24" />

16 <Node id="25" />for<Node id="28" />

17 <Node id="29" />critical<Node id="37" />

18 <Node id="38" />reading<Node id="45" />

19 <Node id="46" />of<Node id="48" />

20 <Node id="49" />the<Node id="52" />

21 <Node id="53" />manuscript<Node id="63" />

22 <Node id="64" />and<Node id="67" />

23 <Node id="68" />helpful<Node id="75" />

24 <Node id="76" />discussions<Node id="87" />.<Node id="88" />

25 </TextWithNodes>

26 

27 <AnnotationSet>

28 <Annotation Id="1924" Type="roleContext" StartNode="9" EndNode="87">

29 <Feature>


31 <Value className="java.lang.String">PersonCollab1</Value>

32 </Feature>

33 </Annotation>

34 <Annotation Id="1923" Type="RoleEntity" StartNode="29" EndNode="87">

35 <Feature>

36 <Name className="java.lang.String">rule </Name>

37 <Value className="java.lang.String">CollabRule3</Value>

38 </Feature>

39 <Feature>

40 <Name className="java.lang.String">kind</Name>

41 <Value className="java.lang.String">PersonCollab</Value>

42 </Feature>

43 </Annotation>

44 <Annotation Id="1921" Type="Person" StartNode="9" EndNode="24">

45 <Feature>


47 <Value className="java.lang.String">PersonFinal</Value>

48 </Feature>

49 </Annotation>

50 </AnnotationSet>

51 </GateDocument>

63 | P a g e

6. Evaluation

This chapter presents and discusses the evaluation of the methods adopted and results obtained

from facts analysed during the knowledge discovery stage of the dissertation. This chapter is

subdivided into three main sections: (1) presents and discusses the evaluation of the URL

Extraction, (2) presents and discusses the evaluation of Role Extraction, and (3) discusses system

issues.

The evaluation of ExtConX2 IE tasks was evaluated according to customary means such as recall-

precision-based metrics. Table 28 defines the evaluation terms used in the subsequent definitions

of Precision (P), Recall (R), and F Measure (F).

Table 28 – Evaluation Terms Described

Relevant Non-relevant Extracted True positives (tp) False positives (fp) Not Extracted False negatives (fn) True negatives (tn)

𝑃 = 𝑡𝑝

𝑡𝑝 + 𝑓𝑝 𝑅 =

𝑡𝑝

𝑡𝑝 + 𝑓𝑛 𝐹 =

2 × 𝑃 × 𝑅

𝑃 + 𝑅

6.1. URL Extraction

Roughly 190, 000 PMC documents were processed, of these, 47, 644 contained a total of147, 133

URLs and 95, 799 unique URLs. Based on the evaluation of a random sample of 50 documents

(222 URLs), the adopted approach achieved 98.6% precision and 96% recall for URL extraction

(see Appendix C for evaluation data). In addition, the soft decision algorithm achieved a recall of

81.1% and precision of 88.7% for classification of resources.

The most referenced resource type by total number of extracted URLs is presented below (Table

29). As expected, Document resource type is referenced most of all. However, an interesting

discovery is the percentage of referenced Software type (see Chapter 7).

Table 29 – Total Resource Type Referenced

URL Resource Type Total Identified URLs % of

None 16, 865 11.46%

Databank 15, 409 10.47%

Document 7, 2197 49.07%

Organisation 7, 353 5.00%

Software 35, 309 24.00%

TOTAL: 147, 133 100%

64 | P a g e

Table 30 provides a summary of accessible/inaccessible online resources by year of publication of

URLs:34

Table 30 – Resource Availability by Year

Year Total URLs Accessible URLs Inaccessible

URLs

% Inaccessible

by Year

2010 1,382 1,248 134 9.70 2009 42,995 38,251 4,744 11.03 2008 37,790 32,242 5,548 14.68 2007 26,133 21,874 4,259 16.30 2006 16,669 13,390 3,279 19.67 2005 9,932 7,561 2,371 23.87 2004 6,745 4,910 1,835 27.21 2003 2,561 1,827 734 28.66 2002 1,659 1,179 480 28.93 2001 729 470 259 35.53 2000 251 172 79 31.47 1999 186 115 71 38.17

Table 29 is illustrated by the following Figure (12):

Figure 12 – URL Decay

As obvious from Figure 12 the notion of URL decay is equally applicable to full-text journals as

found in citation (see Wren 2004; Wren 2008). The trend may be described as a function of

publication year, the older the publications the less accessible resource are exists within

publications.

34

Summary is based on total of 147032 URLs not 147133. This due the reason that metadata for those articles containing

the remaining 101 URLs was not extracted.

0

10

20

30

40

50

60

70

80

90

100

2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999

% o

f A

cces

sib

le R

eso

urc

es

% of Accessible Resources

65 | P a g e

6.1.1. Discussions

This section provides some discussions of facts presented in regards to URLs within PMC

documents and the underlying implementation of ExtConX2 which may have affected those facts.

Potential suggestions or improvements are also provided.

(1) FTPs

One of the limitations of the data presented is that FTP URIs was not checked for availability.

However, the analysis of data extracted showed that out of 147,133 URLs extracted, only 791 (or

0.5%) were FTPs, hence we can conclude that the impact on the statistics present is minimal.

(2) Resource Availability

The method adopted to check availability of resource has some weaknesses. As availability is only

checked once before insertion into the database, the accuracy of availability results may be

implicated. For instance, web servers do not have 100% up-time or unlimited capacity for online

traffic. Hence, either of latter factors may have impacted the results. A better approach to

maximise accuracy of URL availability is to implement an additional module which crawls the

database and updates the URL status appropriately. For instance, Wren‘s (2004, 2008) approach

would be ideal: URLs were checked every day over a 4 week period and any URL which was

accessible over 90% of the times was deemed as an active resource.

In addition, due to the project time constraint, the implementation for checking URLs availability

had a 10 seconds time-out limit.35

As some web servers take longer to respond to HTTP requests,

this limit may have affected the results presented.36

(3) Soft Decision and Resource Identification

Approximately 10.5% of the resources identified were incorrectly classified and 8.5% were not

identified at all. A manual review of these documents (and others) shows two primary issues with

the implementation. The use of keywords to identify resources failed due to (1) lack of keywords

within the citation context to indicate type of resource, and (2) non-accounted resource type within

the implementation, e.g., laboratory tools and equipment. The latter limitation may be addressed

by creating a new list of keywords that characterises laboratory tools and equipment and do some

minor amendments to the implementation to facilitate an additional resource type.

35

Wren‘s (2004, 2008) implementation had a 60 second time-out limit: this is probably an appropriate limit. As the URL

data was loaded into the database the last 10 days of the dissertation, a 60 second timeout would have taken take around 12 days to insert into the database (considering existing system issues: see Section 6.3) 36

Testing of the implementation to check URL availability confirmed cases which did take more than 10 seconds to

confirm accessibility of URLs.

66 | P a g e

Moreover, both the soft decision algorithm (i.e., distributed weight applied to instances) and

method used for resource classification could be further improved. For instance, consider the

following generic citation, which is similar to examples found in manually analysed documents,

which the soft decision failed to classify:

1. James [1] proved that the method has good performance.

This example does not include any keywords per se enabling classification of resource referenced.

However, the citation style (James [1]) indicates Document type. Thus, the use of regular

expressions to match the following pattern: ‗NE [NUMBER]‘ may be applied as an additional

method to use of keyword lists.

6.2. Role Extraction

The adopted rule-based approach to role extraction (i.e., extraction of NEs and corresponding REs)

achieved a recall of 67.6% and precision of 92.6% and F-score of 77.7%. The NER achieved a

recall of 69.9% and precision of 95%, and the extraction of REs achieved a recall of 75% and

precision of 97.6%. The evaluation was based on a random sample of 50 documents. From the

whole PMC dataset processed 86,751 acknowledgements were extracted, 71,615 of these were

identified as containing roles.

(1) Evaluation Principles:

The evaluation was guided by the following principles:

Acknowledgements of NE with no roles were not considered and ignored.

Acknowledgement of entities that were not individuals or organisations (e.g., laboratory

staff, teams/groups, etc) were not considered and ignored

In addition, some acknowledgments, in particular of organisation could have two valid REs. Thus,

either role extracted was considered as a true positive. For instance, in the following example both

supported and grant is considered as true positives:37

1. This work was supported by NIH grant

37

For the evaluation results of REs extracted, this example would be considered as containing 1 RE, and either one

extracted would be considered as true positive.

67 | P a g e

Acknowledgements of multiple NEs with identical RE were considered as separate

acknowledgements. For instance, the following acknowledgement would be considered to contain

three separate roles (see Table 31):

2. We like to thank John Dough, Jim Baker, Zoe Zindan for reviewing the manuscript.

Table 31 – True Positives: Role Extraction

(1) Name Entity: John Dough


(2) Name Entity: Jim Baker


(3) Name Entity: Zoe Zindan


(2) Extracted Facts

Table 32 shows the result of most acknowledged funding organisations within PMC. As the role

extraction system does not handle acronyms prior to IE (i.e, organisations and their corresponding

acronyms are extracted as separate roles) additional manual analysis was needed to present this

result. In addition, some organisations have identical names in different countries. For instance,

National Cancer Institute exists both in US and Canada. This was not taken into consideration.

However, other organisations presented (Table 32) are unique, either by country or globally.

Table 32 – Most Acknowledged Funding Organisation

Name of Funding Organisations Total Nr.

Acknowledgements

1 National Institutes of Health 10,613

2 National Science Foundation 3,099

3 Wellcome Trust 2,287

4 European Union 1,443

5 Deutsche Forschungsgemeinschaft 1,301

6 National Cancer Institute (US and Canada) 1,114

7 Canadian Institutes of Health Research 928

8 Biotechnology and Biological Sciences Research Council (BBSRC) 829

9 European Commission 746

10 National Health and Medical Research Council (NHMRC) 663

11 National Natural Science Foundation of China 548

12 Swedish Research Council 538

13 Swiss National Science Foundation 467

68 | P a g e

6.2.1. Discussions

The overall performance of the IE task was quite poor in terms of recall. This was due to a

combination of factors. However, the most notable factor being the performance of the NER. As

both the RE Tranducer and Role Context Tranducer (refer to Section 4.4.2) rely on the good

performance of the NER, a domino effect lead to the overall poor performance. Description of the

NER, RE Tranducer, and Role Context Traducers follows:

(1) NER

Couple of issues with the NER processing resources include: none or partial recognition (1) of non-

English names and (2) of multi-word organisation NEs.

NEs that did not adhere to customary orthographical rules used in English spelling of names (i.e.,

capitalised initials of NNPs) accounted for significant number of cases. For instance, common

examples included Italian names e.g., Marco de Bartol (note: bold), and Chinese names, which

often adhere to English orthography, but include two letter NNPs e.g., Hurng-Yi Wang (note: bold)

which was not recognised by the NER.

Another issue was the non-recognition of multi-word organisations. Some examples from the data

extracted include:

i. Ministry of Health, Labour and Welfare of Japan

ii. Ministry of Education, Science, Sports and Culture of Japan

iii. Mental Illness Research, Education and Clinical Centre

A potential approach to handle this issue would be at the lexical level processing such as the

expansion of the gazetteer. While around 150 organisation names was added to during the

development process this was clearly inadequate.

(2) RE Tranducer

Factors affecting the performance of the RE Tranducer (labelling of collaboration and funder roles)

include: (1) the poor performance of the NER system, and (2) limitations in terms of variety of

rules used.

The sole pattern used for labelling collaboration roles was (see Table 33 for explanation):38

i. [Person] [for|who|provided] [PRP]? [ROLE]

38

The given pattern is somewhat simplified, but represents the generic rule applied in the RE Tranducer.

69 | P a g e

Table 33 – Description of RE Transducer Rule

Pattern Description

[Person] NE: person

[for|who|provided] Word token: for, who, or provided

[PRP]? Possessive pronoun: his, her, their, etc. (may or may not exist).

[ROLE] The role being labelled, if and only if, the preceding patterns were matched.

Thus, roles that did not adhere to the above pattern were ignored. Below is a common example

identified during the evaluation of the system (NEs are in bold):

i. We like to thank Jim Dough, John Stew, and John Crow from Manchester University, UK,

for helping with the laboratory work.

Hence, as a NE is not preceding the relevant RE (i.e., helping with the laboratory work). As a result

the processing resource fails to identify the RE. See discussion of Role Context Transducer for an

example of the RE Transducer failure to identify a RE due to the poor performance of the NER.

(3) Role Context Transducer

The performance of the Role Context Transducer is almost entirely dependent on preceding

resources, in particular, the NER and RE Transducer. The semantic level processing uses an

identical pattern used by the RE Transducer. However, in contrast, a NE or consecutive NEs which

are followed by a RE (identified by prior processing resource) are collectively labelled as Role

Context. Given that the NER and RE Transducer have correctly identified existing NEs and a RE,

the following example illustrates the ideal result of the application of the Role Context Transducer

(see highlighted text):

i. We are indebted to Brian Boyle, Mark Andersen, and Jeffrey Dean for critically

reviewing the manuscript.

However, due to a domino effect initiated by the poor performance of the NER, the performance of

the Role Context Tranducer and therefore the evaluation results were affected. The following

examples illustrated couple of common results observed during the evaluation stage (identified NEs

are in bold and identified RE is in bold and underlined):

i. We are indebted to Michel Cusson, Pierre Fobert, Frédéric Vigneault, Brian Boyle, Mark

Andersen, and Jeffrey Dean for critically reviewing the manuscript.

ii. We are indebted to Michel Cusson, Brian Boyle, Mark Andersen, and especially Jeffrey

Dean for critically reviewing the manuscript.

70 | P a g e

In the first example given, Mark Andersen is not identified as a NE by the NER process. Therefore,

as the Role Context Transducer relies on either consecutive NEs39

or a single NE followed by a RE,

only 1 out of 6 roles is identified by the Role Context Transducer.

In the second example, the NER processing has failed to identify Jeffrey Dean, hence, the RE

Transducer is unable to identify any RE, and subsequently the Role Context Transducer fails to

identify any roles.

This domino effect initiated by the poor performance of the NER was one of the most significant

issues of the IE application. This limitation may be addressed by expanding the gazetteer and

adding additional rules for recognition of non-English NEs.

6.3 System Limitations

The following environment (Table 34) was used during the development and evaluation of

ExtConX2:

Table 34 - Development and Evaluation Environment

Nr. Environment Value

1 Operating System Windows 7 Home Edition 32-bit

2 Database Server MySQL 5.0

3 Processor Intel Core2 Solo 1.4Ghz

4 Memory Ram 2GB

5 JVM Maximum Memory 512MB

Following sections discusses couple specific software issues uncovered during the evaluation stage:

(1) URL Module

The current implementation to check URL availability contains a bug. The bug is inherited from

the Java API used to check URL availability (i..e, HttpUrlConnection). While the cause has not

been undoubtedly confirmed, it seems to be caused by severs which do not allow HTTP connection

programmatically. This is assumed because, none of the URLs manually checked were unavailable

or had any syntactical issue. Furthermore, the API used freezes when trying to get a response code

from the host to determine if the URL is accessible or not. This issue can be solved by the use of

threads: if no response is received within a certain amount of time, the thread can safely be

terminated (without affecting any concurrent processes) and the URL could be marked for manual

check.

39

Consecutive NEs must be separated by commas or the word token: and.

71 | P a g e

(2) IE Module

The IE Application which handles the text-pre-processing is unable to process acknowledgment

paragraphs over 200 words in the used environment. A java.lang.OutofMemoryError: Java heap

space exception is thrown. This due to the reason that: Java Virtual Machines (JVM) heap size is

insufficient. This is a known issue with GATE API (Cunningham et al. 2010, p.35). However, due

to the environment used, the JVM maximum memory couldn‘t be increased to address this issue.

However, in order to address this issue, the Java maximum heap size needs to be set to 768MB or

more.

72 | P a g e

7. Conclusion

The aim of this project was to develop a text mining system (ExtConX2) to enable:

(1) the exploration of acknowledgements of individuals and organisations, and

(2) analysis of URL decay and most often referenced online resources.

Table 35 summarises the project aims, which have all been fully met.

Table 35 – Accomplished Project Aims

Project Aims

1 Design and implement a relational database (Db) schema to store extracted data.

2 Design and implement a module to extract URLs from documents, determine if the given

URL is accessible or not, determine type of resource (or URL) extracted/referenced and

insert this data into a database.

3 Design and implement a module to identify and extract funders and collaborators (i.e.,

persons/organisations and their respective roles) from acknowledgements and insert this

data into a database.

4 Design and implement a GUI that will facilitate exploration of system functionalities and which provides general statistics.

5 Evaluation of the purposed methodology.

TM techniques were used to achieve the main functional requirements of the system. In particular

NLP processing such as lexical, syntactic, and semantic level processing was used for

acknowledgement extraction. In addition, a rule-based approach (JAPE) was used for semantic

level processing to enable the IE task of role extraction. We differentiated between two classes of

roles: funders and contributors. Finally, a combination of regular expressions and lists containing

keywords were used for extraction of URLs and classification of these resources into four classes

(i.e., Databank, Document, Organisation, and Software).

As part of the project, we have processed a set of 190,000 full-text journal articles from PubMed

Central.40

A subset of 50 documents was manually checked to evaluate ExtConX‘s performance.

For URL extraction, the system achieved 98.6% precision and 96% recall. For URL resource

classification, the system was able to correctly classify 81.1% of URLs (recall) with precision of

88.7%. For role extraction, the system achieved 92.7% precision, 67.6% recall and an F measure of

77.7%.

Using this data, we have analysed some trends in URL decay and acknowledgments. For example,

we found that URL decay can be described as a function of publication year: the older the

publication the less accessible resource contained within publications. We also found that most

funding acknowledgements were associated with National Institutes of Health.

40

However, the full dataset was not available in XML format. Hence, roughly 120,000-130,000 were processed.

73 | P a g e

While prior research has had similar applications as ExtConX2, this project has extended the scope

of that research by analysing larger datasets and adopting more sophisticated approaches. For

instance, Wren‘s (2004, 2008) study was solely confined to PubMed citations, while ExtConX2 has

enable the analysis of URL decay within full-text articles. This has enabled us to draw a more

holistic conclusion in regards to the scope of URL decay within the biomedical domain. In

addition, ExtConX2 is the first system to enables acknowledgement extraction within PMC.

7.1. Limitations and Future Work

The following list defines ExtConX2‘s limitations and provides suggestions for future

enhancements:

1. The URL Module is currently only able to check HTTP (i.e.,http:// and https://) for

availability. Additional implementation is needed for File Transfer Protocol.

2. The IE Module extracts organisation names and its abbreviation as separate NEs, hence

resulting in two separate roles. This could be handled by implementing an additional for

acronym detection.

3. Soft decision and keywords for resource classification may be further studied and

improved. For instance, additional category type: laboratory tools and equipment ought to

be added.

4. Implementation of concurrent processing to speed up check of resource availability and to

handle non responding URLs to address system issues discussed.

5. Currently the implementation is only analysing acknowledgements within defined

acknowledgements sections. However, other

6. The facts presented are quite limited, with available data extracted other

patterns/relationships may be uncovered e.g., (1) resource types and journals which are

most affected by URL decay, and (2) relationship between funding organisations and

discipline of research most often sponsored.

In addition, other topics of interesting was realised during the course of this project:

1. Document representation seems to be changing. More and more documents do not provide

visible/printable URLs, instead, hyperlinks encapsulating URL strings are provided.

2. It would be interesting to analyse the type of applications referenced within PMC. For

instance, what types of software are referenced and what are their uses?

74 | P a g e

References

Ananiadou, S. & McNaught, J., 2006. Text Mining for Biology and Biomedicine. Artech House: London.

Ananiadou, S. et al., 2005. The National Centre for Text Mining: Aim and Objectives. Ariadne, [online] 30 Jan., (42). Available at: http://www.ariadne.ac.uk/issue42/ananiadou/ [Accessed 13

April 2010].

Appelt, E.D. & Israel, J.D., 1999. Introduction to Information Extraction Technology: A Tutorial Prepared for IJCAI-99. [Online] Available at: http://user.phil-fak.uni-

duesseldorf.de/~rumpf/SS2005/ Informationsextraktion/Pub/AppIsr99.pdf [Accessed 1 May 2010].

Automatic Content Extraction (ACE), 2004. Automatic Content Extraction 2004 Evaluation

(ACE04). [Online] Available at: http://www.itl.nist.gov/iad/mig//tests/ace/2004/ [Accessed 10 May

2010].

Baeza-Yates, R. & Ribeiro-Neto, B., 1999. Modern Information Retrieval. Pearson

Education Limited. ACM Press, New York.

Bennet, S., McRobb, S. & Farmer, R., 2006. Object-Oriented Systems Analysis and

Design, 3rd

ed. McGraw-Hill: London.

Berners-Lee, T., Fielding, R. & Frystyk, H., 1996. Hypertext Transfer Protocol -- HTTP/1.0.

[Online] Available at: http://www.ietf.org/rfc/rfc1945.txt [Accessed 4 September 2010].

Black, J.W. et al., 2005. CAFETIERE: Conceptual Annotation for Facts, Events, Terms, Individual Entities, and Relations. Parmenides Technical Report TR-U4.3.1. [Online] Available at:

http://ilk.uvt.nl/~kzervanou/dwn/TRU431.pdf [Accessed 4 September 2010].

Chinchor, N. & Sundheim, B., 1993. MUC-5 Evaluation Metrics. Proceedings of the 5

th

Conference of Message Understanding. Baltimore, Maryland, USA 25-27 August 1993. [Online]

Available at: http://www.aclweb.org/anthology-new/M/M93/M93-1007.pdf [Accessed 9 May 2010].

Cunningham, H. et al., 2010. Developing Language Processing Components with GATE Version 5

(a User Guide). [Online] Available at: http://Gate.ac.uk/sale/tao/tao.pdf [Accessed 9 May 2010].

Cunningham, H., 2006. Information Extraction, Automatic. In: Brown, K., ed. Encyclopedia of

Language & Linguistics, 2nd

ed. Oxford: Elsevier.

Fayyad, U. Piatetsky-Shapiro, G. & Smyth, P., 1996. Knowledge Discovery and Data Mining:

Towards a Unifying Framework. Proceedings of the Second International Conference on

Knowledge Discovery and Data Mining. Portland, Oregon, USA, 2-4 August 1996. [Online] Available at: http://www.aaai.org/Papers/KDD/1996/KDD96-014.pdf [Accessed 21 April 2010].

Frankling, S., 2010. XML Parser: DOM and SAX Put to the Test. [Online] Available at: http://www.devx.com/xml/Article/16922/1954 [Accessed 27 August 2010].

Frantzi, K., Ananiadou, S. & Mima, H., 2000. Automatic Recognition of Multi-word Terms. International Journal of Digital Libraries, 3(2), p.117-132.

Gerner, M. Nenadic, G. & Bergman, C. M., 2010. An Exploration of Mining Gene Expression

Mentions and their Anatomical Locations from Biomedical Text. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala, Sweden, 15 July 2010. [Online]

Available at: http://www.aclweb.org/anthology/W/W10/W10-1909.pdf [Accessed 4 September

2010].

http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/IJODL2000.pdf

75 | P a g e

Giles, C.L. & Councill, G.I., 2004. Who gets acknowledged: Measuring scientific contribution

through automatic acknowledgment indexing. PNAS, 101(51), pp.599-604.

Hahn, U. & Wermter, J., 2006. Levels of Natural Language Processing for Text Mining. In:

Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House:

London.

Hearst, M.A., 1999. Untangling Text Data Mining. Proceedings of the 37

th Annual Meeting of the

Association for Computational Linguistics on Computational Linguistics. College Park, Maryland,

USA 20-26 June 1999. [Online] Available at: http://www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html [Accessed 14 April 2010].

Hotho, A. Nurnberger, A. & Paaß, G., 2005. A Brief Survey of Text Mining. LDV-Forum, 20(1), pp.19-62.

JISC, 2006. Text Mining: Briefing Paper. [Online] Available at:

http://www.jisc.ac.uk/media/documents/publications/textminingbp.pdf [Accessed 16 April 2010].

Kim, J. & Tsujii, J., 2006. Corpora and Their Annotation. In: Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London. Hearst, M.A., 2003. What is

Text Mining? [Online] Available at: http://www.ischool.berkeley.edu/~hearst/text-mining.html

[Accessed 14 April 2010].

McNaught, J. & Black, W.J., 2006. Information Extraction. In: Ananiadou, S. &

McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London.

National Institute of Health (NIH), 2010. [Online] http://www.nih.gov/icd/ [Accessed 6 August

2010].

National Library of Medicine (NLM), 2010a. Fact Sheet. [Online] Available at:

http://www.nlm.nih.gov/pubs/factsheets/pubmed.html [Accessed 13 April 2010].

National Library of Medicine (NLM), 2010b. http://dtd.nlm.nih.gov/publishing/ [Accessed 25 August 2010].

National Library of Medicine (NLM), 2010c. http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/ tagging-guidelines/article/tags.html [Accessed 25 August 2010].

National Library of Medicine (NLM), 2009. Key MEDLINE

® Indicators. [Online] Available at:

http://www.nlm.nih.gov/bsd/bsd_key.html [Accessed 13 April 2010].

National Library of Medicine (NLM), 2008. Fact Sheet: MEDLINE®. [Online] Available at:

http://www.nlm.nih.gov/pubs/factsheets/medline.html [Accessed 13 April 2010].

Polajnar, T., 2006. Survey of Text Mining of Biomedical Corpora. [Online] Available at:

http://www.dcs.gla.ac.uk/~tamara/surveyoftm.pdf [Accessed 10 May 2010].

Sommerville, I., 2004. Software Engineering.7

th ed. London: Pearson.

Tateisi, Y., 2004. GENIA Corpus. [Online] Available at: http://www-tsujii.is.s.u-

tokyo.ac.jp/~genia/topics/Corpus/ [Accessed 13 May 2010].

Tsuruoka, Y. et al., 2005. Developing a Robust Part-of-Speech Tagger for Biomedical Text.

Advances in Informatics: 10th Panhellenic Conference on Informatics. Volas, Greece 11-13

76 | P a g e

November 2005. [Online] Available at:

http://www.springerlink.com/content/3275150j32h61345/fulltext.pdf [Accessed 14 May 2010].

Uramoto, N. et al., 2004. A text-mining System for Knowledge Discovery from Biomedical

Documents. IBM Systems Journal, 43(3), pp.516-533.

Wikipedia, 2009. Extensibility. [Online] Available at: http://en.wikipedia.org/wiki/Extensibility

[Accessed 22 August 2010].

Wikipedia, 2010. Research Funding. [Online] Available at:

http://en.wikipedia.org/wiki/Research_funding [Accessed 6 August 2010].

Wren, D.J., 2004. 404 not found: the stability and persistence of URLs published in MEDLINE.

Bioinformatics, 20(5), pp.668-672.

Wren, D.J., 2008. URL decay in MEDLINE—a 4-year follow-up study. Bioinformatics, 24(11),

pp.1381-1385.

Zelenko D. Aone C. & Richardella, A., 2003. Kernel Methods for Relation Extraction. Journal of

Machine

Learning Research, 2003(3), pp.1083-1106

Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation

Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

(pp. 419–426).

Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation

Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, USA 25-30 June 2005. [Online] Available at:

http://www.aclweb.org/anthology-new/P/P05/P05-1053.pdf [Accessed 10 May 2010].

Appendix A – System Architecture and Design

Figure 13 - System Db EER Diagram

System Architecture

78 | P a g e

Figure 14 - ExtConX2 Architectural Design

Default ANNIE Modules

79 | P a g e

Figure 15 - ANNIE Default IE Modules (www.gate.ac.uk)

Appendix B – Implementation Table 36 – List Keywords for Resource Type Identification

Databank Document Software Organisation

Annotate .doc Algorithm Organisation

Data bank .pdf Application Organization

Databank .txt BLAST Institute

Database Article Interface Foundation

Genbank Artikel Program International Agency

geneontology.org Biomedcentral.com r-project.org

ncbi.nlm.nih.gov/biosystems/ Book Software

ncbi.nlm.nih.gov/cancerchromo Chapter Sourceforge

ncbi.nlm.nih.gov/cdd Conclu System

ncbi.nlm.nih.gov/dbEST Content Tool

ncbi.nlm.nih.gov/dbvar Data

ncbi.nlm.nih.gov/domains Dictionary

ncbi.nlm.nih.gov/epigenomics Doc

ncbi.nlm.nih.gov/gap Document

ncbi.nlm.nih.gov/gds dx.doi.org

ncbi.nlm.nih.gov/Genbank/ Elsevie

ncbi.nlm.nih.gov/gene Facts

ncbi.nlm.nih.gov/genome/ Genomebilogogy

ncbi.nlm.nih.gov/genomes/FLU/ Gudeline

ncbi.nlm.nih.gov/geo Icmje

ncbi.nlm.nih.gov/homologene Info

ncbi.nlm.nih.gov/nuccore Interscience/wiley

ncbi.nlm.nih.gov/nucest Issue

ncbi.nlm.nih.gov/nucgss Journal

ncbi.nlm.nih.gov/omia Molvis.org

ncbi.nlm.nih.gov/omim News

ncbi.nlm.nih.gov/pcassay Overview

ncbi.nlm.nih.gov/pccompound Paper

ncbi.nlm.nih.gov/pcsubstance Publication

ncbi.nlm.nih.gov/pcsubstance Report

ncbi.nlm.nih.gov/peptidome Result

ncbi.nlm.nih.gov/popset Review

ncbi.nlm.nih.gov/probe statistic

ncbi.nlm.nih.gov/projects/CCDS/ stats

ncbi.nlm.nih.gov/projects/gensat/ table

ncbi.nlm.nih.gov/projects/sky/ Vol

ncbi.nlm.nih.gov/projects/SNP Volume

ncbi.nlm.nih.gov/protein Wikipedia.org

ncbi.nlm.nih.gov/proteinclusters

ncbi.nlm.nih.gov/RefSeq/

ncbi.nlm.nih.gov/SNP

ncbi.nlm.nih.gov/Structure/

ncbi.nlm.nih.gov/Structure/VAST/

ncbi.nlm.nih.gov/taxonomy

ncbi.nlm.nih.gov/unigene

ncbi.nlm.nih.gov/unists

ncbi.nlm.nih.gov/VecScreen/

pubchem.ncbi.nlm.nih.gov/

81 | P a g e

Appendix C – Evaluation Data Table 37 – URL Extraction Data

PMCID Total Nr.

URLs

Extracted

URLs

Duplicate

URL

Correct Resource

Type Indentified of

Extracted URLs

PMC2413013 4 4 0 2

PMC2761731 4 3 1 1

PMC1988857 9 9 0 1

PMC2752617 4 3 0 3

PMC2764095 9 9 0 9

PMC2661364 3 4 1 3

PMC2111041 2 2 0 2

PMC1919404 41 41 0 40

PMC2533341 2 2 0 2

PMC1779804 6 4 0 4

PMC1525208 3 3 0 1

PMC2768983 6 6 0 3

PMC2731543 4 4 0 4

PMC2801496 2 2 0 0

PMC1624845 1 1 0 1

PMC1839892 1 1 0 1

PMC2206495 4 3 0 3

PMC2239252 1 1 0 1

PMC2685015 6 4 0 1

PMC2440928 2 2 0 2

PMC2478650 5 5 0 4

PMC2793031 1 1 0 1

PMC1994066 3 3 0 3

PMC2515323 2 2 0 2

PMC2765943 4 4 0 3

PMC1599749 5 5 0 4

PMC2570968 8 9 1 9

PMC2787492 7 7 0 7

PMC2806257 5 5 0 3

PMC1805747 2 2 0 2

PMC2276520 7 7 0 6

PMC2600755 4 4 0 4

PMC2071966 2 2 0 1

PMC1266361 1 1 0 1

PMC2755136 4 4 0 2

PMC2600409 2 2 0 2

PMC2405930 1 1 0 1

PMC1851970 2 2 0 2

PMC1698487 6 5 0 5

PMC2671451 2 2 0 2

PMC2759026 4 4 0 2

PMC2627827 1 1 0 1

PMC441568 6 6 0 5

PMC1797064 6 6 0 4

PMC2657239 4 4 0 4

PMC151303 3 3 0 3

PMC2018828 10 10 0 10

PMC1790700 4 4 0 1

PMC2791112 2 2 0 1

PMC2740322 1 1 0 1

Table 38 – Role Extraction Data

PMCID Nr. Relevant True Nr. Partially False Positives

82 | P a g e

Roles Positives Extracted Roles

PMC2750102 5 0 1 2

PMC2761731 3 2 0 0

PMC2246224 2 2 0 0

PMC519127 14 13 0 0

PMC2293642 4 2 0 0

PMC2759026 2 2 0 1

PMC2688212 2 2 0 0

PMC2588630 7 2 0 0

PMC2718519 5 2 1 0

PMC1885552 5 2 1 0

PMC545072 4 1 0 0

PMC1940049 3 3 0 0

PMC2528195 7 4 0 0

PMC1819381 5 5 0 0

PMC1805747 2 2 0 0

PMC2442612 7 5 1 0

PMC1712367 8 7 0 0

PMC2453772 13 8 0 0

PMC2672046 5 5 0 0

PMC2734341 1 1 0 0

PMC2779906 2 2 0 0

PMC2291575 2 2 0 0

PMC2533119 9 8 0 0

PMC2764095 6 4 0 0

PMC2082466 9 2 0 0

PMC2709726 8 1 0 0

PMC102553 3 2 1 0

PMC2121139 8 4 0 0

PMC2658886 4 4 0 0

PMC2734340 2 1 0 0

PMC2186343 2 0 1 0

PMC166148 5 2 0 0

PMC1616969 5 4 0 0

PMC222959 5 5 0 0

PMC2246224 2 2 0 0

PMC2702309 3 2 0 0

PMC1379658 3 2 1 1

PMC102419 4 3 0 0

PMC2391254 5 4 0 0

PMC2751461 3 3 0 1

PMC128935 4 1 0 0

PMC2427038 4 4 0 0

PMC546163 8 5 0 0

PMC2759976 1 1 0 0

PMC2714901 4 4 0 0

PMC2532720 5 4 1 0

PMC1481595 6 3 0 0

PMC2671166 3 3 0 0

PMC1459217 4 4 0 0

PMC2738522 5 5 0 0

Table 39 –Role Expression Extraction Data


83 | P a g e

REs Positives Extracted REs

PMC2750102 4 3 0 0

PMC2761731 3 2 0 0

PMC2246224 2 2 0 0

PMC519127 5 4 0 0

PMC2293642 4 2 0 1

PMC2759026 2 3 0 0

PMC2688212 2 2 0 0

PMC2588630 3 2 0 0

PMC2718519 5 3 0 0

PMC1885552 4 3 0 0

PMC545072 4 1 0 0

PMC1940049 3 3 0 0

PMC2528195 7 2 0 0

PMC1819381 5 5 0 0

PMC1805747 2 2 0 0

PMC2442612 6 5 0 0

PMC1712367 4 3 0 0

PMC2453772 7 5 0 0

PMC2672046 2 2 0 0

PMC2734341 1 1 0 0

PMC2779906 2 2 0 0

PMC2291575 2 2 0 0

PMC2533119 3 2 0 0

PMC2764095 3 1 0 0

PMC2082466 3 1 0 0

PMC2709726 1 1 0 0

PMC102553 3 2 1 0

PMC2121139 4 2 0 0

PMC2658886 1 1 0 0

PMC2734340 2 1 0 0

PMC2186343 1 0 1 0

PMC166148 3 2 0 0

PMC1616969 4 3 0 0

PMC222959 2 2 0 0

PMC2246224 1 1 0 0

PMC2702309 1 1 0 0

PMC1379658 2 1 0 0

PMC102419 3 3 0 0

PMC2391254 4 3 0 0

PMC2751461 1 1 0 0

PMC128935 1 1 0 0

PMC2427038 3 3 0 0

PMC546163 6 5 0 0

PMC2759976 1 1 0 0

PMC2714901 3 3 0 0

PMC2532720 4 3 0 0

PMC1481595 5 3 0 0

PMC2671166 2 2 0 0

PMC1459217 3 3 0 0

PMC2738522 2 2 0 0

Table 40 –Name Entity Extraction Data


84 | P a g e

NEs Positives Extracted NEs

PMC2750102 5 0 1 3

PMC2761731 3 2 0 0

PMC2246224 2 2 0 0

PMC519127 14 13 0 0

PMC2293642 4 2 0 0

PMC2759026 2 2 0 1

PMC2688212 2 2 0 0

PMC2588630 7 2 0 0

PMC2718519 5 3 0 0

PMC1885552 5 2 1 0

PMC545072 4 1 0 0

PMC1940049 3 3 0 0

PMC2528195 7 4 0 0

PMC1819381 5 5 0 0

PMC1805747 2 2 0 0

PMC2442612 7 6 0 0

PMC1712367 8 7 0 0

PMC2453772 13 8 0 0

PMC2672046 5 5 0 0

PMC2734341 1 1 0 0

PMC2779906 2 2 0 0

PMC2291575 2 2 0 0

PMC2533119 9 8 0 0

PMC2764095 6 4 0 0

PMC2082466 9 1 0 1

PMC2709726 8 1 0 0

PMC102553 3 3 0 0

PMC2121139 8 4 0 0

PMC2658886 4 4 0 0

PMC2734340 2 1 0 0

PMC2186343 2 1 0 0

PMC166148 5 2 0 0

PMC1616969 5 4 0 0

PMC222959 5 5 0 0

PMC2246224 2 2 0 0

PMC2702309 3 2 0 0

PMC1379658 3 3 0 1

PMC102419 4 3 0 0

PMC2391254 5 4 0 0

PMC2751461 3 3 0 1

PMC128935 4 1 0 0

PMC2427038 4 4 0 0

PMC546163 8 5 0 0

PMC2759976 1 1 0 0

PMC2714901 4 4 0 0

PMC2532720 5 5 0 0

PMC1481595 6 3 0 0

PMC2671166 3 3 0 0

PMC1459217 4 4 0 0

PMC2738522 5 5 0 0