umls.it.ilstu.eduumls.it.ilstu.edu/a software engineering and modeling per… · web viewfinally,...

A Software Engineering and Modeling Perspective of Biomedical Ontologies

Morris Perro RosaDivision of Information Systems, Gaggle

[email protected]

Abstract In health and biomedical informatics, for the past decade ontologies have gained immense attention as they capture semantic knowledge spanning across multiple health care domains such as genomics, proteins, diseases, symptoms, treatments, procedures, etc. As these ontologies encapsulate extensive semantic medical knowledge that is machine processable, diverse health care applications rely on the captured semantic knowledge for their core functionality. However, a fundamental gap exists between biomedical ontology development and resulting ontological models and software design process and its resultant models. This article analysis this gap and proposes a scalable solution by reviewing five highly visible biomedical ontologies from the perspective of software modeling and engineering. The analysis includes studying and scrutinizing of those (particularly in the United States) prominent clinical biomedical ontologies and then, they are evaluated against the identified fundamental software modeling concepts. Finally, the article also provides in-depth analysis on the presented evaluation and concludes on the drawbacks biomedical ontologies are facing and possible solutions from the perspective of software design and modeling.

1. INTRODUCTIONThe term Ontology has been coined in philosophy, where it is defined as “a philosophical study

of the nature of being, categorization of being and their relations” (Smith, Ontology and InformationSystems, 2003; Fonseca, 2007). Later, the information sciences borrowed the term and defined it as “a specification of conceptualization” (Gruber, 1995) and further refined it as “a consensus (partial) agreement on semantics of the domain conceptualization”1–3. Further, the term “ontology” is frequently used to refer diverse knowledge representations such as classifications, terminology, hierarchies, taxonomies, class definition and relationships, first-order logic based representation, etc. In the biomedical domain, ontologies are developed to capture biomedical knowledge semantics that are interoperable, i.e., the semantics must be processable – machine understandable & executable, readable – possess a human readable representation, and shareable – possibly be exchangeable/re-usable without loss of semantic definition across multiple applications.

With the introduction of the HITECH (Health Information Technology for Economic and Clinical Health) Act in 20094–6, medical institutions were required to transition from paper based patient records to Electronic Health Records (EHR), primarily to improve patient care and substantially reduce healthcare costs. This digitalization of medical records has given rise to interoperability issues at various levels. For example, consider a simple scenario where a patient Mr. Jones visits his primary care physician who notices that Mr. Jones is suffering from high blood pressure. In the patient’s medical notes the physician mentions “HBP” and prescribes the drug “Dyazide” as a medication for the condition. Upon a future visit at the cardiac specialist, who reads the medical notes made by Mr. Jones’ physician, the decoding of the term “HBP” in the notes may be problematic and, in many cases, the patient may not able to provide detailed information either. This problem creates semantic interoperability and interpretation issues between the medical professionals when the medical data is exchanged and will have a compounding effect if the patient suffers from a chronic illness and visits multiple physicians, who try multiple medications, treatment plans and procedures over the course of time. To overcome such scenarios, the

medical community has developed biomedical ontologies that capture the semantics related to healthcare, which can be shared uniformly between multiple health professionals and between interacting software application systems (with advancement in technology).

In the past decade, numerous biomedical ontologies have been developed that capture semantic knowledge on diverse medical domains. For example, International Classification of Diseases (ICD)(ICD, 2010) - the standard for epidemiology, health management and clinical purposes, including the analysis of the general health situation of population groups; Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) - the most comprehensive, multilingual clinical healthcare terminology that contributes to the improvement of patient care by underpinning the development of EHR’s that record clinical information in ways that enable meaning-based retrieval; Unified Medical Language System (UMLS) – a comprehensive tool that integrates multiple standard terminologies, classifications and coding standards to create more effective and interoperable health terminology and services6–8; Logical Observation Identifiers Names and Codes (LOINC) (LOINC, 2013) – a universal code system for identifying laboratory and clinical observations; Disease Ontology9 - a standardized ontology for human disease; Gene Ontology (GO)1 – a standardized controlled vocabulary of terms for describing gene and gene product attributes across species and databases, etc. to name a few. However, most of these biomedical ontologies are not disjoint. Saripalle and Demurjian (Saripalle & Demurjian, AttainingSemantic Enterprise Interoperability through Ontology Architectural Patterns, 2014) have shown that most medical ontologies cut across each other by representing (both structurally and semantically) medical knowledge about the same domain (Disease, Symptom, Procedure, Treatment, etc.) quite differently, thus creating interoperability issues among the standards. Figure 16 supports this argument by illustrating the domain knowledge overlap between the above stated ontologies.

Notice that the ICD and DSM ontologies have a knowledge overlap on the domain of Mental Disorders and, SNOMED-CT and ICD having a major knowledge overlap on multiple domains such as Disease, Symptom, Procedure, and Findings. Further, SNOMED-CT, ICD and OMIM2 have a knowledge overlap on the domain of Disease. More significantly, UMLS is attempting to encompass all of these standards under one umbrella via its own theory, but fails to reuse existing ontological models or provide modular domain models for respective domains which could be reused independently from the biomedical application. Thus, failing to reuse existing knowledge sources and developing new ontologies for targeting the same domain knowledge leads to chaotic structural and semantic interoperability issues3. One approach that reduces interoperability issues and promotes reuse is by designing abstract ontological models that foster reuse in multiple medical ontologies. For example, if modular ontological models exist for describing the domains of Disease, Symptom, as well as their interactions, the biomedical ontologies SNOMED, ICD, and DSM (from Figure 1) could use the same models for their ontological definitions – thus reducing interoperability issues between each other. This argument can be extended to other biomedical ontologies spanning over multiple domains.

1 http://www.geneontology.org/2 http://www.ncbi.nlm.nih.gov/omim

http://www.ncbi.nlm.nih.gov/omim

http://www.geneontology.org/

Figure 1: Knowledge overlap between biomedical ontologies.

In the history of computing, the domain of software engineering has evolved the phase of modeling as a de facto fixed step that has to be completed successfully before moving onto the implementation of the application. In this context, a domain model 10 is an abstract conceptualized solution, i.e., a blueprint for a software application system similar to an architectural blueprint for structures or mechanical diagrams for automobile development, etc. In order to support domain modeling, the software community has massively contributed in multiple dimension such as: standardized modeling frameworks (Unified Modeling Language (UML) (Booch, Rumbaugh, & Jacobson, 2005), Entity Relation Diagrams (ERD), XML, etc.), standardized domain specific modeling frameworks (VHSIC Hardware Description Language (VHDL), Hyper Text Markup Language (HTML), Structured Query Language (SQL), etc.), software modeling process (spiral model, agile methodology, scrum, etc.), design patterns11, extendable modeling tools (Eclipse, IBM Rational Rose and, Netbeans,) and their frameworks (Eclipse EMF, GMF, etc.). For example, in the field of database development, ERD diagrams are initially designed that need consensual agreement before they are converted into executable SQL scripts; in object-oriented programming, object-oriented models are designed before converting them into executable software code; and in the internet world, XML schemas are designed to provide the structure and semantics of the information to be exchanged.

The research presented in this article evaluates a selected number of highly visible clinical biomedical ontologies, i.e., the ontologies that are currently employed in real healthcare applications (practice management systems, EHRs, Personal Health Records (PHR), laboratory management systems, etc.) and, are also constantly monitored, maintained and researched in the healthcare and biomedical community. The primary intent of this discussion is to gauge the achievement of these clinical ontologies against fundamental software modeling concepts such as meta-models, domain specific meta-models, design patterns, domain model/schema(s), etc.; all concepts promoting reuse and increased interoperability6,11. To achieve this goal the rest of this article is organized as follows. Section 2 discusses the software modeling and engineering process using an example and identifies fundamental software modeling concepts. Section 3 identifies and describes in detail the selected clinical ontologies. Based on Section 2 and Section 3, Section 4 provides the evaluation of the selected clinical biomedical ontologies against the identified fundamental software concepts. Section 5 provides the conclusion for the article.

2. Fundamental Concepts of Software Modeling

This section provides an overview of the software modeling process, the steps/phases involved and the foundational concepts of software design. More in detail, Section 2.1 provides background knowledge about the software modeling process and general steps taken by a software developer while following it and Section 2.2 identifies, defines and explains the foundational software concepts involved in the modeling process. These concepts are later employed to evaluate the selected biomedical ontologies.

2.1. Software ModelingFor decades, engineers, scientists, finance analysts, and other professionals who build complex

structures or systems have been producing designs/models of their creations. Sometimes the models are physical, such as scaled mock-ups of airplanes or houses, while other models are less tangible as seen in business financials models, electrical circuit diagrams, optimization models, etc. In all cases, a model serves as an abstraction—an approximate representation of the real item that is being built. In the domain of software systems, modeling provides software engineers with means to better understand the problem at hand and permits assessment of different options and paths for crafting a smart solution. Furthermore, it allows visualizing the entire system and communicating its design to the target audience, therefore aiding tests and checking against requirements before an actual instance of an system has been built—consequently well before technical, financial, and resource consuming risks have been taken.

Software modeling takes a top-down approach that involves the following steps. First, choosing the modeling paradigm – the initial step where the engineer determines a software paradigm to adopt for designing a viable solution. For example, object-oriented modeling, graphs paradigm, logic based language, database diagrams, non SQL structures, etc. Second, choosing the modeling framework – a structured model with well-defined semantics is chosen based on a software modeling paradigm for building a viable abstract solution. For example, UML – an object-oriented modeling framework or ERD – for database diagrams. Third, conceptual modeling – designing conceptual domain models using the modeling frameworks that capture the structure and semantics of the intended solution based on the application requirements. Note that the domain models in this stage are still platform/language independent. Finally, implementation – realizing the designed models in a specific target language such as Java, SQL, JavaScript, etc. Figure 2 renders the discussed software modeling approach. The core phase that primarily influences the proposed solution and the application itself is the domain conceptual modeling, as this step designs the abstract domain models that capture the structure and semantics of the application entities, which with their attributes, roles, and relationships, and constraints govern the software application.

Figure 2: The general steps of a software modeling process.To illustrate the process shown in Figure 2, consider an example of the academic domain (e.g.

university or college), where the goal is to model an academic institution’s personals, their roles, attributes and interactions. For the sake of simplicity, we consider only a limited set of roles. Following requirements shall be captured by a well-structured model: all the personals must be described, identifiable and must be associated with a department, students must be able to enroll in classes that must be defined by the faculty, and administrative personals can change personal information. Based on the

above business requirements, a software developer is likely to adapt an object-oriented paradigm (Step 1) for modeling the solution. Currently, the de facto standard for designing an object based abstract domain models is by employing Unified Modeling Language (UML) (Step 2). The UML, a meta-model, is a modeling framework providing modeling abstractions that can be instantiated for defining domain models.

Figure 3 shows a UML based class diagram or domain model (Step 3) for the aforementioned academic problem aggregated from three packages. First, the Person Package (instance of UML Package modeling concept) has the concept Person (instance of UML Class modeling entity) as a top class, described with various attributes (name, id, tax-id, address, phone, etc. which are instances of the UML Property modeling entity). This top class is specialized to define different personals involved in academia. Second, the Department Package contains the domain model required to capture the structure and semantics of various departments associated with an academic institution. The domain model has the concept Department as a top class, described with various attributes (name, id, location, etc.) and is specialized to define multiple academic departments. Third the Course Package holds the domain model for describing courses and has the concept Course as the top class that is described with attributes such as courseId, courseName, etc. and is further divided to define other specialized courses. Each of these packages are imported and interconnected using associations or interactions or relationships such as advices, teaches, attends, etc. (instances of UML Association modeling entity) to define the abstract solution model for the illustrated academic problem. The designed domain models can now be implemented in any object-oriented programming languages (e.g., Java or .Net) and deployed as a software application for usage in multiple academic institutions (Step 4). While implementing the designed model, the developer might also leverage software design patterns. Design patterns are general reusable solutions to a commonly occurring problem within a given context. They are not finished designs or implementations that can be transformed directly into source or machine code, but rather a description or a template for how to solve a design problem that can be used in many different situations. Design patterns are discussed in detail in Section 2.2.2. For example, in the rendered design model (Figure 3), all students enrolled in a course must be able to see materials distributed by the respective faculty. This task can be solved by studying the Observer pattern, which describes how to is to automatically notify observers on any state changes to a subscribed content. Applying this pattern to the academic context, the students (the observers) should be notified if any new material or assignment (an event) is made by the publisher (faculty). Similarly, other categories of design patterns such as architectural patterns, creational patterns, behavioral patterns, etc. can be used for implementing the described domain model(s).

The primary points that have to be noticed from this discussion are:

A sound meta-model such as UML, ERD, XML Schema specifications, etc. is required to define a well-structured abstract domain model. The meta-model would be more acceptable if developers can customize it i.e. define new meta-modeling constructs by extending core meta-modeling constructs based on the developers domain requirements and specifications.

The packaged modular domain models (Research Module, Course Module, Personal Module, and Department Module) can be reused in any other application. For example, the Research Package can be reused in any academic application or industrial research application.

Software design patterns provide reusable solutions that can be applied to multiple problems with a given context irrespective of the domain.

The domain model provides a schema/model with well-defined structure and semantics to capture the real data of the domain problem. At no-point in the design, the domain instances (e.g., John Smith, David, Software Engineering, Algorithms, etc.) have been involved or influenced the design of the domain models.

Figure 3: An object-oriented domain model for the academic domain.

2.2. Fundamental Concepts of Software Modeling and Engineering This section identifies, describes, and exemplifies the fundamental concepts involved in software

modeling based on the software modeling process. To standardize the discussion of the identified modeling concept, the presentation follows a consistent ordering structured into: Description – provides the definition or brief explanation about the modeling concept, Explanation & Examples – provides a

detailed explanation about the modeling concept and Implementation – exemplifies any implementations of the modeling concept. The identified concepts are as follows.

2.2.1. Meta-Model

Description: A Meta-Model (MM) is an abstract or simplified descriptive model for defining other descriptive domain model(s).

Explanation and Examples: A Meta-model is a framework that can assist in designing abstract domain models (e.g., the domain model elements such as Figure 3 are instances of the meta-model entities). Each research specialization may have a well-defined meta-modeling framework(s) that can be exploited to define abstract domain models. The object-modeling methodology has the Meta Object Facility (MOF) which allows engineers to define other sound object-oriented modeling frameworks. The object-oriented paradigm has Unified Modeling Language (UML), a modeling framework to define object-oriented abstract domain models (Figure 3). These domain models are implemented or realized using object-oriented programming languages such as Java, or .NET, or PHP. The database research specialization has Entity Relationship Diagrams (ERD), a conceptual framework that allows designers to define database diagrams that are later converted into SQL scripts, which in turn can be executed on diverse database platforms, e.g., MS SQL, MySQL, and Oracle. The domain of modeling tool development has the Eclipse Modeling Framework (EMF), a modeling and code generation framework for building tools in the Eclipse environment based on the Eclipse data model. The domain of modeling transformation is governed by MOF Query/View/Transformation (QVT), a set of standard languages for model transformation defined by the Object Management Group (OMG). The domain of software requirements has Software Requirements Specification (SRS)3, a standard for software requirement definition, analysis and design. The domain of information exchange has eXtensible Markup Language (XML) and DTD4, etc., to name a few. Software designers can exploit these meta-modeling frameworks to define a sound i.e. well-define structure and semantic domain model that can encapsulate the domain problem.

Implementation: The domain model rendered in Figure 3 is a model designed using the UML meta-model for solving the academic domain problem discussed in Section 2.1. The HL7 CDA5,12, a standard for representing and exchanging health information and XACML (Anderson, 2008), a standard for defining and enforcing access control policies, are based on XML Schema specifications. Similarly, the object meta-models UML and Common Warehouse Metamodel (CWM)13, a specification for modeling data warehouse environments, are all instance of MOF model.

2.2.2. Profile

Description: A Profile is a meta-model tailored to a specific domain and extends a core/foundational meta-model entities. The Profile is primarily defined to create new meta-modeling constructs based on established domain requirements in a strictly additive manner without conflicting with the core meta-model semantics. As the Profile emulates a meta-model (Step 2, Figure 2), it is generally employed to define domain models related to the profile domain.

Explanation and Examples: A Profile is a specialized meta-model customized to the requirements and specifications of a domain. In object-oriented meta-models, UML is one powerful meta-modeling framework that provides this functionality through UML Profiles. This extension allows software designers to define a domain specific UML meta-model (Fuentes-Fernández & Vallecillo-Moreno, 2004). The UML Profile is defined in terms of three basic constructs: Stereotype - allow to extend the UML meta-modeling concepts in order to create new model elements; Constraints - associated to stereotypes, impose restrictions on the user defined meta-modeling entities; and Tagged Values - additional meta-attributes that are attached to a Stereotype. The Object Management Group (OMG) group, the developer

3 http://ieeexplore.ieee.org/xpl/abstractCitations.jsp?arnumber=7205744 http://www.w3.org/TR/xhtml1/dtds.html

http://www.w3.org/TR/xhtml1/dtds.html

http://ieeexplore.ieee.org/xpl/abstractCitations.jsp?arnumber=720574

and maintainer of UML, defines and maintains a few standard UML Profiles for various domains such as the EJB UML Profile – a UML profile for modeling using Enterprise Java Beans (EJB), the SoaML– a UML profile for modeling SOA based web services, the UML Testing Profile – a UML profile for defining testing models, etc. Figure 4a shows a snapshot from the SoaML UML profile, where the concepts Participant and Agent are profile classes (stereotypes) identified by the keyword <<stereotype>>. The stereotype Participant extends the core UML Class meta-modeling construct identified by the keyword <<metaclass>> and is specialized to define another stereotype Agent. These SoaML stereotypes are later used for developing a domain model as shown in Figure 4b, where the domain class Requestor is of type Agent which represents instance data such as Web Request. Booch, et al. and Fuentes-Fernández & Vallecillo-Moreno provide detailed semantics of UML Profile, their design and implementation. Currently, meta-models in other domains don’t provide a comparable mechanism.

Figure 4: A snapshot of SoaML UML Profile and modeling using the same.Implementation: As stated in the explanation, the OMG group defines standard UML Profiles for various domains. For example, TelcoML5 – a UML profile for telecommunication standards and services, SPTP6 – a UML profile for Schedulability, Performance and Time, MARTE7 - a UML Profile for modeling and analysis of real-time and embedded systems, etc.

2.2.3. Design Patterns

Description: Design Patterns can be defined as, a general interoperable solutions to a commonly occurring problem in multiple domains within a given context. They are not a solution that can readily be transformed into source or machine code, but a template (sometimes expressed in natural language) on how to solve a common and reoccurring problem.

Explanation and Examples: In software engineering, design patterns are formalized best practices that a software designer can follow while designing and implementing an application. Most popular design

5 UML Profile For Advanced And Integrated Telecommunication Services, http://www.omg.org/spec/TelcoML/6 UML Profile For Schedulability, Performance, And Time, http://www.omg.org/spec/SPTP/7 UML Profile For Modeling And Analysis Of Real-Time Embedded Systems, http://www.omg.org/spec/MARTE/

http://www.omg.org/spec/MARTE/

http://www.omg.org/spec/SPTP/

http://www.omg.org/spec/TelcoML/

patterns are object-oriented and express relationships/interactions between classes/objects by using fundamental concepts of object-oriented paradigm such as polymorphism and inheritance. They achieve this without specifying the target application classes/objects, i.e., they are domain independent. Design patterns are frequently grouped into three categories: Creational Patterns - involve creation mechanisms for objects that are suitable for a given situation, Structural Patterns - ease software design by identifying simple ways to realize relationships/interaction between classes/objects, and Behavioral Patterns - identify and realize common communication mechanism between classes/objects. Additionally, some patters are grouped in the Architectural Patterns category, which contains patterns for solving problems at the software architecture level.

Some examples of commonly used design patterns are: Model–View–Controller (MVC), an architectural pattern for implementing user interfaces that divides an application into three interconnected components – a Model that manages the behavior and data of the application; a View that manages the UI of the application; and, a Controller that interprets the user actions and informs the actions to the model and/or the view. Based on the domain application, the domain models replace the respective MVC components. The Singleton Pattern is a creation pattern that restricts the instantiation of a class to only one object. This pattern is useful when exactly one object is needed to coordinate actions across the system such as a database connection, network connectivity, collaborative user-sessions, etc. The Facade pattern is a Structural pattern that provides a unified higher-level interface to a set of sub-interfaces in a system, easing the usage of the sub-interfaces. The Publisher/Subscriber patter is a behavior pattern that allows a loose dependency relationship between the entities (publishers and subscribers) where a state change in one object (publisher) results in all its dependents (subscribers) being notified and updated automatically.

Implementation: As design patterns are domain and platform/language independent, they are generally realized during the model implementation using programming languages (e.g. Java, .NET, etc.) based on the application requirements (Step 4, Figure 2). For example, Section 2.1 employs Observer/Publisher-Subscriber Pattern to implement interactions between faculty and students.

2.2.4. Domain Model/Schema(s)

Description: A domain model (also encapsulates schemas) is a conceptual representation that describes the entities, their attributes, roles, relationships, and optionally the constraints that govern the model. A well thought-out domain model/schema acts as a blue print for the domain problem and is invaluable for ensuring stakeholders agreement, testing of requirement fulfillment, and verification of correctness.

Explanation and Example: The concept of domain modeling has established itself in software engineering without which no software application is designed, developed and deployed. The domain models are designed by instantiating the fundamental constructs of a meta-model. For example, HL7 Reference Information Model (RIM)14 is a well-defined and widely accepted object-model for capturing clinical data that is represented using UML meta-model and uses XML for its reference implementation. The Distributed Data Service (DDS)8 model - a machine to machine standard middleware that aims to enable scalable, real-time, dependable, high performance and interoperable data exchanges between publishers and subscribers is represented using UML meta-model. The XACML model, Simple Object Access Protocol (SOAP) model – a specification for exchanging information through web services, Vector Markup Language (VML) model - a standard for representing two-dimensional vector graphics, etc. are all based on XML specifications.

Implementation: The domain model/schemas are generally platform independent conceptual models that can be translated into any target platform based on the application requirements and specifications. The process of the translation can be executed partially or even fully automatic and can be repeated after

8 Data Distribution Service, http://www.omg.org/spec/DDS/

http://www.omg.org/spec/DDS/

model changes in order to update the platform dependent instances. For example, HL7 RIM model represented using UML can be translated into XML or other programming language such as Java or .Net that are machine executable. Similarly, the OMG DDS standard is domain model that can be implemented using any programming language such as Java (developed by RTI technologies9) or C++ (developed by OpenDDS10 organization).

2.2.5. Package

Description: A package groups meta-modeling/domain modeling entities together and provides a namespace for referencing these entities across the application. This allows designers to divide complex domain models into modular modules that are reusable across diverse applications.

Explanation and Examples: In software engineering, packages, also referred as modules, are an emphasized and recommended practice, wherein a large complex software model is divided into components to be grouped together into a package. Each package represents a separation of concern (functional, behavioral, creational, etc.), improving the usability, testing, verification, and maintainability of the model by enforcing logical boundaries between models and packages itself. The ideal goal is to reduce the dependency and enforce optimal communication between the packages promoting independence and interoperability through reusability. For example, generally, when a MVC pattern (Section 2.2.3) is applied to a software application, the three components – Model, View and Controller are grouped into three packages and the components in these packages are designed to use/interact with other package components without any direct dependency between the model entities. The UML meta-model provides UML Package with the primary intent of grouping similar modeling entities and provide a namespace for the grouped entities. A UML Package can import into another package to use the encapsulated domain modeling entities. Similarly, XML specification provides XML namespace mechanism to refer schema entities defined in a XML schema. A XML schema can refer (similar to UML import) another XML schema where the former schema’s modeling entities can be used using its original namespace. The Eclipse Modeling Framework (EMF) provides a packaging mechanism very similar to UML. The ERD modeling framework doesn’t really have a well-defined package mechanism when compared to UML or XML.

Implementation: The organization and namespace implementation of UML itself is a good illustration of the package concept. For instance, the UML Core (OMG, 2011) is an aggregation of PrimitiveTypes, Abstractions, Basic, and Constructs packages which themselves are an aggregation of other subordinate packages. The UML meta-model package construction also shows reusability, as packages are reused in designing other packages. For instance, the PrimitiveTypes Package is used in designing both UML Basic and Constructs packages.

3. Evaluated Clinical Biomedical OntologiesFor the past two decades, numerous biomedical ontologies have been developed to capture the

semantics of various aspects of the healthcare domain. From these numerous ontologies, only a few were accepted by the biomedical and health organizations/communities, out which even fewer selected ontologies were employed for providing clinical semantics in healthcare settings. However, enough active clinical ontologies remain for it being unfeasible to be evaluate all in the scope of this work. Therefore this article focuses on five highly visible clinical biomedical ontologies i.e. ontologies that are active, widely utilized in medical application such as EHRs and PHRs, constantly researched, monitored, and accepted at various levels in the biomedical community. The selected ontologies are: International Classification of Diseases (ICD), Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), Disease Ontology15, Logical Observation Identifiers Names and Codes (LOINC), and Unified

9 https://www.rti.com/products/dds/10 http://www.opendds.org/

http://www.opendds.org/

https://www.rti.com/products/dds/

Medical Language Systems (UMLS)816. The rest of this section explains these selected ontologies in detailed in terms of description, domain model(s) if any and implementation details.

3.1. International Classification of Diseases (ICD)Description – ICD (ICD, 2010) is a health care classification system, which provides a system of diagnostic codes for classifying diseases, including nuanced classifications of a wide variety of signs, symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or disease. The system is United Nations sponsored World Health Organization (WHO) standard for clinical classifications. The ICD classification was formally introduced to medical community in 1898 and established itself as medical coding standard in 1949. This makes the ICE classification one of the most matured and influential medical knowledge source. The ICE classification covers diverse clinical domains, e.g. Disease, Symptoms, Procedure, and Injury.

Domain Model(s) & Implementation – The current distribution of ICD 10 (10 th revision of ICD) is copyrighted by the World Health Organization (WHO), which owns and publishes the classification every year. The organization distributes the classification in two formats: textual format and Classification Markup Language (ClaML)11. In the textual distribution format, the clinical classification is divided into three files, namely: chapters.txt, blocks.txt, and codes.txt. The complete ICD classification is divided into multiple chapters, where the information about each chapter is captured in chapters.txt in two columns: chapter number and name. Each chapter is further divided into blocks to further classify the clinical knowledge. The blocks.txt file holds this information, also in two columns: block code and name. Each of these blocks capture the actual clinical vocabulary of ICD classification. The codes.txt holds the ICD vocabulary information in 15 columns such as code classification, code title, code chapter, code block code, etc. The second format, ClaML distribution, has two files: claml.dtd – the schema of the ICD clinical vocabulary and icd102010en.xml – the XML instance that holds the clinical vocabulary.

Based on the textual distribution format, we can render an abstract domain model for ICD clinical vocabulary, where the file is converted into a class and the columns are translated into attributes of the respective class. Figure 5 shows the domain model, where the chapter.txt is converted into Chapters class with attributes chapterNumber and chapterName, the blocks.txt file is rendered as a Blocks class with attributes blockCode and blockName, and the codes.txt is converted into Codes class with its respective attributes and referential associations to Blocks and Chapters classes. The abstract model shown in the Figure 5 is reverse engineered using textual distribution of the clinical vocabulary and can be viewed as an object-oriented model for the same.

11 The "Classification Mark-up Language (ClaML)" is an XML based format designed specifically for classifications. It was accepted in 2007 as European norm (CEN/TS 14463).

Figure 5: A conceptual domain model of ICD clinical vocabulary based on the textual format distribution.

3.2. Disease OntologyDescription - The Disease Ontology9 is a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary. The vocabulary has been achieved through collaborative efforts of researchers at Northwestern University, Center for Genetic Medicine and University of Maryland, School of Medicine. The DO is novice ontology beginning to gain reputation due to its representation in machine processable language (OWL, a description logic-based framework) 17 that can also derive logical conclusions or inferences.

Conceptual Model – The Disease Ontology (DO) is an ontological artifact that is being designed in assorted formats targeting multiple healthcare application requirements. The two prominent formats that are employed are: OBO (Open Biomedical Ontologies) format and OWL format. OWL is a knowledge representational format based on description logic (discussed in the Section 3.2). Figure 6 shows the implementation of the DO in OWL framework, where on the left hand side, the DO concepts such as ‘disease’, ‘disease by infections agent’, ‘disease of anatomical entity’, ‘disease of mental health’, etc. are defined as classes (yellow dots). The meta-data about these ontological classes such as comment, has_exact_synonym, has_alternative_id, etc. are captured as annotation, and the interactions or associations between these ontological entities such as complicated_by, occurs_with, realized_by, etc. are captured as properties (blue boxes) as shown in the Figure 6. The DO is also implemented in OBO framework making it compliant with OBO ontologies and its application systems.

Figure 6: OWL implementation of the Disease Ontology.

3.3. Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT)Description – SNOMED Clinical Terms18 is a systematically organized computer processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. The primary purpose of SNOMED CT is to encode medical semantics that are used in health information and to support the effective clinical recording of data. SNOMED CT provides the core general terminology that includes various clinical domains such as clinical findings, symptoms, diagnoses, procedures, body structures, organisms and other etiologies, substances, pharmaceuticals, etc.

Conceptual Model - SNOMED CT can be characterized as a thesaurus with an ontological foundation. The abstract logical models of SNOMED CT and its components are illustrated in Figure 7. The model is centered on the representation of concepts, their associated relationships and descriptions. As shown in the Figure 7, the Description class captures the set of terms (e.g., fully specified names, preferred terms, synonyms in each supported language, etc.) that describe the Concept class. The Relationship class captures relationship between concepts represented using Concept class. These three classes inherit from the Component class which is a subclass of VersionedComponent class that handles the versioning of medical terminology. The SNOMED CT also defines domain model for axioms or logic expressions on its concepts. For further technical details on the SNOMED’s abstract models, please refer to the SNOMED technical specifications.

Figure 7: The abstract logical model of SNOMED concepts.

3.4. Logical Observation Identifiers Names and Codes (LOINC)Description – The Logical Observation Identifiers Names and Codes (LOINC)19 is a universal accepted standard for identifying medical laboratory observations. It’s developed and maintained by the Regenstrief Institute, a US non-profit medical research organization. LOINC was created in response to the demand for an electronic database for clinical care and management and is publicly available at no cost.

Conceptual Model – The current version of LOINC (version 2.44) contains 72,625 terms and is distributed in various formats similar to other clinical vocabularies. The primary component of the LONIC vocabulary is the LOINC Table, which holds all the information about various laboratory observations. The LOINC system as provides a table to represent mappings between the LOINC concepts. Figure 8 shows the conceptual model of the LOINC system, where LOINC class represents the LOINC Table and the columns of the table are represented as attributes (all attributes could not displayed due to exhaustive list of columns) and the LOINCMap class represents the mapping table.

Figure 8: The conceptual model of LOINC system.

3.5. Unified Medical Language Systems (UMLS) MetathesaurusDescription - The Unified Medical Language System (UMLS)7,16,20 is a software suite that is a combination of a large medical vocabulary and various software tools. The vocabulary, UMLS Metathesaurus or UMLS-META, is a compendium of many controlled vocabularies/ontologies (e.g. ICD, SNOMED, NCBI, and DSM) in the biomedical sciences. It may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. The UMLS-META also provides a mapping structure among the aggregated vocabularies, thus allowing one navigate between various terminologies. The complete system is designed and maintained by the US National Library of Medicine (NLM). In the rest of the paper, any reference to UMLS means UMLS-META.

Conceptual Model – The UMLS is primarily distributed as a read-only relational database schema. The UMLS system provides developer’s SQL scripts that can executed in a database environment resulting in creating the tables and populating them, thus building the UMLS-META. As the database is read-only, there are minimal or no constraints on the tables or between the tables such as referential integrity. Currently, the UMLS system provides database scripts for MySQL and Oracle databases. The complete UMLS database schema has 23 tables out of which the following tables are primary tables: mrconso – contains all the information about the contained concepts i.e. concept name, identification, synonyms, source vocabulary, etc.; mrrel – holds all the relationships or interactions between the concepts; mrhier – holds all hierarchical relationships between concepts; mrmap and mrsmal – holds the mapping relationships between the concepts; mrsat – contains the attributes of the concepts; and mrdef – contains the definitions of the concepts. Figure 9 shows the partial database schema with mrconso, mrrel, mrsat, mrdef, and mrhier tables. The concepts across all the tables are linked (dotted lines) using the CUI (Concept Unique Identifier) attribute, a unique identifier for concepts and primary key for the tables in UMLS.

Figure 9: A snapshot of UMLS database tables.

4. Evaluation against Fundamental Concepts of Software ModelingThis section evaluates the discussed five clinical biomedical ontologies in Section 3 against the

identified fundamental concepts of software modeling in Section 2.2 with supporting arguments to justify the presented evaluation. For providing a comprehensive analysis and evaluation, first, the attainment of the fundamental software concept is discussed in general for the domain of ontologies and second, the attainment of the same software concepts by the selected ontologies is discussed.

4.1. Meta-Model vs. Clinical Biomedical OntologiesThe first question we have to answer before proceeding with the evaluation of ontologies against

meta-models is: what are the well-defined and widely-accepted meta-modeling frameworks available for designing ontological models? If there are any, in what state of maturity (e.g., research, experimental and testing, and industrial) are they in? Do they have tools to support modeling using these meta-modeling frameworks? The term ontologies encompass a wide spectrum of knowledge representations, starting from simple hierarchies and classification over frame based ontologies to complex description logic based representations, which are utilized in various domains based on the application requirements.

Hierarchies are simple “is-a or super-sub type” based knowledge representations where a super concept encompasses all sub concepts and generally stems from a single root concept. An abstract domain independent solution for these hierarchies can be designed using existing software meta-models such as UML, ERD and XML based on the application requirements. The classifications can be modeled using Classification Markup Language (ClaML) (used by ICD, Section 2.1.1) and any tool that supports XML can assist in designing ClaML based classifications, but may not support verification and validation of the designed classification. The ClaML specification was first published as technical specification (CEN/TS 14463:2003) and is now under the supervision of the ISO Technical Committee (TC) on health informatics, known as ISO/TC 215 WG3. Another example are biomedical dictionaries that are similar to general dictionary is a look-up source for finding medical terminology. However, they have faded with time due to advancement in technologies, knowledge representational capabilities with enriched semantics and intelligent tools.

The frame and logic-based ontological representation has a long history and various frameworks have been designed along the way for capturing semantic knowledge. The KL-ONE (Brachman &Schmolze, 1985) and its family have been one of the most influential and imitated knowledge representation systems in the ontological community. The KL-ONE is a frame based language that has formalized the notion of a structured concept as a collection of elements in a specific relationship and has first appeared in 1977. Similar to KL-ONE, FLogic is a powerful frame based framework for designing ontologies that combines the advantages of conceptual modeling with object-oriented, frame-based languages and offers a declarative, compact and simple syntax, as well as the well-defined semantics of a logic-based language. The Knowledge Interchange Format (KIF) is a declarative semantics framework primarily designed to exchange knowledge among disparate applications. The Ontology Inference Language (OIL) is based on concepts developed in description logic and frame-based systems and the DARPA Agent Markup Language (DAML) is the initial efforts for designing a markup language for efficiently structuring and attaching semantic knowledge to information represented across web. The DAML+OIL is a semantic mark-up language for ontological representation by combining the capabilities of OIL and DAML (Horrocks, 2002). The Resource Description Framework (RDF) is a general method for conceptual description or modeling of information primarily targeting information represented in web resources and the Web Ontology Language (OWL)17 is a knowledge representation language for authoring ontologies or knowledge bases with formal semantics that are based on description logic. In the rest of the paper, any reference to OWL means OWL based on description logic (OWL DL) semantics.

The OWL framework has grown out of RDF, DAML+OIL and other description logic based languages. Among the family of frame and logic-based frameworks, KIF framework was most expressive, but never sustained to become a standard due to the high complexity of its reasoning algorithms. KL-ONE family is still supported, but not actively developed or researched, while the OIL, DAML, and DAML+OIL projects are currently inactive. The RDF and OWL are currently the most active and preferred frameworks to capture ontological knowledge due to their knowledge representational capabilities. As RDF and OWL have gained popularity, the OMG group has defined a standard meta-model named Ontology Definition MetaModel (ODM) that has semantics of RDF and OWL and is employed for visual conceptual modeling of ontologies, similar to UML. These frameworks/meta-models are supported by both open source and community driven tools such as Java Ontology Editor, Protégé and Protégé Frames, Swoop (Kalyanpur, Parsia, Sirin, Grau, & Hendler, 2006), etc. Protégé is the most popular tool in the ontological community for OWL/RDF based ontologies.

From the evaluation perspective, how many of these ontological and software meta-modeling frameworks satisfy the conceptual modeling requirements of the aforementioned clinical biomedical ontologies? Starting with ICD, the standard is formally supported by the Classification Markup Language (ClaML), a XML based representation, making it the default meta-modeling framework for ICD vocabulary. However, as the XML specification can be mapped to UML, ODM and ERD meta-models without loss of semantic expressiveness, the classification schema (claml.dtd) can be transformed to domain models based on the mentioned meta-models. These derived domain models can be successfully translated to executable machine code such as Java, OWL or RDF, and SQL respectively. The Disease Ontology (DO) (Figure 6) which is the latest attempt to capture the knowledge on human diseases is by default developed using OWL and OBO frameworks. The same OWL representation can be transformed to a domain model based on ODM. However, as the OWL framework is semantically more expressive when compared to other software or logic or frame-based meta-models, transforming the DO OWL representation to domain models based on meta-models is difficult or unacceptable due to loss of domain model semantics.

The SNOMED-CT domain model and its semantics are represented using the UML meta-model (Figure 7) and EL++ (Parsia, 2008) - a profile (or subset) of OWL semantics. Similar to DO, the transformation of SNOMED domain models to domain models based on other meta-models is not advisable due to loss of domain model semantics. An exception is the OWL meta-model as EL++ is a subset of OWL. The LOINC standard that has similar distribution methodology as ICD (i.e., a textual format), can be represented using both software and ontological meta-models, e.g. UML, ODM, ERD, and XML based on the application requirements. These domain models can later be translated into executable machine code using Java, OWL or RDF, SQL and XML respectively. Finally, the UMLS system is officially distributed as executable SQL scripts and the system also provides domain models in terms of ERD diagrams. As the UMLS system is primarily a read-only database with no or minimal constraints on/between the database tables and its columns, the system can also be represented using UML, ODM, and XML meta-models. But, due to large size of the data set (~ 2 million concepts in roughly 3 GB of disc space), the transformation has be evaluated carefully. Table 1 summarizes the above discussion.

Table 1: Summary of Meta-Models that support the selected clinical biomedical onotlogies.

Clinical Biomedical Ontologies

ICD Disease Ontology

SNOMED-CT LOINC UMLS

Supported by default CLaML/XML OWL and

OBO UML, EL++ - ERD

Supported UML, ERD, ODM - OWL UML, ERD, UML,

through Transformation XML, ODM XML,

ODM

From the above analysis, discussions and summary, we can conclude that one or more software or ontological meta-models fully satisfy the modeling requirements of the previously discussed clinical biomedical ontologies and also the domain of ontologies in a boarder scope.

4.2. Profile vs. Clinical Biomedical OntologiesAs defined in Section 2.2.2, a Profile is a specialized meta-model whose meta-modeling entities

are defined based on domain requirements and application needs. As a Profile extends a core meta-model, this extension mechanism creates a dependency between them, wherein the Profile can only exist with the meta-model.

Similar to the meta-model argument, we first investigate the questions: how many ontological meta-models provide Profile mechanisms for designing domain specific ontological meta-models? Do the ontological tools support such as an extension mechanism? The ontological domain has diverse meta-models ranging from simple XML (ClaML) to frame-based (KL-ONE, KIF) to description logic frameworks (OWL and ODM) to design an ontological model. In the software meta-models, UML is the only meta-model that supports Profile i.e. UML Profile, which is tightly integrated with the UML architecture and is completely supported by the OMG group. Unfortunately, none of the ontological meta-models support or define an extension mechanism similar to UML to define an ontological Profile. One must note that semantics of Profile in this context is completely different from Profile in OWL framework. The current version of OWL framework, informally OWL 2, has defined OWL 2 Profile (a fragment or a sublanguage in computational logic), that is a trimmed down version of OWL 2 i.e. the OWL 2 Profile trade-off some expressive power for the better efficiency of reasoning. The OWL 2 working group have defined three OWL 2 Profiles – OWL 2 EL, OWL 2 QL, and OWL 2 RL. These profiles possess a subset of OWL 2 knowledge representational features. This allows ontology developers to select the framework based on the domain requirements and required representational capabilities. Thus, the primary difference between Profile in the article context and OWL 2 Profiles is that the former is an extension mechanism to define a customized meta-model that is dependent on a fundamental meta-model and the latter is an independent framework containing a subset of OWL 2 functionality to obtain better efficiency.

From the evaluation perspective, do any of the ontology standards provide a biomedical or clinical Profile for ontological modeling? The reason for asking this question even after providing a compelling argument to state that none of the ontological meta-models provide an extension mechanism for designing a Profile is due to the fact that some of the biomedical ontological domain models (ICD, LOINC, SNOMED-CT and UMLS) can be represented using the UML meta-model. However, the organizations, technical committees and medical community that develop and maintain these standard ontologies have not designed any biomedical or clinical Profile to support ontological modeling. Thus, the Profile support from the selected ontologies and its organizations is none.

4.3. Design Pattern vs. Clinical Biomedical Ontologies Design Patterns (Freeman, 2004) are defined as, a general reusable solutions to a commonly

occurring problem within a given domain context as discussed in Section 2.2.2 and Section 2.1 demonstrated the usage of Observer Pattern in implementing faculty-student interactions in an academic domain model. The same pattern can be used in any application that involves a publisher and a subscriber in the problem context.

First, does the domain of ontologies define design patterns? How are they represented? The ontological community has identified the dominance and influence of design patterns in software modeling and has been focusing on designing and developing ontological patterns to capture semantics of

commonly reoccurring ontological modeling issues. Gangemi has proposed the Conceptual Ontology Design Pattern (CODeP)21,22 to capture a generalized use case scenario acting as a template to solve domain knowledge design issues. For example, Time Indexed Participation21 is a CODeP that represents time indexing for the relation between persons and roles they play as shown in the Figure 10a. The Task Role Pattern is also a CODeP representing temporary roles that objects can play, and the tasks that events/actions that are allowed to execute as shown in Figure 10b. The Participation Pattern is a CODeP extracted from the DOLCE ontology that illustrates participation relation between objects and events as shown in Figure 10c. Gangemi’s CODePs are represented in OWL framework.

Figure 10: CODeP Patterns proposed by Gangemi.Similarly, Clark, Thompson and Porter (Clark, Thompson, & Porter, 2003) have proposed the

concept of knowledge patterns defined as a semantic structure representing reoccurring patterns similar to design patterns, but morphing the knowledge pattern entities onto domain classes instead of instantiating them. The Knowledge Patterns are templates that can be translated into any target framework. The Semantic Sensor Network ontology (SSN) focuses on number of perspectives such as sensors, features, data or observation, networked system(s), etc. The SSNO designed and maintained by the W3C community is built on Stimulus-Sensor-Observation pattern23 and implemented in OWL framework. Further, the ontological community maintains a global community driven pattern repository that researchers across the globe can contribute to.

Following this argument, an ontology modeling engineer might ask the question – does the biomedical community have any biomedical design patterns or CODePs that can be used/referred by any biomedical specialist for design ontological domain models? Or does the community have any agreed on templates (code or text) that can be reused for designing ontological model(s)? It is very perplexing to

come to a consciousness that a domain dominated with standard ontologies that express knowledge on diverse domains (Figure 1) have not established standard biomedical patterns that can be aligned with software design patterns or with CODePs. Further, the fundamentals, best practices and implementations of design patterns were first experimented around 1987, but it was not until 1994 that they have proved themselves. In contrast, WHO started publishing ICD classification from 1965, LOINC terminology was initiated in 1994, SNOMED CT was first released in 1998, and the UMLS knowledge system project was initiated in 1998; all of these ontologies efforts have been initiated or designed after software design patterns have established and proved themselves. The Disease Ontology, developed in 2011, has not used/defined any biomedical pattern(s), when both software and ontological design patterns were well known and the domain of Diseases is well captured by the previous standards. This entails that the biomedical community has not defined any biomedical pattern(s) or equivalent CODePs, even after acquiring a vast knowledge by design and developing standard ontologies that overlap with each other in multiple domains (Figure 1). However, enormous collaborative research efforts are contributed to define deterministic algorithms and machine learning techniques to identify patterns from existing large biomedical data and text corpses. But, these patterns have very limited scope as they are extracted from a dataset with a context and no or limited problem statement that may not be universally applicable.

From the above discussion, we can state that none of the discussed biomedical ontologies refer/implement or define any standard biomedical pattern whose semantics can’t be reused to define other similar biomedical ontologies.

4.4. Domain Model(s) & Packages vs. Clinical Biomedical OntologiesThe domain model(s) and packages are the primary output of a software modeling process (Step

3, Figure 2) as it represents a platform independent abstract model for the given domain problem. The merging of packages with domain model(s) emphasizes design of modular solutions that promotes reuse of the abstract conceptual domain model(s) for multiple domain problems based on a given context. One must note that the concept of modularity in software is rather informal with no official definition or methodology to achieve it. However, software professionals, in general agree that the modularity is primarily intended to logically discrete the software functionality such that individual modules can work independently with one another module. Thus, instead of creating a monolithic application (one large software application), several smaller modules are built, compiled, and composed together to construct the whole. This makes modular designed systems, if built correctly, far more reusable than a traditional monolithic design, since ideally modules can be reused (without change) in other projects. As exemplified in Figure 2, the Research Package can be re-used in designing any other academic domain model, if the context of the former and the later are aligned with one another.

Similar to other to the before developed evaluations we ask: does the ontological domain have meta-models, tools and techniques for designing abstract conceptual ontological model(s) and also support the concept of packages? From the description of domain model(s) from Section 2.2.4, the design and development of domain model(s) is always tied to a meta-model i.e. the meta-model constructs are instantiated for defining the domain model(s) entities. The software meta-models such as UML, ERD, XML, EMF, etc. are well discussed in Section 2.2.1 and the meta-models in ontological domain such as KL-ONE, KIL, DAML+OIL, RDF, OWL, ODM, etc. is also well examined in Section 3.1. The existence of diverse meta-models with well-established structure and semantic in both software and ontological domain can allow ontology designers to define domain conceptual ontological models. Further, the mentioned meta-models support packages through import (package and class imports, schema imports, etc.) mechanisms, but ERD doesn’t support packaging. In parallel, there are also tools such as Eclipse, Protégé, JOE, etc. which can assist designers in developing domain models using these meta-models. Finally, the ontological community has defined various ontology life cycle models – a process for defining ontologies, that are similar to software life cycle models such as Methontology, Enterprise Ontology Project, TOVE project, Ontology 101, UPON, HOD2LC24, etc.

Based on the above discussion, the next logical questions for the evaluation is, does the selected biomedical ontologies have abstract domain model(s)? Do they promote modular design, so that one standard’s domain model(s) can be reused by other standards or committees, implicitly suppressing the current interoperability issues? To begin with, the WHO committee distributes ICD vocabulary in text format and ClaML format. In the former, the vocabulary is described in text files using natural language (English and other languages) without any uniform standard semantic representation (i.e., a domain model based on a meta-model). Figure 4 presents a UML based model derived or reverse engineered from these text files that is influenced by the experience of the authors, but not in any way standardized and approved by the WHO ICD technical committee. Thus, this format can’t act as a domain model for the ICD vocabulary. The other distribution format, ClaML, consists of two files – claml.dtd and icd102010en.xml, where the former is the schema/model that explains the medical vocabulary captured in latter using XML. But in further inspecting the ICD domain XML schema, does the. claml.dtd schema support modularity to be reused in other standards or ontologies? ICD vocabulary identifies and classifies various concepts related to clinical domain such as “Diseases of the nervous system”, “Diseases of the circulatory system”, “Diseases of the respiratory system”, “Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified”, etc. The current ICD 10 has 22 such classifications that are further classified and all this knowledge is represented in a single or monolithic file (icd102010en.xml) and schema (claml.dtd). If a medical professional is interested only in “Diseases of the circulatory system” classification, he/she has to write an application program on top of the vocabulary for extracting the required classification, but ICD doesn’t provide any modular model/schema for representing the individual classifications that can be utilized/re-used by the professional for different application scenario. This argument can be extended to other classification such as Symptoms, Injuries and Poisoning, etc. and it can be concluded that, the lack of modularity in ICD vocabulary is hindering the reuse of classification in other standards or ontology modeling (Figure 1). Thus, ICD vocabulary has a ClaML based domain model, but lacks modularity by design.

The Disease Ontology (DO) is an attempt to represent the domain of human diseases using OWL framework and OBO. Figure 7 shows the snapshot of the ontology developed using OWL and Protégé tool. The top concept (all represented as OWL Class) of the ontology is “disease” that is further divided into 8 subclasses, namely: “disease by infectious agent”, “disease of anatomical entity”, “disease of cellular proliferation”, “disease of mental health”, “disease of metabolism”, “genetic disease”, “medical disorder”, and “syndrome”, which are further divided to define other diseases. The ontology also defines binary relationships or associations (OWL objectProperty) such as “complicated_by”, “composed_of”, “has_symptom”, etc. and annotation properties such as “date”, “id”, “definition”, etc. When this DO OWL structure shown in Figure 7 is compared to the sample academic UML model shown in Figure 3, the UML classes (e.g., Person, Faculty, etc.) are equivalent to OWL classes and UML associations (e.g., advices, teaches, etc.) are equivalent to DO OWL objectProperties. However, the UML associations interconnect two or more classes together to define the domain and range of that association. For example, the association “advices” connects Faculty and Student classes, thus defining the complete semantics of the association in the domain model. In the DO OWL ontology, objectProperties are defined, but they are not assigned any valid domain and ranges i.e. they don’t connect any DO classes, causing ambiguity in their definition and incompleteness in the designed OWL model. This disconnect between the classes and properties illustrates them to be two separate components in the current OWL structure (Figure 7) rather than one tightly integrated unit providing sound structural and semantic meaning to complete model. Further, the DO OWL class structure is similar to ICD classification, where the disease concepts are buried deeply in the class hierarchy making it difficult for extracting the required concepts. Based on this argument, the Disease Ontology has an incomplete OWL domain model and a poor modular design, which makes the reuse of Disease Ontology concepts very difficult.

The LOINC vocabulary is a universal standard for identifying medical laboratory observations and used for annotating concepts in any electronic medical documents. The LOINC vocabulary is distributed primarily in excel format (text structure) which is similar to ICD text distribution. The Figure

7 is a UML domain model derived or reverse engineered from the excel file, similar to ICD, but not standardized and approved by the LOINC. Thus, this excel format can’t act as a conceptual domain model for the LOINC system. As the LOINC distribution is not machine executable (e.g. OWL, XML or UML), the medical professional has to write a software code to use the LOINC vocabulary and also to extract the required laboratory observations.

The SNOMED-CT is the one of the most influential medical vocabulary in the US health care domain. Figure 8 shows the domain model of the ontology designed and maintained by SNOMED group. The domain model has three main classes, namely – Concept, Description, and Relationships; all derived from the Component class. These classes are also well connected by various association such as sourceId, destinationId, conceptId, etc. with well-defined semantics attached to these associations. Before we can move further, one question that must answered is, is the content of Figure 6, the accurate domain model for the SNOMED-CT ontology? Does it really capture the intended medical domain knowledge? The primary reason for asking these questions is due to semantic definition of the model’s primary classes and their respective associations. According to the SNOMED-CT technical documentation, the captured medical terminology (Heart failure, Typhoid Fever, etc.) is an instance of the Concept class, the relations (isa, inverse_isa, etc.) between these concepts are instances of the Relationship class and the description about the concepts is an instance of the Description Class. When the biomedical domain is removed from the domain model context, the same conceptual model (Figure 6) can be reused to define knowledge about any other domain such as Finance, Automobile, Academics, Toys, etc. For example, applying the conceptual model to the domain of Finance, the concepts such as Stocks, Bonds, CDs, ETFs, etc. can be instance of class Concept, the relations between the concept such as isInvestment, isGovtBacked, isVolatile, isa, etc. can be instances of Relationship class and any description about the concepts is an instance of Description Class. This shows that the conceptual model is very generic enough to design an ontology for diverse domains, but not specific to the biomedical or clinical domain. From the software modeling perspective, the current SNOMED conceptual domain model can be viewed as a reference model or even as a software Profile for designing the domain ontological models. Further, when the semantic definition of the SNOMED CT classes (Concept, Description and Relationships) is compared to UML meta-modeling entities (Class, Comment, and Association) respectively, they align with each other further justifying the presented argument. Similar to other ontologies previously discussed, the SNOMED CT conceptual model doesn’t promote modularity in its design. Thus, SNOMED CT provides a conceptual domain model that is generic in nature that raising ambiguity about its design. Similar to ICD and LOINC, the SNOMED-CT domain model lacks modularity, as the complete terminology is captured under a single primary class.

Finally, the UMLS system is a comprehensive effort in integrating all biomedical ontologies and standards to define a unifying standard to be used in the biomedical domain. The UMLS system integrates the above discussed biomedical ontologies and other standards such as Gene Ontology, NCBI, etc. The UMLS is distributed as a database with well-defined executable SQL scripts and abstract ERD diagrams. As the UMLS database is a read-only database, the constraints on the database tables and its columns are minimal. The availability of ERD conceptual model with limited constraints allows the software designer to transform the model into other meta-model based domain models such as UML, ODM, etc. The structural design of UMLS is very similar to SNOMED CT conceptual model, where the mrconso table (similar to SNOMED Concept class) is dedicated to hold all the medical terminology, mrdef table (similar to SNOMED Description class) contains the definitions of the concepts, mrrel table (similar to SNOMED Relationship class) is used to hold all the relationships that can exists between the concepts, mrmap table holds the mappings between the concepts from different sources, etc. Thus, the UMLS system provides an abstract domain model in terms of the ERD diagrams and executable SQL scripts, but lacks modularity.

Thus, from the above discussion we can state that the not all the discussed biomedical ontologies fully satisfy the domain model requirement either by providing language independent (UML, ERD, ODM,

etc.) model or dependent (XML, RDF, OWL, etc.) model, but none of them provide modular design that promote reusability of model structure and semantics which can reduce interoperability issues.

4.5. Summary of Clinical Biomedical Ontologies vs. Software Modeling FundamentalsTable 2 summarizes the above discussion using three qualifiers: Full – the ontology supports the

modeling concept, Partial – the ontology shows or partial fulfils the modeling concept and None – the ontology doesn’t support the software modeling concept.

Table 2: Summary of Biomedical Ontologies vs. Software Modeling Fundamentals.

Biomedical Ontology Software Modeling Fundamentals

Meta-Models Profile Design

PatternsDomain

Models/SchemasModularity/

Package

ICD Full None None Full None

DO Full None None Full None

SNOMED-CT Full None None Partial None

LOINC Full None None None None

UMLS Full None None Partial None

Briefly, all the evaluated biomedical ontologies are supported by well-defined meta-models (Section 2.2.1 and 4.1), but none of these ontological meta-models support Profile (Section 4.2) except UML (Section 2.2.2). Further, none of the biomedical ontology committee has exploited UML Profile to define a meta-model for biomedical domain. None of the biomedical ontologies either define a design pattern or refer/use existing software/biomedical ontology patterns (Section 2.2.3 and 4.3). Most of the biomedical ontologies either provide platform independent model or implemented model, but none of them support the concept of modularity which is a crucial feature for domain model reuse that can elevate interoperability issues (Section 2.2.4 and 4.4).

5. ConclusionThis paper has illustrated a serious disconnect between ontology development and software

modeling, where the former is primarily focused on encoding the captured domain knowledge concepts and their respective relationships, while the latter is focused on platform independent domain models which provide abstract view of the solution employing various software modeling fundamentals. The current approach chosen for developing ontologies for very specific application or purpose involving the domain instance data is causing potential structural and semantic interoperability conflicts when one attempts to integrate two or more ontologies. This is a serious barrier to achieve interoperability when two or more healthcare application need to share data and knowledge that employ diverse biomedical ontologies. The research presented in this article examines, justifies and exemplifies a series of serious fundamental gap between ontology and software development process and its fundamentals. We must note that a similar style of evaluation and justification can be imposed on other biomedical ontologies such as DSM, OMIM, NCBI Taxonomies, Gene Ontology, etc. that cut across each other in various

biomedical domains (Figure 1). However, the primary take away point from this article is that there needs to be an upgrade to ontology design that is focused on abstraction, similar to software engineering process. Through this upgrade, ontologies will be able to be more clearly and abstractly defined with the potential for reuse in multiple settings. Towards this objective, Section 2 is primarily focused on software design process that is explained in Section 2.1 and the fundamental software modeling concepts namely, Meta-Models, Profile, Design Patterns, Domain Model/Schemas and Packages involved in the design process which are explained in the Section 2.2. Section 3 identifies the highly visible clinical biomedical ontologies, defines them and presents their domain models. Using Section 2 and Section 3 as basis, in Section 4, we have evaluated the selected clinical biomedical ontologies against the identified fundamental software fundamentals as follows: Section 4.1 examines the availability of ontological meta-models, their support in designing ontology models and evaluates the selected ontologies against both software and ontological meta-models; Section 4.2 examines the ontological meta-models ability to be extended to define Profiles and the usage of both software and ontological Profiles for selected ontology design; Section 4.3 discusses about the concept of ontological design patterns and their usage in selected ontology development; and finally, Section 4.4 discusses about domain modeling and packages for improving modularity and evaluating the selected ontologies against these software fundamentals. With a strong commitment and agreement from domain experts, knowledge designers and ontology users coupled with an ontology design process that embrace a conceptual design perspective for common knowledge sources, there is great potential to allow biomedical ontologies to be more easily integrated leading to an ability to share information across a domain without interoperability issues.

REFERENCES1. Guarino N. Formal Ontology in Information Systems: Proceedings of the 1st International

Conference June 6-8, 1998, Trento, Italy. 1st ed. Amsterdam, The Netherlands, The Netherlands: IOS Press, 1998.

2. Saripalle RK, Demurjian S, Algarin A. A Software Modeling Approach to Ontology Design via Extensions to ODM and OWL. 9. Epub ahead of print 2013. DOI: 10.4018/jswis.2013040103.

3. Saripalle RK, Demurjian S, Behre S. Towards Software Design Process for Ontologies. In: 1st International Conference on Software and Intelligent Information. San Juan, 2011.

4. Blumenthal D. Launching HITECH. N Engl J Med 2010; 362: 382–385.

5. Blechner M, Saripalle R, Demurjian S. A proposed star schema and extraction process to enhance the collection of contextual amp; semantic information for clinical research data warehouses. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops. 2012, pp. 798–805.

6. Saripalle RK, Demurjian S. Attaining Semantic Enterprise Interoperability through Ontology Architectural Patterns. In: Yannis C, Fenareti F, Ricardo CJ (eds) Revolutionizing Enterprise Interoperability through Scientific Foundations. 2014, p. 216.

7. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32: D26–D270.

8. Saripalle RK. UMLS Semantic Network as a UML Metamodel for Improving Biomedical Ontology and Application Modeling. Int J Healthc Inf Syst Informatics; 10.

9. Schriml LM, Arze C, Nadendla S, et al. Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 2012; 40: 940.

10. Booch G, Rumbaugh J, Jacobson I. The Unified Modeling Language User Guide. 2nd ed. Addison-Wesley Professional, 2005.

11. Saripalle RK, Demurjian S. Semantic Design Patterns Using the OWL Domain Profile. In:

International Conference on Information and Knowledge Engineering. 2012.

12. Boone KW. The CDATM book. 2011th ed. Springer, 2011.

13. Ulrich H, Kock AK, Duhm-Harbeck P, et al. Metadata Repository for Improved Data Sharing and Reuse Based on HL7 FHIR. Stud Health Technol Inform 2016; 228: 162–166.

14. Saripalle R. Extending HL7 RIM Model to Capture PhysicalActivity Data. In: 29th International Conference on Software Engineering & Knowledge Engineering. Pittsburg, 2017. Epub ahead of print 2017. DOI: 10.18293/SEKE2017-053.

15. Schriml LM, Arze C, Nadendla S, et al. Disease ontology: A backbone for disease semantic integration. Nucleic Acids Res. Epub ahead of print 2012. DOI: 10.1093/nar/gkr972.

16. Saripalle RK. UMLS visualization for biomedical and health science classroom teaching and student learning. In: 2017 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2017. 2017. Epub ahead of print 2017. DOI: 10.1109/BHI.2017.7897259.

17. Allemang D, Hendler J. Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL. 2nd ed. Morgan Kaufmann, 2011.

18. SNOMED. Systematized Nomenclature of Medicine. 2016, http://www.ihtsdo.org/snomed-ct (2007).

19. McDonald C, Huff S, Deckard J, et al. Logical Observation Identifiers Names and Codes (LOINC®): User’s Guide. LOINC, http://loinc.org/downloads/files/LOINCManual.pdf (2016).

20. Saripalle R. Representing UMLS knowledge using FHIR Terminological Resources. 2020. Epub ahead of print 2020. DOI: 10.1109/bibm47256.2019.8983305.

21. Gangemi A. Ontology Design Patterns for Semantic Web Content. In: Gil Y, Motta E, Benjamins VR, et al. (eds). Springer Berlin Heidelberg, pp. 262–276.

22. Gangemi A, Presutti V. Ontology Design Patterns. In: Handbook on Ontologies. 2009. Epub ahead of print 2009. DOI: 10.1007/978-3-540-92673-3_10.

23. Janowicz K, Compton M. The stimulus-sensor-observation ontology design pattern and its integration into the semantic sensor network ontology. In: CEUR Workshop Proceedings. 2010.

24. Saripalle RK, Demurjian SA, Blechner M, et al. HOD2MLC: Hybrid ontology design and development model with lifecycle. Int J Inf Technol Web Eng. Epub ahead of print 2015. DOI: 10.4018/IJITWE.2015040102.

umls.it.ilstu.eduumls.it.ilstu.edu/a software engineering and modeling per… · web viewfinally,...

Documents