vocabulary interoperability in the semantic web james r morris

Vocabulary Interoperability in the Semantic Web

Why Linked Data will Transform the Taxonomy Profession

James R Morris Associate Principle Information ScientistAstraZenecaOct 21, 2013

This article was originally published in Taxonomy Times : Bulletin of the SLA Taxonomy Division, issue 16 (October 2013). Visit http://taxonomy.sla.org to learn more about becoming a member.

Introduction: why this matters

The management of controlled vocabularies is a traditional librarian specialty. These vocabularies, these networks of concepts, that librarians have created, managed, and used for decades, are more important than ever in today’s highly networked and complex information world. (This article uses the term "vocabularies" to refer to any type of controlled and structured collection of terms or concepts: thesauri, taxonomies, ontologies, glossaries, etc.)

We live in a networked world--a “linked” world. The World Wide Web is more than a technology; it’s a fact of life. No one entity controls the content on the web, and that’s a good thing. In the same way, we can’t control all the information and data that we might need to do our jobs. We can’t import it all, we can’t recreate it all, and we can't buy it all. More critically, we can't predict what information will help us answer our questions until we ask.

Vocabularies (taxonomies, thesauri, etc.), as every taxonomist knows, are essential for finding, navigating, and connecting the information or data we need. But just like the rest of the information we need, we don’t control those either (unless we own them). We can’t import them all and we can’t buy them all. And we can’t recreate them all--which is often what we end up doing.

To continue to build our profession's capability in building, managing, and using taxonomies, thesauri and other vocabularies, we need an approach that recognizes the reality of today’s highly networked information world. This approach includes Linked Data principles and the technology of the Semantic Web, which are designed to harness the inherent open-endedness of the World Wide Web. It's the right platform on which to continue building our professional expertise in vocabulary management.

The opportunities that this field presents to our profession are very exciting. We need more discussions about Linked Data vocabulary management among librarians and information scientists. To an equal extent, the values, skills, and experiences of librarians and information scientists are needed to make this emerging field successful.

Background: The “hyper-connected enterprise”

http://taxonomy.sla.org/

The phrase “hyper-connected enterprise” was highlighted in a 2007 Gartner report, “Gartner Predicts 2008: CIO Success Factors” (http://www.gartner.com/id=564014. License required). Many organizations are moving towards that model and the progression usually looks like this:

1. One Enterprise. It starts with a very self-contained organization, with ready access to all the information it needs.

2. Connected Enterprise. Then increasing number of suppliers, outsourced providers, and new partners become part of how the organization operates day-to-day.

3. Hyper-connected Enterprise. Then, the relationship to these providers and partners begins to evolve. Positions of the players shift, they come and go more rapidly, and boundaries between the center and periphery begin to blur.

To use Gartner's language, "Ecosystems widen, Complexity rises, Coordination increases.” The bottom line is that coordination is the key--not control. Of course your own information must be well controlled, but you cannot succeed with that alone.

As mentioned earlier, this is about more than a hyper-connected enterprise--it’s about a hyper-connected world. It’s an open webbed world; we can’t predict what information we’ll need to answer--or even ask--our questions. But in this world how do we effectively combine resources and work with others to ultimately achieve our goals?

Vocabulary Interoperability and the Semantic Web

The Semantic Web is about building a data architecture that enables us to ask questions and solve business problems across a heterogeneous information landscape extending beyond the traditional boundaries of one organization. Traditionally, mainly professional indexers and searchers have used controlled vocabularies. But today they are often used in a text-analytics or informatics context:

Adding structure to unstructured text; essentially tagging unstructured content with vocabulary term identifiers, by looking for strings that match a wide-array of synonyms, and by other more sophisticated algorithms.

Adding additional structure to already indexed content so that we can organize information in ways specific to our requirements.

Adding vocabulary metadata to the content or to a search index to connect information across repositories of information by connecting the concepts embedded in that information.

Let’s review the hyper-connected enterprise again, this time in the context of controlled vocabularies:

1. One Enterprise. For one organization, with good control of their information, an enterprise taxonomy, or set of enterprise vocabularies can be created, and their usage enforced. A governance structure for

http://www.gartner.com/id=564014

maintaining it can be established. Enterprise taxonomies and master data management is still important--companies do need to manage their own information well.

2. Connected Enterprise: To use external information sources, use their native vocabularies. Deliver information to outsourced providers and/or regulatory agencies requires the use of certain vocabularies. For example, in the pharmaceutical and healthcare field, MedDRA, ICD-9-CM, ICD-10, SNOMED are a few common ones. In the case of a merger or strategic alliance, another company might have their own enterprise taxonomy or product thesaurus that can be integrated with yours without too high a cost.

3. Hyper-connected enterprise: Other partners, universities, companies, divisions, and subsidiaries connected to the enterprise are spread across the globe, and they come and go with increasing frequency. Embarking on a new taxonomy integration project for every partner that appears in the network is not feasible.

A small biotech might say, “Enterprise vocabulary? We’re too small – we just use terms that we need; it’s not complicated. You're too bureaucratic.”

A large corporate partner or regulatory agency might say, “Over here, yes we follow standard vocabularies. They’re not the same as your standards, but…”

An information supplier or database provider could say, “Our proprietary databases use specially designed vocabs optimized for our software. You should really be using our software, you know. We have the information need.”

Traditional solutions to taxonomy development and management fall short of meeting the challenges of the hyper-connected enterprise. Semantic Web principles and technology offer a new approach to the management of our thesauri, taxonomies, and ontologies--one of vocabulary interoperability. Treating vocabularies as part of a network--just like the web itself—is needed. A network of controlled vocabularies to facilitate data interoperability and query expansion across disparate structured and unstructured, internal and external, information is part of the solution. A “network” implies that the vocabularies can be rapidly adopted, reused, mapped, and versioned using standardized, sustainable technologies and approaches.

Principles

The following principles were drafted in my organization. They are presented as points for discussion within the librarian/taxonomist community.

Don’t create a new vocabulary where one already exists. The preference is for authoritative, trustworthy, well-managed sources. Apply standard library criteria used to evaluate resources: authority, accuracy, currency, point of view (objectivity), coverage, relevance, and format.

Extend those sources if necessary, but even then they should be extended by making associations with other authoritative sources. Whenever possible leverage cross-vocabulary mappings already available publicly.

When necessary, extend or map an external vocabulary with internal or bespoke terminology. This could include mapping obvious synonyms between vocabularies, or could entail more nuanced

relationships. For example, mapping a published vocabulary to a list of values from a proprietary or internal database. By strategically connecting the concepts, the data is connected.

Do not corrupt authoritative vocabulary sources. It will always be possible to identify the source vocabulary, including what version, in the network. It's just like a library. A page is not taken out of a book and pasted it into another; sections are not crossed out here and added there to the point where the books can’t be put back together again--that shouldn’t be done with vocabularies either. However, using Linked Data standards, vocabularies can be repurposed. For example, perhaps a deeply hierarchical taxonomy just needs to be rendered as flat list, or only some concepts are relevant to the purpose at hand. Linked Data can enable these types of reuse without corrupting the source.

Obtain vocabularies directly from the authority that produces them whenever practical. Vocabularies will be managed as linked data. Preference is for the “SKOS” standard. SKOS stands for "Simple Knowledge Organization System" and is a lightweight W3C data standard that can be applied to any controlled vocabulary. Focusing on this simple standard allows us to standardize curation, management, and delivery processes.

Capabilities

The following are a proposed set of linked data vocabulary management capabilities. They are presented as a impetus for further discussion within the librarian/taxonomies community.

Vocabulary Mapping. Reusing, creating, and maintaining the relationships between vocabularies holds the power of the “vocabulary network” at the heart of Linked Data Vocabulary Management. Processes and technology need to be developed for lightweight, sustainable, semantics-based vocabulary mapping. Challenges include:

o Mapping two or more vocabularies together using Semantic technology.

o Facilitating the review of automatic mapping; approving, rejecting, adjusting, and amending mapping.

o Applying traditional taxonomy heuristics in a Linked Data environment as quality measures.

o Leveraging the mapping already done in published vocabularies.

o Publishing mapped “bridging” vocabularies to customers.

Vocabulary Slicing. Sometimes a vocabulary consumer (system or project) needs only a subset of a vocabulary that is relevant to their needs; for example, only the terms that are of interest to a particular project or organizational unit. Rather than create a duplicate, we need to identify terms or branches of relevance, and make those available--all while retaining their link back to the authority, so that updates to the authority can continue to be leveraged (new synonyms, new mappings, new child terms, new notes).

Vocabulary Version Management. As the focus is the reuse of already published vocabularies, successfully managing the impact of new versions on the network is critical. For example:

o Comparing a new version to an existing version.

o Identifying impact of vocabulary changes on the network; that is, to other vocabularies that map to it.

o Identifying impact of vocabulary changes on consuming systems or related datasets. Then, once we identify the impact, how do we address it?

Vocabulary modeling and conversions. The vocabularies in the network need to be accessible as Linked Data, using the data standards of the Semantic Web. And, until external suppliers deliver or make their vocabularies available as linked data, we will need to be extremely proficient in converting them from their native formats.

Vocabulary Inventory. Managing a network of vocabularies requires a sophisticated inventory. Ideally managed as Linked Data itself, the inventory can directly inform queries using the vocabulary network. For example, identify all the vocabularies that cover concept X, the datasets that reference those vocabularies, and how to access them. There are developing standards in the Semantic Web specifically for managing these types of inventories.

Vocabulary Mapping using Linked Data--in more detail.

For this article, just one of these capabilities, Vocabulary Mapping, will be explored in more detail. How, and why, would two vocabularies be brought together using Linked Data? First, a few Semantic Web fundamentals need to be covered.

The Semantic Web is a web of data--not pages.

Pages need to be read by a person, or a program mimicking a person, in order to obtain the meaning of the data on the page. For example, if we want to learn the times that local bars offer “happy hour”, we need to find the web page for each bar, scan pages, follow links, and finally read content that says something like “we have happy hour M-F from 4-7”. It doesn’t matter that that information might originally be in a database that builds the page, because to interpret the information we need to read the page. Linked Data challenges us to use the principles of the web: unique identification of resources and links between those resources, to the data itself.

URIs, RDF, and Ontologies

URIs. URI stands for “uniform resource identifier”. URL’s are “uniform resource locators”, a type of URI used to uniquely identify pages and locate them. You "point" your browser to them to locate the unique page. In the semantic web, things in the real world are also given URIs. In fact they use the same syntax. For, example, a local bar, The Artful Dodger, might have a webpage with a URL. But in the semantic web the bar itself--as a unique thing, not a webpage--would have a URI. The URI could be used to differentiate it from the Dickens character of the same name, which might also have a URI.

RDF. RDF stands for Resource Description Framework. RDF is used to encode the most basic statement about a thing in the form of what is called a "triple": subject--predicate--object, which is often abbreviated S--P--O. For example:

o The Artful Dodger (subject), Is a (predicate), Bar (object).o The Artful Dodger (subject), Has a (predicate), Happy Hour (object).

Ontologies. Ontologies are collections of information about classes of things and their relationships. For example, an ontology might state that a Bar (subject) is a Type of (predicate) Restaurant (object). So if we have data that states that the Artful Dodger (S), Is a (P), Bar (O), then, by referencing our ontology, we can infer that the Artful Dodger (S) is a (P), restaurant (O). Inference is a very important concept in the Semantic Web that will be covered more closely in the next example.

The following is how these fundamentals apply to vocabularies, and specifically to the capability of mapping two vocabularies together.

Concepts are Things

Just like The Artful Dodger is a thing, the concept of a “Bar” is also a thing. Vocabularies, taxonomies, thesauri, etc. are all collections of concepts.

Concepts have URIs

As discussed above, things are identified in the semantic web with URIs. A vocabulary is about concepts, their properties, and their relationships.

SKOS is for vocabularies

SKOS stands for “Simple Knowledge Organization System”, and is a very popular Semantic Web ontology. It is a standardized set of classes and properties for representing “knowledge organization systems”, a.k.a. vocabularies. Unlike other thesaurus and taxonomy standards, the fact that SKOS is Linked Data means it allows for the management of vocabularies in a distributed, linkable, interoperable, way. In other words, as part of the open-ended, web of data. The following are some of the most common SKOS elements:

skos:Concept and skos:ConceptScheme. These are what are called "classes" in the Semantic Web. Every "term" in a vocabulary is of the class skos:Concept. A skos:ConceptScheme designates a particular thesaurus, or part of a thesaurus. Concepts and ConceptSchemes are both things, so each is represented by a unique URI. Concepts can belong to one or more ConceptSchemes.

skos:prefLabel, skos:altLabel, and skos:definition. These are used to designate fundamental properties of the classes like skos:Concept. Skos:prefLabel is preferred name for a concept, and skos:altLabel is any synonym or other type of entry term, e.g. “see from”, “used for”, etc. An important feature of SKOS as well as most Semantic Web ontologies is that the classes and properties are extensible. For example, we can create a sub-property, unique to our context. We might want a specific type of skos:altLabel to differentiate acronyms from other synonyms. In this case, we would create a new property for acronym, but make it a sub-property of skos:altLabel. In this way, we can identify the acronyms specifically if we need to, but they can also be referenced as generic synonyms. This is

particularly useful when working across multiple vocabularies where custom properties can all resolve to generic SKOS standards.

skos:broader, skos:narrower, skos:related. These are the essential properties that serve as the building blocks for a hierarchical or ontological structure, within a single vocabulary.

skos:exactMatch, skos:closeMatch. All of the above properties are intended for use within a particular vocabulary or skos:ConceptScheme. These matching relationships, on the other hand, are used to relate Concepts across vocabularies. For example, the concept for Bladder Cancer in one vocabulary might be an exactMatch to the concept for Bladder Neoplasm in another vocabulary.

Now that some basics about Linked Data and SKOS have been covered, we’ll look at a specific example of using SKOS to delivering the core capability of Mapping. Here are two concepts from two popular biomedical thesauri: the National Cancer Institute (NCI) Thesaurus or NCIt, and the National Library of Medicine’s Medical Subject Headings or MeSH, encoded in SKOS:

NCI Thesaurus – “Bladder Neoplasm”URI = http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C2901 (Shorthand: ncit:C2901)

Subject (S) Predicate (P) Object (O)

ncit:C2901 a skos:Concept

skos:prefLabel “Bladder Neoplasm”

skos:altLabel “Tumor of Bladder”

skos:definition “A benign or malignant…”

skos:broader ncit:C2900 (“Bladder Disorder”)

skos:broader ncit:C3431 (“Urinary System Neoplasm”)

MeSH – “Urinary Bladder Neoplasms”URI = http://purl.bioontology.org/ontology/MSH/D001749 (Shorthand: mesh:D001749)


mesh:D001749 a skos:Concept

skos:prefLabel “Urinary Bladder Neoplasms”

skos:altLabel “Cancer of Bladder”

skos:definition “Tumors or cancer of the URINARY

BLADDER…”

skos:broader mesh:D014571 (“Urogenital Neoplasms”)

skos:broader mesh:D001745 (“Urinary Bladder Diseases”)

Once we can uniquely identify each of these concepts with URIs we can record some basic things about them:

They're both concepts.

They both have preferred labels

They each have synonyms or entry terms (many more, in fact, than just those listed here)

They have definitions.

They have one or more parents because both of these thesauri are poly-hierarchical.

A couple critical points here: If we look at the full record for this concept in MeSH (http://www.ncbi.nlm.nih.gov/mesh/68001749), it does in fact have the NCIt preferred label, “Bladder Neoplasm”, as a synonym. However, NCIt does not have the MeSH preferred name as a synonym. Note especially that alternate labels of “Cancer of the Bladder” in MeSH and “Tumor of the Bladder” in NCIt are unique to each vocabulary.

But what if we know that mesh:D001749 is the same concept as ncit:C2901? We map those two concepts together by creating a new “triple” that creates a relationship between the two URIs.


ncit:2901 skos:exactMatch mesh:D001749

So now we've encoded the relationship that MeSH concept labeled “Urinary Bladder Neoplasms” equals the NCIt concept labeled “Bladder Neoplasm”. How do we realize the power of this relationship? What if we’re searching two carefully indexed repositories--one indexed using MeSH and another using NCIt and we want to produce a joined up view of these two repositories, leveraging the power of the two separate indexing methods? What if we’re searching across multiple repositories and we want as rich a synonym set for “Bladder Cancer” as we can get. As we see in this example--by using only one of these vocabularies, we could end up missing things.

Here’s how we use Linked Data to do really interesting things. Now we’ll create a rule that builds upon our new matching relationship to bring more intelligence to our vocabularies. For this we use SPARQL. SPARQL is akin to SQL for relational databases in that it is the language with which to query and manipulate Linked Data in the Semantic Web.

http://www.ncbi.nlm.nih.gov/mesh/68001749

SPARQL Rule:

(Note that this is not executable SPARQL; the syntax has been simplified for the clarity of this example.)

Creating these rules is key. When we hear experts talk about making your data “more intelligent”--this is what they mean! This three-line SPARQL command takes that new matching relationship and reasons that the preferred labels or synonyms in the matched MeSH concept can be inferred as additional synonyms (in red italics below) in the NCIt concept. It would do this for all the concepts in the NCIt that have been matched to Mesh terms.

Subject(S) Predicate (P) Object (O)

ncit:2901 skos:prefLabel “Bladder Neoplasm”

skos:altLabel “Urinary Bladder Neoplasms”

skos:altLabel “Cancer of the Bladder”

Those terms “reasoning” and “inferred” are very important. First we’re making a rule that a computer can use to infer additional information. Second--that’s all this has to be--an inference. We are not necessarily hard-coding additional vocabulary data. We are not manipulating the source vocabularies at all. Every inferred triple can be managed completely separately from either source vocabulary!

A fair question to ask is, "What about meta-thesauri and large vocabulary mapping efforts? Aren't they intended to do the same thing? For example, isn’t that what the National Cancer Institute's Metathesaurus and the National Library of Medicine's UMLS (Unified Medical Language System) initiatives are for?” Yes, to a point. However:

1. MeSH and NCIt are just two familiar examples. What if we try to map vocabularies or search across repositories that use less standard vocabularies: proprietary, internal, or custom vendor vocabs? Or if we need to search, and view information in an interface using the richness of NCIt, but against an repository that uses something different--a folksonomy perhaps, or just values selected from a drop-down list.

2. What if the mapping we wanted to do was proprietary to a project or a company? Every taxonomist will agree that how concepts are arranged and mapped to each other has power. However, what if that mapping was relevant today--but not tomorrow? How can this be accomplished by leveraging these large published vocabularies without corrupting them, or worse--by building yet another one?

CONSTRUCT ?s skos:altLabel ?o .

WHERE ?s skos:exactMatch ?match .?match skos:prefLabel | skos:altLabel ?o .

It is important to be clear that identifying those initial matches (and they are not always as straightforward as an “exactMatch”) is the hard intellectual work of vocabulary management that is, and will continue to be, a hallmark of our profession. The semantic web doesn’t solve the problem of knowing which terms to match. But what it does do--unlike any other taxonomy, thesaurus, vocabulary management system--is give us the standards and tools to encode those matches and create rules that computers can use to make additional inferences. And very importantly, it does not require that we merge or integrate or otherwise corrupt original authoritative sources. So this approach doesn’t make it easy to manage vocabularies in a hyper-connected world--that’s not the point. The point is it makes it possible; possible, sustainable and extensible.

Linked Data Vocabulary Management Roles

In thinking about how to make this all work and to develop and deliver vocabulary services based on these new approaches, the following roles could be assigned.

Vocabulary Developer. Uses Semantic Web technologies and informatics toolkit to ingest, model, bridge, slice, quality check, and version-control vocabularies. This would require some programming skills, and in-depth knowledge of the technology.

Vocabulary Manager. Provides hands-on management, development and enhancement of vocabularies. They would have a deep understanding of vocabulary processes; and internal/external vocabulary landscape and best practices, often with subject specialty. This role is most like our taxonomist role of today.

Vocabulary Administrator. Provides first-line support for helping customer’s access, use and exploitation of available vocabularies. They are accountable for management of service processes, documentation, and inventories.

Conclusion: The Semantic Web needs us

The Promise and the Reality. Linked Data approaches are the sustainable way to work with vocabularies in an increasingly federated and linked information world. The reality is that not enough vocabularies are available, from the authority that produced them, natively in SKOS or another form of RDF. This needs to change. But we can start to instill these practices where we work. This is field is rapidly developing; it needs librarians and information scientists with their deep knowledge of the importance and power of controlled vocabularies and the information sources that use them.

This is not the future--it’s now. Leading organizations in the US, like OCLC and the Library of Congress, are adopting these standards. This even more true in Europe. The field of biomedical ontologies is very exciting right now, and firmly moving toward a Linked Data paradigm. The nature of business and of information demands new approaches. Even taking the Semantic Web’s promise of universally sharing data across the world out of the picture, SKOS and RDF--even just applied to vocabularies--is opportunity alone. These vocabularies can be published in whatever form people or systems need them in. However, to manage a web of vocabularies requires tools and techniques designed for the web.

This is not easy. Breaking down information, including vocabularies into their most basic constructs, is what RDF does. It is elegant in design, but can quickly become very complex in execution. Writing the SPARQL

queries that do complete transformations of vocabularies, manage versions, etc. can become very sophisticated. But isn't it a challenge worth tackling? Aren’t librarians and information scientists poised to make this happen?

SKOS is a great example of a simplified model with practical application. RDF can be used to model ontologies in linguistically accurate, but extraordinarily complex, ways as well. The discussions at a Bioontology or Semantic Web conference can get very deep. But the beauty of RDF is that those complex models can be transformed, again without corrupting the source, into a simple model like SKOS. Guarding against unnecessary complexity is something that librarians are especially good at.

Managing unconnected silos of information, including taxonomies, will not harness the power of a networked, collaborative world. Vocabulary interoperability, enabled by the Linked Data principles of the Semantic Web will--if further developed as a discipline within our profession.

Acknowledgements

This article was derived from a presentation given by James R Morris at the Special Libraries Association’s 2013 PHT and BIO Division Spring Meeting, April 14-16, 2013.

The author thanks:

The AstraZeneca R&D Vocabulary Team and Linked Data Community of Practice members: Courtland Yockey, Sorana Popa, Rob Hernandez, Dana Crowley, Tom Plasterer, Mike Westaway, Kerstin Forsberg, Johan Turnqvist, Tracy Wolski, Dan Ringenbach, Rajan Desai, Pete Edwards, Simon Rakov, Jeff Saxe, and Ian Dix.

Vendors and Advisors:

o Scott Henninger, Bob Ducharme – TopQuadrant

o Dean Allemang – Working Ontologist

References

• Dean Allemang, Jim Hendler. Semantic Web for the Working Ontologist, 2nd edition. Morgan Kaufman, 2011. The first few chapters of this book provide one of the best overviews of the semantic web that you’ll find, and is very readable. It also has a whole chapter on SKOS. Available through ScienceDirect and Safari.

• Bob Ducharme. Learning Sparql, 2nd edition. O’Reilly Media, 2013. Invaluable to have on hand when writing SPARQL.

http://workingontologist.com/

http://www.topquadrant.com/

• World Wide Web Consortium (W3C). SKOS Simple Knowledge Organisation System Primer. http://www.w3.org/TR/skos-primer Straight from the source, it is technically oriented, but once you understand the basics, this is essential reading.

• National Center for Biomedical Ontology. BioPortal. http://bioportal.bioontology.org .A phenomenally deep resource providing access to many biomedical vocabularies, queriable through SPARQL. The NCBO also has several webinars available.

• Lee Harland, et al. “Empowering industrial research with shared biomedical vocabularies” Drug Discovery Today, Volume 16, Issues 21-22, November 2011, Pages 940-947, ISSN 1359-6446, 10.1016/j.drudis.2011.09.013Excellent overview of the challenges and opportunities for using biomedical vocabularies in pharmaceutical research.

• Kerstin Forsberg. Linked Data and URI:s for Enterprises (Blog). http://kerfors.blogspot.com. A thought-leader and evangelist for linked data, especially in the clinical domain.

http://kerfors.blogspot.com/

http://bioportal.bioontology.org/

http://www.w3.org/TR/skos-primer

vocabulary interoperability in the semantic web james r morris

Documents