a practical ontology for the large-scale modeling of scholarly artifacts and their usage

31
Quick TIFF are ne Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel JCDL 2007, June 18-23 2007, Vancouver, Canada A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage Marko A. Rodriguez (1) Johan Bollen Herbert Van de Sompel Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library (1) [email protected] Acknowledgements: Lyudmila L. Balakireva (LANL), Wenzhong Zhao (LANL), Aric Hagberg (LANL) MESUR is supported by the Andrew W. Mellon Foundation.

Upload: marko-rodriguez

Post on 11-May-2015

1.675 views

Category:

Technology


1 download

DESCRIPTION

The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.

TRANSCRIPT

Page 1: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

Marko A. Rodriguez (1)

Johan Bollen

Herbert Van de Sompel

Digital Library Research & Prototyping Team

Los Alamos National Laboratory - Research Library

(1) [email protected]

Acknowledgements:

Lyudmila L. Balakireva (LANL), Wenzhong Zhao (LANL), Aric Hagberg (LANL)

MESUR is supported by the Andrew W. Mellon Foundation.

Page 2: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 3: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 4: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

What is the MESUR project?

• MEtrics from Scholarly Usage of Resourceso http://www.mesur.org

• The MESUR project is currently gathering publication, citation, and usage data from providers world-wide in order to engineer a large-scale scholarly model.

o Publication and citation data from bibliographic databases.o Usage data logged by institutions, publishers, and aggregators.

Page 5: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Journal and Article data

• Journal-level Bibliographic Datao Thomson Scientific JCR: > 8,000 indexed journalso Thomson Scientific JCR: > 50,000,000 journal citationso Thomson Scientific JCR: > 100,000 journal classificationso Ex Libris SFX Journal Master List: > 300,000 journals identifiers (title,

abbreviated title, ISSN, eISSN) clustered in groupso Ex Libris SFX: > 85,000 journal classifications

• Article-level Bibliographic Datao Thomson Scientific Citation Databases : > 37,500,000 articleso Thomson Scientific Citation Databases: > 550,000,000 article citations

Page 6: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Usage data

• Institutions (link resolvers, proxies)o Los Alamos: > 350,000 1-yearo CalState: > 3,500,000 2-yearso UTexas: > 2,500,000 5-yearso PennState: …o …

• Aggregatorso anonymous: > 2,500,000 1-yearo anonymous: > 50,000,000 1-weeko …

• Publisherso BioMed Central: > 24,000,000 2-yearso Elseviero …

Page 7: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Primary Data Representation

• So how are we going to represent all this data in one model such that this model can analyzed computationally?

ANSWER: A semantic network.

• Why are we using a semantic network?o Able to represent a heterogeneous set of entities related to one

another by a heterogeneous set of relationships.o All “actors” are represented in the same substrate.o Existing technologies and standards to support the representation:

triples-stores, RDF, semantic network analysis algorithms, etc.

A BC

Page 8: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Example Scholarly Relationships

• <marko, wrote, practical_ontology>

• <johan, wrote, practical_ontology>

• <herbertv, wrote, practical_ontology>

• <practical_ontology, publishedIn, jcdl>

• <practical_ontology, cites, rdf_specification>

• <rdf_specifcation, downloadedBy, 127.0.0.1>

• <127.0.0.1, from, LANL>

• <LANL, contains, herbertv>

• <herbertv, coauthorsWith, marko>

Page 9: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

What is the Purpose of an Ontology?

Page 10: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The MESUR Data Flow

Page 11: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 12: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

RDF, RDFS, OWL

• The Resource Description Frameworko A data model for representing a semantic network. URIs connected to

one another by a URI. <lanl:marko, lanl:worksWith, lanl:johan>

• The Resource Description Framework Schemao A simple ontology language for defining classes and their relationships

to one another. (provides basic class hierarchy construction)

• The Web Ontology Languageo A more advanced ontology language. (this is the ontology language

used in MESUR)

Page 13: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

RDF, RDFS

ex:marko ex:cookie

ex:Human ex:Food

ex:isEatingrdf:type rdf:type

ex:isEating

rdfs:domainrdfs:range

ontology

instance

Page 14: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

RDF, RDFS, OWL

ex:fluffy ex:marko

ex:Pet ex:Human

ex:hasOwnerrdf:type rdf:type

ex:hasOwner

rdfs:domain rdfs:range

ontology

instance

_:0123

rdfs:subClassOf

owl:onProperty

“1”owl:maxCardinality

ex:bobex:hasOwner

owl:Restrictionrdf:type

Page 15: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Triple Store

SELECT ?a ?c WHERE ( ?a type human ) ( ?a wrote ?b ) ( ?b type article ) ( ?c wrote ?b ) ( ?c type human ) ( ?a != ?c )

• The triple store is to semantic networks what the relational database is to data tables.

• Storing and querying triples in a triple store

• SPARQL query languageo like SQL, but for triple stores

Page 16: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 17: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Problem of Scale

• High-end triple stores reasonably support 1+ billion triples.

• The MESUR solution is to not include all artifact metadata in the triple store.

• MESUR leverages relational database and triple store technology.o The triple store is for relationships.o The relational database is for metadata.

Page 18: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Relational Database & Triple Store

Page 19: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The MESUR Class Hierarchy

Page 20: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Context Classes

Inspired by OntologyX: http://www.ontologyx.com

Page 21: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Publishes Context

Page 22: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

The Uses Context

Page 23: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 24: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Analysis Algorithms

• ISI Impact Factor

• Usage Impact Factoro Bollen J., Van de Sompel, H., “Usage Impact Factor: The Effects of Sample Characteristics on

Usage-based Impact Metrics”, [in review], 2007.

• H-Indexo Hirsh, J.E., “An index to quantify an individual's scientific research output”, Proceedings of the

National Academy of Science, 102:46, 2005.

• Y-Factoro Bollen J., Rodriguez, M.A., Van de Sompel, H., “Journal Status”, Scientometrics, 69:3, 2006.

• Other social network metricso Eccentrity, Betweenness, Closeness, PageRank, …

Page 25: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Journal Citation and Usage

Page 26: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Calculating the 2007 Impact Factor

SELECT ?xWHERE

( ?x rdf:type mesur:Citation )( ?x mesur:hasSource ?a)( ?x mesur:hasSink urn:issn:0028-0836 )( ?x mesur:hasSourceTime ?u) AND

(?u == 2007)( ?x mesur:hasSinkTime ?t) AND

(?t > 2004 AND ?t < 2007)

SELECT ?yWHERE

( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup urn:issn:0028-0836 )( ?y mesur:hasTime ?t ) AND

(?t > 2004 AND ?t < 2007)

INSERT < _123 rdf:type mesur:ImpactFactor >INSERT < _123 mesur:hasObject urn:issn:0028-0836 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue

(COUNT(?x) / COUNT(?y)) >

The 2007 impact factor of journal A is the total number of citations to articles published in A in 2005 and 2006 from articles published in 2007 in journal Bdivided by the total number of articles published by journal A in 2005 and 2006.

Page 27: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Calculating the 2007 Usage Impact Factor

SELECT ?xWHERE

( ?x rdf:type mesur:Uses ) ( ?x mesur:hasUnit ?a )( ?x mesur:hasGroup ?b )( ?b mesur:partOf urn:issn:1082-9873 )( ?x mesur:hasTime ?t ) AND

(?t == 2007)( ?y rdf:type mesur:Publishes )( ?y mesur:hasUnit ?a )( ?y mesur:hasTime ?u ) AND

(?u > 2004 AND ?u < 2007)

SELECT ?yWHERE

( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup ?a )( ?a mesur:partOf urn:issn:1082-9873 )( ?y mesur:hasTime ?t ) AND

(?t > 2004 AND ?t < 2007)

INSERT < _123 rdf:type mesur:UsageImpactFactor >INSERT < _123 mesur:hasObject urn:issn:1082-9873 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue

(COUNT(?x) / COUNT(?y)) >

The 2007 usage impact factor of journal A is the total number of 2007 usage events of articles published in A in 2005 and 2006 divided by the total number of articles published by journal A in 2005 and 2006.

Page 28: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Overview

The MESUR project

A quick RDF/RDFS/OWL tutorial

Modeling the scholarly community

Practical applications of the model

Conclusion

Page 29: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Contributions

• Uniting the RDF/Semantic Web community technology with scholarly modeling.

o Semantic network model of the scholarly community.

• Architectural set-up supports a massive data seto Triple-store/relational database coupling.

• Open ontology for the scholarly communityo Open source ontology to represent many aspects of the scholarly

communication process including publication, citation, and usage.

Page 30: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Some Related Publications

Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007

Marko A. Rodriguez. Grammar-based random walkers in semantic networks. (LAUR-06-7791)

Marko A. Rodriguez and Jennifer H. Watkins. Grammar-based geodesics in semantic networks. (LAUR-07-4042)

Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (arxiv.org: cs.DL/0610154)

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)

Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.

Page 31: A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel

JCDL 2007, June 18-23 2007, Vancouver, Canada

Questions

MESUR is at http://www.mesur.org

MESUR ontology is at http://www.mesur.org/schemas/2007-01/mesur/

Many thanks to the Andrew W. Mellon Foundation for their support