a practical ontology for the large-scale modeling of scholarly artifacts and their usage
DESCRIPTION
The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.TRANSCRIPT
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage
Marko A. Rodriguez (1)
Johan Bollen
Herbert Van de Sompel
Digital Library Research & Prototyping Team
Los Alamos National Laboratory - Research Library
Acknowledgements:
Lyudmila L. Balakireva (LANL), Wenzhong Zhao (LANL), Aric Hagberg (LANL)
MESUR is supported by the Andrew W. Mellon Foundation.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
What is the MESUR project?
• MEtrics from Scholarly Usage of Resourceso http://www.mesur.org
• The MESUR project is currently gathering publication, citation, and usage data from providers world-wide in order to engineer a large-scale scholarly model.
o Publication and citation data from bibliographic databases.o Usage data logged by institutions, publishers, and aggregators.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Journal and Article data
• Journal-level Bibliographic Datao Thomson Scientific JCR: > 8,000 indexed journalso Thomson Scientific JCR: > 50,000,000 journal citationso Thomson Scientific JCR: > 100,000 journal classificationso Ex Libris SFX Journal Master List: > 300,000 journals identifiers (title,
abbreviated title, ISSN, eISSN) clustered in groupso Ex Libris SFX: > 85,000 journal classifications
• Article-level Bibliographic Datao Thomson Scientific Citation Databases : > 37,500,000 articleso Thomson Scientific Citation Databases: > 550,000,000 article citations
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Usage data
• Institutions (link resolvers, proxies)o Los Alamos: > 350,000 1-yearo CalState: > 3,500,000 2-yearso UTexas: > 2,500,000 5-yearso PennState: …o …
• Aggregatorso anonymous: > 2,500,000 1-yearo anonymous: > 50,000,000 1-weeko …
• Publisherso BioMed Central: > 24,000,000 2-yearso Elseviero …
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Primary Data Representation
• So how are we going to represent all this data in one model such that this model can analyzed computationally?
ANSWER: A semantic network.
• Why are we using a semantic network?o Able to represent a heterogeneous set of entities related to one
another by a heterogeneous set of relationships.o All “actors” are represented in the same substrate.o Existing technologies and standards to support the representation:
triples-stores, RDF, semantic network analysis algorithms, etc.
A BC
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Example Scholarly Relationships
• <marko, wrote, practical_ontology>
• <johan, wrote, practical_ontology>
• <herbertv, wrote, practical_ontology>
• <practical_ontology, publishedIn, jcdl>
• <practical_ontology, cites, rdf_specification>
• <rdf_specifcation, downloadedBy, 127.0.0.1>
• <127.0.0.1, from, LANL>
• <LANL, contains, herbertv>
• <herbertv, coauthorsWith, marko>
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
What is the Purpose of an Ontology?
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The MESUR Data Flow
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
RDF, RDFS, OWL
• The Resource Description Frameworko A data model for representing a semantic network. URIs connected to
one another by a URI. <lanl:marko, lanl:worksWith, lanl:johan>
• The Resource Description Framework Schemao A simple ontology language for defining classes and their relationships
to one another. (provides basic class hierarchy construction)
• The Web Ontology Languageo A more advanced ontology language. (this is the ontology language
used in MESUR)
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
RDF, RDFS
ex:marko ex:cookie
ex:Human ex:Food
ex:isEatingrdf:type rdf:type
ex:isEating
rdfs:domainrdfs:range
ontology
instance
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
RDF, RDFS, OWL
ex:fluffy ex:marko
ex:Pet ex:Human
ex:hasOwnerrdf:type rdf:type
ex:hasOwner
rdfs:domain rdfs:range
ontology
instance
_:0123
rdfs:subClassOf
owl:onProperty
“1”owl:maxCardinality
ex:bobex:hasOwner
owl:Restrictionrdf:type
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Triple Store
SELECT ?a ?c WHERE ( ?a type human ) ( ?a wrote ?b ) ( ?b type article ) ( ?c wrote ?b ) ( ?c type human ) ( ?a != ?c )
• The triple store is to semantic networks what the relational database is to data tables.
• Storing and querying triples in a triple store
• SPARQL query languageo like SQL, but for triple stores
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Problem of Scale
• High-end triple stores reasonably support 1+ billion triples.
• The MESUR solution is to not include all artifact metadata in the triple store.
• MESUR leverages relational database and triple store technology.o The triple store is for relationships.o The relational database is for metadata.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Relational Database & Triple Store
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The MESUR Class Hierarchy
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Context Classes
Inspired by OntologyX: http://www.ontologyx.com
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Publishes Context
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
The Uses Context
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Analysis Algorithms
• ISI Impact Factor
• Usage Impact Factoro Bollen J., Van de Sompel, H., “Usage Impact Factor: The Effects of Sample Characteristics on
Usage-based Impact Metrics”, [in review], 2007.
• H-Indexo Hirsh, J.E., “An index to quantify an individual's scientific research output”, Proceedings of the
National Academy of Science, 102:46, 2005.
• Y-Factoro Bollen J., Rodriguez, M.A., Van de Sompel, H., “Journal Status”, Scientometrics, 69:3, 2006.
• Other social network metricso Eccentrity, Betweenness, Closeness, PageRank, …
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Journal Citation and Usage
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Calculating the 2007 Impact Factor
SELECT ?xWHERE
( ?x rdf:type mesur:Citation )( ?x mesur:hasSource ?a)( ?x mesur:hasSink urn:issn:0028-0836 )( ?x mesur:hasSourceTime ?u) AND
(?u == 2007)( ?x mesur:hasSinkTime ?t) AND
(?t > 2004 AND ?t < 2007)
SELECT ?yWHERE
( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup urn:issn:0028-0836 )( ?y mesur:hasTime ?t ) AND
(?t > 2004 AND ?t < 2007)
INSERT < _123 rdf:type mesur:ImpactFactor >INSERT < _123 mesur:hasObject urn:issn:0028-0836 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue
(COUNT(?x) / COUNT(?y)) >
The 2007 impact factor of journal A is the total number of citations to articles published in A in 2005 and 2006 from articles published in 2007 in journal Bdivided by the total number of articles published by journal A in 2005 and 2006.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Calculating the 2007 Usage Impact Factor
SELECT ?xWHERE
( ?x rdf:type mesur:Uses ) ( ?x mesur:hasUnit ?a )( ?x mesur:hasGroup ?b )( ?b mesur:partOf urn:issn:1082-9873 )( ?x mesur:hasTime ?t ) AND
(?t == 2007)( ?y rdf:type mesur:Publishes )( ?y mesur:hasUnit ?a )( ?y mesur:hasTime ?u ) AND
(?u > 2004 AND ?u < 2007)
SELECT ?yWHERE
( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup ?a )( ?a mesur:partOf urn:issn:1082-9873 )( ?y mesur:hasTime ?t ) AND
(?t > 2004 AND ?t < 2007)
INSERT < _123 rdf:type mesur:UsageImpactFactor >INSERT < _123 mesur:hasObject urn:issn:1082-9873 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue
(COUNT(?x) / COUNT(?y)) >
The 2007 usage impact factor of journal A is the total number of 2007 usage events of articles published in A in 2005 and 2006 divided by the total number of articles published by journal A in 2005 and 2006.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Overview
The MESUR project
A quick RDF/RDFS/OWL tutorial
Modeling the scholarly community
Practical applications of the model
Conclusion
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Contributions
• Uniting the RDF/Semantic Web community technology with scholarly modeling.
o Semantic network model of the scholarly community.
• Architectural set-up supports a massive data seto Triple-store/relational database coupling.
• Open ontology for the scholarly communityo Open source ontology to represent many aspects of the scholarly
communication process including publication, citation, and usage.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Some Related Publications
Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007
Marko A. Rodriguez. Grammar-based random walkers in semantic networks. (LAUR-06-7791)
Marko A. Rodriguez and Jennifer H. Watkins. Grammar-based geodesics in semantic networks. (LAUR-07-4042)
Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (arxiv.org: cs.DL/0610154)
Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.
Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.
Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)
Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Marko A. Rodriguez, Johan Bollen, Herbert Van de Sompel
JCDL 2007, June 18-23 2007, Vancouver, Canada
Questions
MESUR is at http://www.mesur.org
MESUR ontology is at http://www.mesur.org/schemas/2007-01/mesur/
Many thanks to the Andrew W. Mellon Foundation for their support