digital research reports discovery and analysis of global ... · we believe passionately that...

9
Digital Research Reports Discovery and Analysis of Global Research Trends Using GRID The Global Research Identifier Database Martin Szomszor and Jonathan Adams FEBRUARY 2017

Upload: others

Post on 17-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

Digital Research Reports

Discovery and Analysis of Global Research Trends Using GRID

The Global Research Identifier Database

Martin Szomszor and Jonathan Adams

FEBRUARY 2017

Page 2: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

1Digital Research Reports

Introduction

In this Digital Research Report, we explore the question of “location” and, in particular, how we consistently identify the organisations that host and facilitate the research process. Using a corpus of sample research articles from the leading open-access journal PLOS ONE, we demonstrate how much attribution to research organisations varies in author affiliations, how this kind of data can be aligned with a reference dataset (GRID), and the potential of analyses resulting from the data integration.

Research is a global operation sustained by the collaborative efforts of a variety of research organisations. They coordinate funding and activity to generate a wealth of research outputs and achieve societal and economic benefits through the impact of their research. Universities lead this effort alongside hospitals, government institutions, funding bodies, and private companies in a complex interplay that transcends national, political, and cultural borders (Adams, 2013). In recent years, open science has catalysed greater international collaboration, creating a more open research environment where data can be shared more freely and research outputs are readily accessible (Adams, 2017).

Growth in international collaboration has been supported and mirrored in research information management (Porter, 2016). Bibliographic information is supplemented by an increasing variety of research data that offers glimpses into many more aspects of the typical research lifecycle. Funding databases (such as Dimensions from ÜberResearch), data sharing platforms (for example, Figshare, Dryad), collaborative authoring tools (Overleaf and ShareLaTeX), discovery applications (Readcube), institutional repositories (ePrints, DSpace), and attention tracking mechanisms (Altmetric, Plum Analytics) operate alongside open standards (ORCID, Crossref) to create a rich network of information about research activities.

Insight into these research activities requires careful data coordination. In academia, the axes by which these activities are universally structured are: date (when the research happened), subject (in what field the research occurred), individual (who conducted the research), and location (where the research happened). While the first of these is relatively straightforward to model, the latter three are non-trivial.

To address the issues of location, Henderson (2007) shows how publishers require a ‘single schema’ to feed into the publishing pipeline. Current solutions are outlined by DeRidder (2011) and MacEwan et al (2013), with Ferguson et al (2015) and Bilder et al (2016) providing a comprehensive review evaluated against use cases common to researchers, publishers, funders and research organisations.

Solutions include the International Standard Name Identifier (ISNI www.isni.org), the Ringgold Identify Database (www.ringgold.com), and Orgref (www.orgref.org). While ISNI's database is large, the coverage is not research focussed as it includes records both for individuals and organisations. Ringgold is a proprietary product with a database accessible under a commercial license. Orgref has global scope, but the dataset is small compared to others and coverage does not extend to cover sectors including national research networks and private companies.

" Bibliographic information is supplemented by an increasing variety of research data."

" Stakeholder groups have recognised their own need for a reference database of research organisations for aggregation purposes."

Martin Szomszor is Consultant Data Scientist at Digital Science. Previously Head of Data Science and founder of the Global Research Identifier Database (GRID), Martin has worked on a range of research metrics projects, applying his extensive knowledge of machine learning, data integration, and visualisation techniques to uncover novel insights and inform the academic research lifecycle. He was previously Deputy Head of Centre at the City eHealth Research Centre (2009-2011) where he led research on the use of social media for epidemic intelligence and was Chair of the 4th International Conference on Electronic Healthcare for the 21st Century. Martin was also a Research Fellow at the University of Southampton (2006-2009) where he worked on various Linked Data, Semantic Web, and Social Network analyses projects. Martin has a BSc and PhD in Computer Science.

http://orcid.org/0000-0003-0347-3527

Jonathan Adams joined Digital Science as Chief Scientist in October 2013. Previously he was the lead founder of Evidence Ltd (2000-2009) and Director of Research Evaluation for Thomson Reuters (2009-2013). Jonathan led the 2008 review of research evaluation in New Zealand and was a member of the Australian Research Council (ARC) indicators development group for its research excellence assessment (ERA). In 2004 he chaired the EC Evaluation Monitoring Committee for Framework Programme 6. In 2006 he chaired the Monitoring Group of the European Research Fund for Coal & Steel. In 2010 he was an Expert Advisor to the interim evaluation of the EU's 7th Framework Programme for Research (FP7).

http://orcid.org/0000-0002-0325-4431

Digital Science is a technology company serving the needs of research. At the centre of our mission is the support of researchers within research institutions, funding bodies, publishers and governments, for whom we provide a range of software, content and consultancy solutions. We believe passionately that tomorrow’s research will be different and better than today’s. Visit www.digital-science.com

The Global Research Identifier Database (GRID) is a free (CC0), manually curated catalogue of research organisations designed to support disambiguation, integration and analysis of data associated with research activities and research outputs. The GRID database can be downloaded at www.grid.ac/downloads in a variety of formats including JSON, CSV, and RDF. Services to automatically disambiguate data using bespoke matching algorithms that process structured and unstructured data are also available. For more information, please email [email protected]

Our Consultancy team delivers custom reporting and analysis to help you make better decisions faster. With in-depth knowledge of the historical and current research ecosystem, our unique perspective helps get the most value from data on the research lifecycle. Our team of data scientists are experts in using innovative analytical techniques to develop revealing visualisations and powerful insights. We understand the changing research landscape, and we can help you develop an evidence base on which to build the best research management and policy decisions.

This report has been published by Digital Science, which is operated by global media company the Holtzbrinck Publishing Group.

The Campus, 4 Crinan Street, London N1 9XW

[email protected]

Copyright© 2017 Digital Science

About the Authors

About Digital Science

About GRID

About Consultancy

Acknowledgements

ISBN 978-0-9956245-2-8

Page 3: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

32 Digital Research Reports Digital Research Reports

The cumulative totals for all affiliations and unique affiliations has been growing steadily in line with the increasing number of published articles. This has been observed by a number of analysts (Caroline Wagner, Ohio State Univ., pers. comm.). To measure the growth of author affiliations over time in the PLOS ONE data set, we process each article in date order, extracting and normalising each affiliation string (removing punctuation, replacing diacritic characters and translating to lowercase), tracking total and unique occurrences for each month (Figure 1).

For reference, the right axis shows the number of unique organisations for which affiliations were ultimately matched in GRID. Notably, this count shows signs of tailing off in the last two years. This may indicate that the variation is not so much a reflection of the total number of organisations referenced, but is more likely due to a consistent churn in how authors express their affiliation. Such turnover would, of course, disrupt any analysis that does not draw on a continuously refreshed address database.

To reveal the impact of this variation, we calculate the number of previously unseen affiliations in each month and plot this as a percentage of the existing stock over time. One might expect this to rapidly stabilise at a point that reflects the churn rate of institutions mentioned. However, this is not the case, even for a relatively large dataset. Instead, the percentage of new affiliations seen each month reduces only very slowly over time, with no indication of reaching an equilibrium (Figure 2).

Global Research Identifier DatabaseDigital Science conceived the Global Research Identifier Database (GRID) grid.ac (Szomszor, 2016), in the first place to support its own software and database development needs. It became clear that other analysts and the wider community could benefit from Digital Science’s approach and an open data model seemed most appropriate.

GRID is a manually curated catalogue of research organisations that supports disambiguation, integration and analysis of data associated with research activities and research outputs. The use of persistent identifiers supports unambiguous attribution even when names, locations and organisational structures change over time.

A key component of GRID is a well structured set of policies to ensure consistency across a wide range of organisation types. This includes how to name them, how to categorize them by sector, where to locate them, and how they relate to other research organisations. GRID also models locale specific structures, such as university systems, large national research networks, and multinational companies.

GRID provides automatic disambiguation algorithms based on a combination of geographic entity recognition and a large database of manually curated mappings. Interfaces are provided to process data that is structured (for example tabular data labelled by name, city, and country) and unstructured (such as author affiliations published in journal articles).

GRID is made available under a CC0 license, enabling others to reuse the database for commercial applications without attribution. New datasets are made available on a monthly basis and provided in a variety of formats including JSON, CSV, and RDF. Access to disambiguation services can be provided for both commercial and research purposes.

Analysis DatasetTo demonstrate the value of consistent modelling of research organisations and the potential analysis benefit, we make use of the research articles published in the leading Open Access journal PLOS ONE from the Public Library of Science (www.plos.org). We choose this particular data set because it covers a broad range of disciplines and subjects, it has a high volume (173K articles since it started in 2006) and it provides full-text content in a structured format that enables simple extraction of author affiliations.

Automatic disambiguation was performed on all author affiliations (unstructured text), and fractional counts were calculated for each GRID organisation referenced. Fractional count is the division of credit to organisations (or countries as appropriate) based on the number of authors that list them in their article affiliations. The total credit available for each article (1.0) is divided equally between contributing authors. For example, an article with two authors from A, one author from B, and one author from C would have the following fractional counts: A[0.5], B[0.25], C[0.25].

" GRID has clear policies, persistent identifiers and automatic disambiguation available under a CC0 license."

" The number of unique affiliations has been growing steadily in line with the increasing number of published articles."

Figure 2. The percentage of author address affiliations in PLOS ONE articles for each month that had not been seen in previous months

GRID Statistics71,727 research organisations from 220 Countries (Release 2017-01-31)

100% with links to Geonames cities

95% with an institutional type assigned

74% with exact geographic coordinates

CC0 license

PLOS ONE Statistics172,897 articles from 2006-2016

1,395,211 author affiliations

1,061,019 matched to GRID (76%)

86.8% of fractional count allocated to a GRID organisation

Figure 1. Cumulative totals for the observed number of affiliations in PLOS ONE articles since 2006 (total count - solid green line; unique count - dashed green line) and the number of unique GRID organisations to which these were matched (red line)

Page 4: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

54 Digital Research Reports Digital Research Reports

A sudden increase in early 2015 is anomalous and is caused by large variations in the number of published articles in each month (Dec 2014: 2,176, Jan 2015: 637, Feb 2015: 1,541, and Mar 2015: 3,354).

Closer inspection of the affiliation strings reveals the underlying cause of the growth in Figure 2. Authors typically reference departments, schools, faculties and other organisational elements that are not stable over time. Further, there is little editorial process applied within an institution, so individuals are free to express their affiliation in any way that suits them, often with little conformity even to others in the same units. Some choose to include organisational sub-units, others do not. Some write separate affiliations if they have joint appointments (for example a medical school and teaching hospital) while others overload a single affiliation and mention both. This background is a key benefit of using a dynamically updated reference database such as GRID that incorporates matching technology.

Collaboration ProfilesTo visualise the difference in collaboration patterns between countries, we create a series of collaboration profile plots for Brazil, China, Germany and Australia. In each plot, all institutions from the country are assigned a position and size. The size is proportional to the total fractional count assigned to the institution, and the position is determined by two metrics:

• Reach (x-axis) - The average proportion of domestic (-1) and foreign (+1) collaborators

• Collaboration (y-axis) - The average number of collaborators

The mean value for each metric is also indicated with a dotted line along with mini plots showing how these metrics vary over time (Figure 3).

These plots provide intriguing insights not only about collaboration, but also about author behaviour. Chinese organisations are concentrated more towards the lower left meaning they are less collaborative, particularly with foreign institutions, than are organisations in Germany or Australia. While international reach has remained stable for Germany and Australia over time (see mini plots), both China and Brazil have seen a slow drop. This would suggest that during the early years of PLOS ONE, Brazil and China published more through fringe collaborations with ‘foreign’ organisations. Over time, their average geographical reach for PLOS ONE articles dropped as more and more of their domestic research was published in this journal

Collaboration MappingThe GRID database is not just a reference and disambiguation set but also provides extensive coverage of geographic metadata. By linking the locations of all organisations to the free worldwide geographical database GeoNames (www.geonames.org), GRID makes it possible to introduce a wealth of free topographical data to other research analyses, extending management information.

Country Collaboration TypesUnilateral 109,019

Bilateral 31,952

Trilateral 6,699

Multilateral 2,436

" Chinese organisations are concentrated more towards the lower left meaning they are less collaborative, particularly with foreign institutions, than are organisations in Germany or Australia."

Figure 4 shows the global collaboration network derived from the complete set of PLOS ONE articles. All organisations with a fractional count >= 1 are shown. Node size is proportional to the fractional count awarded to the organisation, and each is coloured according to continent. Country names and continents are derived from links between GRID and GeoNames. Edges link organisations that have collaborated on articles, where the edge weight is proportional to the shared fractional count.

By aligning GeoNames administration regions with their corresponding spatial features in the free public domain map dataset from Natural Earth (www.naturalearthdata.com), it is possible readily to create a map showing publication intensity by region (Figure 5). This is particularly useful when analysing publication data from the USA (Figure 5a) where individual States often produce as many research outputs as entire countries elsewhere in the world. The geographical spread across the US States is complementary to the spread across, for example, the European Union.

" Growth in affiliations is driven by constant reorganisation."

Figure 3 - Collaboration profiles for Brazil, China, Germany and Australia. Mini plots show how the Collaboration and Reach metrics evolve over time. Figure 3 - Collaboration profiles for Brazil, China,

Germany and Australia. Mini plots show how the Collaboration and Reach metrics evolve over time. Those appearing to the left (tending to -1) collaborate more domestically, and those appearing to the right (tending to +1) collaborate more with foreign organisations. The position on the y axis is determined by the total number of organisations that it collaborates with, irrespective of country. So, for example, an organisation that is highly collaborative internationally will tend to the upper right quadrant.

Page 5: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

76 Digital Research Reports Digital Research Reports

Page 6: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

98 Digital Research Reports Digital Research Reports

Page 7: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

1110 Digital Research Reports Digital Research Reports

Sector ComparisonThe GRID database has an organisational typology with eight categories: archive, company, education, facility, government, healthcare, nonprofit, and other. Many different entity types exist within these groupings (for example: education includes universities, colleges, and schools; healthcare includes hospitals, clinics and foundation trusts). These entities are defined and managed inconsistently between countries, so precision in the typology is challenging at a fine-grained level. A policy decision was therefore made to group these entities at a relatively high level of abstraction for the purposes of global consistency and practicality. The selection of the stated types modelled in GRID was based on a range of research metrics use-cases and these are in line with contemporary analytical requirements.

Filtering the global collaboration network (Figure 4) to include only collaboration between particular types yields fine-grained insights. For example, this approach can be used to identify research precincts (cities where research intensive universities draw on the strengths of their nearest neighbours: Mcgilvray, 2016). By filtering the set of PLOS ONE articles to only those where authors from education and healthcare collaborate, it is possible to draw out a ranking of universities according to how much research they publish with hospitals. Table 1 is a selection of eight leading education entities with information on the geographic distribution of their healthcare collaborators are (see mini plots). Their 10 most frequent healthcare partners (right table) are ordered by the total fractional publication count shared with the collaborator.

Table 1 highlights the dependency of universities on local collaborators for effective medical research. In all cases, the hospitals with which they collaborate most frequently are also the nearest. The specific distribution varies: sometimes a single collaborator is responsible for the majority of research outputs (e.g. Karolinska Institute); in other cases the collaboration is more evenly split (e.g. University of Melbourne). Since most GRID organisations have a precise location (specified using latitude and longitude), calculating the actual distances is straightforward.

A scan of the list of charts for the geographic distribution of collaborators makes it apparent that while collaboration is consistently high locally (<10 km) collaboration beyond that is largely dependent on geography. For example, Australia has far fewer countries within a 1000 km radius than any European country. A more subtle proxy involving both proximity and nearest neighbour analysis would have to be developed to make a meaningful comparison related to the separation between collaborators.

Number of Institutes by Type (Release 2017-01-31)

Company 18,917

Education 17,212

Healthcare 9,948

Nonprofit 6,908

Facility 5,478

Government 4,129

Other 4,056

Archive 1,999

" GRID highlights clusters of research dependency."

Page 8: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

1312 Digital Research Reports Digital Research Reports

Data Sources for GRIDJournal Articles

Conference Proceedings

Book Chapters

Books

Monographs

Editorial Board memberships

Grants

Patents

Clinical Trials

Datasets

DiscussionThe GRID technology has enabled us readily to check almost 1.5 million variant author address affiliations on PLOS ONE articles and match more than three-quarters of these to around 14,000 verifiable research organisations across the globe.

Remarkably, the number of variant author addresses continues to proliferate and the rate of acquiring new affiliations to the database is only slowly declining. This shows both the scale of the problem for research analysts and the benefits of a well-structured global solution focussing on research organisations. Manual processing of this volume of constantly renewed variants is infeasible, so matching to a well curated database is essential. Of course, the adoption of address conventions (a location ID) would also solve the problem but a global standard for this does not yet exist.

GRID provides clear and consistent geographical metadata, which allows rapid analysis of, for example, domestic and international collaboration patterns and their trajectory. For the PLOS ONE dataset, this confirms previously reported national differences but it also shows the evolution of PLOS ONE authorship. For China, earlier papers were driven by international co-authorship whereas recent papers are generated by a more domestic authorship. This reflects a shift in the engagement of China’s researchers with the opportunities that PLOS ONE provides.

Global collaboration networks have been subject to much scrutiny, notable by Caroline Wagner at Ohio State University and Loet Leydesdorff in Amsterdam (Wagner et al., 2017). As our understanding of the evolving collaboration network improves, engagement in local, regional, national and global collaboration will become a more important strategic topic. A better understanding of how to attach to the network, develop new links, and exploit existing relationships could lead to new policy developments. This analysis of PLOS ONE articles demonstrates how GRID can be used as a fundamental building block to support this kind of analyses.

GRID applications are not limited to metadata from publications. The article-address analysis can, in practice, be seen as a relatively clean training set, despite its variety and continuing expansion. It provides the basis for working up the reference data that can then be exploited with the even more variable address information in other forms of publication, such as conference proceedings and monographs, and other forms of data such as research grants, patents and clinical trials as well as datasets themselves.

In a more open research environment, the need for a system that allows research location to be unequivocally and consistently identified becomes increasingly important. The provenance of research outcomes, whether publications or source data, will be essential in verification. Without a clear understanding of where the evidence has come from, how can reliance be placed upon the research claims?

References Adams, J. (2013). Collaborations: The fourth age of research. Nature 497, 557–560. doi:10.1038/497557a

Adams, J. (2017). Research in an Open, Global Landscape. Pp 149-170, in, New Languages and Landscapes of Higher Education, ed. P. Scott, J. Gallacher and G. Parry. Oxford University Press. ISBN 987872082

Bilder, G., Brown, J., & Demeranville, T. (2016) Organisation identifiers: currentprovider survey. Available from https://orcid.org/sites/default/files/ckfinder/userfiles/files/20161031%20OrgIDProviderSurvey.pdf

DeRidder, J. (2011) Improving the Information Supply Chain with Standard Institutional Identifiers. Information Standards Quarterly, 22(3), 26-29.

Ferguson, N., Moore, R. & Schmoller, S. (2015) Review of selected organisational IDs and development of use cases for the Jisc. CASRAI-UK Organisational Identifiers Working Group.

Henderson, H. (2007) Institutional identifiers and the Journal Supply Chain Efficiency Improvement Pilot. Serials 20(3), 180-183.

MacEwan, A., Angjeli, A., & Gatenby, J. (2013) The International Standard Name Identifier (ISNI): The Evolving Future of Name Authority Control. Cataloging & Classification Quarterly, 51(1-3), 55-71. doi: 10.1080/01639374.2012.730601

Mcgilvray, A. (2016) Sydney & Melbourne: A tale of two cities. Nature. 538 (7626), S58-S65. doi:10.1038/538S58a

Porter, S. (2016): Digital Science White Paper: A New ‘Research Data Mechanics’.doi:10.6084/m9.figshare.3514859.v1

Szomszor, M. and Mori, A. (2016) The Global Research Identifier Database GRID – Persistent IDs for the World’s Research Organisations. Proceedings of the 21st International Conference on Science and Technology Indicators, València, Spain.

Wagner, C.S., Whetsell, T.A. & Leydesdorff, L. Scientometrics (2017) 110: 1633. doi:10.1007/s11192-016-2230-9

Page 9: Digital Research Reports Discovery and Analysis of Global ... · We believe passionately that tomorrow’s research will be different and better than today’s. Visit The Global Research

digital-science.com

Part of the Digital Science family