the added value of the european map of excellence and ... · european commission the added value of...

114
Cinzia Daraio July – 2015 EUR 27389 EN The Added Value of the European Map of Excellence and Specialization (EMES) for R&I Policy Making

Upload: ledang

Post on 17-Oct-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Cinzia Daraio July 2015

    EUR 27389 EN

    The Added Value of the European Map

    of Excellence and Specialization (EMES)

    for R&I Policy Making

  • EUROPEAN COMMISSION

    Directorate-General for Research and Innovation Directorate A Policy Development and coordination Unit A6 Science Policy, foresight and data

    Contact: Emanuele Barbarossa, Katarzyna Bitka

    E-mail: [email protected]

    [email protected]

    [email protected]

    [email protected]

    European Commission B-1049 Brussels

    mailto:[email protected]
  • EUROPEAN COMMISSION

    The Added Value of the European Map of Excellence and Specialization (EMES)

    for R&I Policy Making

    Cinzia Daraio

    The document is based on projects carried out by the ONTORES research group at

    Sapienza University of Rome and on the Smart.CI.EU (Sapienza microdata architecture

    for education, research and technology studies. A Competence-based data Infrastructure

    on European Universities). The contributions of Marco Angelini, Alessandro Daraio, Flavia

    di Costa, Maurizio Lenzerini, Claudio Leporelli, Henk F. Moed, Gabriele Petrotta, and

    Giuseppe Santucci are gratefully acknowledged.

    Directorate-General for Research and Innovation 2015 Research, Innovation, and Science Policy Experts High Level Group EUR 27389 EN

  • LEGAL NOTICE

    This document has been prepared for the European Commission however it reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

    More information on the European Union is available on the internet (http://europa.eu).

    Luxembourg: Publications Office of the European Union, 2015.

    ISBN 978-92-79-50354-2

    doi 10.2777/553985

    ISSN 1831-9424

    European Union, 2015. Reproduction is authorised provided the source is acknowledged.

    EUROPE DIRECT is a service to help you find answers

    to your questions about the European Union

    Freephone number (*): 00 800 6 7 8 9 10 11

    (*) The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you)

  • 3

    Table of contents

    EXECUTIVE SUMMARY .................................................................................................... 5

    RSUM ....................................................................................................................... 9

    TABLE OF CONTENTS ..................................................................................................... 3

    1. INTRODUCTION AND CONTENT OF THE STUDY ........................................................... 12

    Introduction ........................................................................................................ 12

    Content of the study ............................................................................................ 12

    2. POLICY RELEVANCE OF AN EMES FOR R&I POLICY MAKING .......................................... 14

    3. DEFINING CRITERIA FOR EMES ................................................................................. 17

    4. GEO-REFERENCING INFORMATION ON EUROPEAN UNIVERSITIES ................................. 17

    Drawbacks and limitations: multi-site institutions .................................................... 19

    5. INTEGRATING BIBLIOMETRIC DATA AT THE LEVEL OF INDIVIDUAL UNIVERSITIES .......... 19

    State of the art .................................................................................................... 19

    Scimago Institutions Rankings ...................................................................... 20

    Global Research Benchmarking System ......................................................... 20

    Leiden Ranking ........................................................................................... 21

    Altmetrics, webometrics and other complementary information ........................ 22

    Coverage of the European university landscape .............................................. 24

    6. LOCATING PUBLICATIONS OF UNIVERSITIES AND PROS ON A GEOGRAPHIC MAP ........... 25

    Towards an authority file for PROs ......................................................................... 26

    Breakdown by discipline........................................................................................ 26

    7. TOWARDS A EUROPEAN MAP OF EXCELLENCE AND SPECIALIZATION............................. 28

    Geo-referencing data on publications...................................................................... 28

    Integrating information from other projects: the case of U-Multirank .......................... 31

    Integrating other socio-economic indicators ............................................................ 32

    Feasibility of selected indicators ............................................................................. 34

    8. CONCORDANCE TABLES OF DIFFERENT SUBJECT CLASSIFICATION SYSTEMS ................. 40

    Introduction ........................................................................................................ 40

    Results from a survey ........................................................................................... 41

    Approaches and systems developed in the past. Correspondence tables between Intellectual Patent Classification (IPC) and Fields of Science (FoS); and between IPC and industrial classification..................................................................... 43

    Correspondence tables between Fields of Education (FoE) and Fields of Science (FoS) .. 43

    Correspondence tables from the Eumida project ............................................. 43

    Correspondence tables from the ETER Project ................................................ 45

    Conclusions and recommendations ......................................................................... 47

    9. VISUAL ANALYTICS FOR A PILOT EMES ...................................................................... 47

    General Design .................................................................................................... 48

    Proof-of-concept prototypal application ................................................................... 51

    10. ASSESSMENT ........................................................................................................ 55

    11. EXPLORATION OF POSSIBLE BUSINESS MODELS AND BUDGET ................................... 56

    1. Model Supported by the European Commission .................................................... 56

    2. Public-private sponsorship Model........................................................................ 57

    3. Science 2.0 Model ............................................................................................ 57

    Linking data in an open platform ............................................................................ 57

    Automation and maintenance of the infrastructural data system ................................ 58

    A real options approach to estimate the investment in an OBDM approach .................. 60

    An estimate of the needed budget.......................................................................... 61

  • 4

    12. RECOMMENDATIONS .............................................................................................. 62

    REFERENCES .............................................................................................................. 65

    APPENDICES ............................................................................................................... 68

    Appendix 1: Authority file of European universities ................................................... 68

    Appendix 2: Concordance tables .......................................................................... 101

    Appendix 3: Possible User Groups ........................................................................ 107

  • 5

    EXECUTIVE SUMMARY

    This study examines the feasibility of constructing a European Map of Excellence and Specialization (EMES) by offering a proof of the concept and illustrating the potential for policy making.

    The term 'Map of Excellence and Specialization' refers to a geographical information system (GIS) that combines and georeferences information from various sources at different geographic scales (Nomenclature of Units for Territorial Statistics (NUTS) levels 2 and possibly 3) and provides indicators intended for policy use.

    Drawing a Map of Excellence and Specialization entails many challenges: Actors in the European Science and Technology (S&T) field are heterogeneous (for example, universities and Public Research Organisations (PROs)), their output is composite (for example, education,

    publications, and patents), the location of their activities is not fully disclosed, etc.

    A number of S&T policy decisions depend on assumptions about the impact of public expenditure on national, regional or local variables such as employment, productivity, and growth. These assumptions are rarely based on sound empirical evidence, however.

    Furthermore, they tend to ignore the magnitude of knowledge spillovers and tend to assume a simplistic view of agglomeration.

    What is needed is a robust empirical base at geographic level in which data on knowledge production are firmly established. Such a base would then offer the opportunity to integrate other data using the same unit of reference at geographic level.

    The current study explores also the policy implications of using such a Map of Excellence and Specialization.

    The main criteria to assess the successful implementation of the EMES have been identified in:

    Availability of data on publications (adequate economic and legal framework for the use of

    commercial data on publications including sources, commercial conditions and update for a medium long period of time);

    Standardization (consideration of existing standards in the science and technology higher education funding fields and adequate solutions to import relevant standards, e.g. ORCID,

    CERIF, EUROCRIS,CASRAI etc.) openness and interoperability with other sources of data available;

    Compliance with state of the art data quality techniques;

    Continuity (adequate organizational solutions for the continuous update, maintenance and improvement of the map);

    Extensions and scalability (explicit solutions to make the map suitable for future integration of new actors, new/updated sources of data, indicators);

    Expertise in the access and analysis of publicly available data; interactivity (ability of the map to allow the automatic generation of new indicators, explicit solutions for controlling the

    statistical properties of new indicators and potential misuse) and usability (the reader has to play with it without entering in the technical details);

    Existence of concordance tables among different subject classifications.

    On the base of the study carried out, and taking into account the established criteria for the assessment of the successful feasibility of the EMES, we suggest to the Commission to further

    proceed with a full scale study for the realization of the European Map of Excellence and Specialization.

    The EMES should be designed by following an Ontology-Based Data Management (OBDM) approach to ensure a sustainable and up-to-date map, interoperable and extendable over time.

    The Map should integrate in a GIS-format at least the following groups of indicators:

    structural indicators for higher education institutions (HEIs)

  • 6

    structural indicators for Public Research Organisations (PROs)

    publications of HEIs and PROs

    patents assigned to HEIs and PROs

    academic staff at HEIs

    undergraduate students

    PhD students

    undergraduate degrees

    PhD degrees.

    All these information should be geo-referentiated and supported by extensive metadata.

    The data should have a breakdown by discipline (Field of Science, or Subject categories) and by Field of Education.

    In addition, data should be integrated with relevant indicators at regional (NUTS2 and, where possible, NUTS3) level. These should include industrial, employment, GDP, social and

    demographic data.

    The proposed Map should specify the procedures for the updating of data, offering solutions for the automatic update as frequently as possible.

    The proposed Map should also demonstrate the sustainability of the organization or business model in the future, by addressing issues such as provision of commercially available data, cost of update, IPRs, standardization and robustness issues.

    With respect to Higher Education institutions, the census established by the ETER project,

    funded by DG Education and Culture in collaboration with DG Research and EUROSTAT, should be assumed as the official source. Consequently, the ID system proposed by the ETER project should be used as a reference in all documents. The project should provide the list of all affiliation names and possible variations found in publications affiliations.

    With respect to Public Research Organisations, the project should issue a similar ID system, organized in a hierarchical way as suggested by the ontology model. For instance:

    organization name at the country level (e.g. Max Planck Society, CNRS, or CSIC)

    first-tier sub-organization name (e.g. institute, or department)

    second-tier organization name (e.g. institute within a department, laboratory within an institute)

    bottom level organization (e.g. research group, research team, laboratory).

    The system should aim at maintaining stability at organization name, establishing a permanent list as a standard reference list. This would be an Authority File to be maintained at official level.

    With respect to first-tier and second-tier organizations, the system should provide a reliable mapping structure, which can be managed automatically. This means that the system should provide a full list of all possible names and abbreviations, in all possible combinations, so that they can be matched to publication data in an automatic way. Each occurrence should be unambiguously related to the Authority File.

    The system must specify the procedures by which the lists of first-tier and second-tier names

    are updated, corrected, and cleaned over time.1 The system must specify under which

    1 One should distinguish two approaches in which the Authority File can be used in the affiliation-de-duplication process.

    In the first the Authority file is used to assign author affiliation strings in scientific publications to organization names. For instance, if one finds articles with the affiliation Dept Astronomy in a paper from Leiden while the name University of Leiden is missing in those strings, a rich Authority File containing information on first and second tier sub-organizations indicates that University of Leiden actually contains a Department of Astronomy, so that these affilation strings should be assigned to the University of Leiden. But this does not necessarily mean that one can obtain a reliable estimate of the publication output of Dept Astronomy at Univ Leiden by counting only articles containing Dept Astronomy in their affiliation data, since there may be many papers from this department that do not contain Dept Astronomy in their affiliation data. The reason is that the affiliation information in scientific publications is often too

  • 7

    conditions the list of first-tier names could eventually become an Authority File in its own, becoming an official and stable source. The same should be examined for the second-tier list of names.

    In order to supervise the development of the project, a Steering group should be formed, involving

    DG Research

    DG Education

    DG Region

    Eurostat

    OECD

    European Parliament- Science and Technology Options Assessment (STOA).

    The Steering Group might meet regularly in order to review the development of the project and to define and refine requirements for the construction of indicators, on the basis of their own

    needs.

    The Commission might also consider whether to invite separately ERC and EIIT representatives, as well as members of national governments.

    In parallel, a Committee representing the PROs should be created. It should include at least one representative for the largest PROs in Europe (i.e. Max Planck, Leibniz, Helmholtz in Germany, CNRS, INSERM, INRA, INRIA, CEA in France, CSIC in Spain, CNR, INFN in Italy) and a number of representatives from other PROs.

    This Committee should regularly meet to supervise the development of the analysis aimed at the geo-referentiation of scientific publications of PROs. The Committee should validate the allocation of specific research outputs to teams, laboratories or institutes that could be located geographically. Provisions for fractional allocation should be examined and validated.

    Meanwhile, other issues should be discussed (e.g. collection of data on patents) for future activities of indicator construction.

    In all future calls for research projects of the Commission there should be a mandatory

    provision for submission to provide full coverage of ORCID numbers for all researchers involved.

    In all future documentation the standards established in the project and/or available at international level should be adopted:

    ID numbers of HEIs

    ID numbers of PROs

    CERIF ID numbers of funding agencies

    ORCID ID numbers for researchers

    In future studies commissioned by the Commission there should be a provision for establishing linkages with the platform produced under the project. In particular, data should be delivered to the Commission in such a way to be integrated in a seamless way into the platform. This requires that the ontology model developed as the base of the data integration process become a standard reference point.

    In addition the ontology suggested by the project should be published. An interactive consultation with producers and users of indicators should be opened. After a given period, the ontology should be published in an official version and should become a standard reference point. Future releases should be published in due time.

    incomplete or inaccurate to achieve de-duplication at lower levels. The recall of such a process tends to be low, even though the precision may be high. This problem is for PROs even worse than for academic institutions.

  • 8

    The feasibility study has shown the enormous potential for reliable, effective and efficient construction of indicators provided by the creation of Authority Files.

    This is an authoritative source established in the form of a list, associated to complete

    definitions and rules for inclusion and exclusion, and to explicit rules for updating over time. These processes could be, in principle, executed automatically.

    Once an Authority File is established, the same information is propagated in the entire information system without ambiguity.

    This means that the same piece of information gets more value, since it is appropriately used in many contexts.

    In the context of this study the following Authority Files should be established:

    official list of Higher Education Institutions

    official list of Public Research Organisations

    author ID

    publication ID

    funding agency ID

    The official list of HEIs is available under ETER: we recommend the adoption as a standard.

    The official list of PROs is to be constructed under a dedicated project.

    With respect to the IDs, we recommend the Commission to support actively all international efforts to establish and maintain standards, such as ORCID and CERIF. The Commission might issue a document asking Member States to adopt these standards in their administrative

    activities.

  • 9

    RSUM

    Cette tude examine la faisabilit de la construction d'une carte europenne d'excellence (EMES) en offrant une preuve du concept et illustrant le potentiel pour l'laboration des politiques.

    Le terme carte de l'excellence se rfre un systme d'information gographique (SIG) qui combine et gorfrences informations provenant de diverses sources, diffrentes chelles gographiques (nomenclature des units territoriales statistiques (NUTS) niveaux 2 et 3) et fournit des indicateurs destins l'usage de la politique.

    Un certain nombre de dcisions politiques dpendent d'hypothses sur l'impact des dpenses publiques sur les variables nationales, rgionales ou locales telles que l'emploi, la productivit et la croissance. Ces hypothses sont rarement fondes sur des preuves empiriques solides.

    Ce qui est ncessaire est une base de donnes solide au niveau gographique dans laquelle les donnes sur la production de connaissances sont fermement tablies. Une telle base serait alors offrir la possibilit d'intgrer d'autres donnes l'aide de la mme unit de rfrence au niveau gographique.

    L'tude actuelle explore galement les implications politiques d'une telle carte de l'excellence et de spcialisation.

    Les principaux critres pour valuer la mise en uvre russie d'une carte europenne d'excellence ont t identifies dans:

    la disponibilit de donnes sur les publications (cadre conomique et juridique adquat pour l'utilisation de donnes commerciales sur les publications y compris les sources, les conditions commerciales et mise jour pour une longue priode de temps moyenne);

    Normalisation (standardisation, par exemple ORCID, CERIF, euroCRIS, CASRAI etc.) ouverture et interoprabilit avec d'autres sources de donnes disponibles;

    la qualit des donnes (application des solutions mthodologiques au jour pour assurer la qualit des donnes selon les normes internationales);

    la continuit (solutions organisationnelles adquates pour le maintien de la mise jour

    continue et l'amlioration de la carte);

    Extensions et l'volutivit (solutions explicites pour faire la carte adapt l'intgration future de nouveaux acteurs, de nouvelles sources / mises jour de donnes, indicateurs);

    Expertise dans l'accs et l'analyse des donnes accessibles au public; interactivit (capacit

    de la carte pour permettre la gnration automatique de nouveaux indicateurs de solutions explicites pour contrler les proprits statistiques de nouveaux indicateurs et abus potentiel) et la facilit d'utilisation (le lecteur a jouer avec elle sans entrer dans les dtails techniques);

    L'existence de tables de concordance entre les diffrents systmes de classifications.

    Sur la base de l'tude ralise, et en tenant compte des critres tablis pour l'valuation de la

    faisabilit de succs de l'EMES, nous suggrons la Commission d'aller plus loin avec une tude grande chelle pour la ralisation de lEMES.

    L EMES devraient tre conus en suivant une approche de gestion des donnes Ontology-Based (OBDM) afin d'assurer la mise jour de la carte, linteroprabilit et lextensibilit au

    fil du temps.

    LEMES devrait intgrer dans un systme SIG au moins les groupes d'indicateurs suivants:

    les indicateurs structurels pour les institutions d'enseignement suprieur (IES)

    indicateurs structurels pour organismes publics de recherche (OPR)

    Publications des IES et des OPR

    brevets cds aux IES et aux OPR

    Le personnel acadmique de IES

  • 10

    les tudiants

    doctorants

    diplmes

    Doctorat.

    Toutes ces informations doivent tre go-referentiated et soutenu par une vaste documentation (mtadonnes).

    Les donnes devraient tre detailles par discipline et par domaine de l'ducation.

    En outre, les donnes doivent tre intgres avec des indicateurs pertinents au niveau rgional (NUTS 2 et, si possible, NUTS3). Ceux-ci devraient inclure industrielle, l'emploi, le PIB, social et donnes dmographiques.

    Le Plan propos devrait prciser les modalits de la mise jour des donnes, proposant des solutions pour la mise jour automatique aussi souvent que possible.

    La Plan propose devrait galement dmontrer la viabilit de l'organisation ou de modle

    d'affaires l'avenir, en abordant des questions telles que la fourniture de donnes disponibles dans le commerce, le cot des problmes de mise jour, de normalisation et de

    robustesse.

    En ce qui concerne les institutions d'enseignement suprieur (IES), recensement tabli par le projet ETER, financ par la DG Education et Culture, en collaboration avec la DG Recherche et EUROSTAT, doit tre assum comme une source officielle. Par consquent, le systme d'identification propos par le projet de ETER doit tre utilis comme rfrence dans

    l'ensemble des documents.

    Le projet devrait fournir la liste de tous les noms d'affiliation et les variations possibles trouvs dans les affiliations des publications.

    En ce qui concerne les ORP, le projet devrait mettre un systme d'identification similaire, organise de faon hirarchique comme suggr par le modle de l'ontologie. Par exemple:

    Nom de l'organisation au niveau des pays (par exemple, le Max Planck, CNRS, ou CSIC)

    Premier niveau sous-nom de l'organisation (par exemple, un institut ou dpartement)

    De second rang nom de l'organisation (par exemple institut au sein d'un dpartement, laboratoire dans un institut)

    Organisation de niveau infrieur (par exemple un groupe de recherche, l'quipe de recherche, laboratoire).

    Le systme devrait viser maintenir la stabilit au nom de l'organisation, l'tablissement d'une liste permanente comme une liste de rfrence standard. Ce serait un Authority File tre

    maintenu au niveau officiel.

    En ce qui concerne les organisations de premier rang et de second rang, le systme devrait fournir une structure de cartographie fiable, qui peut tre gr automatiquement. Cela signifie que le systme doit fournir une liste complte de tous les noms et les abrviations possibles, dans toutes les combinaisons possibles, de sorte qu'ils peuvent tre adapts aux donnes de publication d'une manire automatique. Chaque occurrence devrait tre clairement li au Authority File.

    Le systme doit prciser les modalits selon lesquelles les listes de noms de premier rang et de second rang sont mis jour, corriges et nettoyes au fil du temps.

    En outre, le systme doit indiquer dans quelles conditions la liste des noms de premier rang pourrait ventuellement devenir un fichier d'autorit dans son propre, devenant une source officielle et stable. La mme chose devrait tre examin pour la liste de second niveau.

    Afin de superviser le dveloppement du projet, un groupe de pilotage devrait tre form, impliquant

    DG Recherche

    DG Education

  • 11

    DG Rgion

    Eurostat

    OCDE

    Parlement Europenne - STOA.

    Le groupe de pilotage pourrait se runir rgulirement afin d'examiner le dveloppement du projet et de dfinir et d'affiner les exigences pour la construction d'indicateurs, sur la base de leurs propres besoins.

    La Commission pourrait galement examiner si d'inviter sparment les reprsentants de ERC et EIIT, ainsi que les membres des gouvernements nationaux.

    En parallle, un comit reprsentant les ORP devraient tre crs. Il devrait inclure au moins un reprsentant pour les plus grandes ORP en Europe (i.e.,Max Planck, en Allemagne, CNRS, INSERM, INRA, INRIA, CEA en France, le CSIC en Espagne, le CNR, l'INFN en Italie) et un certain nombre de reprsentants d'autres ORP.

    Ce comit devrait se runir rgulirement pour superviser le dveloppement du projet.

    Dans tous les futurs appels projets de recherche de la Commission, il devrait y avoir une disposition pour la prsentation fournir une couverture complte des numros de ORCID

    pour tous les chercheurs impliqus.

    Dans toute la documentation avenir, les normes tablies dans le projet et / ou disponibles au niveau international devraient tre adoptes:

    numros d'identification des IES d'enseignement suprieur

    Numros d'identification des OPR

    Numros d'identification CERIF des organismes de financement

    Numros ORCID d'identification pour les chercheurs.

    Dans les futures tudes commandes par la Commission il devrait y avoir une disposition pour tablir des liens avec la plateforme produite dans le cadre du projet.

    En particulier, les donnes doivent tre transmises la Commission de manire tre intgr de manire transparente dans la plateforme cre.

    Cela ncessite que le modle de l'ontologie dvelopp comme la base du processus d'intgration de donnes devient un point de rfrence standard.

    En outre l'ontologie propose par le projet devrait tre publi. Une consultation interactive avec les producteurs et utilisateurs d'indicateurs doit tre ouvert. Aprs une priode donne, l'ontologie devrait tre publi dans une version officielle et devrait devenir un point de rfrence standard.

    L'tude de faisabilit a montr le potentiel norme pour la construction fiable, efficace et efficiente des indicateurs fournis par la cration de Authority Files.

    Ceci est une liste, associ complter les dfinitions et les rgles d'inclusion et d'exclusion, et de rgles explicites pour mettre jour au fil du temps. Ces processus pourraient tre, en principe, excutes automatiquement.

    Une fois un Authority File est tablie, la mme information se propage dans le systme d'information sans ambigut.

    Cela signifie que le mme lment d'information est de plus de valeur, car il est utilis de manire approprie dans de nombreux contextes.

    Dans le cadre de cette tude, les Authority File suivante devrait tre tabli:

    Liste officielle des IES

    Liste officielle des OPR

    ID Auteur

    ID Publication

    ID Agence de financement

    La liste officielle des IES est disponible sous ETER: nous recommandons l'adoption en tant que norme.

  • 12

    La liste officielle des OPR doit tre construite par un projet ddi.

    En ce qui concerne les ID, nous recommandons la Commission de soutenir activement tous les efforts internationaux visant tablir et maintenir des standard, telles que ORCID et CERIF. La

    Commission pourrait mettre un document demandant aux tats membres d'adopter ces standard dans leurs activits administratives.

    1. INTRODUCTION AND CONTENT OF THE STUDY

    Introduction

    This study examines the feasibility of constructing a European Map of Excellence and Specialization (EMES) by offering a proof of the concept and illustrating the potential for policy making.

    The term 'Map of Excellence and Specialization' refers to a geographical information system (GIS) that combines and georeferences information from various sources at different geographic

    scales (Nomenclature of Units for Territorial Statistics (NUTS) levels 2 and 31) and provides indicators intended for policy use.

    Drawing a EMES entails many challenges: Actors in the European Science and Technology (S&T) field are heterogeneous (for example, universities and Public Research Organisations (PROs)), their output is composite (for example, education, publications, and patents), the location of their activities is not fully disclosed, etc.

    A number of S&T policy decisions depend on assumptions about the impact of public

    expenditure on national, regional or local variables such as employment, productivity, and growth. These assumptions are rarely based on sound empirical evidence, however. Furthermore, they tend to ignore the magnitude of knowledge spillovers and tend to assume a simplistic view of agglomeration.

    What is needed is a robust empirical base at geographic level in which data on knowledge production are firmly established. Such a base would then offer the opportunity to integrate other data using the same unit of reference at geographic level.

    The current study explores also the policy implications of using such a Map of Excellence and

    Specialization.

    Content of the study

    The main analyses carried out in this study are reported in the following.

    Firstly, the study examines the feasibility of geo-referencing information pertaining to excellence in S&T at NUTS 2 and NUTS 3 level of European universities based on the results of the ETER (European Tertiary Education Register) study. If data are missing, the last EUMIDA data should be used. The task will be based on those higher education institutions that deliver

    the PhD degree (i.e. universities).

    Secondly, the study examines the feasibility of integrating data on scientific publications at the level of individual universities with a breakdown by disciplines employing the most recent data using the following indicators as examples: number of publications; number of citations; percentage of publications in top journals; percentage of citations from top journals; percentage of publications with international collaboration.

    Thirdly, the study examines the feasibility of locating publications of universities on a

    geographic map of Europe, by integrating all data coming from various universities at NUTS 2 and NUTS 3 level. The feasibility is based on data on all European universities, as defined by the Eumida and ETER studies. With respect to Public Research Organizations (PROs) the study discusses the list extracted from affiliations of publications in commercial databases and its comparison with the list currently maintained by DG RDT in light of the experiences gathered from previous research projects carried out at Sapienza. The study also offers a proof of the

    concept by building up a sample of regions and/or universities and locating them on a GIS computer platform.

  • 13

    Fourthly, the study analyses how other socio-economic indicators could be integrated in the GIS in order to build up a Map of Excellence. It explores the feasibility of integrating official statistics, including population, economic, industrial and infrastructure statistics, in the GIS at

    the appropriate level of granularity.

    Fifthly, in order to prepare for future integration of data on the specialisation of European countries and regions, the study reports about the state of the art in the literature regarding the correspondence between different classifications in S&T and industrial statistics. In particular, it

    extensively examines the correspondence between Fields of Education (FoE) and Fields of Science (FoS), between the latter and the International Patent Classification (IPC), between the latter and industrial classifications.

    In discussing the feasibility of a European Map of Excellence and Specialization (EMES), the indicators reported in Box 1 have been taken into account.

    BOX 1: Indicators considered in the feasibility study on the EMES.

    Specialization indicators

    - Revealed Scientific Advantage of regions (NUTS 2), normalized both at EU and country level;

    - Position of NUTS 2 or possible NUTS 3 territory in European ranking by discipline, based on Number of publications both as an absolute and normalized against socioeconomic indicators, e.g. population;

    - Position of NUTS 2 or possible NUTS 3 territory in European ranking by discipline, based on Number of citations (including derived indicators, such as share of publications among the 1% most cited).

    Excellence indicators

    - Composite indicator including Number of publications, Number of citations, and Percentage of publications and citations in top journals.

    Critical mass indicators

    Indicators stating whether the territory (NUTS 2 or possibly NUTS 3) have or have not reached a given threshold of publications in given disciplines using the indicators developed by the EC as test cases.

    Research productivity indicators

    - Number of publications per unit of academic staff

    - Number of excellent publications per unit of academic staff

    - Number of citations per unit of academic staff

    - Number of excellent citations per unit of academic staff

    Research intensity indicators

    - Number of publications per 000 inhabitants

    - Number of publications per million euro GDP

    - Number of citations per 000 inhabitants

    - Number of citations per million euro GDP

  • 14

    Research Excellence indicators

    - Number of excellent publications per 000 inhabitants

    - Number of excellent publications per million euro GDP

    - Number of excellent citations per 000 inhabitants

    - Number of excellent citations per million euro GDP

    The study unfolds as follows. Section 2 illustrates the relevance of EMES for policy making in research and innovation sectors in Europe. Section 3 introduces criteria for feasibility which are

    assessed in Section 10. Section 4 describes the feasibility and limitations of geo-refering universities activities on a European map. Section 5 deals with the localization of bibliometric data taking into account three examples of projects and discussing problems and further issues for a full scale exercise. Sections 6 and 7 further analyse the steps towards a multidimensional EMES for which a proof-of-concept prototypal application is described in Section 9. Section 8 reviews the literature on concordance tables for research and innovation analysis. Section 10 summarizes the assessment of the feasibility of an EMES, while Section 11 explores possible

    business models for a sustainable EMES over time and outlines a budget for a full scale project

    on the realization of the EMES. The study is closed by a set of recommendations for future developments reported in Section 12. Three appendices complete the study.

    2. POLICY RELEVANCE OF AN EMES FOR R&I POLICY

    MAKING

    The main goal of this feasibility study is to illustrate the potential of a platform that allows advanced levels of integration, analysis, automation, and scalability of indicators in the field of science, technology and growth at European level.

    The needs underlying the concept of such a platform can be described as follows:

    integration of data from heterogeneous sources;

    levels of aggregation (geographic, institutional, disciplinary);

    updating and scalability.

    From a policy point of view there are a number of relevant issues that require a novel approach to data management. Let us consider, for example, the following ones.

    Research investment, innovation and growth in laggard regions

    There is a growing evidence that the investment in R&D, funded by public resources, is not a sufficient condition for the catching up effort of laggard regions in Europe, both in Southern and

    Eastern countries. The key concept here is one of complementarity: investment in R&D only produces effects on growth if the specialization in science matches the identification of niches of technological specialization, and if both of them are supported by complementary investment into education and human capital. While the theoretical arguments underlying this point are clear, there is a lack of data that allows the integration at regional, or even urban, scale, of indicators on research, technology, education, and growth, with a breakdown by discipline/ field

    of technology/ industry.

    Therefore an important pillar of European policies rests on weak grounds from an empirical point of view.

    Excellence of European universities

    There is a recurring debate in European countries, fuelled year after year by the publication of university rankings. Why the overall volume of European publications is roughly comparable, or even larger, than the US one, and at the same time there are so few European universities able

  • 15

    to compete at the top? And does it really matter to be in the top league, or it is only a matter of visibility and prestige?

    Although, thanks to recent projects on the matter, microdata at the level of individual

    universities can be obtained from public sources (i.e. ETER for Higher Education Institutions), they are not integrated with other data on inputs and complementary outputs (publications, patents, third mission indicators). Thus, the debate rests on aggregate notions of excellence, or lack thereof, without helping progress in policy making.

    We argue that the current debate is incorrectly set.

    First of all, the meaning of excellence should be clearly defined and formally conceptualized in order to derive a coherent set of indicators able to monitor its evolution and its dynamic changes over time.

    Secondly, excellence should be contextualized, that is, it depends on the missions of the institutions (teaching, research, third mission), often complementary and/or rivalry, on the

    external environmental context as well as on the comparison set, and again can change and evolve over time.

    In order to address the issue of the excellence of European universities on a sound empirical base there is a need for developing a long lasting data infrastructure, interoperable with existing available sources of data, able to be updated and extended over time without having to start

    from the scratch each time a new policy need appears.

    Role of Public Research Organisations

    The European landscape benefits from a large actor, which is missing in other countries or has a different mission, i.e. large PROs. Their aggregate role in the production of science and

    technology is well known, yet there is a lack of empirical evidence regarding the complementarity between PROs and universities, and their impact on regional and national competitiveness.

    In all these examples (but it might be possible to add other issues) there is a distinct need for an integration of heterogeneous data from various sources, without the need to start a new study, and a new database, each time a policy issue appears.

    General relevance of data integration and platform for science of

    science policy

    Clearly, the information needs and the analytic goals of the research community interested in science of science are of historical nature. The ability to compare research institutions in their evolution over time is of paramount importance. This is best obtained, and most economically, if

    some institutions, or network of institutions, take charge to maintain a certified repository of relevant data.

    In other words, science of science should be included among the domains national and regional statistical offices include in their regular surveys and census activities, using methodologies that maximize the ability to understand, use, and compare data. Moreover, these institutions should publish not only statistical summaries, but also microdata sources. In particular, administrative databases of research institutions could become a privileged source of research oriented

    information.

    The social benefits of the building of such an infrastructure greatly exceeds their social costs.

    In this study we argue that in order to design an EMES as a long lasting data infrastructure an Ontology-Based Data Management (OBDM) approach (Poggi et al., 2008; Lenzerini, 2011, 2015a; Daraio, Lenzerini et al. 2015) should be followed.

    The main idea of the OBDM approach is the organization of a three-level architecture, constituted by:

    The ontology: is a conceptual, formal description of the domain of interest (expressed in terms of relevant concepts, attributes of concepts, relationships between concepts, and logical assertions characterizing the domain knowledge).

    The sources: are the repositories accessible by the organization where data concerning the domain are stored. In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently from the others.

  • 16

    The mapping: is a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology.

    An illustration follows in the next figure.

    Figure 1. An illustration of the basic idea of the OBDM approach.

    Source: Daraio (2015).

    The main purpose of an OBDM approach is to allow information consumers to query the data using the elements in the ontology as predicates. It can be seen as a form of information integration, where the usual global schema is replaced by the conceptual model of the application domain, formulated as an ontology expressed in a logic-based language.

    The main advantages of an OBDM approach are:

    Users can access the data by using the elements of the ontology.

    By making the representation of the domain explicit, we gain re-usability of the acquired knowledge.

    The mapping layer explicitly specify the relationships between the domain concepts and the data sources. It is useful also for documentation and standardization purposes.

    Flexibility of the system: you do not have to merge and integrate all the data sources at once which could be extremely costly.

    Extensibility of the system: you can incrementally add new data sources or new elements (ability to follow the incremental understanding of the domain) when they become available.

  • 17

    3. DEFINING CRITERIA FOR EMES

    The study carried out and reported in this report has been based on the competences, previous

    experiences and current research projects carried out at the University of Rome La Sapienza, as well as on the existing literature on the related topics.

    The current study can be seen as a proof-of-concept, needed in order to verify the feasibility and provide further insights for a full scale exercise.

    In the following, we report the proposed criteria to assess the feasibility of the EMES.

    Availability of data on publications (adequate economic and legal framework for the use of commercial data on publications including sources, commercial conditions and update for a medium long period of time);

    Standardization (consideration of existing standards in the science and technology higher

    education funding fields and adequate solutions to import relevant standards, e.g. ORCID, CERIF, EUROCRIS,CASRAI etc.) openness and interoperability with other sources of data available;

    Compliance with state of the art data quality techniques;

    Continuity (adequate organizational solutions for the continuous update maintenance and improvement of the map);

    Extensions and scalability (explicit solutions to make the map suitable for future integration of new actors, new/updated sources of data, indicators);

    Expertise in the access and analysis of publicly available data; interactivity (ability of the map to allow the automatic generation of new indicators explicit solutions for controlling the statistical properties of new indicators and potential misuse) and usability (the reader has to play with it without entering in the technical details);

    Existence of concordance tables among different subject classifications.

    4. GEO-REFERENCING INFORMATION ON EUROPEAN UNIVERSITIES

    The first analytical step for the evaluation of the feasibility of geo-referencing information

    pertaining excellence in S&T consists in locating activities of the universities in European regions.

    Recent efforts toward building a European tertiary education register promoted by the European Commission have ensured big advancement towards the possibility of geo-referencing information pertaining to excellence in S&T of European universities. We specifically refer to the

    results of two projects promoted by the European Commission, respectively:

    EUMIDA - Feasibility Study for Creating a European University Data Collection (Contract No. RTD/C/C4/2009/0233402) completed in 2010;

    ETER - European Tertiary Education Register (Contract No. EAC2013038) completed in July 2015. It will be continued up to 2017 with the project Implement and Disseminate the European Tertiary Education Register (Contract No. EAC-2015-0280).

    EUMIDA demonstrated the feasibility of collecting data at Higher Education Institutions (HEIs) level, covering both input (staff and finance) and output (education, research) dimensions,

    along with a set of descriptors allowing for a more precise profiling at European level (institution category, legal status, foundation, region of establishment, etc.).

    ETER further extended the data collection coverage to all ERA countries, adding at the same time more details and variables breakdowns.

    Focusing our attention to the information useful for a systematic process of geo-referencing, the situation is as follows:

    ETER published the results of the first data collection (reference year 2011) in 2014 (eter.joanneum.at/imdas-eter). The dataset includes the perimeter (list of HEIs with ID code and name) for all 36 ERA countries and data for 29 out of 36 countries. Within the European

  • 18

    Union, information on Hungary, Romania and Slovenia are missing. Among other variables, ETER includes the following geographical information:

    Country of establishment: the country where the institution is established (ie. where the

    institution develops most of its activities, for example where the largest part of the staff is located, even if this is not the legal seat of the institution). Official ISO 3166 country codes are used to identify the country;

    Region of establishment (NUTS2 and NUTS3): the region where the institutions main

    seat is located. The official NUTS classification (Nomenclature of Territorial Units for Statistics) is adopted. This information is not reported when HEIs activities are distributed in more than one NUTS3 region so that no main seat can be identified;

    City: the name in English of the city/town where the main seat and most of the activities are located;

    Postal code: the postal code of the official address of the HEI (postcode system is not in

    force in Ireland);

    Geographic coordinates: longitudes and latitudes, based on the postcode of the official address.

    EUMIDA published the results of DC1 collection in 2011 (reference year 2008) and covered all

    European Union (except Denmark and France) + Norway and Switzerland. The only geographical information collected was the region of establishment at NUTS2 level.

    In order to test the feasibility of the geo-referencing process we created an inventory list of

    higher education institutions that deliver the PhD degree in the European Union and in EFTA countries (Iceland, Liechtenstein, Norway, Switzerland) by integrating information provided by the two projects and filling in the few remaining information gaps. In details:

    retrieving of the most recent information contained in ETER (reference year 2011) for the majority of countries concerned by this study. Hungary, Romania and Slovenia are missing;

    retrieving of the missing data for Hungary and Romania from the EUMIDA DC1 (reference year 2008) integrated with ad hoc investigation online. The list of institutions has been

    retrieved by ETER in order to be updated and consistent with other countries;

    retrieving of the missing data for Slovenia (reference year 2011) from Sapienza internal resource and ad hoc investigation online, on the basis of the list of institutions contained in ETER;

    revising and updating of NUTS codes to the last version in order to allow for full interoperability with the Eurostat regional database (see below).

    The file lists 1,131 Institutions which represent slightly 50% of all institutions contained in the European tertiary education register (the remaining ones are institutions not awarding doctoral -ISCED8- degree). For each institution the file reports the following information:

    ETER code

    the institution name in national language and in English

    the NUTS codes at level 2 and 3

    the name of the city

    Additional geographical information (postcode, GIS coordinates) are available in the ETER database and can be easily integrated when relevant, thanks to the interoperability of the datasets ensured by the use of a unique ID code.

    With the information contained in ETER it is possible to locate each HEI within the city in a map

    of Europe. GIS coordinates have been calculated from postal codes automatically from Google Maps through the website http://www.doogal.co.uk/BatchGeocoding.php. It should be considered that postal code and address refer to an institutions main seat and that usually

    activities are spread in a number of different buildings with different addresses. The information reported therefore cannot be used for micro-localization within a single city/neighbourhood, but rather for a broader localization at a regional, national and European level.

    http://www.doogal.co.uk/BatchGeocoding.php
  • 19

    Drawbacks and limitations: multi-site institutions

    In principle, the location of universities within NUTS2 regions is quite straightforward given that

    multi-site institutions with activities spreading across regional borders are not very diffused. Nevertheless almost 10% of HEIs comprised in the Authority file have secondary branches located in two or more regions (very rarely abroad).

    This share increases up to 24% if we look at the NUTS3 level. The share of multi-site institutions is actually increasing around Europe as a consequence of two opposite phenomena: spread of universities activities from the original seat in the surrounding region (decentralisation

    and wider regional coverage); institutional concentration through the merger of small institutions and the creation of larger and more comprehensive HEIs which usually maintain at least partially the original seats and locations in different cities and regions.

    At this stage it is not possible to disentangle information for multi-site institutions, and the whole university activities are located in the region of the main seat. This could create a slight distortion especially when data are analysed by disciplinary area. It could happen that all function in a disciplinary area are located in a secondary campus outside the region (e.g. the

    medical department of Universit Cattolica del Sacro Cuore is located in Rome (ITI43) while the main seat is in Milan (ITC4): in the map of excellence the figures of the medical department will

    be attributed to ITC4 region instead of ITI43).

    In ETER multi-site institutions are treated in two different ways depending on their typology:

    multi-site HEIs at national level (usually different establishments different regions within one country): are treated as a unique HEI and no disaggregated data are collected. A dummy variable for multisite institutions is collected and NUTS3 codes of local branches is requested;

    international multi-site HEIs (secondary branches abroad): foreign campuses, consistently with UOE guidelines, are treated as self-standing HEIs in the country where they are established. In this case data are disaggregated at the level of the campus (although the inclusion/exclusion of figures in the parent country is not always easy to verify). Only two cases are in ETER: Webster University Vienna - Private University in Austria and The Branch of the University of Bialystok 'Faculty of Economics and Informatics Lithuania.

    This step of analysis confirms the possibility to locate universities activities at regional level across Europe, with the limitations related to multi-site institutions recalled above.

    The Authority File is attached to this report (Appendix 1).

    5. INTEGRATING BIBLIOMETRIC DATA AT THE LEVEL OF

    INDIVIDUAL UNIVERSITIES

    State of the art

    In recent years several projects related to the use of bibliometric output of universities (and other research performing organisations) have been launched. Generally speaking they have analysed the research output (i.e. publications) of a sample of institutions around the world in order to publish a benchmark and/or a rank of their performance.

    We briefly describe below three examples that we consider to be the most representative for the European context - Scimago Institutions Rankings (SI Ranking); Global Research Benchmarking System (GRBS); Leiden Ranking- in order to gain some insights on the feasibility of a full scale

    exercise of integration of data on scientific publications at the level of individual universities.

    In addition, we describe also altmetrics and webometric information as possible additional information to consider.

  • 20

    Scimago Institutions Rankings

    Scimago Institutions Rankings (SI Ranking) is published yearly by the SCImago Research

    Group and takes into account those organizations from any country, with at least 100 documents published in the last year of a five-year period. Data come from Scopus database for scientific literature, containing mainly scholarly journals and conference proceedings. The research group ensured the identification and disambiguation of institutions through the institutional affiliation of documents included in Scopus through the definition and unique identification of institutions (addressing issues related to institution's merge or segregation and

    denomination changes) and the attribution of publications and citations to each institution (manual and automatic system for the attribution of affiliations to one or more institutions).

    For the purpose of this feasibility, we refer to 2013 Scimago world report, based on scientific production of the period 2007-2011. The following indicators are available:

    Output: total number of documents published in scholarly journals indexed in Scopus;

    International Collaboration: Institution's output ratio produced in collaboration with foreign institutions. The values are computed by analyzing an institution's output whose affiliations

    include more than one country address;

    Normalized Impact: normalized Impact is computed using the methodology established by the Karolinska Intitutet in Sweden where it is named "Item oriented field normalized citation score average". The normalization of the citation values is done on an individual article level. The values (in %) show the relationship between an institution's average scientific impact and the world average set to a score of 1, --i.e. a NI score of 0.8 means the institution is cited 20% less than the world average and 1.3 means the institution is cited 30% more than

    world average;

    High Quality Publications: ratio of publications that an institution publishes in the most influential scholarly journals of the world, those ranked in the first quartile (25%) in their categories as ordered by SCImago Journal Rank (SJRII) indicator;

    Specialization Index: it indicates the extent of thematic concentration /dispersion of an institutions scientific output. Values range between 0 and 1, indicating generalist vs.

    specialized institutions respectively. This indicator is computed according to the Gini Index used in economics;

    Excellence Rate: it indicates the amount (in %) of an institutions scientific output that is included into the set of the 10% of the most cited papers in their respective scientific fields.

    It is a measure of high quality output of research institutions;

    Scientific Leadership: leadership indicates an institutions output as main contributor, that is the number of papers in which the corresponding author belongs to the institution;

    Excellence with Leadership: it indicates the amount of documents in the Excellence rate in which the institution is the main contributor.

    The 2013 report ranks 2,744 institutions worldwide, in five sectors: government, higher education, health, private, others. 680 of them are higher education institutions located in Europe, within the perimeter of the present feasibility.

    Global Research Benchmarking System

    The Global Research Benchmarking System (GRBS) was intended to provide objective data and analyses to benchmark research performance in traditional disciplinary subject areas and in

    interdisciplinary areas for the purpose of strengthening the quality and impact of research. The GRBS is an open collaborative effort of the academic community. Data come from Elsevier's

    Scopus database including titles of types Journal, Conference Proceedings, and Book Series.

    The scope of the first release of the GRBS (with reference year 2011) has been limited to three world macro-regions: US and Canada, Asia-Pacific, European Union (plus Norway and Switzerland). Universities are selected for inclusion in GRBS by examining research output in the 4-year window (2007-2010) at two levels:

  • 21

    First in each of the third level subject areas the universities with the highest number of publications are identified. For Asia-Pacific the top 50 are taken and for US & Canada the top 40 are taken. The different depths are due to the differences in the size of the regions.

    Then in each category a minimum cut-off of 50 publications is applied so that universities with fewer than 50 publications in the 4-year window in that subject area are not included in the list.

    In addition, the 200 universities with the highest number of publications are identified in

    broader 2nd level categories. The reason is to include universities that have significant publication output in a broad category but not in any of its subareas. This results in adding a few additional universities since most are captured by searching the third level areas.

    The set of universities included in GRBS is then the union of all these resulting subject area lists. Any university that appears in at least one list is included in GRBS and analysed in all subject areas. In this way the GRBS is able to recognize universities that have particular niche

    strengths.

    From the disciplinary area point of view, GRBS covers 23 of the 27 top level subject areas and 251 of the sub-areas of the All Science Journal Classification (ASJC) focusing on Science and Technology.

    For the purpose of this feasibility we refer to the 2011 release that uses Scopus publication and citation data in the time window 2007 2010. The following indicators are available:

    Total Pubs: Total number of publications during a 4-year time window.

    %Pubs in Top 10% SNIP: Percentage of Total Pubs published in source titles that are within top 10% of that subject area, based on the SNIP value [Ranked Outlets] of the last year in the time window. For the window 2007 - 2010, the SNIP values for 2010 are used.

    %Pubs in Top 25% SNIP: Percentage of Total Pubs published in source titles that are within top 25 % of that subject area, based on the SNIP value of the last year in the time window.

    Total Cites: Total number of citations within a 4-year time window to papers published in that time window. All citation counts used in GRBS exclude self citations.

    %Cites from Top 10% SNIP: Percentage of Total Cites received from publications in journals that are within top 10% based on SNIP value.

    %Cites from Top 25% SNIP: Percentage of Total Cites received from publications in journals

    that are within top 25 % based on SNIP value.

    4-year H-Index: A university having 4-year H-index of X means that at least X of its publications (during that 4-year window ) have no less than X publications citing them

    (during that window). A 4-year H-index of a university is computed for a particular subject area.

    The 2011 release includes 1,355 universities worldwide, and 606 affiliations are located in Europe within the perimeter of the present feasibility (there are cases of multiple affiliation corresponding to the same university).

    Leiden Ranking

    The Leiden Ranking is produced by the Centre for Science and Technology Studies (CWTS) and measures the scientific performance of major universities worldwide, based on data from the Web of Science bibliographic database produced by Thomson Reuters.

    A sophisticated methodology has been developed for the identification of the perimeter of the universities, the evolution of their configuration over time and the presence of affiliated institutions (i.e. hospitals).

    The universities that appear in the Leiden Ranking have been selected based on their contribution to articles and review articles published in international scientific journals in a 4-year period (only publications in core journals are included). A minimum threshold of the

    equivalent of 1,000 papers was required for a university to be ranked.

  • 22

    The CWTS Leiden Ranking provides statistics at the level of science as a whole and at the level of the following seven broad fields of science:

    Cognitive and health sciences

    Earth and environmental sciences

    Life sciences

    Mathematics, computer science, and engineering

    Medical sciences

    Natural sciences

    Social sciences

    The above fields have been defined using a unique bottom-up approach at the level of individual publications rather than at the journal level. Using a computer algorithm, each publication in the Web of Science database has been assigned to one of these seven fields. This has been done based on a large-scale analysis of hundreds of millions of citation relations between

    publications.

    For the purpose of this feasibility, we refer to the 2014 (most recent) edition based on publications in Thomson Reuters' Web of Science database (Science Citation Index Expanded, Social Sciences Citation Index, and Arts & Humanities Citation Index) in the period 20092012. The following indicators are available, in addition to absolute number of publications and citations:

    MCS (mean citation score): The average number of citations of the publications of a

    university.

    MNCS (mean normalized citation score): The average number of citations of the publications of a university, normalized for field differences and publication year. An MNCS value of two for instance means that the publications of a university have been cited twice with respect to the world average.

    PP(top 10%) (proportion of top 10% publications): The proportion of the publications of a university that, compared with other publications in the same field and in the same year,

    belongs to the top 10% most frequently cited.

    PP(collab) (proportion of interinstitutional collaborative publications): The proportion of the

    publications of a university that have been co-authored with one or more other organizations.

    PP(int collab) (proportion of international collaborative publications): The proportion of the publications of a university that have been co-authored by two or more countries.

    PP(UI collab) (proportion of collaborative publications with industry): The proportion of the publications of a university that have been co-authored with one or more industrial partners.

    PP(

  • 23

    such as Twitter and Facebook, reader libraries such as Mendeley and ResearchGate, scholarly blogs and mass media. A dedicated effort is required to consider their inclusion taking into account their specificities, and particularly, data quality aspects.

    Another source to be explored is represented by the webometrics ranking of universities (for more information see Box n. 2) whose data could be integrated and reported in the European Map of Excellence and Specialization.

    BOX 2: Alternatives to bibliometric data: the case of Webometrics

    Ranking of World Universities

    The Webometrics Ranking of World Universities is an initiative of the Cybermetrics Lab, a research group belonging to the Consejo Superior de Investigaciones Cientficas (CSIC), the largest public research body in Spain. The Cybermetrics Lab using quantitative methods has designed and applied indicators that measure the scientific activity on the Web. The

    cybermetric indicators are useful to evaluate science and technology as alternative/complement to the traditional bibliometric indicators.

    The Webometrics Ranking is the largest academic ranking of Higher Education Institutions worldwide (more than 5,000 institutions covered) published every six month since 2004 with the aim of providing reliable, multidimensional, updated and useful information about the performance of universities based on their web presence and impact. The original aim of

    the Ranking was to promote Web publication (Open Access initiatives, electronic access to scientific publications and to other academic material). However web indicators are useful for ranking purposes too as they are not based on number of visits or page design but on the global performance and visibility of the universities.

    Webometrics advocates a number of strengths with respect to bibliometric based rankings (reduced field of science distortion and better (indirect) inclusion of other missions like teaching or the so-called third mission; link analysis as a more powerful tool than citation

    analysis for quality evaluation; wider perimeter of research output including also informal scholarly communication).

    In Webometrics Ranking the unit for analysis is the institutional domain, so only universities and research centres with an independent web domain are considered.

    The first web indicator, Web Impact Factor (WIF), was based on link analysis that combines the number of external inlinks and the number of pages of the website, a ratio of 1:1 between visibility and size. This ratio is used for the ranking, adding two new indicators to

    the size component: number of documents, measured from the number of rich files in a web domain, and number of publications collected by Google Scholar database. Four indicators were obtained from the quantitative results provided by the main search engines as follows:

    Size (S). Number of pages recovered from four engines: Google, Yahoo, and Bing Search.

    Visibility (V). The total number of unique external links received (inlinks) by a site, according to Yahoo Site Explorer.

    Rich Files (R). After evaluation of their relevance to academic and publication activities and considering the volume of the different file formats, the following were selected: Adobe Acrobat (.pdf), Adobe PostScript (.ps), Microsoft Word (.doc) and Microsoft Powerpoint (.ppt). These data were extracted using Google, Yahoo and Bing.

    Scholar (Sc). The data is a combination of items published between 2006 and 2010 included in Google Scholar and the global output (2004-2008) obtained from Scimago SIR.

    The four ranks were combined according to a formula where each one has a different weight

    but maintaining the ratio 1:1.

    For more information: www.webometrics.info

    http://www.webometrics.info/
  • 24

    Coverage of the European university landscape

    All the three bibliometric databases analysed (and ETER itself) have a perimeter defined by a minimum size threshold which impacts their coverage of the European university landscape. In this respect, Scimago has the widest coverage (58%), followed by GRBS (49%) and finally Leiden ranking with only 24%. Given the uneven distribution of research output among HEIs and its concentration in larger and

    more research-oriented institutions, we can assume that the three databases have a better coverage in terms of total publications. The problem of coverage is well known in the literature, but still not solved. Three kinds of interpretational problems have to be taken into account:

    To the extent that indicators are derived from bibliographical databases such as Web of Science or Scopus, the outcomes depend upon the adequacy of coverage of such databases, and differences therein between countries and subject fields. Although in the past years both

    databases dedicate large efforts in expanding with conference proceedings and scholarly books, they are predominantly journal indexes. As a consequence, their coverage of fields in which sources other than journals play an important role in the scientific communication may

    be limited. Also, technologically oriented institutions aimed to develop new products and processes tend to produce outputs that are not well reflected in scientific-scholarly publications in the serial, refereed literature.

    Results for an individual institution can only be interpreted properly when one takes into

    account the structure of the national academic system in which it is embedded, and the particular role of the university therein. Historical, political and cultural factors including national or regional rivalry, different religious traditions or different concepts of academic education may account for structural differences across national research systems (Moed 2006; Lambert and Butler 2006). Also, an institutions performance should be viewed from the perspective of its position in the national and international research collaboration

    network.

    In order to obtain a meaningful interpretation of bibliometric and other types of output indicators of research institutions, other types of information are needed. Three important types are the following: information on the degree of variability within institutions as a complement to indicators at the level of institutions as a whole, such as their degree of disciplinary specialization (e.g., Lopez-Illescas et al 2011; Calero et al. 2008); insight into

    the relation between output and input, based on econometric models of research efficiency

    (Daraio, Bonaccorsi and Simar, 2015a,b); and, last but not least, information on the mission of research institutions.

    Problem Examples Solution

    Bibliographic databases may reveal differences in

    adequacy of coverage of research output among research fields

    Limited coverage of research publications in social sciences,

    humanities; journal articles are not the primary output in technological institutions

    Collect information from institutions themselves and add

    this to the annotation/background file

    Interpretation should take into account the structure of the national academic

    system in which an institution is embedded

    Historical, political and cultural factors may account for structural differences across

    national research systems

    Make available results at distinct aggregation levels: per research discipline within an

    institution; per institution; and per country; and scientific

    collaboration patterns

    Proper interpretation requires info on the degree

    of variability within institutions, on the relation between output and input, and on their mission

    Indicator of the degree of disciplinary specialization; of

    research efficiency; information on mission and regional function

    Include such indicators in a comprehensive information

    system on academic institutions and public research organizations

    Table 1: Major problems in coverage and their solutions. Source: Moed, 2015.

    Table 1 summarises problems of coverage listed above and reports some suggested solutions.

  • 25

    Nevertheless, in order to have a full geographical coverage at European level an additional effort would be necessary in order to reach a higher coverage of the ETER perimeter. This would

    allow for a less biased comparison between national and regional higher education systems -with different average institution size, different level of concentration of activities and different subject specialization patterns.

    6. LOCATING PUBLICATIONS OF UNIVERSITIES AND PROS ON A GEOGRAPHIC MAP

    A full scale exercise to build a map of research require a preliminary process of disambiguation of affiliations and clear definition of the perimeter of HEIs and PROs. The three example of projects described above have faced similar problems and solved them in different ways. Different methodological choices (full count vs. fractional count of publications, different data sources, different time coverage, different coverage of publication typologies, etc.) anyway hamper the possibility to compare final results and consequently to assess the impact of the

    disambiguation process. Generally speaking the problem of disambiguation of affiliations in bibliometric exercises can be detailed in the following issues:

    Researchers do not indicate their institutional affiliations in a uniform manner (different names, acronyms, abbreviation and truncation, evolving naming conventions);

    Bibliographic database producers apply data capturing rules that re-format and modify the original affiliation data in the scientific articles they index. This activity might be sources of

    errors and distortions;

    Additional information sources are needed to comprehensively identify research institutions or organizations in a particular country, especially large countries with complex and articulated research systems;

    The definition of the institutional perimeter of a research institutions might be problematic (affiliated institutions, university hospitals, presence of umbrella organisations;

    Author affiliations not reflecting the whole research process (i.e. the affiliation usually refers to the organization where the author works, but the research might be coordinated and/or funded by different institutions.

    Problem Examples Solution

    a. Researchers do not indicate their institutional affiliations in a uniform manner.

    Leads to variations in institutional names, incomplete names

    Use a validated thesaurus or authority file; consult national experts; use advanced disambiguation software

    b. Data capturing by

    bibliographic database producers is useful but may contain errors

    Database producers may

    change the order of components of an affiliation structure

    Start from the raw affiliation

    data; do not use ill-understood features implemented by indexers

    c. Additional information sources are needed to

    comprehensively identify research institutions

    Analysts of national affiliation data need to know what they

    are looking for

    Use a validated thesaurus or authority file; consult national

    experts;

    d. Problems may arise with

    the definition of research institutions

    Position of academic or

    teaching hospitals and of umbrella organizations such as university systems

    Consult national and

    institutional experts; create an annotation file with background info on how a institution is

    defined;

    e. Traditional author affiliations may not properly express the increasing complexities of

    the research system

    Responsibility for research program and for research infrastructure may be split amongst different organizations

    Initiate further research into this issue. Examine use of information from funding acknowledgements

    Table 2: Major problems in affiliation disambiguation and their solutions. Source: Moed, 2015.

    Table 2 presents a summary of these problems with examples and their solutions.

  • 26

    Towards an authority file for PROs

    Academic institutions constitute in most countries by far the most important type of research

    entities. A second important group of (mainly) publicly funded research institutions is labelled as Public Research Organizations (PROs). Public Research Organizations can be divided into 4 categories or ideal types (OECD Innovation Policy Platform 2011):

    Mission Oriented Centres (MOCs), owned by government departments or ministries at a national level (e.g., INSERM in France; CIEMAT in Spain).

    Public Research Centres (PRCs), publicly funded overarching research institutions such as

    CNRS in France, CNR in Italy, Max Planck Gesellschaft in Germany.

    Research Technology Organizations (RTOs), often in the public sphere, private but not-for-profit, such as Fraunhofer Gesellschaft in Germany and TNO in the Netherlands.

    Independent Research Institutes (IRIs), often at the boundary between the public and the private sector, denoted as centre of excellence, and recently founded.

    While the scientific output of universities is the object of a large effort in the dedicated literature and several exercises have been carried out, much less is known, on a systematic and

    comparable basis, on the scientific output of PROs. This is because there are some issues that

    are still open: (a) establishing a comprehensive list of PROs; (b) disambiguating the innumerable ways in which PROs show up in bibliometric information and normalizing affiliations; (c) locating the scientific production of PROs at geographic level.

    The creation of an Authority file for the European PROs requires a dedicated research effort, aimed at identifying and characterizing the perimeter of the institutions to be involved which is beyond the objective of this feasibility study.

    For the purpose of this study, and on the base of a bibliometric study carried out and in progress at Sapienza, the Elsevier Bibliometric Research Project: Assessing the Scientific Performance of Regions and Countries at Disciplinary level by means of Robust Nonparametric Methods: new indicators to measure regional and national Scientific Competitiveness, we consider as a first rough starting point the list resulting from the combination of institutions included in the RPOs (Research Performing Organizations) Inventory provided by the European

    Commission, the list of the RPOs and RFOs (Research Funding Organizations) to which the ERA Surveys were submitted, and finally the lists of participants at the Framework Programmes initiatives and at the Horizon 2020 programme.

    Breakdown by discipline

    The breakdown of bibliometric data at disciplinary level indeed is crucial for a sound analysis of S&T activities, but requires an additional and dedicated effort.

    Two alternative approaches have been proposed, respectively top-down focusing on journal classification or bottom-up working at the level of the individual publication.

    Following the first approach journals are assigned to one or more research areas, bringing along all the publications appeared in the journal. GRBS follow this approach and makes reference to the Elseviers Scopus All Science Journal Classification (ASJC) which maps source titles in a structured hierarchy of disciplines and sub-disciplines allowing research activity to be categorized according to the field of research. ASJC classifies about 30,000 source titles into a two-level hierarchy. The top level contains 27 subject areas including a Multidisciplinary category and second level contains 309 subject areas.

    The second approach works at the level of individual publications, assigning each publication to a research area through large-scale bibliometric analysis. The Leiden Ranking follows this approach, using a computer algorithm to assign each publication in the Web of Science database to a research area on the base of citation relations. Lowest level research areas are organized in a hierarchical structure which lead to seven top level fields (Waltman and van Eck 2012).

    An alternative bottom-up approach has been experimented in GRBS to deal with

    interdisciplinary and emerging areas, which are not fully captured by a top-down approach. The

  • 27

    GRBS pilot effort with a focus on areas of sustainable development was based on keywords: candidate keyword lists were automatically generated and then checked by domain experts; resulting keyword lists were matched against keywords in the title, abstract, and list of author

    defined keywords of each publication.

    A second order of issues arises when aiming at developing indicators combining bibliometric data with data on other dimensions of universitys profile. The European tertiary education register contains a breakdown of variables related to education (enrolled students and

    graduates), research activity (doctoral students and graduates) and resources (academic staff) by field of education. Data are broken down in the following eleven broad fields (plus one unclassified category) according to the 2011 ISCED-F classification system:

    00 Generic programmes and qualifications

    01 Education

    02 Arts and humanities

    03 Social sciences, journalism and information

    04 Business, administration and law

    05 Natural sciences, mathematics and statistics

    06 Information and Communication Technologies (ICTs)

    07 Engineering, manufacturing and construction

    08 Agriculture, forestry, fisheries and veterinary

    09 Health and welfare

    10 Services

    Although at the broadest level all the classification systems explored here converge to large disciplinary fields, there is no perfect match between the categories used in bibliometric databases and ISCED field of education. This is partially due to different scopes (ISCED-F is aimed at classifying educational activities rather than research ones) and partially to different classification choices.

    See Section 8 for further discussion on concordance tables of different subject classification

    systems.

  • 28

    7. TOWARDS A EUROPEAN MAP OF EXCELLENCE AND SPECIALIZATION

    Geo-referencing data on publications

    Given the availability of an authority file of European universities containing their localization based on ETER, the easiest way for geo-referencing bibliometric data consists in matching affiliations reported in publication indexed databases and ETER IDs.

    An alternative approach could start from the information on affiliations contained in each publication (usually affiliation contains the name of the institution and the address with the name of the city). This approach might lead to more accurate localizations, making reference to the actual address of the branch where the author work (instead of the institutional main seat).

    At the same time, it requires a substantial larger efforts for disambiguation of affiliations address and cities and interpreting the proper placement of the affiliation within its institutional framework. These issues have been addressed by Sapienza though a pilot application of a new alghoritm for disambiguation of RPOs affiliation (see Lenzerini, 2015b).

    Although a classification of research institutions through their url is difficult to carry out, the

    information provided by the url that is reported in the webometric ranking of universities (see

    Box n. 2), could be extremely useful to map and disambiguate institutional affiliations from the publications databases.

    For the scope of this feasibility we explored the first option and matched the list of universities in our authority file with the list of institutions reported in Scimago, GRBS, Leiden. In most cases, the same university appears in the three databases with a different name and these denominations only in a minority of cases perfectly correspond to the ETER institution name (either in national language or in English). Nevertheless, a non-ambiguous correspondence can

    be found between names variants with a dedicated effort.

    The results are really satisfactory:

    100% of universities in the Leiden ranking found a correspondence with an ETER ID;

    Over 99% of GBRS affiliations found a match