elixir bioinformatics user survey

119
1 Centre Bioinformatique de Bordeaux (CBIB) – Université Victor Segalen Bordeaux 2, 146 rue Léo Saignat, 33076 Bordeaux cedex, France. Email contact : [email protected] ELIXIR BIOINFORMATICS USER SURVEY FINAL REPORT JUNE 2009 S. Palcy 1 & A. de Daruvar 1

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ELIXIR BIOINFORMATICS USER SURVEY

1 Centre Bioinformatique de Bordeaux (CBIB) – Université Victor Segalen Bordeaux 2, 146 rue Léo Saignat, 33076 Bordeaux cedex, France. Email contact : [email protected]

ELIXIR BIOINFORMATICS USER SURVEY

FINAL REPORT JUNE 2009

S. Palcy1 & A. de Daruvar1

Page 2: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 2

Table of Contents

EXECUTIVE SUMMARY........................................................................................................ 3

INTRODUCTION.................................................................................................................... 5

USER SURVEY DEVELOPMENT.......................................................................................... 5

A. Survey design .................................................................................................................. 6

B. Coordination with other ELIXIR WPs ........................................................................... 10

USER SURVEY RESULTS .................................................................................................. 10

PART I – ONLINE SURVEY RESULTS............................................................................... 10

A. Overview......................................................................................................................... 10

B. Respondent groups’ profiles ........................................................................................ 12

C. Responses from frequent and occasional users ........................................................ 32

PART II– INTERVIEW RESULTS ........................................................................................ 64

A. Candidates’ profile......................................................................................................... 64

B. Long-term sustainability of bioinformatics infrastructures ....................................... 65

C. Working with bioinformatics databases ...................................................................... 66

D. Working with bioinformatics tools ............................................................................... 67

LIMITATIONS OF THE USER SURVEY.............................................................................. 68

A. Survey scope.................................................................................................................. 68

B. Sampling method........................................................................................................... 68

C. Community strata .......................................................................................................... 68

CONCLUSION ..................................................................................................................... 68

A. Development of a survey strategy................................................................................ 68

B. Survey findings.............................................................................................................. 69

C. Strategy for future bioinformatics infrastructures...................................................... 74

ACKNOWLEDGEMENTS .................................................................................................... 75

REFERENCE ....................................................................................................................... 76

APPENDIX I – SURVEY QUESTIONNAIRES ........................................................................ I

APPENDIX II – SUPPLEMENTARY TABLES.................................................................. XXV

Page 3: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 3

EXECUTIVE SUMMARY During ELIXIR preparatory phase, all the relevant stakeholders, including bioinformatics resource providers and users, have been consulted to ensure that the future European infrastructure meets their needs. As part of the WP3 “Coordination and Participation”, the ELIXIR bioinformatics user survey was collaboratively developed to help with the understanding of bioinformatics infrastructure requirements of the user community represented by individual research groups. The survey was specifically designed to assess users’ current practices with existing bioinformatics resources (i.e. bioinformatics databases and tools) as well as needs and priorities for future bioinformatics infrastructures. An online questionnaire was developed and tested in a pilot study. The user survey was launched online in July 2008 and closed after four months of data collection. The online study was completed by interviews of selected experts (i.e. group leaders in life science) from several countries (from September 08 to February 09).

In total, 804 respondents (i.e. respondent groups) completed the online questionnaire and 9 experts (from France, Germany, Spain, The Netherlands and UK) volunteered for an interview. To proceed with the online survey analysis, only forms including a response to the central question about EU bioinformatics infrastructures sustainability were selected. This reduced the total number of respondents considered in the analysis to 754.

Analysis of respondents’ profile indicated that most of them (85.0% of answering respondents) were frequent users of bioinformatics resources. Altogether they represented 318 organizations from 34 different countries. Unfortunately the private sector was under represented and the vast majority of respondents (89.4% of answering respondents) were from the academic/non-profit sector. Respondents’ research activities were distributed in different research domains which included in majority bioinformatics (62.7% of answering respondents) and biology (58.4% of answering respondents), as expected. Bioinformatics environment (i.e. support for bioinformatics expertise, resources, training and education as well as relationships with the bioinformatics research community) of the respondents was variable and dependent on criteria such as user category (e.g. frequent or occasional users), country location, and research domain.

Analysis of respondents’ responses about bioinformatics resources was focused on frequent and occasional users, as they represented 97.8% of total respondents. In these user categories, 67.2% of answering respondents considered that long-term sustainability of European bioinformatics infrastructure was essential for their research activities. Respondents used bioinformatics databases for principally biological data/information searching and data analysis (respectively 91.4% and 79.1% of answering respondents). Working with molecular sequence data was of prime interest for 76.2% of respondents (i.e. answering respondents). Other data of general interest (about 60% of answering respondents) were genomics, protein functional annotation, gene expression and literature data. The top rated databases included the popular databases PubMed, EMBL/GenBank /Entrez Nucleotide and UniProt/ Swiss-Prot/ TrEMBL/Entrez Protein. Specialized databases of importance included metabolomics (KEGG), human genetic disease (OMIM) and transcriptomics (GEO and ArrayExpress) databases. Database literature citation was a

Page 4: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 4

regular practice for 59.2% of answering respondents and 54.6% of them had interacted with database providers.

As a general statement, respondents indicated variable satisfaction with current database resources. Most frequent challenges included transparent query across databases, format compatibility and database website usability. Hence, improvement in database interoperability/ integration and database functionalities were of main concerns (respectively, 87.2% and 85.6% of answering respondents). Improvement in data quality was also strongly recommended (78.3% of answering respondents). Besides, priorities for new database development included principally expansion of existing databases scope, new API-web services and public web portals, (respectively, 82.4%, 66.3% and 62.4% of answering respondents).

When considering respondents’ usage pattern and experience with bioinformatics tools, the number of tools used was highly variable and dependent on user category, research domain and bioinformatics environment. In general, tools were provided by academic/non-profit organizations and accessed through a combination of online and in-house installation. 85.9% of answering respondents indicated to invest some to significant effort in order to combine data resources and/or tools inputs/outputs. Investment of significant effort was mainly reported by respondents with activities in maths and computer sciences as well as bioinformatics and medicine, but to a lesser extent. Finally, priority of development for tool resources was generally given to dissemination and standardization of bioinformatics tools benchmarking.

In conclusion, this survey allowed consulting with a large and highly diverse community of users. The participation of 804 individual research groups and subsequent analysis of the collected information provided insights for a better understanding of this user community needs and priorities in respect to bioinformatics infrastructures. In particular, the study (i.e. online questionnaire and interviews) highlighted a broad and strong support for sustainable bioinformatics infrastructures which were perceived as critical for the advance of research in life science. Results from the survey also provided guidance for database and tool resources improvement and new development. Finally, information on respondents’ user profiles identified several indicators (i.e. user category, bioinformatics environment, research domain and country location) that reflect this community complexity.

Altogether, the user survey provided an information basis, with regard to the largest stakeholder community, for strategy planning of future bioinformatics infrastructures in Europe.

Page 5: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 5

INTRODUCTION A new discipline has become increasingly important and critical for the future of life science. Bioinformatics which has the challenging tasks of collecting, preserving and making available the biological data generated in hundreds of research laboratories, has become central to all future research efforts in life science. In 2007, a literature-based study has documented the general contribution and increasing impact of bioinformatics in biomedical research, (Perez-Iratxeta et al.). Although bioinformatics is taking more and more importance in life science researchers’ activities, public investment to ensure the development and sustainability of bioinformatics infrastructures (e.g. data and tools resources, education and trainings and computing facilities) in Europe, has not kept pace with the rapidly growing needs. Furthermore, the upcoming challenges in bioinformation (e.g. data volume, computing capacity and multiple resources integration) entail coordination and collaboration of the scientific communities at a supra-national level which cannot be achieved without the building of major bioinformatics infrastructures.

ELIXIR initiative has drawn together European researchers from all field of life science in both academia and industry, to build a pan European bioinformatics infrastructure with a strategy for permanent funding.

During the ELIXIR preparatory phase, all the relevant stakeholders have been consulted to ensure that the future European infrastructure meets their needs and priorities. As part of this consultation process, a user survey has been developed in order to address the part of the user community which includes principally individual research groups, (from both academia and industry) with research activities that span all fields of life science. Consultation of such wide and disparate community represented a challenge which was increased by additional discrepancies with respect to bioinformatics resources knowledge and use as well as bioinformatics expertise.

The user survey helped with the documentation on general as well as specific needs and priorities of individual research groups. It provided also a unique opportunities for the larger part of the user community to return feedback on existing infrastructures and give recommendations for future developments.

USER SURVEY DEVELOPMENT The development of the survey was a collaborative effort involving both University of Bordeaux (UB2) (namely Antoine de Daruvar and Sandrine Palcy) and EBI (namely Janet Thornton, Graham Cameron, Peter Stoehr and Dominic Clark). UB2 was in charge of the questionnaire and interview design (in close interactions with EBI) as well as the collection and analysis of survey data. EBI (with the support of UB2) was in charge of the strategy to solicit principal investigators (PIs) and select candidates for interviews.

Additional contribution was made to the online questionnaire design by Søren Brunak (Chair of the WP12 “Infrastructure for Tools Integration” Committee), Rodrigo Gouveia-Oliveira (member of Work Package 12 “Infrastructure for Tools Integration” Committee), Chris Southan (Member of WP2 “The ELIXIR Strategy for Data Resources” Committee and

Page 6: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 6

Coordinator of the ELIXIR Database Provider Survey) and Rafael Najmanovich (Research Associate, EBI).

Finally, representatives of funding agencies and national bodies (namely Elmar Nimmesgern, representative of BMBF on ELIXIR; Alix de la Coste representative of French NCP Infrastructures; Rosa R. Bernabé, Deputy Director General for International Programmes, Spanish Ministry of Science and Innovation; Dr Adrian Pugh, representative of BBSRC, ELIXIR Programme Manager; Work Package 5; Dr. Elod Nemerkenyi, Assistant of International Affairs, Hungarian Scientific Research Fund and Antoine van Kampen, Scientific Director, Netherlands Bioinformatics Centre) were instrumental for the selections of key national scientists as potential interview candidates.

A. Survey design The design of the user survey (scope, sample, solicitation strategy, tool, support and questionnaire) was addressed in a kick-off Meeting (January 10th 2008 – Participants included Dominic Clark (EBI), Peter Stoehr (EBI), Rafael Najmanovich (EBI) and Sandrine Palcy (UB2)).

1. Goal The user survey aimed to identify priorities for long terms support of infrastructures for biological information (i.e. core and specialized biomolecular databases) based on users’ requirements.

2. Objective The user survey collected information about users’ current use of bioinformatics resources (i.e. databases and tools) as well as needs and priorities for the future.

3. Content The survey addressed three main topics: (i) users’ current practices, (ii) users’ current issues with bioinformatics resources and (iii) users’ priorities for improvement or future development of bioinformatics resources. To correlate users’ profiles with usage patterns and requirements, the survey included information on respondents’ location and affiliation as well as working sector, research domain and bioinformatics environment.

4. Sample The targeted population included bioinformatics resources users represented by experimental (i.e. wet-bench scientists in life science) and bioinformatics research groups in Europe, from both academic and private sectors.

a. Sample size

The minimum size of the sample (number of research groups to be contacted) was determined according to the initial objective of 300 responses. Therefore with a maximum estimated response rate of 20%, the assessed sample should at least include 1500 research groups.

Page 7: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 7

b. Sampling Method

Research groups were invited to participate to the online survey using email communication and other advertising means: website posting (ELIXIR website and ELIXIR partners’ website), news letters advertising (ELIXIR partners’ community) and communication at congress (oral communications, flyers and exhibition booth).

Funding agencies and national bodies were solicited for the identification of interview candidates. Candidates were then invited to participate via email.

c. Sample identification

The following sources of contacts were considered: Funding agencies and national bodies EMBL (Alumni) Members of WP3 Bioinformatics Communities Committee International and European consortia in “OMICS” research

o ProteomeBinders o EuPA

European Biotechnology- Bio-industry associations

5. Methods The survey, conducted mainly via an online questionnaire, was completed by interviews of selected experts (i.e. phone interview).

a. Tool and Support

The on-line survey was conducted using the SurveyMonkey online survey tool (http://www.surveymonkey.com). The access to a professional account on this online survey platform was kindly provided by EBI.

b. Solicitation strategy

A strategy was established to solicit research groups (e.g. Group leader/ Principal Investigators) participation to the online survey and select candidates for interviews.

i. Advertising of the online survey

A survey invitation message was provided, as template, to members of WP3 Bioinformatics Communities Committee for solicitation of their national community.

Furthermore, support for survey advertising was requested from other community contacts such as EMBL alumni association, some European scientific consortia in “OMICS” research and national biotechnology associations.

One response per research group was expected, therefore only one representative of the group such as the principal investigator (or group leaders) or lead user, was invited to participate on behalf of their group.

Page 8: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 8

ii. Selection of interview candidates

Assistance from representatives of funding agencies and national bodies was requested for the identification of potential candidates for interviews. Based on these recommendations, requests for interview participation were individually sent to candidates.

iii. Following up on non-respondents

Following up on survey non-respondents is essential to maximize the response rate. A second round of survey advertising was organized in the mid course of the survey roll out period (month 11, September 2008).

6. Anonymity and confidentiality The survey was anonymous (except for the interview candidates) and any personal data collected was kept confidential in order to protect respondents' privacy and facilitate participation in ELIXIR survey. Email addresses from participants requesting to receive the final survey report was not (and will not be) disclosed to any third parties without participant’s proper authorization.

7. The online questionnaire The online questionnaire was developed in three phases.

a. Questionnaire design and internal test

The online questionnaire was designed based on the survey scope. Due to the surveying method, the online questionnaire was kept as short as possible and should be completed in about 10-15 min to minimize drop out.

Considering the number of topics to be assessed, menu choice questions was preferred, when possible, to shorten the survey completion time. In addition, considering the background discrepancies among the assessed community, questions were designed to allow maximum answer flexibility such as including “other” categories in multiple choice or "Do not know", response options when respondent’s opinion was requested. Finally, free text boxes were included, when suitable, to allow respondents to voluntarily provide additional comments.

Effort was made for designing precise, clear and simple questions (e.g. avoiding double questions). Due to the scientific culture discrepancies among the assessed community; a particular attention was brought on the vocabularies used. When needed, examples were given to assure a better clarity of the question. The wording of the questions was tested internally by volunteers from either bioinformatics (UB2 and EBI) or experimental biology (i.e. wet bench scientists) communities (UB2).

b. Pilot study

The objective of the pilot study was to identify problems on a larger scale with question wording, instructions (e.g. for response selection of multiple choice), etc. In general, the purpose was to test that the respondents understood questions and returned useful answers.

Page 9: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 9

To this end, the pilot study was conducted on a sample, where wet bench scientists and bioinformaticians were, as much as possible, equally represented. The sample source included members of EMBL alumni association and participants to the First ELIXIR Stakeholder Meeting (April 10th, 2008).

c. Questionnaire revision

Based on the results from the pilot study, the online questionnaire was revised. Attention was paid to: (i) question misunderstanding, (ii) possibility of biased answers, (iii) generally skipped questions and (iv) questionnaire drop out.

In addition, presentation of the user survey progresses at the First ELIXIR Stakeholder Meeting led to additional commentaries and suggestions which were also taken into account for the questionnaire revision.

The final questionnaire included the following topics: User and data resources

o Importance and use of data resources for user’s research o Data and data resources currently used by the user o Challenges encountered with data resources o Priorities for improvement and new development of data resources

User and tools o Current practices with tools o Challenges encountered using tools o Relevant tool resources to be developed

User profile o Country location o Affiliation & sector o Bioinformatics environment (i.e. infrastructure, expertise, scientific interaction,

education and training)

Furthermore, the final questionnaire design included three sub-questionnaires in order to separately assess users of different levels or non-users:

o Sub-questionnaire 1: for frequent and occasional users of bioinformatics resources. This sub-questionnaire was designed to assess users with sufficient background expertise of bioinformatics resources to provide relevant feedback.

o Sub-questionnaire 2: for respondents that barely use bioinformatics resources This sub-questionnaire was designed in order to offer less technical oriented questions. In comparison with the sub-questionnaire 1, the question list was simplified (in terms of vocabulary and question subjects) and shortened.

o Sub-questionnaire 3: for non-users. This sub-questionnaire was a very short list of question to capture the reason why bioinformatics resources are of no use for the respondents and whether they would still consider or not bioinformatics infrastructures as a research infrastructure priority.

Page 10: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 10

The questionnaire can be viewed at the following URL address: http://www.surveymonkey.com/s.aspx?sm=6AV4N219x0r851WtLHNnJQ_3d_3d

8. The Interviews The interview design was based on the online questionnaire, supplemented by open field boxes to allow additional commentaries. The length of an interview was about 45 min-1 h in average.

9. Survey duration The survey was conducted using the online questionnaire between month 9 (July 08) and month 13 (November 08). Interviews of candidates started on month 11 (September 08) and lasted until month 16 (February 09).

B. Coordination with other ELIXIR WPs

1. Coordination with the WP3 - Bioinformatics Communities Committee Survey advertising to the national scientific communities was achieved with the coordination, support and, in many cases, active participation (i.e. direct solicitation) of the members of the WP3 Bioinformatics Communities Committee.

2. Coordination with the WP12 - Infrastructure for Tools Integration Committee Members of the WP12 were involved in the questionnaire revision based on the commentaries from the First Stakeholder Meeting.

3. Coordination with other ELIXIR survey efforts The User Survey was designed in coordination with other ELIXIR surveys which included the Database Developer Survey (WP2 - The ELIXIR Strategy for Data Resources) and the Industry Survey (WP3 - Industry stakeholder committee).

USER SURVEY RESULTS

Part I – Online Survey Results

A. Overview

1. Response to the survey In total, 804 responses were collected during the period of the online survey (i.e. about 4 months; closing date: November 7, 2008). An overview of the results is presented in Table 1.

Table 1 – Overview of the online survey results Total respondents 804

(*) Estimated response rate 16%

Total completed forms 575 (71.5%) (*) The estimated response rate is calculated based on a sample size of 5000 research groups.

Page 11: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 11

Out of the 804 collected responses, 71.5% were complete (i.e. no drop out before the end of the questionnaire). To estimate the survey response rate, we could reasonably estimate that the size of the sample solicited via email included at minimum 5000 groups (i.e. according to the data provided by the different contacts for email advertising). To this number should be added the numbers of respondents solicited by website posting, advertising in community news letters and communication at congresses. However, these additional numbers are rather difficult to estimate. Consequently, the response rate could only be estimated (at most) based on the estimated sample from email solicitation (i.e. sample of 5000 research groups) and is only given here for simple indication.

2. Analysis of the results

a. Result filtering

To proceed with the result analysis, the collected responses (i.e. 804) were filtered (Filter #1) in order to select forms which included at least a response to the central question about long term support of EU bioinformatics infrastructures (i.e. “We view the long-term sustainability of European bioinformatics infrastructure for our research activities as: “essential” or “important” or “not relevant”, see Appendix I, sub-questionnaires 1 and 2 - Question #4 or sub-questionnaire 3 - Question #5). Applying this filter reduced the number of total respondents to 754 which, was the total number of respondents (i.e. total respondents) considered in this analysis.

To analyze separately responses to the three sub-questionnaires, a second filter (Filter #2), based on responses about the use of bioinformatics resources (see Appendix I common questionnaire - Question #3), was applied in addition to Filter#1. Application of Filter #2 yielded 738 (97.9% of total respondents) respondents corresponding to frequent or occasional users and respectively 10 (1.3% of total respondents) and 6 (0.8% of total respondents) respondents who either barely or never use bioinformatics resources.

b. Cross tabulation of results

For deeper analysis, responses about users’ practices, challenges and priorities were cross tabulated with several criteria such as:

User category (e.g. frequent or occasional user) Country location Activity sector Research domain Bioinformatics environment (i.e. support, scientific interactions, education and training)

c. Calculation of response frequency

Response frequency was either based on the number of answering respondents (i.e. total number of responses collected for a given question) or on the total number of respondents (i.e. 754), as indicated.

Page 12: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 12

d. Calculation of rating average

For rating scale questions, the data are presented as “rating average”. Each choice was assigned a value (e.g. “Essential”=2, “Important”=1, “Not relevant”=0) and the “rating average” was calculated as follows: [(value of rating scale choice A*number of collected responses for rating scale choice A) + (value of rating scale choice B*number of collected responses for rating scale choice B) + etc…)]/total number of collected responses for all the rating scale choices).

B. Respondent groups’ profiles Users of bioinformatics resources form a very complex community that encompasses a large (and increasing) number of individual groups from extremely diverse research domains. Increasing the complexity of this community mosaic, bioinformatics research and education infrastructures are not equivalently developed and implemented in all EU countries. Furthermore, within the same country, access to bioinformatics research and education infrastructures could also differ according to the local affiliation of each individual research groups.

Hence, it was imperative to collect information about respondent groups’ profiles in order to distinguish general usage patterns and requirements from those specific to individual groups due to their research activities and/or environment.

1. Who completed the questionnaire? (See Appendix I common questionnaire - Question #1)

The user survey aimed to capture information about research groups and not about individual scientists. Therefore members with a general understanding of the group’s bioinformatics use and needs were encouraged to complete the survey on behalf of their group such as principal investigators (or group leaders) or bioinformatics led users.

As shown in Figure 1 and Table 2, the survey questionnaire was mainly completed by principal investigators (or group leader or grant holder) and bioinformatics lead users (51.3% and 33.0%, respectively), as recommended. Other respondents were principally students and post-doc members of the groups.

Figure 1 – Representatives of respondent groups

15.6%

33.0%

51.3%

A grant holder, principal investigator or group leader

A lead user of bioinformatics resources

Other

Response frequency = % of total respondents (i.e. 754).

Page 13: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 13

Table 2 - Other representatives of respondent groups

Representatives of respondent groups Response count

Representatives of respondent groups Response count

Student (PhD & Master) 31 Clinical Research Associate 1 Postdoc 10 Dataminer 1 Consultant 3 Grid expert 1 Manager 3 Institute Executive director 1 Coordinator 2 Lecturer 1 Database curator 2 Modeler 1 Developer 2 Quality Engineer 1 Programmer 2 scientific secretary 1 Bioinformatic helpdesk 1

2. Interest in survey results (See Appendix I, common questionnaire - Question #2)

To further associate the assessed community to the ELIXIR effort, respondents were invited to leave an email address in order to be informed of the survey results.

The great interest in the user survey conclusion was demonstrated by the large number of respondents (557 respondents out of 754) that indicated an email address for the survey results distribution.

3. Use of bioinformatics resources (See Appendix I, common questionnaire - Question #3)

Use of bioinformatics resources for research activities is highly variable among the user community. The frequency of use which can be conditioned by several criteria such as research activities, bioinformatics expertise and support, had to be documented as part of the user profile. This information was of key importance in order to reconcile users’ perception of bioinformatics resources with their experience.

The large majority of the respondents (85% of total respondents) were frequent users of bioinformatics resources (see Figure 2). 12.9% of respondents were occasional users while users that hardly ever use bioinformatics resources or never use them at all, were very poorly represented (1.3% and 0.8%, respectively).

Figure 2 - Use of bioinformatics resources

85.0%

1.3% 0.8%12.9%

Frequently OccasionallyHardly ever Never

Response frequency = % of total respondents (i.e. 754).

Page 14: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 14

4. Country location of respondent groups (See Appendix I, Sub-questionnaire 1 - Question #29; Sub-questionnaire 2 - Question #27; Sub-questionnaire 3 - Question #7)

Country location of respondent groups may impact on their usage pattern and challenges with bioinformatics resources due to the uneven development and implementation of bioinformatics research and education infrastructures among countries.

As shown in Figure 3 and Table 3, country location of respondents groups was distributed among 34 countries. However, there were important variations in the number of respondent groups representing these different countries. The most represented countries were The Netherlands (10.6% of answering respondents), France (10.2% of answering respondents), United Kingdom (9.5% of answering respondents), Sweden (9.3% of answering respondents) and Israel (7.4% of answering respondents). Besides, these top 5 countries that represented altogether 47% of the respondent groups, there were countries with fairly good representation such as Belgium, Germany, Italy and Finland. Other countries corresponded to less than 20 respondent groups.

Figure 3– Country location of respondents

0% 5% 10% 15%

Austria (6)Belgium (34)

Cyprus (5)Czech Republic (11)

Denmark (19)Estonia (8)

Finland (23)France (58)

Germany (30)Greece (16)Hungary (4)

Italy (24)Latvia (1)

Lithuania (7)Luxembourg (5)

Malta (2)Poland (5)

Portugal (10)Romania (8)

Slovakia (10)Spain (3)

Sweden (57)The Netherlands (60)

UK (55)Iceland (2)Israel (43)

Norway (17)Switzerland (36)

Other (21)

Response frequency

Response frequency = % of answering respondents (i.e. 580). For each country, the corresponding response count is indicated.

Page 15: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 15

Table 3 - Other country location

Country Response count

Canada 1 India 3 Russia 1 Singapore 1 South Africa 1 USA 9

5. Organization affiliation of respondent groups (See Appendix I, Sub-questionnaire 1 - Question #30; Sub-questionnaire 2 - Question #28; Sub-questionnaire 3 - Question #8)

Identification of research groups’ affiliation gives information on the local research infrastructures provided to the respondent groups.

72.9% of total respondents indicated their main affiliation. Altogether the indicated affiliations represented 318 organizations. Among them were 17 research organizations (EU and national), 153 universities, 44 research institutes, 21 research centres and 37 companies (pharmaceutical and biotechnology). The most cited organizations are listed in Table 4. Table 4 – Affiliation of respondent groups

Organization names Response

count INRA, France 16 Swiss Institute of Bioinformatics, Switzerland 14 Karolinska Institute, Sweden 10 Uppsala University, Sweden 9 University of Helsinki, Finland 8 Tel Aviv University, Israel 8 The Hebrew University of Jerusalem, Israel 8 Wageningen University and Research Center (WUR), The Netherlands 7 Lund University, Sweden 7 GlaxoSmithKline Pharmaceuticals, UK 7 University of Manchester, UK 7 University of Tartu, Estonia 6 CNRS, France 6 Weizmann Institute of Science, Israel 6 CMBI, Radboud University, The Netherlands 6 Université Libre de Bruxelles, Belgium 5 Biomedical Research Foundation of the Academy of Athens, Greece 5 Ben Gurion University of the Negev, Israel 5 Stockholm University, Sweden 5 University of Copenhagen, Denmark 4 University of Turku, Finland 4 INSERM, France 4 ETH Zurich, Switzerland 4 Leiden University Medical Center, The Netherlands 4 University of Amsterdam, The Netherlands 4 University of Bergen, Norway 4

Page 16: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 16

Table 4 – Affiliation of respondent groups – (continued)

Organization names Response

count Ghent University, Belgium 3 Institut Curie, France 3 EMBL, Germany 3 Technion, Israel 3 Institute of Biotechnology, Lithuania 3 University of Luxembourg, Luxembourg 3 Umeå University, Sweden 3 Erasmus MC, The Netherlands 3 King's College London, UK 3 Wellcome Trust Sanger Institute, UK 3

Per country, respondents groups were affiliated to eleven different organizations in average (maximum was 34 organizations and minimum was 1 organization), (see Figure 4).

Figure 4 – Number of organizations per countries

Number of answering respondents was 545.

0 10 20 30 40

AustriaBelgium

CyprusCzech Rep

DenmarkEstoniaFinlandFrance

GermanyGreece

HungaryIceland

IsraelItaly

LatviaLithuania

LuxembourgMalta

NorwayPoland

PortugalRomaniaslovakia

SpainSweden

SwitzerlandThe netherlands

UKOther

Number of organization

Page 17: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 17

6. Working sector of respondent groups (See Appendix I, Sub-questionnaire 1 - Question #31; Sub-questionnaire 2 - Question #29; Sub-questionnaire 3 - Question #9)

The user community includes research groups from both academia and industry. Information about respondent groups’ working sector was collected as an indicator of specific needs and priorities pertaining to one or the other sector of activities.

Unfortunately the private sector was under represented in this study and the vast majority of respondents (89.4% of answering respondents) were from the academic/non profit sector, (see Figure 5).

Figure 5 – Working sector of respondents

10.6%

89.4%

Academic/Non-profit

Industry/Commercial/SME

Response frequency = % of answering respondents (i.e. 577).

7. Research domains of respondent groups (See Appendix I, Sub-questionnaire 1 - Question #32; Sub-questionnaire 2 - Question #30; Sub-questionnaire 3 - Question #10)

The nature of research activities is determinant for the understanding of bioinformatics resources requirements. Life science is vast and, within each field, the diversity of research activities is very difficult to comprehend. Nevertheless, in an attempt to draw general features in bioinformatics requirements, respondents groups were asked to classify their activities into main scientific research domains.

The most cited research domains were as expected bioinformatics (62.7% of answering respondents) and biology (58.4% of answering respondents). In addition, close related domains such as medicine and computational sciences accounted respectively for 27.4% and 15.4% of answering respondents, (see Figure 6 and Table 5).

Page 18: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 18

Figure 6 – Research domains of respondent groups

0% 25% 50% 75% 100%

Agriculture (39)

Environment (27)

Biology (331)

Chemistry (56)

Maths (23)

Medicine (154)

Computer sciences (88)

Bioinformatics (361)

Nutrition (20)

Other (85)

Response frequency

Multiple answer options were authorized with recommendation for a maximum of three choices. Response frequency = % of answering respondents (i.e. 580). For each research domain, the corresponding response count is indicated.

Table 5- Other research domains

Research domain Response count

Biotechnology 3 Systems Biology 3 Computational biology 2 Pharmacology 2 Pharmacy 2 Toxicology 2 Biomedical informatics 1 Biostatistics 1 Chemoinformatics 1 Consumer goods company, Safety and Environment Assurance Centre 1 Pharmacognosy 1 Statistics 1 Physics 1 Wastewater treatment 1

a. Research domains of respondent groups according to user categories It was interesting to examine the distribution of respondent groups’ activities into the various research domains when considering the different categories of users and non-users. Figure 7 shows that frequent users had research activities mostly in bioinformatics (67.0% of answering respondents) and biology (58.6% of answering respondents) whereas occasional users had research activities principally in biology (54.4% of answering respondents). The proportion of occasional users with activities in maths and chemistry (respectively, 8.8% and 19.3% of answering respondents) was higher than the proportion of frequent users (respectively, 3.5% and 8.8% of answering respondents). In the other research domains, the proportion of frequent and occasional users was quite similar.

Page 19: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 19

Despite the very small number of respondent groups with poor or no use of bioinformatics resources, it can still be noted that their research activities were distributed in biology, bioinformatics, chemistry and environment.

Figure 7 – Research domains of respondent groups according to user categories

0% 25% 50% 75% 100%

Agriculture

Environment

Biology

Chemistry

Maths

Medicine

Computer sciences

Bioinformatics

Nutrition

Other

Response frequency

Frequently (512 )

Occasionally (57)

Hardly ever (8)

Never (5 )

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 582.

8. Bioinformatics support (See Appendix I, Sub-questionnaire 1 - Question #33-35; Sub-questionnaire 2 - Question #31-33)

a. Access to bioinformatics support Respondent groups’ profile would not be complete without a good understanding of their bioinformatics environment. An essential part of this environment is the access to bioinformatics supports such as bioinformatics expertise, resources development (i.e. engineering) and management (i.e. installation and maintenance of databases and softwares), as well as IT infrastructures (i.e. system network and server).

As shown in Figure 8 the large majority of the answering respondent groups have access to bioanalysis expertise, support for databases and applications as well as private server resources (respectively, 75.6%, 77.6% and 70.3% of answering respondents). However,

Page 20: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 20

access to more advanced bioinformatics support such as engineering and computational grids was slightly more restricted, (respectively, 58.1% and 51.3% of answering respondents). Some respondent groups indicated other bioinformatics supports such as software research infrastructures, computer clusters and cloud computing.

Figure 8 – Access to bioinformatics support

0% 25% 50% 75% 100%

Bioanalyses (546)

Database and/orsoftware support (550)

Authenticated access toserver resources (536)

Engineering (532)

Computational grids(522)

Response frequencyYesNoI do not know

Multiple answer options were authorized. Response frequency = % of answering respondents (as indicated for each support category).

b. Access to bioinformatics support according to user categories Access to bioinformatics support could be linked to user category. As shown in Figure 9, the proportion of occasional users accessing any of the bioinformatics support was generally lower than the proportion of frequent users, and this difference increased with more advanced supports (i.e. “Engineering” and “Computational grids”).

When considering respondent groups with poor use of bioinformatics resources, one-fourth had access to bioanalyses, database and/or software support and computational grids and only one-eighth benefited from authenticated access to server resources and engineering support, (data not shown).

Figure 9– Bioinformatics support according to user categories (i.e. frequent and occasional users)

0% 25% 50% 75% 100%

Bioanalyses

Database and/or software support

Authenticated access to serverresources

Engineering

Computational grids

"Yes" - Response frequencyFrequently (499)Occasionally (56 )

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinfor-matics resources. Total number of answering res-pondents considered for this cross analysis was 555.

Page 21: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 21

c. Access to bioinformatics support according to countries To detect discrepancies in bioinformatics support among countries, responses to bioinformatics supports were analyzed according to respondent groups’ location (i.e. Top 15 countries). As supposed, Table 6 shows variations in bioinformatics supports among countries.

Access to bioanalyses was largely provided to respondent groups (at least 80% of answering respondents) in more than half of the countries (i.e. Belgium, Switzerland, Germany, The Netherlands, Israel, UK, Finland and Norway). In other countries, at least 50% of the answering respondents had access to this type of support.

Database and/or software support was also provided to at least 80% of answering respondents in a good number of countries (9 out of 15 countries) but the set of countries was slightly different (i.e. Switzerland, Portugal, UK, Belgium, Germany, Israel, The Netherlands, France, and Czech Republic). In other countries, at least 50% of the answering respondents benefited from this type of support.

More discrepancies between countries were observed when considering authenticated access to server resources, engineering support and access to computational grids. These types of support were available to at least 80% of the answering respondents in only two to three countries (always including Switzerland). In other countries the proportion of supported respondent groups varied from about 78% to 23% of answering respondents.

Table 6 – Access to bioinformatics support according to countries (Top 15 countries)

“Yes” - Response frequency

≥80% <80% and ≥60% <60% and ≥50% <50%

Bioanalyses

Belgium - Switzerland - Germany - The Netherlands - Israel -UK- Finland -Norway

France - Italy - Portugal - Denmark - Greece

Sweden - Czech Republic

Database and/or software support

Switzerland - Portugal - UK - Belgium - Germany - Israel - France - The Netherlands - Czech Republic

Italy - Finland - Norway - Greece - Denmark

Sweden

Authenticated access to server resources

Switzerland - UK - Portugal

Norway - France - Germany - Belgium - The Netherlands - Finland - Israel - Sweden

Italy - Denmark - Czech Republic

Greece

Engineering

Switzerland - Portugal Italy - Germany - The Netherlands - UK - Finland - France - Israel

Belgium - Greece - Czech Republic - Sweden - Norway – Denmark

Computational grids

Finland - Switzerland Portugal - UK - Israel - Greece - The Netherlands - Czech Republic

Italy - Norway Germany - Sweden - Denmark - France – Belgium

“Yes” - Response frequency = % of answering respondents to country location. Total number of answering respondents considered for this cross analysis was 467.

Page 22: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 22

d. Access to bioinformatics support according to research domains Research domains are not evenly supported for bioinformatics. Analysis of responses among the different research domains revealed variations in access to bioinformatics support (see Figure 10). These variations were more important when considering authenticated access to server resources, access to engineering and computer grid services. The best supported research domains were agriculture, bioinformatics, computer sciences and maths.

Figure 10 – Access to bioinformatics support according to research domains

Bioanalyses

0% 25% 50% 75% 100%

Yes

No

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (318)

Chemistry (55)

Maths (22)

Medicine (147)

Computer sciences (83)

Bioinformatics (342)

Nutrition (20)

Other (80)

Database and/or software support

0% 25% 50% 75% 100%

Yes

No

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (318)

Chemistry (55)

Maths (22)

Medicine (147)

Computer sciences (83)

Bioinformatics (342)

Nutrition (20)

Other (80)

Page 23: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 23

Figure 10 – Access to bioinformatics support according to research domains – (continued)

Authenticated access to server resources

0% 25% 50% 75% 100%

Yes

No

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (318)

Chemistry (55)

Maths (22)

Medicine (147)

Computer sciences (83)

Bioinformatics (342)

Nutrition (20)

Other (80)

Engineering

0% 25% 50% 75% 100%

Yes

No

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (318)

Chemistry (55)

Maths (22)

Medicine (147)

Computer sciences (83)

Bioinformatics (342)

Nutrition (20)

Other (80)

Computational grids

0% 25% 50% 75% 100%

Yes

No

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (318)

Chemistry (55)

Maths (22)

Medicine (147)

Computer sciences (83)

Bioinformatics (342)

Nutrition (20)

Other (80)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered for this cross analysis was 555.

Page 24: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 24

e. Sources of bioinformatics support To assess how bioinformatics support is provided to respondent groups, they were asked whether they could count on internal and/or local bioinformatics support or need to request assistance from other external and/or remote source of support.

As reported in Figure 11, access to bioinformatics support was principally implemented within respondent groups (63.8% of answering respondents) or their local organization (53.9% of answering respondents).

Figure 11 – Sources of bioinformatics support

0% 25% 50% 75% 100%

Within my own group (324)

Within my organization (276)

Outsourced with an externalservice platform (58)

Other (42)

Response frequency

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 511). For each category, the corresponding response count is indicated.

Other sources of bioinformatics support cited, were external to the respondent groups. They are summarized in Table 7.

Table 7 - Other sources of bioinformatics support Sources of bioinformatics support Response Count National bioinformatics centre/platform 5 European bioinformatics infrastructure network 4 Collaboration with external group 2 National supercomputer center 1 Community networks 1 European network of Excellence 1 Commercial support 1 Centre for Scientific Computing (MGRID applications) 1 Collaboration with external group 1 Non profit organizations 1 Tool developers 1 Joint national academic effort 1 Database helpdesk 1 Distributed network of programmers and an e-Science approach 1 Publicly available resources and tools 1 Open source integration platforms (e.g. KNIME and CDK.) 1

Page 25: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 25

f. Sources of bioinformatics support according to user categories Sources of bioinformatics support were examined among user categories (i.e. frequent and occasional users). Figure 12 indicates that frequent users were more often supported within their own groups whereas occasional users relied more on support provided by their organization.

Figure 12 – Sources of bioinformatics support according to user categories

0% 25% 50% 75%

Within my own group

Within my organization

Outsourced with an externalservice platform

Other

Response frequencyFrequently (463)Occasionally (45)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 508.

g. Access limitation to bioinformatics support There are several reasons why individual research groups can be limited in their access to bioinformatics support. In this respect, respondent groups indicated predominantly lacks of infrastructure and resources (e.g. funding) as limiting factors (see Table 8 for a summary of the collected responses).

Table 8 - Access limitation to bioinformatics support Limiting factors Response Count lacks of infrastructure 13 Lacks of resources (e.g. funding) 7 Lacks of available expertise 2 Lacks of information 1 Language issues in interdisciplinary interactions 1 Available support not relevant or helpful 1

9. Bioinformatics education and training (See Appendix I, Sub-questionnaire 1 - Question #36-37; Sub-questionnaire 2 - Question #34-35)

a. Access to bioinformatics education and training Access to bioinformatics education and training is determinant for individual research groups to acquire and/or expand their bioinformatics expertise. Figure 13 shows that the large majority of respondent groups (i.e. 92.3% of answering respondents) had access to training

Page 26: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 26

opportunities. However, half of them (i.e. 53.4% of answering respondents) indicated that their access to education and training was restricted.

Figure 13 – Access to bioinformatics education and training

0% 25% 50% 75% 100%

Yes, on a regular basis (241)

Yes, but limited (276)

No (48)

Response frequency

Response frequency = % of answering respondents (i.e. 565). For each category, the corresponding response count is indicated.

b. Access to bioinformatics education and training according to user categories Access to bioinformatics education and training is one of the limiting factors for the use of bioinformatics resources. Figure 14 shows discrepancies between frequent and occasional users. Regular access to bioinformatics education and training was more common among frequent users (45% of answering respondents) in comparison to occasional users (24.6% of answering respondents). In addition, the proportion of respondent groups with no access at all, was about twice higher in occasional users than in frequent users (respectively, 19.3% and 7.0% of answering respondents).

Figure 14 – Access to bioinformatics and training according to user categories

0% 25% 50% 75%

Yes, on aregular basis

Yes, but limited

No

Response frequencyFrequently (500)Occasionally (57)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 557.

Page 27: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 27

c. Access to bioinformatics education and training according to countries Bioinformatics education and training is unevenly developed and supported among countries. As shown in Figure 15, there were important disparities among the Top 15 countries in respect to bioinformatics knowledge support. About 50% to 80% of answering respondents had regular access to training and education in Belgium, Finland, Israel, Portugal, Switzerland, The Netherlands and UK, about 40% in France, Germany and Norway and only about 30% to 17% in Czech Republic, Denmark, Greece, Italy and Sweden.

Figure 15 – Access to bioinformatics education and training according to countries (Top 15 countries)

0% 25% 50% 75% 100%

Portugal (10)

Switzerland (36)

Finland (23)

The Netherlands (58)

UK (53)

Israel (40)

Belgium (33)

France (56)

Norway (16)

Germany (29)

Czech Republic (10)

Greece (15)

Italy (23)

Denmark (17)

Sweden (51)

Response frequency

Yes, on a regular basisYes, but limitedNo

Response frequency = % of answering respondents (as indicated for each country) about country location. Total number of answering respondents considered for this cross analysis was 470.

d. Access to bioinformatics education and training according to research domains To examine whether access to bioinformatics knowledge support differs between research domains, responses to bioinformatics education and training were analyzed according to the nature of research activities. Figure 16, indicates that about 50% to 60% of answering respondents with research activities in maths, bioinformatics, computer sciences and agriculture, had regular access to education and training. However, in other research domains the number of answering respondents with regular access to education and training did not exceed 40%.

Page 28: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 28

Figure 16 – Access to bioinformatics education and training according to research domains

0% 25% 50% 75% 100%

Maths (22)

Bioinformatics (350)

Computer sciences (88)

Agriculture (38)

Medicine (152)

Biology (327)

Environment (27)

Nutrition (20)

Other (84)

Chemistry (56)

Response frequencyYes, on a regular basisYes, but limitedNo

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered for this cross analysis was 563.

e. Access limitation to bioinformatics training and education Access restrictions to education and training include generally a poor ratio between training offers and demands as well as financial and logistic issues.

As shown in Figure 17, limitation to training access appeared to be mainly due to lacks of training opportunity and dedicated financial support (53.5% of answering respondents).

Figure 17 – Access limitation to training and education

0% 25% 50% 75% 100%

Lack of opportunities (151)

Lack of financial support (151)

Impractical location (55)

Other reason (37)

Response frequency

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 282). For each category, the corresponding response count is indicated.

Among other limiting factors mentioned by the respondent groups were principally time constraint and the fact that bioinformatics knowledge support was not considered as a priority, (see Table 9).

Page 29: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 29

Table 9 - Other limiting factors Limiting factors Response CountTime constraint 8 Not considered as a priority 5 Limited training information 2 Not relevant 2 Bad training coordination 1 Lacks of training infrastructures 1 Training expertise not available locally 1 Limited number of training opportunities with appropriate level (e.g. advanced level)

1

Limited number of training opportunities of good quality 1

10. Respondent groups’ relationships with bioinformatics research group (See Appendix I, Sub-questionnaire 1 - Question #38; Sub-questionnaire 2 - Question #36)

To conclude on respondent groups’ bioinformatics environment, information about scientific interactions with bioinformatics research groups was collected. Only one third of the answering respondents (36.3%) indicated strong relationships with bioinformatics research groups (see Figure 18). More generally, respondent groups evaluated that their interaction was of medium level (45.2% of answering respondents).

Figure 18 – Respondent groups’ relationships with bioinformatics research groups

0% 25% 50% 75% 100%

Strong (210)

Medium (261)

Poor (107)

Response frequency

Response frequency = % of answering respondents (i.e. 578). For each category, the corresponding response count is indicated.

a. Relationships with bioinformatics groups according to user categories Level of scientific interactions with bioinformatics research groups was examined among user categories (i.e. frequent and occasional users). In Figure 19, about 50% to 45 % of answering respondents in both frequent and occasional users indicated a medium level of interactions with bioinformatics research groups. However, about 40% of answering respondents in frequent users indicated strong relationships whereas the same proportion of occasional users indicated instead poor relationships.

Page 30: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 30

Figure 19 – Relationships with bioinformatics groups according to user categories

0%

25%

50%

75%

100%

Frequently (508) Occasionally (57)

Resp

onse

freq

uenc

y

PoorMediumStrong

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 565.

b. Relationships with bioinformatics groups according to research domain To identify potential variations in scientific interactions according to research activities, responses were analyzed by research domains. As shown in Figure 20, strong relationships were indicated by about 47% of answering respondents with research activities in maths, computer sciences and bioinformatics, but by only 30% to 25% of answering respondents with research activities in biology, nutrition, environment, medicine and by less than 25% of answering respondents with research activities in chemistry and agriculture.

Figure 20 – Relationships with bioinformatics groups according to research domain

0% 25% 50% 75% 100%

Maths (23)

Computer sciences (88)

Bioinformatics (355)

Other (83)

Biology (328)

Nutrition (20)

Environment (27)

Medicine (153)

Agriculture (39)

Chemistry (56)

Response frequencyStrongMediumPoor

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered for this cross analysis was 562.

c. Bioinformatics support and relationships with bioinformatics research groups Cross analysis of scientific interactions according to bioinformatics support, showed that access to bioinformatics supports seemed to be linked with stronger relationships with bioinformatics research groups, (see Figure 21). The proportion of answering respondents having strong interaction with bioinformatics groups was increased by at least 2-fold in respondent groups supported for authenticated access to server resources as well as access

Page 31: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 31

to engineering expertise and computational grids and by 1.5 fold for other types of supports (i.e. bioanalysis expertise and database and software).

Figure 21 – Bioinformatics support and relationships with bioinformatics research groups

Bioanalyses

0% 25% 50% 75% 100%

Yes (406)

No (97)

I do not know(29)

Response frequency

StrongMediumPoor

Authenticated access to server resources

0% 25% 50% 75% 100%

Yes (371)

No (109)

I do not know(42)

Response frequency

StrongMediumPoor

Database and/or software support

0% 25% 50% 75% 100%

Yes (421)

No (99)

I do not know(16)

Response frequency

StrongMediumPoor

Engineering

0% 25% 50% 75% 100%

Yes (304)

No (170)

I do not know(44)

Response frequency

StrongMediumPoor

Computational grids

0% 25% 50% 75% 100%

Yes (263)

No (166)

I do not know(79)

Response frequency

StrongMediumPoor

Response frequency = % of answering respondents (as indicated for each category) about bioinformatics support. Total number of answering respondents considered for this cross analysis was 549.

d. Relationships with bioinformatics research groups and bioinformatics education and training The observation that well-supported respondent groups had a higher level of interaction with bioinformatics research groups was also applicable to bioinformatics knowledge support. As shown in Figure 22, strong interactions with bioinformatics groups was indicated by 56.4% of answering respondents with regular access to training but by only 23.4% and 15.2% of answering respondents with respectively limited and no access to knowledge support.

Page 32: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 32

Figure 22 – Relationships with bioinformatics research groups and bioinformatics education and training

0% 25% 50% 75% 100%

Yes, on a regular basis (236)

Yes, but limited (269)

No (46)

Response frequency

StrongMediumPoor

Response frequency = % of answering respondents (as indicated for each category) about bioinformatics education and training. Total number of answering respondents considered for this cross analysis was 551.

C. Responses from frequent and occasional users Analysis of respondents’ responses about bioinformatics resources was focused on frequent and occasional user categories, as they represented 97.8% (i.e. 738 out of 754 respondent groups) of total respondents.

1. Importance of long-term sustainability of European bioinformatics infrastructures (See Appendix I, Sub-questionnaire 1 - Question #4)

One of the fundamental issues with bioinformatics infrastructures in Europe is the establishment of funding processes to ensure long term support of resources.

When consulted on this critical point, 67.2% of the answering respondents considered that long-term sustainability of European bioinformatics infrastructures was essential for their research activities. 31.2% thought that is was important and 1.6% did not find it relevant to their activities, (see Figure 23).

Figure 23 – Importance of long-term sustainability of European bioinformatics infrastructures

67.2%

1.6%31.2%

EssentialImportantNot relevant

Response frequency = % of answering respondents (i.e. 738).

Page 33: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 33

168 answering respondents commented their responses. These comments are summarized in Table 16, Table 17 and Table 18 (see Appendix II) and presented in Figure 24 (i.e. comments added to responses “essential” or “important”) according to response count. Long-term sustainability of bioinformatics infrastructures was perceived principally as a support to research activities and bioinformatics resources as well as a mean to foster EU bioinformatics competitiveness world wide.

Figure 24 – Summary of comments to responses “essential” or “important”

0 10 20 30 40 50 60

For the advance of science

As support to research activities

For knowledge disseminationTo sustain EU bioinformaticscompetitiveness world wide

For bioinformatics resources supportTo foster bioinformatics resources

developmentTo support funding

For industry support

Response count

Number of answering respondents was 168.

a. Importance of long-term sustainability of European bioinformatics infrastructures according to user categories Responses to long term support of bioinformatics infrastructures was further analysed according to user categories. Figure 25 shows that sustainability of EU bioinformatics infrastructures was not perceived with the same level of importance by frequent and occasional users, as suspected. The proportion of answering respondents considering long-term support of EU bioinformatics infrastructures as essential was higher in frequent users than in occasional users (respectively, 71.1% and 41.2% of answering respondents)

Further analysis of responses to sustainability of bioinformatics infrastructures according to other criteria such as country location, working sector, research domain and bioinformatics environment, did not reveal noticeable differences (data not shown).

Page 34: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 34

Figure 25 – Importance of long-term sustainability of European bioinformatics infrastructures according to user categories

0%

25%

50%

75%

100%

Frequently (641) Occasionally (97)

Res

pons

e fr

eque

ncy

Not relevant

Important

Essential

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 738.

2. Working with bioinformatics databases (See Appendix I, Sub-questionnaire 1 - Question #5-18)

Bioinformatics databases were used by 97.1% of frequent/occasional users for their research activities.

a. Purpose for using bioinformatics databases Interest in bioinformatics databases is motivated by different research needs. As shown in Figure 26, bioinformatics databases were primarily used by respondent groups for biological information searching and data analysis (respectively 91.4% and 79.1% of answering respondents). Data manipulation (i.e. download of large data sets) was less common (66.8% of answering respondents).

Figure 26 – Purpose for using bioinformatics databases

Response Frequency

0% 25% 50% 75% 100%

search for specific biological information(631)

analyse our research data (546)

download large sets of data forsubsequent use in computational biology

(461)

Other (61)

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 690). For each category, the corresponding response count is indicated.

Page 35: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 35

Other purposes for using bioinformatics databases are listed in Table 10. They were mainly related to bioinformatics resources development and integration.

Table 10 – Other purpose for using of bioinformatics databases

Purpose Response Count Purpose Response

Count Bioinformatics resources development 17 Education and training 2 Bioinformatics resources integration 10 Data management 1 Biological experiment and/or tool design 3 Generation of in silico data 1 Laboratory management 3 Publication 1 Data sharing 2 User support 1

i. Purpose for using bioinformatics databases according to user categories To examine whether frequent and occasional users work with bioinformatics databases with same purposes, responses were analyzed among user categories. Figure 27 reveals a small variation in response frequency about data analysis between the two user categories. However, when considering a more advanced use, such as data manipulation, the response frequency was increased by almost 3-fold in frequent users in comparison to occasional users. Response frequencies about biological information searching were almost equal in the two user categories.

Figure 27– Purpose for using bioinformatics databases according to user categories

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other

Response frequencyFrequently (616)Occasionally (74)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 690.

ii. Purpose for using bioinformatics databases and bioinformatics support Bioinformatics environment may impact the use of bioinformatics databases. As shown in Figure 28, response frequency for data manipulation decreased when respondent groups were not supported. However, this difference was less noticeable when considering access to bioanalyses support. On the contrary, no or slight variations were observed in response

Page 36: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 36

frequency for information searching and data analysis whether respondent groups were supported or not.

Figure 28 – Purpose for using bioinformatics databases and bioinformatics support

Bioanalyses

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other (please specify)

Response frequencyYes (404)No (95)I do not know (27)

Database and/or software support

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other (please specify)

Response frequencyYes (417)No (96)I do not know (17)

Authenticated access to server resources

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other (please specify)

Response frequencyYes (376)No (110)I do not know (42)

Page 37: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 37

Figure 28 – Purpose for using bioinformatics databases and bioinformatics support – (continued)

Engineering

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other (please specify)

Response frequencyYes (304)No (165)I do not know (42)

Computational grids

0% 25% 50% 75% 100%

search for specific biologicalinformation

analyse our research data

download large sets of data forsubsequent use in computational

biology

Other (please specify)

Response frequencyYes (266)No (169)I do not know (79)

Response frequency = % of answering respondents (as indicated for each category) about bioinformatics support. Total number of answering respondents considered for this cross analysis was 542.

b. Biological data of interest The general interest of the whole life science community for molecular information is confirmed in Figure 29. Molecular sequence was the first biological data of interest (76.2% of answering respondents). Genomics, protein functional annotation and gene expression are other data types which are shared among several research domains and are in the heart of today’s research interests. They appeared as a second group of biological data of high interest (respectively 60.9%, 57.8% and 57.1% of answering respondents). As expected, literature information was also considered of interest for more than half of the answering respondents (i.e. 57% of the answering respondents). More specialized type of data such as molecular structure, cell signalling, genetics, ontology and proteomics were indicated to be of interest by less than half of the answering respondents (respectively 43.3%, 42.6%, 40.6%, 39.7% and 35.3%). Finally, a small fraction of the answering respondents considered that taxonomy, metabolic, metabolomics and chemical data were of interest for their research activities.

Page 38: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 38

Figure 29 –Biological data of interest

0% 25% 50% 75% 100%

Molecular sequence (524)Genomics (419)

Protein functional annotation (398)Gene expression (393)

Literature (392)Molecular structure (298)

Cell signaling (293)

Genetic (279)Ontology (273)

Proteomics (243)Taxonomy (159)Metabolic (129)

Metabolomics (120)Chemical (108)

Other (61)

Response frequency

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 688). For each type of data, the corresponding response count is indicated.

Other biological data cited by the answering respondents are listed in Table 11.

Table 11 - Other biological data of interest

Other Response Count Other Response

Count Evolution data 7 Epigenetics 1 Clinical data 3 Glycomics data 1 Metagenomics 3 High content microscopy data 1 Pathway data 2 Immunogenetics 1 Phenotype data 2 Immunoinformatics 1 Anatomopathologic data 1 Meteorological 1 Antibody engineering 1 Molecular binding kinetics 1 Biological networks 1 Oceanographic 1 Biospecimen annotation 1 Organism predictive model 1 Development data 1 Pharmacology 1 Disease data 1 Phylogeny 1 Drug design 1 Physiology 1 Electron microscopy data 1 Protein disorder 1 Environmental data 1 Species specific 1 Epigenomics 1 Toxicology 1

i. Biological data of interest according to user categories Cross analysis according to user categories revealed variations in response frequency for the different biological data between frequent and occasional users, except for molecular sequence, cell signalling and chemical data (see Figure 30). In general, response frequency

Page 39: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 39

for each biological data was increased in frequent users in comparison to occasional users, with the exception of molecular structure data for which the opposite was observed.

Figure 30 – Biological data of interest according to user categories

0% 25% 50% 75% 100%

Molecular sequenceGenomics

Protein functional annotationGene expression

LiteratureCell signaling

GeneticOntology

Molecular structureProteomicsTaxonomyMetabolic

MetabolomicsChemical

Other

Response frequency

Frequently (613)Occasionally (75)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 688.

ii. Biological data of interest according to research domains It is obvious that biological data of interest and research domain are strongly linked. However, how the interest for specific biological data is distributed among research domains was worth the analysis. This distribution is presented in Figure 31.

Results demonstrated the true transversal aspect of some biological data to all research domains such as molecular sequence, literature and protein functional annotation. In addition, the specificity in terms of data of interest of some research domains is shown, such as chemistry (i.e. molecular structure and chemical data) and nutrition (i.e. metabolic and metabolomics data).

Page 40: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 40

Figure 31 – Biological data of interest according to research domain

0%

25%

50%

75%

100%

Molecular sequence Protein functionalannotation

Literature

Resp

onse

freq

uenc

y

Agriculture (37)

Environment (24)

Biology (322)

Chemistry (53)

Maths (21)

Medicine(148)

Computer sciences (84)

Bioinformatics (356)

Nutrition (20)

Other (84)

0%

25%

50%

75%

100%

Genetic Genomics Gene expression

Res

pons

e fr

eque

ncy

Agriculture (37)

Environment (24)

Biology (322)

Chemistry (53)

Maths (21)

Medicine(148)

Computer sciences (84)

Bioinformatics (356)

Nutrition (20)

Other (84)

0%

25%

50%

75%

100%

Cell signaling Proteomics Ontology

Res

pons

e fr

eque

ncy

Agriculture (37)

Environment (24)

Biology (322)

Chemistry (53)

Maths (21)

Medicine(148)

Computer sciences (84)

Bioinformatics (356)

Nutrition (20)

Other (84)

Page 41: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 41

Figure 31 – Biological data of interest according to research domain – (continued)

0%

25%

50%

75%

100%

Metabolic Metabolomics Taxonomy

Res

pons

e fr

eque

ncy

Agriculture (37)

Environment (24)

Biology (322)

Chemistry (53)

Maths (21)

Medicine(148)

Computer sciences (84)

Bioinformatics (356)

Nutrition (20)

Other (84)

0%

25%

50%

75%

100%

Molecular structure Chemical

Res

pons

e fr

eque

ncy

Agriculture (37)

Environment (24)

Biology (322)

Chemistry (53)

Maths (21)

Medicine(148)

Computer sciences (84)

Bioinformatics (356)

Nutrition (20)

Other (84)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered for this cross analysis was 553.

c. Bioinformatics databases of interest To identify users’ favourite bioinformatics databases, respondent groups were asked to rate selected resources. The Top 3 rated databases were literature and general molecular sequence (nucleotide and protein) databases, (see Figure 32). As expected, more specific databases (e.g. specific to molecular families or organisms) appeared with a lower rate than general resources. However, the rating of the selected databases could also be impacted by some historical aspects and visibility of the resources as shown by the large discrepancies between the literature databases PubMed and other similar resources.

Page 42: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 42

Figure 32 – Importance of selected bioinformatics databases

Useful Important Essential

The rating scale was as follows: “Essential”=3, “Important”=2, “useful”=1. For each database, the corresponding response count is indicated. Total number of answering respondents was 619.

185 answering respondents added a comment to their database rating, (see Appendix II Table 19). Respondent groups’ comments included recommendations to database providers (e.g. issues with increasing volume of data, database scope, information standard, quality of database functionality and interface), indications of database use (e.g. tool development, computational analysis, biological tool design, research focus and teaching) and remarks about lacks of database knowledge.

1.0 1.5 2.0 2.5 3.0

PubMed (594)EMBL/GenBank/Entrez Nucleotide (557)

UniProt/ Swiss-Prot/TrEMBL/Entrez Protein (535)Entrez Gene (463)

Ensembl (466)MSD/PDB (371)

Pfam (444)KEGG (428)

GOA (428)GEO (320)

OMIM (383)InterPro (414)

ArrayExpress (336)dbSNP (346)

Reactome (311)NEWT (311)IntAct (264)

GO (264)PDBsum (282)

PubChem (267)DSSP (265)

IPI (242)BioModels (240)

ChEBI (240)BRENDA (275)EBIMed (244)

SWISS-2DPAGE (241)PRIDE (212)

Genome Reviews (278)Integr8 (218)IntEnz (219)

CluSTr (238)CiteXplore (214)

FlyBase (259)GPCRDB (213)

RESID (204)ASTD (215)

CSA (198)IMGT/LIGM (197)IMGT/HLA (196)

Rating average

Page 43: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 43

To further inquiry about users’ favourite databases, respondents groups were invited to name freely other resources of interest. “Essential”, “Important” and “Useful” data resources were indicated by respectively 231, 105 and 36 respondent groups. A summary of the most cited resources is presented in Table 12. The Top 3 resources were in decreasing order UCSC Genome Browser, HapMap and Structural Classification of Protein (SCOP).

Table 12 - Other databases of interest Essential Response CountUCSC Genome Browser - http://genome.ucsc.edu/ 27 HapMap- http://www.hapmap.org 9 Structural Classification of Proteins (SCOP) - http://scop.mrc-lmb.cam.ac.uk/scop/ 8 NCBI RefSeq - http://www.ncbi.nlm.nih.gov/RefSeq/ 7 Saccharomyces Genome Database (SGD) - http://www.yeastgenome.org/ 7 TAIR - http://www.arabidopsis.org/ 7 GeneCards - http://www.genecards.org/ 6 CAZy-Carbohydrate-Active enzymes - http://www.cazy.org/ 5 Rfam - http://www.sanger.ac.uk/Software/Rfam/ 5 TIGR -http://www.tigr.org/tdb/e2k1/ath1/ 5 Wormbase - http://www.wormbase.org/ 5 microRNADB - http://bioinfo.au.tsinghua.edu.cn/micrornadb/ 4 miRBase - http://microrna.sanger.ac.uk/sequences/ 4 Unigene - http://www.ncbi.nlm.nih.gov/unigene 4 Note: 193 resources (mostly including specialized databases) were cited as “Essential” by only one respondent, showing the vast diversity in databases of interest. Important Response CountClusters of Orthologous Groups of proteins (COGs) -http://www.ncbi.nlm.nih.gov/COG/ 4 SCOP (Structural Classification of Proteins) - http://scop.mrc-lmb.cam.ac.uk/scop/ 4 TRANSFAC - http://www.gene-regulation.com/ 4

Note: 105 resources (mostly including specialized databases) were cited as “Important” by only one respondent.

Useful Response CountGeneCards - http://www.genecards.org/ 4

Note: 30 resources (mostly including specialized databases) were cited as “Useful” by only one respondent.

d. Citation of bioinformatics databases in publication To coordinate the efforts from both ELIXIR user and database provider surveys, practices with literature citation of databases were documented. As shown in Figure 33, 59.2% of answering respondents indicated that the databases used for their research activities were systematically cited in the corresponding publications. Database literature citation was irregular for 36.3% of answering respondents. A very few number of answering respondents (4.4%) acknowledged that database references were never included in their publications.

Page 44: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 44

Figure 33 – Database literature citation

0%

25%

50%

75%

100%

Yes (362) Sometimes (222) No (27)

Res

pons

e fr

eque

ncy

Response frequency = % of answering respondents (i.e.611). For each category, the corresponding response count is indicated.

i. Bioinformatics databases citation according to user categories Frequency of database literature citation varied among user categories. Systematic citation was twice less common in occasional users than in frequent users (respectively, 27.0% and 63.0% of answering respondents, see Figure 34).

Figure 34 – Database literature citation according to user categories

0%

25%

50%

75%

100%

Yes Sometimes No

Res

pons

e fr

eque

ncy

Frequently (548 )

Occasionally (63)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 611.

ii. Bioinformatics databases citation according to research domains Analysis of responses to database literature citation according to research domains showed that practices varied with respondent groups’ research activities. Nevertheless, systematic database literature citation was indicated by at least half of the answering respondents in all research domains (see Figure 35). The highest response frequency for such a practice was observed for respondent groups with research activities in maths (81.0% of answering respondents). In all research domains the proportion of answering respondents that never included database references in their publications was below10%.

Page 45: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 45

Figure 35 – Database literature citation according to research domain

0% 25% 50% 75% 100%

Yes

Sometimes

No

Response frequency

Maths ( 21)Environment (23)Computer sciences (84)Bioinformatics (353)Medicine (147)Biology (320)Nutrition (20)Chemistry (54)Other (83)Agriculture (36)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered for this cross analysis was 547.

iii. Database literature citation and relationships with bioinformatics research groups When considering respondent groups’ interactions with bioinformatics research groups, it appeared (see Figure 36) that database literature citation varied with the strength of relationships. Systematic literature citation of databases was indicated by 75.5% of answering respondents with strong relationships but by only 53.8% and 48.9% of answering respondents with respectively medium and poor relationships. These observations suggest than tighter interactions with the bioinformatics community may foster literature citation of bioinformatics resources.

Figure 36 – Database literature citation and relationships with bioinformatics research groups

0%

25%

50%

75%

100%

Yes Sometimes No

Res

pons

e fr

eque

ncy

Strong (204)Medium (247)Poor (92 ) Response frequency = % of

answering respondents (as indicated for each category) about relation-ships with bioinformatics research groups. Total number of answering respondents considered for this cross analysis was 543.

Page 46: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 46

e. Interaction with database providers Intercommunication, between database providers and users, is required for user feedback and support as well as identification of future users’ needs. Figure 37 shows that 54.6% of the answering respondents had interacted with database providers.

Figure 37 – Interaction with database provider

0%

25%

50%

75%

100%

Yes (326) No (271)

Resp

onse

freq

uenc

y

Response frequency = % of answering respondents (i.e. 597). For each category, the corresponding response count is indicated.

63 answering respondents added a comment to their response. They are summarized in Table 13. Comments included names of databases or groups with whom respondent groups got in contact, motives and types (mode, level and frequency) of the interaction.

Table 13 – Comments to interaction with database providers Name of databases or groups ADDA http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb.

NCBI/Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/

BioModels http://www.ebi.ac.uk/biomodels-main/

NTI http://www.nti.org/db/nistraff/

BRENDA http://www.brenda-enzymes.org/

Pathway Commons http://www.pathwaycommons.org/pc/

CAMERA http://camera.calit2.net/

PDBe http://www.ebi.ac.uk/pdbe/

CBU http://www.cbu.uib.no/

Pfam http://pfam.sanger.ac.uk/

CLC http://www.clcbio.com/

RCSB-PDB http://www.rcsb.org/pdb/home/home.do

Clustr http://www.ebi.ac.uk/clustr/

Reactome http://www.reactome.org/

EBI/ArrayExpress http://www.ebi.ac.uk/microarray-as/ae/

Rfam http://www.sanger.ac.uk/Software/Rfam/

Ensembl http://www.ensembl.org/index.html

SABIO-RK http://sabio.villa-bosch.de/

EntrezGene http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

Sanger http://www.sanger.fr/

GeneDB http://www.genedb.org/

EBI http://www.ebi.ac.uk/

GOLD http://www.genomesonline.org/

SEED http://www.theseed.org/wiki/Main_Page

HPRD http://www.hprd.org/

SGD http://www.yeastgenome.org/

HUGO http://www.genenames.org/

SubtiList http://genolist.pasteur.fr/SubtiList/

Page 47: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 47

Table 13 – Comments to interaction with database providers – (continued) Name of databases or groups IGBMC http://www.igbmc.fr/

SwissProt/UniProt http://www.ebi.ac.uk/uniprot/

IntAct http://www.ebi.ac.uk/intact/site/index.jsf

Sym'Previus http://www.symprevius.net/index.php?rub=mentions_legales_2

JAX http://jaxmice.jax.org/strain/002329.html

TCDB http://www.tcdb.org/

MaGe https://www.genoscope.cns.fr/agc/mage/wwwpkgdb/Login/log.php?pid=7

UCSC http://genome.ucsc.edu/

NCBI/Genbank http://www.ncbi.nlm.nih.gov/Genbank/

Motives Report of errors Report of bugs Data submission Suggestion of new features Information request about API interfaces General feedback Data download issue Data comparison Information request about the database next release Report of parsing issues Modes Via a scientific computing group, In the frame of a European project Level Feedback communication Contribution to databases annotation content Collaboration with database developer groups Frequency Regular/constant interactions Limited Infrequent Selected comments “With ENSEMBL and HapMap my experience is quite positive concerning questions to datasets. Normally this is unfortunately not the case. Normally, you wait very long to get an answer if you get one at all.” “SWISS-PROT has always been very helpful.” “It is not clear that database providers wish to receive any feedback; there are usually no feedback forms.” “I have always found the support staff at the EBI to be very helpful.”

i. Interaction with database providers according to user categories Analysis of responses among user categories showed large discrepancies between frequent and occasional users. As presented in Figure 38, interaction with database providers was about four times less common in occasional users (13.1% of answering respondents) than in frequent users (59.3% of answering respondents).

Page 48: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 48

Figure 38 – Interaction with database providers according to user categories

0%

25%

50%

75%

100%

Yes No

Res

pons

e fre

quen

cy

Frequently (536)Occasionally (61)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered for this cross analysis was 597.

ii. Interaction with database providers and relationships with bioinformatics research groups Interaction with database providers was cross analyzed with strength of relationships with bioinformatics groups. As shown in Figure 39, positive response frequency increased with the strength of relationships.

Figure 39 – Interaction with database providers and relationships with bioinformatics research groups

0%

25%

50%

75%

100%

Yes No

Res

pons

e fr

eque

ncy

Strong (198)Medium (240)Poor (90)

Response frequency = % of answering respondents (as indicated for each category) about relationships with bioinformatics research groups. Total number of answering respondents considered for this cross analysis was 528.

f. Users’ perception of existing databases resources It was interesting to get a general statement about respondent groups’ satisfaction with the bioinformatics databases currently available for their research. Figure 40 shows that a large majority of the answering respondents (82.4%) indicated a variable satisfaction with existing bioinformatics databases.

Page 49: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 49

Figure 40 – Satisfaction with existing databases

0%

25%

50%

75%

100%

Yes, totally(103)

Some yes,some no

(523)

Not at all (9)

Res

pons

e fr

eque

ncy

Response frequency = % of answering respondents (i.e.635). For each category, the corresponding response count is indicated.

3. Challenges with bioinformatics databases (See Appendix I, Sub-questionnaire 1 - Question #19)

Common challenges encountered when working with bioinformatics databases were proposed to respondent groups for rating according to frequency. As presented in Figure 41, the Top 3 most frequent challenges were, in a decreasing order, transparent query across databases, compatibility of file format (i.e. for subsequent use of data obtained from databases) and database website usability. On the other hand, challenges about database online access and local installation were rated with the lowest frequency by answering respondents.

The Top 5 examples of database cited by the answering respondents to illustrate the different challenges are presented in Figure 42 They included the databases Ensembl, UniProt/Swiss-Prot/TrEMBL, ArrayExpress, InterPro and Pfam. Likely, the numerous citations of these well-used data resources were due in part to their respective popularity.

Other issues encountered by the answering respondents when working with bioinformatics databases are summarized in Table 20 (see Appendix II). They included issues with databases (i.e. database content, access, functionalities, integration, support continuity, information and knowledge – total response count = 45), data (i.e. data access and visualization – total response count = 6) and user support, (response count = 2).

Page 50: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 50

Figure 41 – Challenges with bioinformatics databases

Never Barely Sometimes Frequentlty

For each challenges, the corresponding response count is indicated. The rating scale was as follows: frequently =3; sometimes = 2; barely = 1; never = 0. (“I do not know” response was not taken into account for the rating average calculation). Total number of answering respondents was 425.

Figure 42 – Challenges with bioinformatics databases: database examples (Top 5)

0 5 10 15 20 25 30 35

Database webpage usability

Compatibility of file format for data submission todatabases

Compatibility of file format for subsequent use ofdata downloaded from databases

Submission or retrieval of large volumes of data

Transparent queries across databases

Working with databases via online access

Continuity in database support

Local installation of databases

Response count

Ensembl

Uniprot/Swiss-Prot/TrEMBLArrayExpress

InterPro

Pfam

Cross analysis of responses to challenges with bioinformatics databases according to criteria such as user categories, research domain and bioinformatics environment did not return noticeable differences (data not shown).

0.0 1.0 2.0 3.0

Transparent queries across databases (292)

Compatibility of file format for subsequent use of datadownloaded from databases (368)

Database webpage usability (402)

Submission or retrieval of large volumes of data (344)

Compatibility of file format for data submission todatabases (321)

Continuity in database support (339)

Working with databases via online access (357)

Local installation of databases (269)

Rating average

Page 51: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 51

4. Improvement of existing bioinformatics databases (See Appendix I, Sub-questionnaire 1 - Question #20)

Respondent groups were consulted about improvements needed for current bioinformatics databases. Database interoperability/integration, database functionalities and data quality were of main concerns (respectively, 87.2%, 85.6% and 78.3% of answering respondents). Besides, improvement needs for public web portals and procedures for local installation of databases were also indicated by at least half of the answering respondents, (see Figure 43).

Figure 43 – Improvement of existing databases

0% 25% 50% 75% 100%

Data quality (397)

Databaseinteroperability/integration

(406)

Database functionality (402)

Public web portals (372)

Procedures for localinstallation of databases

(356)

Response frequency

YesNoI do not know

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 430). For each category, the corresponding response count is indicated.

The Top 6 examples of database cited by the answering respondents to illustrate the different improvement needs are presented in Figure 44. They included the databases Ensembl, Uniprot/Swiss-Prot/TrEMBL, ArrayExpress, Reactome and MSD/PDB.

Other needs for improvement are summarized in Table 21 (see Appendix II), they included improvement needs about databases (i.e. database content, access, graphical user interface, support continuity, standard, information and knowledge – total response count = 27), data (i.e. data access – total response count = 1) and user support, (total response count = 1).

Cross analysis of responses about improvement of existing bioinformatics databases according to criteria such as user categories, research domain and bioinformatics environment did not return noticeable differences (data not shown).

Page 52: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 52

Figure 44 – Improvement with bioinformatics databases: database examples (Top 6)

0 2 4 6 8 10 12 14

Ensembl

Uniprot/Swiss-Prot/TrEMBL

ArrayExpress

Reactome

GO

MSD/PDB

Response count

Data quality

Database interoperability/integration

Database functionality (e.g. for data query,visualization, download and upload)

Public web portals

Procedures for local installation ofdatabases

5. New development for bioinformatics databases (See Appendix I, Sub-questionnaire 1 - Question #21)

Information about needs for new database developments was of critical importance as improvement of existing resources will likely not be sufficient to respond to all users’ current and/or future needs. To this end, respondent groups’ point of view about several development suggestions was collected. Figure 45 shows that all development suggestions were considered as a development need by at least half of the answering respondents. However, the Top 3 development needs were, in decreasing order, expansion of existing databases scope, development of application programming interface (API)-web services and public web portals, (respectively, .82.4%, 66.3% and 62.4% of answering respondents).

Other needs for development are summarized in Table 22 (see Appendix II). They included development needs about databases (i.e. database content, functionalities/tools, access, graphical user interface, support continuity, standard, integration, information and knowledge as well new database – total response count = 44) and data (i.e. data quality and access – total response count = 7).

Cross analysis of responses to bioinformatics databases new development according to criteria such as user categories, research domain and bioinformatics environment did not return noticeable differences (data not shown).

Page 53: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 53

Figure 45 – New development for bioinformatics databases

0% 25% 50% 75% 100%

Expanding the scope of existingdatabase (408)

New databases (378)

Public web portals (375)

Procedures for local installation ofdatabases (367)

API-web services (368)

Other (142)

Response frequency

YesNoI do not know

Multiple answer options were authorized. Response frequency = % of answering respondents (i.e. 430). For each category, the corresponding response count is indicated.

6 Working with bioinformatics tools (See Appendix I, Sub-questionnaire 1 - Question #22-27)

When assessing users’ requirements for the exploitation of data resources, users’ needs and priorities toward bioinformatics tools should be also documented.

Bioinformatics tools were used by 96.2% of frequent/occasional users (i.e. answering respondents) for their research activities.

a. Number of bioinformatics tools used for research activities Numbers of bioinformatics tools used by respondent groups varied from 2 (or less) to more than 25. One third (i.e. 27%) of answering respondents used between 6 and 10 tools. Similar proportions of answering respondents (about one fifth) indicated to use either 3 to 5 (i.e. 22.8% of answering respondents), 11 to 25 (i.e. 19.4% of answering respondents) or more than 25 tools (i.e. 22.6% of answering respondents), (see Figure 46).

i. Number of bioinformatics tools used according to user categories The number of tools used was analyzed according to user categories. As expected, Figure 47 clearly indicates that, in general, frequent users worked with a higher number of tools than occasional users. Answering respondents that used between 2 (or less) and 5 tools corresponded to 66% of occasional users and to only 22.4% of frequent users. In contrast, answering respondents that used from 11 to more than 25 tools corresponded to 50.3% of frequent users but to only 10.6% of occasional users.

Page 54: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 54

Figure 46 – Number of bioinformatics tools used for research activities

0% 5% 10% 15% 20% 25% 30%

2 or fewer (18)

Between 3 and 5 (126)

Between 6 and 10 (149)

Between 11 and 25 (107)

More than 25 (125)

I do not know (27)

Response frequency

Response frequency = % of answering respondents (i.e. 552). For each category, the corresponding response count is indicated.

Figure 47 – Number of bioinformatics tools used for research activities according to user categories

0% 25% 50% 75%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Frequently (505)

Occasionally (47)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered in this cross analysis was 552.

ii. Number of bioinformatics tools used according to research domains Analysis of numbers of tools used according to research activities showed that in research domains such as agriculture, environment, biology, chemistry, medicine and nutrition, the highest response frequencies were observed for a number of tools ranging from 3 to 10, (see Figure 48). For research domains such as maths, computers sciences and bioinformatics, the highest response frequencies were observed for a number of tools superior to 25.

Page 55: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 55

Figure 48 – Number of bioinformatics tools used according to research domain

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Agriculture (36)

Environment (25)

Biology (313)

Chemistry (54)

Maths (22)

Medicine (143)

Computer sciences (83)

Bioinformatics (351)

Nutrition (20)

Other (82)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered in this cross analysis was 541.

iii. Number of bioinformatics tools used and bioinformatics support To detect whether bioinformatics environment could be linked to the number of tools used, responses were analyzed according to access to bioinformatics support, (see Figure 49). In general, the use of large number of tools (i.e. more than 11 tools) was more frequently associated with respondent groups having access to bioinformatics support. Reciprocally, the use of low number of tools (i.e. below 10 tools) was more frequently associated with respondent groups having no access to bioinformatics support. Frequency differences between supported and not supported groups were less striking when considering access to support such as bioanalyses.

Page 56: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 56

Figure 49 – Number of different bioinformatics tools used according to bioinformatics support

Bioanalyses

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Yes (397)No (89)

Database and/or software support

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Yes (408)No (91)

Authenticated access to server resources

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Yes (361)No (106)

Access to engineering

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Yes (299) No (161)

Computational grids

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Yes (259)No (159)

Response frequency = % of answering respondents (as indicated for each category) about bioinformatics support. Total number of answering respondents considered in this cross analysis was 528.

iv. Number of bioinformatics tools used and bioinformatics education and training Similar observations were made when considering bioinformatics knowledge support. Figure 50 shows that the use of high numbers of tools (i.e. from 11 to more than 25) was predominantly associated with respondent groups having regular access to training and

Page 57: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 57

education. On the contrary, low numbers of tools (i.e. inferior to 11), were mostly associated with respondent groups having restricted or no access to training and education.

Figure 50 – Number of bioinformatics tools used and bioinformatics education and training

0% 15% 30% 45%

2 or fewer

Between 3and 5

Between 6and 10

Between 11and 25

More than 25

I do not know

Response frequency

Yes, on a regular basis (237)

Yes, but limited (254)

No (39)

Response frequency = % of answering respondents (as indicated for each category) about bioinformatics education and training. Total number of answering respondents considered in this cross analysis was 530.

v. Number of bioinformatics tools used and relationships with bioinformatics research groups

To complete the analysis according to users’ bioinformatics environment, the number of tools used was examined according to the strength of relationships with bioinformatics research groups. Figure 51 shows that large numbers of tools were principally used by respondent groups having tight relationships with bioinformatics scientific community. Reciprocally, answering respondents having poor scientific interactions with bioinformatics groups used predominantly low numbers of tools.

Page 58: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 58

Figure 51 – Number of bioinformatics tools used and relationships with bioinformatics research groups

0% 15% 30% 45%

2 or fewer

Between 3 and 5

Between 6 and 10

Between 11 and 25

More than 25

I do not know

Response frequency

Strong (204)

Medium (244)

Poor (88)

Response frequency = % of answering respondents (as indicated for each category) about relationships with bioinformatics research groups. Total number of answering respondents considered in this cross analysis was 536.

b. Sources of bioinformatics tools

In general, tools are provided mainly by academic/non-profit organizations (see Figure 52). However, 29.1% of answering respondents are using tools both provided by academic/non-profit organization and commercial vendors.

Figure 52 – Sources of tools

0% 15% 30% 45% 60% 75%

Mainly academic/non-profitorganizations (381)

Mainly commercial vendors (9)

A mixture of both (162)

Not relevant (4)

Response frequency

Response frequency = % of answering respondents (i.e. 556). For each category, the corresponding response count is indicated.

i. Sources of tools according to research domains Variations of tool sources were observed according to research areas (see Figure 53). For instance, a majority of respondent groups (60% of answering respondents) with activities in nutrition used a mix of public and commercial tools. On the contrary, 80.7% of answering respondents with computer sciences activities used mainly tools from academic/non-profit sources.

Page 59: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 59

Figure 53 – Sources of tools according to research domain

0%

25%

50%

75%

100%

Mainly academic/non-profitorganizations

Mainly commercial vendors A mixture of both Not relevant

Res

pons

e fr

eque

ncy

Nutrition (20) Medicine (145)

Chemistry (55) Maths (22)

Other (83) Agriculture (36)

Biology (316) Environment (25)

Bioinformatics (353) Computer sciences (83)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered in this cross analysis was 545.

ii. Sources of tools according to sector Respondent responses were analyzed according to sector (see Figure 54). While the tools used in public sector was mainly from academic/non-profit organizations (74.4% of answering respondents), in the private sector (i.e. industry) the majority of respondents indicated to use a mix of public and commercial tools (73.8% of answering respondents).

Note: this observation was the first noticeable difference between respondents’ responses from public and private sectors in this study.

Figure 54 – Sources of tools according to sector

0% 25% 50% 75% 100%

Mainly academic/non-profitorganizations

Mainly commercial vendors

A mixture of both

Not relevant

Response frequency

Academic/Non-profit (481)Industry/Commercial/SME (61)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered in this cross analysis was 542.

Cross analysis of responses to bioinformatics tools sources according to other criteria such as user categories and bioinformatics environment did not return noticeable differences (data not shown).

c. Access to bioinformatics tools Means of access to bioinformatics resources are important parameters for the future infrastructure design. When consulted on this point, 53.6% of the answering respondents indicated that their accesses to bioinformatics tools were a combination of internet access

Page 60: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 60

and in-house installation (see Figure 55). Still, 29.0% of the answering respondents worked with bioinformatics tools through internet access exclusively.

Figure 55 – Access to bioinformatics tools

0% 25% 50% 75% 100%

Mainly via the internet (161)

Mainly using in-house installation(including intranets) (94)

A mixture of both (298)

Not relevant (3)

Response frequency

Response frequency = % of answering respondents (i.e. 556). For each category, the corresponding response count is indicated.

Cross analysis of responses to bioinformatics tools access according to criteria such as user categories, research domain, bioinformatics environment and working sector did not return noticeable differences (data not shown).

d. Challenges with bioinformatics tools One of the most frequent challenges encountered by users when working with bioinformatics resources is the combination of bioinformatics databases and/or tools, due to poor compatibilities of inputs/outputs formats. Hence, respondent groups were consulted about their current experience with such tasks. As reported in Figure 56, a large majority of respondent groups indicated to invest either some (48.1% of answering respondents) or significant (37.8% of answering respondents) effort in order to combine data resources and/or tools inputs/outputs. Only 14.1% of answering respondents reported that this was not an issue (i.e. no or little effort were necessary) for them.

Figure 56 – Challenges with bioinformatics tools

0% 15% 30% 45% 60%

No or very little effort (77)

Some effort (262)

Significant effort (206)

Response frequency

Response frequency = % of answering respondents (i.e. 545). For each category, the corresponding response count is indicated.

51 answering respondents added a comment to their response (see Appendix II, Table 23).

Page 61: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 61

i. Challenges with bioinformatics tools according to user categories A larger proportion of frequent users, in comparison to occasional users (respectively, 38.9% and 26.1%), indicated that significant effort was requested for the combination of bioinformatics resources, (see Figure 57). Reciprocally, minor or no effort was indicated by a higher proportion of occasional users than frequent users (respectively, 23.9% and 13.2%).

Figure 57 – Challenges with bioinformatics tools according to user categories

0% 25% 50% 75%

No or verylittle effort

Some effort

Significanteffort

Response frequency

Frequently (499)Occasionally (46)

Response frequency = % of answering respondents (as indicated for each user category) about use of bioinformatics resources. Total number of answering respondents considered in this cross analysis was 545.

ii. Challenges with bioinformatics tools according to research domain Analysis of responses according to research activities pointed out that indication of significant effort was more frequent in respondent groups with activities in maths and computer sciences as well as bioinformatics and medicine, although to a lesser extent (see Figure 58).

Figure 58 – Challenges with bioinformatics tools according to research domain

0%

25%

50%

75%

100%

No or very little effort Some effort Significant effort

Res

pons

e fr

eque

ncy

Maths (22) Computer sciences (83) Bioinformatics (350)

Medicine(142) Biology (313) Chemistry (54)

Nutrition (20) Other (81) Environment (25)

Agriculture (36)

Response frequency = % of answering respondents (as indicated for each research domain) about research activities. Total number of answering respondents considered in this cross analysis was 534.

e. Development of bioinformatics tool resources To call attention on priorities for tools resources, respondents were asked to evaluate the priority of development for some tool resources (see Figure 59). A general demand for a broaden access to standardized benchmarking of bioinformatics tools, was observed. Similarly, more programmatic access to bioinformatics tools appeared to be also of high concern.

Figure 59 – Importance of some tool resources development

Page 62: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 62

0.0 1.0 2.0

A single portal for bioinformatics tools(532)

More w idespread and standardizedbenchmarking of bioinformatics tools

(525)

Programmatic access to bioinformaticstools (504)

Rating average

Not relevant Important Essential Multiple answer options were authorized. The rating scale was as follows: “Essential”=2, “Important”=1, “Not relevant”=0. Response frequency = % of answering respondents (i.e. 546). For each category, the corresponding response count is indicated.

37 answering respondents added a comment to their response (see Appendix II, Table 24).

7. General comments (See Appendix I, Sub-questionnaire 1 - Question #28)

To conclude with the user survey questionnaire, respondents were invited to add a general comment about the topics addressed or the questionnaire itself. 73 answering respondents added a general comment (see Appendix II, Table 25).

a. About user needs There are two distinct classes of users in terms of how they do work with bioinformatics resources. One class corresponds to the “power” users who can/have the “power” to customize resources (and/or their access) to their specific needs. The other class (the largest) corresponds to the “non power” users (or also called end-users) that rely totally on the available user interface (GUI) to access and work with resources. Obviously, the level and mode of interaction of the two classes of users with bioinformatics resources are different and imply specific requirements. However, the number of end-users (mainly represented by groups in experimental biology) is so large in comparison to “power” users that sometimes addressing the needs of the latest does not represent a priority for developers. Comments in Table 25 included on one hand a claim for more consideration of “power” users’ needs: “I think, I've banged on enough about how poorly the power user is being treated with respect to tools and databases” and on the other hand recommendations for a strong focus on biologists’ issues: “By all this work, please do keep your focus on the needs of biologists. […]”.

Based on these comments, it is clear that significant and equal efforts have to be made to respond to the needs and priorities of these two distinct user communities.

b. About importance of bioinformatics infrastructures Essentialness of bioinformatics infrastructures for research in life-sciences was clearly stated in Table 25 and geographical distribution of major bioinformatics centres in Asia, US and EU was perceived of importance for the advance of science and innovation. Sustainable bioinformatics infrastructures were seen as instrumental to address challenges in bioinformatics (i.e. increasing volume of data and needs for bioinformatics resources

Page 63: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 63

integration). These points are critical for today’s research in life science and related domains: “[…]. Setting up a sustainable, harmonized and integrated bioinformatics infrastructure is an essential initiative to guarantee progress in life science and health care”.

c. About role and features of bioinformatics infrastructures Bioinformatics infrastructures were stated as core resources for bioinformatics (and computational biology) research and education. They were also seen as potential players to foster scientific networking (e.g. conference organization) between the involved communities. Finally, bioinformatics infrastructures were identified as a good environment for the establishment of community standards: “Promoting Open Source solutions in bioinformatics both in Academia as well as in industry, and the establishment of Open Standards for data exchange should be paramount in all efforts to future development of the area.”

Motivations for new bioinformatics infrastructures included improvement of the set of features currently offered by the existing ones: “For a European Infrastructure to be justified, it needs to offer more than already existing installations, say at the NCBI. There is no point in having a copy of functionality. […]”. Among the suggestions for additional features, were control of data reliability, bioinformatics resources (i.e. databases and tools world wide) interoperability and access to large computational capacity. To ensure common use and high visibility, new infrastructures were also recommended to be “[…] as convenient and user friendly as possible”.

d. About resources for the future infrastructure Identification of suitable resources for the future infrastructure is a complex task. Transversal as well as specific requirements of research domains have to be considered along with quality, comprehensiveness and technological constraints. Several comments pointed out that some important data types for research activities needed to be better represented/structured (e.g. human gene mutation, disease and phenotypic data) or that representation of model organisms in databases should be more generally extended (e.g. to prokaryote). Addition of resources for bioinformatics tools (e.g. benchmarking of commercial solutions) and education (e.g. online courses and biology basic knowledge information) was also advised. More specifically, some respondent groups indicated examples of suitable resources (e.g. sequence workbench - as provided by Genetic Data Environment and MRC - mrs.cmbi.ru.nl) or made already a proposal for contribution: “Our group would warmly support the incorporation of its GRISSOM portal (http://195.251.6.234/ biodatagrid/new/userlogin.php currently under development - end Dec 2008), […]”.

e. About ELIXIR initiative ELIXIR initiative was congratulated and warmly supported. Wishes for further contribution to the ELIXIR effort were also expressed.

f. About this survey The survey approach and content was well acknowledged by some respondent groups: ”Very good survey, hope it will serve to improve the European bioinformatics”, and invited further contribution: “Relevant and interesting questions. I would be happy to provide further

Page 64: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 64

input.” However, some negative points were also highlighted about collecting only one response per group, lacking topics, answer options and readability of some questions.

Part II– INTERVIEW RESULTS Based on the recommendation from representatives of funding agencies and national bodies, 23 researchers were contacted for a phone interview and 9 of them agreed.

A. Candidates’ profile All candidates were from academia and represented the following countries: Germany France, Spain, UK and The Netherlands. Candidates’ name, position, affiliation and research interest are listed in Table 14.

Table 14 - List of interview candidates

Name and affiliation Research focus

Prof. Hans Lehrach, Head of Department of Vertebrate Genomics, Max-Planck-Institute for Molecular Genetics, Berlin, Germany.

Genomics - Molecular Embryology - Bioinformatics - Comparative and Functional Genomics - Genetic Variation Group – Oligofinger printing / cell arrays - In vitro ligand screening - etc…

Prof. Ralf Hofestädt, Head of Department of Bioinformatics / Medical Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany.

Metabolic network analysis, database integration and parallel computing.

Dr. Emmanuel Barillot, Director of Bioinformatics and IT, Inserm Unit 900 / Ecole des Mines / Institut Curie, Paris, France

Cancer research using mathematical and computational approaches. Analysis of large volume of data generated by high throughput technologies.

Dr. Laurent Duret, Group Leader, Bioinformatics and Evolutionary Genomics Group. UMR CNRS 5558, Université Claude Bernard - Lyon 1, Villeurbanne, France

Study of genome organization and evolution. Development of bioinformatics tools for comparative genomic analysis.

Prof. Francisco E. Rodríguez Valera, Group leader, Facultad de Farmacia, Universidad Miguel Hernandez de Elche, Elche, Spain .

Metagenomics - DNA sequencing from environment.

Prof. David W Burt, Group leader, The Roslin Institute, University of Edinburgh, Roslin, United Kingdom.

Chicken Genomics: development of the chicken embryo, host pathogens interactions, transcription, regulation and genes.

Prof. Raymond A Dixon, Group leader, Department of Molecular Microbiology, John Innes Centre, Norwich, United Kingdom.

Signal transduction and regulation in bacteria.

Prof. Ottoline Leyser, Group leader, Department of Biology, University of York, Heslington York, United Kingdom.

Plant networks and their role in plant developmental plasticity.

Prof. Ritsert Jansen, Head of Groningen Bioinformatics Centre (GBIC), Groningen Biomolecular Sciences and Biotechnology Institute (GBB), Faculty of Mathematical and Natural Sciences (FWN), Haren, The Netherlands.

System genetics - studying the living organism with focus on the genetic component.

Interview candidates’ research groups were frequent users of bioinformatics resources with for most of them, a comfortable bioinformatics environment including access to

Page 65: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 65

bioinformatics support, education and training (but only 3 groups with a regular access) as well as strong interaction with the bioinformatics research community.

B. Long-term sustainability of bioinformatics infrastructures All the interview candidates were strongly supportive of bioinformatics infrastructures long-term sustainability. To their view, bioinformatics infrastructures were essential to (i) face the ever growing volume of data that required to be safely stored for information exchange and further exploitation (e.g. biological system modelling), (ii) address the challenge of biological data integration, (iii) support research activities which rely critically on bioinformatics resources availability (e.g. systems biology, evolutionary genomics, etc…).

“This is essential because we aim for integrating information to improve our holistic understanding of the biological organism. So it means that all data collected at different level from genome to Transcriptome, proteome, metabolome, phenome, etc…, have to be integrated across organisms, but certainly within organisms, across different studies carried out by lots and lots of people, to really get return on the major investments made by European Union and others funding agencies. […]” – Prof. R. Jansen

“It on depends on genome research, functional genomics that wish to make models of biological systems. So without EBI (i.e. core databases maintained at EBI), it would be very very very difficult to do any of that work.” – Prof. D. W. Burt

To date, strategies for long-term funding of such infrastructures have been cruelly missing in Europe.

“Today, funding resources ensure only partially information preservation and sharing because most of the fundings are linked to large centres, in Europe EBI. But outside of EBI, there are also many tools and databases which are very rich, and much more detailed (in terms of annotation for example) and whose sustainability is not guarantied. The maintenance of such resources is depending on European or national fundings which are not necessarily dedicated to infrastructures. Things are often started with a research project funding and then after for the maintenance that should be ensured, additional funding dedicated to infrastructure, must be obtained. This should be part of the European policy.” - Dr. E. Barillot

“This is extremely important. See what the US is doing, they are giving money for the most important databases. We are not doing this in Europe and this is a big mistake. This is a problem in Europe and also in Germany that we do not take care of very good databases. You can have a grant for three years and then it is over. So nobody is taking care of the best tools and databases after three years. It is not the case in Japan and US.” – Prof. R. Hofestädt

Similarly to some respondents to the online questionnaire, Prof. F. Rodriguez-Valera expressed some concerns about the European frame of ELIXIR and suggested rather to address the question in a global view. Furthermore, according to Prof. F. Rodriguez-Valera, the key priority for Europe in terms of bioinformatics, related on the expansion of bioinformatics expertise.

Page 66: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 66

“To have an effort on bioinformatics infrastructure at the European level is not the key thing. Efforts have to be made at a more global level. The key issue is the human factor. When they are trying to fund research the most important component of research is the human resources. The human factor is totally forgotten most of the time. So we need programmers and we need people with a mixed background between biology and computer sciences.” - Prof. F. Rodriguez-Valera

C. Working with bioinformatics databases As frequent users, responses of the 9 candidates were comparable to those collected from the same user category through the online questionnaire.

All the candidates indicated that their group used bioinformatics databases for information searching, data analysis and manipulation purposes.

Their biological data of interest were primarily molecular sequence, protein functional annotation, genomics and literature data. Besides, genetic data was found to be also of general interest.

Table 15 lists some of the essential, important and useful data resources as indicated by the 9 candidates.

Table 15- Biological databases of interest

Essential Response

count Important

Response count

Useful Response

count EMBL 4 PubMed 2 Brenda 1 Ensembl 3 BIOBASE 1 InterPro 1 UniProt 3 BioPax 1 PDB 1 KEEG 2 CAMERA 1 PubChem 1

Pubmed 2 dbSNP 1 RDB (Ribosomal DB project) 1

SMART 2 GEO 1 UniProt, 1 InterPro 1 GO 1 MGD, 1 GOA, 1 MSD 1 HapMap, 1 MSD-PISA 1 HPRD 1 MSD-PQS 1 MRB (mouse), 1 NCBI-nrDB, 1 PubChem 1 OMIM 1 Reactome 1

PDBSum 1 System Biology Ontology 1

Protein Databases 1 Transfac (from Biobase) 1

Structure-Pfam, 1 Transfac, 1 UCSC Genome Browser 1 XGap,

Similarly to the results from the online questionnaire, the most cited databases as essential resources included molecular sequence, genomics and literature databases.

Page 67: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 67

7 out of 9 candidates indicated that the databases used for their research activities are systematically cited in the corresponding publications. All of them had contacts with database providers in several contexts: interaction at meeting, database feedback, report of technical problems and data submission.

The 9 candidates were partially satisfied with the current bioinformatics databases. Even though most of them acknowledged that the existing resources offered already a substantial support to their research, progresses should be made in several areas.

“In general, I am satisfied of the availability of the existing resources. They are critical for our activities. However, there are some points which need to get improved, especially the dissemination of genome data. For each genome project, the centre in charge of the sequencing and annotation provides online access to the information but this information is disseminated among several resources. There is not a centralised centre for genome information. For our activities, this is a big lack. There have not been efforts for the collection of the genome data provided by the different sequencing centres.” – Dr. L. Duret

“One of the reasons why we invest ourselves in development is because the software infrastructure that we need is not at hand.” – Prof. R. Jansen

When working with bioinformatics databases, the top 3 encountered challenges by the 9 candidates related to transparent query across databases and compatibility of file format of either data downloaded from databases or data for submission to databases.

“There are problems with gene expression data and we had to create our own software to provide data in a correct format. So I would say, some databases it’s easy to submit data, for others it remains a challenge. If the database providers want data submitted, then really it should be up to them, to provide softwares to make life easy for the user.” - Prof. D. W. Burt

Recommendations from the candidates about databases improvement included database interoperability, functionality (e.g. visualization tools) as well as data quality. Development efforts were recommended for the expansion of current data resources, new data resources and API-web services.

D. Working with bioinformatics tools 8 out of 9 candidates indicated that their research groups used from 11 to 25 (and more) bioinformatics tools. This was again consistent with the responses of frequent users collected through the online questionnaire.

5 candidates used tools mainly supply by academic/non-profit organisation and 4 used tools supplied by both public and commercial organisations.

Working with a combination of bioinformatics resources (i.e. databases and/or tools) was a task that required some to significant effort for all the candidates’ groups.

Finally, among the suggested development for bioinformatics tools, programmatic access was indicated as essential by 6 candidates and important by the 3 others. Standardized benchmarking of bioinformatics tools was considered as important by 8 candidates.

Page 68: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 68

Limitations of the user survey

A. Survey scope As stated by the study title, the targeted population was the community of bioinformatics resources “users” and as intended, information was collected from research groups who felt belonging to that population, rather more than less. This was shown by the high proportion of frequent users among the respondents (85% of total respondents). Indeed, frequent users were the best user category to provide relevant feedback on the current status of bioinformatics resources use, needs and priorities. This information was critical for the design and implementation of future bioinformatics infrastructures. However, the survey was not designed to document on the proportion of bioinformatics resources users within life-sciences, neither to identify what promoted/hampered the integration of bioinformatics resources in individual groups’ research activities.

B. Sampling method Although the sampling method was efficient enough to gather a significant number of respondents, there was little control on the actual number of solicited people and therefore the actual response rate to the survey could only be estimated. Besides, only one response per group was requested and the analysis was done assuming that each survey response represented a different research group. However, the online questionnaire being anonymous there was no mean to actually evaluate to which extent this recommendation had been followed by the respondents, except that the survey results showed that 51.3% of the respondents were PIs/group leaders.

C. Community strata Significant efforts were made during the survey solicitation phase to obtain an even representation of countries, sector, research domains and user categories (i.e. frequent and occasional users). However, results about respondents’ profile showed important discrepancies regarding these criteria which limited the interpretations of some data comparison.

CONCLUSION

A. Development of a survey strategy To support ELIXIR preparatory phase objective, different strategies have been undertaken to consult with the stakeholder community at large. The WP3 has coordinated the organization of three stakeholder meetings where ELIXIR partners and other members of the scientific community were invited to explain their view and voice their recommendations for the building of a European bioinformatics infrastructure. Besides, several work groups representing EU bioinformatics communities (i.e. industry, international collaborators and data providers), have been established. Finally, a survey strategy was developed in order to address the large and diverse community of bioinformatics resources users represented by individual research labs.

Page 69: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 69

ELIXIR Bioinformatics User Survey was designed to collect information on usage patterns with bioinformatics resources as well as needs and priorities for bioinformatics infrastructures in Europe. The targeted population encompassed individual research groups working in diverse areas of life science and using bioinformatics resources at different levels and for different purposes. Assessing such a large and complex population was a real challenge in terms of sampling and questionnaire design. The sampling issue was addressed using different survey advertisement strategies (e.g. email communication, website posting, news letter advertising and communication at congresses). The strong and active support from representatives of national bioinformatics communities (WP3 bc members) and funding agencies was determinant for sample identification. The principal issue with the questionnaire design was to write and format a set of questions that can be readable and relevant for different categories of users as well as non-users. Hence, the questionnaire included a common questionnaire and three sub-questionnaires specifically designed to assess separately users of different level and non-users. This consultation approach was very well received by the assessed community as indicated by the great interest that respondent groups showed in the survey results (73.8% of respondent groups).

B. Survey findings

1. Users of bioinformatics resources: a community mosaic.

The information collected about respondents’ profile reflected the expected complexity of the user community. The survey respondents represented research groups from 318 organizations, located in 34 different countries and with research activities pertaining to more than 20 different areas of life science.

Analysis of the captured data highlighted few indicators related to the community diversity such as user category (e. g. frequent or occasional user), research domain, bioinformatics environment and country location. According to these criteria, noticeable variations in usage pattern of bioinformatics resources were observed. On the contrary, the working sector did not seem to have similar impact on respondents’ usage of bioinformatics resources. However, this observation should be taken with caution considering the under-representation of industry (61 groups - i.e. 10.6% of answering respondents) in this study.

An interesting observation was the interrelationships between the different indicators revealing the intricacy of the user population. For instance, the nature and/or quality of respondents’ bioinformatics environment were linked to the country location, research domain and user category. Similarly, the user category was, to some extent, linked to the research domain.

Analysis of respondents’ responses about bioinformatics resources, allowed distinguishing general usage patterns from those specifically linked to users’ profile. For instance, searching for biological data and data analysis were common purposes for working with bioinformatics databases (respectively 91.4% and 79.1% of answering respondents). As general practices, they did not (or poorly) vary with criteria such as user category or access to bioinformatics support. Other general trends included interest for molecular sequence data (76.2% of answering respondents) and the popular databases PubMed, EMBL/GenBank/ Entrez Nucleotide and UniProt/Swiss-Prot/TrEMBL/Entrez Protein. Regarding bioinformatics

Page 70: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 70

tools, a common trait to all respondent groups was the little use of commercial tools (i.e. only 1.6% of answering respondents used predominantly commercial tools). Considering the working sector of most of the respondent groups (i.e. academic/non-profit), this was not surprising since academic researchers have limited access to commercial products due to limited financial resources. Nevertheless, this clearly stated (if still needed) how much these respondent groups were dependent on public tool resources.

Specific usage patterns of bioinformatics databases and tools were more strikingly linked to the user category. As part of specific practices, was the use of bioinformatics databases for data manipulation (i.e. download large sets of data for subsequent use in computational biology) which was very limited in occasional users (19.6% of answering respondents) but of common practice in frequent users (69% of answering respondents). A greater interest for genomics, ontology and genetics data was also observed in frequent users in comparison to occasional users. Furthermore, interactions with database providers or database literature citations were also more common in frequent users (respectively, 54.6% and 59.2% of answering respondents) than in occasional users (respectively, 13.1% and 27.0% of answering respondents). Finally, the number of tools used by respondents was closely linked to the user category. Occasional users mostly used between 3 and 5 tools whereas almost half of frequent users used between 11 and 25 tools (or more).

Altogether these observations indicated that frequent users were associated with a more advanced use of bioinformatics resources and were in a closer proximity to the bioinformatics community. It is noteworthy that research activities of frequent users pertained almost equally to biology (58.6% of answering respondents) and bioinformatics (67.0% of answering respondents). Furthermore, a small proportion (less than 10%) of frequent users indicated that their research activities were outside of both biology and bioinformatics domains. Hence, simply drawing a line between bioinformatics and biology research domains in order to predict usage patterns would be far too simple and misleading.

2. Users’ perception of bioinformatics infrastructures

There was an authentic community consensus about the essentialness of bioinformatics infrastructures sustainability. More than 90 % of respondent groups (i.e. 726 answering respondents) agreed that long term support of EU bioinformatics infrastructures was either essential or at least important.

“Without a persistent and maintained infrastructure then access to and exploitation of data and information is impossible. In fact, failure to invest is not only a lost opportunity but a diminishing return on the large investment already committed.” - Anonymous.

“[…]. We really need those data to be available for a long time, well stored, well documented and that is really essential. If not then, we throw away the baby with bath water and we lose lots of our investments.” - Prof. R. Jansen.

“You can have a grant for three years and then it is over. So nobody is taking care of the best tools and databases after three years. It is not the case in Japan and US.” – Prof. R. Hofestädt

Page 71: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 71

“It is very essential. We are generating a huge amount of data if you do not make them available, we might better not spend that much effort and time in generating them.” - Prof. H. Lehrach

The role of such infrastructures was further designated as being critical for respondent groups’ research activities in terms of support (e.g. 15 respondent groups commented that their research activities relied specifically on EU bioinformatics resources), development, exchange and competiveness. From the collected comments, bioinformatics infrastructures were expected to play a major role in resources maintenance, high capacity data storage, architecture stability, data quality, data centralization and integration as well as data access and user support.

“We are experiencing a flood of data from genomics experiments, which will be impossible to process and understand without a large bioinformatics effort and infrastructure.” - Anonymous

“Centralized resources insure: robustness and stability, data quality, curation quality” - Anonymous

Such infrastructure initiative was thought to foster bioinformatics resource development and research, provide a core platform for knowledge dissemination as well as mechanisms for long-term funding. Altogether the hope with such a large bioinformatics infrastructure was to significantly increase EU bioinformatics competiveness on the global scene.

Nevertheless, some concerns were expressed about a European infrastructure initiative: what is the relevance of such a community effort on a European scale?

“Worldwide sustainability should be the primary goal although I understand this may be outside the scope of Elixir.” - Anonymous

“[…] So, if we have things that are specifically European and are working well and are successful, then by all means we should support them, I agree. Databases that are being used and are being important for research should be kept since they are valuable resources. I agree with that. But what I do not like is the idea of different compartments in Europe, US and Japan, because actually at the level of communication that we have today, there is no actual borderlines, they are all the same thing.” – Prof. F. Rodriguez-Valera

Other concerns were about the diversity of labs’ requirements and the ability of such large (and may be remote) infrastructure to address them.

“Always the issue of how useful large consortium initiatives are for your own specific research application.” - Anonymous

From the results of this study, no doubt that molecular sequence information was of central interest to the assessed respondent groups. This was clearly confirmed by the high rating of the corresponding general resources (e.g. EMBL-Bank and UniProt). Besides, genomics, protein function and gene expression data was of main interest for a majority of respondent groups which, was also illustrated by a high rating of the data resources Entrez Gene, Ensembl, Pfam, InterPro, Geo and ArrayExpress.

Page 72: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 72

Surprisingly, only 57% of answering respondents indicated an interest for literature data. Nevertheless, the literature database PubMed was rated first among all the resources, which was in agreement with the very transversal nature of literature data resources.

Despite these general trends, respondent groups had indicated very specific but non negligible interests for specialized molecular (e.g. glycomics, immunogenetics, molecular binding sites and protein disorders) and genomics data (e.g. epigenomics and metagenomics), non-molecular data (e.g. phenotype data) as well as data from biology-related domains such as pharmacology, clinics, oceanography and meteorology. This high diversity in data of interest was also apparent from respondents’ database rating: specialized databases were typically rated lower than their corresponding general resources. Moreover, when respondent groups were asked to name essential resources for their activities, 207 database names were cited of which, 193 by only one respondent group!

As a general response, respondent groups indicated to be incompletely satisfied with the existing data resources. The most encountered challenges with bioinformatics databases related to the lack of database interoperability (i.e. transparent querying across databases), lack of standard for data format (i.e. compatibility of data file format for subsequent use) and quality of user graphic interface (i.e. database website usability).

“To me, the biggest issue is about integration. So, one might be very satisfied with individual databases but it would be much better if they talk to the other databases more straightforwardly. I think that is the key point.” - Prof. O. Leyser

The lack of database interoperability is a major issue that system biology researchers have to overcome in order to combine large volume of data, of different nature and from very variable sources.

“Basically, database integration and interoperability is easy if the data have been generated in a high throughput project, and difficult if they have not. You can build up a sophisticated system to try to integrate data which are hard to integrate and you will get further than if you do not try at all, but you will never compensate for the mistake you have made in generating the data in the first place.” - Prof H. Lehrach

“It is the scope of EU CASIMIR and EU GEN2PHEN projects as well. That is why these projects are funded.” - Prof R. Jansen

Yet, working with only one data resource might also be challenging when database web sites are not adequately designed for a friendly use.

“I would not say difficulties but there are preferences. We stick to certain sites dependent upon familiarity (continuity), usability and range of facilities on offer. The interface is not so important to some but is crucial for the less experienced members of the community.” – Prof R. A. Dixon

“Sometimes websites have too much information and it is quite difficult to find what you are after until you have spent a bit of time getting use to the website and that is where the feedback comes in. When you can get some feedback to the database provider and when they respond, that is good. Now sometimes they do not, but usually they do because they want people to use the website efficiently” – Prof D. W. Burt

Page 73: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 73

Graphic visualization is also extremely important for data exploitation in some areas.

“For our activities, it is important to get a graphic visualization when analyzing a genome. It is really critical to visualize the location of the gene of interest and to know its chromosomal environment. Ensembl provides this type of functionality.” –Dr. L. Duret

When consulted on their experience with bioinformatics tools, respondent groups clearly stated the difficulty to work with a combination of several resources (i.e. bioinformatics tools and/or databases). The burdening and time-consuming aspects of the additional programming work needed were strongly expressed in comments collected via the online questionnaire. The interview candidates were also unanimous on this point.

“This step takes up about 90% of my research time. Enough said.” – Anonymous

“This is something we do not like because it really adds to our research. “ – Prof. R. Jansen

“At the moment it is a significant effort. You have to do a lot of programming, write scripts, to combine things from different databases to get whatever you want. That is quite a significant amount of time.” – Prof. H. Lehrach

“Most of the in-house tool development for my group is focused on transferring data from one tool to another tool.” - Prof. O. Leyser

In response to the identified challenges with bioinformatics resources, respondent groups’ recommendations and suggestions for future bioinformatics infrastructures were focused on three main points: (i) development for databases and/or tools integration (e.g. API-web services and standardization of file format ), (ii) improvement of database functionalities (e.g. database user interface and associated tools) and (iii) control of data quality.

“My main critic is: who is taking care of the quality of the data? No one. This is the main problem of all the databases. So I am happy to have the database but I am unhappy what is about the quality.” – Prof. Hofestädt

“There is a lack of data quality and traceability. This is typical, when working for example, with gene expression data or patients clinical data. In GEO, the sample annotation is poor. We can find the expression data but very few details about the characteristics of the analyzed samples.” – Dr. E. Barillot

“A measure of data quality will be useful, so I can actually tell whether the data are good or not. […] Generally, the data are good, but it would be nice to have some data quality measurement attached to the data that is in there.” - Prof. D. W. Burt

New development for data resources was more in favour of expanding the current databases rather than systematically building new databases. However, there are major issues with the resources development time frame.

“New types of data are difficult to be mixed with existing types of data because public databases are lagging behind in development but they do not offer option to take their product and make your own extension to it.” - Prof. R. Jansen

Regarding database models, development of object (or model) oriented databases was strongly suggested.

Page 74: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 74

“Things can be integrated better and I think that we will have to develop new databases of biological object which are directly able to be used in modelling processes. I think that we will have to move to databases of objects, which can be directly used. But that does not mean that what is available is not already very good.” - Prof. H. Lehrach

“Years ago, EMBL already said that we will come up with object oriented databases soon. Until now nothing happened. But I think that object oriented databases are normally very interesting for biology because if you take a look to a cell you will see a lot of objects finally. So I hope that in the future we will see more object oriented database systems.” - Prof. R. Hofestädt

To assist with the use of bioinformatics tools, development of standardized benchmarking was perceived as an excellent initiative. Yet, some concerns were raised about a potential impact of such initiative on tool innovation.

“There must also be room for 'improvisation'. It would not do to have an overly rigid, virtually castrating entity interested in standardizing to the point of stifling innovation.” – Anonymous

“Tools evolve rapidly and this allows to improve ways of working or to face new needs. This is very dynamic and tool sources are very diverse. The diversity of tool providers is crucial.” - Dr. L. Duret

C. Strategy for future bioinformatics infrastructures Bioinformatics infrastructures must serve the needs and priorities of a very complex community of users. One of the dilemmas for future infrastructures will be to respond simultaneously to the very distinct demands of the two discrete classes: the “power users” and “non-power users (or end-users)”. Another important challenge will be to define the appropriate scope of such infrastructures. The study showed that in spite of general trends about genetic and molecular information, respondents groups’ interest for other biological (or biological-related) data was very specific. Hence, future bioinformatics infrastructures should provide a biological information environment which acknowledges such users’ interest diversity. The requirement for a comprehensive and multi-disciplinary data environment will be even more critical for researchers from integrative disciplines (e.g. system biology, metagenomics, drug discovery, etc…). Furthermore, to provide proficient infrastructures, several bottle-necks in bioinformatics resources exploitation have to be overcome: lacks of resources interoperability, programmatic access, input/output format standardization and user-friendly web interfaces. Besides these efforts in resources development, an optimal community synergy should be established between resources providers/developers and users: involvement of future users during resources development phase, efficient capture of users’ feedback information, development of resources documentation and tutorials.

“The databases we use are absolutely essential in terms of tools and resources for our research. However, there is not much opportunity for users to interact with the providers. It would be useful to have either user group meetings or web based forums, to enable users to interact with the providers of individual data resources.” - Prof. R. A. Dixon

Page 75: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 75

“The databases I enjoy using, I would like that they have a user friendly interface that provides help in terms of tutorials, especially for the large databases, to get the most out of them. It is quite useful to have some pdf files, some power point displays where you can learn how to use it properly: online tutorials” – Prof. D. W. Burt

At last, parallel developments in bioinformatics education and training will be fundamental to maximize the benefits of such infrastructures.

“It appears to me essential to promote bioinformatics education in university programs for biologists. Nowadays, biology is using these tools and all biologists have to be trained in the use of bioinformatics tools.” - Prof. L. Duret

“Bioinformatics education is very important. Every biological degree should include a bioinformatics section. And also decent mathematics and statistics should be included as well.” - Prof. D. W. Burt

“With the next generation of sequencing, data generation is going to be gigantic, so we need to have a huge step forward in capabilities of analysis. My advice to anybody who has the possibility of deciding about the scientific policy is to invest into these capabilities of analysis as much as they can. So this means of course, bioinformatics and education of biologists in using bioinformatics to have a gradient of expertise. It would be great to have at the European scale, an effort that promotes the development of a critical mass of experts for data analysis. This is important for peer-review or selection of personnel. ELIXIR should synergize European efforts in order to enlarge the community of experts in bioinformatics.” - Prof. F. Rodriguez-Valera

The present study documented that, from the user perspective, development and sustainability of leading-edge bioinformatics infrastructures have become vital to enable a competitive and collaborative research in life science. Survey respondents clearly stated the need for coordinated organization and funding of bioinformatics infrastructures in Europe. Such initiatives will provide life science with essential research infrastructures and the European scientific community with essential e-infrastructure components to enable e-Science.

ACKNOWLEDGEMENTS

We wish to acknowledge the support and collaborative effort of the User Survey Working group (namely, Janet Thornton, Graham Cameron, Dominic Clark, Peter Stoehr and Rafael Najmanovich), the members of the WP3 Bioinformatics Community Committee (namely, Michael Ashburner, Carsten Carlberg, Bernhard de Bono, Jan Gorodkin, Elmars Grens, Sampsa Hautaniemi, Des Higgins, Inge Jonassen, Lubos Klucar, Sophia Kossida, José Leal, Michal Linial, Laszlo Patthy, Bengt Persson, Andrei Petrescu, Vasilis Promponas, Björgvin Richardsson, Ron Appel, Leszek Rychlewski, Dietmar Schomburg, Zlatko Trajanoski, Anna Tramontano, Alfonso Valencia, Yves Van de Peer, Antoine van Kampen, Ceslovas Venclovas, Jaak Vilo, Jiri Vohradsky and Blaz Zupan), the members of the WP12 “Infrastructure for Tools Integration” Committee (namely, Søren Brunak and Rodrigo

Page 76: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 76

Gouveia-Oliveira), Chris Southan (Member of the WP2 “The ELIXIR Strategy for Data Resources” Committee and Coordinator of the ELIXIR Database Provider Survey), the representative of funding agencies and national bodies (namely Elmar Nimmesgern, representative of BMBF on ELIXIR; Alix de la Coste representative of French NCP Infrastructures; Rosa R. Bernabé, Deputy Director General for International Programmes, Spanish Ministry of Science and Innovation; Dr Adrian Pugh, representative of BBSRC, ELIXIR Programme Manager; Work Package 5; Dr. Elod Nemerkenyi, Assistant of International Affairs, Hungarian Scientific Research Fund and Antoine van Kampen, Scientific Director, Netherlands Bioinformatics Centre)

We are thankful to all the online survey respondents and interview candidates for their participation in this community effort.

REFERENCE

Perez-Iratxeta C, Andrade-Navarro MA, Wren JD (2007) Evolving research trends in bioinformatics. Briefings in Bioinformatics 8:88-95

Page 77: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 i

APPENDIX I – Survey questionnaires A. ELIXIR User Survey - Online questionnaire

1. ELIXIR User Survey - Common questionnaire

Page 78: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 ii

2. ELIXIR User Survey - Sub-questionnaire 1

Page 79: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 iii

Page 80: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 iv

Page 81: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 v

Page 82: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 vi

Page 83: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 vii

Page 84: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 viii

Page 85: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 ix

3. ELIXIR User Survey - Sub-questionnaire 2

Page 86: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 x

Page 87: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xi

Page 88: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xii

Page 89: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xiii

Page 90: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xiv

Page 91: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xv

4. ELIXIR User Survey - Sub-questionnaire 3

Page 92: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xvi

Page 93: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xvii

B. ELIXIR User Survey – Interview questionnaire

Page 94: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xviii

Page 95: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xix

Page 96: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xx

Page 97: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxi

Page 98: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxii

Page 99: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxiii

Page 100: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxiv

Page 101: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxv

APPENDIX II – Supplementary Tables

Table 16 - Summarized comments to “Essential” responses

Indications of bioinformatics infrastructure needs Response count

Number of countries

To support research activities 29 16 Current research activities are dependent on EU bioinformatics resources

15 10

To support newly developing research approaches depending on bioinformatics resources

6 5

To foster EU Biological research exchanges 4 3 Current research activities are dependent on bioinformatics resources

1 1

To foster advanced research in bioinformatics 1 1

Research activities

To foster research competitiveness 1 1 To ensure resources maintenance 4 4 To ensure stability of bioinformation architecture 4 4 To face data volume and volume expansion 3 1 To facilitate access to biological data 3 3 To facilitate data exploitation 2 2 To ensure continuity in database availability 2 2 To overcome data/resources scattering 2 2 To capture research data and allow analysis 1 1 To centralize resources and ensure robustness, stability and quality

1 1

To facilitate EU resources exploitation 1 1 To support users of bioinformatics resources 1 1 To facilitate access to bioinformatics resources 1 1 To promote quality standard 1 1

Bioinformatics resources support

To promote database integration 1 1 To support bioinformatics resources development 5 5 Bioinformatics

Resources development To ensure resources development 2 2

To provide public financial support to EU DB 2 2 Funding To respond to bioinformatics infrastructure funding

issues 1 1

To provide another alternative to US and Japan international resources

6 6

To foster EU competiveness and independency in bioinformatics domain

5 3

EBI is a unique and competitive international facility in EU

4 4

International competitiveness in bioinformatics

To keep the advantage of EU diversity and international competiveness

3 2

To ensure industry access to DB of quality 1 1 Industry support

To support industry activity 1 1 To support training and knowledge network 2 2 Knowledge

dissemination To support primary education 1 1 It is a priority 2 2 To promote EU science 2 2 Advance of science To support Science 1 1

Page 102: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxvi

Table 16 - Summarized comments to “Essential” responses – (continued)

Recommendations about bioinformatics infrastructure Response count

Number of countries

should be a distributed infrastructure 3 3 Should include unique EU resources such as IMGT 1 1 should focus on API development 1 1 should be build on existing resources 1 1 should address database comprehensiveness 1 1 should address easy-to-use resources 1 1 should include biodiversity informatics resources 1 1 coordination with other infrastructure (e.g. international, related domains)

1 1

should include software repositories 1 1 should address standardization issues 1 1

Infrastructure design

Parallel computing is recommended 1 1 Infrastructure financing EU Financing system such as the Frame-work programs

1 1

Table 17 - Summarized comments to “Important” responses

Indications of bioinformatics infrastructure needs Response count

Number of countries

Current research activities are dependent on EU bioinformatics resources

2 2

Current research activities are dependent on bioinformatics resources

1 1

To foster EU Biological research exchanges 1 1

Research activities

To foster research exchanges 1 1 To face data volume and volume expansion 1 1 To centralize data 1 1

Bioinformatics resources support

To ensure stability of bioinformation architecture 1 1

Funding To provide financing resources to support national bioinformatics infrastructure

1 1

International competitiveness in bioinformatics

To provide another alternative to US and Japan international resources

1 1

Long-term perspective In the long run only 3 2 Advance of science It is a priority 1 1

Recommendations about bioinformatics infrastructure Response count

Number of countries

Should include resources such as PRIDE 1 1 Should cooperate with other international resources - 1 1 Should address easy-to-use resources 1 1

Infrastructure design

Should be limited to databank repositories 1 1 Concerns expressed about EU bioinformatics infrastructure The EU dimension is not required to respond to bioinformatics infrastructure needs 2 2 How EU infrastructure will address labs specific needs? 1 1 EBI is focused on some data type 1 1 Infrastructures are not the first priority 1 1

Page 103: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxvii

Table 18- Summarized comments to “Not relevant” responses

“Not relevant” since Response

count Number of countries

Rely on in -house resources 1 1 Rely on US resources 1 1

Table 19 – Comments to databases of interest

About nucleotides databases

Recommendation to database providers “In the light of large-scale new generation sequencing, databases to hold large-scale alignment of (re-sequenced) genomes will be essential.”

“Genome reviews will be really interesting for us if extended to many viridiplantae organisms.” “Wouldn't it be a perfect world if we all would speak the same language: i.e. controlled vocabularies for biology?! With respect to the IMGT DBs, please involve informaticians to set-up these kind of databases” “Comprehensive genome databases for bacterial and archaeal genomes and for Metagenome datasets would be useful (sensu CAMERA or IMG)”

Purpose of use

“For tool development”

“ENSEMBL is essential for our efforts in pathway mapping (wikipathways, pathvisio, GenMAPP)” “These are where we retrieve from all the DNA sequence info for designing expression vectors for structural studies and also this is how we design primers for PCR amplification.”

Database knowledge

“Never heard of some of them!” “I have never used any of these databases, and I am not familiar with many of them. I am typically using nucleotide databases at the NCBI.”

“I don't know/use IMGT/LIGM, IMGT/LIGM and ASTD”

“All but NCBI databases were unknown to me before seeing this. I had heard of FlyBase.”

Other comments

“Unfortunately we have no access to IMGT (since we are a commercial entity).” “For gene indexes, names and identifiers Entrez Gene wins completely for usability. Ensembl wins for its API access. This is one exception to my previous comment, where competition has not greatly helped. The proposal for our group to move from using EG to Ensembl is probably going to make analysis/interpretation harder in future.”

About protein databases

Recommendation to database providers “It would be helpful to further develop the concept of protein function in a temporal, tissue, and within-cell local (organelle, co-localization) context. At the moment, all evidence per protein is thrown together in an unstructured way in one record. This does not allow capturing information that one protein might have multiple roles, depending on space and time.”

Purpose of use

“I study sequences of SDR enzyme family”

“For tool development” “These are where we get enzyme info from about different enzymes which are in our circle of interest and also sequence info to do multiple sequence alignments or pairwise alignments to check for single/multiple amino acid mutations we introduce for structure-function studies.”

Database knowledge

“I don't know the others”

“I don't know/use CluSTr, CSA, IntEnz”

Page 104: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxviii

Table 19 – Comments to databases of interest – (continued)

About protein databases Comments about the questionnaire

“I lack the option to say which databases are in my view not useful!”

Other comments “I am using GenBank (nr + env databases) rather than any of the European resources. I could use European resources if there was a direct counterpart of nr and env, and if the access was as easy as via the NCBI website. For me, the most important issue is to be able to run PSI-BLAST interactively on user-defined databases (e.g. nr + env) and download batch data (e.g. up to 1.000.000 sequences resulting from interactive PSI-BLAST searches).”

“No access to BRENDA.” “The proteomics facility within our hospital could give more specific answers to this.Not a real "protein" database but we use The Protein Atlas a lot”

“There are other relevant databases of smaller size which are very useful for our research.” “Essential in immunogenetics and immunoinformatics: Translation of IMGT/LIGM-DB IMGT/DomainDisplay”

“There are additional resources that I see as critical on protein levels”

About structure databases

Purpose of use

“For tool development” “These are what we use to retrieve essential structure related info and where we publish our results to share it with the research community.”

Database knowledge

“Don't know and don't use these databases”

“I don't know/use PubChem and RESID”

“I have not yet worked at this level, and did not know any of these resources before seeing this.”

Other comments

“DBali is useful, too”

“I do not use them , but I may use them in the future” “PubChem will grow in importance once the data content improves in size as well as quality (more annotated target proteins in particular)”

“Essential in immunogenetics and immunoinformatics: IMGT/ 3Dstructure-DB”

“The ASTRAL datasets of PDB entries is extremely useful.”

About pathway and network databases

Recommendation to database providers “Reactome would be essential if it was functioning better and there were no constant unannounced modifications.” “Better "high quality" interaction and pathway databases are essential in the public domain. Currently my group has to make use of private pathway analysis platforms such as those provided by GeneGo and Ingenuity as they have greater pathway and network content over what is available in the public domain. This is not a satisfactory situation!”

Purpose of use

“For teaching purposes”

Page 105: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxix

Table 19 – Comments to databases of interest – (continued)

About pathway and network databases Other comments “I have only used KEEG” “MetaCyc is also used frequently in our group” “Also NCI-Nature Pathway Interaction Database is valuable. I have no clear idea of the overall overlap with others.” “Biocyc also” “Lacking in interoperability and re use is hampered by it” “They all could be essential if they were filled with useful data. Now it merely a bunch of maybe pathways or odd collections of genes/proteins aimed to confuse us rather than help us.” “BioCarta” “We use Ingenuity Pathway Analysis” “Panther” “I expect these databases to become increasingly important as the volume and annotation of protein interaction data increases and the different interaction databases begin routinely sharing data with each other.” “I have never used biomodels and intact.” “I don't use BioModels, Reactome” “string.embl” “What about Cytoscape?” About biology ontology databases Recommendation to database providers “If further integration and reasoning by ontologies is to be achieved much more effort has to be put into the ontologies themselves” Purpose of use “Essential in immunogenetics and immunoinformatics: IMGT-ONTOLOGY” “For tool development” Database knowledge “My background as a naturalist taught me the essential parts of taxonomic procedures in science and its communication, together with the need to keep these concepts flexible. I did not know of any of these databases before seeing this.” Other comments “NEWT = NCBI taxonomy (almost)?” “We use Gominer” “We use a lot of other ontologies, such as anatomies, phenotypes, etc.” “They all could be essential if they were filled with useful data. Now it merely almost random information aimed to confuse us rather than help us.” “I would like to see more Ontologies” “we miss disease and phenotype ontology” “All are fundamental for functional/ chemical knowledge” About literature databases Recommendation to database providers “Here we would like to see more efforts. We are missing disease databases (OMIM is essential, but in bad shape). A fact database (extracted facts from pubmed) would also be important). Signal transduction is also a field with a good database” Purpose of use “For teaching, training and references” “This is what we use for all research article searches and full-text mining.”

Page 106: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxx

Table 19 – Comments to databases of interest – (continued)

About literature databases

Database knowledge “Didn't know about CiteXplore and EBIMED, but now that I have read their functions, am keen to see if they can be useful, so as I have not used them yet I can only say they sound useful from descriptions”

“I hadn't heard of CiteXplore before, but will be looking at it after this...” “I only knew before seeing this PubMed and OMIM, the latter I used only once (but gave me important information).”

Other comments “The interface to PubMed is terrible. There's so much more that could be done with it, including more intelligent ways of doing merged searches.” “We are not into text mining very much right now. However, references are essential as evidence to observations and as such essential.”

“WEB of Knowledge, WEB of science, Medline, Food Sciences & technology, CAB abstracts”

“It would not surprise me if text mining would be the next major confusion factor in omics.”

“ISI web of knowledge regularly used”

“other databases used SCOPUS” “The information contained in OMIM is essential but it is impossible to do any computational analysis, so it needs a major overhaul”

“Not currently used. Used are databases of publishers primarily.”

“ISI Web of knowledge is used frequently” “I also use GOpubmed and connotea/citeulike. These two are not primary databases, but they have derived data and so they could be considerate databases.”

About genome, proteome and transcriptome databases

Recommendation to database providers “Integr8 needs a major overhaul to the way it presents the information into a more helpful view for the ordinary biologist.”

“Improve ArrayExpress usability. Synchronize content across GEO and ArrayExpress DBs”

Purpose of use

“For teaching”

Other comments

“I have only used Integr8” “More public and free ones would be great - it seems this is an area where the companies are ahead of academia.” “PRIDE and SWISS-2DPAGE are probably useful for future, but at the moment, even if we know these databases and are implicated in several proteomics projects, we don't know how to use them efficiently.”

“The proteomics resources are now developing and PRIDE and SW-2D are critical for the field”

Page 107: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxi

Table 20 – Other (summarized) challenges with bioinformatics databases Challenges

Database content Database examples

Response Count

Database identifier instability Pfam, UniProt 3 Data quality (e.g. annotation errors and incorrect data) 3 Data "noise" and fragments in sequence databases trEMBL 1 Data cleaning 1 Incomplete or patchy databases 1 Intelligible annotation and interpretation of results would be useful for many non-expert users

1

Naming issues 1 Outdated databases cross references PDB vs. 1 Truncation of gene/protein sequence which are never clearly indicated in databases

UniProt 1

Correlation between data in databases e.g. ENSEMBL, dbSNP and GO. 1 Data consistency 1 Database updating 1

Database access Database examples

Response Count

Automated access, for example via web services 1 Computational capacity for online request 1 Computer readability 1 Lacks of access to data (via ftp) 1 Lacks of access to database table schema ENSEMBL 1 Parsing issues - use of improper delimiters, changing database structure, changing support and update frequencies

1

Database functionalities Database examples

Response Count

A simple automatic tool for database download (avoiding issues FTP file names tracking from one release to the next)

ENSEMBL 1

Downloading full databases (smaller ones) 1 Limited database query functionalities 1 Software bugs 1

Database information and knowledge Database examples

Response Count

Lacks of database documentation 3 The knowledge and know-how to keep up with the latest developments 1 Too many databases to be aware of 1

Database integration Database examples

Response Count

Data integration across databases HPRD or BioGRID,

4

Accession code incompatibility with other (non EU) databases 1 Cross-database queries 1 Format compatibility across similar databases EBI and NCBI 1 Lacks of md5/crc based validation of correct database cross references 1 Lacks of resources compatibility and integration from main providers EBI and NCBI 1 Proliferation of new specialized databases without clean integration in existing resources and without long term stability

1

Page 108: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxii

Table 20 – Other (summarized) challenges with bioinformatics databases – (continued)

Database integration Database examples

Response Count

Redundancy across similar databases 1

Residue numbering across sequence databases, PDB,

SwissProt 1

Semantic integration. 1

Database support Database examples

Response Count

Continuity of support to newly developed resources of interest 1

Data access Database examples

Response Count

Accessing the databases through web services 1 Data dissemination 1 Data sharing 1 Mirroring the most important databases locally. 1 Slow online access 1

Data visualization Database examples

Response Count

Database query or analysis tool output not user-friendly 1

User support Database examples

Response Count

Dead helpdesk 1 Lacks of recognition for reporting errors 1 Note: most of the comments (45 out of 57) were given by respondents belonging to bioinformatics, computer sciences and/or maths. Other comments were given by respondents from agriculture, biology and/or medicine

Page 109: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxiii

Table 21 - Other (summarized) improvements needed for bioinformatics databases Improvements

Database content Database examples

Response Count

Clear description of cross-references for identical sequences between different or within the same database (e.g. UniProt) UniProt 1 Clear description of the origin of the sequence in particular in term of taxonomy (Phylum, gensus, species...) (e.g. UniProt) UniProt 1 Continuity in curation effort 1 Data coverage (e.g. chemical database, number of prokaryote organisms)

PubChem, CheEBI 4

Improvement of literature data structure such as disease and mammalian phenotype ontologies PubMed 1 Updating frequency 1

Database access Database examples

Response Count

High quality accesses to the databases to avoid local installations 1 Database access is a priority 1 Open access policy 1 Remote access with web services and local caching 1 Standardized APIs for database access. 1

Database user interface (GUI) Database examples

Response Count

Intuitive GUIs SRS 3

Database information and knowledge Database examples

Response Count

Navigation and advanced query instruction 1 Tools for keeping up-to-date with existing databases and servers 1

Database standard Database examples

Response Count

Universal naming 1 Common standards for ontologies. 1 Definition of quality standards for functional annotation RefSeq 1 Standardized formats/interchange formats 2

Database support Database examples

Response Count

Lacks of support for the primary sources 1 “Niche-databases” should be maintained 'somewhere' once the developers lost stamina 1

Data access Database examples

Response Count

Improvement of data exchange 1

User support Database examples

Response Count

User support: ability to give simple feedback Ensembl 1 Note: most of the comments (42 out of 51) were given by respondents belonging to bioinformatics, computer sciences and/or maths. Other comments were given by respondents from agriculture, biology and/or medicine

Page 110: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxiv

Table 22 - Other (summarized) development needed for bioinformatics databases Comments

Database content Response Count

More comprehensive database 1 More data coverage 1 Improve genomics annotation 1 Insert links to all the papers which use that set of data for benchmarking purposes 1 More types of published data should be captured in databases 1

Database functionalities/tools Response Count

Better search option 1 Biodiversity databases associated tools 1 Expanding the capacity of data mining, text mining 1 Flexible rdf dumps 1 High throughput storage and analysis 1 Statistics 1 Tools for using XML/RDF/OWL formats (e.g., BioPax) 1 Novel method for the visualization of data mining results which would allow understanding of non-specialists 1 Result from data query in standardized tabular format, that can easily be integrated into in-house solutions 1

Database user interface (GUI) Response Count

Intuitive user interface 1

Database access Response Count

Access to several databases from similar interfaces (e.g. Biomoby) 1 Easier access to databases for untrained people 1

Database information and knowledge Response Count

Add recurrent services with notification by e-mail 1 Better advertisement of databases 1 Documentation procedures for database changes (e.g. format) 1

Database integration Response Count

Database integration is a priority 1 API development and web services 4 Connectivity between model species and non - model species 1 Data mining across databases 1 Databases international collaboration 1 Improvement of existing resources and interoperability rather than new developments 1 Integrate databases with tools (central DB server <--> mirror DBs with specific tools) 1 Integration of existing databases 2 Standardized data access method 1 Support of a common indexing system (SRS or MRS like) allowing for distributed querying with redundancy integrated to avoid downtime 1

Database standard Response Count

Standardized formats and identifiers 1 Standardized output files 1

Page 111: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxv

Table 22 – Other (summarized) development needed for bioinformatics databases – (continued)

New databases Response Count

Community-driven databases and tools that respond to the specific needs of research groups 1 Database for translational research needs 1 Database result-oriented for biologists!!! 1 For functional data (e.g. RNAi, Assays, effects of compounds etc) 1 For metagenomics 1 For nuclear microsatellites 1 New databases using appropriate standards (formats, description terms, export formats, etc.) 1 Should only be allowed (i.e. published by journals) if they genuinely provide something that an existing database can/will/does not. 1

Data quality Response Count

Data quality is a priority 3 High quality databases with a wide range of options for the power user 1 Serious curation and validation 1 Data quality and traceability 1

Data access Response Count

Make data available using easy scripts 1 Note: most of the comments (44 out of 55) were given by respondents belonging to bioinformatics, computer sciences and/or maths. Other comments were given by respondents from agriculture, biology and/or medicine

Page 112: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxvi

Table 23 – Comments to challenges with bioinformatics tools Comments In general “A large part of the bioinformatics process is to change the output of one program, combine it with another program and pass the result as input into a third program. No portal or single application will ever be able to deal with the diversity required when solving problems.” “This step takes up about 90% of my research time. Enough said.” “Formats are very inter-convertible in general.” “Sometimes it is a big issue. We try to avoid doing it, but it is not always possible.” “This is often the major challenge” “I co-chair the Semantic Web Health Care and Life Sciences Interest Group, which gathers researchers who are addressing this problem. (http://www.w3.org/2001/sw/hcls/)” “It depends of the tools, if its output is standard.” “In recent years this aspect was improved but it is still not available for 90% of the tools” “More than 50% of my time...” Challenges “Transformation is just one part of the problem. In large scale work, substantial consistency checks are required to detect corruption in tool output etc.” “The file formats in certain preferred semi-commercial tools are sometimes very difficult to convert to non-profit tools.” “Problematic especially with the proprietary software” “Most painful step in data-mining piping” “Gene IDs/annotations have to be changed almost always” “Depending of the database input format needs some preparation” “To export data from DIAMIC to tumorotek is very complicated ...” “Frequently data format changes or mistakes are observed that we feed back to database providers” “File formats often need to be adjusted in very specific ways” “Most of the effort is accession number linking/conversion. […]” “The problem is many people program python/perl script to transform data to different formats, but they don't test the tools they have written, so there is no way to know if their result is good.” “EMBOSS is particularly bad in this regard.” “Sequence format conversions are still a bit of a pain - especially with respect to limitations on sequence names. This problem is passed on to phylogenetic format limitations on names (Newick format).” “Convenient, open-source conversion and analysis utilities are not always a hand. Moreover, most bioinformatic/crystallographic software we use, need some specialised input to run, not always well documented (but this is more the problem with scientific software than with databases).” Solutions brought by methods and tools “No problems for the 15 on-line IMGT tools available on Internet as they use the same ontology (IMGT-ONTOLOGY).” “With the availability of EMBOSS this efforts has been reduced to a minimum.” “What about developing some common format (like XML based)? A good example - fasta format. Common and easily understood. Of course, outputs from different databases/queries/tools differ, but some common XML format could be thought of. Some fields could be redundant, but it would be easier to remove them, then each time to convert from one format to another.” “We have a software (BioMAJ, http://biomaj.genouest.org) which do the job automatically (post-processes provides all needed different formats)”

Page 113: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxvii

Table 23 – – Comments to challenges with bioinformatics tools – (continued) Solutions brought by methods and tools “We had to develop our own system of data formatting in order to be able to transfer data between databases and tools (e.g. from GenBank to programs for alignment to programs for tree calculation to programs that display phylogenetic distribution).” “This is a faithful question. Usually we have to write computer programs that does this job. It is especially relevant for genetics programs.” “With Perl.” “Perl is a necessity to retrieve output from one program and streamline it for subsequent analyses. I wonder if it would be possible to make a more universal format.” “I use Perl programming to connect Genbank Results, MySQl Databases, BioNumerics software, ClustalW, local BLAST....” “We tend to use perl, including BioPerl, Emboss (especially seqret) or simple shell commands to do the majority of the necessary transformations.” “Perl scripts are in place to do just that.” “Workflows are a function we are testing.” “e.g. I have written several scripts to transform IntAct xml format into tabulated, etc, formats - but maybe this is part of what we call bioinformatics, after all.” “Although it costs some effort, writing parser script is almost always easy and quickly done.” “We have a database of scripts for parsing one output format to another format.” Solutions brought by appropriate expertise “If necessary we have access to support entities to help us using/updating tools and databases.” “We have a software research infrastructure doing it for us. Any workflow consisting of 'external' (not our own) tools will contain almost as many format converters. It is a central issue and a bottleneck for efficient analysis method development.” “We are bioinformaticians. What may be little effort for us may be unfeasible for others.” “Not a big problem, since we know how to program ourselves.” “My students and I are highly skilled programmers.” “Installing different databases and software tools in an integrated fashion is precisely an important activity of our group (the Belgian EMBnet Node).”

Page 114: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxviii

Table 24– Comments to resources development for bioinformatics tools Comment In general “Format compatibility is essential” “A strong effort should be made to make these tool Open Source projects, allowing for adaptability and tailoring to the users needs.” “For my group it is essential that our tools will also have a visibility. There are many groups who develop software, and their use should also be promoted. Funding for new methods and tools development, as well as maintenance needs to be somehow solved.” “Such facilities already exist via the national nodes of the EMBL network, but this is not widely known or used.” “Often weak/absent is the possibility to export results in simple tabular format for reformatting and input to the next tool (html format can be a pain, I do not speak of JAVA graphical outputs or PDF-tables).” “Parity between different bioinformatic tools would be very helpful. Another concern I currently have is the widespread duplication of very similar efforts e.g. TOPP, TPP, GAPP, which overall do pretty similar tasks but with differing perspectives.” “Make the existing tools more available, or share them” “It would be great to have a unified platform to do most of the bioinformatics.” “We use mostly in-house code.” “My research group develops statistical models and uses other bioinformatic tools only for comparisons.”“EBI and friends are providing very useful databases (with downloading options etc.) already, so the current status is also reason for some praise I would say.” “We generally make our own tools and pipelines because even though there are lots of existing tools out there, they are not always easy to find, use and/or understand exactly what they do. The number of small applications can be quite overwhelming when looking for a new tool.” “Better integration of information, e.g. Ensembl could adapt some nice features of flymine.org: saving gene list, testing gene lists for enrichment, etc.” “Experience with Taverna has shown that even though a fascinating concept, SOA for bioinformatics services is not mature yet and causes way too much unnecessary overhead. “Integrated databases that could be freely queried on the basis of their schema without being limited to very few predefined queries.” “The dream is: to have all bioinformatic, crystallographic, NMR, EM processing algorithms implemented as free, open source libraries, with easy possibility to integrate any algorithms and methods with any database, search tool, visualisation software... In house, on the fly, on my own computer.” About single portal for bioinformatics tools “Again, I'm worried that a huge amount of effort will be committed to the "single web page" which has all its parameters hidden away so as not to scare the occasional wet lab person who deigns to use a tool he/she probably doesn't understand in any case, while power user tools are always an afterthought. In reality, these are where the effort should be focussed upon.” “We design software, so if something small is missing it often is faster to rewrite it than to find it. A single portal would only be useful if it is really a single portal that lives for a long time. Pedro's list in the mid 90's was useful because it was complete, lived for a few years, and was without competition.” “We are working in the aforementioned by developing our own open portal for microarray experiments analysis called GRISSOM (http://195.251.6.234/biodatagrid/new/), which will also offer links to certain other open biological databases providing either data information or tools for further meta-analysis of the results of microarray experiments. A significant novelty regarding our implementation, has to do with the adjustment of the algorithms of our pipeline and of the database in Grid Environment, that is in operational environments that comply with the rules and philosophy of distributed computing, thus securing sheer acceleration of the computational procedure.” “I don't believe a single portal ever will exist - it would be nice but all "one shop" attempts have failed because competition is good! NCBI tries to be that, but I go to Genome UCSC or Ensembl. for some information, and to NCBI for others.”

Page 115: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xxxix

Table 24 – Comments to resources development for bioinformatics tools – (continued) About single portal for bioinformatics tools “Some portals have overlapping tool. Some time It is difficult to choose the proper tool.” “A single portal is not necessarily the best solution; the experience of IMGT has shown that the development of high quality tools for immunoinformatics in constant evolution was only possible on the IMGT platform. As a user, I do not mind to go to EBI for nucleotide; to UniProt for protein and to IMGT for immunoinformatics.” “I don't believe that a single portal will reflect the real functionality of the tools. Tools are developed by different groups, have different functionalities and options. In order to maintain the originality and flexibility, there is a clear need of independence of the developer groups, but of standardization of the exchange formats and a concrete step towards inter-operability is needed.” “Whether everything is located one place doesn't matter as long as "they" speak the same language.” “A database/portal of all bioinformatic tools sounds great but will be difficult to maintain. I would like it, but don't know if it is feasible. I understand here a portal that provides the code, not a portal that provides the tools to run.” “A single portal for bioinformatics tools is a 'BEAUTIFUL' idea, one place, all the data and tools available, its very nice thing.” Standardized benchmarking of bioinformatics tools “There must also be room for 'improvisation'. It would not do to have an overly rigid, virtually castrating entity interested in standardizing to the point of stifling innovation.” “Already carefully cleaned benchmark data sets would be fantastic. A major issue with these is that a "gold" standard is very hard to have for any question of biological relevance, particularly reliable NEGATIVE records are usually not provided (e.g., for protein-protein interaction, not just what proteins are known to interact, but also, a set of proteins that are known to NOT interact)” “Brilliant program. Add to this some standards around analysis with each tool are the value would be difficult to gauge, but enormous.” “What would a benchmarking for bioinformatics mean???? How do you benchmark applicability? Benchmarking often focuses only on informatics characteristics.” “What also could be of interest is a rating-system for tools, by users and citation in application papers.” “I'm not sure what you mean by 'standardized benchmarking' what is needed is standardized data formats for measurements and data representation.” Programmatic access to bioinformatics tools “Programmatic access is absolutely essential but we only needed this once.” “Programmatic access is one of the most important things to speed up use of these databases/tools.”

Page 116: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xl

Table 25 – General comments Comment User needs “I think I've banged on enough about how poorly the power user is being treated with respect to tools and databases.” “Always keep thinking about your "clients". They are biologists!!!!!!!!! There is a general tendency that bioinformatics is slowly taken over by technical oriented people (like statisticians, and chemometrists). As such there is an increasing gap between the day-to-day practice of a biological researcher and the tool developer. This is understandable, as it is scary to work in biology since omics made our life quite complicated. But the answers are not in improved tools, but in a new approach to biological research per se. First we have to know what we want to do, then develop the tools we need.” “By all this work, please do keep your focus on the needs of biologists. I have a feeling that the bioinformatics field slowly but surely is hijacked by technically oriented people who may know a lot about all kind of fancy data analysis, but very little about biology. Because their interest is obviously technical oriented, there is an increasing gap between the needs of the biologists and the actual work and (needed) infrastructure for bioinformaticians. Just as an example, try to find "simple" things like all rRNA sequences of a specific organism, or the length of all genes in a prokaryote, or all time-series experiments with > 20 arrays in GEO etc. You will experience that the bioinformatics infrastructure is focused on complicated questions, methods and tools, totally ignoring that biologist often have quite simple, but essential questions. Also, all bioinformatics success depends on the quality of experimental data. So, I focus on data curation and data quality control should become available as soon as possible. And defining just (utterly complicated) standards is not enough. Well I could go on for quite a bit about controlled vocabularies, text mining in out-dated literature, non-sense GO categories, etc, but I think you'll get the picture.” Bioinformatics infrastructures importance “Bioinformatics brought a cultural revolution in biology and medicine and it is essential for the progress of these sciences. It is so rapidly expanding that often biologist experts in other fields have difficulties to use informatics tools.” “Existence of several main bioinformatics centres with a critical mass in Asia, Europe, and the US seems to be vital for innovation in and advancement of the field. Very comprehensive survey, but some sections could be structured better.” “With new large scale data generation like next gene sequencing, data-bases and connection to them will be a big challenge. New intuitive ways to display biological systems need to be developed.” “The challenge is not only technological but also cultural. Bioinformatics has become an integrated scientific skill base within genomics and biology; the key goal is to align bioinformatics and "wet" biology. Standardization and harmonization is required to keep bioinformatics (and biology) affordable. Setting up a sustainable, harmonized and integrated bioinformatics infrastructure is an essential initiative to guarantee progress in life sciences and health care.” “I think standardization and integration of databases and tools is vital for bioinformatics and this can only really be achieved through large, well-supported centres such as EBI.” Infrastructures role and features “Bioinformatics is essential in education of young scientists. I would like to see more structure in the development of educational tools.” “Yes, I think there should be a consortium of all such database users appreciating and acknowledging open access. Common conferences/meet-ups would really be wonderful to exchange and mask bioinformatics research. That should also be welcoming for biologists. Regards. Prash” “Reliability of the data is of the utmost importance.”

Page 117: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xli

Table 25 – General comments – (continued) Infrastructures role and features “Promoting Open Source solutions in bioinformatics both in Academia as well as in industry, and the establishment of Open Standards for data exchange should be paramount in all efforts to future development of the area.” “For a European Infrastructure to be justified, it needs to offer more than already existing installations, say at the NCBI. There is no point in having a copy of functionality. Points to consider are: - fully transparent operation - involving stakeholders / users - support of local installations - integration of add-ons developed in the community - quality and release management” “It must be as convenient and user friendly as possible to allow wide spread use.” “We need a hosting service for online data sources so that individual labs do not have to host their own servers and everything is maintained in one place. Perhaps charging industry to use public databases should be considered to augment funding.” “More efforts should be put in elaborating good practices in using bioinformatics tools. I think that the 'in silico' part of an experiment, is not different from the 'traditional' one: so you should test your tools by using control samples, and elaborate test units to be sure that you are not messing with it, or at least know how to individuate errors. In the future I would like to put on a society to provide bioinformatics support as if it were 'technical support'. In shorts, I believe scientist shouldn't use bioinformatics tools too much, if they are not properly trained, because they are not able to prove if they are making errors in their workflows.” “Despite the huge number of bioinformatics databases out there, there is still both ample room for improved interoperability and the urgent need to do so, esp. for systems-oriented approaches that seek to integrate very diverse kinds of biological data. The process of collecting and (re-)storing this information locally is time consuming and could be much improved and sped up by centralized portals and standardized data formats facilitating subsequent processing.” “There exist a plethora of databases and it is very difficult to keep up with their similarities/differences pros and cons. It would be great if DBs were linked at a higher level that would allow semantic queries that would be answered by the most appropriate DB transparently to the user. This sort of meta-layer is missing and it is a big problem. Furthermore, when a data analysis paper is published e.g. in proteomics, the corresponding dataset should become available in some data base for people to use and compare their methods against. Even today it is still very difficult to find annotated proteomics datasets (e.g. 2DGE gels with ground truth).” “DB is only one side of the equation. Application framework, unified way for running algorithms and unified way for visualizing and reproducing a bioinformatics workflow on large datasets (without downloading them locally, i.e. with distributed file systems via remote access) is the second.” “We give access to international databases on our system and are interested in a simple and quick way to retrieve and update these databases (local / continental mirrors). A standardized database versioning system would be useful. The software portal could also provide related tutorials.” “A big problem is the computing power. For example, analyses in MrBayes takes days or weeks for one dataset (we have 8 nodes available). If we want to analyze different datasets, we have not enough nodes to do that in an acceptable time. Computer power available somewhere else that can be used can be interesting.” “Standardization and interoperability should not be restricted to intra European tools. It should include indispensable tools such as PubMed, MGD, KEGG, etc.”

Page 118: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xlii

Table 25 – General comments – (continued) Infrastructures role and features “It is essential that EBI can support bioinformatics across Europe in a coordinated fashion; however, it is useful if not all resources have their homes in Cambridge UK, so that local expertise can be taken advantage of.” “A better way to port output from one tool to use as input in other web based tools would be an important step in making things simpler for a majority of users who do not have significant background.” “Some support of «networking" with other researchers would be a useful part of the infrastructure.” “The remark about funding made earlier is important.” Resources for the future infrastructure “Our group would warmly support the incorporation of its GRISSOM portal (http://195.251.6.234/biodatagrid/new/userlogin.php currently under development - end Dec 2008), plus the interfaces it has developed for exploitation of the Grid infrastructure, which is unfortunately pretty barren of algorithmic tools or database solutions regarding processing of biological data, as well as of the lessons it took from this implementation, to the network of European Bioinformatic Infrastructures.” “There is a direct need throughout Europe to standardize and ease access to human mutation databases (also called locus-specific DBs, used in molecular diagnostics).” “A pity that only mammalian research is inquired. What about bacteria?” “A public, well maintained sequence workbench like former GDE (Genetic Data Environment) would be most welcome. I also would love to have a mini-SWISSPROT that I can take with me on my laptop (maybe selected species only). In the old days that was available, at least for the EMBL Nucleotide Library. Mobile (simplified or condensed) versions for other databases would be nice as well, including PDB.” “In general, disease data is somewhat neglected and is mainly available as unstructured literature articles. We are concerned that lots of results are published but the underlying data is not (or only as supplementary info). Think e.g. of mutations and their effects on function or structure. There are no databases for these data.” “An evaluation of commercial providers of bioinformatic tools would be of advantage. For example: applied maths. On line offered courses would be of advantage. Computer language is still a problem for non informaticians” “There should also be some details such as structural biology, basics of gene, protein sequence included it will become added advantage.” “The analysis of sequence diversity, allele variability --> phenotypic variation is under investigated. Most research is focusing on 1 dimension of sequence (the genome once), rather than the 2nd dimension (the haplotypes/alleles across individuals). The HapMap project is teaching the relevance of this issue.” “Suggestions for "my perfect world" of DB: * DB may be housed at different institutes (for example for different focus) * DB may be available in mirrors (for massive data download) * standardization in terms of: - data formats - API access - import/export format - inclusion of own data (project data)” “Have a look at MRS: mrs.cmbi.ru.nl. Very nice and efficient tool for databank management.” “Please consider existing database layout & design before inventing something new.” ELIXIR initiative “Excellent initiative.” “Please keep going and help us survive and evolve in the W(j)ungle.”

Page 119: ELIXIR BIOINFORMATICS USER SURVEY

ELIXIR Bioinformatics User Survey - Final Report

S. Palcy and A. de Daruvar, University of Bordeaux, France – June 2009 xliii

Table 25 – General comments – (continued) ELIXIR initiative “As we are also heavily involved in bioinformatics database development we greatly appreciated to become more involved/connected.” “This is a good initiative. As a mostly naive but heavy user of certain databases and tools I believe standardization is vital if we wish to progress further. Now the situation is that only the people with access to the knowledge can use the databases. You need a bioinformatics person only to start touching the databases.” “ELIXIR can create a dynamic for bioinformatics at the European level. Getting funding for European infrastructure is indeed crucial. Whereas EBI is at the core of that infrastructure and should be supported, specialized systems such as IMGT (6 databases, 15 on-line tools) with a unique expertise (immunogenetics and immunoinformatics) and a successful story of 20 years cannot be put aside and should be acknowledged as part of that European infrastructure.” This survey “It is very nice survey; I am looking forward to see the ELIXIR comes into life.” “Very good survey, hope it will serve to improve the European bioinformatics.” “The survey is needed and welcome. I hope it is used to improve the situation, which is not bad and improving.” “Beginning was a little bit "too slow" - single questions/page. And then the main content was quite a bit more serious.” “Relevant and interesting questions. I would be happy to provide further input.” “I believe that asking a single member of a group to respond to this survey, (and I assume that you are then analyzing the results based on this assumption), will deter people from responding at all, and may possibly lead to misleading results as those that do respond may well be from the same group anyway. Some of the questions needed more options for answers. Although this makes the survey harder to analyse in some ways, the lack of any appropriate answer will likely lead to misleading results or end up with quite a bit of missing data.” “I think some of my people need to fill in this survey, I don't think my answers are very useful” “Simply an apology for not adding more information... there are things I could have said, but nothing burning, and I have a pile of deadlines. I would, however, have liked "Yes, mostly" as an answer to the question on database satisfaction - it is far nearer my opinion than either "Yes, totally" (interpreted as 100%) or "Yes and no".” “I got the impression there were some weird omissions.” “Some questions were difficult to understand.” “"Bioinformatic tools" is not very specific; could you give a list of categories?” “It is not entirely clear what is defined as bioinformatics resources as used with the first question. It is also a problem that declaring intentions of new databases may be in conflict with the wish to publish these prior to discussing them.” “I would add some questions for additional users: i.e., 1. Which DB and tools are more important for Bioinformatics teaching. 2. Are you using DB that are commercial or will you pay for such tools. 3. I would ask on tools/ querying system like BIOMAT”