deliverable 9.5 - phenomenal · msconvert and the integration into the continuous integration...

10
Deliverable 9.5.1 Project ID 654241 Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data. Project Acronym PhenoMeNal Start Date of the Project 1st September 2015 Duration of the Project 36 Months Work Package Number 9 Work Package Title WP9 Tools, Workflows, Audit and Data Management Deliverable Title D9.5.1 Updated Preprocess Virtual Machine Image Delivery Date M32 Work Package leader IPB Contributing Partners IPB, ICL, EMBL-EBI, SIB Authors Evangelos Chandakas, Tim Ebbels, Pablo Moreno, Steffen Neumann, Kristian Peters, Rico Rueedi, Daniel Schober Abstract This deliverable reports on the development and use of container images to enable data producers to (pre-)process raw data into standard community-supported formats, locally

Upload: others

Post on 25-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

Deliverable 9.5.1

Project ID 654241

Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data.

Project Acronym PhenoMeNal

Start Date of the Project

1st September 2015

Duration of the Project

36 Months

Work Package Number

9

Work Package Title

WP9 Tools, Workflows, Audit and Data Management

Deliverable Title D9.5.1 Updated Preprocess Virtual Machine Image

Delivery Date M32

Work Package leader

IPB

Contributing Partners

IPB, ICL, EMBL-EBI, SIB

Authors Evangelos Chandakas, Tim Ebbels, Pablo Moreno, Steffen Neumann, Kristian Peters, Rico Rueedi, Daniel Schober

Abstract This deliverable reports on the development and use of container images to enable data producers to (pre-)process raw data into standard community-supported formats, locally

Page 2: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

2

or in the cloud. This deliverable is an update to “D9.2.1 PhenoMeNal-Preprocess VM”.

Table of Contents

1. Executive Summary .....................................................................................................3

2. Contribution towards the project objectives .................................................................3

3. Detailed report on the deliverable ................................................................................3

3.1 Background ............................................................................................................3

3.2 MS Data Preprocessing .........................................................................................4

3.3 NMR Data Preprocessing .......................................................................................6

3.3.1 nmrmlconv Container .......................................................................................7

3.3.2 User documentation and training .....................................................................8

3.3.3 BATMAN ..........................................................................................................9

4. Delivery and Schedule ...............................................................................................10

5. Conclusion .................................................................................................................10

Page 3: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

3

1.Executive Summary

The PhenoMeNal project supports several of the most common workflows in metabolomics covering, amongst others, NMR and Mass Spectrometry data processing. The diversity of instrument vendor-specific data formats requires preprocessing tools to convert data files from proprietary formats into standardized open formats – for instance, converting raw MS and NMR data files into the mzML or nmrML formats, respectively.

This deliverable reports on the development and use of container images to enable data producers to (pre-)process raw data into standard community-supported formats, locally or in the cloud. This deliverable is an update to “D9.2.1 PhenoMeNal-Preprocess VM”.

2. Contribution towards the project objectives

The deliverable has contributed towards the following project objectives for WP9:

● Specify and integrate software pipelines and tools utilised in the PhenoMeNal e-Infrastructure into VMIs, adhering to data standards developed in WP8 and supporting the interoperability and federation middleware developed in WP5. We will develop new applications only to complete ‘missing links’ in pipelines. We will use public repositories and continuous integration to always provide development snapshots of the infrastructure VMIs.

● Develop methods to scale-up software pipelines for high-throughput analysis, supporting distributed execution on e.g. local clusters, private clouds, federated clouds, or GRIDs.

3. Detailed report on the deliverable

3.1 Background

The PhenoMeNal project supports several of the most common workflows in metabolomics, covering NMR, Mass Spectrometry and downstream statistical analysis. The diversity of instrument vendor-specific data formats creates incompatibilities between processing tools and strategies that can be avoided by using community-accepted open standard data formats starting from the instrument level. Hence, PhenoMeNal supports preprocessing tools to convert data files from proprietary formats into standardized open formats – for instance, converting raw MS and NMR data files into the mzML or nmrML formats, respectively.

Page 4: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

4

This deliverable is an update to “D9.2.1 PhenoMeNal-Preprocess VM: Virtual Machine Images to enable data producers to locally process raw data into standard formats supported in PhenoMeNal”1.

3.2 MS Data Preprocessing

For the conversion vendor formats to mzML we are using the open source msconvert, developed by the ProteoWizard team (http://proteowizard.sourceforge.net)2, which is one of the reference implementations for mzML. It can convert to mzML from Sciex, Bruker, Thermo, Agilent, Shimadzu, Waters and also the earlier file formats like mzData or mzXML and is consequently widely used.

MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have especially worked on the container image. The container was rebased from the seemingly unmaintained container suchja/wine:dev with pre-installed WINE (v1.8.0) windows layer, to the i386/debian:stretch-backports image that is part of the docker-library/official-images collection with regular updates. This also allowed to move to the WINE version 3.6.0, supporting a wider range of windows applications and better stability. The build file was also cleaned up, where several workarounds related to the earlier WINE version could be removed. The Proteowizard version was updated from version 3.0.9098 to 3.0.18110. The Dockerfile was also updated to conform to the latest PhenoMeNal guidelines, i.e. has updated LABELs that are used in the Build process, on the Jenkins continuous integration3 and in the app library4.

A new data testing strategy was added, where we have started to collect a range of different MS files in vendor formats, which are then converted. We then check that the converted mzML matched a known-good output as Jenkins job5.

1 http://phenomenal-h2020.eu/home/wp-content/uploads/2016/09/D9.2.1PhenoMeNal-Preprocess.pdf 2 Chambers, Matthew C et al. "A cross-platform toolkit for mass spectrometry and proteomics." Nature biotechnology 30.10 (2012): 918-920. 3 http://phenomenal-h2020.eu/jenkins/job/container-pwiz/ 4 https://portaldev.phenomenal-h2020.eu/app-library/pwiz 5 http://phenomenal-h2020.eu/jenkins/view/%20B.-%20Container%20data%20tests/job/test-container-pwiz-bruker/

Page 5: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

5

Figure 1: Screenshot of MSconvertGUI converting Thermo RAW data in a docker container on Ubuntu/Linux.

It is now possible to also convert Thermo RAW files in this container using the GUI under Linux/X11, which is a long-standing wishlist item in the metabolomics community (see screenshot in Figure 1). However, it is not yet possible to perform this conversion also with the command line msconvert.exe tool. We are in contact with the Proteowizard team and identified the underlying reason, which lies in the complex interaction between VisualStudio 2012 C++ and the windows CLR runtime used in the msconvert.exe tool and vendor libraries. At the HUPO PSI 2018 meeting (18.-20.04.2018) in Heidelberg (Germany), we were in contact with Jim Shofstahl from Thermo Scientific. We have received a version of the Thermo RawFileReader that is based on .NET, and avoids the C++ runtime issues under WINE/Linux. We have created a local container-RawFileReader successfully running a command line program that extracts required information from the Thermo RAW file under linux, and informed the Proteowizard team. We can hence expect that an upcoming version of proteowizard will be able to successfully convert also Thermo data on the command line, and henceforth also in Galaxy..

We also worked on the license compliance. Due to the nature of the container registries and the kubernetes cloud setup it is not possible to request acknowledgement of the license terms during a download step, as is the case for the Proteowizard binaries6. Instead, we are using a mechanism that is also used in the pwiz build process, where adding “--i-agree-to-the-vendor-licenses” while building will acknowledge the vendor licenses. With this analogy, the resulting container is called “phnmnl/pwiz--i-agree-to-the-vendor-licenses”. All licensing information was added to the container repository and the container itself.

6 http://proteowizard.sourceforge.net/downloads.shtml#

Page 6: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

6

Several tools in the PhenoMeNal workflows can process metabolomics data in mzML, including XCMS, CAMERA and MetFrag (via MSnbase), and the OpenMS tools.

3.3 NMR Data Preprocessing

The nmrML standard is an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments, see Figure 2 for an overview of the role of nmrML. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. The manuscript “nmrML: A Community Supported Open Data Standard for the Description, Storage, and Exchange of NMR Data” has been published recently7.

Figure 2: workflow of nmrML data flow.

The main route to nmrML-formatted data is using our open source converter nmrmlconv, which is part of the nmrML package. The converter is also packaged as a container image, together with the required Java runtime environment, and is available as a Galaxy tool.

7 https://pubs.acs.org/doi/10.1021/acs.analchem.7b02795

Page 7: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

7

Since the Deliverable D9.2.1, we have updated the container image to the latest version of nmrmlconv and we have created a Galaxy module and a Galaxy workflow that uses nmrmlconv to process NMR RAW data, see Figure 3. The module can be used with both singular and many NMR RAW acquisitions that are bundled in a dataset collection. When processing the nmrmlconv module, the Galaxy workflow engine starts the nmrmlconv-container8, which is launched within the PhenoMeNal e-infrastructure and is further described below. In the process, we have raised several github issues9 to further improve the nmrml converter.

Figure 3: Screenshot of the NMR workflow running in Galaxy. The module “nmrmlconv” has been emphasized and is now integral part of the NMR RAW processing workflow.

3.3.1 nmrmlconv Container

The nmrmlconv container itself integrates the official nmrML converter sources10 and packages them into a docker container. This container is launched within the PhenoMeNal e-infrastructure. In order to ensure that NMR RAW data is processed reproducibly, our Continuous Integration Framework automatically tests the container for

8 https://github.com/phnmnl/container-nmrmlconv 9 https://github.com/nmrML/nmrML/issues/179 10 https://github.com/nmrML/nmrML/tree/master/tools/Parser_and_Converters/Java

Page 8: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

8

launching correctly11 and with actual data12. The nmrmlconv currently can process NMR RAW data from any of Bruker, Varian and Jeol vendor formats.

The vendor to nmrML converter itself was annotated via the EDAM ontology and published under bio.tools repository (see https://bio.tools/nmrML_converter6742). Bio.tools now enables to find this tool via appropriate controlled vocabularies describing a tools function and I/O formats.

We are also currently investigating how we can add/use nmrML as vendor independent NMR raw data standard within the newly emerging NMReData Record.zip NMR assignment data standard (see http://nmredata.org). This standard could gain momentum at the late stages of NMR processing workflows, as it is supported by many molecule to spectrum feature assignment tools (e.g. mnova, TopSpin, …) and small molecule databases (e.g. NMRShiftDB, C6H6.org, ...). The NMReData standard currently stores the vendor native formats in the Record.zip file, which impairs cross-vendor data access, e.g. as desired in molecule analytics databases. Repository providers (e.g. the https://nps-datahub.com/main drug data repository) have indicated interest to switch to nmrML due to that drawback and would therefore welcome nmrML inclusion.

3.3.2 User documentation and training

The nmrML conversion is part of the NMR Workflow MTBLS1 Tutorial13, was featured in a YouTube tutorial14, and covered in the workshops CloudMET15 and at the MetaboMeeting16.

11 http://phenomenal-h2020.eu/jenkins/view/%20A.-%20Container%20tools/job/container-nmrmlconv/ 12 http://phenomenal-h2020.eu/jenkins/view/%20B.-%20Container%20data%20tests/job/test-container-nmrmlconv/ 13 https://portal.phenomenal-h2020.eu/help/NMR1d-Workflow 14 https://www.youtube.com/watch?v=pHB9pN2jXMA 15 http://phenomenal-h2020.eu/home/2017/07/05/cloudmet-2017/ 16 http://phenomenal-h2020.eu/home/2017/11/07/phenomenal-metabomeeting-2017/

Page 9: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

9

Figure 4: User tutorial for the NMR workflow including the nmrML conversion.

3.3.3 BATMAN

Bayesian AuTtomated Metabolite ANalyzer (BATMAN) for NMR is a protocol for automated metabolite deconvolution and quantification from complex NMR spectra. BATMAN deconvolves resonances from 1-dimensional NMR spectra and assigns them to specific metabolites from a target list and obtains concentration estimates. It applies a Markov Chain Monte Carlo (MCMC) algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced error compared with conventional numerical integration and comparable to manual deconvolution by experienced spectroscopists. BATMAN is available for PhenoMeNal galaxy users.

The BATMAN workflow performs a 1D NMR spectra analysis using NMR raw data, coming from e.g. the MetaboLights database. The user can connect to this database via PhenoMeNal galaxy and can import data using the tool “Metabolights downloader”. Then, the raw NMR data is converted to multiple nmrML files using the tool “nmrmlconv”. As a next step, using the tool “ZIP nmrML Collection” a zip archive of a nmrML collection is created, which is imported via the “nmrML2BATMAN Converter”, followed by the

Page 10: Deliverable 9.5 - PhenoMeNal · MSconvert and the integration into the continuous integration Jenkins and as Galaxy tool have already been described in D9.2.1. Since then, we have

10

remaining BATMAN data processing workflow, which will be covered in the upcoming “D9.5.2 Updated Data Processing”.

Figure 5: NMR workflow that includes the nmrML2BATMAN conversion and the BATMAN tool.

4. Delivery and Schedule

The delivery is delayed: No

5. Conclusion

In PhenoMeNal, we have worked to improve the preprocessing of metabolomics data, which starts with the conversion of raw vendor file formats to open formats supported in PhenoMeNal. We are now supporting the conversion to the mass spectrometry data format mzML and to nmrML on non-Linux systems. Our efforts have been acknowledged by other developer in the mass spectrometry and workflow community.