web-based inference of biological patterns, functions and pathways from metabolomic data using...

19
© 2011 Nature America, Inc. All rights reserved. PROTOCOL NATURE PROTOCOLS | VOL.6 NO.6 | 2011 | 743 INTRODUCTION Metabolomics is primarily concerned with comprehensive analysis of all small-molecule compounds that can be found in biological samples, such as cells, tissues or biofluids 1 . Because of its utility in identifying biomarkers of disease and in measuring biochemical phenotypes, the field of metabolomics has grown rapidly in recent years. This growth has also been aided by advances in analytical technologies, such as high-resolution nuclear magnetic resonance (NMR) spectroscopy, mass spectrometry (MS) and various com- pound separation techniques 2,3 . As with other ‘omics’ technologies, bioinformatics has a key role in facilitating the storage, dissemi- nation and interpretation of metabolomic data. In particular, bioinformaticians have developed a number of comprehensive spectral, compound and biofluid databases 4–6 , as well as a variety of software tools for data processing, compound identification and compound quantification 7–12 . With these basic bioinformatics tools now in place, the focus in metabolomic software development has gradually shifted away from basic compound identification and more toward functional interpretation and pathway analysis (i.e., systems biology). There are two general approaches to performing a metabolomics study: chemometric approaches and quantitative approaches 13 . Chemometric approaches (also known as nontargeted or untar- geted methods) use raw, unannotated peak lists, binned spectral data or aligned spectral profiles in combination with multivariate statistics to identify spectral features that are statistically different between two (or more) different sample populations. Those features (peaks, retention times, masses, chemical shifts) identified as being significant may or may not be identified in subsequent analysis steps. Because chemometric methods do not make compound identification a priority, a major challenge with this approach is the subsequent identification step and the handling and elimina- tion of false positives or spectral noise. In contrast, quantitative metabolomics (also known as targeted profiling) requires com- pound identification and quantification before any further analysis. Multivariate statistical methods are then applied to the resulting concentration data to identify metabolites that are statistically different between two (or more) different sample populations. In quantitative metabolomics, compound identification and quantification are usually achieved by comparing the MS or NMR spectra of the biological samples of interest with a set of chemical standards or a reference spectral library. Obviously, a key limita- tion to quantitative metabolomics is the accurate identification and quantification of compounds, especially in complex mixtures. Although still in use today, chemometric approaches were more widespread when compound identification was hampered by the lack of comprehensive spectral databases and appropriate com- pound identification/quantification software. However, as many metabolomics researchers learned, without a list of named com- pounds, it is extremely difficult to identify the affected pathways, to infer a mechanism of action or to develop any kind of biological understanding. It is also very difficult to patent an unknown peak or an unnamed spectral feature. With the availability of several comprehensive metabolomic databases and improved spectral analysis tools 4–6,14,15 , compound identification has become much easier, and now quantitative metabolomics is becoming much more widely used in the metabolomics community 16–19 . In response to this trend toward quantitative metabolomics, as well as the growing community shift toward using open-access, web-based tools in many ‘omics’ applications, we have developed a web-based software tool called MetaboAnalyst 20 . MetaboAnalyst was specifically designed to address a wide variety of common metabolomic research and educational needs, including conven- tional biomarker identification, the extraction of diagnostic or prognostic metabolite patterns, general metabolite annotation, putative pathway identification, functional or biological interpre- tation of metabolomic data, general data exploration, online class instruction for multivariate statistics, general data visualization, the creation of plots/figures for publications and presentations, MS and/or NMR data normalization and large-scale error-checking of MS and NMR metabolomic data. Although MetaboAnalyst is certainly capable of being used for standard chemometric applica- tions, it is mainly designed to support quantitative metabolomics. MetaboAnalyst is particularly unique among metabolomic analysis tools, in that it provides comprehensive support for multiple data Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst Jianguo Xia 1 & David S Wishart 1– 3 1 Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada. 2 Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada. 3 National Research Council, National Institute for Nanotechnology, Edmonton, Alberta, Canada. Correspondence should be addressed to D.S.W. ([email protected]). Published online 5 May 2011; doi:10.1038/nprot.2011.319 MetaboAnalyst is an integrated web-based platform for comprehensive analysis of quantitative metabolomic data. It is designed to be used by biologists (with little or no background in statistics) to perform a variety of complex metabolomic data analysis tasks. These include data processing, data normalization, statistical analysis and high-level functional interpretation. This protocol provides a step-wise description on how to format and upload data to MetaboAnalyst, how to process and normalize data, how to identify significant features and patterns through univariate and multivariate statistical methods and, finally, how to use metabolite set enrichment analysis and metabolic pathway analysis to help elucidate possible biological mechanisms. The complete protocol can be executed in ~45 min.

Upload: sharifskssks

Post on 12-Aug-2015

115 views

Category:

Documents


0 download

DESCRIPTION

My collection for metabolomics articles.

TRANSCRIPT

Page 1: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 743

IntroDuctIonMetabolomics is primarily concerned with comprehensive analysis of all small-molecule compounds that can be found in biological samples, such as cells, tissues or biofluids1. Because of its utility in identifying biomarkers of disease and in measuring biochemical phenotypes, the field of metabolomics has grown rapidly in recent years. This growth has also been aided by advances in analytical technologies, such as high-resolution nuclear magnetic resonance (NMR) spectroscopy, mass spectrometry (MS) and various com-pound separation techniques2,3. As with other ‘omics’ technologies, bioinformatics has a key role in facilitating the storage, dissemi-nation and interpretation of metabolomic data. In particular, bioinformaticians have developed a number of comprehensive spectral, compound and biofluid databases4–6, as well as a variety of software tools for data processing, compound identification and compound quantification7–12. With these basic bioinformatics tools now in place, the focus in metabolomic software development has gradually shifted away from basic compound identification and more toward functional interpretation and pathway analysis (i.e., systems biology).

There are two general approaches to performing a metabolomics study: chemometric approaches and quantitative approaches13. Chemometric approaches (also known as nontargeted or untar-geted methods) use raw, unannotated peak lists, binned spectral data or aligned spectral profiles in combination with multivariate statistics to identify spectral features that are statistically different between two (or more) different sample populations. Those features (peaks, retention times, masses, chemical shifts) identified as being significant may or may not be identified in subsequent analysis steps. Because chemometric methods do not make compound identification a priority, a major challenge with this approach is the subsequent identification step and the handling and elimina-tion of false positives or spectral noise. In contrast, quantitative metabolomics (also known as targeted profiling) requires com-pound identification and quantification before any further analysis. Multivariate statistical methods are then applied to the resulting concentration data to identify metabolites that are statistically different between two (or more) different sample populations.

In quantitative metabolomics, compound identification and quantification are usually achieved by comparing the MS or NMR spectra of the biological samples of interest with a set of chemical standards or a reference spectral library. Obviously, a key limita-tion to quantitative metabolomics is the accurate identification and quantification of compounds, especially in complex mixtures.

Although still in use today, chemometric approaches were more widespread when compound identification was hampered by the lack of comprehensive spectral databases and appropriate com-pound identification/quantification software. However, as many metabolomics researchers learned, without a list of named com-pounds, it is extremely difficult to identify the affected pathways, to infer a mechanism of action or to develop any kind of biological understanding. It is also very difficult to patent an unknown peak or an unnamed spectral feature. With the availability of several comprehensive metabolomic databases and improved spectral analysis tools4–6,14,15, compound identification has become much easier, and now quantitative metabolomics is becoming much more widely used in the metabolomics community16–19.

In response to this trend toward quantitative metabolomics, as well as the growing community shift toward using open-access, web-based tools in many ‘omics’ applications, we have developed a web-based software tool called MetaboAnalyst20. MetaboAnalyst was specifically designed to address a wide variety of common metabolomic research and educational needs, including conven-tional biomarker identification, the extraction of diagnostic or prognostic metabolite patterns, general metabolite annotation, putative pathway identification, functional or biological interpre-tation of metabolomic data, general data exploration, online class instruction for multivariate statistics, general data visualization, the creation of plots/figures for publications and presentations, MS and/or NMR data normalization and large-scale error-checking of MS and NMR metabolomic data. Although MetaboAnalyst is certainly capable of being used for standard chemometric applica-tions, it is mainly designed to support quantitative metabolomics. MetaboAnalyst is particularly unique among metabolomic analysis tools, in that it provides comprehensive support for multiple data

Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalystJianguo Xia1 & David S Wishart1– 3

1Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada. 2Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada. 3National Research Council, National Institute for Nanotechnology, Edmonton, Alberta, Canada. Correspondence should be addressed to D.S.W. ([email protected]).

Published online 5 May 2011; doi:10.1038/nprot.2011.319

Metaboanalyst is an integrated web-based platform for comprehensive analysis of quantitative metabolomic data. It is designed to be used by biologists (with little or no background in statistics) to perform a variety of complex metabolomic data analysis tasks. these include data processing, data normalization, statistical analysis and high-level functional interpretation. this protocol provides a step-wise description on how to format and upload data to Metaboanalyst, how to process and normalize data, how to identify significant features and patterns through univariate and multivariate statistical methods and, finally, how to use metabolite set enrichment analysis and metabolic pathway analysis to help elucidate possible biological mechanisms. the complete protocol can be executed in ~45 min.

Page 2: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

744 | VOL.6 NO.6 | 2011 | nature protocols

types (NMR, gas chromatography-MS (GC-MS) and liquid chro-matography-MS (LC-MS) data), multiple data processing proce-dures, a wide range of statistical and machine learning methods, and tools for high-level functional interpretation. MetaboAnalyst also provides a user-friendly interface that guides non-experts through the data analysis process. In addition, it offers intuitive visualization tools and generates a detailed analysis report at the end of each session.

Since its release in 2009, MetaboAnalyst has been heavily used by researchers in the metabolomics community. Currently, the server is being accessed by an average of ~50 unique users per day. This has necessitated multiple server upgrades and the development of a very extensive set of frequently asked questions and tutorials. On the basis of user feedback, MetaboAnalyst has also undergone several updates to improve its support for binary (two-group) analysis and to extend its support for multiple-group analysis. One of the most recent enhancements has been the incorpora-tion of metabolite set enrichment analysis (MSEA)21 and metabolic pathway analysis22 into MetaboAnalyst to assist in the high-level functional interpretation of quantitative metabolomic data. These additions should make MetaboAnalyst a true ‘one-stop shop’ for metabolomic data analysis.

Comparison with other available tools for metabolomic data analysisPerhaps the most widely used tool in metabolomics data analy-sis today is SIMCA-P + (Umetrics). SIMCA-P + is a commercial desktop application with a nicely designed graphical user interface that supports a wide variety of data transformations and multi-variate statistical analyses, including principal component analysis (PCA), partial least squares–discriminant analysis (PLS-DA) and orthogonal projection into latent structure (see Box 1 for glossary). SAS (Statistical Analysis System from the SAS Institute) is another

stand-alone commercial software package that is also commonly used in many metabolomics studies. Similar to SIMCA-P + , SAS supports a wide range of data transformations as well as sophisti-cated univariate and multivariate analyses. Unlike SIMCA-P + , SAS lacks a graphical interface and is generally accessed through appli-cation programming interfaces. Generally speaking, the normali-zation, clustering, multivariate statistics and many of the graphs generated by means of MetaboAnalyst (and the accompanying pro-tocols) could be generated using SIMCA-P + and/or SAS. However, neither SIMCA-P + nor SAS support metabolomic-specific data processing (for NMR and/or MS data), nor do they offer high-level functional interpretation through automated metabolite annotation, MSEA or metabolic pathway analysis. Furthermore, MetaboAnalyst is a freely available, web-based application with extensive graphical output and an easy-to-use graphical user interface. This makes it somewhat more accessible, easier to learn and far easier to use than either SIMCA-P + or SAS. To the best of our best knowledge, there are only two other freely available web-based metabolomic data processing tools—MeltDB23 and the metaP-Server24. However, neither would be able to perform most of the data processing or interpretive steps described in this protocol. MeltDB was primarily built for MS-based metabolomics data stor-age, administration, analysis and annotation, whereas metaP-Server was designed to support exploratory metabolomic data analysis using mainly univariate summary statistics. A detailed feature com-parison for these five tools is given in Table 1.

Limitations of the protocol and softwareBecause of space restrictions, the protocols/procedures outlined in this paper will not be able to illustrate all of the functional capabili-ties that can be found in MetaboAnalyst. In particular, the clustering, classification and machine learning tools for data processing will not be discussed here. Similarly, some of the metabolite annotation

Box 1 | GLoSSARY ANOVA—analysis of varianceCSF—cerebrospinal fluidCSV—comma separated valuesEBAM—empirical Bayesian analysis of microarrays38

FAQ—frequently asked questionsFDR—false discovery rateGSEA—gene set enrichment analysisHSD—Tukey’s honestly significant differenceLOOCV—leave-one-out cross-validationLSD—Fisher’s least significant differenceGC/LC-MS—gas chromatography/liquid chromatography-mass spectrometryMSEA—metabolite set enrichment analysis21

NMR—nuclear magnetic resonanceORA—over-representation analysisPCA—principal component analysisPLS-DA—partial least squares–discriminant analysisOPLS—orthogonal projections to latent structures39

QEA—quantitative enrichment analysis21

SAM—significance analysis of microarray30

SNP—single nucleotide polymorphismSSP—single sample profiling21

VIP—variable importance in projection

Page 3: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 745

functions will not be presented either. Although it is important to note some of the limitations of this particular protocol, it is also important to note that the software itself also has some shortcom-ings. In particular, MetaboAnalyst has relatively limited metabolite annotation capabilities, it does not support or integrate other kinds of ‘omics’ data and it has limited capabilities for processing and visu-alizing raw MS spectral files. This limitation with MS spectral files is primarily the result of hardware restrictions, both with respect to the MetaboAnalyst server and with respect to the speed of Internet data transfers (bandwidth). Raw MS spectral files are often too large (greater than hundreds of Mb) to be routinely or rapidly uploaded to a remote server. Furthermore, spectral processing (including peak picking, alignment and annotation) is a computationally intensive exercise that usu-ally requires multiple iterations and care-ful manual inspection to achieve optimal results. Consequently, we believe that these tasks are better handled by locally installed

software packages rather than through a web-based application. Indeed, many freely available tools have been developed for MS spectral processing, including MetAlign11, MZmine25, Met-IDEA26, MSFACTS27, Tagfinder28 or XCMS8, to name just a few. By avoiding this data transfer bottleneck and by limiting its preferred input for-mat to partially processed data, such as peak lists or concentrations, MetaboAnalyst is able to offer much more efficient data analysis and visualization services to a much wider user base.

Analysis overviewThe procedure described here provides a step-by-step protocol for using MetaboAnalyst to fully analyze quantitative metabolomic

taBle 1 | Comparison of different metabolomics data analysis/interpretation programs.

tool Metaboanalyst MeltDB metap-server sIMca-p sas

Software type Web-based Web-based Web-based Stand-alone Stand-alone

License Free Free (registry required) Free Commercial Commercial

Data input Data table, NMR, MS, GC-MS data, compound/peak lists

Raw mass spectral files Data table Data table Data table

Graphical interface + + + + + + + + + + + / −

Normalization + + + + + + + + +

Univariate analysis + + + + + + + + + + +

Multivariate analysis + + + + + + + + + + + +

Clustering + + + + + + +

Classification + + + +

Enrichment analysis + +

Pathway analysis + + + +

Pathway visualization + +

Integration with other omics data +

Peak annotation + + + + + The level of support for a particular feature is rated by the number of ‘ + ’, with ‘ + + + ’ as the highest.

Other inputs:

Compoundname lists

Pathway analysis (Steps 21–31) Enrichment analysis (Steps 32–35)• 15 organisms, 1,173 pathways • 6,295 metabolite sets in 7 categories

• Over-representation analysis• Single sample profiling• Quantitaive enrichment analysis

• Pathway enrichment analysis• Pathway topology analysis• Interactive visualization

Result download: analysis report, images, processed data

High-level functional interpretation

Compound nameStandardization (Step 7)

Data processing and normalization

Concentration tables Statistical analysis

Univariate analysis• Fold changes• t-Tests• Volcano plots• ANOVA (Step 8A)• Correlations (Step 9)• SAM (Step 8B)• EBAM

• PCA (Steps 11–14)• PLS-DA (Steps 15–19)

• Hierarchical cluster• SOM• K-means

• Random forests• SVM

Multivariate analysis

Clustering

Classification

Data integrity check (Step 3)

• Missing value imputation

• Row-wise procedures• Column-wise procedures

• Outlier removal

Data normalization (Steps 4–6)

Data pre-processingPeak detection/alignment

• Retention time correction• Noise filtering

• Peak lists

• MS spectra• Spectral bins

Figure 1 | Flowchart for MetaboAnalyst. MetaboAnalyst is composed of three main functional modules responsible for data processing, statistical analysis and high-level functional interpretation. Different data inputs are first processed to produce appropriate data matrices. A wide array of univariate and multivariate statistical analyses can then be performed on these data matrices. If compound identities are known, users can perform enrichment analysis or pathway analysis after compound name standardization. Corresponding PROCEDURE step numbers are indicated in the figure.

Page 4: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

746 | VOL.6 NO.6 | 2011 | nature protocols

data. It begins with a general overview of the program, followed by a detailed descrip-tion on how to format and upload data, how to ‘cleanse’ the data, how to normalize it and how to identify significant features or generate lists of ‘important metabolites’. It concludes with a description on how to perform MSEA and how to perform meta-bolic pathway analysis. Although the pro-tocol is specific to MetaboAnalyst, many of the early stage statistical steps can be readily adapted to other statistical analysis packages (such as SIMCA-P + and SAS). As noted earlier, not all of MetaboAnalyst’s options or data analysis paths can be discussed in detail. However, the protocol described here should be applicable to many common data analysis scenarios in metabolomics.

MetaboAnalyst consists of three main modules: (i) a data process-ing module; (ii) a statistics module; and (iii) a high-level functional interpretation module. The data processing module is responsible for data input, data processing and data normalization. The statis-tics module supports a number of statistical (univariate, multivari-ate) and machine learning methods for feature selection, clustering and classification. The high-level functional interpretation module includes enrichment analysis and pathway analysis. The enrich-ment analysis offers MSEA using several comprehensive metabolite- set libraries. The pathway analysis offers pathway enrichment analysis and pathway topology analysis through a Google Maps–style interactive pathway visualization system. As illustrated in Figure 1, the data processing module is the entryway to access the other two modules. The statistics module, which is perhaps the

most important module in MetaboAnalyst, is designed for general-purpose metabolomic data analysis and can be used to analyze a number of different data types, including compound concentra-tion data, peak lists or binned spectral data (i.e., both targeted and non-targeted data). For high-level functional interpretation, only quantitative metabolomic data (i.e., compound concentration data or a list of metabolite names) can be accepted. It is important to note that MetaboAnalyst’s high-level functional analysis is organism specific as dictated by MetaboAnalyst’s underlying knowledgebase. For enrichment analysis, the collection of ~6,300 metabolite sets was compiled primarily from human studies. Therefore, users need to provide their own custom metabolite sets if they wish to perform enrichment analysis for other organisms. MetaboAnalyst’s path-way analysis currently supports 15 model organisms with ~1,200 precompiled Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Before using this option, users need to decide whether these predefined libraries are applicable to their organism(s) under study. To perform high-level functional analysis, one critical step is to match compound names between users’ data and MetaboAnalyst’s

Box 2 | DATA FoRMATTING comma separated values (.csv)Input data values must be numeric, such as concentrations, peak intensities or areas of spectral bins. Missing values should be left blank or marked as NA. Samples can be in rows or columns, with group labels immediately following the sample names. The group label can be binary or multigroup. For enrichment analysis or pathway analysis, users can also use continuous labels for regression analysis.peak list filesEach peak list file should be saved in CSV format, with the first row reserved for column labels. An NMR peak list file must contain two columns, with the first column for peak positions (p.p.m. or parts per million) and the second column for peak intensities; mass spectrometry (MS) peak lists can be saved in either two-column (mass and intensities) or three-column format (mass, retention time and intensities) but not as a mixture of both. These files should be organized into separate folders named by their group labels and then uploaded as a single ZIP file.Ms spectraMS spectra must be saved in one of two open exchange formats (netCDF or mzXML), put into different folders named by their group labels and uploaded as a single ZIP file. Because of Internet bandwidth constraints, the maximum size (after compression) allowed for an upload is 50 Mb. Larger spectral data should be first processed into other formats such as spectral bins or peak lists. Many spectral processing tools are freely available. For MS spectra processing, users can use MetAlign11, MZmine25 or XCMS8; for NMR spectra processing, HiRes9 or Automics40 can be used.

Figure 2 | Data upload view. This screenshot shows MetaboAnalyst’s available data analysis modules, with the ‘Statistical Analysis’ module being selected for data upload. Clicking the tab labeled ‘Enrichment Analysis’ or ‘Pathway Analysis’ will allow users to upload data for the corresponding data analysis. The navigation tree is located on the left with the current step (‘Upload’) highlighted.

Page 5: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 747

knowledgebase. As there are currently no universally accepted set of metabolite names or IDs, we have implemented an automated compound ‘disambiguator’ to convert various compound IDs and synonyms to Human Metabolome Database (HMDB) compound names for MSEA and to KEGG compound names for pathway analysis. In some cases, there will be redundancies and conflicts due to different naming schema adopted by different databases. Those compounds with name conflicts will be highlighted for subsequent manual inspection. We recommend that users try the recently released Chemical Translation Service29 (http://cts.fiehnlab.ucdavis.edu) to clarify these ambiguities before performing any kind of high-level analysis.

MetaboAnalyst uses a navigation tree to guide users through its different analysis procedures (Fig. 2). All the available functions

are represented as tree nodes and these nodes are organized into different branches or functional categories. Users may click the corresponding nodes to navigate among different MetaboAnalyst functions. Depending on the context, some tree nodes may be disabled when the required preliminary steps have not been per-formed by the user. The current node is always highlighted during the analysis, as shown in Figure 2.

This protocol is organized into five sections: (i) data format-ting, uploading and processing; (ii) identifying important features using univariate analysis; (iii) multivariate statistical analysis; (iv) MSEA; and (v) metabolic pathway analysis. Two compound concentration data sets are provided to demonstrate these pro-cedures. The first data set contains metabolite concentrations of 39 bovine rumen samples measured by proton NMR. The rumen

Box 3 | DATA PRoCESSING Handling missing valuesDepending on the type of metabolomics experiment being analyzed, there may be a substantial number of missing values present in the data set. A variety of methods have been implemented to deal with missing values. By default, MetaboAnalyst treats missing values as being present but with low signal intensity (below the detection limit). Consequently, they are replaced by half of the minimum positive value detected in the user data. Users are also allowed to manually or automatically exclude samples with too many missing values. Alternately, a user can choose several computational methods, including replace by mean/median, probabilistic principal component analysis (PCA), Bayesian PCA or singular value decomposition, to impute the missing values41.outlier identification and removalOutliers are defined as those few (usually one or two) data points that stand out from the majority of the other data points. They can be caused by sample degradation, instrumental errors, changes in measurement conditions or faulty measurements due to human error. An outlier can be either a sample outlier or a feature outlier. Outliers can usually be visually identified based on some graphical summaries of the data. For instance, the PCA score plot is often used to identify sample outliers. With this kind of plot, the outlier(s) should be located far away from main cluster. Alternately, hierarchical clustering can also be used to identify outliers, as they usually form a distant branch that joins the main cluster at a very high level. Box plots or box-and-whisker plots are commonly used to help detect feature outliers. After being identified, outliers can be removed using the DataEditor interface provided by MetaboAnalyst.Please note, after missing value replacement or outlier removal, users should redo their normalization before moving to any further downstream analysis. Many data analysis methods are quite sensitive to outliers; therefore, the results may be quite different after these procedures.

Box 4 | DATA NoRMALIZATIoN Eight commonly used procedures have been implemented for data normalization in MetaboAnalyst. Depending on whether they are to be performed on samples (rows) or features (columns), these methods are organized into two categories as described below.row-wise normalizationThis procedure aims to reduce systematic bias during sample collection. The normalization by sum method is often used for binned spectra data in which the total spectral area is assumed to be constant; the normalization by a feature can be used to adjust the feature values (i.e., concentrations) of each sample against a spike-in, an internal standard or a physiological constant (i.e., urinary creatinine); the normalization by a sample, also known as probabilistic quotient normalization42, is a robust method to account for different dilution effects during sample preparation. This method rescales each sample by the most probable dilution factor, which is calculated as the median of the quotients between all corresponding features of the sample and the reference. It is very useful as an alternative procedure for urinary dilution adjustment when creatinine is unsuitable (i.e., kidney disease). The sample-specific normalization allows users to manually specify a normalization value for each sample (i.e., on the basis of tissue volume, dry weight, etc.).column-wise normalizationThese procedures aim to reduce the impact of very large feature values and to make all features more comparable or normally distributed by using different centering, scaling and log transformations. One disadvantage associated with this approach is the inflation of measurement errors or noise (usually of small values) after this procedure. A detailed discussion of these different normalization procedures is available in the paper by van den Berg et al.43.Please note that row-wise and column-wise normalizations are usually performed sequentially. For data already normalized before upload, this step can be skipped.

Page 6: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

748 | VOL.6 NO.6 | 2011 | nature protocols

samples were collected from dairy cows fed with different proportions of barley grain. The samples are labeled in four groups—0, 15, 30 and 45—indicating different per-centages of barley in the diet. The second data set contains metabolite concentrations of 77 urine samples from cancer patients, also measured by proton NMR. The sam-ples are divided into two groups—control or cachexic (significant muscle loss).

Data formatting, processing and normalization This section describes how to upload vari-ous data types into MetaboAnalyst, followed by explanations on how to perform data ‘cleansing’ and normalization. The basic idea is to transform any uploaded data into a matrix, with samples in rows and features in columns. Three basic data formats are supported by MetaboAnalyst (Box 2). The most common type is a data table contain-ing compound concentrations, peak inten-sities or spectral bins. These kinds of data can be easily viewed and edited using any spreadsheet program. The second data type consists of multiple peak lists, as picked from multiple spectra (NMR, MS or GC-MS). These kinds of data can be obtained from most spectral processing programs. The third data

type corresponds to raw MS spectra saved in open exchange formats, such as netCDF or mzXML. More detailed information regarding data input formats, including example data sets, are available on the MetaboAnalyst website under ‘Data Formats’ and ‘FAQs’ links.

Fumarate Fumarate

Before normalization After normalization

Glucose GlucoseEndotoxin EndotoxinXanthine Xanthine

Valine ValineValerate Valerate

Uracil UracilTyrosine Tyrosine

Succinate SuccinateRibose Ribose

Propionate PropionateProline Proline

PAG PAGPhenylacetate Phenylacetate

Nicotinate NicotinateNDMA NDMA

Methylamine MethylamineMathanol Methanol

Maltose MaltoseLysine Lysine

Leucine LeucineLactate Lactate

IsovalerateIsoleucine

IsobutyrateHypoxanthine

HistidineGlycine

GlycerolGlutamate

FormateFerulateEthanol

DimethylamineCholine

CaffeineCadaverine

ButyrateBenzoateAspartate

AlanineAcetoacetate

Acetate3-PP3-HP3-HB1,3-D

8e+05

6e+05

4e+05

2e+05

0e+00

Normalized concentration1e–06 0e+00 1e–06 2e–06

IsovalerateIsoleucine

Com

poun

ds

IsobutyrateHypoxanthine

HistidineGlycine

GlycerolGlutamate

FormateFerulateEthanol

DimethylamineCholine

CaffeineCadaverine

ButyrateBenzoateAspartate

AlanineAcetoacetate

Acetate3-PP3-HP3-HB1,3-D

0.0020

Den

sity

0.0015

0.0010

0.0005

Concentration0 10,000 20,000 30,000 40,000

0

Figure 3 | Data normalization view. The graph summarizes the distribution of input data values before and after normalization. The box plots on the top show the concentration distributions of individual compounds, whereas the bottom plots show the overall concentration distribution based on kernel density estimation.

Box 5 | SIGNIFICANT FEATURE IDENTIFICATIoN Identification of features similar to a known biomarkerIn this case, researchers are looking for features (metabolites or peaks) showing similarities in their intensity or concentration changes to a feature of interest (co-expression). Users can directly perform correlation analysis against the target feature to identify those peaks or metabolites that are either positively or negatively correlated. Users can also use hierarchical clustering. Features located in the same cluster as the target feature are most similar in terms of intensity or concentration changes. MetaboAnalyst supports many similarity measures, including Euclidean distance, Pearson’s correlation, Spearman’s rank correlation and Kendall’s τ-test.Identification of features following a particular patternIn this case, researchers are looking for features that have shown particular patterns of changes under multiple (>2) conditions or through a range (>2) of time points. MetaboAnalyst uses a template matching approach to address this situation, as described by Pavlidis44. Template matching is part of the correlation analysis suite of MetaboAnalyst. Users can either use a predefined pattern or specify a new pattern to perform template matching. The template patterns must be specified as a series of numbers corresponding to concentration levels expected in different groups or at different time points.Identification of features significantly different between ‘case’ and ‘control’MetaboAnalyst offers a number of approaches, including classical univariate methods such as the t-test and ANOVA, which are commonly used to compare means or medians of one variable across two or more groups. Because of the multiple testing issue, false discovery rate (FDR) or Bonferroni-corrected P values are provided. Significance analysis of microarrays (SAM)30 and empirical Bayesian analysis of microarrays (EBAM)38 are also available for high-dimensional data based on moderated t-statistics. In addition, machine learning approaches such as Random forests45 also provide feature importance measures based on their contribution to classification performance. Feature selection and assessment using multivariate approaches are discussed in Box 6.

Page 7: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 749

Depending on the type of uploaded data, different prepro-cessing procedures may be used to convert the raw data into an appropriate data matrix. Compound concentration data (measured by NMR, GC-MS or LC-MS) are usually of high quality and do not normally need a preprocessing step. Binned spectral data usually contain a great deal of baseline noise and often require baseline filtering. For NMR and MS peak lists, MetaboAnalyst will first align the peaks across all samples. For GC/LC-MS spectra, MetaboAnalyst performs peak detection, peak alignment and retention time correction sequentially using the XCMS package8. MetaboAnalyst also supports some limited peak annotation/identification from raw peak lists. This anno-tation function can be accessed by going to ‘Other Utilities’ tab and clicking on the ‘NMR/MS peak search’ tool bar. This par-ticular function uses the HMDB peak search tools to score and identify peaks. MetaboAnlalyst’s MS peak search also identifies common adducts.

After the data has been converted into a data matrix, a data integ-rity check is performed to ensure that the data are valid and suitable for subsequent analysis. This data integrity check includes checking for data values, sample size, group labels and other data features. Box 3 describes some of the approaches available in MetaboAnalyst for dealing with missing values and outliers.

It is often necessary to normalize metabolomic data before starting any kind of statistical analysis for several reasons. First, normalization can reduce systematic bias or technical variation. Second, metabolite concentrations or peak intensities usually span several orders of magnitude (sub-micromolar to millimolar). Consequently, the variance from the more abundant metabo-lites will tend to dominate the variance-covariance matrix and obscure small but potentially significant signals. This can lead to misidentification of significant changes or a failure to identify significant changes, particularly with conventional multivari-ate statistical approaches. In addition, many statistical methods assume that data values follow a Gaussian distribution. Therefore, it is important to perform appropriate data transformations to make the data look like a ‘bell curve’. MetaboAnalyst provides many useful methods for data normalization (Box 4). The effect of these normalization procedures on users’ data can be visualized with a diagnostic plot (Fig. 3).

Significant feature identification using univariate methodsThis section provides the detailed steps on how to identify fea-tures of interest using classical univariate statistical methods, such as the Student’s t-test, analysis of variance (ANOVA) or correla-tion analysis. It also describes how to use a method developed for

Box 6 | MULTIVARIATE STATISTICAL ANALYSIS principal component analysisPCA is an unsupervised clustering or classification method. It projects complex, high-dimensional data to a new coordinate system with fewer dimensions. The projection direction is calculated to maximize the data variance in just the first few dimensions (called principal components). The values in the remaining dimensions may be ignored with minimal loss of information. PCA is very good at revealing the internal structure of a data set with respect to variance. The results of a PCA are usually discussed in terms of ‘scores’ and ‘loadings’. The scores represent the original data in the new coordinate system and the loadings are the weights applied to the original data during the projection process. Note that in PCA there is no guarantee that the directions of maximum variance will contain the best features for discrimination. It is also important to remember that PCA is very sensitive to outliers. Therefore, data normalization and outlier removal are usually needed in order to obtain good PCA results.partial least squares-discriminant analysisPLS-DA is a supervised clustering or classification method. This means that previous knowledge about the class labels (Y) is used during the classification process. PLS-DA projects the data (X) into a low-dimensional space that maximizes the separation between different groups of data in the first few dimensions (also called latent variables). These latent variables are ranked by how well they explain the Y-variance.

An important issue with PLS-DA is deciding on the number of latent variables to be used to build the model. MetaboAnalyst supports two common approaches: (i) assessing the sum of squares captured by the model (R2) or the cross-validated R2 (also known as Q2), and (ii) assessing the prediction accuracies based on cross-validation, with different numbers of components. By default, the optimal number determined by Q2 is used (Fig. 5b).

A common problem with PLS-DA is its propensity to data overfitting. This occurs when the algorithm appears to achieve good separation but has done so by picking up random noise rather than real signals. It has been shown that this problem cannot always be detected through cross-validation, but it can be detected using permutation tests46. A permutation test involves randomly reassigning the class labels and performing PLS-DA on the newly relabeled data set. The process is repeated hundreds or thousands of times, and the performance measures are plotted on a histogram for visual assessment. From the resulting histogram, it is possible to determine whether the original class assignment is significantly different from, or a part of, the distribution based on the permuted class assignments (Fig. 5c). An empirical P value is often calculated by determining the number of times the permutated data yielded a better result than the one using the original labels. For example, if none of the permuted classes is better than the observed one in 2,000 permutations, the P value is reported as P < 0.0005 (less than 1/2,000). We have implemented both cross-validation and permutation tests, as suggested by Bijlsma et al36.

PLS-DA also produces variable importance measures. Two variable importance measures are available in MetaboAnalyst. The first, variable importance in projection (VIP), is a weighted sum of squares of the PLS loadings that takes into account the amount of explained Y-variance of each component (Fig. 5d). The other importance measure is based on a weighted sum of the PLS-regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components.

Page 8: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

750 | VOL.6 NO.6 | 2011 | nature protocols

analyzing high-dimensional data, namely, significance analysis of microarrays (SAM)30. Metabolomic data sets are intrinsically high dimensional, with the number of features (peaks, metabolites) ranging from a few dozen to hundreds or even thousands. They represent snap-shots of global biochemical profiles of individual organisms. Most of these features are expected to be within nor-mal physiological variations, and only a few may be significantly associated with the conditions or phenotypes of interest. The identification of those ‘key’ features is the first step toward finding useful biomarkers or explaining the underlying biological process. Depending on the spe-cific questions being asked or the informa-tion already known, MetaboAnalyst offers a number of different strategies to perform feature identification and assessment (Box 5). MetaboAnalyst also supports feature (or peak) annotation after signifi-cant features (peaks or bins) have been identified. This utility can be accessed

under the ‘Peak search’ node located near the bottom of the navigation tree.

Multivariate data analysisMultivariate statistics involves the simulta-neous observation and analysis of more than two statistical variables. Because metabo-lomic data usually consist of dozens of fea-tures (compounds, peaks), many of which change as a function of time, phenotype or experimental conditions, multivariate data analysis is ideal for analyzing metabo-lomic data. Multivariate analysis includes a number of techniques, such as multivariate ANOVA, multivariate regression analysis, PCA, factor analysis and discriminant anal-ysis. MetaboAnalyst supports two widely used multivariate methods—PCA and PLS-DA. These two methods are very useful for exploratory data analysis through dimen-

sional reduction and data visualization (Box 6). MetaboAnalyst is also able to generate a variety of colorful, two- or three-dimensional graphs, such as score plots, loading plots and other kinds of

6

a

c d

b0153045

AccuracyR2

Q2

0.8

1.0

*

0.6

0.4Per

form

ance

0.2

01 2

Endotoxin

150

100

Fre

quen

cy

50

0 0.5 1.0 1.5Permutation test statistics

2.0 2.5 3.0

0

3-PP

3-HP

Caffeine

NDMA

1.0 1.2 1.4 1.6 1.8VIP scores

2.0 2.2

Alanine

Methylamine

Glucose

Uracil

AspartateIsobutyrate

Acetate

Observedstatistic

P < 5e–04

Isovalerate

Succinate

Histidine

3 4Number of components

5

4

2

–2

–3–2

–1

12

34

Com

ponent 2 (6.6%)0

–4

–6–6

0

4 62–2

Component 1 (15.9%)

Com

poun

ds

Com

pone

nt 3

(12

.1%

)

–4 0

Figure 4 | Multivariate analysis using PLS-DA. (a) PLS-DA 3D score plot. (b) Bar plots showing the three performance measures (prediction accuracy, R2 and Q2) using different number of components. The red ‘*’ indicates the best values of the currently selected measures (Q2). (c) The result of permutation tests summarized by a histogram. (d) The top 15 compounds ranked by VIP scores.

a

b

Figure 5 | Results from metabolite set enrichment analysis. (a) The result table summarizing the matched metabolite sets ranked by their P values. (b) The detailed view of a matched metabolite set (accessed by clicking the corresponding bar icon on the last table column).

Page 9: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 751

diagnostic plots (Fig. 4). This section describes the detailed steps needed to perform PCA and PLS-DA using the example data sets and how to interpret the results.

Metabolite set enrichment analysisThis section describes the detailed steps to perform MSEA. MSEA is the metabolomic counterpart of the gene set enrich-ment analysis (GSEA)31, which has been widely used in gene expression data analysis. The key idea behind GSEA is to inves-tigate the enrichment of predefined groups of functionally related genes (or gene sets) instead of individual genes. This approach has been shown to be good at identifying significant as well as subtle but coordinated expression changes among a group of related genes. As groups of genes are usually associ-ated with biological functions or biological pathways, GSEA also greatly facilitates higher-level functional interpretation. MSEA has been implemented in MetaboAnalyst, using the same concepts underlying GSEA (Fig. 5). Similar to GSEA, there are

two essential components for MSEA—(i) the algorithms for enrichment analysis and (ii) the comprehensive libraries of functionally related metabolite sets. Box 7 provides more details about these two components.

Metabolic pathway analysisThis section describes the basic steps to perform metabolic pathway analysis and visualization of the results. Pathway analysis has proven to be an invaluable tool in understanding complex relationships among genes and proteins in genomics and proteomics studies32–35. Most pathway analysis tools focus on visually displaying and high-lighting matched genes, proteins or metabolites and do not sup-port more quantitative or statistical analysis. To address this issue, we have integrated two pathway analysis approaches—pathway enrichment analysis and pathway topology analysis. The results can be visualized intuitively using a Google Maps–style visualiza-tion system (Fig. 6). Box 8 provides additional details on the main features offered by MetaboAnalyst’s pathway analysis utilities.

Box 7 | METABoLITE SET ENRICHMENT ANALYSIS types of enrichment analysisOverrepresentation analysis (ORA). This algorithm requires a list of compound names as input. Such a list can be obtained by various feature selection or clustering analysis techniques, such as ANOVA, PCA or PLS-DA. A hypergeometric test is used to evaluate whether a particular metabolite set is represented more than expected by chance within the given compound list. The P-value indicates the probability of observing at least a particular number of metabolites from a certain metabolite set in a given compound list.Single-sample profiling (SSP). For this approach, the required input is a list of compound concentrations measured from common human biofluids, such as cerebral spinal fluid (CSF), blood and urine. The concentrations must be provided using standard concentration units (µmol for blood and CSF, and µmol per mmol creatinine for urine). The method first identifies those metabolites with concentrations deviating significantly from the reported normal reference ranges. These metabolites are then subject to overrepresentation analysis.Quantitative enrichment analysis (QEA). For this algorithm, the required input is a compound concentration table. QEA is based on the globaltest47 algorithm, which uses a generalized linear model to estimate the association between concentration profiles of a matched metabolite set and the class label. The P-value indicates the probability that none of the matched compounds in the metabolite set is associated with the class label.Metabolite set librariesWe consider a group of metabolites as a metabolite set if there are established, empirically observed or theoretically predicted functional associations among them. On the basis of these criteria, we have collected a total of 6,292 metabolite sets organized into seven categories—83 pathway-associated metabolite sets, 742 disease-associated metabolite sets, which were further divided into three groups on the basis of the type of biofluid (CSF, blood or urine) from which they were reported, 4,501 single-nucleotide polymorphism (SNP)-associated metabolite sets, 921 model-predicted metabolite sets and 57 metabolite sets on the basis of tissue or cellular co-localization.Please note: Users should always be aware of the technological limitations of their metabolomics platform(s) when interpreting the results from overrepresentation analysis. Current analytical technologies only offer partial metabolome coverage, which makes metabo-lomic studies intrinsically biased toward metabolite sets containing compounds that are more abundant or more easily detected by a given technology platform. MetaboAnalyst supports the application of a platform-specific reference metabolome to correct for this potential bias.

MaterIalsEQUIPMENT SETUP

A PC with an Internet connectionBrowser requirements: MetaboAnalyst has been tested on all modern web brows-ers that are JavaScript enabled, including Mozilla Firefox 3.0 + , Safari 4.0 + , Chrome 5.0 + (Google), Opera 10.0 + and Internet Explorer 8.0 (Microsoft).Data files: MetaboAnalyst has a number of example data sets for format illustration purposes as well as for testing purposes. Users can directly select a testing data set in MetaboAnalyst’s data upload page without

••

actually downloading it. For this protocol, we will download a concen-tration data set and then re-upload it to better illustrate how local or user-generated data files may be handled. First, go to the MetaboAnalyst home page and then click the ‘Data Formats’ link on the left menu bar. In the Data Formats page, under the ‘Comma Separated Value (CSV) format’, click and download the first concentration file—‘Compound concentra-tion data set—cow, four groups’ and save it as ‘cow_diet.csv’. The second concentration file to be retrieved is ‘Compound concentration data set—human, two groups’. Save this file as ‘human_cachexia.csv’.

Page 10: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

752 | VOL.6 NO.6 | 2011 | nature protocols

proceDureData upload, processing and normalization ● tIMInG 5–10 min1| Starting up: Go to the MetaboAnalyst home page and click the ‘click here to start’ link to enter the data upload page. crItIcal step As most browsers support multiple tabs, do not access MetaboAnalyst from more than one tab during an analysis. Opening up multiple connections to MetaboAnalyst within the same browser will cause problems as a result of having the session data overwritten.? trouBlesHootInG

2| Data upload: Depending on the type of analysis that a user wishes to perform, they can upload their data using any of the three available tab options—Statistical Analysis, Enrichment Analysis or Pathway Analysis (Fig. 2). Here we show how to upload data from the ‘Statistical Analysis’ tab, which is selected by default (data upload instructions for Enrichment Analysis are provided in Steps 21–24, and data upload directions for Pathway Analysis are given in Step 32). In the ‘Upload your data’ section, users can upload either a comma-separated value (CSV) file or a compressed (ZIP) file (see Box 2 for more details). For the example we use here, choose ‘Concentrations’ as the data type and ‘Samples in rows (unpaired)’ as the data format. Click the ‘Browse’ button to locate the ‘cow_diet.csv’ file and click the ‘Submit’ button. crItIcal step Users must specify the correct data type and data format that match their data. Failure to do so will result in MetaboAnalyst launching the wrong data processing procedure. crItIcal step Users can also easily perform paired analysis in MetaboAnalyst. For any kind of paired data comparison, there must be an even (2n) number of samples. For CSV formatted data, the pairwise information must be given by the class labels as integer values between − 1 and − n/2 and between 1 and n/2. Samples with class labels having the same absolute integer values are considered to be pairs (i.e., − 18 is paired with + 18). For ZIP formatted data, users need to upload a separate text file (.txt) to give the pair information. Each pair is specified as two sample names (without a suffix) separated by a colon with one pair per row.? trouBlesHootInG

a bc

Figure 6 | Metabolic pathway analysis and visualization. (a) The ‘metabolome view’ showing all metabolic pathways arranged according to the scores from enrichment analysis (y axis) and from topology analysis (x axis). (b) The ‘pathway view’ showing the corresponding metabolic pathway after clicking any node in the ‘metabolome view’. The matched metabolites are highlighted according to their P values. Users can zoom or drag the pathway map to view a subset of the compounds. (c) The ‘compound view’ showing the concentration distribution of the corresponding metabolite after clicking any matched compound node. The P value and the node importance are indicated below.

Page 11: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 753

3| Data integrity checking: If the data has been uploaded successfully, a data integrity check is performed. After this check is completed, MetaboAnalyst will provide a summary of the data characteristics. Two common issues that often arise with metabolomic data are missing values and outliers (see Box 3 for more details). To handle missing values, users can click the ‘Missing value imputation’ button to use a variety of options to either exclude or replace these values. Outlier identification and removal is an iterative process and is usually performed in combination with preliminary data exploratory analysis. See Step 28 for an example. For this particular data set, we accept the data ‘as is’ and so we will click the ‘Skip’ button to go to the normalization step.

4| Data normalization: There are two normalization procedures—row-wise normalization and column-wise normalization. The characteristics of the different normalization procedures are discussed in Box 4. In the data normalization page, choose ‘normalization by a reference sample’ and then select the first sample name ‘0-1-1’ for row-wise normalization. crItIcal step The choice for a reference sample is generally the sample in the control group with the fewest missing values. Alternatively, users can choose to use a pseudo-reference sample created by averaging all samples in the control group. For high-quality data in which samples in the same groups are very homogenous, the effects of either procedure should be very similar.

5| Select ‘auto-scaling’ for column-wise normalization.

6| After the normalization steps have been completed, click ‘next’ to view a graphic summary of the normalization effects on the data (Fig. 3).

7| Compound name standardization (optional): This step is only applicable for compound concentration data. Click the ‘Name check’ node under the ‘Processing’ branch. The results of the name conversion process will be shown as a table. Compounds without an exact match in MetaboAnalyst’s name library will be highlighted in either yellow (approximate match found) or red (no match found). Users should manually examine the compounds with approximate matches and choose the correct one. Otherwise, the first match in the candidate name list will be used. Click the ‘Submit’ button to finish the name checking. Note that after this step, all three major nodes on the navigation tree—‘Statistics’, ‘Enrichment’ and ‘Path-way’ should be enabled. Note that if the data are uploaded under the ‘Enrichment Analysis’ or ‘Pathway Analysis’ tab, the compound name mapping will be performed by default. The data are now processed, normalized and ready for a variety of downstream analysis procedures.

Identification of significant features with univariate methods ● tIMInG ~10 min8| Identification of significantly different features: MetaboAnalyst directly supports significant feature (metabolite) identification using several methods including t-tests, ANOVA, volcano plots, SAM and others. Use option A for ANOVA-based feature selection or option B for SAM-based selection.

Box 8 | METABoLIC PATHWAY ANALYSIS pathway enrichment analysisPathway enrichment analysis offers both over-representation analysis (Fisher’s exact tests or hypergeometric tests) as well as quantitative enrichment analysis (globaltest47 and GlobalAncova48). The main characteristics of the two types of enrichment analysis are given in Box 7.pathway topology analysisThe importance of a compound within a given metabolic network can be estimated by its centrality measure. There are two commonly used centrality measures—‘degree’ centrality and ‘betweenness’ centrality. The former measures the number of connections the node of interest has to other nodes and the latter measures the number of shortest paths going through the node of interest. As metabolic pathways are directed graphs, the relative betweenness centrality and out-degree centrality measures are used for calculating compound importance. For more information on graph-based methods, please refer to the paper by Aittokallio et al.49.pathway visualizationThe metabolic pathways used by MetaboAnalyst are obtained from the KEGG database and presented as networks of chemical compounds, with metabolites as nodes and reactions as edges. MetaboAnalyst’s pathway visualization system supports lossless zooming, dragging, and linking operations. Relevant information can be obtained by clicking on the appropriate graphical elements. For instance, clicking each compound node on a metabolic pathway will display a more detailed view of the concentration distributions of the metabolite together with the node importance score and P value calculated by t-test, ANOVA or linear regression, as determined by the analysis type (Fig. 6).

Page 12: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

754 | VOL.6 NO.6 | 2011 | nature protocols

(a) anoVa-based feature selection (i) As the data in ‘cow_diet.csv’ contains four groups, one can use ANOVA methods to select important features. Click the

ANOVA node on the navigation tree to enter the ‘One-way ANOVA and post hoc analysis’ page. (ii) Significant features are identified with the default P value threshold of 0.05. As the ANOVA F-test only indicates

that more than two groups differ, the post hoc analysis further tests the ones that differ from each other. MetaboAnalyst offers two commonly used methods—Fisher’s least significant difference (LSD) and Tukey’s honestly significant difference (HSD). Tukey’s HSD is generally more conservative than Fisher’s LSD.

(iii) Click the ‘view details’ link to see a data table from the ANOVA and post hoc tests using Fisher’s LSD (the default). Users can click any compound name to view a box plots summary of its concentrations in different groups.

(B) saM-based feature selection (i) SAM is designed to control the false positives when running multiple tests on high-dimensional data. To use the SAM

method, click the ‘SAM’ node on the MetaboAnalyst navigation tree. (ii) The default view is the Step 1 tab, which contains two plots to help users select a suitable delta value. The left

plot shows the false discovery rate (FDR) change with different delta values and the right plot shows the number of significant compounds identified given different delta values. For example, using the default delta value 0.6 will identify ~25 compounds with an FDR ~0.3; using a delta value of 1.0 will identify ~20 significant compounds with the FDR less than 0.1. Enter 1.0 as the new delta value and click ‘Submit’.

(iii) The Step 2 tab shows a typical SAM plot with the delta value equaling 1.0. Click the ‘View details …’ link to see the SAM results table. A total of 21 compounds were identified above the chosen threshold. Note that the top ten compounds are almost exactly the same as those identified using ANOVA.

9| Identification of other features with patterns of interest: This step allows users to investigate trends or patterns in metabolite concentration changes. Click the ‘Correlations’ node on the navigation tree to enter the ‘Correlation Analysis’ page. There are two types of correlation analysis that can be performed in MetaboAnalyst—correlation with a defined pattern (option A) or correlation with a specific feature (option B).(a) correlation with a defined pattern (i) Here we will attempt to identify those metabolites that increase concentrations with the percentage of grain in the

diet. Choose a predefined pattern ‘1–2–3–4’ from the ‘select a predefined pattern’ drop-down list, which corresponds to a linear concentration increase in groups 0, 15, 30 and 45. Alternatively, users can specify their own patterns in the ‘define your own pattern’ text field.

(ii) Click the ‘Submit’ button beside the drop-down list used in the previous step. The result is shown in Figure 7a. The light blue bars show those metabolites showing a negative correlation and the light pink bars show those with a positive correlation with the given pattern of change.

(iii) Click the ‘View details’ link to see a table of all the compounds listed as well as their correlation coef-ficients. Clicking any compound name will generate a graphic summary of its concentration distribution within each group (Fig. 7b).

(B) correlation with a specific feature (i) On the basis of the above analysis and a review

of the literature, we know that elevated levels of endotoxin are important for initiating certain inflam-matory responses. We are interested in identifying other metabolites with patterns of change similar to

a

b

Endotoxin

Top 25 compounds correlated with the pattern 1–2–3–4

Alanine

Methylamine

Glucose

Uracil

Valine

Dimethylamine

Glycerol

Xanthine

Ethanol

Isoleucine

Benzoate

Ribose

Histidine

Formate

Succinate

Acetoacetate

Isovalerate

Acetate

Isobutyrate

Aspartate

3-PP

–1.0 –0.5 0

2

1

0

0 15 30 45

–1

–2

Correlation coefficients

0.5 1.0

3-HP

1,3-D

NDMA

Figure 7 | Correlation analysis to identify compounds with a specific pattern. (a) Correlation plot showing the compounds that are significantly associated with a given pattern ‘1–2–3–4’ (a linear concentration increase under different conditions). The compounds are represented as horizontal bars, with colors in light pink indicating positive correlations and that in light blue indicating negative correlations. Users can click the ‘view details’ link to see a detailed table. (b) Box plots summarizing the concentration distributions of a selected compound.

Page 13: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 755

endotoxin. We will use the default ‘Pearson r’ as the distance measure and then select ‘Endotoxin’ from the ‘Select a feature’ drop-down list.

(ii) Click the ‘Submit’ button. The resulting image shows a number of other features that are either positively or negatively correlated to endotoxin levels. The details can be obtained by following the ‘view details’ link.

10| Report generation and result download: Click the ‘Download’ node on the navigation tree. MetaboAnalyst will generate a detailed analysis report based on the steps that the user has previously executed. The report contains a brief description of each method used, followed by the graphical and textual results based on the last parameter set. The normalized data, as well as any graphs generated during the analysis, are also available for download.? trouBlesHootInG

Multivariate data analysis ● tIMInG ~10 min11| Data exploration and visualization with PCA: PCA summarizes data into a few components that explains most of the data variance. The main characteristics of PCA are discussed in Box 6. Click the ‘PCA’ node on the navigation tree to enter the PCA page. This page shows six main output panels from MetaboAnalyst’s PCA analysis. The default view is a pair-wise score plot from the top five PCs, with the diagonal panels showing the explained variance.

12| Click the ‘2D score plot’ tab to see a detailed scores plot using PC1 and PC2. The samples are labeled and colored accord-ing to their group memberships. In this view, users should look first for outliers; if there are obvious outliers, use the ‘DataE-ditor’ under the ‘Processing’ navigation tree to exclude outliers. Outlier removal should be carried out with considerable care and outliers should be removed only if there is some clear justification (sample stability problems, sample collection issues, instrument problems, typographical errors and so on) Next, users should investigate sample dispersion; if the data points in the score plot are not well dispersed or show a high degree of skewing, this may be due to insufficient normalization. Click the ‘Normalization’ node under the ‘Processing’ branch to choose a different normalization procedure. In particular, autoscal-ing or range scaling can be very effective for correcting severely skewed data.

13| In our case, no obvious outliers or skewed distribution can be detected. Furthermore, some modest separation or clustering is noticed among different groups. There are also some clusters that appear to overlap with each other. Users can click the ‘3D score plot’ to see whether a better separation can be identified with an extra dimension or an extra principal component.

14| Identification of influential or important features: If good separation patterns are seen in a scores plot, users should go to the ‘Loading plot’ as well as the ‘Biplot’ views to identify those features that are most responsible for the separation. The loading plot can be viewed either as a scatter plot or a bar plot, as specified by the user. In this particular case, as there are no clear separa-tions, it is very difficult to identify the features that are important. We will use a supervised method—PLS-DA—for this purpose.

15| Data exploration and visualization with PLS-DA: PLS-DA can perform both classification and feature selection. The main characteristics of PLS-DA are discussed in Box 6. Click the ‘PLS-DA’ node on the navigation tree to start this analysis. The default view is a pairwise summary of the score plots of the top five components.

16| Click the ‘2D Score plot’ for a detailed view of the separation patterns. A much better separation is obtained with PLS-DA compared with the PCA result obtained in Step 10. The 3D Score plot shows an almost perfect separation with the first three components (Fig. 4a).

17| Choosing the optimal number of components: MetaboAnalyst calculates R2 and Q2, which are two common performance measures in assessing PLS-DA models. R2 corresponds to the sum of squares captured by the model, whereas Q2 is the cross-validated R2. MetaboAnalyst also calculates prediction accuracies through cross-validation. Click the ‘Cross Validation’ tab to start the process. Users can choose ‘10-fold cross validation’ or ‘Leave-one-out cross validation (LOOCV)’. In this case, we will choose ‘LOOCV’ and click the ‘Submit’ button. The result indicates that using the top two components gives the best perform-ance based on Q2 measures (Fig. 4b). Click the ‘view details …’ link to get a detailed table of the calculated values.? trouBlesHootInG

18| Result validation: As noted earlier, PLS-DA tends to overfit the data and this can often lead to false separations or incor-rect classification. As a result, PLS-DA models need to be validated to see whether the separation is statistically significant or is due to random noise. This can be carried out using permutation tests. In each permutation, a PLS-DA model is built between the data (X) and the permuted class labels (Y) using the optimal number of components determined in the previous step. MetaboAnalyst provides two kinds of performance measures. The first is the separation distance, which is defined as

Page 14: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

756 | VOL.6 NO.6 | 2011 | nature protocols

the ratio of the between-group sum of squares and the within-group sum of squares (B/W ratio), as suggested by Bijlsma et al.36. The second is the prediction accuracy. This is the default approach used by MetaboAnalyst. Click the ‘Permutation’ button to view the results. The resulting histogram summarizes the distribution of the permutation test scores, with the red arrow indicating the performance based on the original labels. The further the arrow is to the right of the distribution, the more significant the separation between the two groups. Figure 4c shows a typical permutation result based on separation distance. As seen in this figure, the original class assignment is very significant and not part of the distribution that we obtained using the permuted data. A P value < 0.0005 is reported on the basis of 2,000 permutations.? trouBlesHootInG

19| Identification of important features: Click the ‘Var. Importance’ tab to see a list of important features identified based on the variable importance in projection (VIP) score (Fig. 4d). For multiple group analysis, the VIP score is calculated for each component. The overall VIP score shown in the figure is the average across all the selected components. Users can also use the coefficient-based importance measure by clicking on the corresponding radio button and then pressing the ‘Submit’ button. For multiple-group discriminant analysis, the same number of predictors will be built with one for each group. The overall coeffi-cient-based importance is the average of feature coefficients in all predictors. Click the ‘View details …’ link to see the individual VIP scores in each selected component or the coefficients in each group predictor if the coefficient-based importance is used.? trouBlesHootInG

20| Report generation and result download: Click the ‘Download’ node to download all the data, tables and figures produced from this particular analysis.? trouBlesHootInG

Metabolite set enrichment analysis ● tIMInG 5–10 min21| In the Upload page, click the ‘Enrichment Analysis’ tab.

22| There are three drop-down panels for three different types of enrichment analysis (see Box 7 for more details). Each method accepts a different data type: a list of compound names entered in a single-column format for over-representation analysis; a list of compound concentrations entered as two-column table for single-sample profiling (SSP); and a concentra-tion table (CSV) with samples in rows and metabolites in columns for quantitative enrichment analysis (QEA). The phenotype information must be placed in the second column and can be binary, multiclass or continuous. Click the third drop-down pane ‘A concentration table (quantitative enrichment analysis)’.

23| In the open page, click ‘Browse’ to locate the ‘human_cachexia.csv’ data file.

24| Ensure that the selected compound label type is ‘compound names’ and the phenotype label is ‘Discrete (Classification)’, and then click ‘Submit’.? trouBlesHootInG

25| Compound name conversion: The purpose of this step is to compare and convert the compound names to common compound names used in the HMDB. The compound identities can be specified by common names or major database IDs (i.e., KEGG, PubChem, HMDB, MetLin, BiGG and so on). MetaboAnalyst’s compound name/ID conversion is based on a name-mapping table from the HMDB. Each HMDB compound ID is associated with a common name, a set of synonyms and compound IDs used in other major metabolomic databases. Any naming inconsistency is flagged and displayed to users for manual inspection and correction (see Step 7 for more details). crItIcal step Users must label compounds with either common compound names or common database IDs. Abbreviated names usually cannot be recognized. Unmatched or unidentified compounds will be excluded from downstream analyses.

26| Concentration comparison (optional): This step is only applicable when the uploaded data is a list of compound concen-trations used for SSP. The basic idea behind SSP is to compare the measured concentration values of each compound with its normal reference ranges in the corresponding biofluid. For common human biofluids, such as blood, urine or cerebrospinal fluid, normal concentration ranges are known for many metabolites. In clinical metabolomic studies, it is often desirable to know whether certain metabolite concentrations in a given sample are higher or lower than their normal ranges. This procedure is designed to provide this kind of analysis. Click ‘Conc. check’ to start concentration comparison. By default, only compounds with concentrations above or below all the known or reported normal ranges will be selected for further investigation. Users should manually select or deselect compounds to over-ride this default selection by inspecting the concentration comparison plots, as well as the original reports, by clicking the image icon in the ‘Details’ column.

Page 15: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 757

27| Data normalization (optional): This step is only applicable when the uploaded data is a concentration table. In this case, we select ‘Normalization by a reference sample’, and then choose ‘create a pooled average sample’ from the ‘control’ group. Choose ‘Autoscaling’ for column-wise normalization. See Box 4 for more details.

28| Data visualization and outlier detection (optional): The purpose of this step is to check whether the data values are relatively homogenous and for outlier detection. Click the ‘PCA’ node to open the PCA page. On the 2D score plot, a clear out-lier ‘PIF_115’ is noticeable as it is far away from all other data points. This particular outlier is due to sample deterioration/contamination. Follow the route ‘Processing → DataEditor’ and select ‘PIF_115’ under the ‘Sample Editor’ tab, click ‘Remove’ and then click ‘Finish’ to go back to the normalization page. Perform the data normalization as done in Step 27. Recheck the PCA score plot. This time, no obvious outlier should be detected. Follow ‘Enrichment → Set param.’ to specify the parameters for enrichment analysis.

29| Set parameters for enrichment analysis: In this step, users must specify a metabolite set library (or upload a custom metabolite set library) to start the analysis (see Box 7 for details). Users can also indicate whether a filter should be applied to exclude metabolite sets containing very few compounds. In this case, we use the default ‘Pathway-associated metabolite sets’ and click the ‘Next’ button to view the result.

30| View the MSEA results: The MSEA result is presented, both graphically and in a detailed table (Fig. 5a). The horizontal bar graph summarizes the most significant metabolite sets identified during the analysis. The bars are colored on the basis of their P values and the bar length is based on the fold enrichment calculated as the actual matched number / expected number of matches (for over-representation analysis) and calculated statistic / expected statistic (for QEA). The Bonferroni corrected P value and FDR are also provided. Users can click the image icon in the ‘Details’ column of each matched metabo-lite set to view all its constituent metabolites with matched ones highlighted in red (Fig. 5b), as well as SMPDB pathway images37 (when available).

31| Report generation and result download: Click the ‘Download’ node to download the analysis report, images and the processed data.? trouBlesHootInG

Metabolic pathway analysis ● tIMInG ~10 min32| Data upload and processing: In the ‘Upload’ page, click the ‘Pathway Analysis’ tab to get started with the ‘human_cachexia.csv’ data. Users can either enter a list of compound names or a concentration table. The data upload and processing steps are similar to those involved in the enrichment analysis. Please see Steps 21–25 for more details.? trouBlesHootInG

33| Set parameters for pathway analysis: Three parameters must be specified for pathway analysis. These include the pathway library, the algorithm for pathway enrichment analysis and the algorithm for topology analysis (see Box 8 for more details). Users can also supply a reference metabolome to correct for any potential bias in the enrichment analysis. The reference metab-olome is specified as a list of KEGG compound IDs. In this case, we select the ‘Homo sapiens’ library and use the default ‘Global Test’ and ‘Relative Betweenness Centrality’ for pathway enrichment analysis and pathway topology analysis, respectively.

34| Result visualization: The results from the pathway analysis are presented in two parts—a graphical output in the top section and a table containing all the numerical results at the bottom. Users can intuitively explore the results by pointing and clicking on various graphic elements. There are three types of views (Fig. 6). The left panel is the ‘metabolome view’, which displays all the matched pathways as circles (Fig. 6a). The color and size of each circle is based on P values and pathway impact values, respectively. Pointing the mouse over different nodes will show the corresponding pathway names. Clicking the nodes of interest will launch the corresponding ‘pathway view’ on the right panel (Fig. 6b). Users can zoom or drag to focus on a particular section of the pathway. Clicking on any matched compound node (with highlighted back-ground) will show the corresponding ‘compound view’, which contains a detailed summary of the compound concentrations, importance measure, as well as the P value (Fig. 6c).

35| Report generation and result download: Click the ‘Download’ node to get the complete analysis report as well as the processed data and images produced during the analysis.? trouBlesHootInG

Page 16: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

758 | VOL.6 NO.6 | 2011 | nature protocols

? trouBlesHootInGTroubleshooting advice can be found in table 2.

● tIMInGThe duration required to perform the steps described in the protocol depend on the data set size as well as the number of active users connected to the web server. For the test data sets used for these protocols, most results should be returned in a few seconds after a user has selected the appropriate parameters. The most time-consuming computational step is prob-ably the permutation test used by PLS-DA (15–20 s for 1,000 permutations). The most time-consuming non-computational test is typically the data visualization or data inspection step. Data upload, processing and normalization (Steps 2–7) should take about 5–10 min; feature selection using univariate analysis (Steps 8–10) usually takes around 3–5 min; and multivariate analysis (Steps 11–20) takes ~10 min. For high-level functional analysis, MSEA (Steps 21–31) should take 5–10 min, whereas metabolic pathway analysis (Steps 32–35) should take ~10 min. Once the data has been uploaded, a modestly experienced user should be able to execute the complete protocol in 30–40 min.

antIcIpateD resultsGraphical outputThe graphical outputs produced during the analysis procedures are given in Figures 1–7. Some of the algorithms of the MetaboAnalyst use time-dependent random number generators to calculate certain statistical values and the results may vary slightly among runs.

Data processing resultsThe data integrity check for the data in ‘cow_diet.csv’ will detect four groups with a total of 51 zero values and no missing values. The data integrity check for ‘human_cachexia’ will yield two groups with no zero or missing values.

Feature selection using univariate methodsIn MetaboAnalyst’s ANOVA analysis of the ‘cow_diet.csv’ data, the top five compounds identified with the default threshold should be endotoxin, 3-PP, glucose, isobutyrate and methylamine. The top five compounds identified using the SAM method

taBle 2 | Troubleshooting table.

steps problem possible reason possible solution

1 The content of the home page does not show up

JavaScript is disabled in your browser

For Mozilla Firefox 3.0 + , go to Tools → Options → Content, then select the checkbox beside ‘Enable JavaScript’. For Internet Explorer 8.0, go to Tools → Internet options → Security, then select ‘Internet’ from the Zone icons. Click the ‘Custom level …’ button. From the list of available options, make sure the ‘Disable’ radio button is not selected under ‘Active scripting’ item. For Safari 4.0 + , go to Edit → Preferences → Security, then select the checkbox beside ‘Enable JavaScript’. Please check the documentation for other browsers on how to enable JavaScript

2, 24 and 32 Fail to upload data Non-unique or unusual names; small sample size; wrong data formats; unrec-ognized zip format

Make sure sample or feature (peak/compound) names are unique and consist of a combination of English letters, underscores or numbers for naming purposes; the names should contain no space or other special characters; make sure there are at least three samples per group; make sure the selected data format matches your data; for Microsoft Excel users, choose ‘CSV (Macintosh)’ to generate a .csv file; for WinZip (v12.0) users, choose the ‘Legacy compression (Zip 2.0 compatible)’ for compression

17–19 No image is generated The sample size is too small These procedures require a minimum of five samples per group

10, 20, 31, 35 No PDF report is generated Some of the expected data were not generated

Set appropriate parameter values to make sure the resulting images are generated; make sure there are a minimum of five samples per group for PLS-DA analysis

Page 17: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

nature protocols | VOL.6 NO.6 | 2011 | 759

will be the same. In correlation analysis using the predefined ‘1–2–3–4’ pattern, endotoxin and alanine are the top two compounds that will be positively correlated with this pattern, whereas 3-PP and aspartate are the top two compounds that will be negatively correlated with this pattern. The same compounds should be identified as being correlated/anticorrelated with endotoxin, using ‘Pearson r’. The top five compounds identified in SAM will be the same as those identified using the ANOVA test.

Multivariate data analysisThe score plot from the PCA analysis of the ‘cow_diet.csv’ data should not show a clear separation, with groups 1 and 2 overlapping substantially and group 3 slightly overlapping with groups 2 and 4. A much better group separation will be achieved through PLS-DA. Using PLS-DA, the five most important compounds identified by VIP will be endotoxin, 3-PP, alanine, methylamine and glucose. The best PLS-DA model will use just top two components based on the Q2 score estimated from LOOCV (0.814). The P value based on 2,000 permutations should yield a value of P < 5e − 04, which is very significant.

Metabolite set enrichment analysisAll compound names from the ‘human_cachexia.csv’ data set should be found to have an exact match during the name conversion step. The PCA score plot should not show a clear separation, although it should show PIF_115 as being a clear outlier. In the enrichment analysis using the pathway-based metabolite sets, the top five metabolic pathways that appear to be associated with cachexia will be pyrimidine metabolism, beta-alanine metabolism, ketone body metabolism, purine metabolism and glutamate metabolism.

Metabolic pathway analysisThe top five pathways from the ‘human_cachexia.csv’ data set that should be identified by pathway enrichment analysis alone are pyrimidine metabolism, pantothenate and CoA biosynthesis, beta-alanine metabolism, synthesis and degradation of ketone bodies and propanoate metabolism. Note that three of these pathways are similar to those previously identified by MSEA. The top three pathways identified by topology analysis alone should be glycine, serine and threonine metabolism; pyruvate metabolism; and taurine and hypotaurine metabolism. Overall, three pathways—pantothenate and CoA biosynthe-sis; citrate cycle (TCA cycle); and alanine, aspartate and glutamate metabolism—appear to be perturbed as a consequence of cachexia, as these will be located in the diagonal area of the plot with relatively good scores from both analyses.

acknowleDGMents We thank the Canadian Institutes for Health Research (CIHR) and the Alberta Ingenuity Fund (AIF; now part of Alberta Innovates—Technology Futures) for financial support.

autHor contrIButIons J.X. and D.S.W. prepared and tested the protocol and wrote the article.

coMpetInG FInancIal Interests The authors declare no competing financial interests.

Published online at http://www.natureprotocols.com/. Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/.

1. Fiehn, O. Metabolomics—the link between genotypes and phenotypes. Plant. Mol. Biol. 48, 155–171 (2002).

2. Wishart, D.S. Quantitative metabolomics using NMR. Trends Analyt. Chem. 27, 228–237 (2008).

3. Dunn, W.B. & Ellis, D.I. Metabolomics: current analytical platforms and methodologies. Trends Analyt. Chem. 24, 285–294 (2005).

4. Wishart, D.S. et al. HMDB: the human metabolome database. Nucleic Acids Res. 35, D521–D526 (2007).

5. Lundberg, P. et al. MDL—The Magnetic Resonance Metabolomics Database http://mdl.imv.liu.se (European Society for Magnetic Resonance in Medicine and Biology, ESMRMB, 2005).

6. Smith, C.A. et al. METLIN—a metabolite mass spectral database. Ther. Drug Monit. 27, 747–751 (2005).

7. Weljie, A.M., Newton, J., Mercier, P., Carlson, E. & Slupsky, C.M. Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Anal. Chem. 78, 4430–4442 (2006).

8. Smith, C.A., Want, E.J., O′Maille, G., Abagyan, R. & Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear

peak alignment, matching, and identification. Anal. Chem. 78, 779–787 (2006).

9. Zhao, Q., Stoyanova, R., Du, S., Sajda, P. & Brown, T.R. HiRes—a tool for comprehensive assessment and interpretation of metabolomic data. Bioinformatics 22, 2562–2564 (2006).

10. Xia, J., Bjorndahl, T.C., Tang, P. & Wishart, D.S. MetaboMiner—semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9, 507 (2008).

11. Lommen, A. MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing. Anal. Chem. 81, 3079–3086 (2009).

12. Katajamaa, M., Miettinen, J. & Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 22, 634–636 (2006).

13. Wishart, D.S. Current Progress in computational metabolomics. Brief. Bioinform. 8, 279–293 (2007).

14. Cui, Q. et al. Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 26, 162–164 (2008).

15. Wishart, D.S. et al. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 37, D603–D610 (2009).

16. Henderson, J.P. et al. Quantitative metabolomics reveals an epigenetic blueprint for iron acquisition in uropathogenic Escherichia coli. PLoS Pathog. 5, e1000305 (2009).

17. Altmaier, E. et al. Variation in the human lipidome associated with coffee consumption as revealed by quantitative targeted metabolomics. Mol. Nutr. Food Res. 53, 1357–1365 (2009).

18. Ewald, J.C., Heux, S. & Zamboni, N. High-throughput quantitative metabolomics: workflow for cultivation, quenching, and analysis of yeast in a multiwell format. Anal. Chem. 81, 3623–3629 (2009).

19. Zulak, K.G., Weljie, A.M., Vogel, H.J. & Facchini, P.J. Quantitative 1H NMR metabolomics reveals extensive metabolic reprogramming of primary and secondary metabolism in elicitor-treated opium poppy cell cultures. BMC Plant Biol. 8, 5 (2008).

Page 18: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

©20

11 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

protocol

760 | VOL.6 NO.6 | 2011 | nature protocols

20. Xia, J., Psychogios, N., Young, N. & Wishart, D.S. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res. 37, W652–W660 (2009).

21. Xia, J. & Wishart, D.S. MSEA: A web-based tool to identify biologically meaningful patterns in quantitative metabolomics data. Nucleic Acids Res. 38, W71–W77 (2010).

22. Xia, J. & Wishart, D.S. MetPA: a web-based metabolomics tool for pathway analysis and visualization. Bioinformatics 26, 2342–2344 (2010).

23. Neuweger, H. et al. MeltDB: a software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24, 2726–2732 (2008).

24. Kastenmuller, G., Romisch-Margl, W., Wagele, B., Altmaier, E. & Suhre, K. metaP-server: a web-based metabolomics data analysis tool. J. Biomed. Biotechnol. 2011, (2010).

25. Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010).

26. Broeckling, C.D., Reddy, I.R., Duran, A.L., Zhao, X. & Sumner, L.W. MET-IDEA: data extraction tool for mass spectrometry-based metabolomics. Anal. Chem. 78, 4334–4341 (2006).

27. Duran, A.L., Yang, J., Wang, L.J. & Sumner, L.W. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 19, 2283–2293 (2003).

28. Luedemann, A., Strassburg, K., Erban, A. & Kopka, J. TagFinder for the quantitative analysis of gas chromatography—mass spectrometry (GC-MS)-based metabolite profiling experiments. Bioinformatics 24, 732–737 (2008).

29. Wohlgemuth, G., Haldiya, P.K., Willighagen, E., Kind, T. & Fiehn, O. The Chemical Translation Service—a web-based tool to improve standardization of metabolomic reports. Bioinformatics 26, 2647–2648 (2010).

30. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–21 (2001).

31. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

32. Salomonis, N. et al. GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8, 217 (2007).

33. Goffard, N., Frickey, T. & Weiller, G. PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways. Nucleic Acids Res. 37, W335–W339 (2009).

34. Hu, Z. et al. VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 37, W115–W121 (2009).

35. Goffard, N. & Weiller, G. PathExpress: a web-based tool to identify relevant pathways in gene expression data. Nucleic Acids Res. 35, W176–W181 (2007).

36. Bijlsma, S. et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal. Chem. 78, 567–574 (2006).

37. Frolkis, A. et al. SMPDB: the small molecule pathway database. Nucleic Acids Res. 38, D480–D487 (2010).

38. Efron, B., Tibshirani, R., Storey, J.D. & Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001).

39. Trygg, J. & Wold, S. Orthogonal projections to latent structures (O-PLS). J. Chemom. 16, 119–128 (2002).

40. Wang, T. et al. Automics: an integrated platform for NMR-based metabonomics spectral processing and data analysis. BMC Bioinformatics 10, 83 (2009).

41. Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 1164–1167 (2007).

42. Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 4281–4290 (2006).

43. van den Berg, R.A., Hoefsloot, H.C., Westerhuis, J.A., Smilde, A.K. & van der Werf, M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7, 142 (2006).

44. Pavlidis, P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods 31, 282–289 (2003).

45. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).46. Westerhuis, C.A. et al. Assessment of PLSDA cross validation. Metabolomics

4, 81–89 (2007).47. Goeman, J.J., van de Geer, S.A., de Kort, F. & van Houwelingen, H.C. A

global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20, 93–99 (2004).

48. Hummel, M., Meister, R. & Mansmann, U. GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24, 78–85 (2008).

49. Aittokallio, T. & Schwikowski, B. Graph-based methods for analysing networks in cell biology. Brief Bioinform. 7, 243–255 (2006).

Page 19: Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.