visual analysis of gel-free proteome data...1 visual analysis of gel-free proteome data lars linsen...

1

Visual Analysis of Gel-free Proteome DataLars Linsen Julia Locherbach Matthias Berth Dorte Becher Jorg Bernhardt

Abstract— We present a visual exploration system supportingprotein analysis when using gel-free data acquisition methods.The data to be analyzed is obtained by coupling liquid chro-matography (LC) with mass spectrometry (MS). LC-MS datahave the properties of being non-equidistantly distributed in thetime dimension (measured by LC) and being scattered in themass-to-charge ratio dimension (measured by MS). We describea hierarchical data representation and visualization method forlarge LC-MS data. Based on this visualization we have developeda tool that supports various data analysis steps. Our visual toolprovides a global understanding of the data, intuitive detectionand classification of experimental errors, and extensions to LC-MS/MS, LC/LC-MS, and LC/LC-MS/MS data analysis. Due tothe presence of randomly occurring rare isotopes within thesame protein molecule, several intensity peaks may be detectedthat all refer to the same peptide. We have developed methodsto unite such intensity peaks. This deisotoping step is visuallydocumented by our system, such that misclassification can bedetected intuitively. For differential protein expression analysiswe compute and visualize the differences in protein amountsbetween experiments. In order to compute the differential expres-sion, the experimental data need to be registered. For registrationwe perform a non-rigid warping step based on landmarks.The landmarks can be assigned automatically using proteinidentification methods. We evaluate our methods by comparingprotein analysis with and without our interactive visualization-based exploration tool.

Index Terms— Interactive visual exploration, hierarchical datarepresentation, visualization in bioinformatics, proteomics.

I. INTRODUCTION

Proteomics is the study of the function of proteins [29]. Thegoal of proteomics is to determine how much of which proteinis present under which conditions. Thus, protein analysesinclude

• protein identification and qualitative analysis, i. e., whichproteins are present in a given sample,

• quantitative analysis, i. e., how much of a protein ispresent in a given sample, and

• differential protein expression analysis, i. e., what is thedifference in protein expression under changing condi-tions.

Differential protein expression experiments are typically per-formed for test vs. control sample. The computed differencesindicate up- or down-regulation of certain proteins.

Various approaches exist to quantitatively measure occur-rence of proteins in a given sample. A common approach isto use 2D gel electrophoresis, where whole or intact proteins

Lars Linsen and Julia Locherbach are with the Department ofMathematics and Computer Science, Ernst-Moritz-Arndt-Universitat Greif-swald, Matthias Berth is with DECODON GmbH, and Dorte Becherand Jorg Bernhardt are with the Department of Microbiology, Ernst-Moritz-Arndt-Universitat Greifswald, all located in Greifswald, Germany.E-Mail addresses: [email protected], {loeba,berth}@decodon.com,{dbecher,Joerg.Bernhardt}@uni-greifswald.de.

are separated in two orthogonal directions. This setup resultsin a 2D image, which can be analyzed using image processingapproaches. A relatively new approach is to couple liquidchromatography (LC) with mass spectrometry (MS). The LCstep is used to separate proteins (or parts thereof that aregenerated by a cleavage process) and the subsequent MS stepis used to determine the masses of the individual parts calledpeptides. More experimental background on the deployedmethods is given in Section III.

The structure of the obtained LC-MS data is rather com-plicated. While the LC step leads to sample locations thatare non-equidistantly distributed in the time dimension, theMS step creates scans for each time step with unstructured orscattered samples. Thus, the visualization of LC-MS data isnot as trivial as generating a 2D image. The first approachesto visualize LC-MS data in an accurate and interactive wayhave recently been introduced by us, Linsen et al. [17], andby de Corral and Pfister [7].

In this paper, we present a system for visual exploration ofLC-MS data and its extensions to support interactive proteinanalysis. For visualization purposes, we create a hierarchi-cal representation of LC-MS data. The hierarchy generationfollows a one-dimensional resampling step and assures thatintensity peaks are preserved throughout the different levelsof resolution. The hierarchical data representation and visual-ization methods are described in Section IV.

When representing the LC-MS data over the described two-dimensional domain, the peptides (i. e., parts of a cleavedprotein) correspond to individual intensity peaks. However,due to the presence of different isotopes of the same chemicalelement, one kind of peptide may be represented by severalintensity peaks. To determine the peptide’s and, thus, theprotein’s quantity in the sample, the intensity peaks of onekind of peptide need to be united. Our system supports sucha deisotoping step and visually shows its effect.

In order to identify the proteins in our sample, the LC-MSexperiment gets extended by another MS step. This is calledtandem mass spectrometry. The second MS step determines theamino acid sequence of a peptide, such that the proteins canbe identified by matching the sequence to data obtained froman online database. Our visual exploration tool also supportsthe visualization of such LC-MS/MS data, as well as furtherextensions to LC/LC-MS and LC/LC-MS/MS experiments.

For the computation of differential protein expression, theresults of two experiments (typically test vs. control) needto be compared quantitatively. As the LC step of LC-MSexperiments is not reproducible with sufficient precision, a reg-istration step is required prior to the difference computations.We execute a non-rigid registration step based on landmarks.The landmarks can be obtained automatically by performinga protein identification step for both experiments and match

2

the equally identified intensity peaks. After registration, thecomputed differential expressions are visualized by our systemin a 3D setup. The visual exploration methods supported byour system are described in Section V.

This work is an extension of our previous efforts [17]. Theextended system supports further important features for visualprotein analysis. The new features include

• the extension to visual data exploration of data obtainedby tandem mass spectrometry,

• the deisotoping step to unite scattered intensity peaksbelonging to one peptide,

• the description of an automated domain warping stepbased on protein identification to register data sets al-lowing for differential expression computation, and

• the quantitative depiction of up- and down-regulationfor a test vs. control experiment in our 3D interactiveenvironment.

Moreover, we conduct a thorough evaluation of our method.We apply it to a real data set examining the protein expressionin Bacillus subtilis and compare visual protein analysis usingour interactive exploration tool with methods used prior to theintroduction of our tool. The evaluation is described in SectionVI.

II. RELATED WORK

LC-MS data can be displayed directly using one 2D plotof intensity over time (output of the LC step) and manythousands of 2D plots of intensity over mass-to-charge ratio(output of the MS step). Obviously, exploring the data bylooking at some of these 2D plots does not lead to a goodcomprehension of the data as a whole. Instead one wouldlike to have a global view on the entire data, i. e., lookingat intensities over a two-dimensional domain. A first approachhas been taken by Li et al. [14]. The output of the visualizationmethod, however, restricts itself to two-dimensional images,where intensity is displayed using different shades of gray ina linear or logarithmic mapping. The images exhibit clearly thepositions of intensity peaks but the different shades of gray arehard to distinguish, which limits the perception of the actualintensity values.

The usage of 2D images is quite common for data visual-ization in proteomics, e. g., when using gel electrophoresis [2].However, if we generate such images for LC-MS data, we runinto the problem that the samples are not structured over aCartesian grid. In the work by Li et al. [14], the resolution ofthe resulting images can be chosen by the user. To deal withthe challenges of not operating on a Cartesian grid, the authorsaveraged the intensities collected within each pixel. Due to theaveraging step, intensity peaks may not have been displayedwith their maximum intensity but with a much lower value.Significant intensity peaks may have been reduced to smallerpeaks that do not seem to be noteworthy. From a biologicalpoint of view, one is particularly interested in the height ofthe high intensity peaks. Thus, a reduction of the intensitieswould lead to a significant distortion of the actual data.

We have developed a 3D visualization technique for LC-MSdata, where intensities are displayed over the 2D domain [17].

The 3D view allows for a global understanding of the absoluteand relative intensities. For data representation, we do notperform any averaging steps but preserve all the intensitypeaks. The approach is revised in Section V. Such maximum-preserving operations have been used in other contexts, e. g.by Shinagawa and Kunii [26].

Meanwhile, de Corral and Pfister [7] have presented avisualization method for LC-MS data similar to ours. Theirapproach is motivated by the adaptive representation of heightfields in terrain visualization applications. They decided toadjust the method by Losasso and Hoppe [19] to their needs.The main focus of de Corral and Pfister’s work is real-time dis-play of large data sets using a sophisticated data managementand acceleration by employing hardware-accelerated methods.The driving biological questions are not discussed in depthincluding the need for preserving the maximum intensitiesthroughout all levels of resolution.

To deal with large data sets, multiresolution methods arecommonly applied in visualization. Many different approachesexist for two-dimensional and even higher-dimensional do-mains [4], [11], [18], [23]. Hierarchical methods are alsocommon in terrain visualization [5], [6], [8], [12], [16], [19],which was the motivation for de Corral and Pfister [7] touse such approaches. A common technique to build multires-olution hierarchies is the use of wavelets [27]. By storingdata values explicitly for the lowest resolution only andcomputing higher-resolution representations by successivelyadding details, wavelet-based hierarchies do not require anyadditional storage space (compared to the storage space of theoriginal data at highest resolution). In the work by Anderssonet al. [1], one-dimensional wavelets have been applied to LCdata for data reduction and de-noising.

III. GEL-FREE PROTEOMICS

A. Liquid Chromatography-Mass Spectrometry (LC-MS)

Liquid chromatography-fed mass spectrometry (LC-MS) isan analytical technique that combines physical separation vialiquid chromatography with mass analysis via mass spectrom-etry. It recently has obtained a lot of attention in the fieldof proteomics. It has been demonstrated that LC-MS-basedmethods are very powerful and in certain aspects superioror complementary to other approaches such as 2D elec-trophoresis, see for example [21]. In particular, LC-MS-basedmethods are capable of capturing both intracellular proteinsand membrane proteins and seem to perform especially wellfor the latter.

Figure 1 illustrates the individual LC-MS processing steps.Proteins are macromolecules consisting of hundreds or thou-sands of amino acids. A biological sample, in turn, can bea mix of thousands of different proteins. In the first step ofour processing pipeline, protein molecules are cut into smallerfragments (called peptides), e. g., by the enzyme trypsin.Trypsin cuts at well-defined positions in the amino acid chain(after lysine and also after arginine if not followed by proline),such that the sequences of potential fragments are known whena protein’s sequence is known.

3

Fig. 1. Liquid chromatography-mass spectrometry (LC-MS) workflow: After growing and isolating the biological material (A), the sample is taken andits proteins are digested using peptidases (B). This process leads to cleaved proteins called peptides, which are fed to the liquid chromatograph, where thepeptides are separated. This diagram shows the utilization of two consecutive LC steps. The peptides are separated by loading the peptide mixture onto anion exchange column (C), eluting the column step by step using an ammonium chloride gradient, directly loading the eluate onto a reverse phase column (D),and eluting the reverse phase column using a continuous acetonitrile gradient (E). The eluate directly enters the mass spectrometer to determine the masses ofthe separated peptides. The MS ionizes the peptides, deflects them using a quadrupole, and detects the ions (F). Finally, the data is delivered to a computer,where our visual data analysis tool is applied (G).

In order to examine the peptides individually, we needto separate them. Peptide separation is done by liquid chro-matography (LC). A solvent containing the peptides is forcedthrough a separation column (loading). The column containsthe stationary phase that binds the peptides. Afterwards, thepeptides are washed out of the column by the mobile phase(eluting). The diagram in Figure 1 shows the utilization of twoconsecutive LC steps, which is described below. The weakera peptide is bound to the substrate, the faster it gets washedout. Thus, peptides can be separated by their binding properties(e. g. hydrophobicity). The output data of the LC step can bedisplayed using a 2D plot, where intensity in counts per secondis plotted over time, see Figure 2.

Fig. 2. Liquid chromatography outputs intensity values over time.

The masses of the separated peptides can be determinedindividually using mass spectrometry (MS). Mass spectrome-

try separates ions by their mass-to-charge ratios (m/z-ratios).Thus, we need to ionize the peptides. Different approachesfor ionization exist. Among the most popular are electro-spray ionization (ESI) [9] and matrix-assisted laser desorptionionization (Maldi) [13]. When using electrospray ionizationone may have to normalize the results due to decreasingionization. After ionization the molecules are accelerated andhanded to the mass analyzer. The mass analyzer uses electric(time-of-flight) or magnetic fields (quadrupole) to deflect thecharged particles, while the kinetic energy imparted by motiongives the particles inertia dependent on their mass. The massanalyzer steers the particles to a detector based on their m/z-ratio. The detector measures intensity in counts per second.The MS output can be displayed by one 2D plot for each timestep. The 2D plot shows intensity over mass-to-charge ratio(or m/z-ratio), see Figure 3.

The data coming out of a liquid chromatograph is a functionover time. When delivering the data from the detector to acomputer system, the values are given at discrete points intime t1, . . . , tn ∈ IR. The points in time ti, i = 1, . . . ,n, are notdistributed equidistantly. The number n can be in the range ofmany thousands. Figure 2 shows a 2D plot.

The data coming out of the mass spectrometer is a functionover the m/z-ratio. The intensity is measured in counts pertime and stored as an intensity list for discrete m/z-ratiosm1, . . . ,mp ∈ IR. The m/z-ratios m j, j = 1, . . . , p, are not

4

Fig. 3. Mass spectrometry outputs intensity values over m/z-ratios.

distributed equidistantly, either. The number p depends on theexperimental setup and can be in the range of tens to hundredsof thousands. Figure 3 shows a 2D plot of such an intensitylist. The plot exhibits several high peaks.

Instead of generating such two-dimensional graphs for eachpoint in time ti, i = 1, . . . ,n, we use a three-dimensionalsetup, where the intensity is shown as a heightfield over thedimensions time and m/z-ratio. Unfortunately, the m/z-ratiosm1, . . . ,mp vary from one point in time ti to the subsequentpoint in time ti+1, and even the number p of values in them/z-ratio dimension varies significantly. Thus, when lookingat a two-dimensional domain with the dimensions being m/z-ratio and time, data positions are scattered in one dimensionand non-equidistant in the other dimension. Figure 4 sketchesthe discrete data positions for LC-MS in the two-dimensionaldomain.

t 1 t 2 tn

m/z

t

Fig. 4. Structure of LC-MS data: Data value positions are scattered in m/z-dimension and non-equidistant in t-dimension.

B. Tandem Mass Spectrometry

For tandem mass spectrometry (also referred to as LC-MS/MS) we attach another MS step to the end of the pipelineshown in Figure 1(A-F). The goal of the second MS step is

to determine the amino acid sequence of selected peptides.Of interest are peptides with highest peaks in the first-orderMS spectrum. The respective ions can be selected using thequadrupole of the mass spectrometer and fragmented usingcollision induced dissociation. The fragments are detected bya TOF analyzer. Since peptide bonds (bonds between aminoacids) are the weakest bonds within a peptide, they are thefirst to break during fragmentation. Thus, the intensity plot ofthe second-order MS spectrum (i. e., the peak list we obtain bythe second MS step) exhibits peaks whose peptides differ bythe number of amino acids. The mass differences of the peaksallows for the determination of the amino acid sequence.

For protein identification, the determined amino acid se-quence can be matched against amino acid sequences retrievedfrom online databases using analysis systems such as Mascot[22].

C. Multi-dimensional Protein Identification

One problem of the described LC-MS approach is that thesimultaneous digest of the protein mixture results in a highlycomplex collection of thousands of peptides. Thus, a single LCstep may not be capable to separate all of them. This problemis typically solved by adding another LC step preceding theone described above. This method is referred to as multi-dimensional protein identification technology (MudPIT) orLC/LC-MS [30]. The diagram in Figure 1 illustrates this two-step LC using an ion exchange column and a reverse phasecolumn. The ion exchange column is eluted using a stepwiselyincreasing ammonium chloride gradient. Certain peptides canonly be eluted using certain fractions of ammonium chlorideconcentration. Thus, we elute the peptides in several steps,while the fraction increases from one step to the next. For eachfraction, we perform a reverse phase LC step (the peptideseluted from the ion exchange column are directly loaded intothe reverse phase column) and an MS step as before.

We can also couple the MudPIT with tandem mass spec-trometry leading to so-called LC/LC-MS/MS experiments. Interms of data representation, both MudPIT and tandem massspectrometry add another dimension to our LC-MS data.

IV. HIERARCHICAL DATA REPRESENTATION ANDVISUALIZATION

A. Resampling

Since LC-MS data is scattered in the m/z-dimension, adirect data visualization method would have to apply scattereddata interpolation techniques or domain triangulation (e. g.,Delaunay triangulation). Both approaches do (in their generalform) not account for the non-scattered structure in the time-dimension. Moreover, scattered data interpolation leads to aloss of precision, while domain triangulation can be compu-tationally expensive.

To allow for an efficient visualization with an acceptableamount of preprocessing, we decided to perform a resamplingstep. Since the time-dimension is already structured, we onlyapply a one-dimensional resampling in the m/z-direction. Wegenerate a structured domain with non-equidistant samples inthe t-direction and equidistant samples in the m/z-direction.

5

Resampling should be done such that all intensity peaksare preserved (with their maximum intensity). The only wayto fulfill this condition is to use a sufficiently high resamplingrate. If the resolution res of the mass spectrograph is known,it is best to resample with rate 1/(2 · res). If the resolutionof the mass spectrograph is unknown, it can be estimated bydetermining the minimum distance between any two measure-ments.

Obviously, we generate a lot of redundant information. Fordisplaying data visually on a computer screen, there is no needto go beyond the screen’s resolution. We reduce the amount ofdata by merging adjacent data values. However, we still wantto be able to retrieve the highest resolution data for displaywhen zooming into regions of interest and when outputtingdata to peak quantification tools. We generate a hierarchicaldata representation that allows for multiresolution visualizationand data access.

B. Hierarchical Data Representation

A hierarchical data representation scheme stores a data setat various resolutions. For downsampling, the sample positionsof resolution Ln are split into two groups: the ones that belongto the next coarser resolution Ln−1 (called even vertices) andthe ones that belong to Ln \Ln−1 (called odd vertices). Thevalues at the even vertices are computed from the values at thesample positions ∈Ln. When using wavelet-based techniquesthe values at the odd vertices store the “difference” betweenthe values at the even vertices and the values at the samplepositions ∈Ln. Thus, resolution Ln can be reconstructed fromresolution Ln−1 at any time by adding the differences. Onlyresolution Ln−1 and the difference set Ln \Ln−1 need to bestored, which is the same amount of data storage as needed tostore Ln. Thus, setting up a multiresolution hierarchy using awavelet scheme does not require additional data storage.

The simplest and, thus, most widely used wavelets are Haarwavelets, see [27]. One-dimensional Haar wavelets computethe values f n−1

i ∈ Ln−1 at the even vertices by averaging thevalues f n

i and f ni+1 at the respective sample point pairs ∈ Ln.

The values f n−1i+1 at the odd vertices store the differences f n

i −f n−1i .

We adopt the ideas from wavelet-based multiresolutionhierarchy generation. However, averaging data values wouldcause intensity peaks to lose their maximum intensity or evento vanish. To maintain the maximum intensity of all peaks, weset the values at the even vertices to

f n−1i = max( f n

i , f ni+1) .

The values at the odd vertices store the differences

| f n−1i+1 | = f n−1

i −min( f ni , f n

i+1) .

The sign of f n−1i+1 is used to indicate whether f n

i or f ni+1 was

the larger value. Figure 5 shows an example for our peak-preserving multiresolution hierarchy generation.

C. Interactive Visualization System

We have developed a three-dimensional visualization sys-tem for interactive exploration of LC-MS data. We render

even

odd

even

oddn−1 n−2

n−1

n−1

n−1

n−2

n−2

n−1

LL

n

n

n

n

Ln

fff

ff

fff

f

f1

0 0 0

1

2

2

23

3

Fig. 5. Hierarchical data representation: Multiresolution scheme is peak-preserving ( f n−2

0 = max{ f n0 , f n

1 , f n2 , f n

3 }) and does not require additionalstorage space (stores only [{ f n−2

0 };{ f n−22 };{ f n−1

1 , f n−13 }]).

heightfields over the two-dimensional resampled domain withdimensions time and m/z-ratio. Figure 6 shows a resultingimage when displaying intensity over the 2D domain. Thesystem allows for visualization of the entire data set on aglobal scale and zoomed views into regions of interest. Tofulfill the real-time requirements, the appropriate resolution isselected from the multiresolution data hierarchy described inthe previous section. Figure 7 shows how the system switchesfrom a low-resolution data visualization (Figure 7(a)) to ahigher-resolution one during zooming (Figure 7(b)).

Fig. 6. Visualization of LC-MS data: Intensities are shown over time andm/z-ratio dimensions.

Standard interaction mechanisms are provided: The colorscheme for the visualization can be changed interactively. Aone-dimensional transfer function is used to map intensityvalues to RGB color values such that the color of a peak

6

(a) (b)

Fig. 7. Multiresolution LC-MS data visualization: When zooming into regions of interest, resolution switches from coarse for global view (a) to fine for amore detailed view (b).

indicates its intensity. The material properties and the shadingmethod can be chosen by the user, and the system supportsboth parallel and perspective projection. To filter, i. e., selectand display only the most significant peaks, a thresholdingmethod is provided that culls all peaks beneath a chosenthreshold. The system also allows for peak labeling.1

V. VISUAL DATA EXPLORATION

A. Deisotoping

When investigating LC-MS data using our visual explo-ration tool, it becomes apparent that there are intensity peaksthat seem to form groups. A frequent pattern shows three tofive peaks in a row. They all belong to the same time step andexhibit an equal spacing between them. Often, the second in-tensity peak is the highest (depending on the peptide/protein),and the maximum intensities decrease monotonically for thesubsequent peaks. Figure 8 shows a typical example for suchan intensity peak pattern. Further investigations can lead tothe observation that the distance between successive intensitypeaks of a group is about 1/z Da, where z is the chargeof the peptide. Thus, the group of intensity peaks representthe (stable) isotopic distribution of one peptide. For example,carbon is known to primarily exist in form of the stable isotope12C, but about 1.1% of all carbon atoms are of form 13C,where the numbers 12 and 13 denote the masses (in Dalton).Also for hydrogen (3H), nitrogen (15N), oxygen (17O, 18O),and sulfur (32S, 33S, 34S, 36S) alternative stable isotopes areknown. The larger the peptide the higher is the probability tocontain one or more rarely occurring stable isotopes that resultin one Dalton mass shift.

When trying to quantify the amount of a protein presentin a given sample, one kind of protein may exist in various

1A movie showing features of the visual exploration tool accompanies thepaper.

Fig. 8. Visualization of LC-MS data exhibit characteristic patterns due toisotopes. Shown is a group of adjacent peaks forming a short chain.

forms containing different isotopes. To determine the amountof protein, we should count all occurrences of the protein, notjust the ones that contain exactly the same isotopes. Hence,when looking at the group of intensity peaks, the individualintensity peaks should not be considered separately. Instead,their intensities should be summed up to form one majorintensity peak, which represents the number of all peptidesof one kind. This step is called deisotoping.

We perform the deisotoping step by scanning the MSspectrum for such patterns of intensity peaks. If we find threeor more intensity peaks that show the described behaviorand whose distances are 1 Da (with a small error tolerance),we classify them as candidates for deisotoping. We unite theintensity peaks at the position of their highest representative.The height of the united intensity peak is the sum of the

7

intensities of the individual peaks. Figure 9 shows the effect ofdeisotoping. The MS spectrum shown in red color exhibits thecharacteristic pattern. In the MS spectrum shown in blue color,the group of peaks have been replaced by one major intensitypeak. To assure highest precision, all these calculations areconducted on the original data, i. e., before any resamplingstep are applied.

Fig. 9. Deisotoping: Intensity peaks forming the characteristic isotope pattern(shown in red color) are united to form one major intensity peaks each (shownin blue color).

Various other approaches exist for deisotoping, includingsophisticated ones using database queries [15] to retrieve massinformation for isotopes. These approaches are computation-ally expensive but can deal with overlapping patterns. Ourapproach is a simple and fast one. We have chosen thisapproach, as it can be incorporated in our interactive visualsystem providing an immediate and intuitive understandingof the applied modifications. Thus, potential misclassificationswould immediately be noticable.

B. Quantification

2D gel electrophoresis is a wide-spread approach for pro-teome analysis, many existing quantification methods operateon 2D images. Since biologists are already familiar with thesealgorithms and successfully have applied them to their owndata sets, we want our system to support their integration.Therefore, we want to export our data to 2D grey-scale images.We allow the user to pick any resolution up to the highestresolution supported by our hierarchical data representationfor the image export. We compute the image from the under-lying data representation and not from the visualization. Theexported image could represent the entire data set or a selectedregion of interest.

Our visualization uses higher precision for storing the datavalues than is supported by the common image formats. Thus,during image export we are losing precision. Since most ofthe intensity values fall into the low intensity range, whileonly a few intensity peaks exhibit large values, we are using

a logarithmic scale to map the high-precision values to thelower-precision ones. Using a logarithmic scale, we can stilldistinguish between the different low-intensity values. Whenusing a linear scale, many low-intensity values would bemapped to the same exported value. Nevertheless, we supportimage export using both logarithmic and linear scales.

C. Visual Exploration of Tandem Mass Spectrometry Data

When using tandem mass spectrometry, a second MS spec-trum is determined for some selected peptides, which are thepeptides that exhibit high intensity peaks in the first MS spec-trum. The visual exploration of tandem mass spectrometry datais based on our general framework visualizing intensities overtime and m/z-ratios of the first MS step. In addition, we canswitch on labeling of intensity peaks. We label those intensitypeaks that correspond to peptides, for which a second MSstep has been executed. When clicking a label, a new windowappears, which displays the second-order MS spectrum. Figure10 illustrates this interactive exploration method.2

The second-order MS spectrum can be used to supportthe identification of the peptides and, thus, the proteins. Theintensity peaks in the second-order MS spectrum differ bythe masses of the amino acids of the peptides’ sequences.By computing the mass differences of the peaks, the aminoacid sequence can be determined. The determined amino acidsequence can then be matched against amino acid sequencesretrieved from online databases (such as Mascot [22]) toidentify the peptide.

The labels in Figure 10 show identity tags. After theidentification step, they can be replaced by peptide names orother properties obtained from the database. Our system alsoprovides the option to display additional available informationabout intensity peaks. Figure 11 shows a table with allproperties next to the 3D visualization system. When selectingan intensity peak from the first- or the second-order MSspectrum by mouse click, the corresponding rows in the tableare highlighted. This interaction mechanism allows the user toimmediately obtain exact values for the measured quantities.

D. Registration

In order to get insight into multiple LC-MS experiments,we would like to display their visualization over the samedomain and to allow for quantitative comparisons as well asinterpolations and animations. The multiple experiments mayhave been taken under varying conditions (for example, testvs. control to compute differential protein expressions) or maybe different fractions of a MudPIT experiment.

We have to solve several problems in this regard, which areimposed by the structure of LC-MS data. Not only that weget different sampling locations when executing two differentexperiments, also when executing the same experiment multi-ple times the results vary. LC runs are not reproducible withsufficient precision. We observed that the variation may bepretty high, as intensity peaks for one peptide exhibit signifi-cant shifts from one output to another. The intensity peaks of

2A movie showing interactive exploration of tandem mass spectrometrydata accompanies the paper.

8

Fig. 10. Visual exploration of tandem mass spectrometry data. Intensity peaks of peptides, for which a second-order MS spectrum is available, are labeled.Clicking on the labels opens a new window showing the second-order MS spectrum.

Fig. 11. Tables next to the 3D visualization provide exact quantities andproperties of interactively selected peaks.

Figure 12 illustrate the shift by visualizing test (colored in red)and control data (colored in green) as two heightfields over thesame domain. The positions of most dominant peaks in the twoheightfields are supposed to coincide, but they exhibit a severeshift.

Thus, we need to register (align) the multiple experiments,if we want to have a meaningful visual or quantitative com-parison of multiple data sets. The registration step warps thedomain of the test data set onto the domain of the control dataset.

The warping transformation is computed using a landmark-based approach. Obviously, the position of the most significantintensity peaks make for good landmarks. While landmark-based approaches are known to produce good results, their

Fig. 12. Visualization of two LC-MS data sets. The samples have beentaken under different conditions. Left: Locations of the maximum intensitypeaks (colored in red and green) are supposed to coincide but clearly exhibitshifts. Right: Registration by landmark-based domain warping corrects thelocal shift.

drawback is the necessity of user interaction. Typically, sub-stantial expertise is required to manually determine matchinglandmarks in two data sets. Moreover, for large data sets,this user interaction may become a rather tedious task. Forour system, however, we can make use of the tandem massspectrometry analysis described in the previous section. Thesecond MS step allows us to determine which peptides areassociated with which intensity peaks. When performing thispeptide identification step for both data sets, we automaticallyobtain matching landmarks. The landmarks are the positionsof the intensity peaks, to which the same identifications havebeen assigned.

Once the landmarks are given, we can warp the domainof a test data set onto the domain of the control data set.We cannot use a rigid transformation, as the shifts may varylocally. Thus, we use a localized method: In a first step,we partition the domain of the control data set using a 2DDelaunay triangulation, where the landmarks (and the cornersof the bounding box) are used as vertices. The blue lines on theleft-hand side of Figure 13 illustrate the domain partitioningstep. The same partition is induced onto the test data setusing the matching landmarks as vertices (right-hand side ofFigure 13). Based on the partition, we can warp every point

9

p from the domain of the test to the domain of the controldata set. We determine, in which triangle point p lies andcompute its barycentric coordinates within that triangle. Thesame barycentric coordinates are used within the matchingtriangle of the control data set to compute the warped positionof p, see Figure 13.

p1

p1

p2

p3

p2

p3p

p

Fig. 13. Domain warping for registration of two LC-MS experiments:Peptides with intensity peak locations p1, p2, and p3 are identified, matched,and used as landmarks. Delaunay triangulation is used to composite the 2Ddomain. Point p is warped to new position by using barycentric coordinates.

The right-hand side of Figure 12 shows the result of ourregistration step applied to the two experiments of the exampleshown on the left hand-side of the figure. The matchingintensity peaks have been moved to the same position.

E. Visual Differential Protein Expression Analysis

Differential protein expression analysis can be used todetermine how the occurrences of proteins change whenaltering the conditions under which the samples are taken.Typically, one experimental setup serves as control data, whilethe other setups provide the test data. For example, the testand control data may have been taken from healthy anddiseased organisms, respectively. The difference in expressioncan lead to hypotheses about what proteins are responsiblefor the occurrence of a disease. Another question is, if andhow protein expression changes for different cell populations.To answer such questions, LC-MS data from two or moreexperiments need to be compared. We provide means tovisually explore the quantitative differences in the two datasets. In addition to explore each data set on its own, weuse an integrated view of both experiments by visualizing thedifferential expression.

As we mentioned in the previous section, the two exper-iments need to be registered. After registration, differentialexpression can be computed by subtracting the intensity valuesfor the test data set from the intensity values for the controldata. The resulting differential expression is visualized as aheightfield over the 2D domain, i. e., over time- and m/z-dimension. Figure 14 shows an example for differential ex-pression visualization. The visual exploration of the resultingdata is supported just like for data from a single experiment.

F. Visual Exploration of MudPIT Data

To visualize MudPIT data, we generate a heightfield foreach fraction. We provide a slider for the user to switchbetween the heightfield renderings of the individual fractions.

Fig. 14. Visual differential protein analysis for a test vs. control experiment.Up-regulation is shown in red, down-regulation in green. The peaks’ valuesdisplay the quantitative difference of the two experiments over dimensionstime and m/z-ratio.

Figure 15 shows the visualization of three fractions exhibitingchanges in intensity.

An integrated view of all or selected fractions of a MudPITdata set may be of interest. However, results obtained by theMudPIT process also suffer from not being (precisely) repro-ducible. Thus, the locations of intensity peaks representingpeptides present in subsequent fractions do not coincide. Whentrying to integrate MudPIT data from various fractions intoone setup, we need to perform a registration step, again. Weproceed as described above. Figure 16 shows the integratedvisualization of various fractions after registration. Each frac-tion gets assigned one color. Thus, colors indicate in whichfraction which intensity peaks are highest.

Fig. 16. Integrated visualization of various fractions from MudPIT: Colorsindicate, in which fraction each peak has maximum intensity. The colorscheme assigns colors cyan, white, magenta, yellow, red, green, and blueto fractions in the order of increasing ammonium chloride.

Using the registered fractions, we can also interpolate be-tween the intensities of subsequent fractions. Instead of usinga slider to switch between visualizations of different fractionsas in Figure 15, we generate an animation over the fractiondimension, where intensity values are animated over fractionsin the order of increasing ammonium chloride concentration.3

3A movie of an animation accompanies the paper.

10

Fig. 15. Visualization of three fractions obtained by using MudPIT. Slider allows to interactively switch between fractions.

For smooth transition we use linear interpolation of the height-fields. The animation allows for an even better perception ofintensity changes with changing fractions.

VI. RESULTS AND EVALUATION

For the explanations of our visual exploration system in theprevious chapters, we chiefly have applied our methods to dataacquired from human cell line SiHa used to model cervicalcancer [24]. The cell line was grown under normal conditionsand showed no perturbation. The MudPIT experiment wasdone using liquid chromatography (LC) with reverse phase(RP) column and mass spectrometry (MS) with electrosprayionization (ESI). Eleven fractions with 0 mMol, 20 mMol, 40mMol, 60 mMol, 80 mMol, 100 mMol, 150 mMol, 200 mMol,300 mMol, 500 mMol, and 900 mMol ammonium chloridewere used. The retention time during liquid chromatographywere in the range from 0 to 85 minutes. The m/z-ratiosmeasured by mass spectrometry were in the range of 300 to1500. The measured intensities were as high as 109 counts persecond.

Moreover, we applied our tools to our own data in order toidentify and quantify proteins and their differential expressionin Bacillus subtilis. The data acquired consists of seven LC-MS/MS cycles. The first cycle was a tandem mass spectrom-etry experiment as described in Section III, which is alsoreferred to as “flow through”. Cycles two to seven describe aMudPIT (LC/LC-MS/MS) experiment (cf. Section III). Eachcycle corresponds to one fraction. For the individual fractionswe used 2.5 mMol, 7.5 mMol, 12.5 mMol, 17.5 mMol, 25mMol, and 37.5 mMol ammonium chloride.

Figure 17 shows the 3D visualization of the “flow through”.Our first observation from the visualization is that the biolog-ical experiment was successful in terms of possible failuresduring data acquisition. In a successful experiment, a largenumber of intensity peaks spread over the 2D domain withoutforming clusters or patterns. An intuitive and immediate checkfor errors is particularly important, as the data acquisitionmethods are not routinely conducted yet due to their novelty.

Fig. 17. 3D visualization for global understanding and error check.

For comparison, we show two examples, where experimentsfailed. In Figure 18(a), we observe that the intensity peaksform a vertical band. Apparently, the LC step did not workproperly, such that peptides have not been separated appropri-ately but agglomerate in this narrow band. In Figure 18(b),the sample is “empty”, i. e., only very few values have beendetected. This may be a detection problem during the MSstep. Thus, our visualization method does not only provide anintuitive and immediate validation check but can even hint towhat problem may have caused the experiment to fail.

Next, we want to look into more detail of our data set. Wedetect a very high intensity peak and turn on the labeling to ob-tain more information about the peak. This visual explorationstep is shown in Figure 11. The high intensity peak is labeledwith the identity tag “1109”. The corresponding spectrumof this MS scan is plotted in Figure 3. When selecting thispeak we can retrieve further information. The quantities of theselected intensity peak are highlighted in the given table, see

11

(a) (b)

Fig. 18. Two failed experiments (bird’s eye view): (a) Measured intensitiesagglomerate in a narrow vertical band, possibly caused by an LC error. (b)Few peptides have been detected, possibly caused by an MS error.

Figure 19. In the table we enlist the identity tag of the scan,the number of intensity peaks in the scan, the order of theMS scan, and the maximum intensity, which is the intensityof the selected peak. Moreover, the subsequent rows give theinformation of the second MS step. Five intensity peaks ofscan 1109 have undergone further investigations (scans 1110 to1114). In the last column the value of their m/z ratio is given,which allows us to detect the second-order MS spectrum thatcorresponds to the selected intensity peak.

Fig. 19. Quantitative information about scan 1109 is shown and highlightedwhen the respective intensity peak is selected.

We can use the second-order MS spectrum to determine theamino acid sequence of a peptide and make a database queryto identify the corresponding peptide.

The feedback from the biologists in our group was thatour tool provides intuitive visual exploration mechanisms,which allow them to quickly obtain the precise quantitativeand qualitative information from LC-MS runs. Our visualexploration tool helps to significantly accelerate their workflowand makes their analysis steps more intuitive, transparent, andcomprehensible.

Of great value to the experimenter is the possibility toimmediately detect problems in the experimental protocolindicated by abnormal 3D charts.

Further impact of our work to the proteomics communityincludes that our visual analysis system allows for differentialexpression analysis for the entire LC-MS data set, whichpreviously had not been solved adequately. The considerationof deisotoping for the graphical and quantitative representationof LC-MS data gets us closer to a reliable quantitation ofwhole monoisotopic peak sets of the same peptide within anLC-MS proteome data set.

VII. CONCLUSIONS AND FUTURE WORK

We have presented a system for visual exploration of datafrom gel-free protein experiments. We provide interactive3D visualizations of LC-MS data, tandem mass spectrometrydata, and MudPIT data. Interactivity is achieved by using ahierarchical data representation scheme. Care has been takento assure that the biological data (in terms of maximum inten-sities and sample locations) is preserved with high precision.

We provide methods for deisotoping and registration andsupport protein quantification and identification. These algo-rithms can be used for computing and visualizing differen-tial protein expression. We have evaluated our approach byillustrating how protein analysis becomes much more targetedtoward the retrieval of significant data when interacting withour visualization-based exploration tool.

In our current implementation we have not yet integratedthe database queries and the intensity peak quantification step.Integrating the database search results would eliminate themanual database queries. Integrating the peak intensity stepwould also improve the accuracy of our system. By exportingthe intensities to a 2D image, we are not only losing precisionin the intensity values, we also need to resample the time stepsto obtain equidistantly sampled values, which fit the regularpattern of a pixel image. This resampling step may introducesome inaccuracies.

We also plan on coupling our deisotoping step with adeconvolution step. Currently, we are only detecting patternswith distances 1 Da, i. e., with charge z = 1. During the MSionization step, ions with higher charge may be generatedsuch that the distances in the patterns would be proportionallysmaller.

REFERENCES

[1] F. O. Andersson, R. Kaiser, and S. P. Jakobsson. Data preprocessing bywavelets and genetic algorithms for enhanced multivariate analysis of lcpeptide mapping. Journal of Pharmaceutical and Biomedical Analysis,34:531–541, 2004.

[2] J. Bernhardt, K. Buttner, C. Scharf, and M. Hecker. Dual channelimaging of two-dimensional electropherograms in bacillus subtilis. Elec-trophoresis, 20(11):2225–2240, 1999.

[3] J. Bernhardt, J. Weibezahn, C. Scharf, and M. Hecker. Bacillus subtilisduring feast and famine: visualization of the overall regulation of proteinsynthesis during glucose starvation by proteome analysis. Genome Res.,13(2):224–237, 2003.

[4] P. Cignoni, C. Montani, E. Puppo, and R. Scopigno. Multiresolutionmodeling and visualization of volume data. IEEE Transactions onVisualization and Computer Graphics, 3(4):352–369, 1997.

[5] P. Cignoni, E. Puppo, and R. Scopigno. Representation and visualiza-tion of terrain surfaces at variable resolution. The Visual Computer,13(5):199–217, 1997.

[6] D. Cohen-Or and Y. Levanoni. Temporal continuity of levels of detail indelaunay triangulated terrain. In IEEE Visualization 1996, pages 37–42.IEEE Computer Society Press, 1996.

[7] J. de Corral and H. Pfister. Hardware-accelerated 3d visualization ofmass spectrometry data. In C. Silva, E. Groller, and H. Rushmeier,editors, IEEE Visualization 2005, pages 439–446. IEEE ComputerSociety Press, 2005.

[8] M. Duchaineau, M. Wolinski, D. E. Sigeti, M. Miller, C. Aldrich, andM. B. Mineev-Weinstein. Roaming terrain: Real-time optimally adaptingmeshes. In IEEE Visualization 1997, pages 81–88. IEEE ComputerSociety Press, 1997.

[9] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse.Electrospray ionization for mass spectrometry of large biomolecules.Science, 246(64), 1989.

12

[10] D. R. Gilbert, M. Schroeder, and J. van Helden. Interactive visualizationand exploration of relationships between biological objects. TrendsBiotechnol., 18(12):487–494, 2000.

[11] R. Grosso, C. Lurig, and T. Ertl. The multilevel finite element method foradaptive mesh optimization and visualization of volume data. In R. Yageland H. Hagen, editors, Proceedings of IEEE Conference on Visualization1997, pages 135–142. IEEE, IEEE Computer Society Press, 1997.

[12] H. Hoppe. Smooth view-dependent level-of-detail control and itsapplication to terrain rendering. In IEEE Visualization 1998, pages 35–42. IEEE Computer Society Press, 1998.

[13] M. Karas and F. Hillenkamp. Laser desorption ionization of proteinswith molecular masses exceeding 10 000 daltons. Anal Chem, 60:2299–2301, 1988.

[14] X.-J. Li, P. G. A. Pedrioli, J. E. J, D. Martin, E. C. Yi, H. Lee, andR. Aebersold. A tool to visualize and evaluate data obtained by liquidchromatography/electrospray ionization/mass spectrometry. Anal Chem,76:3856–3860, 2004.

[15] X.-J. Li, H. Zhang, J. R. Ranish, and R. Aebersold. Automated statisticalanalysis of protein abundance ratios from data generated by stableisotope dilution and tandem mass spectrometry. Anal Chem, 75:6648–6657, 2003.

[16] P. Lindstrom, D. Koller, W. Ribarsky, L. Hodges, N. Faust, andG. Turner. Real-time continuous level of detail rendering of heightfields. In SIGGRAPH 1996, pages 109–118. ACM SIGGRAPH, 1996.

[17] L. Linsen, J. Locherbach, M. Berth, J. Bernhardt, and D. Becher.Differential protein expression analysis via liquid-chromatography/mass-spectrometry data visualization. In C. Silva, E. Groller, and H. Rush-meier, editors, IEEE Visualization 2005, pages 447–454. IEEE ComputerSociety Press, 2005.

[18] L. Linsen, V. Pascucci, M. A. Duchaineau, B. Hamann, and K. I.Joy. Wavelet-based multiresolution with n√2 subdivision. Journal onComputing, 71(1+2), 2004.

[19] F. Losasso and H. Hoppe. Geometry clipmaps: Terrain rendering usingnested regular grids. ACM Transaction on Graphics, 24(3):769–776,2004.

[20] S. Luhn, M. Berth, M. Hecker, and J. Bernhardt. Using standardpositions and image fusion to create proteome maps from collectionsof two-dimensional gel electrophoresis images. Proteomics, 3(7):1117–1127, 2003.

[21] D. M. Maynard, J. Masuda, X. Yang, J. A. Kowalak, and S. P. Markey.Characterizing complex peptide mixtures using a multi-dimensionalliquid chromatography-mass spectrometry system: Saccharomyces cere-visiae as a model system. Journal of Chromatography B, 810(1):69–76,2004.

[22] D. N. Perkins, D. J. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using massspectrometry data. Electrophoresis, 20(18):3551–3567, 1999.

[23] D. Pinskiy, E. Brugger, H. R. Childs, and B. Hamann. An octree-based multiresolution approach supporting interactive rendering of verylarge volume data sets. In H. Arabnia, R. Erbacher, X. He, C. Knight,B. Kovalerchuk, M. Lee, Y. Mun, M. Sarfraz, J. Schwing, and H. Tabrizi,editors, Proceedings of the 2001 International Conference on ImagingScience, Systems, and Technology (CISST 2001), Volume 1, pages 16–22. Computer Science Research, Education, and Applications Press(CSREA), Athens, Georgia, 2001.

[24] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte.The need for a public proteomics repository (commentary). NatureBiotechnology, 22:471–472, 2004.

[25] N. Shah, V. Filkov, B. Hamann, and K. I. Joy. GeneBox: interactivevisualization of microarray data sets. In F. Valafar and H. Valafar,editors, International Conference on Mathematics and EngineeringTechniques in Medicine and Biological Sciences (METMBS ’03), pages10–16. Computer Science Research, Education, and Applications Press(CSREA), Athens, Georgia, 2003.

[26] Y. Shinagawa and T. L. Kunii. Unconstrained automatic image matchingusing multiresolutional critical-point filters. IEEE Trans. Pattern Anal.Mach. Intell., 20(9):994–1010, 1998.

[27] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin. Wavelets for ComputerGraphics: Theory and Applications. The Morgan Kaufmann Series inComputer Graphics and Geometric Modeling, Brian A. Barsky (serieseditor), Morgan Kaufmann Publishers, San Francisco, U.S.A., 1996.

[28] C. Tang, L. Zhang, and A. Zhang. Interactive visualization and analysisfor gene expression data. In Hawaii International Conference on SystemSciences, 2002.

[29] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422:193– 197, 2003.

[30] M. P. Washburn, D. Wolters, and J. R. Yates III. Large-scale analysisof the yeast proteome by multidimensional protein identification tech-nology. Nature Biotechnology, 19:242–247, 2001.

[31] M. R. Wilkins, C. Pasquali, R. D. Appel, K. Ou, O. Golaz, J. C.Sanchez, J. X. Yan, A. A. Gooley, G. Hughes, I. Humphery-Smith,K. L. Williams, and D. F. Hochstrasser. From proteins to proteomes:large scale protein identification by two-dimensional electrophoresis andamino acid analysis. Biotechnology, 14:61–65, 1996.

Lars Linsen is an assistant professor of computerscience at the Department of Mathematics and Com-puter Science of the Ernst-Moritz-Arndt-UniversitatGreifswald, Germany. He received a B.S. and anM.S. (Diplom) in computer science from the Univer-sitat Karlsruhe (TH), Germany, as well as a Ph.D. in2001. He was awarded with the 2002 “Preis desFordervereins des Forschungszentrum Informatik”for an outstanding dissertation. He spent three yearsas a post-doctoral researcher and lecturer at theInstitute for Data Analysis and Visualization (IDAV)

and the Department of Computer Science of the University of California,Davis, U.S.A. He joined the Ernst-Moritz-Arndt-Universitat Greifswald inOctober 2004. His research interests are in the areas of scientific andinformation visualization, multiresolution methods, computer graphics, andgeometric modeling. He is a member of ACM and ACM SIGGRAPH.

Julia Locherbach is a Ph.D. student at the De-partment of Mathematics and Computer Science ofthe Ernst-Moritz-Arndt-Universitat Greifswald, Ger-many. She received a B.S. and an M.S. (Diplom)in biomathematics from the Ernst-Moritz-Arndt-Universitat. She spent one year at the Heriot-WattUniversity Edinburgh, Scotland. She also works asa part-time software developer for Decodon GmbH,Greifswald, Germany.

Matthias Berth received an M.S. (Diplom) in1995 as well as a Ph.D. in Mathematics in1999 from Ernst-Moritz-Arndt-Universitat Greif-swald, Germany. He is co-founder and CTO of DE-CODON GmbH where he is responsible for researchand development.

Dorte Becher is postdoctoral research fellow atthe Institute for Microbiology of the Ernst-Moritz-Arndt-Universitat Greifswald, Germany. From1987–1992 she studied chemistry and receivedher M.S.(Diplom) in chemistry in 1992 and herPhD in microbiology in 1998. During this timeshe worked as a guest student at the University ofAberdeen for some month. Since 1999 she is oneof the responsibilities for mass spectrometry in theInstitute for Microbiology. Her research interestsare in the areas of mass spectrometry in proteomics,

in particular the protein identification, investigation of post translationalmodifications, methods of gel free protein identification/quantitation andcharacterization of protein complexes.

Jorg Bernhardt is postdoctoral research fellow atthe Institute for Microbiology of the Ernst-Moritz-Arndt-Universitat Greifswald, Germany. 1994 he re-cieved his M.S.(Diplom) in Microbial Physiology,and his Ph.D. in 2000. During this time (1999) hewas awarded with the poster price of the JapaneseElectrophoresis Society in Tokyo and with the Re-search Award of the University of Greifswald forhis Dissertation on the Proteome of Bacillus subtilis.In Dec 2000 he became cofounder and CSO ofDECODON, a company engaged in the development

of software tools for functional genomics. His research interests are in theareas of functional genomics, proteomics, bacterial physiology, and imageanalysis.

visual analysis of gel-free proteome data...1 visual analysis of gel-free proteome data lars linsen...

Documents