novel analytical strategy for the 2d-dige (us 20100046813-a1)

Novel strategy for the 2D-DIGE gel analysis algorithms Summary 2D-DIGE gel is important technology for top-down proteomic analysis strategies. The image analysis of 2D-gels is mostly performed by image segmentation and density integration after spot detection steps on images. In principle, there are some major limitations to the approach. The quantification of spot volumes often suffer from heavily overlapping spots that can contribute extremely unreliable and inaccurate density estimations. Secondly, low-abundance protein spots tend to be buried within the sea of proteins, making detection of biologically important but low copy number proteins (such as transcription factors) virtually impossible. As Mass Spectrometer’s performance improved, detection of proteins in these classes should be possible in terms of technicality. Although MS sensitivity may be enough for detecting and quantifying the protein, first, protein needs to be detected in gel image, but as mentioned above, that is unlikely with current approaches. I have developed novel strategy to address these issues by introducing few technological breakthroughs. First, inter-channel normalization protocol for Cy2/3/5 channels to produce “differential images” among channels. This strategy brings some advantages over other approaches, (1) Elimination of background intensity by channel-channel image subtraction after normalization protocol, (2) Detection of “hidden” spots that are buried within the high abundance spots. In addition to inter-channel normalization strategy, there is another change in the algorithms. Instead of using spot detection followed by image segmentation, after spot detection, spots are modeled with physically reasonable mathematical model, 10 parameter skewed 2D-Gaussian spot density functions. This function simulates both IEF and PAGE direction tailing often observed in real gel images and gives exact quantification results from mathematical integration of function itself even with heavily overlapped spots. With these improvements, accuracy of spot quantification and detection of low abundance proteins are significantly improved. Inter-channel normalization

The images are subjected to “normalization” to the maximum intensity observed in the gel or given numbers in the case of images all are in relative low level of intensities. These processes are linear transformation of images to match the baseline and overall image intensities by linear regression of images (to minimize MSE). Note: this is not normalization between low and high points, but minimization of overall intensity differences and assumption is that no more than 50% of pixels are changing in intensity in the images. There is additional step to examine baseline (background) level intensity and correction of the overall normalization process utilizing baseline information. Differential images The normalized images are subjected to image subtraction processes to generate inter-channel “differential” images. There are nothings special about the process, just producing “subtraction image”. This image contains information of “changing” spots only and in principle

Inter-channel normalization of DIGE images: Original images indicate difference in appearances between channels but after normalization, they are same and image subtraction leaves very little intensities

zero background intensity. Care needs to be taken for interpretation of extremely high abundance proteins such as some serum proteins (immunoglobulins, Albumines) or RuBisCo (plant) proteins as physical electrophoretic processes can be affected by too much proteins concentrated in small region during the electrophoresis (both isoelectric focusing and PAGE) thus intensity information can be unreliable (non-smooth intensity transition, rough intensity distribution by physical disturbance during the physical migration or general intensity reduction due to self-quenching of dyes during the fluorescent detection). Also, in some small proteins in lower area of the gel, occasionally shift in mobility among the channels may be observed possibly due to physicochemical differences among the dyes, although they are quite similar in structure but overall molecular lengths that could affect mobility in PAGE process. Ratio image After normalization process, “ratio” images are also produced. This is also very simple process to calculate ratio of each pixel between channels. Same precaution needs to be taken for interpretation of image and data. Some properties of differential and ratio images Differential image is very sensitive against absolute change of image intensities among channels making it little misleading to judge “percent change of abundance” rather than “absolute change of signal intensity” even as “difference” between channels as differential image omits original intensity information of parental images. Ratio image is rather insensitive to that issue but giving “ratio” of two channels, thus more intuitive. Although there are some major difference in their behavior, they appear somewhat similar as image. If there is not significant overlaps, ratio image directly give the “change” ratio as spot center value is exactly the ratio of intensities of target channels without any quantitative calculations or data modeling by complicated skewed 2D-Gaussians.

Why ratio image seems similar to differential Image? Images are composed of many 2D-‐gauusian sums.

€

I1 = B1 + A1ie− fi ( x,y )

i

N

∑ ,

€

I2 = B2 + A2ie− fi (x,y )

i

N

∑ ,

fi(x,y)=

€

−x − xciwxi

⎛

⎝ ⎜

⎞

⎠ ⎟

2

+y − yciwyi

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

Ratio image is expressed as the logarithm of image ratio between I2 and I1.

Irt=log2(I2/I1) = log2

€

B2 + A2ie− fi (x,y )

i

N

∑


i

N

∑

⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

=log2

€


i

N

∑⎡

⎣ ⎢

⎤

⎦ ⎥ -‐ log2

€


i

N

∑⎡

⎣ ⎢

⎤

⎦ ⎥

If there is no significant overlap between the spots around the spot centers [xci, yci], With background level adjusted, B2≈B1≈0. Beyond the 2x distances spot width parameters Wxi, Wyi from spot centers, intensity contribution from nearby spot is negligible, thus

∴ Irt(~Xci, ~Yci) ≈ log2

€

A2ie− fi (x,y )[ ]-‐ log2

€

A1ie− fi (x,y )[ ] = log2

€

A2iA1i

⎡

⎣ ⎢

⎤

⎦ ⎥

Thus region near each spot center appears to have intensity ratio of spot at the center and gradually changes to adjacent (surrounding) spot intensity ratios. Thus, if there is no significant overlap among the spots, spot intensity ratio can be obtained directly from ratio image. The ratio value at the spot center tends to be underestimated. The ratio value at the center of spot is more accurately expressed with following equation. Where ε1 and ε2 are the contributions from neighboring overlapping spots. This decreases the absolute value of ratio with some degree depending on the degree of overlap and noise.

Irt(~Xci, ~Yci) ≈ log2

€

A2i + ε2A1i + ε1

⎡

⎣ ⎢

⎤

⎦ ⎥

Spot detection The differential image is calculated in the first step of image processing. The differential image includes information about the spots that cannot be detected in conventional spot detection algorithms. Thus, in order to maximize the sensitivity of spot detection to such spots and improve the detection limits, spot detection is performed on differential image.

There are few steps for spot detection algorithm. First, program calculates modified second derivative image. As differential image contains both positive and negative values, the spot center information within second derivative image may be both positive and negative. Also, in order to emphasize the degree of change and also sign of intensity values, second derivative is multiplied by image intensity.

Modified second derivative = (second derivative)x(differential image)

In this way, all local minima

in second derivative are guaranteed to be negative values. This makes spot center detection easier and also prevent detection of false-positive spots (such as dent in the curvature of spot density distributions). Sot detection can be done with simply find local minimum value in negative value range. After initial search of local minima, they are filtered by certain criteria that can be specified by user in order to remove noises and artifacts.

After spot centers are detected in second derivative image and filtered, Third derivative image is calculated for the spot parameter estimation. This step is necessary for the good spot fitting results. As spot fitting is a non-linear optimization problem, initial starting parameters need to be as close as it can be. If starting parameters are too far away from optimum parameters, optimization may not converge into reasonable local minimum. As skewness parameters are hard to calculate, program uses simple Gaussian parameter estimation using third derivative image. Third derivative image is used for detecting the spot width information and separation of two closely located spots by calculating slope change in second derivative. These estimated parameters are used to create synthetic image

Modified second derivative image

Third derivative image with detected spot edges

and examined with real differential images then another refinement in parameter estimation is done. Spot fitting

The spots are detected, their parameters are estimated then spot parameter optimization is carried out. This is an extremely computationally expensive process. The speed of calculation is strictly proportional to N2 (N is number of parameters). Thus, optimization with large number of parameters is prohibitive. The current algorithm is using multi-threaded Lebenberg-Marquardt algorithm. Although it is taking advantage of dual CPU with dual cores (total 4 cores), still, 25 spots fitting (250 parameters) takes roughly 10-15 minutes. In order to speed up the calculation, there are few approaches we are taking. One is using cluster to calculate in parallel. Next is using GPU for curvature matrix calculation. This is proved to improve calculation speed of matrices by factor of 10 to 30. Another is that divide the image into small pieces with maximum 5 to 10 spots within single chunk of image. This brings calculation down to linear increase by number of spots rather than exponential. This affects calculation time with huge improvements. (calculation of 1000 spots with image divided into 100 areas takes 100x102, instead of 10002). As each region within the image is independent from the other region, divided image size are reasonably large and enough overlap is taken among divided image, optimization with this strategy works fine.

Parameter estimations

Simplified Spot Density Function (just oval shape)a

1st partial derivatives

2nd partial derivatives

Synthetic image with estimated spot parameters

Differential image and detected spots with estimated spot width parameters

(color is flipped from synthetic image)

Spot density function (10 parameters)

€

y = A ⋅ e− f (X ,Z )

€

Z

X⎡ ⎣ ⎢ ⎤ ⎦ ⎥

=− sinθ cosθ

cosθ sinθ⎡ ⎣ ⎢

⎤ ⎦ ⎥ z−zc

x−xc⎡

⎣ ⎢

⎤

⎦ ⎥

€

f (x,z) =1sk1x

Xwx

⎛

⎝ ⎜

⎞

⎠ ⎟

4

+ sk2xXwx

⎛

⎝ ⎜

⎞

⎠ ⎟

3

+ sk1xXwx

⎛

⎝ ⎜

⎞

⎠ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪ +

1sk1z

Zwz

⎛

⎝ ⎜

⎞

⎠ ⎟

4

+ sk2zZwz

⎛

⎝ ⎜

⎞

⎠ ⎟

3

+ sk1zZwz

⎛

⎝ ⎜

⎞

⎠ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

For the calculation of derivatives, sum of these derivatives are used.

In the image, a the pixel P, (A) is expressed as (C) is expressed as (B),(D) are expressed as

F = (A) + (B) + (C) + (D) Here, (B), (D) are symmetric along z-‐axis thus, (X-‐Xc) is opposite direction Thus, if (B) is expressed as , then (D) is Thus, the “summed” 2nd derivative for spot detection is

At F=0,

Define rx=x-xc as Wx, Wz >0 Same way,

∵

at the spot center, x =xc, z =zc

As and

Since xc, zc are already known, all unknown parameters can be calculated from observed values

Actual calculation is done in following;

(Intensity of pixel P-‐ pixel 4) + ( Intensity of pixel P-‐ pixel 5)

Differential image with spot boundaries

Synthetic image with spot boundaries (modeled with simple 2D-Gaussian and parameter estimation), before the optimization with 10 parameter skewed 2D-Gaussian.

The fitting results using real gel image. Upper panels are with 250x250 pixels region with 25 spots. Lower panels are 600x600 pixels region divided into 9 sub-‐images with 50 pixels overlap each other.

Spot Matching Detected spots in different gel images need to be matched in order for further processing

such as statistical analyses. The matching processes are performed with algorithm specifically designed for 2-D gel image analyses. The global matching is based on the pattern recognition with directions and distances among “Landmark” spots. The landmark spots are chosen in following criteria, well resolved, separated each other, not too intense or too faint. The candidate spots are tentatively paired between sets of detected spots from two images (different gels for replication). This pairing process is done with following algorithm.

(1) Initially spots are marked with the angle and distance from the Left Top of image. This is an Acidic/High molecular weight direction. This is chosen because in 2-D gel, acidic pI range has better reproducibility among the experiments and high molecular weight region also shows smaller variation in mobility.

(2) These “angles and distances” are compared between two sets of detected spots and “similar” spots within the sets are tentatively paired as potential landmark spots.

(3) These spots are marked with angles and distances among them for each set. (4) The candidate pair is judged by total and ratio of matching with the other candidate

spots. If they match the criteria, they go next step otherwise they are rejected. (5) All detected spots within the vicinity of candidate landmark spots are marked with the

angles and distances from candidate spot.

Globally matched Landmark spots (numbered in red) and locally paired spots.

The vector field between two sets of spots are indicated by blue lines.

(6) These spots are then subjected to local matching check in order to confirm or reject pairing is correct or wrong again. As global check eliminate “obviously wrong” candidate pair, it is difficult to eliminate the pair they are wrong but close enough to be judged by global check.

(7) These local spots are then compared between two sets for candidate spots. If they match the criteria (total number of matching, percentage matching etc), these two spots are determined as landmark spots.

After landmark spots are determined and pairing is done, the vector field is calculated for landmark spots and nearby “local” spots that are paired in previous steps. These Vector field is used to interpolate the vectors for other spots that are not paired yet. The interpolation process is performed using following principle. The electrophoresis is physical processes and spot locations within the image change gradually between two images. There is no “crossing” vector among any spots, nor “sharp turn” of vectors and the length of vector changes gradually. The newly paired spots are checked by local matching again in order to make sure they are correctly matched. At the end, an overall vector field is examined for its smoothness in both length and angles. If there are vectors that do not satisfy criteria, local matching processes are repeated until all criteria are satisfied.

Locally matched spots with examination of angles and distances to judge the pairing.

novel analytical strategy for the 2d-dige (us 20100046813-a1)

Documents

original images

normalized images

case of images

real gel images

subtraction image

image analysis

detection of proteins

image segmentation