novel analytical strategy for the 2d-dige (us 20100046813-a1)

9
Novel strategy for the 2D-DIGE gel analysis algorithms Summary 2D-DIGE gel is important technology for top-down proteomic analysis strategies. The image analysis of 2D-gels is mostly performed by image segmentation and density integration after spot detection steps on images. In principle, there are some major limitations to the approach. The quantification of spot volumes often suffer from heavily overlapping spots that can contribute extremely unreliable and inaccurate density estimations. Secondly, low- abundance protein spots tend to be buried within the sea of proteins, making detection of biologically important but low copy number proteins (such as transcription factors) virtually impossible. As Mass Spectrometer’s performance improved, detection of proteins in these classes should be possible in terms of technicality. Although MS sensitivity may be enough for detecting and quantifying the protein, first, protein needs to be detected in gel image, but as mentioned above, that is unlikely with current approaches. I have developed novel strategy to address these issues by introducing few technological breakthroughs. First, inter-channel normalization protocol for Cy2/3/5 channels to produce “differential images” among channels. This strategy brings some advantages over other approaches, (1) Elimination of background intensity by channel-channel image subtraction after normalization protocol, (2) Detection of “hidden” spots that are buried within the high abundance spots. In addition to inter-channel normalization strategy, there is another change in the algorithms. Instead of using spot detection followed by image segmentation, after spot detection, spots are modeled with physically reasonable mathematical model, 10 parameter skewed 2D-Gaussian spot density functions. This function simulates both IEF and PAGE direction tailing often observed in real gel images and gives exact quantification results from mathematical integration of function itself even with heavily overlapped spots. With these improvements, accuracy of spot quantification and detection of low abundance proteins are significantly improved. Inter-channel normalization The images are subjected to “normalization” to the maximum intensity observed in the gel or given numbers in the case of images all are in relative low level of intensities. These processes are linear transformation of images to match the baseline and overall image intensities by linear regression of images (to minimize MSE). Note: this is not normalization between low and high points, but minimization of overall intensity differences and assumption is that no more than 50% of pixels are changing in intensity in the images. There is additional step to examine baseline (background) level intensity and correction of the overall normalization process utilizing baseline information. Differential images The normalized images are subjected to image subtraction processes to generate inter- channel “differential” images. There are nothings special about the process, just producing “subtraction image”. This image contains information of “changing” spots only and in principle Interchannel normalization of DIGE images: Original images indicate difference in appearances between channels but after normalization, they are same and image subtraction leaves very little intensities

Upload: keiji-takamoto

Post on 16-Jul-2015

178 views

Category:

Documents


2 download

TRANSCRIPT

Novel strategy for the 2D-DIGE gel analysis algorithms Summary 2D-DIGE gel is important technology for top-down proteomic analysis strategies. The image analysis of 2D-gels is mostly performed by image segmentation and density integration after spot detection steps on images. In principle, there are some major limitations to the approach. The quantification of spot volumes often suffer from heavily overlapping spots that can contribute extremely unreliable and inaccurate density estimations. Secondly, low-abundance protein spots tend to be buried within the sea of proteins, making detection of biologically important but low copy number proteins (such as transcription factors) virtually impossible. As Mass Spectrometer’s performance improved, detection of proteins in these classes should be possible in terms of technicality. Although MS sensitivity may be enough for detecting and quantifying the protein, first, protein needs to be detected in gel image, but as mentioned above, that is unlikely with current approaches. I have developed novel strategy to address these issues by introducing few technological breakthroughs. First, inter-channel normalization protocol for Cy2/3/5 channels to produce “differential images” among channels. This strategy brings some advantages over other approaches, (1) Elimination of background intensity by channel-channel image subtraction after normalization protocol, (2) Detection of “hidden” spots that are buried within the high abundance spots. In addition to inter-channel normalization strategy, there is another change in the algorithms. Instead of using spot detection followed by image segmentation, after spot detection, spots are modeled with physically reasonable mathematical model, 10 parameter skewed 2D-Gaussian spot density functions. This function simulates both IEF and PAGE direction tailing often observed in real gel images and gives exact quantification results from mathematical integration of function itself even with heavily overlapped spots. With these improvements, accuracy of spot quantification and detection of low abundance proteins are significantly improved. Inter-channel normalization

The images are subjected to “normalization” to the maximum intensity observed in the gel or given numbers in the case of images all are in relative low level of intensities. These processes are linear transformation of images to match the baseline and overall image intensities by linear regression of images (to minimize MSE). Note: this is not normalization between low and high points, but minimization of overall intensity differences and assumption is that no more than 50% of pixels are changing in intensity in the images. There is additional step to examine baseline (background) level intensity and correction of the overall normalization process utilizing baseline information. Differential images The normalized images are subjected to image subtraction processes to generate inter-channel “differential” images. There are nothings special about the process, just producing “subtraction image”. This image contains information of “changing” spots only and in principle

 Inter-­channel   normalization   of   DIGE   images:     Original  images   indicate   difference   in   appearances   between  channels  but  after  normalization,  they  are  same  and  image  subtraction  leaves  very  little  intensities  

 

zero background intensity. Care needs to be taken for interpretation of extremely high abundance proteins such as some serum proteins (immunoglobulins, Albumines) or RuBisCo (plant) proteins as physical electrophoretic processes can be affected by too much proteins concentrated in small region during the electrophoresis (both isoelectric focusing and PAGE) thus intensity information can be unreliable (non-smooth intensity transition, rough intensity distribution by physical disturbance during the physical migration or general intensity reduction due to self-quenching of dyes during the fluorescent detection). Also, in some small proteins in lower area of the gel, occasionally shift in mobility among the channels may be observed possibly due to physicochemical differences among the dyes, although they are quite similar in structure but overall molecular lengths that could affect mobility in PAGE process. Ratio image After normalization process, “ratio” images are also produced. This is also very simple process to calculate ratio of each pixel between channels. Same precaution needs to be taken for interpretation of image and data. Some properties of differential and ratio images Differential image is very sensitive against absolute change of image intensities among channels making it little misleading to judge “percent change of abundance” rather than “absolute change of signal intensity” even as “difference” between channels as differential image omits original intensity information of parental images. Ratio image is rather insensitive to that issue but giving “ratio” of two channels, thus more intuitive. Although there are some major difference in their behavior, they appear somewhat similar as image. If there is not significant overlaps, ratio image directly give the “change” ratio as spot center value is exactly the ratio of intensities of target channels without any quantitative calculations or data modeling by complicated skewed 2D-Gaussians.      

   Why  ratio  image  seems  similar  to  differential  Image?    Images  are  composed  of  many  2D-­‐gauusian  sums.    

I1 = B1 + A1ie− fi ( x,y )

i

N

∑ ,  

I2 = B2 + A2ie− fi (x,y )

i

N

∑  ,    

fi(x,y)=  

−x − xciwxi

⎝ ⎜

⎠ ⎟

2

+y − yciwyi

⎝ ⎜ ⎜

⎠ ⎟ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪  

Ratio  image  is  expressed  as  the  logarithm  of  image  ratio  between  I2  and  I1.    

Irt=log2(I2/I1)  =  log2

B2 + A2ie− fi (x,y )

i

N

B1 + A1ie− fi (x,y )

i

N

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

=log2

B2 + A2ie− fi (x,y )

i

N

∑⎡

⎣ ⎢

⎦ ⎥ -­‐  log2

B1 + A1ie− fi (x,y )

i

N

∑⎡

⎣ ⎢

⎦ ⎥  

 If  there  is  no  significant  overlap  between  the  spots  around  the  spot  centers  [xci,  yci],  With  background  level  adjusted,  B2≈B1≈0.    Beyond  the  2x  distances  spot  width  parameters  Wxi,  Wyi  from  spot  centers,  intensity  contribution  from  nearby  spot  is  negligible,  thus  

∴ Irt(~Xci,  ~Yci)  ≈  log2

A2ie− fi (x,y )[ ]-­‐  log2

A1ie− fi (x,y )[ ]  =  log2

A2iA1i

⎣ ⎢

⎦ ⎥  

Thus  region  near  each  spot  center  appears  to  have  intensity  ratio  of  spot  at  the  center  and  gradually  changes  to  adjacent  (surrounding)  spot  intensity  ratios.      Thus,  if  there  is  no  significant  overlap  among  the  spots,  spot  intensity  ratio  can  be  obtained  directly  from  ratio  image.    The  ratio  value  at  the  spot  center  tends  to  be  underestimated.  The  ratio  value  at  the  center  of  spot  is  more  accurately  expressed  with  following  equation.  Where  ε1  and ε2  are    the  contributions  from  neighboring  overlapping  spots.  This  decreases  the  absolute  value  of  ratio  with  some  degree  depending  on  the  degree  of  overlap  and  noise.    

Irt(~Xci,  ~Yci)  ≈    log2

A2i + ε2A1i + ε1

⎣ ⎢

⎦ ⎥    

   

Spot detection The differential image is calculated in the first step of image processing. The differential image includes information about the spots that cannot be detected in conventional spot detection algorithms. Thus, in order to maximize the sensitivity of spot detection to such spots and improve the detection limits, spot detection is performed on differential image.

There are few steps for spot detection algorithm. First, program calculates modified second derivative image. As differential image contains both positive and negative values, the spot center information within second derivative image may be both positive and negative. Also, in order to emphasize the degree of change and also sign of intensity values, second derivative is multiplied by image intensity.

Modified second derivative = (second derivative)x(differential image)

In this way, all local minima

in second derivative are guaranteed to be negative values. This makes spot center detection easier and also prevent detection of false-positive spots (such as dent in the curvature of spot density distributions). Sot detection can be done with simply find local minimum value in negative value range. After initial search of local minima, they are filtered by certain criteria that can be specified by user in order to remove noises and artifacts.

After spot centers are detected in second derivative image and filtered, Third derivative image is calculated for the spot parameter estimation. This step is necessary for the good spot fitting results. As spot fitting is a non-linear optimization problem, initial starting parameters need to be as close as it can be. If starting parameters are too far away from optimum parameters, optimization may not converge into reasonable local minimum. As skewness parameters are hard to calculate, program uses simple Gaussian parameter estimation using third derivative image. Third derivative image is used for detecting the spot width information and separation of two closely located spots by calculating slope change in second derivative. These estimated parameters are used to create synthetic image

 

Modified  second  derivative  image  

 

Third  derivative  image  with  detected  spot  edges  

 

and examined with real differential images then another refinement in parameter estimation is done.  Spot fitting

The spots are detected, their parameters are estimated then spot parameter optimization is carried out. This is an extremely computationally expensive process. The speed of calculation is strictly proportional to N2 (N is number of parameters). Thus, optimization with large number of parameters is prohibitive. The current algorithm is using multi-threaded Lebenberg-Marquardt algorithm. Although it is taking advantage of dual CPU with dual cores (total 4 cores), still, 25 spots fitting (250 parameters) takes roughly 10-15 minutes. In order to speed up the calculation, there are few approaches we are taking. One is using cluster to calculate in parallel. Next is using GPU for curvature matrix calculation. This is proved to improve calculation speed of matrices by factor of 10 to 30. Another is that divide the image into small pieces with maximum 5 to 10 spots within single chunk of image. This brings calculation down to linear increase by number of spots rather than exponential. This affects calculation time with huge improvements. (calculation of 1000 spots with image divided into 100 areas takes 100x102, instead of 10002). As each region within the image is independent from the other region, divided image size are reasonably large and enough overlap is taken among divided image, optimization with this strategy works fine.

Parameter estimations

Simplified Spot Density Function (just oval shape)a

 

1st  partial  derivatives    

2nd  partial  derivatives    

 Synthetic  image  with  estimated  spot  parameters  

 

 Differential  image  and  detected  spots  with  estimated  spot  width  parameters  

(color  is  flipped  from  synthetic  image)  

Spot density function (10 parameters)

y = A ⋅ e− f (X ,Z )

Z

X⎡ ⎣ ⎢ ⎤ ⎦ ⎥

=− sinθ cosθ

cosθ sinθ⎡ ⎣ ⎢

⎤ ⎦ ⎥ z−zc

x−xc⎡

⎣ ⎢

⎦ ⎥

f (x,z) =1sk1x

Xwx

⎝ ⎜

⎠ ⎟

4

+ sk2xXwx

⎝ ⎜

⎠ ⎟

3

+ sk1xXwx

⎝ ⎜

⎠ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪ +

1sk1z

Zwz

⎝ ⎜

⎠ ⎟

4

+ sk2zZwz

⎝ ⎜

⎠ ⎟

3

+ sk1zZwz

⎝ ⎜

⎠ ⎟

2⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

 

 

 

   For  the  calculation  of  derivatives,  sum  of  these  derivatives  are  used.  

 In  the  image,  a  the  pixel  P,    (A)  is  expressed  as                          (C)  is  expressed  as        (B),(D)  are  expressed  as        

F  =  (A)  +  (B)  +  (C)  +  (D)        Here,  (B),  (D)  are  symmetric  along  z-­‐axis  thus,  (X-­‐Xc)  is  opposite  direction    Thus,  if  (B)  is  expressed  as                        ,    then  (D)  is                                                                Thus,  the  “summed”  2nd  derivative  for  spot  detection  is      

       At  F=0,    

     

Define  rx=x-­xc        as  Wx,  Wz  >0                  Same  way,    

                                                                                                                                             ∵      

at  the  spot  center,  x  =xc,  z  =zc                  

As                                and      

 Since  xc,  zc  are  already  known,  all  unknown  parameters  can  be  calculated  from  observed  values    

   

 

Actual  calculation  is  done  in  following;      

             (Intensity  of  pixel  P-­‐  pixel  4)        +  (  Intensity  of  pixel  P-­‐  pixel  5)  

 

 Differential image with spot boundaries

 Synthetic image with spot boundaries (modeled with simple 2D-Gaussian and parameter estimation), before the optimization with 10 parameter skewed 2D-Gaussian.

 

 

 The  fitting  results  using  real  gel  image.  Upper  panels  are  with  250x250  pixels  region  with  25  spots.  Lower  panels  are  600x600  pixels  region  divided  into  9  sub-­‐images  with  50  pixels  overlap  each  other.  

Spot Matching Detected spots in different gel images need to be matched in order for further processing

such as statistical analyses. The matching processes are performed with algorithm specifically designed for 2-D gel image analyses. The global matching is based on the pattern recognition with directions and distances among “Landmark” spots. The landmark spots are chosen in following criteria, well resolved, separated each other, not too intense or too faint. The candidate spots are tentatively paired between sets of detected spots from two images (different gels for replication). This pairing process is done with following algorithm.

(1) Initially spots are marked with the angle and distance from the Left Top of image. This is an Acidic/High molecular weight direction. This is chosen because in 2-D gel, acidic pI range has better reproducibility among the experiments and high molecular weight region also shows smaller variation in mobility.

(2) These “angles and distances” are compared between two sets of detected spots and “similar” spots within the sets are tentatively paired as potential landmark spots.

(3) These spots are marked with angles and distances among them for each set. (4) The candidate pair is judged by total and ratio of matching with the other candidate

spots. If they match the criteria, they go next step otherwise they are rejected. (5) All detected spots within the vicinity of candidate landmark spots are marked with the

angles and distances from candidate spot.

 Globally  matched  Landmark  spots  (numbered  in  red)  and  locally  paired  spots.    

The  vector  field  between  two  sets  of  spots  are  indicated  by  blue  lines.  

(6) These spots are then subjected to local matching check in order to confirm or reject pairing is correct or wrong again. As global check eliminate “obviously wrong” candidate pair, it is difficult to eliminate the pair they are wrong but close enough to be judged by global check.

(7) These local spots are then compared between two sets for candidate spots. If they match the criteria (total number of matching, percentage matching etc), these two spots are determined as landmark spots.

After landmark spots are determined and pairing is done, the vector field is calculated for landmark spots and nearby “local” spots that are paired in previous steps. These Vector field is used to interpolate the vectors for other spots that are not paired yet. The interpolation process is performed using following principle. The electrophoresis is physical processes and spot locations within the image change gradually between two images. There is no “crossing” vector among any spots, nor “sharp turn” of vectors and the length of vector changes gradually. The newly paired spots are checked by local matching again in order to make sure they are correctly matched. At the end, an overall vector field is examined for its smoothness in both length and angles. If there are vectors that do not satisfy criteria, local matching processes are repeated until all criteria are satisfied.

 Locally  matched  spots  with  examination  of  angles  and  distances  to  judge  the  pairing.