maqc papers over the cracks

2
NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 1 JANUARY 2007 27 because researchers typically identify differentially expressed genes and carry them forward for biological pathway analysis or use them as the basis for mechanistic inferences, we propose that a reasonable first step is to present methods that will enhance the reproducibility or stability of a list of differentially expressed genes. In summary, reproducibility can and should be used as a quality measurement, but it is not the only quality measurement (that is, sensitivity and specificity are also important). We also agree with Klebanov et al. that a method is sound if “its properties…can be guaranteed by theoretical considerations or at least be supported by computer simulations.” We believe that using theoretical constructs and computer simulations will help elucidate scenarios where different rules for gene selection have advantages and disadvantages related to sensitivity, specificity, reproducibility and even revelance. To that end, we performed several computer simulations during the initial MAQC study where the truth concerning differential gene expression was known, and sensitivity, specificity and reproducibility were measured. Although the results from these simulations will be provided in a separate paper, we can state that our current simulation results substantially emulate the results of the MAQC study 1,2 . One important contrast between the MAQC’s simulations and those of Klebanov et al. is that ours are specifically measuring gene-list commonality (and thus reproducibility) between simulated instances of experiments with the same known differential expression, whereas the simulations of Klebanov et al. do not. When the coefficient of variance (CV) of replicates is large (>100%) and n is modest (e.g., n = 5), our simulations show that all previously referenced gene selection methods generally lack reproducibility, regardless of sensitivity concerns. This would be true for the simulations of Klebanov et al. where the CV values for half (100) of the differentially expressed genes are typically 50% to over 100% that are much higher than what we usually see in practice. In addition, Klebanov et al. do not follow the MAQC recommendation that a nonstringent level of statistical significance (e.g., P < 0.05 or 0.01) be used in conjunction with fold change. As a result, when they include 400 very noisy ‘null’ genes (CV ~ 300% or more) in their simulations, large numbers of false positives occur. The simulations of Klebanov et al. demonstrate sensitivity of differential detection using P values, which we accept. Nevertheless, we restate our view that results that show sensitivity to any degree but that are not reproducible are of no utility in the scientific and regulatory environment. In fact, it is the apparent lack of reproducibility of gene lists that has been used as evidence for criticizing microarray technology. We have demonstrated that the apparent lack of reproducibility of results may come from the common practice of using a stringent P value cutoff to determine differentially expressed genes. This difficulty in reproducing gene lists is a methodological issue and should not be inherently associated with microarray technology itself. Although CV is an important metric for examining reproducibility of signal, it is not the only measure associated with microarray measurement and should not be examined in isolation. No single measure associated with microarrays captures all aspects of various problems that microarrays are trying to address. That is why we examined repeatability and reproducibility of signal, repeatability and reproducibility of detection, and reproducibility and compression/expansion of differential gene expression 1 . These and other measurements are needed to understand the different facets of performance of microarrays. Finally, we examined the reproducibility of results in terms of gene ranking, not because gene ranking is the selection procedure itself, but because gene ranking provides inferences related to a range of thresholds that may be used in a selection procedure. 1. Shi, L. et al. Nat. Biotechnol. 24, 1151–1161 (2006). 2. Guo, L. et al. Nat. Biotechnol. 24, 1162–1169 (2006). MAQC papers over the cracks To the editor: I argue here that the findings of the MAQC consortium reported in the September issue by Shi et al. (Nat. Biotechnol. 24, 1039–1176, 2006) are seriously flawed in experimental design. I also believe they raise conflict-of- interest issues for the scientists involved. First, their choice of samples, human brain RNA and a universal human reference RNA, is like comparing apples and oranges—it would rarely happen in a real research setting. It is estimated that >50% of genes expressed in the brain are brain specific and not found in any other tissues in the human body 1 . So simply by chance, any gene probe lit up with brain- specific cDNA probe has >50% probability for being expressed only in that tissue. In addition, the large coefficients of variation (up to 20%) associated with the study results appear to be no more accurate than random guessing. Second, even a 1–5% error rate could translate into noise for 100–500 genes, given that a typical mammalian cell expresses ~10,000–15,000 genes. When one compares ‘apples and oranges’ as in these studies, the difference or ‘signal’ (>4,000 genes) is of course significantly higher than the noise. It is not surprising, then, that the MAQC researchers saw high intra- and interplatform concordances—a result that is in stark contrast to the observations of Cam and colleagues 2 at the US National Institutes of Health, who compared two closely related RNA samples. Because most comparative gene expression analyses in biomedical research focus on gene expression that are significantly smaller than, or similar to, the intrinsic noise of a microarray platform, the conclusion about the reliability of the method portrayed by these articles is thus very misleading, if not deceiving! Several years ago, Arthur Pardee and I pointed out the root cause problem that plagues the use of DNA microarrays in gene expression analysis—that problem is the complexity of the cDNA probes used 3 . We also outlined several simple controls to test it. The complexity of a cDNA probe specifies the number of cDNA species (or mRNA species) and their relative concentrations within that probe. DNA microarrays are reverse northern blot approaches, where a cDNA probe is made by reverse transcription of all the mRNAs expressed in a cell or tissue specimen using an oligo-deoxythymidine primer, which targets the polyadenosine tails present in most eukaryotic mRNAs. In fact, these cDNA probes are so complex that CORRESPONDENCE © 2007 Nature Publishing Group http://www.nature.com/naturebiotechnology

Upload: peng

Post on 21-Jul-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MAQC papers over the cracks

NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 1 JANUARY 2007 27

because researchers typically identify differentially expressed genes and carry them forward for biological pathway analysis or use them as the basis for mechanistic inferences, we propose that a reasonable first step is to present methods that will enhance the reproducibility or stability of a list of differentially expressed genes. In summary, reproducibility can and should be used as a quality measurement, but it is not the only quality measurement (that is, sensitivity and specificity are also important).

We also agree with Klebanov et al. that a method is sound if “its properties…can be guaranteed by theoretical considerations or at least be supported by computer simulations.” We believe that using theoretical constructs and computer simulations will help elucidate scenarios where different rules for gene selection have advantages and disadvantages related to sensitivity, specificity, reproducibility and even revelance. To that end, we performed several computer simulations during the initial MAQC study where the truth concerning differential gene expression was known, and sensitivity, specificity and reproducibility were measured. Although the results from these simulations will be provided in a separate paper, we can state that our current simulation results substantially emulate the results of the MAQC study1,2. One important contrast between the MAQC’s simulations and those of Klebanov et al. is that ours are specifically measuring gene-list commonality (and thus reproducibility) between simulated instances of experiments with the same known differential expression, whereas the simulations of Klebanov et al. do not. When the coefficient of variance (CV) of replicates is large (>100%) and n is modest (e.g., n = 5), our simulations show that all previously referenced gene selection methods generally lack reproducibility, regardless of sensitivity concerns. This would be true for the simulations of Klebanov et al. where the CV values for half (100) of the differentially expressed genes are typically 50% to over 100% that are much higher than what we usually see in practice.

In addition, Klebanov et al. do not follow the MAQC recommendation that a nonstringent level of statistical significance (e.g., P < 0.05 or 0.01) be used in conjunction with fold change. As a result, when they include 400 very noisy ‘null’ genes (CV ~ 300% or more) in their simulations, large numbers of false positives occur. The simulations of Klebanov et al. demonstrate sensitivity of differential detection using P values, which we accept. Nevertheless,

we restate our view that results that show sensitivity to any degree but that are not reproducible are of no utility in the scientific and regulatory environment. In fact, it is the apparent lack of reproducibility of gene lists that has been used as evidence for criticizing microarray technology. We have demonstrated that the apparent lack of reproducibility of results may come from the common practice of using a stringent P value cutoff to determine differentially expressed genes. This difficulty in reproducing gene lists is a methodological issue and should not be inherently associated with microarray technology itself.

Although CV is an important metric for examining reproducibility of signal, it is not the only measure associated with microarray measurement and should not be examined in isolation. No single measure

associated with microarrays captures all aspects of various problems that microarrays are trying to address. That is why we examined repeatability and reproducibility of signal, repeatability and reproducibility of detection, and reproducibility and compression/expansion of differential gene expression1. These and other measurements are needed to understand the different facets of performance of microarrays. Finally, we examined the reproducibility of results in terms of gene ranking, not because gene ranking is the selection procedure itself, but because gene ranking provides inferences related to a range of thresholds that may be used in a selection procedure.

1. Shi, L. et al. Nat. Biotechnol. 24, 1151–1161 (2006).

2. Guo, L. et al. Nat. Biotechnol. 24, 1162–1169 (2006).

MAQC papers over the cracksTo the editor:I argue here that the findings of the MAQC consortium reported in the September issue by Shi et al. (Nat. Biotechnol. 24, 1039–1176, 2006) are seriously flawed in experimental design. I also believe they raise conflict-of-interest issues for the scientists involved.

First, their choice of samples, human brain RNA and a universal human reference RNA, is like comparing apples and oranges—it would rarely happen in a real research setting. It is estimated that >50% of genes expressed in the brain are brain specific and not found in any other tissues in the human body1. So simply by chance, any gene probe lit up with brain-specific cDNA probe has >50% probability for being expressed only in that tissue. In addition, the large coefficients of variation (up to 20%) associated with the study results appear to be no more accurate than random guessing.

Second, even a 1–5% error rate could translate into noise for 100–500 genes, given that a typical mammalian cell expresses ~10,000–15,000 genes. When one compares ‘apples and oranges’ as in these studies, the difference or ‘signal’ (>4,000 genes) is of course significantly higher than the noise. It is not surprising, then, that the MAQC researchers saw high intra- and

interplatform concordances—a result that is in stark contrast to the observations of Cam and colleagues2 at the US National Institutes of Health, who compared two closely related RNA samples. Because most comparative gene expression analyses in biomedical research focus on gene expression that are significantly smaller than, or similar

to, the intrinsic noise of a microarray platform, the conclusion about the reliability of the method portrayed by these articles is thus very misleading, if not deceiving!

Several years ago, Arthur Pardee and I pointed out the root cause problem that plagues the use of DNA microarrays in gene expression analysis—that problem is the complexity

of the cDNA probes used3. We also outlined several simple controls to test it.

The complexity of a cDNA probe specifies the number of cDNA species (or mRNA species) and their relative concentrations within that probe. DNA microarrays are reverse northern blot approaches, where a cDNA probe is made by reverse transcription of all the mRNAs expressed in a cell or tissue specimen using an oligo-deoxythymidine primer, which targets the polyadenosine tails present in most eukaryotic mRNAs. In fact, these cDNA probes are so complex that

CORRESPONDENCE©

2007

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy

Page 2: MAQC papers over the cracks

28 VOLUME 25 NUMBER 1 JANUARY 2007 NATURE BIOTECHNOLOGY

each one may consist of as many as 10,000 different species, each ranging from a few to thousands of copies per cell. Clearly, one of the main challenges for DNA microarrays using such complex cDNA probes is how to ascertain whether a hybridization signal is specific and quantitative to a known gene sequence laid on a chip.

One simple control experiment that could address this problem would be to label only one mRNA at a time for several highly as well as rarely expressed genes present on an array with a corresponding gene-specific primer (e.g., a primer that anneals just upstream of the poly-A of a gene of interest), rather than the common approach of labeling all mRNAs by reverse transcription with oligo-deoxythymidine primers. These single gene-specific probes, when hybridized to the DNA microarray individually, would provide an accurate glimpse of the actual sensitivity and specificity of a DNA microarray.

In my opinion, a technique using a gene-specific cDNA probe that fails to detect a mere single gene is unlikely to provide reliable measurements of the other tens of thousands of genes on the array, so how can this reflect an accurate and sensitive snapshot of global gene expression within a cell? I find it very surprising that none of the array manufacturers appears to have tried these real controls to prove that their methods actually work—perhaps they did, but would rather keep the findings to themselves?

To my knowledge, DNA microarrays have never been proven to accurately measure the expression level for one gene at a time—spiking with a control mRNA does not count because it does not address how many other genes are wrongly detected on the chip because of spiking. Yet the misconception has been that these platforms can simultaneously detect the expression of tens of thousands of genes at the same time.

Third, I was frankly shocked to see US Food and Drug Administration (FDA; Rockville, MD) scientists as co-authors with researchers from commercial entities whose interest is clearly to sell the technology under scrutiny. The FDA would never let drug companies evaluate their drug candidates in collaboration with FDA researchers in clinical trials. The same principle should obviously apply to the evaluation of microarray technology, particularly if the FDA is seriously thinking about whether or not to allow microarray data to be admitted to support of a sponsor’s clinical trial data or in medical diagnostics. In essence, most of the authors of these papers are being ‘athletes’ and

‘referees’ at the same time, which casts a serious doubt on every thing done and concluded by MAQC. It is obvious that every author in these papers had little to gain, but much to lose, if the conclusions finally drawn by the MAQC would have been contrary to what was finally reported. After all, every researcher in the MAQC group was a microarray proponent or practitioner.

In my opinion, the most scientific, balanced and transparent way to determine whether the DNA microarray is a truly reliable and accurate technology for detecting changes in gene expression in a physiologically relevant, biological context, the FDA or NIH should consider forming an external scientific panel, with no conflicts of interest, to coordinate a trial in which none of the array manufacturers would know in advance the nature of a pair of RNA samples to be compared. Upon completing the array analysis, each company would submit their results independently to the panel. The experts would then determine whether there was indeed good intra- and interplatform concordance for the method and, most importantly, whether these different microarray platforms were actually able to track down the real genes whose expression is truly different, as the panel would expect based on the nature of the test RNA samples. Other technologies, such as serial analysis of gene expression (SAGE) and differential display should also be included in this trial.

Perhaps a cash prize could be offered to the technology/platform that can track down most accurately the true differences in gene expression between the two related RNA samples. After all, a small investment by the government could mean billions of dollars saved down the road, if a flawed technology continues to be used without fundamental improvement.

ACKNOWLEDGMENTSPeng Liang is a cofounder of GenHunter Corporation, which offers cloning and differential display products for sale that compete with DNA microarray technology.

Peng Liang

Vanderbilt-Ingram Cancer Center, 658 Preston Building, Nashville, Tennessee 37232-6838, USA.e-mail: [email protected]

1. Shippy et al. Nat. Biotechnol. 24, 1123–1131, (2006)

2. Tan, P.K. et al. Nucleic Acids Res. 31, 5676–5684 (2003).

3. Liang, P. & Pardee, A.B. Nat. Rev. Cancer 3, 869–876 (2003).

On behalf of MAQC, Leming Shi, Wendell D Jones, Roderick V Jensen, Russell D Wolfinger, Ernest S Kawasaki, Damir Herman, Lei Guo, Federico M Goodsaid and Weida Tong reply:Liang’s comments reveal several misconceptions about the MAQC study1,2 and the current state-of-the-art of DNA microarray technology.

In his first point, Liang appears to confuse measures of precision and accuracy in evaluating gene expression technologies. In practice, the precision is easily evaluated using technical replicates to answer the question: “How variable are probe-by-probe results when I repeatedly measure the same biological sample?” Using standardized protocols and RNA samples, the MAQC study demonstrated that the intrasite precision of the commercial whole genome microarray platforms, measured by coefficient of variance (CV), ranged in median value from 5% to 10%1. On the other hand, the accuracy is determined by the ability of the technology to correctly measure changes in gene expression in different samples: “If a gene expression level changes, can I correctly detect and measure it?” To best answer this second question, the MAQC consortium deliberately chose two distinct reference RNA samples with a large number of differentially expressed genes so that the technical accuracy of the different platforms could be evaluated across many thousands of genes with a wide range of expression levels and fold changes. Moreover, the titration studies using 3:1 and 1:3 mixtures of the two distinct samples provided sensitive measures of the relative accuracy of different microarray platforms3. In fact, the two results showed that, owing to the small CV values, the commercial microarrays could reliably detect gene expression changes in >90% of the genes with changes as small as 25% (Fig. 2 of Shippy et al.3).

With respect to Liang’s second point, we would like to emphasize the importance and value of the common practices in analytical chemistry of using reference or standard samples for evaluating the performance of analytical instruments. The reference RNA samples were selected for the MAQC study after careful analysis of pilot data and, more importantly, were expected to be widely used by the scientific community to evaluate the technical performance of gene expression platforms and protocols. We feel that it is important to distinguish biological variability and technical noise in evaluating gene

CORRESPONDENCE©

2007

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy