a metabolomic data uncertainty budget for the plant arabidopsis thaliana

6
In “Statistics and Metabolo- mics,” David Banks discusses five places for collaboration between statisticians and biologists collect- ing and interpreting metabolomic data. Here, we illustrate the first of those: the construction of an uncertainty budget. Our example comes from plant metabolomics. In plant metabolomics, the mea- surements are the same as in human metabo- lomics: the concen- trations of cellular metabolites usually with a molecular weight less than 500. Although it could be used in the same way the human metabolome is used—as a fingerprint for rapid identification of disease—the primary motivation for studying the plant metabolome is its usefulness to basic science. The metabolome is the intermediary between enzyme activity, which ultimately is a consequence of the plant genome, and the phenotype, the observable characteristics of individual plants. The metabolome provides a tool for understanding the function of genes, even if that gene has minimal or no effect on the phenotype. Using reverse genetic techniques, it is possible to create a knockout mutant, in which the DNA sequence for a specific gene is changed and the gene product is disabled. Plants with the knockout mutant are then com- pared to wild-type controls. The knockout sometimes kills the plant, sometimes changes the visible phenotype, and sometimes produces plants that look identical to the wild-type. When the knockout is not lethal, comparing metabolomes of knockout and wild-type individuals provides a way to discover whether the gene of interest has a function and to understand the meta- bolic origin of phenotypic changes. The Data Set The data used here are part of Geng Ding’s investigation of knockout mutants for an enzyme that degrades an amino acid. The plants being studied are Arabidopsis thaliana, a model organ- ism widely used in plant science. For each of two mutants, Ding has plants of three genotypes, differing in the number of copies (zero, one, or two) of the knockout DNA. These genotypes have subtle differences in the phenotype, but the differences are tiny during vegetative growth. Ding’s biological goal is to compare mean concentrations of each of 18 amino acids among the six combinations of two mutants and three genotypes, called “id’s” henceforth. Two plants of each id were grown in a homogeneous environment. At harvest, tissue from each of the 12 plants was split into two containers, yielding 24 samples. Because it wasn’t feasible to extract amino acids from all 24 samples at the same time, extraction was done in two batches. The first 12 containers, from one plant of each of the six id’s, were extracted in one batch. Then, the remaining 12 samples (and six plants) were extracted. Each of the 24 samples (6 ids × 2 plants per id × 2 extracts per plant) was then measured. Amino acid concentrations were measured by a gas chro- matograph with a flame ionization detector (GC-FID). This detector measured the amount of carbon-containing com- pounds coming out of the GC every few seconds. Specific amino acids were identified by comparing their retention time in the GC to known standards. Each amino acid was quantified by integrating the signal from a distinct peak, normalizing by an internal standard, and using a calibration curve to determine the amino acid concentration. Concentrations were expressed as micromole of amino acid per gram of plant fresh weight. This data collection scheme, a complete block design, provides a measure of variability between extraction batches, a measure of the biological variability between plants, and measures of the differences among id’s. The variability between the two samples from the same plant includes the within- plant variability, the variability between extractions, and the variability between measurements. The entire study was then repeated to give a total of 48 samples. The variability between repetitions provides a measure of the repeatability of the results, including long-term drift in the measuring process and the growth environment. One extract was measured twice. The two measurements for this extract provide an estimate of the variability between measurements. Because only one extract was measured twice, the second measurement is omit- ted from most of the analyses described here. A Metabolomic Data Uncertainty Budget for the Plant Arabidopsis thaliana Philip M. Dixon and Geng Ding 12 VOL. 21, NO. 2, 2008

Upload: independent

Post on 05-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

In “Statistics and Metabolo-mics,” David Banks discusses five

places for collaboration between statisticians and biologists collect-ing and interpreting metabolomic data. Here, we illustrate the first of

those: the construction of an uncertainty budget.

Our example comes from plant metabolomics. In plant

metabolomics, the mea-surements are the same as in human metabo-

lomics: the concen-trations of cellular

metabolites usually with a molecular weight less

than 500. Although it could be used in the same way the human metabolome is used—as a fi ngerprint for rapid identifi cation of disease—the primary motivation for studying the plant metabolome is its usefulness to basic science. The metabolome is the intermediary between enzyme activity, which ultimately is a consequence of the plant genome, and the phenotype, the observable characteristics of individual plants.

The metabolome provides a tool for understanding the function of genes, even if that gene has minimal or no effect on the phenotype. Using reverse genetic techniques, it is possible to create a knockout mutant, in which the DNA sequence for a specifi c gene is changed and the gene product is disabled. Plants with the knockout mutant are then com-pared to wild-type controls. The knockout sometimes kills the

plant, sometimes changes the visible phenotype, and sometimes produces plants that look identical to the wild-type. When the knockout is not

lethal, comparing metabolomes of knockout and wild-type individuals provides a way to discover whether the gene of interest has

a function and to understand the meta-bolic origin of phenotypic changes.

The Data Set The data used here are part of Geng Ding’s investigation of knockout mutants for an enzyme that degrades an amino acid. The plants being studied are Arabidopsis thaliana, a model organ-ism widely used in plant science. For each of two mutants, Ding has plants of three genotypes, differing in the number of copies (zero, one, or two) of the knockout DNA. These genotypes have subtle differences in the phenotype, but the differences are tiny during vegetative growth. Ding’s biological goal is to compare mean concentrations of each of 18 amino acids among the six combinations of two mutants and three genotypes, called “id’s” henceforth. Two plants of each id were grown in a homogeneous environment. At harvest, tissue from each of the 12 plants was split into two containers, yielding 24 samples. Because it wasn’t feasible to extract amino acids from all 24 samples at the same time, extraction was done in two batches. The first 12 containers, from one plant of each of the six id’s, were extracted in one batch. Then, the remaining 12 samples (and six plants) were extracted. Each of the 24 samples (6 ids × 2 plants per id × 2 extracts per plant) was then measured.

Amino acid concentrations were measured by a gas chro-matograph with a fl ame ionization detector (GC-FID). This detector measured the amount of carbon-containing com-pounds coming out of the GC every few seconds. Specifi c amino acids were identifi ed by comparing their retention time in the GC to known standards. Each amino acid was quantifi ed by integrating the signal from a distinct peak, normalizing by an internal standard, and using a calibration curve to determine the amino acid concentration. Concentrations were expressed as micromole of amino acid per gram of plant fresh weight.

This data collection scheme, a complete block design, provides a measure of variability between extraction batches, a measure of the biological variability between plants, and measures of the differences among id’s. The variability between the two samples from the same plant includes the within-plant variability, the variability between extractions, and the variability between measurements. The entire study was then repeated to give a total of 48 samples. The variability between repetitions provides a measure of the repeatability of the results, including long-term drift in the measuring process and the growth environment. One extract was measured twice. The two measurements for this extract provide an estimate of the variability between measurements. Because only one extract was measured twice, the second measurement is omit-ted from most of the analyses described here.

A Metabolomic Data Uncertainty Budget for the Plant Arabidopsis thalianaPhilip M. Dixon and Geng Ding

12 VOL. 21, NO. 2, 2008

Components of the Uncertainty Budget Our goal here is quantify components of the uncertainty budget. Each level of replication (two repetitions of the study, two extraction batches, two plants per repletion and treatment, and two extractions/measurements per plant) is a component of the uncertainty budget that can be quantified by estimating variance components. Extraction batches are nested within repetitions, id’s are crossed with extraction batches, plants are nested within extraction batches, and extractions/measure-ments are nested within plants. Although Banks indicates many specific reasons for sampling and measurement variability in his article, most are confounded in this study and cannot be separated. Ding’s sampling design provides an estimate of the biological variability between plants, which is crucial for the comparison of genotypes and mutants. It also provides an estimate of the variability between extracts. The single extract measured twice provides an estimate of the variability between measurements.

Many designs to estimate variance components use only nested sampling. One example would be a design that grows plants of one genotype in three pots. Three plants are indi-vidually harvested and extracted from each pot, and then each extract is measured twice. Ding’s design introduces crossed effects because of the blocking by replicate and extraction batch. This blocking provides more precise estimates of dif-ferences between genotypes and mutants, but it complicates the analysis of variance components. We will fi rst illustrate a typical analysis of nested effects by considering data from only one id. Then, we will illustrate the analysis of the entire data set. These are illustrated using data for one of the 18 amino acids measured by Ding, Threonine.

A Model for Nested Random Effects The data for a single id includes only one plant per extraction batch, so there are three nested random effects: between repeti-tions, between plants in a repetition, and between confounded extracts and measurements. One commonly used model for nested random effects can be written as:

Yijk i ij ijk

= + + +μ α β λ ,

1 )

The concentration of Threonine in replicate i, plant j, and measurement k is denoted Y

ijk. The overall mean Threonine

concentration is denoted μ. The deviation from the mean associated with replicate i is α

i. The deviation from the mean

of replicate i associated with plant ij is βij. Within plant ij, the

deviation of measurements ijk about the plant mean is γijk. The

terms αi, , β

ij and γ

ijk are considered random effects when the

goal of the analysis is to estimate the magnitude of their vari-ability. It is common to assume all random effects are indepen-dent normal random variables with constant variance

The variance between observations from a randomly chosen replicate, plant, and measurement is the sum of the three vari-ance components, In this sense, the

variance components partition the random variation among observations into components associated with each source of uncertainty.

The data for one id are shown in Figure 1. The two replicate averages are similar, the four plant averages are quite differ-ent—even when compared within the replicate—and the measurements from the same plant are very similar. The pooled variance between measurements on the same plant estimates

Figure 1. Variability between replicates, plants within replicates, and measurements within plants for one id. The two dots labeled “Replicate” are the averages for replicate 1 and replicate 2. The four dots labeled “Plants” are the averages for the two plants from each replicate, sorted by replicate average and labeled by the replicate number. The eight dots labeled “Measurements” are the measurements from each plant, sorted by plant average and labeled by replicate and plant number. Each column average is indicated by –.

0.4

0.5

0.6

0.7

0.8

TH

R co

ncen

trat

ion

(nm

/mg) −

−−

12

−−

21 11 12 22

MeasurementsPlantsReplicate

Table 1—ANOVA Table and Expected Mean Squares for Data From a Single Id

Source d.f. Sum-of-Squares Mean Square Expected Mean Square

Replicates 1 0.004857 0.004857 σ σ σmeas plant rep

2 2 22 4+ +

Plants(Reps) 2 0.109532 0.054766 σ σmeas plant

2 2+

Measurements(Plants, Reps)

4 0.001480 0.000370 σmeas

2

Corrected Total 7 0.115869

CHANCE 13

σmeas

2, but the pooled variance between plant averages (Y

ij.),

using dots as subscripts to indicate averaging (i.e., Yij.=(Y

ij1+

Yij2

)/2), overestimates σplant

2. This is because, within a replicate,

(i.e., conditional onαi), the variance between plant averages,

Var Yij.is σ σ

plant meas

2 22+ / , which is larger than σ

plant

2if σ

meas

2>0.

Similarly, the variance between replicate averages, Var Yi..=

Yi rep plant meas..

/ /= + +σ σ σ2 2 22 4 , overestimates σ

rep

2.

Estimators of Variance Components Although there are many estimators of the variance compo-nents—σ σ

rep plant

2 2, , and σ

meas

2 —the two most commonly used

are the ANOVA and REML estimators. The ANOVA, or method-of-moments estimator, starts with an ANOVA table quantifying the observed variability for each component. The variance components are estimated by equating the observed

mean squares to their expected values—the expected mean squares—and solving for the variance components. For the Threonine data in Figure 1, the ANOVA table and expected mean squares are given in Table 1.

The estimated variance components are ˆ .σmeas

20 00037= ,

ˆ .σplant

20 027= , and ˆ . .σ

rep

20 012= − The variance component for

plants is much larger than that for measurements, consistent with the pattern in Figure 1. The negative estimate for replicates is disconcerting, since a variance must be non-negative. Negative ANOVA estimates often occur when the parameter is close to zero, when the degrees of freedom for the effect are small, when there are outliers, or when the model is wrong. However, ANOVA estimates are unbiased when the model is correct and robust to the assumption of normality because they are computed from variances only.

REML, restricted maximum likelihood, estimates are always non-negative because the estimates are constrained to lie within the parameter space for a variance. REML differs from standard maximum likelihood (ML) in correctly accounting for the estimation of any fi xed effects. As a simple example, if

independentN( , )μ σ 2 the ML estimator of the variance of

a single sample, ( ) /Y Y ni−∑ 2

, is biased. The REML estimator

( ) / ( )Y Y ni− −∑ 2

1 is the usual unbiased variance estimator. However, when data have multiple levels of variation, REML estimates of variance components are often biased. The bias arises for two reasons: the constraint that an estimate is non-negative and the adjustment to other variance components that occur when a negative ANOVA estimate is shifted to zero. For example, the REML estimates for the data in Figure

1 are ˆ . , ˆ . ,σ σmeas plant

2 20 00037 0 019= = and σ

rep

20= . The replicate

variance is estimated as zero, but that forces a shift in the plant-plant variance component (from 0.027 to 0.019). However, the replicate variance is estimated from only two replicates (one degree of freedom), so one should expect a poor estimate.

There is no consensus among statisticians as to which estimator is better. I prefer the ANOVA estimates because they are less dependent on a model and because estimates at one level are not adjusted because of insuffi cient data at another level. Others prefer REML estimators.

The previous analysis uses only one-sixth the data in which there are only four plants and eight measurements. The entire data set includes 24 plants and 48 measurements. Pooled estimates of variance components using all the data will be more precise, which may eliminate the problem of a negative estimated variance component if it is reasonable to assume variance components are the same for all id’s. We will separately consider the measurement variance and the plant-plant variance.

Characteristics of the Measurement VarianceThe assumption of equal measurement variance is easy to assess using a plot of the average of the two measurements per plant against the standard deviation of those two measurements (Figure 2). There is a lot of variability because each standard deviation is computed from two measurements, but it is clear the measurement standard deviation tends to increase with the average. When this happens, using log Y instead of Y often equalizes the variances. As Banks indicates in his article, metabolomic data are usually log-transformed because of the biological focus on ratios naturally expressed on a log scale.

Figure 2. Plot of the standard deviation (s.d.) and average of the two measurements per plant. Both X and Y axes are log scaled.

0.5 1.0 1.5 2.0 2.5

0.00

20.

010

0.05

00.

200

Average of measurements

SD o

f mea

sure

men

ts

Figure 3. Plot of the standard deviation and average of the log-trans-formed measurements per plant. The y-axis is log scaled; the x-axis is not because some averages are less than zero.

−0.5 0.0 0.5

0.00

50.

020

0.10

00.

500

Average of log transf. meas.

SD o

f log

tran

sf. m

eas.

14 VOL. 21, NO. 2, 2008

−0.4 −0.2 0.0 0.2 0.4 0.6

0.02

0.05

0.10

Plant average, log(Y)

Plan

t s.d

., lo

g(Y)

0.6 0.8 1.0 1.2 1.4 1.6

0.01

0.02

0.05

0.10

0.20

Plant average, 1/Y

Plan

t s.d

., 1/

Y

F igure 4. Plot of the plant-plant standard deviation (s.d.) and average, after log transforming the measurements

Figure 5. Plot of the plant-plant standard deviation (s.d.) and average, after using a 1/Y transformation of each measurement

Table 2—ANOVA Table and Expected Mean Squares for Data From All Six Id’s

Source d.f. Sum-of-Squares

Mean Square Expected Mean Square

Replicates 1 4.077 4.0770 σ σ σ σmeas

2

plant

2

batch

2

rep

2+ + +2 12 24

Extraction 2 0.428 0.2138 σ σ σmeas

2

plant

2

batch

2+ +2 12

Id 5 1.922 0.3842 σ σ δmeas

2

plant

2+ + ∑2 8 52

k/

Plants 15 2.908 0.1939 σ σmeas

2

plant

2+ 2

Measurements 24 0.540 0.0225 σmeas

2

Corrected Total 47

The Threonine data illustrate another reason for a transforma-tion—to equalize variances.

A useful characteristic of a random variable with a log normal distribution is that the coefficient of variance is a function of the log scale variance. If log Y ( , )μ σ 2

, then the mean and the variance of the untransformed Y are E Y e= +μ σ2 2/

, Var Y e e= −+ +2 2 22 2μ σ μ σ, so the coeffi cient of variation is

VarY

EYe

( ).

2

2

1= −σ Hence, assuming a constant variance on

the log scale is equivalent to assuming a constant coeffi cient of variation for the untransformed values.

After using a transformation, one should check that it worked as intended. This can be done by plotting the average and standard deviation of the two log-transformed measure-ments per plant (Figure 3). While there is much less pattern after the transformation, there is still a tendency for the stan-dard deviation to increase with the mean. A stronger transfor-mation in the Box-Cox family, perhaps 1/Y, would do a better job of equalizing the measurement variances for this specifi c data set. However, a transformation of Y affects all aspects of the model. Before making a fi nal choice, it would be good to assess the characteristics of the plant-plant variability.

Characteristics of Plant-Plant Variation It is harder to assess the characteristics of plant-plant variability (or any variability other than the residual variation) because the plant-plant variation is not directly observed. The only direct information about charac-teristics of the plant-plant variation comes from aver-ages of the two measurements for each plant. Because these are averages of mea-surements, characteristics of the plant-plant variation are confounded with those of the measurement variation.

CHANCE 15

16 VOL. 21, NO. 2, 2008

Table 3—Standard Error (s.e.) of the Difference of Two Treatment Means for Different Choices of Sample

Size, Assuming σ σplant extract2 20 086 0 022= =. , . , and

σtech2 0 00034= . .

Number of:

Plants Extracts per Plant

Measurements per Extract

s.e. of Difference

4 2 1 0.220

4 2 10 0.220

4 4 1 0.214

8 2 1 0.156

Two approaches can be used to investigate plant-plant variation. One is to assume a model, and, based on that model, calculate the best unbiased linear predictor (BLUP) of each random effect (i.e., predict the random effect),β

ij,

associated with plant ij. The other is to ignore the measure-ment variability and use traditional diagnostics to evaluate the averages for each plant. The second approach is reason-able when the contribution of the measurement variance is approximately the same for all plants. This is the case here for log-transformed data, so we use plant averages to investigate the plant-plant variability.

If observations are log transformed, the standard deviation (s.d.) between plant averages is approximately constant (Figure 4). But, if observations are transformed using the stronger 1/Y transformation, the s.d. between plant average is clearly not constant (Figure 5). Hence, the analysis will use a log transformation because it provides an approximately constant measurement variance and a constant plant-plant variance.

A Model for All Observations The model for all 48 observations is then:

log Yijkl i ij k ijk ijkl

= + + + + +μ α θ δ β γ . (2)

The Threonine concentration in replicate i, extraction batch j, id k, and measurement l is Y

ijkl. The overall mean Threonine

concentration is denoted μ. The deviation from the mean

associated with replicate i is αi. The deviation from the mean

of replicate i associated with extraction batch ij isθij. The

deviation from the mean associated with id k isδk. The devia-

tion associated with each plant ijk for each id k in extraction batch ij is β

ijk. Within each plant ijk, γ

ijkl is the deviation of

the observation ijkl from the plant mean. The variability described by the γ

ijkl includes the variability among extracts

and variability among measurements because there is only one measurement per extract in the 48-observation data set. All the random effects are assumed to be independent and normally distributed. Each source of variation has its own variance component: α

iN

rep ij( , ),0

2σ θ Nbatch ijk

( , ),02σ β N

plant( , ),0

2σ and γ

ijkl Nmeas

( , ).02σ

Fitting model (2) to the Threonine data gives the ANOVA table in Table 2. The estimated variance components are ˆ . , ˆ . , ˆ . ,σ σ σ

rep batch plant

2 2 20 16 0 0017 0 086= = = and ˆ . .σ

meas

20 022=

The REML estimates of the variance components, in this

case, are exactly the same because the data are balanced and all estimated variance components are positive.

Estimating the Variability Between Measurements of the Same ExtractDing re-measured one of the 48 extracts used in the above analy-sis. The two measurements are 2.244 and 2.187. The variance of these values estimates the technical measurement variance (i.e., the variability between measurements made on the same extract). Using log-transformed values, this is σ

tech

2 = 0 00034. , which is two orders of magnitude less than the combination of measurement and extraction variability. Given an estimate of the technical measurement variance, it is possible to estimate the contribution to the error due to extraction. Because each of the 48 extracts in the original data set was measured once, σ σ σ

meas

2

tech

2

extract

2= + , whereas σextract

2is the variance component

between extracts of the same plant. The estimated variance component is σ σ σ

extract

2

meas

2

tech

2= − = − =0 022 0 00034 0 022. . . ..

Although σtech

2 is not precise because it is a one degree of free-

dom (d.f.) estimate, it is clear that essentially all the variability between measurements is due to variability between different extractions of a single plant. Almost none of the variability comes from the instrument measurement.

CHANCE 17

The Uncertainty Budget Consistent with the earlier results for one id, the biological vari-ance between plants is ca. four times larger than the variance between extractions and two orders of magnitude larger than the technical variability between measurements. The variability between different extracts is small, but the variability between the two replicates of the study is surprisingly large. The data indicate why the replicate variance component is so large. The average Threonine concentrations are 0.64 and 0.76 nm/mg for the two extractions in the first replicate and 1.23 and 1.52 nm/mg for the two extractions in the second replicate. The large variance component between replicates makes sense, but the biologi-cal reasons for such a large variation are, as of yet, unknown.

Since the model assumes log-transformed values are normally distributed, the variance components can be con-verted into coeffi cients of variation for each component of error, as described previously. The technical measurement c.v. is

exp( . ) . %0 00034 1 1 8− = , the extraction c.v. is 15.1%, the plant-plant c.v. is 29.9%, the batch c.v. is 4.1%, and the rep-licate c.v. is 42%.

The uncertainty budget and estimated variance compo-nents provide useful information for designing subsequent studies. The goal of Ding’s work is to compare metabolite concentrations among genotypes and mutants. Blocking by extraction and replicate (i.e., measuring all id’s [combinations of genotypes and mutants] in the same extraction and same replicate) increases the precision of comparisons among id’s. When the average metabolite concentration is calculated from r replicates, b batches, e extractions, and m measurements per plant, the variance of the average difference between two id’s is:

VarY Y

rb rbe

plant extract tec

.. . .. .1 2

2 2

2− = + +σ σ σ

hh

rbem

2⎛

⎝⎜

⎠⎟ .

When comparisons are made within blocks, neither the replicate nor batch variances contribute to the variance of the difference. The only variance components that matter are those for plants, extracts, and measurements.

Increasing the number of plants—by increasing either the number of replicates, r, or the number of extraction batches, b—decreases the contribution of all three variance compo-nents, σ σ

plant extracts

2 2, and σ

tech

2. This effect is sometimes called

hidden replication because increasing the number of plants also increases the numbers of extracts and measurements. An alternative is to retain the same number of plants, but increase the number of extracts or measurements per plant. Assuming the variance components estimated from these data apply to a new study, the expected precision can be calculated for various combinations of # of plants, # of extractions per plant, and # of measurements per extract (Table 3).

Because the technical measurement variance is so small, rela-tive to the other sources of variability, increasing the number of measurements per extract tenfold has essentially no effect on the precision. Doubling the number of extracts per plant leads to a small increase in precision, but doubling the number of plants markedly increases the precision of the difference. The general advice for designing a study with multiple sources of error would be to replicate “as high up as possible.” In this study, that would be to increase the number of components, as it is here.

Final ThoughtsPlant metabolomics has given us new biological data for study-ing the relationship between genotype and phenotype, thereby learning about basic scientific processes. Using data from one metabolite, we have explored the characteristics of measure-ment and plant-plant variability, constructed an uncertainty budget, and used the estimated variance components to evaluate design choices. We found that the biological vari-ability between plants is larger than the variability between extractions, and considerably larger than the variability between measurements of the same extract. Similar sorts of evaluations are possible whenever there are replicated observa-tions for each important source of variability, but the details of the statistical model will depend on the experimental design (i.e., whether random effects are crossed or nested). Estimating variance components and identifying the important parts of the uncertainty budget help design more precise and cost-effective studies.

Further Reading

Variance components analysis is described in many inter-

mediate-level applied statistics books. Two of many good

chapter-length treatments are in Angela M. Dean and Daniel

Voss’ Design and Analysis of Experiments and George E. P. Box, J.

Stuart Hunter, and William G. Hunter’s Statistics for Experimenters.

Details and many extensions of what has been described here

are presented in Shayle R. Searle, George Casella, and Charles

E. McCulloch’s book, Variance Components, and D. R. Cox and P.

J. Solom on’s book, Components of Variance.