model performance, model robustness, and model fitness scores: a

7
Model performance, model robustness, and model fitness scores: A new method for identifying good land-surface models Lindsey E. Gulden, 1 Enrique Rosero, 1 Zong-Liang Yang, 1 Thorsten Wagener, 2 and Guo-Yue Niu 1 Received 22 February 2008; revised 17 April 2008; accepted 1 May 2008; published 14 June 2008. [1] We introduce three metrics for rigorous evaluation of land-surface models (LSMs). This framework explicitly acknowledges perennial sources of uncertainty in LSM output. The model performance score (z ) quantifies the likelihood that a representative model ensemble will bracket most observations and be highly skilled with low spread. The robustness score (r) quantifies the sensitivity of performance to parameter and/or data error. The fitness score (8) combines performance and robustness, ranking models’ suitability for broad application. We demonstrate the use of the metrics by comparing three versions of the Noah LSM. Using time-varying z for hypothesis testing and model development, we show that representing short-term phenological change improves Noah’s simulation of surface energy partitioning and subsurface water dynamics at a semi-humid site. The least complex version of Noah is most fit for broad application. The framework and metrics presented here can significantly improve the confidence that can be placed in LSM predictions. Citation: Gulden, L. E., E. Rosero, Z.-L. Yang, T. Wagener, and G.-Y. Niu (2008), Model performance, model robustness, and model fitness scores: A new method for identifying good land-surface models, Geophys. Res. Lett., 35, L11404, doi:10.1029/2008GL033721. 1. Introduction [2] The increasing reliance of scientists, engineers, and policymakers on the predictions of land-surface models (LSMs) demands more rigorous evaluation of LSM param- eterizations. Most LSMs are assessed using limited, local- ized, often semi-quantitative approaches [e.g., Chen et al., 2007]. With few exceptions, model intercomparisons and evaluations of modified parameterizations neglect an as- sessment of uncertainty that extends beyond simple end- member sensitivity analyses [e.g., Niu et al., 2005]. This incomplete approach is due in part to a dearth of observa- tions and in part to evaluation procedures that are no longer state-of-the-art with respect to available computing resour- ces. The development of robust metrics for comprehensive model evaluation is in its infancy [Randall et al., 2007]. Here, we present a simple method for increasing the rigor of LSM assessment. [3] The simplest way to assess an LSM is to evaluate performance at a single site using default parameters [e.g., Henderson-Sellers et al., 1996]. LSM performance varies widely when parameters are shifted within reasonable ranges [e.g., Gulden et al., 2007]. At a given site, the parameter set resulting in the best performance may signif- icantly differ from the default. That one model equipped with default parameters does better than another is likely fortuitous. Parameters tend to be effective values, not physical quantities [Wagener and Gupta, 2005]. A more thorough evaluation method is to first minimize parameter error by calibrating all models and to then compare model output generated with the best parameter set [e.g., Nijseen and Bastidas, 2005]. Using optimal parameters does not represent the way in which LSMs are generally applied; calibration against certain criteria may worsen the simulation of other, equally important criteria [Leplastrier et al., 2002]. After calibration, most equivalently complex models perform equivalently well [e.g., Beven, 2006]. Additional methods for LSM evaluation (e.g., the use of neural networks to bench- mark LSMs [Abramowitz, 2005]) show promise, but to our knowledge, none has been widely adopted. [4] Even in the rare case when we can estimate individual parameter ranges, parameter interaction and discontinuous model responses to even small shifts in parameter values undercut confidence in the realism of simulations [e.g., Gulden et al., 2007; E. Rosero et al., Evaluating enhanced hydrological representations in Noah-LSM over transition zones: Part 1. Traditional model intercomparison, submitted to Journal of Hydrometeorology, 2008a]. The dearth of extensive validation datasets makes this limitation unlikely to soon change. To assess model performance in ‘real life’ settings, evaluation frameworks such as the one we present here must explicitly acknowledge these sources of uncer- tainty. Here we treat only parameter uncertainty, but we stress that our framework can and should be applied to incorporate uncertainty in observations. The dependence of model performance on parameter and forcing error supports the use of a probabilistic approach to evaluate LSMs. To evaluate LSMs, we propose three metrics that harness the information contained in ensemble runs of an individual model. This paper introduces the metrics themselves; E. Rosero et al. (Evaluating enhanced hydrological represen- tations in Noah-LSM over transition zones: Part 2. Ensem- ble-based model evaluation, submitted to Journal of Hydrometeorology, 2008b) apply the metrics presented here as part of an in-depth model intercomparison. 2. Example Application [5] To demonstrate the new framework for LSM evalu- ation, we use an example application. We run three versions of the Noah LSM [Ek et al., 2003] using meteorological GEOPHYSICAL RESEARCH LETTERS, VOL. 35, L11404, doi:10.1029/2008GL033721, 2008 Click Here for Full Articl e 1 Department of Geological Sciences, University of Texas at Austin, Austin, Texas, USA. 2 Department of Civil and Environmental Engineering, The Pennsylva- nia State University, University Park, Pennsylvania, USA. Copyright 2008 by the American Geophysical Union. 0094-8276/08/2008GL033721$05.00 L11404 1 of 7

Upload: phungtruc

Post on 19-Jan-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Model performance, model robustness, and model fitness scores: A new

method for identifying good land-surface models

Lindsey E. Gulden,1 Enrique Rosero,1 Zong-Liang Yang,1 Thorsten Wagener,2

and Guo-Yue Niu1

Received 22 February 2008; revised 17 April 2008; accepted 1 May 2008; published 14 June 2008.

[1] We introduce three metrics for rigorous evaluation ofland-surface models (LSMs). This framework explicitlyacknowledges perennial sources of uncertainty in LSMoutput. The model performance score (z) quantifies thelikelihood that a representative model ensemble will bracketmost observations and be highly skilled with low spread.The robustness score (r) quantifies the sensitivity ofperformance to parameter and/or data error. The fitnessscore (8) combines performance and robustness, rankingmodels’ suitability for broad application. We demonstratethe use of the metrics by comparing three versions of theNoah LSM. Using time-varying z for hypothesis testing andmodel development, we show that representing short-termphenological change improves Noah’s simulation of surfaceenergy partitioning and subsurface water dynamics at asemi-humid site. The least complex version of Noah is mostfit for broad application. The framework and metricspresented here can significantly improve the confidencethat can be placed in LSM predictions. Citation: Gulden,

L. E., E. Rosero, Z.-L. Yang, T. Wagener, and G.-Y. Niu (2008),

Model performance, model robustness, and model fitness scores:

A new method for identifying good land-surface models,

Geophys. Res. Lett., 35, L11404, doi:10.1029/2008GL033721.

1. Introduction

[2] The increasing reliance of scientists, engineers, andpolicymakers on the predictions of land-surface models(LSMs) demands more rigorous evaluation of LSM param-eterizations. Most LSMs are assessed using limited, local-ized, often semi-quantitative approaches [e.g., Chen et al.,2007]. With few exceptions, model intercomparisons andevaluations of modified parameterizations neglect an as-sessment of uncertainty that extends beyond simple end-member sensitivity analyses [e.g., Niu et al., 2005]. Thisincomplete approach is due in part to a dearth of observa-tions and in part to evaluation procedures that are no longerstate-of-the-art with respect to available computing resour-ces. The development of robust metrics for comprehensivemodel evaluation is in its infancy [Randall et al., 2007].Here, we present a simple method for increasing the rigor ofLSM assessment.[3] The simplest way to assess an LSM is to evaluate

performance at a single site using default parameters [e.g.,

Henderson-Sellers et al., 1996]. LSM performance varieswidely when parameters are shifted within reasonableranges [e.g., Gulden et al., 2007]. At a given site, theparameter set resulting in the best performance may signif-icantly differ from the default. That one model equippedwith default parameters does better than another is likelyfortuitous. Parameters tend to be effective values, notphysical quantities [Wagener and Gupta, 2005]. A morethorough evaluation method is to first minimize parametererror by calibrating all models and to then compare modeloutput generated with the best parameter set [e.g., Nijseenand Bastidas, 2005]. Using optimal parameters does notrepresent the way in which LSMs are generally applied;calibration against certain criteria may worsen the simulationof other, equally important criteria [Leplastrier et al., 2002].After calibration, most equivalently complexmodels performequivalently well [e.g., Beven, 2006]. Additional methods forLSM evaluation (e.g., the use of neural networks to bench-mark LSMs [Abramowitz, 2005]) show promise, but to ourknowledge, none has been widely adopted.[4] Even in the rare case when we can estimate individual

parameter ranges, parameter interaction and discontinuousmodel responses to even small shifts in parameter valuesundercut confidence in the realism of simulations [e.g.,Gulden et al., 2007; E. Rosero et al., Evaluating enhancedhydrological representations in Noah-LSM over transitionzones: Part 1. Traditional model intercomparison, submittedto Journal of Hydrometeorology, 2008a]. The dearth ofextensive validation datasets makes this limitation unlikelyto soon change. To assess model performance in ‘real life’settings, evaluation frameworks such as the one we presenthere must explicitly acknowledge these sources of uncer-tainty. Here we treat only parameter uncertainty, but westress that our framework can and should be applied toincorporate uncertainty in observations. The dependence ofmodel performance on parameter and forcing error supportsthe use of a probabilistic approach to evaluate LSMs. Toevaluate LSMs, we propose three metrics that harness theinformation contained in ensemble runs of an individualmodel. This paper introduces the metrics themselves; E.Rosero et al. (Evaluating enhanced hydrological represen-tations in Noah-LSM over transition zones: Part 2. Ensem-ble-based model evaluation, submitted to Journal ofHydrometeorology, 2008b) apply the metrics presented hereas part of an in-depth model intercomparison.

2. Example Application

[5] To demonstrate the new framework for LSM evalu-ation, we use an example application. We run three versionsof the Noah LSM [Ek et al., 2003] using meteorological

GEOPHYSICAL RESEARCH LETTERS, VOL. 35, L11404, doi:10.1029/2008GL033721, 2008ClickHere

for

FullArticle

1Department of Geological Sciences, University of Texas at Austin,Austin, Texas, USA.

2Department of Civil and Environmental Engineering, The Pennsylva-nia State University, University Park, Pennsylvania, USA.

Copyright 2008 by the American Geophysical Union.0094-8276/08/2008GL033721$05.00

L11404 1 of 7

forcing data from the 2002 International H2O Project(IHOP) [LeMone et al., 2007] at sites covered by drygrassland (site 2), semi-arid pasture (site 4) and semi-humidgrassland (site 8) (mean annual precipitation [MAP] = 540,740, 880 mm y�1, respectively). The standard version ofNoah (‘STD’) is the benchmark against which we evaluatetwo newer versions: one augmented with a lumped, uncon-fined aquifer model (‘GW’) [Niu et al., 2007] and a secondaugmented with a short-term dynamic phenology module(‘DP’) that allows leaf area to change in response toenvironmental variation on daily to seasonal time scales[Dickinson et al., 1998].[6] We evaluate model performance against independent

objectives: 3-hour-running mean evaporative fraction (EF);top 30-cm soil wetness (W30); and 24-hour change inwetness (DW30). We define EF as:

EFt ¼ LEt= LEt þ Htð Þ ð1Þ

where LEt and Ht, are, respectively, the latent, and sensibleheat flux, averaged over 30-minute time interval t. Wecompute W30 as:

W30 ¼XNlayers

i¼1

qizi

, XNlayers

i¼1

wizi ð2Þ

where qi, zi, and wi are, respectively, the volumetric soilmoisture, thickness, and porosity of the ith layer of the soilcolumn, which has Nlayers layers (for the observations,Nlayers = 4; for the models, Nlayers = 2). We represent the24-hour change in soil-moisture as:

DW30;t ¼ W30;t �W30;t�47: ð3Þ

[7] For each site, we generate two 150-member ensem-bles: (1) a ‘calibrated ensemble,’ generated using parame-ters defined by the Markov Chain Monte Carlo sampler ofVrugt et al. [2003] while simultaneously minimizing fiveRMSE objectives (LE, H, ground heat flux, 5-cm soiltemperature and moisture); and (2) an ‘uncalibrated ensem-ble,’ composed of runs from a subset of 15,000 generatedby random sampling of uniform independent parameterdistributions; the subset was defined as the group thatobtained scores within one standard deviation of the modefor each RMSE (i.e., the most frequent error) (Figure 1). Inthis example, model performance varies widely whenparameters are selected within reasonable ranges (Figure1a); after calibration, STD, DP, and GW perform equiva-lently well (Figure 1b). When generating ensembles, forsimplicity, we neglected data uncertainty. All realizationsused a 30-minute time step to simulate 01/01/2000–06/25/2002. We treated the first 2.5 years of simulation as modelspin-up; only the last 45 days of each simulation werescored.

3. Performance, Robustness, and Fitness Scores

[8] To define a metric that identifies the ‘best’ model or‘best’ parameterization, we first define a good model. Givena representative ensemble, when the LSM is good:[9] 1. The ensemble brackets most of the observations.[10] 2. The ensemble is centered on the observations.[11] 3. The ensemble has low spread when bracketing

observations but high spread when not (i.e., the ensembledoes not resolutely cling to an incorrect value).[12] 4. The model is relatively insensitive to parameters

that are not observable and is also not significantly affectedby errors in meteorological forcing data.

Figure 1. The range of scores obtained by varying effective parameters. (a) RMSE for 5-cm volumetric soil moisture andLE for 15,000 runs of STD at IHOP site 4. Each simulation used unique parameter sets randomly selected from uniformdistributions. Also shown are calibrated-model scores. (b) Taylor diagram of LE simulated by DP, STD, and GW at 3 IHOPsites (2, 4, 8). ‘Best’ is the set that minimizes the L2 norm of the 5 objectives.

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

2 of 7

[13] 5. The model performance is consistently good (asdefined by Descriptors 1–3) across sites.[14] Descriptors 1–3 describe a model that is well suited

to a given location; 4 and 5 describe a model that is robust[Carlson and Doyle, 2002]. The ideal model for use overregional and global domains will match all five Descriptors.[15] At time step t, the ensemble of the best-scoring

model minimizes the performance score, z t:

& t ¼ CDFens;t � CDFobs;t

� �= 1� CDF

obsþc

� �� �ð4aÞ

where CDFens,t, CDFobs,t, and CDFobs+c are, respectively,the cumulative distribution functions of the variablesimulated by the ensemble of models, of the observationat time t, and of all values of �ot, shifted by arbitrary constantc (to prevent division by zero). �ot is the mean of allrealizations of the observation at time step t. Whenobservational uncertainty is unknown, for simplicity, z tcan be expressed in deterministic form as:

& t ¼XNens

i¼1

jxi;t � otj,XNens

i¼1

j�o� cj ð4bÞ

where xi,t is ensemble member i at time t, ot is theobservation at time t, Nens is the number of ensemblemembers, c is an arbitrary constant that is less than allvalues ot, and �o is the mean of the observations. z t is lowest(best) when the ensemble brackets most observations, haslow spread, and is centered on the observations (Figure 2).Figure 3 shows the ensemble’s time-varying performanceand the corresponding z t. Table 1 demonstrates that z tencompasses both the commonly used ensemble spread andskill [e.g., Talagrand et al., 1997].[16] Overall insensitivity to factors that may significantly

alter performance (e.g., poorly known parameters or errorsin meteorological forcing data) can be expressed as:

r ¼ j�&e1 � �&e2j�Ve1 þ �Ve2

ð5Þ

where �Ve1 and �Ve2 are the time means of the performancescores for the first and second ensembles, respectively. Thetwo ensembles should significantly differ in the way(s) inwhich modelers wish to assess robustness. For example, totest robustness with respect to parameter variation, theensemble members should use parameter sets that come

Figure 2. Distribution of EF simulated for Site 2 at 2:00 PM on Julian day 161 (2002). (a) CDFs of the calibrated anduncalibrated ensembles. (b)–(d) Performance scores. The performance score improves as the model is better described byDescriptors 1–3 (see text).

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

3 of 7

Figure 3. Relation between time-varying ensemble simulations of EF and time-varying performance score (z). EFsimulated by each ensemble is shown for Site 2 for Julian day 161 (2002). Black circles are observations; gray lines areindividual ensemble members; white bars are ensemble mean; gray bars are the ensemble interquartile range.

Table 1. Ensemble Spread, Skill, and Performance Score for EF Simulation at IHOP Site 2 at 2:00 PM Local Time on Julian Day 161 of

2002a

Spreadb Skillc Performance Score (z)

Calibrated Uncalibrated Calibrated Uncalibrated Calibrated Uncalibrated

STD 0.00207 0.00370 0.00226 0.00470 0.114 0.154DP 0.0105 0.0113 0.0109 6.07e–7 0.232 0.183GW 0.00291 0.00504 9.22e–6 0.00293 0.0815 0.152

aSee also Figure 2.bEnsemble spread (pt) is pt =

1Nens�1

PNens

i¼1

(xi,t � �xt)2, where �xt is the ensemble mean at time t, xi,t is the ith ensemble member at time t, and Nens is the

number of ensembles.cEnsemble skill (kt) is kt = (�xt � ot)

2, where ot is the observed value at time t.

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

4 of 7

from distinct distributions (as described above); to testrobustness with respect to data error, the ensembles shoulddiffer in the type or level of noise by which their input datais perturbed.[17] Given the assumption that spatially varying charac-

teristics of the land surface shape surface-to-atmospherefluxes and near-surface states [e.g., Dickinson, 1995], wenote that there is an inherent contradiction between Descrip-tors 4 and 5 above. A tradeoff exists between a model that iscompletely insensitive to parameter variation and one thatconsistently does well across sites. A ‘compromise ideal’model is insensitive to the parameters that cannot be easilyidentified; it is at least somewhat sensitive to parametersthat are physically realistic (e.g., vegetation type) and whenmodel and measurement scales are similar.[18] For a given site and criterion, a model’s overall

fitness score, f, quantifies its suitability for broad applica-tion. We define f as:

f ¼ �Vr ð6Þ

where �V is the mean performance score (Equation 4) ofthe most representative ensemble and r is robustness(Equation 5). The more sites and criteria used, the moreconfident we can be of model performance (Descriptor 5).The best model (m) from a set of M models tested across N

sites minimizes 1N

PNi¼1

fi,m. For fair cross-criterion compar-

ison, modelers should first rank f for a given criterion andshould then compare average fitness rankings of the models.[19] Figures 4a and 4b show the utility of this framework

for improving LSM physical structure. DP, which is differ-ent from benchmark model STD only in that it allows leafarea index to vary over short time scales in response toenvironmental variation, consistently better simulates EF atsemi-humid site 8 (Figure 4a), a result that is consistent withthe hypothesis that short-term phenology can alter surfaceenergy partitioning (see Rosero et al. (submitted manuscript,2008b) for detailed analysis). DP tends to have the best z twhen simulating 24-hour change in wetness (Figure 4b andTable 2). When used with other model output characteristics(e.g., anomalies, bias), z t helps determine when and wherethe model is most likely to succeed or fail. It can also beused across criteria (Figure 4c); the combination of z and ryields an assessment of overall model fitness (Figure 4d andTable 2). Overall, GW performs best but is less robust thanSTD. STD is the fittest model, but tradeoffs in fitnessbetween criteria make all models equivalently fit at themost-humid site 8 (Table 2).

4. Summary and Implications

[20] We introduce three metrics for rigorous and realisticevaluation of LSM performance within a framework thatexplicitly acknowledges perennial sources of LSM outputuncertainty (e.g., sensitivity of output to parameters that are

Figure 4. Utility of metrics. Time-varying performance score (z) of STD, DP, and GW for (a) EF and (b) 24-hour changein wetness (DW30). (c) Model performance when simulating EF, DW30, and DW30; change between the calibrated anduncalibrated ensemble performance scores indicates model robustness. (d) Model fitness scores for each model, criterion,and site.

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

5 of 7

impossible to specify and to errors in meteorological forcingdata). The model performance score (z) quantifies thelikelihood that a representative model ensemble will behighly skilled with low spread; the robustness score (r)quantifies the sensitivity of model performance to changesin parameters (as shown in the example here) and/orperturbations to meteorological forcing. Our frameworktreats the relative insensitivity of an LSM to both parametervariability and to forcing error as beneficial characteristicsin the face of the less-than-perfect settings in which LSMsare applied. The fitness score (f) combines the concepts ofgood performance and robustness and is used to rankmodels’ suitability for broad application.[21] We demonstrate the use of the metrics using three

versions of the Noah LSM to simulate summer in Okla-homa. Our example shows that the least complex version ofNoah is most fit for broad application. We use the time-varying z (a tool for model evaluation and development) toshow that allowing leaf area index to vary on short timescales improves Noah’s simulation of surface energy parti-tioning and subsurface water dynamics at the semi-humidsite. Standard computational resources are now such that thepresented framework can be applied to several models (or asingle model with candidate parameterizations) and numer-ous flux tower sites as a means for more thorough andinformative model evaluation: on a single 2.66 GHz pro-cessor, to run one model for 2.5 years 15,000 times (as wedid for each Monte Carlo sampling) required less than2 hours of computing time.[22] Researchers are often quick to assume that the model

is performing well because it is more complex (i.e., hasmore parameters), a characteristic often equated with in-creased physical realism. This method for evaluation shouldnot be used to ‘prove’ that one representation is morephysically realistic than another. Although it is likely thatimproved conceptual realism will improve model perfor-mance, the converse is not necessarily valid. Regardless, themodels examined here have so many degrees of freedomthat it is difficult to parse whether strong model perfor-mance is the result of cancelling errors or the result of

physical correctness; however, the performance and fitnessscores presented here allow for hypothesis testing andmodel development that gives a realistic treatment touncertainty. Because the results are obtained using ensemblesimulations, we can be more confident that this improve-ment is indeed the result of the altered model structure andis not the simple result of a lucky guess of parameters.[23] Land-surface modelers are unlikely to ever know the

‘right’ parameters for any site on which their models areapplied; they will not be able to eliminate error in meteo-rological forcing data. Evaluation of LSM performancemust take a realistic view of these limitations. The metricsproposed here enable more objective comparison of LSMperformance in a framework that more accurately representsthe ways in which LSMs are applied. Use of this frameworkand metrics will strengthen modelers’ conclusions and willimprove confidence in LSM predictions.

[24] Acknowledgments. We thank the Texas Advanced ComputingCenter, NOAA (grant NA07OAR4310216), the NSF, the NWS/OHD, andthe Jackson School of Geosciences; we are also grateful for the insightfulcomments of two anonymous reviewers.

ReferencesAbramowitz, G. (2005), Towards a benchmark for land surface models,Geophys. Res. Lett., 32, L22702, doi:10.1029/2005GL024419.

Beven, K. (2006), A manifesto for the equifinality thesis, J. Hydrol., 320,18–36.

Carlson, J. M., and J. Doyle (2002), Complexity and robustness, Proc. Natl.Acad. Sci. U. S. A., 99, Suppl. 1, 2538–2545.

Chen, F., et al. (2007), Description and evaluation of the characteristics ofthe NCAR high-resolution land data assimilation system, J. Appl. Me-teorol. Climatol., 46, 694–713, doi:10.1175/JAM2463.1.

Dickinson, R. E. (1995), Land-atmosphere interaction, Rev. Geophys., 33,917–922.

Dickinson, R. E., et al. (1998), Interactive canopies for a climate model,J. Clim., 11, 2823–2836.

Ek, M. B., K. E. Mitchell, Y. Lin, E. Rogers, P. Grunmann, V. Koren,G. Gayno, and J. D. Tarpley (2003), Implementation of Noah landsurface model advances in the National Centers for EnvironmentalPrediction operational mesoscale Eta model, J. Geophys. Res.,108(D22), 8851, doi:10.1029/2002JD003296.

Gulden, L. E., E. Rosero, Z.-L. Yang, M. Rodell, C. S. Jackson, G.-Y. Niu,P. J.-F. Yeh, and J. Famiglietti (2007), Improving land-surface modelhydrology: Is an explicit aquifer model better than a deeper soil profile?,Geophys. Res. Lett., 34, L09402, doi:10.1029/2007GL029804.

Henderson-Sellers, A., K. McGuffie, and A. J. Pitman (1996), The Projectfor Intercomparison of Land-Surface Parameterization Schemes (PILPS):1992 to 1995, Clim. Dyn., 12, 849–859.

LeMone, M. A., et al. (2007), NCAR/CU surface, soil, and vegetationobservations during the International H2O Project 2002 Field Campaign,Bull. Am. Meteorol. Soc., 88, 65–81.

Leplastrier, M., A. J. Pitman, H. Gupta, and Y. Xia (2002), Exploring therelationship between complexity and performance in a land surface modelusing the multicriteria method, J. Geophys. Res., 107(D20), 4443,doi:10.1029/2001JD000931.

Nijseen, B., and L. A. Bastidas (2005), Land-atmosphere models for waterand energy cycle studies, in Encyclopedia of Hydrological Sciences, vol.5, part 17, edited by M. G. Anderson, chap. 201, pp. 3089–3102, JohnWiley, Hoboken, N. J.

Niu, G.-Y., Z.-L. Yang, R. E. Dickinson, and L. E. Gulden (2005), A simpleTOPMODEL-based runoff parameterization (SIMTOP) for use in globalclimate models, J. Geophys. Res., 110, D21106, doi:10.1029/2005JD006111.

Niu, G.-Y., Z.-L. Yang, R. E. Dickinson, L. E. Gulden, and H. Su (2007),Development of a simple groundwater model for use in climate modelsand evaluation with Gravity Recovery and Climate Experiment data,J. Geophys. Res., 112, D07103, doi:10.1029/2006JD007522.

Randall, D. A., et al. (2007), Climate models and their evaluation, inClimate Change 2007: The Physical Science Basis. Contribution ofWorking Group 1 to the Fourth Assessment Report of the Intergovern-mental Panel on Climate Change, edited by S. Solomon et al., CambridgeUniv. Press, Cambridge, U.

Table 2. Model Fitness and Average Rankinga

Criterion STD DP GW

Site 2EF 0.0085 (1) 0.0278 (3) 0.0214 (2)DW30 0.0329 (2) 0.0635 (3) 0.0155 (1)W30 0.0030 (1) 0.1485 (3) 0.0424 (2)Site-mean rank 1.33 3 1.67

Site 4EF 0.0041 (1) 0.0252 (3) 0.0164 (2)DW30 0.1997 (2) 0.0901 (1) 0.6172 (3)W30 0.0004 (1) 0.0085 (2) 0.0256 (3)Site-mean rank 1.33 2 2.67

Site 8EF 0.0346 (2) 0.0241 (1) 0.0546 (3)DW30 0.5246 (3) 0.3220 (2) 0.2905 (1)W30 0.0235 (1) 0.0497 (3) 0.0403 (2)Site-mean rank 2 2 2

Mean rank 1.55 2.33 2.11Variance of rank 0.53 0.75 0.61

aThe rank of model fitness score among the cohort for each given site andcriterion is shown in parentheses.

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

6 of 7

Talagrand, O., R. Vautard, and B. Strauss (1997), Evaluation of probabil-istic prediction systems, paper presented at ECMWF Workshop onPredictability, Eur. Cent. for Med. Range Weather Forecasts, Reading,U. K.

Vrugt, J. A., H. V. Gupta, L. A. Bastidas, W. Bouten, and S. Sorooshian(2003), Effective and efficient algorithm for multiobjective optimizationof hydrologic models, Water Resour. Res., 39(8), 1214, doi:10.1029/2002WR001746.

Wagener, T., and H. V. Gupta (2005), Model identification for hydrologicalforecasting under uncertainty, Stochastic Environ. Res. Risk Assess., 19,378–387.

�����������������������L. E. Gulden, E. Rosero, Z.-L. Yang, and G.-Y. Niu, Department of

Geological Sciences, 1 University Station C1100, University of Texas atAustin, Austin, TX 78712-0254, USA. ([email protected])T. Wagener, Department of Civil and Environmental Engineering, The

Pennsylvania State University, 212 Sackett Building, University Park, PA16802, USA.

L11404 GULDEN ET AL.: IDENTIFYING GOOD LAND-SURFACE MODELS L11404

7 of 7