A framework for benchmarking land models

. Land models, which have been developed by the modeling community in the past few decades to predict future states of ecosystems and climate, have to be critically evaluated for their performance skills of simulating ecosystem responses and feedback to climate change. Benchmarking is an emerging procedure to measure performance of models against a set of deﬁned standards. This paper pro-poses a benchmarking framework for evaluation of land model performances and, meanwhile, highlights major chal-lenges at this infant stage of benchmark analysis. The framework includes (1) targeted aspects of model performance to be evaluated, (2) a set of benchmarks as deﬁned references to test model performance, (3) metrics to measure and compare performance skills among models so as to identify model strengths and deﬁciencies, and (4) model improvement. benchmarks to effectively evaluate land model performance. The second challenge is to develop metrics of measuring mismatches between models and benchmarks. The metrics may include (1) a priori thresholds of acceptable model performance and (2) a scoring system to combine data–model mismatches for various processes at different temporal and spatial scales. The benchmark analyses should identify clues of weak model performance to guide future development, thus enabling improved predictions of future states of ecosystems and climate. The near-future research effort should be on development of a set of widely acceptable benchmarks that can be used to objectively, effectively, and reliably evaluate fundamental properties of land models to improve their prediction performance skills.


3858
Y. Q. Luo et al.: A framework for benchmarking land models benchmarks to effectively evaluate land model performance. The second challenge is to develop metrics of measuring mismatches between models and benchmarks. The metrics may include (1) a priori thresholds of acceptable model performance and (2) a scoring system to combine data-model mismatches for various processes at different temporal and spatial scales. The benchmark analyses should identify clues of weak model performance to guide future development, thus enabling improved predictions of future states of ecosystems and climate. The near-future research effort should be on development of a set of widely acceptable benchmarks that can be used to objectively, effectively, and reliably evaluate fundamental properties of land models to improve their prediction performance skills.

Introduction
Over the past two decades, tremendous progress has been achieved in the development of land models and their inclusion in Earth system models (ESMs). State-of-the-art land models now account for biophysical processes (exchanges of water and energy) and biogeochemical cycles of carbon, nitrogen, and trace gases (Oleson, 2010;Wang et al., 2010;. They also simulate vegetation dynamics (Sitch et al., 2003) and disturbances (Thonicke et al., 2010). When coupled to ESMs, land models now allow simulation of land-atmosphere physical interactions (Bonan, 2008) and climate-carbon feedbacks (Bonan and Levis, 2010;Friedlingstein et al., 2006). These models are now widely used for policy relevant assessment of climate change and its impact on ecosystems or terrestrial resources, and more recently on allowable anthropogenic CO 2 emissions compatible with a given concentration pathway (Arora et al., 2011). However, there is still very limited knowledge of the performance skills of these land models, especially when embedded in ESMs. Without quantification of the performance skills of land models, their prediction of future states of ecosystems and climate cannot be widely accepted.
Model performance has traditionally been evaluated via comparison with common knowledge, observed data sets, and other models. "Validation" against observed data is traditionally the most common approach to model evaluation (Oreskes, 2003;Rykiel, 1996). However, a land model typically simulates hundreds or thousands of biophysical, biogeochemical, and ecological processes at regional and global scales over hundreds of years. It would be unrealistic to expect validation of so many processes at all spatial and temporal scales independently, even if observations were available. The complex behavior of these interacting processes can only be realistically understood if we holistically assess land models and their major components. As a consequence, there have been many international model intercomparison projects. For example, the Project for Intercompari-son of Land surface Parameterization Schemes (PILPS) focused on simulation of the water and energy balance (Pitman, 2003). The Carbon Cycle Model Linkage Project (CCMLP) evaluated simulation of the terrestrial carbon cycle (McGuire et al., 2001). The Coupled Carbon Cycle Climate Model Intercomparison Project (C4MIP) compared simulation of the climate-carbon cycle coupling among 11 models (Friedlingstein et al., 2006). Nevertheless, there have been a very few, if any, attempts to systematically evaluate land models against data from a range of observation networks and experiments in a comprehensive, objective and transparent manner (Cadule et al., 2010;Randerson et al., 2009).
The International Land Model Benchmarking (ILAMB) project (http://www.ilamb.org/) has recently been launched to promote model-data comparison to evaluate and improve the performance of land models. ILAMB aims to (1) develop internationally accepted benchmarks for land model performance, (2) promote the use of these benchmarks by the international community for model comparison, (3) strengthen linkages between experimental, remote sensing, and climate modeling communities, (4) design new model tests, and (5) support the design and development of a new, open source, benchmarking software system for use by the international community. ILAMB has the potential to stimulate observation and experimental communities to design new measurement campaigns to improve models and reduce uncertainties associated with key processes in land models.
As a part of the ILAMB project, here we propose a framework for benchmark analysis and highlight its major challenges and future research opportunities. The framework is intended to define terms related to benchmark analysis and to facilitate communication among practitioners in this area of research, as well as with those who are entering into this field of research. The framework for benchmark analysis we propose consists of four major elements, which are (1) identification of key aspects of land models that require evaluation, (2) definition of benchmarks against which model performance skills can be quantified, (3) creation of metrics to measure model performance, and (4) approaches to identify and rectify model deficiencies. The most central but challenging part of developing this framework is to define a set of a few yet effective benchmarks with a metrics system to measure model performances. A stepwise procedure to conduct individual benchmark analysis can follow relevant published papers, such as Randerson et al. (2009).

Benchmark analysis: a general framework
In a general sense, benchmark analysis is a standardized evaluation of one system's performance against defined references (i.e., benchmarks) that can be used to diagnose the system's strengths and deficiencies for future improvement. Benchmark analyses have been widely applied in economics, meteorology, computer sciences, business, and engineering.

Y. Q. Luo et al.: A framework for benchmarking land models 3859
In business, for example, benchmark analysis provides a systematic approach to improving production efficiency and profitability through identifying, understanding, and adapting the successful business practices and processes used by other companies in terms of quality, time and cost (Fifer, 1988). In engineering, benchmark analysis is used to measure efficiency, productivity, and quality against a reference or benchmark performance of a standardized instrument (Jamasb and Pollitt, 2003). In meteorology, benchmark analysis facilitates testing the accuracy, efficiency, and efficacy of meteorological model formulations and assumptions against measurements (Bryan and Fritsch, 2002). In computer sciences, benchmark analysis is used to examine the performance of a processor, code structure, features of processor architecture, and optimization of compiler against a number of standard tests to gain insight into how the processor or code compares with alternative approaches and how it can be improved (Simon and McGalliard, 2009;Ghosh and Sonakiya, 1998).
Benchmark analysis is urgently needed to evaluate land models against observations and experimental manipulations as it allows us to identify uncertainties in predictions as well as guide the priorities for model development (Blyth et al., 2011). Several land model benchmarking studies have been attempted but have used only a subset of available observations or have been applied to a small number of models. For example, the Carbon-LAnd Model Intercomparison Project (C-LAMP) compared two biogeochemistry models integrated within the Community Land Model (CLM)-Carnegie-Ames-Stanford Approach (CASA ) and carbonnitrogen (CN) with nine different classes of observations (Randerson et al., 2009). The Joint UK Land Environment Simulator (JULES) was evaluated for its performance against surface energy flux measurements from 10 flux network (FLUXNET) sites with a range of climate conditions and biome types (Blyth et al., 2011). Three global models of the coupled carbon-climate system were evaluated against atmospheric CO 2 concentration from a network of stations to quantify each model's ability to reproduce the global growth rate, the seasonal cycle, the El Niño-Southern Oscillation (ENSO) forced interannual variability of atmospheric CO 2 , and the sensitivity to climatic variations (Cadule et al., 2010). The evaluation procedures so far have been developed independently by small groups of researchers, and as a consequence have emphasized different types of observational constraints and evaluation metrics. It is essential to develop a widely accepted, consistent and comprehensive framework for benchmark analysis.
A comprehensive benchmarking framework has at least four elements: (1) targeted aspects of model performance to be evaluated, (2) benchmarks as defined references to evaluate model performance, (3) a scoring system of metrics to measure relative performances among models, and (4) diagnostic approaches to identification of model strengths and deficiencies for future improvement (Fig. 1). First, a land model (1) defining model aspects to be evaluated, (2) selecting benchmarks as standardized references to test models, (3) developing a scoring system to measure model performance skills, and (4) stimulating model improvement.
typically simulates biophysical processes, hydrological processes, biogeochemical cycles, and vegetation dynamics. For each of the component processes, the land model has to represent basic system dynamics well (i.e., baseline simulation) and simulate their responses and feedback to climate change and disturbances (i.e., response simulation). Any benchmark analysis has to be clear on what aspects of the land models are being evaluated. Second, the most critical component of any benchmark analysis is to define benchmarks, which have to be objective, effective, and reliable for evaluating model performance. Third, a scoring system is needed to set criteria for a model to pass the benchmark test and measure relative performance among models. Fourth, benchmark analysis should identify needed model improvements and areas where the model is sufficiently robust for accurate simulations. The four elements of the benchmarking framework are discussed in detail in the following sections.

Aspects of land models to be evaluated by means of benchmarking
Land models typically simulate the surface energy balance, hydrological processes, biogeochemical cycles, and vegetation dynamics. Although individual studies may evaluate a few aspects of model performance, a comprehensive framework is required to evaluate all of these major components when land models are integrated with Earth System Models (ESMs). Unlike models used for weather prediction, the land components of ESMs are usually designed to predict longer-term future states of ecosystems and climate. The performance of a model should therefore be evaluated for its 3860 Y. Q. Luo et al.: A framework for benchmarking land models baseline simulations over broad spatial and temporal scales, and include evaluations of modeled responses and feedbacks of land processes to global change and different types of disturbance. Scientists have to establish some level of confidence in land models' baseline simulations of pre-industrial ecosystem processes before they can be used to study ecosystem responses and feedback to climate change. The baseline state for biogeochemical cycles includes simulated global totals, spatial distributions, and temporal dynamics of gross primary production, net primary production, vegetation and soil carbon stocks, ecosystem respiration, litter production, litter mass, net ecosystem production, and land-use and landcover patterns. The baseline state for biophysical processes includes shortwave and longwave radiation, sensible and latent heat fluxes, surface temperature, evaporation, transpiration, snow cover and snow depth, active layer dynamics in permafrost regions, and runoff. The baseline state for vegetation dynamics includes pre-industrial vegetation distributions, and changes in vegetation distribution from the last glacial maximum through the Holocene. Most baseline preindustrial control simulations are validated against common knowledge and evaluated against benchmarks, for example, for their representation of diurnal and seasonal variations (Fig. 2). Another key baseline performance requirement is that land processes reach and maintain steady state, usually through spin-up, before the models are used to simulate ecosystem responses and feedback to climate change.
To reliably predict future states of ecosystems under a changed environment, land models have to realistically simulate responses of land processes to disturbances and global change. Natural and anthropogenic disturbances can significantly alter biogeochemical processes, biophysical properties, and vegetation dynamics. Several land models have incorporated algorithms to simulate individual events of fire and land-use changes (Thonicke et al., 2010;Prentice et al., 2011). Natural disturbances occur at different frequencies with varying severity on diverse spatial scales in different regions and thus can be characterized by disturbance regimes . Climate change can regulate and, in turn, be affected by disturbance regimes. How to simulate and benchmark the responses and feedback of disturbance regimes to climate change still remains a great challenge (see Weng et al., 2012). In this context, improved regional-to global-scale time series of burned area, insect outbreaks, hurricane damage, wind blow downs, and logging are needed to reduced uncertainties in existing parameterizations.
Major global change factors include rising atmospheric CO 2 concentration, increasing land use, surface air temperature, altered precipitation amounts and patterns, and nitrogen (N) deposition. Most land models often use the Farquhar leaf photosynthesis model (Farquhar et al., 1980) and one stomatal conductance formulation to simulate instantaneous increases in carbon influx in response to increasing [CO 2 ], but there is much greater variation in the extent to which The annual cycle of CO 2 is regulated by plant phenology, photosynthesis, allocation, and decomposition processes. A well functioning model has to match the observations, but it is possible to get the right answer for the wrong reasons. Thus, multiple constraints and parallel use of functional relationships are needed for benchmark analysis (adopted from Randerson et al., 2009). current models account for long-term acclimation of photosynthetic and respiratory parameters to global change. Almost all land models simulate ecosystem responses to climate warming primarily via the kinetic sensitivity of photosynthesis and respiration to temperature and have not fully considered warming-induced changes in phenology and the length of growing seasons, nutrient availability, ecosystem water dynamics and species composition . Expected changes in the precipitation regime, for example, including changes in frequency, intensity, amount, and spatial distribution as predicted by climate models, will modify species composition and ecosystem function through multiple interacting pathways (Knapp et al., 2008), few of which are currently represented in land models. A few global land models have been designed to simulate ecosystem responses to nitrogen deposition (Thornton et al., 2007;Wang et al., 2010;, mainly by means of its simulation of plant growth or modification of decomposition rates. Many indirect effects of nitrogen on ecosystem structure and function or long-term changes in total ecosystem nitrogen content (Lu et al., 2011a;Yang et al., 2011) have not been integrated in most land models.
Feedbacks occur among land processes themselves and between ecosystems and the atmosphere. For example, soil nitrogen availability influences leaf area expansion, plant growth, and ecosystem carbon cycle. Carbon sequestration in plant biomass and soil feeds back not only to short-term mineral nitrogen availability but potentially also stimulates longterm accumulation of total ecosystem nitrogen content (Luo et al., 2006). Nitrogen availability may also influence albedo (Ollinger et al., 2008) and thus land surface energy and water balances and ultimately feedbacks with the climate system. There are numerous feedback processes within land models and in their coupling with climate models. However, it is not straightforward to disentangle these processes and therefore to evaluate feedback mechanisms in benchmark analysis.
While complex land models have numerous aspects to be evaluated, our understanding of their common structures and fundamental properties can make benchmark analysis much more effective. Taking carbon cycle as an example. Land models share some common structures despite their vast differences. Virtually all models simulate four common properties of carbon cycling: (1) photosynthesis as the primary pathway of C entering an ecosystem, (2) compartmentalization of carbon cycle into distinct pools, (3) donor pooldominated C transfers, and (4) the first-order decay of litter and soil organic matter to release CO 2 . The four properties can be well described by a firstorder line differential equation: where X(t) is the C pool size, A is the C transfer matrix, U is the photosynthetic input, B is a vector of partitioning coefficients, X(0) is the initial value of the C pool, and ξ is an environmental scalar. With these equations, ecosystem carbon storage capacity equals carbon inputs multiplied by residence time (Fig. 3) (Xia et al., 2012), and thus carbon-cycle feedbacks to climate change can be quantified by analyzing relative changes in carbon influx into ecosystems and residence times . Thus, C input flow and residence times are critical parameters to consider in benchmark analysis. It will substantially simplify benchmark analysis if we can develop similar analytical frameworks for biophysical processes and dynamic vegetation model components.

Benchmarks as defined references
A comprehensive benchmarking framework has a set of defined benchmarks against which land model performance can be evaluated (Table 1). It is challenging to define a few benchmarks that can be used objectively, effectively, and reliably to evaluate model performance.

Criteria of benchmarks
What would be qualified to be benchmarks has not been carefully discussed in the research community, although several studies have evaluated performances of land models against available data. In general, a benchmark has to meet the following criteria: objectivity, effectiveness, and reliability for evaluating model performance. First, an objective benchmark likely derives from data or data products because data can objectively reflect biogeochemical, biophysical, and vegetation processes in the real world that land models attempt to simulate. In some instances, models of previous versions or statistical models can be used as benchmarks to gauge improvements in model performance. Second, a benchmark should be effective for evaluating model performance. Such a benchmark usually reflects fundamental properties of the systems. Carbon influxes and residence times, for example, determine carbon storage capacity in an ecosystem (Fig. 3) Xia et al., 2012). Thus, long-term and large-scale data sets of carbon influx (e.g., net primary production -NPP)  and ecosystem residence times would be very effective in evaluating performance skill of carbon cycle models. Third, benchmarks also should be reliable. In general, the more variable a data set, the less reliable the benchmark. It is therefore important to evaluate uncertainty of the data set that will be used as a benchmark.
In addition, benchmarks should be selected to reduce equifinality as much as possible. Although extensive data sets are available for benchmarking land models, equifinality remains a major issue in model evaluation (Tang and Zhuang, 2008;Luo et al., 2009). That is, the available data streams are insufficient to constrain model parameterization Wang et al., 2001;Carvalhais et al., 2010) or to distinguish between different modeling structures (Frank et al., 1998). Increases in the number, type, and location of observations used in model calibration and evaluation would ideally mitigate the equifinality issue. Therefore, effective benchmarks should draw upon a broad set of independent observations spanning multiple temporal and spatial scales (Randerson et al., 2009;Zhou and Luo, 2008).

Sources of benchmarks
Benchmarks could be comprised of direct observations (Mittelmann and Preussner, 2006), results from manipulative experiments, data-model products, or derived functional relationships or patterns from data (Table 1). Direct observations and experimental results reflect recorded states of ecosystems when the measurements were made and are generally accepted to be the most reliable benchmarks for model performance. Direct measurements include atmospheric CO 2 mixing ratio, biomass, litter, soil carbon stocks, species composition, streamflow, snow cover and soil water content. Comparisons with models need to recognize that even the most direct measurements have had some level of processing, up-scaling, and assumptions to generate the final estimates.
For example, biomass data of trees are usually derived from allometric equations being applied to actual measured diameter at breast height and tree height (Chave et al., 2005).
Direct measurements are usually made at specific points of time and space. Evaluating land model performance over the globe and hundreds of years needs benchmarks with extensive spatiotemporal representations of many processes (Sitch et al., 2008). Data-model products with well-quantified errors, which are generated according to some functional relationships to extend data's spatial and temporal scales via interpolation and extrapolation, can become useful for benchmarking. For example, evapotranspiration (ET) estimates derived from remote sensing measurements of various energy components together with the energy balance equation (Fisher et al., 2008;Mu et al., 2007;Vinukollu et al., 2011;Jin et al., 2011) offers broad spatial and long temporal data sets for benchmark analysis.
Land models can also be evaluated on their simulated patterns or relationships instead of absolute values of particular variables against benchmarks. This approach is particularly effective when uncertainties in data due to both random and systematic errors are unknown or prognostic climate may induce biases in ecosystem function. For example, the southnorth increase in the amplitude of the seasonal cycle in atmospheric CO 2 (Prentice et al., 2000) and latitudinal gradients in the satellite observed fraction of absorbed radiation  both give information about the geographic distribution of vegetation production. Similarly, the spatial relationship between annual NPP and annual precipitation in a global network of monitoring stations provides more information about the sensitivity of NPP to climate than a comparison of these data on the basis of vegetation types (Randerson et al., 2009) (Fig. 4). Correlations between El Niño related climate anomalies and growth rate of atmospheric CO 2 can be used to examine consistency between the observed and simulated ecosystem responses to climate change (Cadule et al., 2010) (Fig. 5).
Model performance is also sometimes evaluated against standardized simulation results of a well-accepted model (Dai et al., 2003), the model ensemble mean (Chen et al., 1997), or statistically-based model results (Abramowitz, 2005). For example, a statistically-based artificial neural network has been used to compare the performance of processbased land models and can help define a benchmark level of performance that land models can be targeted to achieve relative to the information contained in the meteorological forcing of the surface fluxes (Abramowitz, 2005).

Candidate benchmarks for evaluation of various aspects of land models
Benchmarks are needed to evaluate biophysical processes, biogeochemical cycles, and vegetation dynamics of land models. Exchange of water and energy between land surface and atmosphere exerts controls on regional and global climate. In general, the available net radiation at the land surface is partitioned into ground, sensible, and latent heat fluxes, which drive the hydrological cycle via latent heat flux. Benchmarking energy and water exchange requires estimates of precipitation, shortwave and longwave radiation components, latent and sensible heat fluxes, runoff, and soil moisture and temperatures. Examples of global-scale reference data sets are shown in Table 2. Manipulative experiments can also be used to evaluate modeled responses of water and energy to global change . Data sets from over 100 sites on soil and permafrost data and active layer depths from the Circumpolar Active Layer Mon- itoring (CALM; http://nsidc.org/data/ggd313.html) program (Brown et al., 2003) are candidate benchmarks for evaluating model simulation of high-latitude ecosystems.
Data sets that are often used for benchmarking biogeochemical cycle models include atmospheric CO 2 records on seasonal to centennial time scales (Dargaville et al., 2002;Heimann et al., 1998) and satellite data at seasonal or longer time scales (Blyth et al., 2010;Maignan et al., 2011;Randerson et al., 2009). Other available data sets for biogeochemical cycle benchmarking include global gross primary production (GPP), NPP, soil respiration, ecosystem respiration, plant biomass, litter pool, litter decomposition rates, and soil carbon data products (Table 3). Recently, better estimates of high-latitude soil carbon stocks have been assembled (Tarnocai et al., 2009). Data sets of methane emissions at various sites have been used to test a methane model (Riley et al., 2011). Preference is always given, where possible, for longer time series data sets, as they offer the potential to detect how the land surface responds to low frequency modes of climate variation (e.g., Piao et al., 2011 on normalized difference vegetation index (NDVI) greening and browning in boreal areas). Data sets on nutrient cycling and state variable at site, regional, and global scales can be used to benchmark global carbon-nitrogen models (Wang et al., 2010;. In addition, global change experiments offer the potential to benchmark biogeochemical cycle responses to elevated CO 2 , warming, precipitation, and nitrogen fertilization or deposition (Table 3). Free-air CO 2 enrichment (FACE) experiments are a good example of manipulative experiments that have provided useful benchmarks for land surface models (Randerson et al., 2009). These experiments provided  integrative measures of ecosystem response to future concentrations of atmospheric CO 2 (e.g., NPP, N uptake, stand transpiration) over multiple years, as well as detailed descriptions of contributory processes (e.g., photosynthesis, fine-root production, stomatal conductance) (Norby and Zak, 2011). The average response of the 11 models in the C4MIP project (Friedlingstein et al., 2006) was consistent with the FACE results, although individual models varied widely. However, most of the experiments may not have been run long enough to quantify slow feedback processes , such as progressive N limitation that may downregulate NPP (Norby et al., 2010). Vegetation is usually represented in ESMs by some combination of 7-17 plant functional types (PFTs) in land models. The composition and abundance of PFTs can either be prescribed as time-invariant fields or can evolve with time as a result of changes in disturbance, mortality, recruitment, competition, or land-use change. Although different land models have their own set of PFTs, pre-industrial vegetation types are very important for benchmarking model performance (Table 4). In addition, it is also critical to have data sets of vegetation responses to disturbance and global change. There are some limited data available for quantifying vegetation responses to warming, N deposition, fire, and land use and change (Table 4).
While many of the available data sets described above may be suitable candidates for benchmarks, they have to be effective and reliable for evaluating model performance by the international science community. In this context, it is essential to develop a consensus by experts on defining and selecting benchmarks for use by the international community.

Benchmarking metrics
A comprehensive benchmarking study usually scrutinizes model performance from multiple perspectives. Thus, a suite of metrics across several variables should be synthesized to holistically measure model performance at the relevant spatial and temporal scales at which the model operates (Abramowitz et al., 2008;Cadule et al., 2010;Randerson et al., 2009;Taylor, 2001). Choices of which measures of performance to use and how to synthesize the measures can significantly affect the outcome of measuring performance skills among models. Defining a metrics system, therefore, is a key step in any benchmark analysis.
Many statistical measures (e.g., continental-scale daily root-mean-square error (RMSE) , global mean annual deviation from observed values, and global monthly correlations) are available to quantify mismatches between modeled and multiple observed variables (Janssen and Heuberger, 1995;Smith and Rose, 1995). For example, Schwalm et al. (2010) used Taylor skill, bias, and observational uncertainty to measure performance of 22 terrestrial ecosystem models against observations from 44 FLUXNET sites (Fig. 6). How to combine them to holistically represent model performance skill is still an unresolved issue in benchmark analysis. Many techniques have been explored by the data assimilation research community to combine metrics of measuring mismatches of modeled variables with multiple observations (Trudinger et al., 2007). Some of these techniques may be very useful for benchmark analysis. It is essential to define a cost function that describes data-model mismatches using multiple observations for data assimilation (Table 5). Standard deviations of individual observations were used as weights for model mismatches with data sets whose absolute values differed by several orders of magnitude  and also successfully in regional data assimilation with spatially distributed data (Zhou and Luo, 2008). Normalization by standard deviations of various data sets can effec-tively account for uncertainties in reference data sets. Other weighting functions include a simple sum of mismatches between modeled and observed variables, the standard deviation of residuals after a preliminary run of the calculation, the average value of observations, and a linear function of the observation values (Trudinger et al., 2007).
Besides the statistical methods, the C-LAMP system (Randerson et al., 2009) gave metrics for model performance that depended on a qualitative assessment of the importance of the process being tested. To make such an assessment more objective, an analytic framework has recently been developed to trace modeled ecosystem carbon storage capacity to (1) a product of NPP and ecosystem residence time (τ E ).  6. Model skill metrics for 22 terrestrial ecosystem models. Skill metrics are Taylor skill (S), normalized mean absolute error (NMAE), and reduced chi-squared statistic (χ 2 ). Taylor skill is used to represent the degree to which simulations matched the temporal evolution of monthly NEE; NMAE quantifies bias, i.e., the "average distance" between observations and simulations in units of observed mean NEE; χ 2 is used to quantify the squared difference between paired model and data points over observational error normalized by degree of freedom. Better model-data agreement corresponds to the upper left corner. Benchmark represents perfect model-data agreement: S = 1, NMAE = 0, and χ 2 = 1. Gray interpolated surface added and model names attached to improve readability. Model details are given in Schwalm et al. (2010).
The latter is further traced to (2) baseline carbon residence times, (3) environmental scalars (ξ ) modifying baseline carbon residence time into actual ecosystem residence time, and (4) environmental forcings (Xia et al., 2012). The framework has the potential to help define weighting factors for various benchmarks in a metrics system for measuring carbon cycle model performance.
The research community also may decide upon a priori threshold levels of model performance to meet minimal requirements before a benchmark analysis of multiple models is conducted. Such a threshold would need to be justified  according to criteria of why a model below the threshold is not acceptable. Such thresholds may be viewed as a necessary, but not sufficient, condition for a fully functioning model, because complex models may perform well on particular metrics as a result of compensating errors (that is, getting the right answers for the wrong reasons).
The ranking of land models should be tailored to the specific objective of benchmark analysis. For instance, land surface models operating within mesoscale meteorology or weather forecast models must be particularly robust at simulating energy and moisture fluxes, while land models coupled to Earth system models should simulate those energy and water fluxes but also accurately represent ecosystem responses to changes in atmospheric composition and climate over decadal to centennial time scales. Thus, metrics that measure disagreements between simulated and observed energy and water fluxes should be weighed more in a mesoscale meteorological study than in a decadal to centennial climate change study.

The role of benchmarking in model improvement
One of the ultimate goals of a benchmark analysis is to provide clues for diagnosing systematic model errors and thereby aid model development, although it need not be an essential part of a benchmarking activity. The clues for model improvement usually come from identified poor performance of a land model in its simulations of processes and/or ecosystem composition at different temporal and spatial scales. Model improvement is usually implemented through changes in model structure, parameterization, initial values, or input variables.
The average physiological properties of plant functional types are traditionally conceived as model "parameters". Parameter error may therefore arise when the values chosen for model parameters do not correspond to true underlying values. Thus, benchmarking land models against plant trait data sets might be useful in assessing whether model parameters fall within realistic ranges. Such data sets include the GLOP-NET leaf trait data set (Reich et al., 2007;Wright et al., 2005) and the TRY data set (Kattge et al., 2009). The TRY data set, for example, provides probability density functions of photosynthetic capacity based on 723 data points for observed carboxylation capacity (V cmax ) and 1966 data points of observed leaf nitrogen. Implementing these new, higher values of observationally constrained V cmax in the CLM4.0 model resulted in a significant overestimate of canopy photosynthesis, compared to estimates of photosynthesis derived from FLUXNET observations (Bonan et al., 2011). The magnitude of the overestimation of GPP (∼ 500 g C m −2 yr −1 , between 30 • and 60 • latitude) identified several fundamental issues related to the formulation of the canopy model in CLM4.0.
Model structure error arises when key causal dependencies in the system being modeled are missing or represented incorrectly in the model. Based on biogeochemical principles of carbon-nitrogen coupling, for example, Hungate et al. (2003) conducted a plausibility analysis to illustrate that carbon sequestration may be considerably overestimated without the inclusion of nitrogen processes (Fig. 7). Without the carbon-nitrogen feedback, models fail to capture the experimentally observed positive responses of NPP to warming in cool climates . Generally, model structural errors are likely to reveal themselves through sufficiently comprehensive benchmarking and usually cannot be resolved by tuning or optimizing parameter values (Abramowitz, 2005;Abramowitz et al., 2006Abramowitz et al., , 2007. Nevertheless, over-parameterizations of related processes may mask structural deficiencies. A poor representation of the seasonal cycle of heterotrophic respiration in high latitudes by the Hadley Centre model (Cadule et al., 2010) was caused by soil temperature becoming much too low in the winter. Simply improving the seasonal cycle by adjusting the temperature function of respiration would have given the right answer for the wrong reason and materially affected the sensitivity to future changes. Understanding the processes (too little insulation of soil temperatures by the snow pack) enabled resolving the error without changing the long-term sensitivity. The C-LAMP benchmark analysis of CLM-CASA and CN against atmospheric CO 2 measurements, eddy-flux data, MODIS observations, and TRANSCOM results suggested the need to improve model representation of seasonal and interannual variability of the carbon cycle (Fig. 2).

Relevant issues
There are a few general issues that are worthy of discussion on benchmark analysis. One issue is on model predictions vs. performance skill as measured by a benchmark analysis. While an increase in performance gained through benchmark analysis will likely lead to an increase in predictive ability of a model for short-range predictions, it might not be sufficient to guarantee improved long-term projections of ecosystem responses to climate change, because observations on past ecosystem dynamics cannot fully constrain model responses to future climate conditions that have never been 1171 1172  Hungate et al. (2003) was based on some biogeochemical principles to reveal major deficiencies in global biogeochemical models. The analysis may not be considered as a typical benchmark analysis, but it played a role in stimulating global modeling groups to incorporate nitrogen processes into their models. However, relative performance skills of land models as measured by the benchmark analysis vary with additional considerations of data sets, as illustrated in analysis on flexibility of C : N ratio by Wang and Houlton (2009). Moreover, nitrogen capital in the terrestrial ecosystem is considerably dynamic in response to rising atmospheric CO 2 concentration (Luo et al., 2006), rendering less limitation of ecosystem carbon sequestration.
observed. Nevertheless, comparing models and observations over a wide range of conditions increases the chance to capture important nonlinearities and contingent responses that may control future behavior . Also, future states of land ecosystems are determined not only by internal processes, which are usually evaluated by benchmark analysis, but also by external forces. The latter dominates longterm land dynamics so that predictions are clearly bounded by scenario-based, what-if analysis. Embedding land models within Earth system models, however, can help assess feedbacks between internal processes of land ecosystems and various scenarios of climate and land-use changes.
Another issue is related to the feasibility of building a community-wide benchmarking system. Land model benchmarking has reached a critical juncture, with several recent parallel efforts to evaluate different aspects of model performance. One future direction that may minimize duplication of effort is to develop a community-wide benchmarking system supported by multiple modeling and experimental teams. For a community-wide system to function well, it will need to be built using open source software and using only freely available observations with a traceable lineage. The software system could be used to diagnose impacts of model development, guide synthesis efforts, identify gaps in existing observations needed for model validation, and reduce the human capital costs of making future model-data comparisons (Randerson et al., 2009). This is the approach being taken by the International Land Model Benchmarking Project (ILAMB) that will initially develop benchmarks for CMIP5 models participating in the IPCC 5th Assessment Report. An expectation of the first ILAMB benchmark is that it will be modified and expanded for use in future model intercomparison projects. Ultimately, a robust benchmarking system, when combined with information on model feedback strengths, may reduce uncertainties associated with emissions estimates required for greenhouse gas stabilization over the 21st century or other future climate projections. Such an open source, community-wide platform for model-data intercomparison also speeds up model development and strengthens ties between modeling and measurement communities. Important next steps include the design and analysis of land-use change simulations (in both uncoupled and coupled modes), and the entrainment of additional ecological and Earth system observations.
Lastly, benchmark analysis shares objectives and procedures with data assimilation in many ways (Table 5). Data assimilation is a formal approach to infuse data into models for improving parameterization and adjusting model structures Peng et al., 2011;Raupach et al., 2005;. Data assimilation projects a misfit between model and observed quantities in the space of parameters, and quantifies the level of constraints on each parameter with associated uncertainties. It provides quantitative information, instead of performance criteria that should be met in comparing model output with data, to decide whether a model has a satisfactory behavior or not. However, data assimilation is computationally very costly and, as a consequence, cannot be easily implemented to directly improve the comprehensive, global-scale land models. A combination of benchmarking and data assimilation may facilitate land model improvement. Benchmarking can be used to pinpoint model deficiencies, which can become targeted aspects of a model to be improved via data assimilation.

Concluding remarks
This paper proposed a four-component framework for benchmarking land models. The components are: (1) identification of aspects of models to be evaluated, (2) selection of benchmarks as standardized references to test models, (3) a scoring system to measure model performance skills, and (4) to evaluate model strengths and deficiencies for model improvement. This framework consists of mostly commonsense principles. To implement it effectively, however, we have to address a few challenging issues. First, land models have incorporated more and more relevant processes to simulate land responses to global change as realistically as possible. As a consequence, it becomes almost impossible to evaluate so numerous processes individually. We have to understand fundamental properties of the models to crystalize key aspects of models or identify a few traceable components (e.g., Xia et al., 2012) to be evaluated. Second, global networks of observation and experimental studies offer more and more long-term, broadly spatial data sets, which become candidates of benchmarks for model evaluation. Even so, many data sets have limited information content, leading to equifinality issues for model evaluation. We have to evaluate various data sets to develop widely acceptable benchmarks against which model performance can be reliably, effectively, and objectively evaluated. Third, a robust scoring system is essential to compare performance skills among models. It is still challenging to develop a scoring system that can effectively synthesize various aspects of model performance skills. Development of an effective scoring system has to use various statistical approaches to evaluate the relative importance of the evaluated processes toward the targeted performances of the models. Fourth, benchmark analysis will become much more effective in identifying model strengths and deficiencies aimed at model improvement when it combines other model analysis and improvement approaches, such as model intercomparison and data assimilation.
Benchmark analysis has the potential to rank land models according to their performance skills and thus to convey confidence to the public, to improve land models for more realistic simulations and accurate predictions, and to stimulate closer interactions and collaboration between modeling and observation communities.