Towards a more objective evaluation of modelled land-carbon trends using atmospheric CO 2 and satellite-based vegetation activity observations

Terrestrial ecosystem models used for Earth system modelling show a significant divergence in future patterns of ecosystem processes, in particular the net land– atmosphere carbon exchanges, despite a seemingly common behaviour for the contemporary period. An in-depth evaluation of these models is hence of high importance to better understand the reasons for this disagreement. Here, we develop an extension for existing benchmarking systems by making use of the complementary information contained in the observational records of atmospheric CO 2 and remotely sensed vegetation activity to provide a novel set of diagnostics of ecosystem responses to climate variability in the last 30 yr at different temporal and spatial scales. The selection of observational characteristics (traits) specifically considers the robustness of information given that the uncertainty of both data and evaluation methodology is largely unknown or difficult to quantify. Based on these considerations, we introduce a baseline benchmark – a minimum test that any model has to pass – to provide a more objective, quantitative evaluation framework. The benchmarking strategy can be used for any land surface model, either driven by observed meteorology or coupled to a climate model. We apply this framework to evaluate the offline version of the MPI Earth System Model’s land surface scheme JSBACH. We demonstrate that the complementary use of atmospheric CO2 and satellite-based vegetation activity data allows pinpointing of specific model deficiencies that would not be possible by the sole use of atmospheric CO 2 observations.


Introduction
The terrestrial and oceanic biospheres currently absorb almost half of the fossil-fuel emissions, and thereby buffer the atmospheric CO 2 increase and reduce the rate of climate change (Cox et al., 2000;Raupach et al., 2008;Le Quéré et al., 2009).Because of the strong interactions between the biosphere net carbon (C) uptake and climate in particular on land (Cox et al., 2000;Friedlingstein et al., 2006, Arora et al.2013), projections of future climate changes from Earth system models (ESMs) need to accurately simulate the processes that control the evolution of the terrestrial net C balance.However, despite a seemingly common behaviour of C cycle models for the contemporary period, estimates of the future C land balance by different terrestrial biosphere models (TBMs) diverge significantly.This divergence contributes strongly to the overall uncertainty in the future evolution of the global carbon cycle (Friedlingstein et al., 2006;Sitch et al., 2008;Arora et al., 2013).The apparently contradictory behaviour underlines the difficulty of constraining future projections of terrestrial models with current observations.This calls for an in-depth model evaluation that focusses on the model's capacity to simulate key features of Ccycle-related processes rather than simply ensuring that the easily diagnosed simulated net land-atmosphere C exchange agrees with estimates inferred from observations.Several global model evaluation analyses have been published in the last decades with respect to land model performances of the carbon cycle (Anav, et al., 2013;Cadule et al., 2009;Blyth et al., 2009;Randerson et al., 2009;Heimann et al., 1998).However, they differ with respect to reference dataset used, selection of the observational traits as well as D. Dalmonech and S. Zaehle: Towards a more objective evaluation of modelled land-carbon trends their computation, and mathematical formulations used to quantify the data-model mismatch.These differences cause uncertainty when it comes to ranking several land surface models or to analyse the outcome from different evaluation works.Recent model benchmarking initiatives (Randerson et al., 2009;Luo et al., 2012) have therefore underlined the need for the development of a standard set of tests and metrics applicable to any land surface model at different spatial and temporal scale.
In addition to a lack of standards, a key challenge in evaluating global biosphere models comes from the uncertainties in observations.From a perspective of data-model mismatch quantification, given uncertainties in data and observation, operators to link model and data exist.However, data error and structural errors are often not known or provided quantitatively (e.g.Raupach et al., 2005).
This study is an attempt to move toward a more robust and a more objective evaluation framework by defining novel tests/diagnostics and quantitative model performance measures that are robust against these mentioned unquantifiable uncertainties.We first selected a parsimonious number of reference datasets that are as much as possible direct observations.In first instance, upscaled products such as that from Beer et al. (2009) were not used as the fraction of gap-filled information is not quantified.Atmospheric CO 2 and remote sensing data of vegetation activity were selected to take advantage of their spatial and temporal coverage and the complementarity of their information content.
Atmospheric CO 2 measurements and transport modelling that links surface fluxes to these measurements are a valuable approach to evaluate TBMs since the atmospheric CO 2 retains the signature of terrestrial ecosystem response to climate variability (Heimann et al., 1998;Randerson et al., 2009;Cadule et al., 2010).However, atmospheric CO 2 observations alone do not allow inference of the contribution of vegetation and soil components to the observed signal, such that a good fit might hide compensating model errors.Remote-sensing observations of vegetation activity may provide complementary information as they reflect the climateand disturbance-related seasonal and interannual trends of vegetation greenness (Peñuelas et al., 2009;Richardson et al., 2009).
Rather than comparing average quantities, the analyses presented here analyse how much relevant and robust information, which helps constraining model projections, can be extracted from observations.Hence, we select traits, in particular with respect to vegetation activity, that are based on the information of changes with time, correlations with covariates, and the sign of the changes, as well as based on metrics that are sensitive to difference in sign and phase.Phasing and extent of the climate variability simulated by Earth system models (ESMs) often differs from observed climate because of unforced variability (Deser et al., 2010).To circumvent the resulting mismatch from a direct comparison of ESM simulations and modern observations, and to make key characteristics of the observations useful for the evaluation of ESMs, priority was given to traits and metrics that describe the relationship between climate variables and carbon cycle processes rather than direct comparison of observed and modelled time series.
The second innovation of our studies is that we impose a lower acceptable model performance measure (baseline benchmark) based on the assumption of a null model, i.e. a model that does not show any trend in the quantity under investigation.This lower boundary for each metric helps to avoid misleading interpretation of the number returned by the scores, and to provide a more informative and intuitively interpretable analysis of the model performance.With respect to the atmospheric CO 2 traits, the aim is to quantify how much information the land surface model adds to the signal of ocean and anthropogenic fossil-fuel emissions and thus to quantify how good the model is relative to the null hypothesis (null model).The working line is thus as follows: the analyses were performed on a seasonal and a de-seasonalized signal to better identify C-cycle patterns and the relationship between C-cycle-related processes and climate variability.As detailed in Sect.2, we selected several characteristics (traits) of the observational data that are relevant to the biosphere's response to climate variability in terms of terrestrial C cycling patterns.We focussed on the the last three decades  since this is the period with the best data availability (Tables 2 and 3).For the selected tests, a list of comprehensive metrics was selected to quantify model performances according to the information content of identified traits.We then compared this metric to the reference value of the metric obtained according to the baseline benchmark to arrive at a final score for the model.
In Sect. 3 we discuss the potential strengths and limitations of the evaluation framework at the example of the the JSBACH land surface model of the MPI-ESM (Raddatz et al., 2007;Giorgetta et al., 2013) driven by reconstructed meteorology.

Atmospheric CO 2
Atmospheric CO 2 concentration recorded at remote measuring stations were obtained from the flask data/continuous measurements provided by different institutions (e.g.flask data of NOAA/CMDL's sampling network, update of Conway et al., 1994, Japan Meteorological Agency (JMA), Meteorological Service of Canada (MSC), and many others; see Rödenbeck, 2005).Simulated net land-atmosphere CO 2 fluxes for the period 1980 to 2009 were transported together with estimated net ocean CO 2 fluxes (Jacobson et al., 2007;Mikaloff Fletcher et al., 2006, 2007 -one of the best available products based on Takahashi ocean dataset and involving several biogeophysical ocean models) and fossil-fuel fluxes (EDGARv.4.0, Olivier et al., 2001, http://edgar.jrc.ec.europa.eu/faq.php)by means of an atmospheric transport model (TM) to estimate atmospheric CO 2 record at the measuring stations.For our analysis, we used the TM3 model, version 3.7.22(Rödenbeck et al., 2003), with a spatial resolution of 4 • × 5 • and driven by interannually varying wind fields of the NCEP reanalysis (Kalnay et al., 1996).
The model-based time series of CO 2 at the measuring stations were based on sampling simulated CO 2 abundance at the same time in which measurements were available in order to reduce the representation bias.The temporal resolution of CO 2 data is the original resolution as recorded at the monitoring stations (hourly to daily/weekly) and dependent of the specific station.
Stations were selected in order to cover representatively a latitudinal gradient (Table 1).Latitudinal and vertical transport of CO 2 differs among TMs (Yang et al., 2007), but these differences are difficult to quantify and attribute to particular model features (Gurney et al., 2003;Peylin et al., 2005).In remote stations with simple topography, different TMs tend to agree better and are expected to have less error.The selection of monitoring stations takes account of this by including mainly oceanic/island stations as these remote stations have a lower uncertainty and are only marginally influenced by local C sources or sinks (MPI Biogeochemistry, technical reports 5-6: http://www.bgc-jena.mpg.de/bgc-systems/pmwiki2/pmwiki.php/Publications/TechnicalReports).
Two estimates of the net land-atmosphere CO 2 flux obtained from inverting the observed atmospheric concentrations using atmospheric transport modelling (hereafter referred to as standard fluxes) were also transported using the same protocol as for the simulated TBM fluxes.These fluxes were taken from the Jena inversion system, which relies on the same TM3 transport model (Jena inversion version 3.7.22,available at http://www.bgc-jena.mpg.de/∼ christian.roedenbeck/download-CO2/, update of Rödenbeck et al., 2003;Rödenbeck, 2005Rödenbeck, , covering the periods 1996Rödenbeck, -2008Rödenbeck, and 1981Rödenbeck, -2008, respectively), respectively).The standard fluxes were not used to derive an absolute benchmark sensu strictu but as reference to compute additional traits as reported in Sects.2.4.1 and 2.4.5.

Vegetation activity datasets
To characterize seasonal and interannual changes in vegetation activity, we rely on two satellite-based products: the SeaWiFS-FAPAR (Gobron et al., 2006a, b), the fraction of photosynthetically active radiation absorbed by vegetation, and the longer GIMMS-NDVI collection g (http://glcf.umd.edu/library/guide/GIMMSdocumentationNDVIg GLCF.pdf), which is the normalized difference vegetation index, retrieved from the AVHRR sensor records (Tucker et al., 2005;Beck et al., 2011).Both FAPAR and NDVI provide a measure of greenness integrating canopy functioning.It has been previously shown that these quantities are nearly linearly related (Myneni and Williams, 1994).The selected FAPAR data were provided as 10-dayaggregated time series from September 1997 until June 2006 at a nominal spatial resolution of 2 km and were used to analyse the seasonal cycle of vegetation activity (Table 2).The GIMMS dataset contains biweekly data at a spatial resolution of 8 km from 1981 until 2006 and was used to estimate long-term changes in vegetation activity (Table 3).
Satellite data were aggregated at the spatial resolution of the TBM, including grid cells that are partially covered by bare soils.With this approach, the aggregated signal indirectly accounts for changes in vegetation activity and density.A simple gap-filling procedure based on 2nd degree polynomial interpolation in time was applied to replace bad-quality flag data.All data were aggregated at the monthly temporal resolution.In the case of GIMMS-NDVI, the maximum value composite (MVC) method was used (Holben, 1986).It is assumed that the process of temporal and spatial aggregation of satellite-based vegetation activity smoothes out noise in the data, and the uncertainty induced by the aggregation might be considered negligible for our purpose.Tropical areas were excluded from the analysis due to the high uncertainty in the interpretation of the satellite signal (Asner and Alencar, 2010) and high uncertainties in NDVI datasets in these regions (Huete et al., 2002;Brown et al., 2006).

The JSBACH model
JSBACH is the land surface model of the Max Planck Institute's Earth System Model (MPI-ESM) (Raddatz et al., 2006;Giorgetta et al., 2013) In this study we use the version that was used for the CMIP5 activity (JSBACH version 2.0).JSBACH considers 11 plant functional types, which occupy annually varying fractions (tiles) of a model grid cell, prescribed from land-use data (see Sect. 2.2).Phenology and C cycling is simulated explicitly for each tile, while the halfhourly fluxes of energy and water are calculated for each grid cell, based on the relevant average properties of vegetation and soils across the tiles.The land-use emissions are computed according to the method reported in Reick et al. in  and were aggregated via conservative regridding to the T63 resolution of the MPI-ESM grid at daily resolution.These data were used as model forcing as well as for the climate correspondence analysis.The standardized precipitation index SPI was computed from the precipitation record of the CRU observational dataset (Mckee et al., 1993;Lloyd-Hughes and Saunders, 2002).SPI is suitable as indicator of both dry and wet soil conditions.Irrespective of biomes or region, the 6-month cumulated precipitation data was used to compute the SPI for each grid cell (see Appendix A for more details).Land-cover and land-use change transition maps were derived from Hurtt et al. (2006).

Evaluation methodology
The analyses in this study focus on seasonal and interannual/decadal time scales.To identify these components from the observed and simulated atmospheric CO 2 , as well as vegetation activity and climatic drivers, a seasonal component (up to annual time scale) and an interannual time scale component were isolated using a filter implemented in the Fourier space.We followed the method and the cut-off values presented in Thoning et al. (1989), using Gaussian spectral weights (Rödenbeck et al., 2003).The outcome of the filtering is (i) a seasonal component with a mean of zero, which retains information up to the annual frequency with the very high frequency (daily to biweekly) removed, and (ii) a de-seasonalized signal, which includes all the frequencies lower than the annual cycle -i.e. the interannual to decadal time scales.In terms of interannual variability, this approach of filtering is more advantageous than consideration of monthly anomalies since a de-seasonalized signal provides a better measure of the strength and persistence of interannual variability related to climatic and natural events as El Niño events and volcanic eruptions.The analysis of seasonal patterns aims not only at the relative phasing of vegetation growth and ecosystem respiration and modelled phenology that affects the seasonal phasing of the net land-atmosphere C exchange (Prentice et al., 2000) but also at biogeophysical effects such as the water and energy exchanges (Notaro et al., 2007;Peñuelas et al., 2009).Interannual variability and long-term trends of net land-C exchanges and vegetation activity are an important and crucial aspect of the terrestrial ecosystem in a climate change context.Changes of vegetation activity might have implications to long-term potential for retaining more C in the system, contributing hence to the biosphere-atmosphere feedbacks and internal plant-soil feedbacks (Bonan, 2008).
In the following sections, we describe key features of the atmospheric CO 2 and vegetation activity obtained from the decomposed signals (Table 2: seasonal time scales; Table 3: interannual time scale).These traits are used to assess the capacity of the model to reproduce climate-variability-induced effects on terrestrial ecosystems.In addition, traits characterizing the co-variability of vegetation features/atmospheric CO 2 and land climatic patterns are defined.Some of the selected traits were analysed separately in three time intervals (1982-1991, 1992-1997, and 1998-2006) according to two breakpoint events: the Mount Pinatubo eruption in 1991, and the El Niño event in 1997 -two of the most relevant natural events occurred in the last three decades.
The systematic quantitative assessment of the correspondence of anomalies and trends in simulated vegetation activity and net C exchange is performed using normalized metrics (see Appendix B for the mathematical description).The proposed selected traits and metrics are suitable to be applied to land surface models run in either offline or fully coupled mode because they are based on reproducing variability and/or statistical relationships with the driving climate rather than focusing on the absolute correspondence of the variables.This strategy reduces potential biases in the assessment due to uncertainty in the predicted climatic variability (Deser et al., 2010).
Geographical regions at the continental scale consistent with the regions used for the Transcom3 project (Gurney et al., 2002;Fig. 1) were used to determine the influence of net land-atmosphere CO 2 fluxes from a particular region to the signal at the monitoring stations following the procedure reported in Cadule et al. (2010).The characterization of vegetation activity was performed at grid-cell level and at regional level according to the same Transcom3 regions.The Transcom3 region maps were further intersected with the dominant vegetation map obtained from the Synmap vegetation classification of Jung et al. (2006) (see Appendix D).Grid cells with dominance of bare soil or ice, as well as grid cells with no valid observations, were excluded from the analyses.

Seasonality of Atmospheric CO 2
The model's capacity to simulate phase and amplitude of the mean seasonal cycle of atmospheric CO 2 (MSC) was evaluated using the Taylor score (Taylor, 2001).The selected metric gives more weight to the correspondence in phase instead of amplitude (Taylor, 2001), which is the more reliable feature of transport models (Stephens et al., 2007).Additional information on the land net C exchange is contained in the latitudinal gradient of the amplitude of the mean seasonal cycle (MSClg), which increases from the South Pole northwards because of the relatively higher land masses fraction in the Northern Hemisphere (NH).A metric based on the variance of the amplitude data was used to assess the model performance (Table 2 and Appendix B).
The relative contribution of the C fluxes from land (and ocean) Transcom3 regions to the seasonal cycle amplitude (MSCc) was computed using the atmospheric CO 2 record obtained by transporting the standard fluxes constrained on the period 1996-2008 as reference.This choice was made so as to overlap with the time period for which the SeaWiFS-FAPAR data are available (see Sect. 2.4.2).The relative contribution of each region to each single monitoring station in both standard fluxes and modelled fluxes was compared using the Pearson correlation coefficient.This trait checks thus also for the existence of potential inconsistencies between the regional and seasonal distribution of net land-C fluxes from the model and estimated by the inversion of atmospheric observations.
Changes in the seasonal cycle over time, referred to as the monthly CO 2 trend (MT), are quantified as the year-to-year change in CO 2 concentration for each month.Previous works analysed solely the change in amplitude of the seasonal cycle in Mauna Loa as response to land surface warming (Myneni et al., 1997;Angert et al., 2005;Buermann et al., 2007), while we focus on decadal trends in long-term northern stations, which exhibit a clearer signal.This trait summarizes the seasonal change in the trend of land-C sink/sources in response to climatic drivers and natural disturbances in the extratropical latitudinal band.The model-data correspondence is analysed using the Pearson correlation coefficient.
The trend in the seasonal onset of net land-C uptake (C-dd) was computed as follows: for each year, the algorithm looks for the downward zero-crossing point of the seasonal time series of atmospheric CO 2 .The trend is thereinafter computed on the extracted dates.This feature characterizes in particular the observed high-latitude ecosystem responses to recent land surface warming and it is indirectly linked to the beginning of the growing season (Keeling et al., 1996;Myneni et al., 1997).Because the years 1991-1993 -i.e. the years following the Mount Pinatubo eruption -are an anomaly in this trend (Lucht et al., 2002), these three years were excluded from the analysis.The analyses for the MT and C-dd traits focus on the stations in the extratropical latitudinal band with a clear signal from land and low contamination of the trends due to uncertainties in the fossil-fuel emissions (Table 2).

Seasonality of vegetation activity
A direct comparison of absolute values of remote sensing data such as NDVI or FAPAR and their corresponding modelled variable might be not a viable strategy, first and foremost because of different retrieval and post-processing algorithms used to compute the final estimated FAPAR/NDVI in different satellites products, and to remove, for example, cloud contamination and atmospheric corruption, etc. (e.g. the intercomparison study of Dahlke et al., 2013).This implies that the outcome of a direct model-data comparison is dependent on the reference dataset used.In addition, the radiances recorded by satellites differ in the way that radiation extinction is computed at the land surface in land surface models.This difference does not allow a priori for a perfect match between data and model.the TransCom-3 regions (Gurney et al., 2002) over which the estimated fluxes a at neighbouring land and ocean regions actually overlap each other beyond th being done separately for the individual flux components, according to whethe he apparent coast lines shown here correspond to a land/water map at the stan 43 Fig. 1.Map of the land regions used for the regional benchmark of phenology and the analysis of the biosphere fluxes, as defined in the TransCom intercomparison studies (Gurney et al., 2002).The map shows the regions at TM3 resolution.Code: North American Boreal (NAB), North American Temperate (NATe), South American Tropical (SATr), South American Temperate (SATe), Northern Africa (NA), Southern Africa (SA), Eurasian Boreal (EAB), Eurasian Temperate (EATe), Tropical Asia (TrA), Australia (AUS), Europe (EUR).The ocean was considered as a single region.
However, as shown in Dahlke et al. (2013) for the seasonal information, the temporal evolution of the recorded signal is likely to be a robust feature among datasets and the temporal evolution of the modelled signal should resemble the reference dataset such that they can be evaluated by a metric that is independent from the absolute values of the time series.Because of the aforementioned reasons, we focused on metrics based on information on time and sign of changes as indicated by the satellite data.
With respect to the seasonal signal, as a first step, grid cells with only one detected growing season per year were selected by analysing the autocorrelation of the seasonal record and its significance.The shape of the seasonality of vegetation activity was then characterized by two robustly identifiable and meaningful phases of the phenological cycle: the time of the beginning of the vegetative growing season, hereafter referred to as time of onset (t-onset), and the time of the maximum FAPAR signal (t-max) (Randerson et al., 2009).Data and model signals characterized by mean amplitude of the seasonal record within 1 % of total FAPAR range were ex-cluded from the analyses.The definition of the beginning of the growing season is a subjective matter and a direct and precise link to ground-level observation is difficult to identify (Lucht et al., 2002;Maignan et al., 2008;Verstraete et al., 2008).Analogously to the method of estimating the beginning of the net CO 2 uptake reported in Sect.2.4.1, the proxy of the time of onset of vegetation activity is calculated on the seasonal signal, and corresponds to the point in time of the upward zero-crossing point of the seasonal curve (see Fig. A1).
Linear differences of the most frequent month of time of onset or maximum of FAPAR were computed between model and data.Consequently this metric ranges between one (no difference) to zero (6-month difference).The length of the growing season was not used as additional trait because it is poorly defined from satellite data as autumnal leaf colouring and the simultaneous presence of living and dead leaves confounds the satellite signal, in particular in temperate regions (Estrella and Menzel, 2006;Menzel et al., 2006).

Interhemispheric gradient and trend of atmospheric CO 2
The long-term trend in atmospheric CO 2 (C-LTT), given known fossil fuel, land-use change emissions and net ocean carbon fluxes, is an indication of the long-term net C balance of the terrestrial biosphere (Prentice et al., 2000;le Quere et al., 2009).The trend was computed from the mean annual values of the de-seasonalized signals and compared directly to the observations for stations covering the period 1982-2008(Table 1).The interhemispheric gradient in atmospheric CO 2 abundance (IHG) measures the north-south differences in atmospheric CO 2 caused by changing balance of the increasing fossil-fuel emissions in industrialized regions and the net ocean and land-C uptake.For each year, this trait was computed by subtracting the observed and modelled annual CO 2 concentration at the South Pole station (SPO) from the respective station concentrations, as in Cadule et al. (2010).
The metric was based on the comparison of the standard deviation of modelled and observed data.

Trend of vegetation activity
Similar to atmospheric CO 2 , vegetation activity trends were computed from modelled and reference data.Beck et al. 2011 indicated that the GIMMS-NDVI dataset is suitable for assessing temporal changes of vegetation activity.However, due to the unknown uncertainty of the absolute NDVI values, the selected trait does not compare numerical trends.Instead, the selected metric focusses on the robust trends in the data and determines the spatial patterns of positive, negative, or no significant trend in vegetation signal from the GIMMS-NDVI dataset and compares this to the pattern in modelled FAPAR (Table 3).For each grid cell, the metric calculation was performed on annual values of the de-seasonalized vegetation time series.The non-parametric Mann-Kendall test was used to determine whether a positive (greening), negative (browning) trend or no significant trend was detected (two-tailed statistic).The advantage of this approach is that it is robust against satellite drift and high-model internal variability that is, for instance, induced by high variability in the climate simulated by an Earth system model.At the grid-cell level, the metric is a binary score which measures whether the model and data show a significant trend of the same sign.The global-scale metric is then a ranking of a percentage agreement for cells of a particular trend class.

Quantification of interannual variability: atmospheric CO 2 and vegetation activity relationship with land climate pattern
The relationship between the seasonality of phenology and local climatic drivers at grid-cell level was explored using the annual variations of the time of beginning of the growing season (t-onset; Table 2).The time series for the SeaWiFS-FAPAR data is too short to allow for a trend analysis.Therefore the correlation of the t-onset with the annual temperature, given the annual SPI as conditional variable, was taken as a proxy.A ranking metric, analogous to the vegetation activity trend metric, was computed according to cell-by-cell agreement in terms of sign of the statistics, hence according to significantly positive, negative, or non-existent correlation.
Interannual variability in vegetation activity was assessed using de-seasonalized signals obtained from the GIMMS-NDVI/modelled FAPAR aggregated to the Transcom3 land region.Cross correlations between monthly records of vegetation activity and regional climatic variables, temperature and SPI, were computed with lags up to 24 months (Table 3).The South American Tropical region, Tropical Asia regions, and grid cells with dominance of tropical forests in Africa are excluded by the analysis (see Sect. 2.1.2).
The same approach was used to measure the relationship between atmospheric CO 2 growth rate and land surface climate (Table 3).The atmospheric CO 2 growth rate is well known to provide information on the interannual variability of the biospheric response to climate variability and in particular land response at the ENSO time scale (Keeling et al., 1995;Le Quere et al., 2003;Peylin et al., 2005).However, most of the land surface climate shows some coherence with this large-scale climatic feature (Buermann et al., 2003), such that the CO 2 signal in the atmosphere could be perfectly correlated, instantaneously or lagged, with climate over most of the land regions.To reduce this problem, an empirical orthogonal function (EOF) decomposition of the atmospheric CO 2 records, obtained by transporting the "inverted fluxes" from each land region, was computed.The three most contributing land regions (to at least 80 % of the variability in the observed total signal) for selected monitoring stations were determined and only these were used in the analysis (see Appendix C).
The obtained statistically significant cross correlations from data and model (vegetation and atmospheric CO 2 growth versus regional climate) were compared with a correlation metric in order to test if the model is able to return the coupled patterns with time lags (see Appendix C).
The use of inverted fluxes to determine the most contributing regions at interannual time scale and for the EOF decomposition does not affect significantly the results in terms of model behaviour evaluation.However, it changes the degree to which the observations can effectively constrain the model if in the model domain a region contributes less than inferred from the inverted fluxes.
The last selected feature of the carbon cycle uses the CO 2 growth rate to compute an apparent land-C cycle sensitivity to global temperature anomalies, defined as the slope of the annual CO 2 growth rate versus the aggregated annual land surface temperature.The record at the station of Mauna Loa (MLO) was used as proxy of evolution of globally averaged atmospheric CO 2 concentration (Zeng et al., 2005).D. Dalmonech and S. Zaehle: Towards a more objective evaluation of modelled land-carbon trends

The baseline benchmark and the final scores
The reference minimum (baseline benchmark) concept applied in this study compares the skill of the model under investigation with the score of the metric obtained assuming a land biosphere that does not systematically contribute to any signal.For the C-cycle analyses, the baseline benchmark is set to be a biosphere without a terrestrial C-cycle ecosystem, implying that the signal or trend in the observations is driven by fluxes of fossil fuel and net ocean only (no-land case).Since this lower benchmark is applied based on the same TM for all the simulations, this further reduces the potential errors introduced by transport modelling uncertainties.Scaling the metric to the lower benchmark highlights the contribution of modelled land fluxes to match the observed trait of the data under consideration.In other words, the final score number is the metric for an individual trait, cleaned by the contribution of other CO 2 source/sink other than the modelled land fluxes.Only for the CO 2 drawdown test (C-dd; Table 2) is the baseline benchmark set as zero trend (i.e.there is no trend on land).A similar concept is applied for the vegetation activity traits: the lower benchmark is provided by the case with constant vegetation (no-change case).Only in the case of timing of the vegetation onset and maximum (t-onset and t-max traits; Table 2) is the baseline benchmark set as the maximal possible difference (6 months).
The final global model metrics M for each trait are computed as follows: first, the metrics are computed for the CO 2 signal in each monitoring station and in correspondence of each Transcom3 region for the vegetation-related traits (M or in Eq. 1).The same statistic is also applied for the null-model case to return the numerical metric value of the trait for the baseline benchmark case (M base in Eq. 1).The original metric is then scaled to a new, normalized metric (score) between 0 and 1 according to Eq. ( 1), where 1 indicates perfect datamodel match and 0 indicates that the model is not able to perform better than a system without the representation of the land biosphere.
Secondly, the model performances are summarized in a polar plot that goes radially from 0 (less skillful model), in the centre, to 1 (skillful).The global scores are derived as follows: for the satellite-based scores, the global score is the average of the scores computed for each Transcom3 region, with the exception of the ranking-based scores, which are already computed at global scales.For the CO 2 -station-based scores, the scores for each station were first averaged by latitudinal band, and the global score was then derived as the average of the scores computed by latitudinal band.

Results and discussion
In the following, we discuss the results of the above framework at the example of the JSBACH model.The results for the individual traits are summarized in Fig. 2. Table A1 reports the results of the baseline benchmarking for comparison.Table 4 reports results per latitudinal band with regards to CO 2 traits, and global scores for the vegetation traits.
In this section an in-depth analysis of the mechanisms behind data-model mismatch is not performed, but what we can learn from observations and how can we use them to quantify data-model differences is shown, as well as what the benchmarking framework can tell about potential areas of model deficiencies.

Seasonality of atmospheric CO 2
The Taylor diagram (Fig. 3a) reports the data-model correspondence in terms of phase and amplitude of the mean seasonal cycle (MSC).JSBACH is in general capable of simulating the phase of the seasonal cycle of CO 2 , with the exception of the stations south of the equator, which tend to be out of phase.At those stations, ocean fluxes dominate the signal, which can be seen in the large difference between the low original and the higher scaled metric (Table 4).The anticorrelation of the model's seasonality might further indicate either (or both) a high contribution of the signal from the Northern Hemisphere or reveal effective out-of-phase seasonal land-C fluxes.The in-depth analysis of the regional contribution to the mean seasonal cycle (the MSCc trait) indicates that the Eurasian Boreal and Eurasian Temperate regions slightly but systematically contribute more to the signal in the stations above the 50 • N than inferred from observations (Fig. A2).At the southern stations, the model signal from the South American Temperate region clearly dominates the ocean signal (Fig. A2), suggesting that this region has a seasonal cycle of net land-atmosphere C fluxes inconsistent with the atmospheric record.This inconsistency leads to the low scores in the southern latitudinal band (Fig. 2, Table 4).The model clearly overestimates the amplitude of the MSC across the global network of stations, as can be seen in the latitudinal gradient of the amplitude of the mean seasonal cycle (Fig. 3b).Although uncertainties in the transport model could partially contribute to this, the steep drop of the CO 2 concentration during the summer months (data not shown) are an indication that an overestimation of spring C uptake (i.e.too large global gross primary productivity) is responsible for the overestimation of the amplitude.2 and 3.The polar plot goes radially from 0 (less skillful model), in the centre, to 1 (skillful).Since we only consider one model here, we refer to the threshold value of 0.5 to indicate the model good/high performances and less good/low performances.

Seasonality of vegetation activity
Figure 4 shows that JSBACH simulates the time of onset with a systematic lag of 1 to 2 months over large areas of the Northern Hemisphere (NH).A major exception is the east and south of the North American Temperate region, where the model tends to lead the observed growing season.Given the monthly temporal resolution of the analyses, these results in the NH are still in line with the good performance in terms of phase correspondence of the MSC of CO 2 at the northern D. Dalmonech and S. Zaehle: Towards a more objective evaluation of modelled land-carbon trends stations (Fig. 3a).However, results indicate that there is space to improve the modelled phenology.
In large parts of the tropical latitudinal band, most of the modelled signal is flat, in contrast to the detected seasonal cycle recorded in the SeaWiFS data in seasonally dry topical areas (Fig. 4).In these areas, which are dominated by raindeciduous vegetation, the occurrence of one growing season is driven by seasonality in rainfall.Similarly, the model signal in the Australian scrubland does not show any clear seasonality in contrast of the observations.The flat tropical signal and the detected differences up to 3-4 months in some southern regions are responsible for the low aggregated global model performance (Fig. 2, Table 4).The vegetation classes contributing most to the lower performances are deciduous broadleaved forests and grassland, probably mostly due to their geographical distribution and presence in drought-prone areas (see Fig. A3).
The timing of the maximum analysis (t-max; data not shown) returns similar geographical pattern to the "t-onset" trait, but the differences are generally slightly higher.This aspect partly relates to the less well defined nature of the timing of the maximum in regions with several months of full foliar coverage.At the global scale, there are no discernible differences between the two scores (Fig. 2, Table 4).These results show that the seasonality in the model is slightly lagged in time, but without strong distortions in the signal in the first period of the growing season in the Northern Hemisphere.An improvement in phenology parameterization in areas dominated by raingreen vegetation in the seasonally dry tropical latitudinal band and drought-prone shrublands is necessary.It is unclear from the analysis, however, whether a too low sensitivity of raingreen vegetation to soil moisture stress or a too low seasonal cycle of simulated soil moisture as a result of problems with the modelled soil hydrology is the cause of this phenomenon.

Monthly CO 2 trend
As example for the trend in the monthly CO 2 signal (MT), Fig. 5a displays the trend computed for observed signal at the Alert station (ALT) together with the contributions of the net land and ocean C fluxes and fossil-fuel emissions.For the selected northern stations, the observational analysis shows that, in particular in the summer months (June-July), the land is the most dominant contributor to the tendency towards a more pronounced seasonal cycle.That is to say, increased monthly land-C uptake rather than changes in ocean fluxes and fossil-fuel emissions are responsible for this trend.This feature is particularly strong in the period 1982-1991 and consistent across the selected stations, although this trend is not always statistically significant for all the months (Fig. 5a).The monthly CO 2 trend in the period 1992-1997 is less clear (data not shown), while the negative trend of the summer uptake occurs in that of 1998-2006, albeit weaker than for 1982-1991.The latter pattern likely reflects the weakening of the positive land warming effect on phenology during the growing season, which was particularly apparent in the 1980s (Myneni et al., 1997).
Using an additional TM simulation, we verified that the observed weakening of the negative trend in summer is indeed mainly land induced and not induced by the interannual wind fields used in the transport model.The experimental results with constant wind (data not shown) confirmed that interannually varying transport can contribute but does not overwhelm the land-based trends in monthly CO 2 concentrations.Potential trends in the seasonality of fossil-fuel emissions (Blasing et al., 2005) are unlikely to strongly affect this trend (data not shown).
Figure 5b exemplarily shows that JSBACH is able to qualitatively return the seasonal-like shape of the monthly CO 2 trend and the detected land-C uptake weakening, but it is not able to fully explain the observed signal (Fig. 2 and Table 4).Since the selected metric analyses the correspondence of phase of the monthly trend, the non-perfect match could be attributable to divergence in observed and modelled climate sensitivities of photosynthesis and respiration.

Interhemispheric gradient and long-term trend of atmospheric CO 2 (Table 3)
The interhemispheric gradient trait (IHG), which evaluates the interannual variability of the net land-atmosphere C exchange, agrees well between JSBACH and the observations (results not shown, but see Fig. 2).However, the analysis on the long-term C balance trend (C-LTT) shows that JSBACH substantially overestimates the long-term trend compared to observation (Fig. 6a), such that its score is actually lower than the baseline benchmark at all stations (Fig. 2, Table 4, Table A1).Since this detected data-model difference is unlikely to be due to uncertainties in fossil-fuel emissions or ocean net carbon fluxes (le Quere et al., 2009), this result is due to a substantial underestimation of net land-C uptake.

Vegetation activity trend (Table 3)
Figure 6b displays the decadal patterns of the normalized annual vegetation activity time series (GIMMS-NDVI and JSBACH-FAPAR), excluding evergreen tropical forests, glaciers, and desert areas.There appears to be a good qualitative global agreement, suggesting that phenological limitations are not likely the cause for the aforementioned too low increase in land C.However, the good agreement of the global vegetation pattern is partly due to the compensation of errors (Fig. 7).The observed, spatially extensive positive trend in vegetation greenness in the period 1982-1991 is not fully captured by the model because several areas have either no trend or even a negative trend (in parts of the South America, Australia, and South East Asia).During the years 1992-1997, no clear geographical pattern is detected (data not shown).observed positive trend in the period 1982-1991 appear to have no or even negative trends.This phenomenon is only partly reproduced by JSBACH: in the northern boreal regions and in the Southern Hemisphere, particularly in the South American Temperate region, the negative trends are simulated.
The observed large-scale positive trends in vegetation activity during the period 1982-1991 is consistent with previous results (Myneni et al., 1997;Zhou et al., 2003).However, our analysis underlines that the observed positive warming effect on greening has not been persistent in time, but switched toward a neutral effect in the years 1992- served negative pattern in the SH is generally consistent with the trends in evapotranspiration and in particular soil moisture reported in Jung et al. (2010) even though our analyses ends in 2006, while theirs ends in 2008.Several factors might contribute to the observed overall behaviour following the El Niño event in 1997.These include recurrent drought events, D. Dalmonech and S. Zaehle: Towards a more objective evaluation of modelled land-carbon trends pest outbreaks, and severe fire events over several regions responsible for the detected negative trends in boreal areas and the weakening of the summer C uptake that we reported in Sect. 3.2 (van der Werf et al., 2004;Angert et al., 2005;Goetz et al., 2005).
The low final score of JSBACH in this metric (Fig. 2, Table 4) is in particular the result of the recurrent large-scale negative trends in several areas in the SH and in south-east Asia during the years of 1982-1991and 1998-2006 (Fig. 7 (Fig. 7).
The non-quantitative nature of this comparison prohibits a too strict interpretation of the mechanisms behind the modeldata differences.It is unclear whether these differences are caused by the phenological scheme of the model, land-use change protocol, or other factors such as the drought response or fire processes.However, the disagreement in the sign of the trend can be attributed to model deficiencies, and the ranking metric provides a quantitative measurement of the detected disagreement.
As aforementioned, despite the spatial model-data disagreement, at global scale the errors in the model compensate to return a positive vegetation activity trend.Assuming that vegetation activity is linked to plant productivity, the underestimation of the net land-C uptake in JSBACH (Sect.3.3) is likely the consequence of a too high soil-C turnover rate.

Growing season response to local climate (Table 2)
The timing of the CO 2 drawdown point (C-dd) and the onset of vegetation greening (t-onset) represent two independent proxies to measure the effects of land warming on spring phenology (Badeck et al., 2004;Menzel et al., 2006).There is a tendency towards earlier CO 2 drawdown at the stations of STM, BRW, and ALT (Fig. 8a), although this trend is statistically significant only for the latter two stations (P < 0.10).Such a negative trend in time is consistent with the advance of spring phenology induced by land surface warming (Fig. 8b): the correlation between climate variability and the timing of vegetation onset is significantly negative with annual temperature.Despite this trait constituting an emerging empirical relationship, the negative correlation mainly in the boreal areas, as clearly shown in Fig. 8b, is consistent to an earlier green-up in warmer years.
JSBACH does not show any discernible trend in any of the three stations (Fig. 8a, example for BRW), despite the fact that it returns a similar correlation pattern at the start of the growing season with local temperature (Fig. 8c), in particular in the extratropical northern areas.The final, global score for this trait is very low, despite the good visual matching, because of the low cell-by-cell correspondence (Fig. 2, Table 4).These two analyses underline that the model, although it realistically simulates the beginning of the growing season (Sect.3.1), is likely to respond too weakly to land surface temperature anomalies.

Interannual variability of vegetation activity regional climate (Table 3)
The vegetation activity is analysed separately for each climatic driver.It is not possible to clearly disentangle temperature and precipitation effects.Nonetheless, the analysis suggests that the NDVI at high latitudes is mainly correlated with surface air temperature, where plant growth is mainly limited by temperature.An exception to this pattern is Eurasia Boreal (EAB), which shows a higher co-variation of vegetation activity with precipitation pattern.NDVI in regions with dominance of shrubs/grassland is mainly driven by precipitation anomalies -in agreement with previous studies (Groeneveld and Baugh, 2007).Figure 9a-b presents exemplary the computed cross correlograms for Eurasia Temperate (EATe) and the North American Boreal (NAB).The pattern returned in NAB, which is common to NATe and EUR, reveals a strong covariation of vegetation activity and temperature in both data and model.However, the model behaviour suggests a strong correlation with temperature even in areas where the observations suggest a stronger covariation with precipitation (measures as SPI), as for instance in the EATe.One notable feature in these regions is that JSBACH shows a larger delay in the response of vegetation activity to SPI than observed, with differences of the order of 2-3 months (EAB, NA, SA).The final JSBACH score is good for this trait (Fig. 2, Table 4) when considering an average performance over all the regions.Low scores are obtained in precipitation-driven areas mainly due to the different time lag of the response, which corresponds well with the aforementioned too low sensitivity of raingreen vegetation to seasonal drought.
The selected trait underlines the tendency of the system to respond in a specific way to external forcing/climate, or to respond instantaneously or with some lag, and the metric is selected in order to be sensitive to model-data difference of phase rather than absolute differences of climate and vegetation activity.An important aspect emerging from this simple trait is that the detected delay could hide an incorrect representation of the effects of soil drought on vegetation growth or soil hydrology.The same regions in which the model shows a delayed response to precipitation also show a persistent negative trend in vegetation activity (Sect.3.4, Fig. 7).This pattern is evident in particular in South East Asia, South America Temperate, and Australia, which are mainly dominated by grasslands, shrub lands, or crops.Even if other non-climatic effects at smaller spatial scales (i.e.land degradation and management practices, and fire recurrence) might affect vegetation cover and activity (Foley et al., 2005), the longer lag in the co-variation of vegetation and precipitation might be caused by the same model fault responsible for the mismatch in the vegetation trends.From a biogeophysical point of view, this model feature could also indicate a less reliable capability of the land surface model to return memory effects of the vegetation-precipitation pattern Table 3. List of atmospheric CO 2 and vegetation activity traits used for the analyses at the interannual time scale (higher than annual frequency).A detailed explanation of metrics can be found in Sect. 2 and Appendices B and C).  4 (1982-91/1992-97/1998-2006) negative/no trends Veg.activity ∼ regional V-CL covariance with time lag Pearson correlation r 2.4.5 3.5.2drivers relationships emerging in the real Earth system (Alessandri and Navarra, 2008;Hirschi et al., 2010) in a coupled Earth system model setting.

Interannual variability of CO 2 growth rate and regional climate (Table 3)
The analysis of the CO 2 growth rate revealed distinctly different behaviour in two latitudinal bands: in tropical latitudes, the correlation structure is similar between observations and model.However, JSBACH performs less well in particular where the CO 2 growth rate is mainly correlated to temperature anomalies, as for instance in North American Boreal and North American Temperate regions (Fig. 9d).
It is noteworthy that this model deficiency occurs despite the good correspondence in terms of vegetation temperature (Fig. 9b).One potential reason for this phenomenon might be modelled temperature sensitivities of ecosystem respiration parameterization, particularly soil-C decomposition -inconsistent with the observations.However, it is also possible that the CO 2 signal at the monitoring station is influenced by net land-atmosphere C fluxes in other extratropical regions, Fig. 7. Vegetation activity trend according to the Mann-Kendall statistics for the period of references is reported for GIMMS-NDVI and modelled FAPAR.Red: positive monotonic trend (P < 0.10); blue: negative monotonic trend (P < 0.1); white: no significant trend; grey: areas masked out from the analysis (grid cells with dominance of tropical forests, dominance of desert and ice).
Table 4. Final scores of atmospheric CO 2 and vegetation activity.Atmospheric CO 2 scores are reported per latitudinal band.The numerical values prior of the scaling to the baseline benchmark are reported in brackets where they differ from the final scores.For the acronyms refer to Tables 2 and 3.
Atmospheric obscuring the local relationship.In general, the observed weak correspondence for the station of BRW is also observed for the station of ALT, while for the stations between 60 • N and 25 • N, no statistically significant co-variations were found in observations (data not shown).In all stations, where most of the contribution to the observed concentrations is from tropical regions (e.g.South American Tropical, Northern and Southern Africa), the results reveal a good correspondence of the pattern of the covariance.However, in contrast to the observations the modelled correlation is weaker and sometimes not significant (Fig. 9c).A comparison of the time series of atmospheric CO 2 and land surface climate (data not shown) reveals that the modelled time series exhibits more variability than observed and explained by, for instance, ENSO-related events.The apparent global land-C sensitivity to land surface temperature anomalies (C-Clsens) computed for the model is not significant and very shallow (Fig. 10), in contrast to the observed sensitivity (4.2 Pg C yr −1 K −1 ) (P < 0.01).It is not possible to determine to what extent the missing fire module in the current version of the model or the use of a specific transport model contribute to the observed-modelled trait mismatch involving the CO 2 growth rates.However, the very low sensitivity returned by the model is comparable to the baseline benchmark (assuming a neutral biosphere; see Table 4), suggesting a deficiency in the model rather than a conceptual error in the methodology.q q q q q q q 2010 q q q q q q q q q q q q q q q q q q q q q ALT BRW STM Y] q q q q q q q 2010 q q q q q q q q q q q q q q Obs JSBACH T q q q q q q q q q q q q q q q q q q q q q q q q q q q 1980 1990 2000 2010 140 150 160 170 180 190 years DOY q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Obs JSBACH C-dd [doy] BRW a q q q q q q q q q q q q q q q q q q q q q q q q q q q 1980 1990 2000 2010 140 150 160 170 180 190 years DOY q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Obs JSBACH The CO 2 growth rate is the result of several concurrent biospheric and anthropogenic signals with dominance of land contributions, coming from several areas on the global land.Because of this large contribution from land, and the detailed regional analysis we have performed, the CO 2 -growth-rate- based traits are a useful diagnostic to indicate potential conflicts between model and observations that deserve further investigation, even though a process attribution is not possible without the use of further data streams.
As suggested by Rafelski et al. (2009), a similarity in the climate sensitivity of the underlying C processes at interannual and decadal time scales is likely to exist and to be mostly attributable to the land biosphere.This would imply that the poor results obtained from the JSBACH model in the land-C sensitivity trait could also indicate a potential for a model deficiency at longer temporal scales with respect to the net land-C exchange.
Common to most of the evaluation schemes, data and model errors are not considered explicitly in the mathematical formulation of the metrics.This constitutes a major limitation of this and other evaluation frameworks.As stated in the introduction, uncertainties in observations or reference datasets are not always provided or quantified, and this poses challenges for the computation of model-data/model-model differences and the significance of these distances.Structural model errors can only be assessed with a dedicated study investigating the effect of alternative model structures on surface fluxes, which is beyond the scope of a benchmarking scheme.Conversely, the scheme proposed here can help to quantify the model structural error if different model variants are available.Where possible, we have minimized the conceptual difference between model and observation by only considering those features (traits) that can be robustly compared and have isolated the contribution of the land versus oceanic and anthropogenic influences.The proposed evaluation framework defines bounded metrics that allows stating whether and how much the model adds information to the simulation of carbon cycle trends and thus whether the model lies in the range of acceptable performances.This provides an additional constraint to the model performance.

Concluding remarks
Pertinent information on current C-cycle-related processes contained in the atmospheric CO 2 record and the satellitebased records of vegetation activity were compiled and synthesized into easily identifiable traits and a framework of intuitively comparable metrics.The results of the exploratory analysis of C-related processes and climate variability were presented with emphasis on the robustness of the information content of the observations, making use of both atmospheric CO 2 concentration and vegetation activity at the appropriate time-and spatial scale of a global land surface model.The results show that the simultaneous use of the atmospheric CO 2 record and satellite-based vegetation activity as two independent datasets help to identify the sources of datamodel mismatch in terms of regional source of errors, or to detect potential compensation errors.In particular, the separate analysis of the atmospheric CO 2 and vegetation activity circumvent the problem that the atmospheric CO 2 retains the net effect of both vegetation activity (i.e.photosynthesis) and ecosystem C release response.
The use of a baseline benchmark with a clear ecological meaning was shown to be a valuable approach to provide a more robust and objective quantification of data-model disagreement.In addition, scaling the metric against a reference case allows more independence by the section of a specific metric and avoidance of misleading interpretation of the numerical score.
A key component of the evaluation framework developed here is that it is designed to be suitable and sensitive to evaluate global land surface models both in offline modei.e. when driven by observed climate variability -and fully coupled to Earth system models with a different climate and climate variability.Therefore, in addition to providing metrics for key traits that describe climatological mean variables, we use a range of correlational metrics to analyse the climate sensitivity of key carbon cycle traits.We demonstrate that these metrics provide insight into the realism of the carbon cycle simulation that go beyond an evaluation of mean states and trends.In this paper, we described the framework and applied it to an example model.The next step will be the use of this framework to evaluate online and offline versions of JSBACH.Nonetheless, even application of the benchmarking framework for the evaluation of the JSBACH model in offline mode already allows certain conclusions particular to the model: -The traits at seasonal time scales showed that highlatitude terrestrial ecosystem patterns are a major strength of JSBACH, with good performance both in terms of mean vegetation activity and mean seasonal CO 2 cycle in the high-latitudinal stations.Lower performance of mean pattern of phenology occurs in the Southern Hemisphere, in particular in shrub-dominated areas and in deciduous broadleaved forests in Southern Africa.A systematic overestimation of the seasonal cycle of CO 2 points to a too high magnitude of the seasonal land gross C fluxes.
-The observed weakening of the positive warming effect in vegetation in the NH and the trend toward a neutral/negative effect in the SH pronounced in last decade are not fully captured by the model, both in CO 2 and vegetation activity traits.The analysis of vegetationclimate covariance revealed that the modelled ecosystem response is primarily driven by temperature anomalies, suggesting that this discrepancy might be associated with an incorrect sensitivity of vegetation to precipitation anomalies at interannual time scales.
-While the analysis of CO 2 growth rate and climate drivers returned a weak covariation of the atmospheric signal with climate on selected regions on land, the model deviates strongly from the observations both in terms of the long-term trend of the atmospheric CO 2 , and therefore the implied net land-C uptake, and the apparent interannual land-C sensitivity to temperature anomalies.The combined analysis of CO 2 with the vegetation trend analysis suggests that a too high soil-C turnover rate might be responsible for the underestimation of net land-C uptake.

Appendix A Computation of SPI index
The SPI is the transformation of the precipitation time series into a standardized normal distribution (z distribution).First, a gamma distribution is fitted to the cumulative precipitation frequency distribution.The gamma distribution has been used to fit the empirical frequency of data.Since the gamma distribution is undefined for null values of the variables, the cumulative probability has been corrected according to Lloyd-Hughes and Sanders (2002).Using an equiprobable transformation, the cumulative probability function of the gamma distribution is then transformed into the normal distribution function.In addition to the classical statistics as the Pearson correlation coefficient (r), the squared correlation coefficient (r 2 ), cross correlation, and standard deviation statistics (σ ), metrics were selected as combination of some of the previous statistics and built ad hoc for the specific trait analysed.
As reported in Taylor (2000): In this metric, more weight is given to the capability of the model to return the right phase of the trait rather than the amplitude.σf = σ m /σ 0 is the ratio between the modelled standard deviation and the observed standard deviation of the trait of interest.R 0 is the maximum correlation achievable and assumed to be 1.
Comparison of variability of the signal via standard deviation: (B2) Linear differences metric: where O is the observed value and M is the modelled value.
It is applied to the most frequent month of the variable observed (0 when the maximum difference of the variables is six months, 1 when no differences occur).Single value comparison metric: where O is the observed value and M is the modelled value.At exception of the Taylor statistics, all the other metrics are symmetric.
Map cell-by-cell comparison metric: The ranking metric specifies the number of agreement cells against the total observed cell belonging to a specific class.The final score is the average over the selected classes.Three classes were used in our framework: no statistically significant relationship (i.e.no correlation, no trend detected), positive relationship (i.e.correlation/trend), and negative relationships (i.e.correlation/trend) detected.
In terms of lower benchmark, the case of constant vegetation has been used.This is the equivalent to analysing the returned trend against a null hypothesis of non-changing vegetation.The average score obtained under this setting is equal to 0.3, considering the agreement cell-by-cell to each single class.The score of the model is thereinafter scaled to this lower benchmark.

Fig. 2 .
Fig. 2. Global atmospheric CO 2 and vegetation activity scores for the JSBACH model according to the list of traits in Tables2 and 3.The polar plot goes radially from 0 (less skillful model), in the centre, to 1 (skillful).Since we only consider one model here, we refer to the threshold value of 0.5 to indicate the model good/high performances and less good/low performances.

Fig. 4 .
Figure6bdisplays the decadal patterns of the normalized annual vegetation activity time series (GIMMS-NDVI and JSBACH-FAPAR), excluding evergreen tropical forests, glaciers, and desert areas.There appears to be a good qualitative global agreement, suggesting that phenological limitations are not likely the cause for the aforementioned too low increase in land C.However, the good agreement of the global vegetation pattern is partly due to the compensation of errors (Fig.7).The observed, spatially extensive positive trend in vegetation greenness in the period 1982-1991 is not fully captured by the model because several areas have either no trend or even a negative trend (in parts of the South America, Australia, and South East Asia).During the years 1992-1997, no clear geographical pattern is detected (data not shown).For the years 1998-2006, large areas with an

Fig. 5 .
Fig. 5. Monthly CO 2 trend in the station of Alert (Canada, ALT) for the periods 1982-1991 and 1998-2008.(a) Observations, as well as simulated contribution from fossil-fuel emission and net ocean fluxes (**P < 0.01).(b) Monthly record for observations and modelled data.Negative values for a specific month indicate a decrease of seasonal atmospheric CO 2 , indirectly linked to an increase of biosphere C uptake, and vice versa.
Fig. 6.(a) Long-term pattern of atmospheric CO 2 at the station of Mauna Loa (MLO); (b) normalized annual values of vegetation activity (excluding tropical, desert, and ice areas) for GIMMS-NDVI and modelled FAPAR.Period of reference 1982-2006.Dotted lines represent the linear trend computed on the normalized data (qualitative analysis).

Fig. 8 .
Fig. 8. (a) Atmospheric CO 2 drawdown points (C-dd) as computed at the station of Barrow (BRW) for observations and model.(b) and (c) partial correlation between time of onset and mean annual temperature computed for observations and JSBACH, for the pe-riod 1998-2005.Red: positive correlations (P < 0.1); blue: negative correlations (P < 0.1); white: no significant correlations; grey: areas masked out from the analysis (see text).
Fig. 9. (a) Cross correlation between precipitation pattern (SPI) and vegetation activity in Eurasia Temperate (EATe).(b) Cross correlation between temperature and vegetation activity in North American Boreal (NAB).(c) Atmospheric CO 2 growth rate in the station of Mauna Loa (MLO) and temperature pattern in the South American Tropical (SATr).(d) Atmospheric CO 2 growth rate in Barrow (BRW) and temperature patter in NAB.Dotted lines are confidence intervals at significant level of P < 0.05 (two-tailed statistics).

Fig. 10 .
Fig. 10.Apparent land-C sensitivity: CO 2 growth rate in Mauna Loa (MLO) versus global land surface temperature.Regression is significant at P < 0.01 for observations.Annual data points are omitted for clarity.
Fig. A1.Exemplar of determination of time of onset of the seasonal signal, zero centred, of the vegetation activity (SeaWiFS-FAPAR).Time series extracted from one grid cell located in North American Boreal.a b

Table 1 .
List of selected atmospheric CO 2 monitoring stations and satellite-based vegetation activity datasets used in the analyses, as well as the time period used for elaborations.

Table 2 .
List of atmospheric CO 2 and vegetation activity traits used for the analyses at the seasonal time scale.A detailed explanation of metrics can be found in Sect. 2 and Appendices B and C).
* Trait applied to the stations ALT, BRW, STM.

Table A1 .
Final scores of atmospheric CO 2 and vegetation activity for the baseline benchmark.Atmospheric CO 2 scores are reported per latitudinal band.