Numerical Terradynamic Simulation Group 7-2015 Assessment of model estimates of land-atmosphere CO 2 exchange across Northern Eurasia

A warming climate is altering land-atmosphere exchanges of carbon, with a potential for increased vegeta­ tion prodnctivity as well as the mobilization of permafrost soil carbon stores. Here we investigate land-atmosphere car­ bon dioxide (CO2) cycling throngh analysis of net ecosystem prodnctivity (NEP) and its component llnxes of gross pri­ mary prodnctivity (GPP) and ecosystem respiration (ER) and soil carbon residence time, simnlated by a set of land snrface models (LSMs) over a region spanning the drainage basin of NorthemEnrasia. The retrospective simnlations cover the pe­ riod 1960-2009 at 0.5° resolntion, which is a scale common among many global carbon and climate model simnlations. Model performance benchmarks were drawn from compar­ isons against both observed CO2 llnxes derived from sitebased eddy covariance measnrements as well as regionalscale GPP estimates based on satellite remote-sensing data. Published by Copernicus Publications on behalf of the European Geoseienees Union. 4386 M. A. Rawlins et al.: CO2 Exchange Across Northem Enrasia The site-based comparisons depict a tendency for overesti­ mates in GPP and ER for several of the models, particularly at the two sites to the south. For several models the spatial pattern in GPP explains less than half the variance in the MODIS MOD 17 GPP product. Across the models NEP in­ creases by as little as 0.01 to as much as 0 .79gC m “^yr“ ,̂ equivalent to 3 to 340% of the respective model means, over the analysis period. For the multimodel average the in­ crease is 135 % of the mean from the first to last 10 years of record (1960-1969 vs. 2000-2009), with a weakening CO2 sink over the latter decades. Vegetation net primary produc­ tivity increased by 8 to 30 % from the first to last 10 years, contributing to soil carbon storage gains. The range in re­ gional mean NEP among the group is twice the multimodel mean, indicative of the uncertainty in CO2 sink strength The models simulate that inputs to the soil carbon pool exceeded losses, resulting in a net soil carbon gain amid a decrease in residence time. Om analysis points to improvements in model elements controlling vegetation productivity and soil respiration as being needed for reducing uncertainty in landatmosphere CO2 exchange. These advances will require col­ lection of new field data on vegetation and soil dynamics, the development of benchmarking data sets from measurements and remote-sensing observations, and investments in fiitme model development and intercomparison studies.

The site-based comparisons depict a tendency for overestimates in GPP and ER for several of the models, particularly at the two sites to the south. For several models the spatial pattern in GPP explains less than half the variance in the MODIS MOD17 GPP product. Across the models NEP increases by as little as 0.01 to as much as 0.79 g C m −2 yr −2 , equivalent to 3 to 340 % of the respective model means, over the analysis period. For the multimodel average the increase is 135 % of the mean from the first to last 10 years of record (1960-1969 vs. 2000-2009), with a weakening CO 2 sink over the latter decades. Vegetation net primary productivity increased by 8 to 30 % from the first to last 10 years, contributing to soil carbon storage gains. The range in regional mean NEP among the group is twice the multimodel mean, indicative of the uncertainty in CO 2 sink strength. The models simulate that inputs to the soil carbon pool exceeded losses, resulting in a net soil carbon gain amid a decrease in residence time. Our analysis points to improvements in model elements controlling vegetation productivity and soil respiration as being needed for reducing uncertainty in landatmosphere CO 2 exchange. These advances will require collection of new field data on vegetation and soil dynamics, the development of benchmarking data sets from measurements and remote-sensing observations, and investments in future model development and intercomparison studies.

Introduction
Northern boreal regions are known to play a major role in the land-atmosphere exchange of CO 2 at high latitudes (Graven et al., 2013). During the Holocene the Arctic is believed to have been a net sink of carbon (Pries et al., 2012). During modern times, often referred to as the anthropocene (Crutzen, 2006), warming across the high northern latitudes has occurred at a faster rate than the rest of the globe (Serreze et al., 2006). The enhanced warming is attributable to feedbacks involving biogeochemical and biogeophysical processes (Chapin III et al., 2005;Serreze and Barry, 2011;Schuur et al., 2015). Warming may increase soil microbial decomposition, placing the large permafrost carbon pool at greater risk for being mobilized and transferred to the atmosphere as greenhouse gases (GHGs), thus providing a positive feedback to global climate (Dutta et al., 2006;Vogel et al., 2009;Schuur et al., 2009). Warming may also lead to longer growing seasons, contributing to increased plant productivity and ecosystem carbon sequestration (Melillo et al., 1993;Euskirchen et al., 2006). At the same time, warming may also lead to respiration increases through enhanced microbial activity and/or increased input of plant photosynthates into the soil (Högberg et al., 2001), offsetting any productivity increases and resulting in relatively low net carbon uptake (Parmentier et al., 2011). Satellite observations show broad greening trends in tundra regions (Myneni et al., 1997;Goetz et al., 2005;Zhang et al., 2008), suggesting a potential increase in the land sink of atmospheric CO 2 . Some areas, however, are browning (Goetz et al., 2006).
Research studies point to uncertainty in the sign, magnitude and temporal trends in contemporary land-atmosphere exchanges of CO 2 . A recent synthesis of observations and models by McGuire et al. (2012) suggests that tundra regions across the pan-Arctic were a sink for atmospheric CO 2 and a source of CH 4 from 1990-2009. However, a meta-analysis of 40 years of CO 2 flux observations from 54 studies spanning 32 sites across northern high latitudes found that tundra was an annual CO 2 source from the mid-1980s until the 2000s, with the data suggesting an increase in winter respiration rates, particularly over the last decade (Belshe et al., 2013). In an analysis of outputs from several models from recent terrestrial biosphere model intercomparison projects, Fisher et al. (2014) found that spatial patterns in carbon stocks and fluxes over Alaska in 2003 varied widely, with some models showing a strong carbon sink, others a strong carbon source, and some showing the region as carbon neutral. It is critical to understand the net carbon sink as recent studies suggest that with continued warming the Arctic may transition from a net sink of atmospheric CO 2 to a net source over the coming decades (Hayes et al., 2011;Koven et al., 2011;Schaefer et al., 2011;MacDougall et al., 2013;Oechel et al., 2014). In a study using a process model which included disturbances, Hayes et al. (2011) estimated a 73 % reduction in the strength of the pan-Arctic land-based CO 2 sink over 1997-2006 vs. previous decades in the late 20th century.
Recent studies have provided new insights into model uncertainties relevant to our understanding of the land-based CO 2 sink across Northern Eurasia. Examining several independent estimates of the carbon balance of Russia including two dynamic global vegetation models (DGVMs), two atmospheric inversion methods, and a landscape-ecosystem approach (LEA) incorporating observed data, Quegan et al. (2011) concluded that estimates of heterotrophic respiration were biased high in the two DGVMs, and that the LEA appeared to give the most credible estimates of the fluxes. In an analysis of the terrestrial carbon budget of Russia using inventory-based, eddy covariance, and inversion methods, Dolman et al. (2012) noted good agreement in net ecosystem exchange among these bottom-up and top-down methods, estimating an average CO 2 sink across the three methods of 613.5 Tg C yr −1 . Their examination of outputs from a set of DGVMs, however, showed a much lower sink of 91 Tg C yr −1 . Graven et al. (2013) point to specification of vegetation dynamics and nitrogen cycling in a subset of CMIP5 models as a potential cause for their underestimation of changes in net productivity over the past 50 years. These analyses highlight the need for comprehensive assessments of numerical model estimates of spatial and temporal variations in land-atmosphere CO 2 exchange against independent benchmarking data. A lack of direct flux measurements across northern land areas presents considerable challenges for model validation efforts (Fisher et al., 2014).
In this study we examine model estimates of net ecosystem productivity (NEP) and component fluxes gross primary productivity (GPP) and ecosystem respiration (ER) across the arctic basin of Northern Eurasia from a series of retrospective simulations for the period 1960-2009. Our analysis for the region is unique in its synthesis of a large suite of land-surface models, available site-level data, and a remotesensing product. Study goals are two-fold. First, using the available in situ data derived from tower-based measurements and the remote-sensing GPP product we seek to assess model efficacy in simulating spatial and temporal variations in GPP, ER, and NEP across the region. In doing so we elucidate issues complicating evaluations of model carbon cycle estimates across Northern Eurasia and, by extension, other areas of the northern high latitudes. Second, we estimate time changes in NEP and soil organic carbon (SOC) residence time and its controls as an indicator of climate sensitivity and potential vulnerability of soil carbon stocks. We focus the analysis and discussion on assessing how well the models capture the seasonal cycle and spatial patterns in GPP and ER flux rates, evaluating uncertainties in the net CO 2 exchange given reported biases in respiration rates, and in advancing understanding of the land-atmosphere cycling of CO 2 over recent decades.

Study Region
The spatial domain is the arctic drainage basin of Northern Eurasia which comprises all land areas draining to the Arctic Ocean, a region of some 13.5 million km 2 (Fig. 1). The basin covers roughly half of the Northern Eurasian Earth Science Partnership Initiative (NEESPI) study area, generally defined as the region between 15 • E in the west, the Pacific Coast in the east, 40 • N in the south, and the Arctic Ocean coastal zone in the north . Warming and associated environmental changes to this region are among the most pronounced globally (Groisman and Bartalev, 2007;Groisman and Soja, 2009). Tundra vegetation is common across northern areas, with boreal forest and taiga comprising much of the remainder of the region. Steppes and grasslands are found across a relative small area in the extreme southwest. Continuous permafrost underlies over half of the region. Sporadic and relic permafrost comprise the southwest portion of the domain. West to east, the Ob, Yenisei, Lena, and Kolyma rivers drain a large fraction of the total river discharge from the Northern Eurasian basin.

Modeled data
We used outputs from retrospective simulations of nine models participating in the model integration group of the Per-  (Table 3). Gridded PFTs are from the MODIS MOD12 product (Oak Ridge National Laboratory, 2014). Permafrost classes for each grid are drawn from the CAPS data set (International Permafrost Association Standing Committee on Data Information and Communication (comp.), 2003). mafrost Carbon Network. All simulation outputs available at the time of writing were included in the analysis (http: //www.permafrostcarbon.org). The simulation protocol allowed for the choice of a model's driving data sets for atmospheric CO 2 , N deposition, climate, disturbance, and other forcings (Tables 1 and 2). Simulations were run at daily or sub-daily time steps in some models and at 0.5 • resolution over all land areas north of 45 • N latitude. The present study focuses on analysis of spatial patterns and temporal changes in land-atmosphere CO 2 fluxes over the period . Quantities analyzed are GPP, ER, and NEP, defined here as NEP = GPP−ER, where a positive value represents a net sink of CO 2 into the ecosystem. ER is the sum of  heterotrophic respiration and autotrophic respiration as estimated by the models. In this study we follow the conceptual framework for NEP and related terms as described in Chapin III et al. (2005). For this Permafrost Carbon Network activity modeling groups are providing gridded data for permafrost regions of the northern hemisphere. The nine models examined here (full model names in Table 1) are the (1) CLM version 4.5 (hereafter CLM4.5, Oleson et al., 2013); (2) CoLM (Ji et al., 2014); (3) ISBA (Decharme et al., 2011); (4) JULES Clark et al., 2011); (5) LPJ Guess WHyMe (hereafter LPJG, Smith et al., 2001;Wania et al., 2009bWania et al., , a, 2010Miller and Smith, 2012); (6) MIROC-ESM (Watanabe et al., 2011); (7) ORCHIDEE-IPSL (Koven et al., 2009(Koven et al., , 2011Gouttevin et al., 2012); (8) UVic (Avis et al., 2011;MacDougall et al., 2013); and (9) UW-VIC (Bohn et al., 2013). Table 2 lists the model elements most closely related to CO 2 source and sink dynamics. These include model land cover initialization, time series forcings, light use efficiency, and CO 2 and nitrogen fertilization. Among the models there is a wide range of accounting for processes related to disturbances such as fire and land use change ( Table 2). All but two of the nine models (ISBA and UW-VIC) are considered to be dynamic global vegetation models (DGVMs), possessing the ability for vegetation to change over the model simulation. For ORCHIDEE, dynamic vegetation was not enabled in the simulation examined in this study. While studies that examine the overall ecosystem carbon balance (i.e. the net ecosystem carbon balance, NECB) are elemental to our understanding of the carbon cycle of Northern Eurasia, the present study focuses on the patterns in NEP and component fluxes GPP and ER, common in all of the models, in order to avoid the uncertainties given the range of model formulations related to the full carbon balance. Outputs from several of the nine models have been examined in other recent studies. The LPJG and ORCHIDEE were used in the synthesis of data and models presented by McGuire et al. (2012). JULES, LPJG, ORCHIDEE, and CLM4.5 participated in the TRENDY MIP . CLM4.5, ORCHIDEE, and LPJG were three of the eight models examined in the study of Dolman et al. (2012).

Flux tower eddy covariance data
Model estimates for GPP, ER, and NEP are evaluated against data from six eddy covariance flux towers in four research areas located across Russia. The data are contained in the La Thuile global FLUXNET data set (Baldocchi, 2008).    Adam and Lettenmaier (2003) and Adam et al. (2006) for precipitation adjustments, 9 Kalnay et al. (2006) for wind speed, 10 WATCH used for 190110 WATCH used for -1978WFDEI used for 1978WFDEI used for -2009 here by their locations: Chersky (CHE), Chokurdakh (COK), Hakasija (HAK), and Zotino (ZOT). Data from three towers are available for Hakasija; HAK1 is in an area of grasslandsteppe; HAK2 is grassland; HAK3 an abandoned agricultural field. Chersky and Chokurdakh are in northeast Russia in the general zone of tundra vegetation. Hakasija and Zotino are in an area of generally higher productivity in southern Siberia ( Fig. 1). Data are available for years at Chersky, Hakasija and Zotino, and 2003 at Chokurdakh. General characteristics of these sites are summarized in Table 3. In this data set GPP and ER are derived from an empirical model driven by field-based eddy covariance measurements of net ecosystem CO 2 exchange (NEE) using methodologies described in Reichstein et al. (2005).

Satellite-based estimates of GPP
Satellite-data-driven estimates of annual total GPP are also obtained from the MODIS (Moderate Resolution Imaging Spectroradiometer) MOD17 operational product (Running et al., 2004;Zhao et al., 2005). The MOD17 product has been derived operationally from the NASA EOS MODIS sensors since 2000 and provides a globally consistent and continuous estimation of vegetation productivity at 1-km resolution and 8-day intervals. MOD17 uses a light use efficiency algorithm driven by global land cover classification and canopy fractional photosynthetically active radiation (FPAR) inputs from MODIS. The product also uses daily surface meteorology inputs from global reanalysis data (Zhao and Running, 2010), and land cover class specific biophysical response functions to estimate the conversion efficiency of canopy absorbed photosynthetically active radiation to vegetation biomass (g C MJ −1 ) and GPP (Running et al., 2004). The MOD17 algorithms and productivity estimates have been extensively evaluated for a range of regional and global applications, including northern, boreal and Arctic domains (Heinsch et al., 2006;Turner et al., 2006;Zhang et al., 2008;Zhao and Running, 2010). We use the MOD17 Collection 5 product, which has undergone five major reprocessing improvements since 2000. The MOD17 data are used in this study as a consistent satellite-derived baseline for evaluating GPP simulations from the detailed carbon process models.

Site-level evaluations
Confident assessment of uncertainties in land-atmosphere CO 2 fluxes is dependent on robust comparisons of model estimates against consistent benchmarking data. We begin by assessing the seven models which provided estimates through 2005, along with MOD17 GPP product. Monthly GPP from the models and MOD17 are compared with the cumulative monthly tower values by extracting the model values for the grid cell encompassing each tower site. Error measures that are based on absolute values of differences, like the mean absolute error (MAE) and mean bias error (MBE) are preferable to those based on squared differences (Willmott and Matsuura, 2005;Willmott et al., 2011). Model performance is evaluated here using the MBE, defined as the difference between the model and observed values: j = C j −C obs , where C j is GPP, ER or NEP for model j and C obs is the observed tower value. As shown in (Fig. 2), MOD17 GPP agrees well with the tower estimates for Chersky and Chokurdakh, with MBE over the 3 years of −2 and −11 g C m −2 month −1 , respectively (Table 4). MOD17 GPP broadly agrees with the observations at Hakasija and Zotino. Average MBEs are 13 and 10 g C m −2 month −1 , respectively, for these sites with higher productivity than Chersky and Chokurdakh. Averaged across all models the error in GPP is 7, 34, 34, and 13 g C m −2 month −1 for Chersky, Chokurdakh, Hakasija and Zotino, respectively. The MBE for ER are 8, 35, 43, and 33 g C m −2 month −1 , respectively.
Overall the models simulate fairly well the seasonal cycle in GPP (Fig. 2) and ER (Fig. 3), including the timing of peak CO 2 drawdown. Modest overestimates are noted near growing season peak at Hakasija and Zotino. However, for all four sites significant over-and under-estimates in GPP and ER are also noted (Table 4). For the two sites in the south there is a tendency for overestimation in GPP and ER. All models overestimate both GPP and ER at Hakasija. Seven of the nine models overestimate GPP and ER at Zotino, with ER overestimated by a considerable degree. Overestimates in ER for Hakasija and Zotino during late summer and autumn are particularly noteworthy. An ANOVA test was carried out to determine whether model errors in ER exceed the errors in GPP. The tests confirm that ER errors are greater on average than the GPP errors for comparisons where (i) ER errors for all sites are pooled together and compared against GPP pooled across all sites and (ii) ER and GPP errors for the two northern sites are pooled and compared against ER and GPP errors from the two southern sites.
The tendency to overestimate ER leads to discrepancies in net CO 2 source (negative NEP) at Hakasija and Zotino, particularly in autumn (Fig. 4). Average NEP errors are −11 and −20 g C m −2 month −1 for Hakasija and Zotino, respectively (Table 4). Errors in the magnitude and timing of NEP prior to and following the dormant season are much smaller at Chersky, and to some extent Chokurdakh. However, a lack of available tower-based data during the colder months limits the robustness of our assessments during that time of year. We further evaluate model performance through two additional error metrics, the refined index of agreement (d r ) (Willmott et al., 2011) and the Nash-Sutcliffe coefficient of efficiency (E) (Nash and Sutcliffe, 1970). As described by Willmott et al. (2011), the refined index of agreement (d r ) involves the sum of the magnitudes of the differences between the model-predicted and observed deviations about the observed mean, relative to the sum of the magnitudes of the perfect-model (model predicted = observed) and observed deviations about the observed mean. It is bounded between −1 and +1. When d r equals 0.0, it signifies that the sum of the magnitudes of the errors and the sum of the perfect-model-deviation and observed-deviation magnitudes are equivalent. Like d r , the Nash-Sutcliffe E considers observed deviations within the basis of comparison. For both metrics, values closer to 1 indicate higher model accuracy. Nash-Sutcliffe's E is also positively correlated with d r . Values of E less than zero occur when the residual model variance is larger than the data variance. Table 3. Flux tower sites from the LaThuile data set (Baldocchi, 2008) used in this study. Site Hakasija consists of records from 3 sub-sites which all fall within the same RCN model grid. Each sub-site is represented with a different symbol in Figs. 2c, 3c, 4c. GPP and ER in the La Thuile data set are calculated using methodologies described in Reichstein et al. (2005  A wide range of model performance is evident from Table 5. As with the mean errors shown in Table 4, agreements with observations are generally better at Chersky and Chokurdakh than Hakasija and Zotino. ER errors are also greater than GPP errors. Nash-Sutcliffe Es are negative for all models for both GPP and ER at Hakasija, and for most of the comparisons at Chokurdakh. Models CLM4.5, ISBA and UW-VIC exhibit the largest disagreements among the seven models for which estimates are available over the 2002-2005 period.

Regional-level evaluation of model GPP
Estimates from the MOD17 product provide a temporally and spatially continuous benchmark to assess model simulated GPP over the study domain. Average annual-total GPP from MOD17 over the period 2000-2009 is shown in Fig. 5. The MOD17 product clearly captures three distinct land cover zones over the region, representing: (i) grasslands across the south; (ii) boreal forests in the center of the region; and (iii) tundra to the north. Highest production occurs in the western forests where mean annual temperatures are higher. Both the steppe and tundra areas show annual GPP of less than 300 g C m −2 yr −1 . Areas of low productivity in high elevation areas to the north are well delineated. The spatially averaged mean across the region is approximately 470 g C m −2 yr −1 . In most of the models the patterns in GPP broadly represent the major biome areas captured in the MODIS land cover product (Fig. 1a). The east to west gradient is broadly captured in most of the models. However, grid-based correlations with the MOD17 GPP estimates (upper left of map panels in Fig. 5) show a wide range of agreement across the models. Spatial averages of the correlations across the domain range from r = 0.92 (ISBA) to r = 0.48 (ORCHIDEE). Four of the nine (LPJG, MIROC, ORCHIDEE, UVic) simulate a GPP field that explains less than 44 % of the variability in GPP found within the MOD17 product. Annual GPP in the LPJG is notably low across the eastern half of the region. The CLM4.5 tends to predict lower GPP than MOD17 over tundra areas and higher productivity in the boreal zone. As estimated by the coefficient of variation (CV, upper right panel of Fig. 5), agreement in GPP is best across the higher productivity taiga biome. Figure 6 Biogeosciences, 12, 4385-4405, 2015 www.biogeosciences.net/12/4385/2015/   (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000); and JULES (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000). Spatial correlations between MOD17 GPP and each model GPP for all grids is shown at upper left in each map panel. Map panel at upper right is coefficient of variation (CV) for GPP. At each grid the CV is estimated from the mean and standard deviation across the nine models (MOD17 not included).
shows the distribution of GPP for all grids of each model. In general, the models bracket the MOD17 estimates, with several models showing a larger spread and several showing Figure 6. Distributions for mean annual GPP from the models and the MOD17 product over the averaging period listed in Fig. 5. The rectangles bracket the 25th and 75th percentiles. Whiskers extend to the 5th and 95th percentiles. Thick and thin horizontal lines mark the mean and median respectively. a reduced spread. Regional averages from each model fall within ±20 % of the MOD17 average of 468 g C m −2 yr −1 , with the exception of the LPJG model for which annual GPP is 40 % lower than MOD17. For each model the spatial pattern in ER (not shown) closely matches the pattern in GPP, consistent with the strong dependence of autotrophic respiration and litterfall on vegetation productivity (Waring et al., 1998;Bond-Lamberty et al., 2004). Area-averaged GPP and ER are highly correlated (r = 0.99, Fig. 7). That is, models which simulate low (high) GPP also simulate low (high) ER.

Spatial patterns and area averages
In this study net ecosystem productivity (NEP) represents the net exchange of CO 2 between the land surface and the atmosphere. NEP is defined as the difference between GPP and ER. We do not examine other emission components of landatmosphere CO 2 exchange (Hayes and Turner, 2012) regions (e.g. fire and forest harvest). The multimodel mean NEP is highest over the south-central part of the region and lowest in the tundra to the north (Fig. 8a). Only 0.3 % of the region is a net annual source of CO 2 , notably two small areas in Scandinavia. Tundra areas are a net sink of approximately 15 g C m −2 yr −1 based on the multimodel mean NEP. As measured by the coefficient of variation (CV), the agreement in NEP among the models is highest across the boreal region and lowest in the tundra to the north and grasslands to the south (Fig. 8b). The multimodel mean NEP is approximately 20 g C m −2 yr −1 or 270 Tg C yr −1 over the simulation period (Fig. 9). Among the models, NEP varies from 4 (UVic) to 48 (JULES) g C m −2 yr −1 , a range that is double the multimodel mean. The UVic simulates a negative NEP (CO 2 source) for nearly half of the region, and the CoLM and MIROC for nearly 25 % of the region.      [1960][1961][1962][1963][1964][1965][1966][1967][1968][1969] and last decade (2000-2009) we estimate that the NEP change ranges from 10 to 400 % of the first decade mean, with a nine model average of 135 %. For each model the GPP trend magnitude exceeds the ER trend magnitude (Table 6), hence the increase in NEP over time. The increases from the first to last decade of the simulations range from 9-35 % of the early decade average for GPP and 8-30 % for ER. Total cumulative NEP over the 50-year period and averaged across all models is approximately 12 (range 3-20) Pg C (Fig. 11). Averaged across the models, NEP exhibits an increase during mainly the earliest decades that tends to weaken over the latter decades (Fig. 12). The uncertainty range for the multimodel mean shows that the region has been a net sink for CO 2 over the simulation period. Interestingly the uncertainty range reflects relatively better model agreement in annual NEP (lower variance) during the years 1960-1965 and in the low NEP years 1978 and 1996. Amid this increase there is evidence of a deceleration in NEP. The deceleration is apparent when examining trend magnitude and significance across all time intervals (minimum 20-year interval) over the simulation period (Fig. 13). Here several models (ISBA, LPJG, ORCHIDEE) exhibit weaker linear trends over time and all models show a lack of significant positive trends for time intervals spanning the latter decades (e.g. 1980-1999 or 1982-2009  varies by an order of magnitude across the models. Over the period examined all of the models simulate a statistically significant (p < 0.01) decrease in the regionally-averaged RT. Across the models the decrease from first to last decade of the study period ranges from −5 to −16 % of each model's mean. The decline occurs amid an increase in SOC storage over time. All models with the exception of CoLM simulate a statistically significant increase in soil carbon and all exhibit an increase in R h . The increases in carbon storage range from 0.2 to 3.6 % while the increases in R h range from 7 to 22 %. Likewise the models simulate an increase in the rate of net primary productivity (NPP) of 8 to 30 %. Across the model group the change in RT is highly correlated (r = 0.99) with change in R h . In essence, higher rates in R h and NPP led to a decrease in soil carbon RT, with increased soil carbon storage resulting from enhanced vegetation productivity and litterfall inputs. The spatial pattern in RT changes suggests that controlling influences are leading to both decreases and increases over different parts of the region. The largest decreases are found across north-central Russia and the eastern third of the domain (Fig. 14a). The decline in RT is statistically significant (p < 0.01) for just over 46 % of the region. exceeding −20 % for approximately 16 % of the region. An increase in RT is noted for less than 5 % of the region, including a small area in the far north and across extreme southern parts of the region. The change, however, is not significant in those areas. The CV map (Fig. 14b) lends further confidence to the RT decreases across much of the center of the region. High uncertainties (CVs > 10) are noted in the areas where the multimodel average suggests an increase in RT.

Uncertainties in tower-based measurements
The potential for alterations to the terrestrial sink of atmospheric CO 2 across the high northern latitudes motivates our examination of model estimates of land-atmosphere exchanges of CO 2 across the arctic drainage basin of Northern Eurasia. Validation of model estimates through comparisons to measured flux tower data is hindered by several factors. The limited extent of available measurements from a sparse regional tower network clearly challenges the validation of model estimates and, in turn, identification of model processes which require refinement. There are also inherent uncertainties in GPP and ER data derived from net ecosystem exchange (NEE) measurements at the eddy covariance tower sites. ER is generally assumed to equal NEE during nighttime hours (Lasslop et al., 2010). An empirical relationship is derived to estimate ER during that time and it is extrapolated into the daylight hours. GPP is then generally calculated as the difference between NEE and ER (accounting for appropriate signs). Since there is generally daylight for photosynthesis during the middle of the summer, ER could potentially be underestimated if primary production had occurred during the hours used for ER model calibration. Direct validation of the partitioning of measured NEE flux to GPP and ER is not possible. However, in a recent sensitivity study Lasslop et al. (2010) compared two independent methods for partitioning and found general agreement in the results. This agreement across methods increases our confidence in the partitioned GPP and ER estimates in the LaThuile FLUXNET data set. When measurements come from nearly-ideal sites the error bound on the net annual exchange of CO 2 has been esti-  1960-1979, 1960-1980, . . ., 1990-2009. Intervals for which the trend is significant are marked with a line from the start to end year of the interval and shaded by the trend magnitude. As an example, one time interval is identified with a significant NEP trend for UW-VIC, from 1964-1993 mated to be less than ±50 g C m −2 yr −1 (Baldocchi, 2003). Systematic errors in eddy covariance fluxes due to non-ideal observation conditions are uncertain at this time. Total error is likely below the value of 200 g C m −2 yr −1 that has been conservatively estimated (Reichstein et al., 2007). The model errors estimated in this present study often exceed that level for site Hakasija and, for a few models, Zotino as well.

Biogeosciences
Lastly, any conclusions about the CO 2 sink strength drawn from such a limited number of eddy covariance sites should be viewed with caution.

Model uncertainties contributing to errors in net CO 2 sink/source activity
Regionally averaged GPP is within 20 % of the MOD17 average (470 g C m −2 yr −1 ) for 8 of the 9 models. While the models broadly capture the three major biomes across the region, a wide range in spatial GPP estimates is evident. This result may reflect differences in model forcings, initial conditions, parameterization and the dynamic vs static nature of vegetation and LAI (Table 2). While these differences make it difficult to unambiguously determine the underlying causes for many of the mismatches, the evaluations, in the context of prior studies, point to particular biases. The timing of peak summer GPP is generally well captured in most of the models (Fig. 4). Despite the agreement in peak GPP (and ER) timing, several models overestimate the small source of CO 2 before, and to some degree after, winter dormancy at the Hakasija sites and Zotino. Overestimates in GPP and ER are more common than underestimates (Table 4). Indeed, all errors are positive for site Hakasija and five of the seven models show relatively large overestimates in ER at Zotino. The tendency to overestimate GPP suggests that parametrizations and process specifications controlling primary production (e.g. # 1, 2, 3, 4, 6, 8 in Table 2) may require refinement. It should be noted that large seasonal flux errors (e.g. Keenan et al., 2012;Richardson et al., 2012;Schaefer et al., 2012) will appear as more modest monthly errors such as those noted in our analysis. While it is not possible to evaluate sources of error separately for R h and autotrophic respiration (R a ), our results and those from prior studies implicating R h in the model uncertainties ( a need for further investigation of model processes controlling respiration. Only one of the nine models, the CLM4.5, simulated limits on productivity due to nitrogen availability. None account for competition for nitrogen. Lack of accounting for nitrogen limits on photosynthesis may be leading to overestimates in simulated GPP, since nitrogen availability limits terrestrial carbon sequestration in boreal regions (Zaehle, 2013). While accounting for fire is important for estimates of impacts on recently disturbed areas, and may be contributing to the wide range in GPP exhibited by CLM4.5, CoLM, and LPJG (Fig. 6), climate variability is a more dominant influence on regional fluxes (Yi et al., 2013). Regarding errors in respiration rates, models with the highest soil carbon amounts (CLM4.5 and UW-VIC) exhibit relatively high ER rates when compared to the observations at several sites ( Fig. 3). This tendency is consistent with results described by Exbrayat et al. (2013), who suggest that initial carbon pool size is the main driver of the response to warming, with the magnitude of the carbon pool strongly controlling the sensitivity of R h to changes in temperature and moisture. While all of the models incorporate temperature and moisture in their formulations for R h , only three of the nine account for the effect of vegetation type on soil thermal dynamics. A wide range in process specifications for soil thermal dynamics is present across the models. In a study of nine models from the TRENDY project, Peng et al. (2015) found that the models overestimate both GPP and ER, and underestimate NEE at most of the flux sites examined, and for the Northern Hemisphere based on upscaled measurements. A low NEE, or NEP, may be attributable to model biases in respiration exceeding those in productivity. Averaged across the nine models and the region of the present study, NEP of approximately 20 g C m −2 yr −1 (Fig. 9) (270 Tg C yr −1 ) is broadly consistent with inventory assessments for Eurasian forests, which range between 93 and 347 Tg C yr −1 (Hayes et al., 2011). Quegan et al. (2011) concluded that NPP simulated by two DGVMs examined was nearly balanced by the models' estimate of R h . Dolman et al. (2012) found that GPP increased during the years 1920 to 2008, with the GPP increase in the DGVMs balanced equally by increases in respiration. They reported NEP over the Russian territory as an average of three methods at nearly 30 g C m −2 yr −1 . The DGVM average, however, was only 4.4 g C m −2 yr −1 and so low that the authors chose to remove it from their final carbon budget. This underestimate was attributed to an excess in R h . While the mean NEP of 20 g C m −2 yr −1 in the present study is more consistent with the three-method average of Dolman et al. (2012) than their lower DGVM estimates, our comparisons against tower-based data and results of other studies suggest the sink strength is underestimated. Of the three models common to that study and the present one, the CLM4.5 and ORCHIDEE rank on the low end of model NEP magnitudes (Fig. 9).
Recent research points to phenology as one of the principle sources of error in model simulations of land-atmosphere exchanges of CO 2 . Graven et al. (2013) found that the change in NEP simulated by a set of CMIP5 models could not account for the observed increase in the seasonal cycle amplitude in atmospheric CO 2 concentrations. They point to data showing that boreal regions have experienced greening and shifting age composition which strongly influence NEP and suggest that process models under-represent the observed changes. Model inability to capture canopy phenology has been identified as a major source of model uncertainty leading to large seasonal errors in carbon fluxes such as GPP (Keenan et al., 2012;Richardson et al., 2012;Schaefer et al., 2012). Indeed, evaluated against flux tower data across the eastern USA, current state-of-the-art terrestrial biosphere models have been found to mis-characterize the temperature sensitivity of phenology, which contributes to poor model  (Keenan et al., 2014). Two recent studies using eight land surface models from the TRENDY comparison  (several examined in the present study) and 11 coupled carbon-climate models  have found that models consistently overestimate leaf area index (LAI) and have a longer growing season, mostly due to a later autumn dormancy, compared to satellite data. However, when estimated using model GPP, dormancy was much earlier than previously predicted using LAI. The authors conclude that the models are keeping inactive leaves for longer than they should, but with little impact on carbon cycle fluxes. Anav et al. (2013) further suggested that it was unlikely that differences in climate in the coupled models were solely responsible for the positive bias. Fisher et al. (2014) also concluded that variability in land model fluxes was driven primarily by differences in model physics rather than differences in forcing data. Simulated R h estimates among the DGVMs analyzed by Dolman et al. (2012) vary in the range between 200 to 225 g C m −2 yr −1 . In the present study the nine model average is 190 g C m −2 yr −1 . Dolman et al. (2012) point to lower estimates from Kurganova and Nilsson (2003) of 139 g C m −2 yr −1 and Schepaschenko et al. (2013) of 174 g C m −2 yr −1 as being more representative for the region. Our benchmark comparisons of ER against tower-based data are consistent with these recent studies and suggest that several models are overestimating R h , particularly over the boreal forest zone. Among the model examined in this study a wide range in soil carbon parameterizations is noted (Table 2). Not surprisingly the effects of active layer depth on the availability of soil organic carbon for decomposition and combustion has been recognized as a key sensitivity in process models . Regarding below-ground processes, model parameterizations and processes controlling carbon storage and turnover such as litter decomposition rates and biological activity in frozen soils (Hobbie et al., 2000) require close examination as well. Model simulations of R h during the non-growing season are sensitive to the presence or absence of snow (McGuire et al., 2000), suggesting that future studies of mechanisms controlling winter CO 2 emissions in tundra may help resolve uncertainties in processes within land surface models and provide a means to connect a warming climate with vegetation changes, permafrost thaw and CO 2 dynamics.

Uncertainties in temporal trend estimates
Uncertainties exist as to whether tundra areas are presently a net sink or source of CO 2 . Across tundra regions, process models indicate a stronger sink in the 2000s compared with the 1990s, attributable to a greater increase in vegetation net primary production than heterotrophic respiration in response to warming Belshe et al., 2013. The spatial pattern in multimodel mean NEP in this study points to small areas in Scandinavia (< 1 % of the do-main) as sources of CO 2 . Broadly, areas classified as tundra are a modest CO 2 sink of approximately 15 g C m −2 yr −1 . Across-model standard deviations in areas of small positive and negative NEP are a factor of ten or more greater than the multimodel mean in some areas, and are generally high across the tundra (Fig. 8b). Estimates of NEP sink magnitudes must be interpreted with caution given that the models in general possess inadequate representation of disturbances which are an important component of the overall carbon balance (Hayes et al., 2011). Among this model group, four of the nine account for fire. The nature of model initialization and spinup is also a strong influence on simulated NEP changes. For example, spin-up procedures can explain some of the discrepancies. ISBA, for instance, was equilibrated using the 10 coldest years of the WATCH forcing repeatedly to emulate preindustrial climate. As a result, soil and vegetation carbon were fairly low at the beginning of the 20th century run, much lower than the equilibrium that would result from the 1960s climate. Due to the large characteristic timescale of soil carbon, part of ISBA's large trend during the 1960-2009 period (Fig. 11) can be traced to the climate used for the model spinup procedure.
Previous studies have pointed to changes in the seasonal drawdown and release of CO 2 across the northern high latitudes (Graven et al., 2013). A change in the seasonal cycle of GPP and ER is also noted (figure not shown), with the models analyzed in this study simulating a relatively higher productivity rate from late spring to mid-summer. Indeed, increased productivity did not occur uniformly across the growing season, as most of the models show little change in August or September NEP over time. The models also simulate little change in NEP over the cold season. Greater productivity in spring and early summer may be due in part to earlier spring thawing and temporal advance in growing season initiation (McDonald et al., 2004), whereas GPP and NEP are more strongly constrained by moisture limitations later in the growing season (Yi et al., 2014). Extension of the growing season is therefore attributed more to a regional warming driven advance in spring thaw than a delay in autumn freezeup Euskirchen et al., 2006;Kim et al., 2012) which correlates with regional annual evapotranspiration for the region above 40 • N (Zhang et al., 2011). There are, however, signs of a delay in the timing of the fall freeze (−5.4 days decade −1 ) across Eurasia over the period 1988(Smith et al., 2004 consistent with fall satellite snow cover (SCE) increases, and attributed to greater fall/winter snowfall and regional cooling (Cohen et al., 2012). Consistent with the advance in spring thaw, the models examined here show a greater NEP increase in spring compared to autumn.
Soil carbon storage across the region increased significantly over the study period in eight of the nine models. A relatively larger increase in R h is correlated strongly with the associated decline in soil carbon residence time. This suggests that amid recent warming, vegetation carbon inputs to www.biogeosciences.net/12/4385/2015/ Biogeosciences, 12, 4385-4405, 2015 the soil were greater than the enhancement in decomposition. In a recent study involving CMIP5 models, Carvalhais et al. (2014) found that while the coupled climate/carbon-cycle models reproduce the latitudinal patterns of carbon turnover times, differences between the models of more than one order of magnitude were also noted. The authors suggest that more accurate descriptions of hydrological processes and watercarbon interactions are needed to improve the model estimates of ecosystem carbon turnover times. The reduction in soil carbon residence time may at least partially be a direct response to increasing NEP, rather than through warming effects on respiration. A recent study  using a set of simulations from five CMIP5 models found that, because heterotrophic respiration equilibrates faster to the increasing NPP than the soil carbon stocks, increased productivity leads to reductions in inferred residence times even when there are no changes to the environmental controls on decomposition rates, a process they refer to as false priming. Because the experimental protocol analyzed here does not include a fixed-climate simulation, it is not possible to unambiguously separate the contribution from the false priming effect from that due to warming-related respiration increases, but the fact that soil C stocks increase over the period of simulation suggests that it is the dominant effect. Apart from climatological factors, vegetation growth is also dependent on biological nitrogen availability. Failure to account for nitrogen limitation may thus impart a bias in the modeled carbon flux estimates. However, more process models are incorporating linkages between carbon and nitrogen dynamics (Thornton et al., 2009). Given the broad range in spatial patterns in GPP across the models, a closer examination of processes related to nitrogen limitations and primary production is needed. The lower rate of NEP increase over the latter decades of the simulation period suggests a weakening of the land CO 2 sink, driven by increased R h from warming, associated permafrost thaw, and an upward trend in fire emissions (Hayes et al., 2011). As the climate warms, the amount of carbon emitted as CH 4 and CO 2 will depend on whether soils become wetter or drier. A synthesis of observations and models points to intensification of the pan-Arctic hydrological cycle over recent decades (Rawlins et al., 2010), manifested prominently by increasing river discharge from Northern Eurasia (Peterson et al., 2002). In addition to hydrological cycle intensification and deepening soil active layer (Romanovsky et al., 2010), rapid thaw and ground collapse will also likely alter the landscape and impact land-atmosphere carbon exchanges. Land surface models are now beginning to implement new process formulations to account for these fine scale perturbations. Several of the models examined in this study incorporate the effect of soil freeze-thaw state on decomposition of organic carbon (Table 2). Only four of the nine models, however, account for methane emissions. Six simulate talik formation, and among these a variety of approaches are employed to compute snow insulation type.

Conclusions
Outputs from a suite of land surface models were evaluated against independent data sets and used to investigate elements of the land-atmosphere exchange of CO 2 across Northern Eurasia over the period 1960-2009. The models exhibit a wide range in spatial patterns and regional mean magnitudes. Compared to tower-based data, overestimates in both GPP and ER are noted in several of the models, with larger errors in ER relative to GPP, particularly for the comparisons at the southern higher productivity sites. Regarding agreement in the spatial pattern in GPP, less than half of the variance in GPP expressed in the MOD17 product is explained by the GPP pattern from four of the nine models. The NEP increases range from 3 to 340 % of the model means, further illustrating uncertainties in sink strength. The models exhibit a decrease in residence time of the soil carbon pool that is driven by an increase in R h , simultaneous with an increase in soil carbon storage. This result suggests that net primary productivity (NPP) inputs to the pool increased more than R h fluxes out. Among the quantities examined, uncertainties are lowest for GPP across the forest/taiga biome and highest for residence time over tundra and steppe areas. Amid the uncertainty in NEP magnitude, the results of this study and others suggests that the CO 2 sink of the region is underestimated.
Several recommendations are made as a result of this analysis. The range in area and climatological mean NEP across the models, more than double the mean value, illustrates the considerable uncertainty in the magnitude of the contemporary CO 2 sink. The results of the site-level comparison point to a need to better understand the connections between model-simulated productivity rates, soil dynamics controlling heterotrophic respiration rates, and associated uncertainties in total ER. Given the strong connections between soil thermal and hydrological variations and soil respiration, we recommend that model improvements are targeted at processes and parameterizations controlling respiration with depth in the soil profile. These validation efforts are especially important given the likelihood of net carbon transfer from ecosystems to the atmosphere from permafrost thaw (Schuur and Abbott , 2012;Schuur et al., 2015). Model responses to CO 2 fertilization and nitrogen limitation, processes largely underrepresented in the models, should be evaluated in the context of ecosystem productivity. While insights have been gained by examining the model estimates of GPP, ER, and NEP, an improved understanding of net CO 2 sink/source dynamics will require the continued development and application of model formulations for carbon emissions from fire and other disturbances. The limited number of measured site data across this important region clearly hampers model assessments, highlighting the critical need for new field, tower, and aircraft data for model validation and parametrization. Specifically, new observations in the boreal zone are required to better evaluate model bi -Biogeosciences, 12, 4385-4405, 2015 www.biogeosciences.net/12/4385/2015/ ases documented in this and in other recent studies. Moreover, our finding of biases in CO 2 source activity during the shoulder seasons points to a critical need for observations during autumn, winter, and spring. Given our results, conclusions drawn from studies which use a single model should be viewed cautiously in the absence of rigorous validation against observations across the region of interest. New observations from current and upcoming field campaigns such as Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) and the Arctic Boreal Vulnerability Experiment (ABoVE) should be used to confirm the results of this study. Future model evaluations will benefit from continued development of consistent benchmarking data sets from field measurements and remote sensing. Regarding tower data, any new measurements must be supported by refinements in the models used to partition the measured NEE flux into GPP and ER components. Regarding these and similar model intercomparisons, investments must be made which will minimize or eliminate differences in a priori climate forcings used in the simulations. At a programmatic level support for these activities should lead to well-designed model intercomparisons which minimize, to the extent possible, differences in model spinup, forcings and other elements which confound model intercomparisons.