Evaluating the agreement between measurements and models of net ecosystem exchange at different times and timescales using wavelet coherence : an example using data from the North American Carbon Program Site-Level Interim Synthesis

Earth system processes exhibit complex patterns across time, as do the models that seek to replicate these processes. Model output may or may not be significantly related to observations at different times and on different frequencies. Conventional model diagnostics provide an aggregate view of model–data agreement, but usually do not identify the time and frequency patterns of model–data disagreement, leaving unclear the steps required to improve model response to environmental drivers that vary on characteristic frequencies. Wavelet coherence can quantify the times and timescales at which two time series, for example time series of models and measurements, are significantly different. We applied wavelet coherence to interpret the predictions of 20 ecosystem models from the North American Carbon Program (NACP) Site-Level Interim Synthesis when confronted with eddy-covariance-measured net ecosystem exchange (NEE) from 10 ecosystems with multiple years of available data. Models were grouped into classes with similar approaches for incorporating phenology, the calculation of NEE, the inclusion of foliar nitrogen (N), and the use of model–data fusion. Models with prescribed, rather than prognostic, phenology often fit NEE observations better on annual to interannual timescales in grassland, wetland and agricultural ecosystems. Models that calculated NEE as net primary productivity (NPP) minus heterotrophic respiration (HR) rather than gross ecosystem productivity (GPP) minus ecosystem respiration (ER) fit better on annual timescales in grassland and wetland ecosystems, but models that calculated NEE as GPP minus ER were superior on monthly to seasonal timescales in two coniferous forests. Models that incorporated foliar nitrogen (N) data were successful at capturing NEE variability on interannual (multiple year) timescales at Howland Forest, Maine. The model that employed a model–data fusion approach often, but not always, resulted in improved fit to data, suggesting that improving model parameterization is important but not the only step for improving model performance. Combined with previous findings, our results suggest that the mechanisms driving daily and annual NEE variability tend to be correctly simulated, but the magnitude of these fluxes is often erroneous, suggesting that model parameterization must be improved. Few NACP models correctly predicted fluxes on seasonal and interannual timescales where spectral energy in NEE observations tends to be low, but where phenological events, multi-year oscillations in climatological drivers, and ecosyst m succession are known to be important for determining ecosystem function. Mechanistic improvements to models must be made to replicate observed NEE variability on seasonal and interannual timescales.

identify the time and frequency patterns of model misfit, leaving unclear the steps required to improve model response to environmental drivers that vary on characteristic frequencies. Wavelet coherence can quantify the times and frequencies at which models and measurements are significantly different. We applied wavelet coherence to interpret the predictions of twenty ecosystem models from the North American Carbon Program (NACP) Site-Level Interim Synthesis when confronted with eddy covariancemeasured net ecosystem exchange (NEE) from ten ecosystems with multiple years of available data. Models were grouped into classes with similar approaches for incorporating phenology, the calculation of NEE, and the inclusion of foliar nitrogen (N). Models with prescribed, rather than prognostic, phenology often fit NEE observations better on 15 annual to interannual time scales in grassland, wetland and agricultural ecosystems. Models that calculate NEE as net primary productivity (NPP) minus heterotrophic respiration (HR) rather than gross ecosystem productivity (GPP) minus ecosystem respiration (ER) fit better on annual time scales in grassland and wetland ecosystems, but models that calculate NEE as GPP − ER were superior on monthly to seasonal time 20 scales in two coniferous forests. Models that incorporated foliar nitrogen (N) data were successful at capturing NEE variability on interannual (multiple year) time scales at Howland Forest, Maine. Combined with previous findings, our results suggest that the mechanisms driving daily and annual NEE variability tend to be correctly simulated, but the magnitude of these fluxes is often erroneous, suggesting that model param-

Introduction
Land surface models represent our understanding of how terrestrial ecosystems func-5 tion in the climate system. It is critical to test, compare and improve these models as new information and methods become available, especially because numerous recent syntheses have demonstrated a considerable lack of model skill (Schwalm et al., 2010;Wang et al., 2010;Schaefer et al., 2012). Models are commonly diagnosed using statistical metrics that can be combined for a more complete view of model performance 10 (Taylor, 2001). Such model diagnostics are able to identify whether a different model, different model parameterization, or different subroutine represents an improvement (Akaike, 1974), but are not intended to identify the symptoms of model failure across time and scales in time to identify the conditions that result in poor performance. Residual analyses and detailed investigations of model performance during different time 15 periods give important insight into the mechanisms underlying model failure, but are rarely interpreted with respect to patterns of model/measurement mismatch (see however Dietze et al., 2011;Mahecha et al., 2010;Vargas et al., 2010). In this paper, we present a formal analysis of model/measurement mismatch across times and frequencies. Such an analysis may also provide insight into how improvements to model 20 structure and/or parameterization should be made (Williams et al., 2009).
Improving individual models is a noteworthy goal, but modern efforts combine multiple observations and model simulations, i.e. multiple databases, to arrive at a synthesis (Friedlingstein et al., 2006;Schwalm et al., 2010). In other words, such studies adopt an data-intensive approach to scientific inference (Gray, 2009), and techniques from non- 25 linear time series analysis and knowledge discovery in databases may provide important insights into the aggregate or divergent behavior of these model and observational 3043 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | databases. In this study, we quantify significant relationships among twenty ecosystem models and ten multi-year time series of eddy covariance NEE measurements from the North American Carbon Program (NACP) Site-Level Interim Synthesis (Schwalm et al., 2010) using a technique called wavelet coherence (Grinsted et al., 2004;Torrence and Webster, 1999). Wavelet coherence is conceptually similar to a measure of 5 correlation between data series across time and time scale (related to frequency). Like correlation, significant values of wavelet coherence can be quantified, in this case by comparison against appropriate synthetic null spectra. Unlike simple correlation, statistical significance can be quantified across both time and time scales simultaneously. We use wavelet coherence to determine the times and time scales when NACP models 10 and measurements are significantly related and, importantly, when they are not. Notably, wavelet coherence can quantify significance in the time and time scale domains even when common power (i.e. shared variability) among time series on these scales is low (Grinsted et al., 2004), and may offer an improvement over residual analyses for this reason. Wavelet coherence has found applications in comparing ecological models 15 and measurements for the goal of model improvement (Williams et al., 2009), but not across multiple model and observational time series to date.
Previous studies of ecosystem models in the time scale domain have demonstrated that models tend to miss patterns in flux observations on intermediate (i.e. weekly to monthly) and interannual time scales (Siqueira et al., 2006;Stoy et al., 2005). Biological 20 responses to variability in climate often dominates flux variability on these time scales , and models tend to replicate these biological responses poorly. Such responses include weekly-to-monthly shifts in leaf out/leaf drop phenology and the multitude of factors including lagged responses known to contribute to interannual carbon flux variability. With respect to the NACP, findings to date have iden-25 tified superior model fit when phenology is prescribed by remote sensing observations as opposed to prognostic via a phenology model, when a sub-daily (i.e. half hourly or hourly) rather than a daily time step is used, and when net ecosystem exchange (NEE) is calculated as the difference between gross primary productivity (GPP) and Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ecosystem respiration (ER) rather than the difference between net primary productivity (NPP) and heterotrophic respiration (HR) (Schwalm et al., 2010;Richardson et al., 2012). Schwalm et al. (2010) also found that model performance was poorer during spring and fall when phenological events dominate surface flux and during dry periods within the growing season. Less certain is how models match measurements on multi-5 ple time scales as they respond to climatic and biological forcings that act on multiple time scales (Dietze et al., 2011). Quantifying such model-measurement relationships contributes to the NACP objective to measure and understand the sources and sinks of CO 2 in North America. Following previous studies, we hypothesize that models will tend to match flux patterns on daily and annual time scales, and we focus our investigation on time scales between weeks and multiple months as well as interannual time scales, where we postulate that models will replicate observations more poorly.

Eddy covariance data and ecosystem models
Half hourly (or hourly) micrometeorological and eddy covariance measurements were 15 collected by site principal investigators and research teams, and these data were provided to the AmeriFlux and Fluxnet-Canada consortia to create the NACP Site Level Interim Synthesis product (Schwalm et al., 2010). For this analysis we examine 20 ecosystem models against measurements of the net ecosystem exchange of CO 2 (NEE) from the ten eddy covariance research sites investigated by Dietze et al. 20 (2011) ( Table 1). These sites were chosen because the length of the observation period tended to be longer and more continuous, allowing us to investigate interannual (multiple year) variability, and because more models tended to be run for these ecosystems (Schwalm et al., 2010;Schaefer et al., 2012). Missing meteorological data were gap-filled using National Oceanic and Atmospheric Administration (NOAA) meteoro-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | (or hourly) NEE values were filtered to remove periods of insufficient turbulence determined using friction velocity (u * ) thresholds, and despiked to remove outliers (Papale et al., 2006;Reichstein et al., 2005). Missing NEE data were then gap-filled following Barr et al. (2004). Model runs at each site followed a prescribed protocol for intercomparison described by Schwalm et al. (2010). Ancillary biological, disturbance, edaphic, 5 and management data used by model runs for each site were given by the AmeriFlux BADM templates (Law et al., 2008). The ecosystem models explored here are listed in Table 2 and described in more detail in Schwalm et al. (2010) and the original publications. 10 The times and time scales at which two corresponding data series (here time series) have high common power can be quantified using the wavelet cospectrum. Wavelet coherence uses wavelet spectral and cospectral calculations to quantify correlations in the time and time scale domains (Grinsted et al., 2004;Torrence and Webster, 1999). Briefly, following Grinsted et al. (2004), wavelet coherence is defined in a similar man-15 ner to the coefficient of determination (r 2 ) using instead wavelet coefficients:

Wavelet coherence
where W X n (s) and W Y n (s) are the wavelet coefficients from the model (Y ) and measurement (X ) time series at time n on time scale s, W X n Y (s) is the cross wavelet transform (W X n (s) times the complex conjugate of W Y n (s)), and S is a smoothing operator for the 20 Morlet wavelet following Torrence and Webster (1999) and Grinsted et al. (2004). Grinsted et al. (2004) noted that many geophysical time series are characterized by red (Brownian) noise, which can be modeled as a first-order autoregressive process (AR1). These patterns can be used as a null model by simulating synthetic data that were simulated with AR1 coefficients to quantify significant wavelet coherence at the Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | 95 % confidence level. Eddy covariance time series approximate pink noise (1/f noise) (Richardson et al., 2008), which is a class of autoregressive noise, and Grinsted et al. (2004) demonstrated that the color of noise has little impact on the determination of the significance level. Wavelet coherence values above 0.7 were found by Grinsted et al. (2004) to be significant against synthetic data sets across a wide range of scales when 5 10 scales per octave (i.e. per a doubling or halving of frequency) were chosen in the scale-wise smoothing, although a higher coherence values (ca. 0.8 or higher) should be chosen at very high and low frequencies. We used 10 scales per octave and also chose the commonly used 0.7 wavelet coherence threshold for determining significance. We de-emphasize the interpretation of high frequency coherence (e.g. on hourly and sub-10 daily time scales) to focus on the longer time scales (i.e. > one day) where models often fail. Wavelet coefficients on very long time scales (years to multiple years) often exceed the so-called cone of influence beyond which the coherence calculation is dominated by edge effects because of incomplete time-locality across frequencies (Torrence et al., 1998). Wavelet coefficients outside the cone of influence are unreliable and will not be Results are presented with two different representations of time scale in mind. For the demonstration of the wavelet coherence technique, we interpret all relevant scales 20 from twice the observation time step (usually 1 to 2 h) to 1 2 the length of the truncated time series. For the comparison of model output against flux observations, we interpret wavelet coherence on time scales longer than one day to enable the comparison of models that operate on daily and sub-daily time steps and to focus our analysis on the longer time scales (e.g. seasonal or interannual time scales) on which models 25 often fail. By definition, some wavelet coherence values will exceed the significance threshold by chance. To avoid over-interpreting the outcome, we discuss only large regions in the time/time scale wavelet half-plane for which the coherence between Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | model and measurements is adjudged to be statistically significant or not significant following Grinsted et al. (2004).

Combined wavelet coherence significance analysis
A wavelet half-plane of significance values can be created for each modelmeasurement combination for each site. As such, significance values from wavelet 5 half-planes that represent different models run for a single site can be combined for an aggregate view of model performance. The approach that we explore is to sum wavelet half-planes that represent significance values (i ) for models that possess a given attribute A (A i ), divide by the number of models with A (N A ), then subtract the sum of the wavelet half-planes of significance values for models that possess the opposing model 10 attribute B (B i ) divided by the number of models with B (N B ), using: The purpose of this calculation is to provide a simple metric between −1 and 1 for cases where N A and N B may be different but are weighted equally to simplify comparison. The goal is to identify regions in time and time scale at different sites where a cer- 15 tain model attribute outperforms the other (or others) across all models investigated here, with the goal of interpreting the success or failure of different model formulations across time and time scale for different ecosystem types. To avoid over-interpreting results, we only plot absolute values of Eq. (2) that exceed 0.5 to focus our study on times and time scales where the first and second terms of Eq. (2) differ by more than Introduction  and modeled NEE from ED2 demonstrate common power on these time scales. Some multi-day to multi-month (seasonal) periods likewise have high wavelet coherence, but wavelet coherence is generally low (< 0.7) on time scales longer than one year. 15 Wavelet coherence coefficients were converted to binary significance values as demonstrated in Fig. 2. Here, regions in the time/scale wavelet half-plane that have significant coherence at the 95 % level (i.e. wavelet coherence coefficients > 0.7 following Grinsted et al., 2004) are given the value of one and appear in white in the figure, and non-significant regions are given the value of zero and appear in black in 20 the figure. Figure 2 reveals that ED2-modeled NEE is significantly related to the NEE measurements on daily time scales during the growing season (i.e. the white areas in Fig. 2), on the annual time scale, and on seasonal time scales during the earlier part of the measurement period, but not during most of the remaining times and time scales. Smaller regions of the wavelet half-plane with significant coherence should not 25 be over-interpreted as these occur in some 5 % of cases by chance.

Wavelet coherence significance testing of multiple models at a single site
Comparing significant wavelet coherence among US-Ha1 NEE and the output of multiple models (choosing SiBCASA, ED2, LoTEC, and ORCHIDEE, Fig. 3) reveals that the observed annual variability of NEE tends to be well-replicated by models. This finding is expected given the dominant role of orbital motions in controlling climate and flux in 5 the temperate zone on these time scales. We note that Fig. 3 and subsequent figures ignore time scales smaller than one day to facilitate comparison between models that run on the daily and sub-daily time steps, and to emphasize longer time scales in the wavelet coherence significance tests. Figure 3 also demonstrates that results from some models are significantly related to measurements from different regions of the time/scale half-plane. LoTEC in particular is frequently related to observations on weekly and monthly time scales, but LoTEC is the only NACP model that implemented a data assimilation procedure, and should be expected to have a stronger relationship to measurements (Schwalm et al., 2010).
LoTEC results are discussed further in Appendix A. Significant wavelet coherence ex- 15 ists among US-Ha1 NEE measurements and the SiBCASA, ED2 and LoTEC models, but not ORCHIDEE, on the seasonal time scale (one to several months) before 2002. Such findings question whether common model attributes (Table 2) are responsible for good fit or poor fit during these times and time scales. A major advantage of converting the wavelet coherence half-planes into binary sig- 20 nificance maps is that the output of different models for the different measurement sites can be averaged or summed to explore aggregate model performance (e.g. via Eq. (2) or other metrics). We can begin by summing the significance patterns of all 15 models that were run for US-Ha1 (Fig. 4, see Dietze et al., 2011). Figure 4 demonstrates that all models are related to NEE on the annual time scale for at least part of the measure-25 ment period. More than ten models are significantly related to the measurements on seasonal time scales, and frequent periods when multiple models are significantly related to measurements appear on weekly and monthly time scales. These features may Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | be related to model structural attributes that can guide model testing and interpretation. We demonstrate such an approach by first exploring further the NEE observations and model output for US-Ha1. We then proceed to interpret results from the other nine research sites evaluated in this analysis (Table 1).

3.4
The role of common model features in determining significant wavelet 5 coherence Models in the NACP synthesis share features in common ( (2010) and include comparisons between prescribed versus prognostic canopy phenology, the calculation of NEE as GPP ecosystem respiration (ER) or as net ecosystem productivity (NEP) heterotrophic respiration (HR), and the inclusion of foliar nitrogen in the model ( Table 2). The results of the combined wavelet coherence significance analysis for US-Ha1 15 are shown in Fig. 5. In Fig. 5a, regions in the wavelet half-plane for which coherence between NEE measurements from US-Ha1 and all models with prognostic phenology is significant are given the value of 1 (see Eq. 2). From this, significant regions in the half-plane for which all models with prescribed phenology are subtracted. If all models with prognostic phenology are significantly related to NEE measurements for a given 20 region in time and scale, and none with prescribed phenology are significant, the value of Eq.
(2) equals 1 − 0 = 1 (dark blue). If the opposite holds, then the region equals negative 1 and is shown in dark red. This procedure is repeated for the different model attributes investigated (Table 2). For example, from Fig. 5a, all (or most) models with prognostic phenology are often 25 significantly related to NEE observations from US-Ha1 during seasonal time scales, especially earlier in the measurement period. These results suggest that phenology models are working well in simulating the seasonal patterns that they seek to replicate, 3051 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | but prescribing phenology using remote sensing observations results in model fits at these times and time scales that are not significant. Schwalm et al. (2010) found that models with prognostic phenology and (to a lesser degree) those that calculate NEE as GPP − ER tend to show better performance across sites (Fig. 5a, b). When this holds at US-Ha1, it is on time scales between ca. 10 2.9 h (i.e. one month) and 10 3.5 h (i.e. 3  (Tables 1 and 2).

Phenology
Ecosystem models often fail to replicate the timing of spring green up and autumn leaf senescence (Richardson et al., 2012) and, interestingly, incorporating satellite remote sensing data (i.e. prescribing phenology in models) may not represent an improvement 20 in capturing phenological events (Fisher et al., 2007). However, NACP model results from the ten study sites indicate that prescribing the phenology of leaf area index (LAI) often improves modeled carbon fluxes on seasonal and annual time scales at the cold, non-forest sites (i.e. CA-Let and CA-Mer, the deciduous forests US-UMB and US-Ha1, and the agricultural ecosystem US-Ne1 (Figs. 5 and 6). The creation of effective prog-25 nostic phenology models for grasslands and croplands remains challenging, especially when cropping systems often depend on manager decisions. Remote sensing is often unsuccessful for capturing grassland phenology (Reed et al., 1994) to the fact that the shift from green to brown biomass is critical for modeling NEE but can be subtle and difficult to ascertain remotely (Sus et al., 2010). Despite many successes, prescribing phenology resulted in erroneous model fit in some ecosystems, times, and time scales (Fig. 6), in agreement with Richardson et al. (2012) who found that model biases of two weeks or more for were common for deciduous forests. Pre-5 dicting phenology in the coniferous forests (CA-Obs and US-Ho1) is a superior strategy for modeling NEE on seasonal time scales. This makes sense given the difficulty of using remote sensing to detect seasonal changes in leaf area and photosynthetic activity in evergreen canopies.

NEE calculation
10 Models calculate ecosystem carbon uptake and loss in different ways and the NACP models can roughly be categorized as those that calculate NEE as GPP − ER and those that calculate NEE as NPP − HR (Schwalm et al., 2010). Models that calculate NEE as NPP−HR tend to fit better than models that calculate NEE as GPP−ER on the annual time scale at the Canadian grassland (CA-Let) and bog (CA-Mer) sites, which 15 are characterized by short-statured vegetation and pronounced seasonality (Fig. 7). Models that calculate NEE as NPP − HR also represent an improvement on seasonal and annual time scales at the deciduous forests Ca-Oas and US-UMB, and at daily to weekly time scales at the coniferous forests Ca-Obs, US-Ho1, and US-Me2. Models that calculate NEE as GPP−ER tend to be better on monthly to seasonal time scales at 20 the coniferous forests CA-Obs and US-Ho1. In general, simulating NEE and HR results in poorer NEE model fit at seasonal and annual time scales in coniferous stands, and simulating GPP and ER presents more of a challenge in grasslands, wetlands, and deciduous forests. Many of the wavelet half-planes in Fig. 7 show a scale-wise shift (from red to blue or vice versa) as one moves to longer scales in time, suggesting that the response of GPP, NPP, ER and HR to environmental drivers that act on different time scales need to be examined carefully for proper frequency response.

Nitrogen
Models utilizing measurements of foliar N show improved fits on interannual time scales than models that exclude N at a coniferous forest (US-Ho1; Fig. 8f) and to a lesser degree at a deciduous forest (CA-Oas, Fig. 8d). This finding supports the incorporation of canopy N as an important component for accurately modeling spatial and tempo-5 ral patterns in NEE Ollinger et al., 2005Ollinger et al., , 2008. However, it is discouraging that incorporating N improves interannual model fit for only a couple of sites rather than for all sites; note for example the poor fit of models that include N on time scales shorter than the interannual time scale at Ca-Oas (Fig. 8d). Climatic variables tend to be unrelated or poorly related to observed NEE on interannual time 10 scales (Stoy et al., 2009), and variability in biological drivers like canopy N are thought to be a principle drivers of NEE variability on interannual time scales . The role of biological lags (e.g. growth and NPP lagging behind C uptake) tend to be poorly represented in the current generation of ecosystem models (Keenan et al., 2012), as are the dynamics of the non-structural carbohydrates that can contribute to 15 such lags (Gough et al., 2009(Gough et al., , 2010. Modeling the biological responses to interannual climatic variability continues to be a major research challenge Siqueira et al., 2006), and it appears that modeling N improves models of NEE, but only in certain instances. Including foliar N improves model fit on certain time scales for different sites; for example including N appears to improve models in CA-Let, CA-

20
Oas and CA-Obs, during summer months in 2006. The summer of 2006 was at the time the second warmest on record in Canada, but the role of N in improving modeled NEE during these conditions is difficult to interpret.

The analysis of models at multiple time scales
We used wavelet coherence as a criterion for model/measurement comparison in this 25 study. Spectral analyses can also be used to discriminate among model subroutines and inputs (Stoy et al., 2005)  2009), and it is for these purposes that wavelet coherence may find the most application in the biogeosciences. Wang et al. (2011) recently used wavelet analysis to quantify patterns of CABLE model output (Kowalczyk et al., 2006) and demonstrated how model improvements improved predictions of NEE, latent heat and sensible heat on multiple time scales, although observed patterns in interannual variability in NEE 5 remained difficult for CABLE to resolve. We suggest that any comprehensive model diagnostic toolkit should explore model frequency response, and we demonstrate the application of wavelet coherence as a model-measurement comparison technique that is also visually intuitive. It is important to note that wavelet coherence tests for matches in patterns, rather than magnitudes, and by itself is an incomplete metric for model fidelity.
Future research efforts should compare wavelet-based approaches with other time series decomposition techniques including singular systems analysis (Mahecha et al., 2010), spectral analysis of model residuals (Dietze et al., 2011;Vargas et al., 2010), and/or to quantify causal relationships among measurements and models across time and spectra using the Granger definition (Detto et al., 2012).

Conclusions
We demonstrated an application of wavelet coherence for testing significant relationships between flux observation and the output of multiple ecosystem models run at multiple different study sites. Models with prognostic phenology were often significantly related to NEE measurements on seasonal time scales in coniferous sites, but models 20 with prescribed phenology improved seasonal and annual model fit in grassland and wetland study sites, and to a lesser degree in the deciduous forests US-Ha1 and US-UMB. The inclusion of foliar N improved model performance on interannual time scales at US-Ho1.
Model pattern tended to match observed NEE on diurnal time scales during the grow- 25 ing season and on annual time scales (e.g. Figs. 1 and 2), but previous analyses indicate that models often misrepresent the magnitude of fluxes on these highly-energetic Introduction  (Dietze et al., 2011). Despite correct frequency responses on growingseason diurnal and annual time scales as we find here, Dietze et al. (2011) demonstrated that proper parameterization of flux magnitude on these scales should remain a focus of modeling efforts. LoTEC results (Fig. 2) hint that data assimilation can improve model fit on the intermediate weekly to seasonal time scales during many pe-5 riods, but modeled flux variability on diurnal and interannual time scales was not significantly related to measurements, suggesting that mechanistic model responses still need improvement. Mechanistic explanations for describing interannual NEE variability still elude most models, although correctly modeling N dynamics may be a strategy for progressing on this problem in some ecosystems (Fig. 8).

10
Wavelet coherence adds an additional diagnostic tool to a modeler's conceptual toolbox for evaluating the performance of single models or suites of models (Grinsted et al., 2004;Torrence et al., 1998;Williams et al., 2009). Future efforts should determine the benefits and drawbacks of wavelet, Fourier, and Singular Systems Analysis approaches for model/measurement comparisons (Katul et al., 2001;Mahecha et al., 15 2010;Siqueira et al., 2006;Vargas et al., 2010), use the outcomes of multiple spectral analyses to provide insight into how and why models fail, and use this information to improve model performance at the multiple times and time scales at which biogeochemical fluxes vary.

20
Data assimilation for formally fusing observations and models has gained increased attention in the Biogeosciences (Hill et al., 2011;Rastetter et al., 2010;Raupach et al., 2005;Williams et al., 2005). LoTEC applied a data assimilation procedure in the NACP modeling exercise, and output in many instances represented a striking improvement against the aggregate output of other models (Fig. 9). Namely, LoTEC output is sig- and US-Me2. Results suggest that the optimized parameters computed in the LoTEC data assimilation procedure can improve fit across times and time scales, especially for some of the ecosystems that exhibit pronounced seasonality in canopy dynamics (i.e. some deciduous forests, and the agricultural ecosystem). Results also demonstrate that the data assimilation routine does not always result in significant relationships be-10 tween measurements and models; there are many periods, often time scales between a day and about a month and a half (10 3 h, Fig. 3), where LoTEC is not significantly related to measurements. Such findings demonstrate the importance of data assimilation, but also demonstrate that data assimilation should not take the place of efforts to improve model structure. 15 Acknowledgements. First and foremost we wish to thank all of the researchers involved in eddy covariance data collection and curation, and in model development and testing. PCS thanks Matteo Detto and Aslak Grinsted for helpful discussions on wavelet coherence and the National Science Foundation (Scaling ecosystem function: Novel Approaches from MaxEnt and Multiresolution TM , DBI #1021095). We would like to thank the North American Carbon Program Biogeosciences, 6, 2297-2312, doi:10.5194/bg-6-2297-2009, 2009  (2). Areas of dark blue represent times and scales where all models that include prognostic phenology (A), the NEE calculation as GPP−ER (B), and the inclusion of nitrogen (C) are significantly related to NEE observations, and when none of the opposing model strategy listed in Table 2 is significant. Areas of dark red represent periods when the opposite holds.  Fig. 6. The ratio of wavelet coherence significance tests for models with prognostic versus prescribed phenology for nine sites in the North American Carbon Program Interim Synthesis. The colorbar follows Fig. 5. Regions in the time/scale wavelet half-plane for which models that use prognostic leaf area index (LAI) are significantly related to NEE measurements and those that use prescribed LAI are not significantly related to NEE measurements are shown as dark blue. Regions for which models that use prescribed LAI are significant and those that use prognostic LAI are not significant are shown as dark red. The colorbar follows Fig. 5.  Fig. 9. The ratio of wavelet coherence significance tests for the models uses a data assimilation procedure (LoTEC) versus other models that do not use data assimilation. The colorbar follows Fig. 5. Regions in the time/scale wavelet half-plane for which LoTEC is significantly related to NEE measurements and other models are not significantly related to NEE measurements (on average) are shown as dark blue. Regions for which models other than LoTEC are significantly related to NEE measurements and LoTEC is not significantly related to NEE measurements are shown as dark red. The colorbar follows Fig. 5.