A model inter-comparison study to examine limiting factors in modelling Australian tropical savannas

. The savanna ecosystem is one of the most dom-inant and complex terrestrial biomes, deriving from a dis-tinct vegetative surface comprised of co-dominant tree and grass populations. While these two vegetation types co-exist functionally, of savanna vegetation and its response to rainfall interannual variability. We believe that this study is the ﬁrst to assess how well TBMs simulate savanna ecosystems and that these results will be used to improve the representation of savannas ecosystems in future global climate model studies.

Abstract. The savanna ecosystem is one of the most dominant and complex terrestrial biomes, deriving from a distinct vegetative surface comprised of co-dominant tree and grass populations. While these two vegetation types co-exist functionally, demographically they are not static but are dynamically changing in response to environmental forces such as annual fire events and rainfall variability. Modelling savanna environments with the current generation of terrestrial biosphere models (TBMs) has presented many problems, particularly describing fire frequency and intensity, phenology, leaf biochemistry of C 3 and C 4 photosynthesis vegetation, and root-water uptake. In order to better understand why TBMs perform so poorly in savannas, we conducted a model inter-comparison of six TBMs and assessed their performance at simulating latent energy (LE) and gross primary productivity (GPP) for five savanna sites along a rainfall gradient in northern Australia. Performance in predicting LE and GPP was measured using an empirical benchmarking system, which ranks models by their ability to utilise meteorological driving information to predict the fluxes. On aver-age, the TBMs performed as well as a multi-linear regression of the fluxes against solar radiation, temperature and vapour pressure deficit but were outperformed by a more complicated nonlinear response model that also included the leaf area index (LAI). This identified that the TBMs are not fully utilising their input information effectively in determining savanna LE and GPP and highlights that savanna dynamics cannot be calibrated into models and that there are problems in underlying model processes. We identified key weaknesses in a model's ability to simulate savanna fluxes and their seasonal variation, related to the representation of vegetation by the models and root-water uptake. We underline these weaknesses in terms of three critical areas for development. First, prescribed tree-rooting depths must be deep enough, enabling the extraction of deep soil-water stores to maintain photosynthesis and transpiration during the dry season. Second, models must treat grasses as a co-dominant interface for water and carbon exchange rather than a secondary one to trees. Third, models need a dynamic representation of LAI that encompasses the dynamic phenology

Introduction
Savanna ecosystems are a diverse and important biome that play a significant role in global land-surface processes (van der Werf et al., 2008). Globally, they occupy regions around the wet-dry tropical to sub-tropical equatorial zone, covering approximately 15 to 20 % of the terrestrial surface and contribute ∼ 30 % to global net primary production (Grace et al., 2006;Lehmann et al., 2014). Savannas are water-limited ecosystems where rainfall is often seasonal or monsoonal and have a spatial extent that can cover an area with annual rainfall in the range of 500 to 2000 mm (Bond, 2008;Kanniah et al., 2010;Sankaran et al., 2005). The variability in the amount and timing of annual rainfall, coupled with local topo-edaphic properties, and the frequency and intensity of seasonal fires strongly influence the structure and function of savanna vegetation Kanniah et al., 2010;Ma et al., 2013;Sankaran et al., 2005). Savannas are characterised by a multi-layer stratum of vegetation, where an open and discontinuous canopy overstorey is seasonally dominated by understorey grasses (Scholes and Archer, 1997). These tree and grass layers are distinctly and functionally different, fixing carbon using different photosynthetic pathways: C 3 and C 4 photosynthesis respectively (Bond, 2008;Scholes and Archer, 1997; R. J. . The canopy overstorey can be either evergreen or deciduous (depending on the evolutionary history), while the grass understorey is annual: active only in the wet season and senescing at the end of this period (R. J. . Consequently, water, carbon, and nutrient cycling in savannas is largely determined from the balance and co-existence of these two life forms (Lehmann et al., 2009;Sankaran et al., 2005).
Given the complex nature of savannas, modelling the land surface exchange and vegetation dynamics for this biome is challenging for terrestrial biosphere models (TBMs). Here we define TBMs to broadly encompass stand, land surface, and dynamic global vegetation models (Pitman, 2003). Most land surface schemes that feed into larger earth system models use simplistic representations of vegetation, and these will have difficulty describing the complex structure of savanna ecosystems. Such issues may be simplistic assumptions in relation to rooting depth and inadequate responses to drought (De Kauwe et al., 2015;Li et al., 2012); ignoring the multilayered nature of savannas and the differing structural (including radiation), functional (including different plant functional types), and phenological differences (Whitley et al., 2011); and in some cases neglecting the C 4 photosynthetic pathway entirely (Parton et al., 1983;Schymanski et al., 2007). It is therefore critical that TBMs meet the challenges that savanna dynamics present if water and carbon exchange are to be correctly simulated in response to global change.
Despite these issues, there have been significant advances in modelling savanna dynamics in recent years, and these have been focused on integrating important features specific to savanna ecosystems, namely frequent fire and tree-grass competitive interactions, which are processes that shape savanna structure and function (Haverd et al., 2016;Higgins and Scheiter, 2012;Scheiter and Higgins, 2007;Scheiter et al., 2014;Simioni et al., 2003). Nevertheless, little work has been undertaken to critically evaluate the performance and processes of TBMs when used to capture water and carbon cycling in savannas, most notably in West Africa (Simioni et al., 2000) and Australia (Schymanski et al., 2007(Schymanski et al., , 2008(Schymanski et al., , 2009Whitley et al., 2011). Many global ecosystem models moreover use broad plant functional types (PFTs) with single parameter values to describe whole biomes (Pitman, 2003), making them unable to represent changing vegetation structure (tree : grass ratio) in the continuum of grassland to woodland savanna. Approaches have been developed that can account for savanna dynamics, such as using mixed tiles, whereby trees and grasses are simulated as separate surfaces that are then aggregated together (Kowalczyk et al., 2006). However, this approach fails to capture the competition between trees and grasses for light, water, and nutrient resources.
In this study, we take six TBMs of distinctly different conceptual frameworks and assess their ability to simulate savanna water and carbon exchange along the North Australian Tropical Transect (NATT), which is defined by a strong rainfall gradient. Australian tropical savannas can be considered largely intact compared to South American and African savannas and provide a "living laboratory" to understand the links between vegetation structure and function and how it responds to environmental change . We challenge the models by evaluating them along the rainfall gradient, which extends over a broad biogeographical extent and strong interannual variability in climate (Koch et al., 1995). The aim of this study is to highlight critical processes that may be missing in current TBMs and are required to adequately simulate savanna ecosystems. Specifically, we examine whether a TBM's structural framework, such as the representation of the understorey grasses (C 4 photosynthesis), tree rooting depth, and description of phenology (prescribed vs. dynamic), can adequately replicate observed carbon and water fluxes. To achieve this we measure the performance of each TBM by comparing its predictions to a set of empirical benchmarks that describe a priori expected levels of model performance. We identify regions of low performance among sites and seasons to diagnose under what climate conditions reduced model performance occurs. We then infer what processes (present or missing) may be the cause for reduced per-formance when applied to savanna ecosystems. Our intention is that these results can be used to flag high priorities for future development by the terrestrial biosphere modelling community.

Observational data
The NATT is a sub-continental rainfall gradient in the wetdry tropical climate zone of northern Australia, covering a distance of approximately 1000 km over a latitudinal range of −12 to −23 • S and a decline in mean annual precipitation (MAP) from 1700 to 300 mm . It is one of three savanna transects established in the mid-1990s, forming part of the International Geosphere Biosphere Program (IGBP) along with the SAvannas in the Long Term (SALT) transect in West Africa and the Kalahari Transect in southern Africa (Koch et al., 1995). Soils range from sand-dominated red kandosols to black, cracking clay soils that are more extensive in the southern end of the NATT that are limiting to woody plant growth . Kandosols are ancient and weathered, such that they have been leached of nutrients by the large monsoonal rainfall (McKenzie et al., 2004). Close to the northern coastline, vegetation is comprised primarily of evergreen Eucalyptus and Corymbia tree species that overly an understorey of C 4 Sorghum and Heteropogon spp. grasses. Inland, tree biomass, leaf area index (LAI), and cover tends to decline and by −18 • S savanna vegetation transitions to less dense Acacia woodlands, shrublands, and grasslands that are dominated by Astrebla grass species . Fires occur regularly in these environments, increasing in frequency with higher rainfall (MAP > 1000 mm), and are fuelled by the accumulation of understorey C 4 grasses that cure in the dry season Russell-Smith and Edwards, 2006).
The five flux tower sites along the NATT used in this study are outlined in Table 1, which describes stand soil and vegetation characteristics, as well as a summary of local meteorology . These sites represent a sampling of savanna environments covering a wide range of MAP and a much smaller range of mean annual temperature ( Fig. 1). At each site, an eddy covariance system was used to measure the ecosystem-atmosphere exchange of radiation, heat, water, and CO 2 . Quality assurance and control and corrections on the fluxes were carried out on the 30 min data set using the OzFlux QC/QA protocol (v2.8.5), developed by the OzFlux community under creative commons licensing (www.ozflux.org.au; see Eamus et al., 2013). Missing or rejected data were gap-filled using the DINGO (Dynamic INtegrated Gap filling and partitioning for Ozflux) system (see Moore et al., 2016). Gross primary productivity (GPP) was not observed but determined from the differ-ence between measured net ecosystem exchange (NEE) and modelled ecosystem respiration (Re). Values of Re were determined by assuming nocturnal NEE equals Re under the conditions for sufficient turbulent transport. Values that meet these requirements are then used to make daytime predictions of Re, using an artificial neural network (ANN), with soil moisture and temperature, air temperature, and the normalised difference vegetation index used as predictors. Additionally, the effect of fire on the water and carbon fluxes are quantified and incorporated into the data sets accounting for the nonlinear response in productivity (becoming a carbon source) during the post-fire recovery period . Because the TBMs used here do not attempt to simulate stochastic fire events (and other disturbance regimes), these post-fire recovery periods were removed when determining the benchmarks and model performance as described below.
Finally, we use the definitions for water and carbon exchange as outlined by Chapin et al. (2006), whereby the subdaily rate of GPP is expressed in µmol m −2 s −1 and uses a negative sign (−) to denote the removal of CO 2 from the atmosphere. Similarly, latent energy (LE) is expressed in terms of energy as W m −2 and uses a positive sign to denote the addition of H 2 O to the atmosphere.

Terrestrial biosphere models
The six TBMs used in this study cover a wide spectrum of characteristics of operation, scale and function, and include differences in operational time step (30 min vs. daily), scope of simulated processes (soil hydrology, static or dynamic vegetation, multi-layer or big leaf description of the canopy), and intended operational use (coupled to earth system models, offline prediction, driven by remote sensing products). These characteristics along with what we define as a model "functional class" are given in Table 2 and are defined as follows. Stand models (SMs) give detailed multi-layer descriptions of canopy and soil processes for a particular point, operating at a sub-daily time step (Soil-Plant-Atmosphere model (SPA) and MAESPA). Land-surface models operate at the same temporal resolution as SMs, but they adopt a simpler representation of canopy processes, allowing them to be applied spatially (Community Atmosphere Biosphere Land Exchange model (CABLE) and BIOS2, a modified version of CABLE). Dynamic global vegetation models (DGVMs) simulate water and carbon much like the other models, but they simulate dynamic rather than static vegetation that changes in response to climate and disturbance (Lund-Potsdam-Jena General Ecosystem Simulator, LPJGUESS). Lastly, remote sensing models are driven by remotely sensed atmospheric products and infer water stress of vegetation through changes in fractional cover rather than detailed soil hydrological processes (Breathing Earth System Simulator, BESS). Some of the TBMs share similar structural frameworks in parts: for example, both SPA and MAESPA use similar belowground soil hydrology and root-water uptake schemes, while BIOS2 is a fine spatial resolution (0.05 • ) offline modelling environment for Australia, in which predictions of CABLE (with alternate parameterisations of drought response and soil hydrology) are constrained by multiple observation types (see Haverd et al., 2013). Although these similarities reduce the number of truly functionally independent models used in the experiment, the presence of such overlap can be useful in identifying whether particular frameworks are the cause for model success or failure.

Experimental protocol
All TBMs were parameterised for each of the five savanna sites using standardised information on vegetation and soil profile characteristics (Table 1). For TBMs that required them, parameter values pertaining to leaf biochemistry, such as maximum Rubisco activity (V cmax ) and leaf nitrogen content per leaf area (N area ), were assigned from , who undertook a physiological measurement campaign during the SPECIAL program . Parameters relating to soil sand and clay content were taken from the Australian Soil Classification (Isbell, 2002), while root profile information was sourced from Chen et al. (2003) and Eamus et al. (2002). Each TBM was set up to describe a C 3 evergreen overstorey with an underlying C 4 grass understorey and conforms well with the characteristics of savannas in northern Australia (Bowman and Prior, 2005). All TBMs (excluding LPJGUESS) prescribed LAI as an input in order to characterise the phenology of vegetation at each site. In these cases LAI was determined from MODIS-derived approximations that were well matched to ground-based estimations of LAI at the SPECIAL sites (Sea et al., 2011).
The fraction of C 3 to C 4 vegetation was handled differently by each model and was determined for each as follows. For MAESPA and SPA, the models allowed for time-varying tree and grass fractions to be assigned as direct inputs, and these time-varying fractions were determined using the method of Donohue et al. (2009). BIOS2 similarly used the same method to extract time-varying fractions, while CABLE used a static fraction that did not change. The BESS model derived the C 3 : C 4 fraction from the C 3 and C 4 distribution map of Still et al. (2003), while for LPJGUESS this fraction is a prognostic determination resulting from the competition between trees and grasses (see Smith et al., 2001). Model simulations were driven using observations of solar radiation, air temperature, relative humidity (or vapour pressure deficit, VPD), rainfall, atmospheric CO 2 concentration, and LAI (if prescribed), and they included a spin-up period of 5 years to allow internal states, such as the soil-water balance and soil temperature to reach equilibrium. The exception to the above was the BIOS2 model, which was run using gridded meteorological inputs and had its model parameters optimised through a model-data fusion process (see Haverd et al., 2013).
Simulations for each savanna site covered a period of 2 to 10 years, depending on the availability of data from each flux site (Table 1), and results were standardised to the ALMA (Assistance for Land-surface Modelling Activities) convention. Model predictions of LE and GPP were evaluated against local observations at each site from the eddy covariance data sets and benchmarked following the methodology proposed by the Protocol for the Analysis of Land-surface models (PALS) and the PALS Land SUrface Model Benchmarking Evaluation PRoject (PLUMBER) (Abramowitz, 2012;Best et al., 2015) as described below.

Empirical benchmarking
The paradigm for model assessment outlined by PALS (Abramowitz, 2012) suggests that model assessment is more meaningful when a priori expectations of performance in any given metric can be defined. Such benchmarks can be created using simple empirical models, built on statistical relationships between the fluxes and drivers, and establish the de- gree to which models utilise the information available in their driving data about the fluxes they aim to predict. Additionally, these empirical models are simple in the sense that they are purely instantaneous response to time-varying meteorological forcing and contain no internal states or expression of ecophysiological processes. This is in comparison to TBMs that are complex, having some 20+ soil and vegetation parameters, internal states, partitioning of light, as well as soil and vegetation, carbon, and nitrogen pools (Abramowitz et al., 2008). We created a set of three empirical models of increasing complexity following the procedure of Abramowitz (2012), which we compared with the TBMs. The first benchmark (emp1) is simply a linear relationship between a turbulent flux (LE or GPP) and downward short-wave radiation (R s ). The second benchmark (emp2) is slightly more complex, and is a multi-linear regression between a flux and R s , air temperature (T a ), and VPD. Finally, the third benchmark (emp3) is the most complex and is a nonlinear regression of the fluxes against R s , T a , VPD, and LAI, determined from an ANN. This benchmark is constructed using a self-organising linear output map that clusters the four covariates into 10 2 distinct nodes and performs a multi-linear regression between the fluxes and the four covariates at each node, resulting in a nonlinear (piece-wise linear) response to the meteorological forcing data (Abramowitz et al., 2008;Hsu, 2002). In a departure from Abramowitz (2012), we include LAI as an additional covariate, as the seasonal variance of savanna water and carbon exchange is strongly coupled to the phenology of the grasses and to the deciduous and semi-deciduous woody species (Moore et al., 2016). The seasonal behaviour of the empirical benchmark drivers along the NATT can be referred to in the Supplement. Empirical benchmarks are created for each of the five flux sites using non gap-filled data and are parameterised out of sample, such that they use data from all sites except the one in question. For example, the Howard Springs empirical benchmark models would use information from Adelaide River, Daly Uncleared, Dry River, and Sturt Plains to establish their parameter values but would exclude Howard Springs itself. Constructing the benchmark out of sample results in what is effectively a generalised response to an independent data set. Once the empirical models were calibrated for each site, benchmarks were then created for both fluxes using the same meteorological forcing used to run the TBMs. Finally, we assess ecosystem model performance in terms of a ranking system, following the PLUMBER methodology of Best et al. (2015). The performance of each individual ecosystem model in predicting both LE and GPP at each site was determined using four statistical metrics that describe the mean and variability of a model compared to the observations. These metrics included the correlation coefficient (r), standard deviation, normalised mean error, and mean bias error (see Table A1 in Appendix A). Similarly, the same metrics were determined for each of the three benchmarks at each savanna site. Each TBM was then ranked against the benchmarks (independently of the other models) for each of the metrics listed above, where the ranking is between 1 and 4 (1 model + 3 benchmarks) and the best performing model for a given metric is ranked as 1. An average ranking is then determined across all metrics for each TBM and all benchmarks to give a final ranking of performance for each savanna site. The ranks denote the number of metrics being met by the models and are not a measure of the smallest absolute error. In determining the average ranks, the metrics were evaluated at the daily timescale, as this was the lowest temporal resolution common amongst the six TBMs. Additionally, days where either driver or flux had been gap-filled were removed. Here we use the term performance to relate to how well the TBMs compare to the benchmarks as expressed by the ranks. Figure 2 shows the daily time course of LE and GPP from the flux tower, models, and benchmarks at each of the five savanna sites. Models, benchmarks, and observations are represented as a smoothed time series (7-day running mean) and have been aggregated into an ensemble year to express the typical seasonality of savanna water and carbon exchange. Visually, the TBMs showed varying levels of performance across the rainfall gradient. None of the models showed a clear consistency in simulating either flux, and each responded differently to the meteorological drivers across sites. Additionally, some of the models, such as CABLE and LPJGUESS, showed difficulty in simulating the seasonality of the fluxes across the transect, particularly GPP. Differences among model-simulated LE and GPP were larger in the wet season than the dry season. However, modelled LE and GPP appeared to co-vary quite strongly; overall both fluxes were underestimated across sites by most models. Simulations by SPA and MAESPA were the exception to this, broadly capturing tower GPP despite consistently underestimating LE across sites. Figure 3 shows the probability density functions (PDFs) for the wet (November-April) and dry season (May-October) fluxes at each site. Tower and model PDFs were determined by binning each flux into the respective seasons and using kernel density estimation (Bashtannyk and Hyndman, 2001) to determine smoothed distributions. The shape and mean position of the distributions indicate the ability of the models to capture the extremes (day-to-day variability) and the seasonality of the fluxes respectively, highlighting possible predictive biases (i.e. the over-or underestimation of the tower fluxes). Across the NATT, the PDFs for the tower fluxes tended to shift to low values and became narrower as annual rainfall declined, and this was most prominent in the dry season. A change in the spread and mean position of the flux tower PDFs demonstrate the strong seasonality of water and carbon exchange at all sites. The PDFs of the model simulations did not replicate this trend, having high densities and being mostly stationary across sites. Regarding savanna water use, the distributions of the BIOS2 and SPA models were similar to those of the flux towers. The BESS model also showed a similar distribution of LE, despite the fact that it did not simulate soil-water extraction. The LPJGUESS model, which had the shallowest simulated tree rooting depth, displayed PDFs of high density that were biased towards low LE (20-40 W m −2 ) across all sites and seasons. The MAESPA model showed a similar behaviour, despite this model having a much deeper simulated rooting depth and a root-water extraction scheme that is equivalent to the SPA model. The distributions for the CABLE and BIOS2 models were largely disparate despite these models being functionally equivalent. Notably, CABLE wet season LE was more broadly distributed (5-200 W m −2 ) than the flux towers and other models at all sites, while dry season LE was narrower. In relation to savanna carbon uptake, all models showed wet and dry season PDFs of high density that became more closely aligned with the flux tower distributions as the sites became drier. The behaviour of the modelled GPP distributions were otherwise similar to those of the modelled LE distributions. The differences among TBM and flux tower PDFs indicated possible issues in simulated processes that are active during the wet season.

Model predictions
The benchmarks set low to high levels of expected TBM performance across the NATT. Additionally, they also demonstrated the level of model complexity that is required to simulate water and carbon exchange at these sites. The simplest of the benchmarks, represented as a linear regression of the fluxes against R s (emp1), which was capable of predicting the magnitude and daily time course of the tower fluxes (data not shown), but there was not enough information in R s to capture the seasonality or the distribution of the fluxes expressed by the tower data. The intermediate benchmark that included additional meteorological information on T a and VPD (emp2) demonstrated an improved capability in capturing the flux distributions but could not replicate the full seasonality of the fluxes across the NATT. It was only by including additional phenological information (LAI) together with site meteorology (R s , T a , and VPD) that the seasonality and distribution of the fluxes could be captured, as demonstrated by the most complex benchmark (emp3). This indicated that in order for the TBMs to achieve the best possible performance at simulating water and carbon exchange along the NATT, the correct implementation and utilisation of phenological information by the models was required. All TBMs used in this study utilised this breadth of information, but only some of the models were capable of meeting the expected level of performance set by the emp3 benchmark, and then only for specific sites and seasons.

Residual analysis
An analysis of the model residuals was conducted to show how model structure affects the prediction of savanna fluxes across the rainfall gradient. To do this we examined the standardised model residuals from each TBM, determined by expressing the residual error in terms of its standard deviation. Figure 4 shows the residual time series for model-predicted LE and GPP at each savanna site and provides an effective way of examining how a model responds to progressive changes in the environment through the expression of model bias and error (Medlyn et al., 2005).  The model residuals demonstrated that there was significant bias and heteroscedasticity in predicted LE and GPP in almost all cases. The residual time series showed that model error was largest in the wet season but declined with the transition into the dry season. Additionally, the models underestimated LE and GPP more significantly during the wet sea-son . A possible explanation for this behaviour is that during the wet season, multiple land-surface components: the soil surface, the understorey grasses, and the tree canopy (i.e. three sources for potential error) contribute to the bulk fluxes, while during the dry season only the tree canopy contributes (i.e. one source for potential error). It is likely that  the reduction in residual error between wet and dry seasons was a result of the declining influence of the grasses and the soil surface to ecosystem land-surface exchange during the latter period (via senescence and low surface soil moisture respectively). The bias towards the underestimation of wet season fluxes was more pronounced at the mesic sites (Howard Springs, Adelaide River), despite some models simulating relatively deep root profiles (e.g. BIOS2, MAESPA).
Differences in how the TBMs simulated root-water extraction also had no effect on reducing this bias (e.g. MAESPA, SPA). Given that soil-water was not a limiting factor at the mesic sites during this period, deep root profiles offered limited advantage towards model performance. Nonetheless, the simulated tree root zone appeared to be an important factor for all sites during the dry season, with shallow root depths (LPJGUESS: 2 m) and/or inadequate root-water up-  Table S1. The rankings are determined individually for latent energy (LE) and gross primary productivity (GPP). The coloured lines represent each of the six models in the study, while the grey lines represent the empirical benchmarks. The average ranking for each model was determined for (a) a complete year, (b) the wet season, and (c) the dry season. take schemes (CABLE: concentrated in the upper soil profile) the likely cause for underestimation during this period. However, as the sites became drier (e.g. Sturt Plains) a shallow root profile was suitable to give flux estimates of a reasonably low error. Despite model error reducing with the increase in ecosystem water limitation that occurs in both space (down the NATT) and time (wet to dry season), there are still patterns of model bias that may be unrelated to simulated soil-water dynamics. This is particularly obvious during the wet-to-dry transition periods (e.g. BIOS2, SPA) when the C 4 grass understorey senesces, indicating possible problems with the how the models translate information on phenology. Figure 5 shows a comparison of individual TBM performance ordered by site from wettest (Howard Springs) to driest (Sturt Plains) and in terms of their annual, wet, and dry season predictions for each flux. Despite differences in model complexity (Table 1), the TBMs showed a similar performance across sites and seasons. For almost all sites, the TBMs outperformed the emp1 benchmark for annual flux predictions (Fig. 5a). However, there were some exceptions to this, and good performance in one flux did not necessarily result in good performance in the other. For example, MAESPA was unable to beat the emp1 benchmark for LE at sites where MAP > 1000 mm but performed better than the emp2 benchmark for GPP. In general, there was a slight pattern of increased model performance as annual rainfall declined, though with a degree of site-to-site variability in the rankings for some of the TBMs. In order to examine how seasonal changes affect model performance, we additionally determined the metrics and rankings for the wet and dry season periods (Fig. 5b-c). Seasonal differences were immediately obvious. Model performance for wet season LE and GPP was low to moderate, and the majority of the TBMs showed a performance that ranged between the emp1 and emp2 benchmarks. In contrast, there were noticeable improvements to dry season model performance amongst the TBMs. For dry season LE, half the models (BIOS2, BESS, and SPA) were able to consistently outperform the emp2 benchmark and come close to meeting the same number of metrics as the emp3 benchmark particularly at the drier sites. In comparison, predicted dry season GPP saw a larger enhancement in model performance, with TBMs more frequently outperforming the emp2 benchmark and even some outperforming the emp3 benchmark (LPJGUESS, BESS, and SPA at the Daly Uncleared site). The exception to all this was the CABLE model, which showed surprisingly little loss or gain in performance despite the season. The results give an indication that, as a whole, input information was better utilised by each TBM at drier sites and in the dry season, suggesting that there are problems in wet season processes.

Discussion
The NATT, which covers a marked rainfall gradient, presents a natural "living laboratory" with which a model's ability to simulate fluxes in savanna ecosystems may be assessed. Our results have highlighted that there is a clear failure of the models to adequately perform at predicting wet season dynamics, as compared to the dry season, and suggests that modelled processes relating to the C 4 grass understorey are insufficient. This highlights a key weakness of this group of TBMs, which likely extends to other models outside of this study. The inability of these TBMs to capture wet season dynamics is highlighted by the benchmarking, where the performance for many of the models was at best equivalent to that of a multi-linear regression against R s , T a , and VPD (emp2) and in some cases no better than a linear regression against R s (emp1). Given that this subset of TBMs are sophisticated process-based models that represent our best understanding of land surface-atmospheric exchange processes, we would expect them to perform as well as a neural network prediction (emp3). Consequently there is an evident underutilisation of the driving information (i.e. a failure to describe the underlying relationships in the data) impeding the performance of these models when predicting savanna fluxes. However, there were instances where some of the TBMs were able to reach similar levels of performance with the emp3 benchmark, which strongly suggests that each of these models is capable of replicating savanna dynamics under certain conditions (e.g. during the dry season).
Our results suggest that errors among models are likely to be systematic, rather than related to calibration of existing parameters. For example, BIOS2 had previously optimised model parameters for Australian vegetation (see Haverd et al., 2013) but was still unable to out-perform the emp3 benchmark in most cases, although it performed better than an un-calibrated CABLE, to which it is functionally similar. Similarly, MAESPA and SPA, which used considerable site characteristic information to parameterise their simulations, did not significantly outperform un-calibrated models (e.g. CABLE). Additionally, despite these models using the same leaf, root, and soil parameterisations, both SPA and MAESPA displayed markedly different performances in predicting LE. Consequently, improving how models represent key processes that drive savanna dynamics is critical to improving model performance across this ecosystem.
There is certainly enough information in the time-varying model inputs to be able to adequately simulate wet and dry season dynamics, as is evidenced by the benchmarks. We therefore consider the implications of our results, present possible reasons below for why this group of TBMs is failing to capture water and carbon exchange along the NATT, and make suggestions as to how this could be improved.  . Average year outputs of vegetation transpiration (grass + trees) and soil evaporation, as well as their percentage contributions to total latent energy (LE) for each of the six terrestrial biosphere models at each of the five savanna sites.

Water access and tree rooting depth
During the late dry season surface soil moisture in the sandy soils declines to less than 3 % volumetric water content, with an equivalent matric potential of 3 to 4 MPa (Prior et al., 1997). During this seasonal phase, the grass understorey becomes inactive and LE can be considered as equivalent to tree transpiration, such that it is the only active component during this period (O'Grady et al., 1999). Using this equivalence, one can infer the relative effect that rooting depth has on LE during this period. Previous studies have shown that for these savanna sites along the NATT, tree transpiration is maintained throughout the dry season by deep root systems that access deep soil-water stores, which in turn are recharged over the wet season Hutley et al., 2001;Kelley et al., 2007;O'Grady et al., 1999). In order for models to perform well they will need to set adequate rooting depths and distributions, along with root-water uptake process, to enable a model response to such seasonal variation. Examining performance across the models, we can infer this to be a key deficiency. As expected, TBMs that prescribed shallow rooting depths (e.g. LPJGUESS) did not simulate this process well and underestimated dry season LE at three of the five savanna sites by up to 30 to 40 %. The two sites at Adelaide River and Sturt Plains were an exception to this with the TBMs displaying a low residual error, which is likely to be a consequence of heavier textured soils and trees at these sites having shallow root profiles. At Adelaide River shallow root profiles are a consequence of shallow, heavier textured soils; however, dry season transpiration is sustained due to the presence of saturated yellow hydrosol soils. Sturt Plains is a grassland (the end member of the savanna continuum) where C 4 grasses dominate and no trees are present such that transpiration is close to zero in the dry season. The few small shrubs that are established have shallow root profiles that have adapted to isolated rainfall events driven by convective storms . Consequently, the TBMs would be expected to perform better at these sites, as water and carbon exchange will be modulated by the soil-water status of the sub-surface soil layers. For the other sites, models which assumed a root depth > 5 m (BIOS2, SPA, and MAESPA) showed the most consistent performance in predicting dry season LE, and we suggest that for seasonally water-limited ecosystems, such as savanna, deeper soil-water access is critical. Our results highlight the need for data with which to derive more mechanistic approaches to setting rooting depth, such as that of Schymanski et al. (2009). Interestingly, a low residual error for LE in the dry season did not translate as good performance in the overall model ranking. This suggests that other processes along the soilvegetation-atmosphere continuum need to be considered to improve simulated woody transpiration. Such processes may include root-water uptake (distribution of roots and how water is extracted), and the effect of water stress and increased atmospheric demand at the leaf level (adjustment of stomatal conductance due to changes in leaf water potential). More detailed model experiments that examine how each TBM simulates these processes would help identify how they can be improved.
An exception to the above is the BESS model, which forgoes simulating belowground processes of soil hydrology and root-water uptake entirely. Rather, this model assumes that the effects of soil-moisture stress on water and carbon exchange is expressed through changes in LAI (and by extension V cmax ), which acts as a proxy for changes in soil moisture content (Ryu et al., 2011). The fact that BESS performed moderately well along the NATT, coupled with the fact that tree transpiration continues through the dry season, suggests that there may be enough active green material for remote sensing proxies of water stress to generally work rather well for savanna ecosystems. It is notable that BESS overestimated both GPP and ET in dry season at the driest site, Sturt Plains (Fig. 2e), implying that greenness detected by satellite remote sensing might not capture carbon and water dynamics well in such a dry site.

Savanna wet season dynamics
The relative performance of the TBMs at predicting LE was much poorer in the wet season compared to the dry season. The reason for this difference is that wet season LE is the sum of woody and herbaceous transpiration (E veg ) as well as soil and wet-surface evaporation (E soil ); in contrast, dry season LE is predominantly woody transpiration as described previ-ously. During the wet season, up to 75 % of total LE arises from understorey herbaceous transpiration and soil evaporation Hutley et al., 2000;Moore et al., 2016) and of this fraction the C 4 grasses contribute a significant daily amount . In the absence of observations of understory LE it can be difficult to determine whether grass transpiration is being simulated correctly. However, separating out the components of wet season LE into soil and vegetation can help identify which of these components are causes for error.
Separating the outputs of simulated E veg and E soil from each TBM (excluding BESS which did not determine these as outputs during the study) shows that simulated wet season E veg was particularly low for a lot of the models, despite high LAI and non-limiting soil-water conditions (Fig. 6). A previous study at Howard Springs by Hutley et al. (2000) observed that, during the wet season, the grass understorey could transpire ∼ 2.8 mm d −1 , while the tree canopy transpired only 0.9 mm d −1 (E veg = 3.7 mm d −1 ). Of the six TBMs at Howard Springs, only CABLE and SPA were able to predict an E veg close to this level, while the other models predicted values closer to tree transpiration (i.e. an underestimate). This pattern is similar for other NATT sites, where predicted wet season E veg remained low and was dominated by E soil at the southern end of the NATT. An underestimation of wet season LE could be due to underestimated E soil in some of the models. Conversely, CABLE and BIOS2 predicted a higher E soil than the other models, and this could be a reason for their higher LE performance during the wet season. Although E soil has been reported to reach as high as 2.8 mm d −1 at Howard Springs , predicted E soil by these models may still be overestimated, given that vegetation cover during this period is at a seasonal peak (limiting energy available at the soil surface) and transpiration is only limited by available energy, not water Ma et al., 2013;Schymanski et al., 2009;Whitley et al., 2011). Given the limited data for E soil along the NATT, it is difficult to determine how large E soil should be. However, the ratios displayed by the TBMs appear to be reasonable, with vegetation acting as the predominant pathway for surface water flux.
Grass transpiration is thus clearly being underrepresented by most of the TBMs, and reasons for this could be due to multiple factors. The evolution of C 4 grasses to fix carbon under low light, low CO 2 concentrations, and high temperatures has resulted in a gas-exchange process that is highly water-use efficient (von Caemmerer and Furbank, 1999). Consequently, this life form is abundant in tropical, waterlimited ecosystems, where it can contribute to more than 50 % of total LAI (2.0 to 2.5), particularly at high rainfall sites (Sea et al., 2011). The annual strategy of the C 4 grasses at these sites is to indiscriminately expend all available resources to maximise productivity during the monsoon period, for growth and to increase leaf area. This therefore allows grass transpiration to exceed tree transpiration during the peak wet season as evergreen trees will be more conservative in their water use, allowing them to remain active in the dry season Hutley et al., 2000;Scholes and Archer, 1997). Following this logic, our results suggest that the TBMs are either (i) incorrectly ascribing leaf area to the understorey (i.e. the C 4 fractional cover is too low), (ii) incorrectly describing the C 4 leaf-gas exchange physiology, (iii) incorrectly describing the understory micro climatic environment (R s , T a , VPD), or (iv) a combination of these causes. Furthermore, it should be noted that the TBMs used in this study are not truly modelling grasses, but approximating them. Grasses are effectively simulated as "stemless" trees, and the distinction between the two life forms is reliant on different parameter sets (e.g. V cmax , height) and slight modifications of the same process (e.g. rate of assimilation, respiration). While our results and the tower data do not allow us to directly determine how C 4 grasses may be misrepresented in these TBMs, they clearly indicate that future development and evaluation should be focused on these issues. Eddy covariance studies of understorey savanna vegetation as conducted by Moore et al. (2016) will be critical to this process.

Savanna phenology
The results from this study have shown that to simulate savanna fluxes, TBMs must be able to simulate the dynamics of savanna phenology, expressed by LAI. This was highlighted by the empirical benchmarks, where the results showed that while R s , T a , and VPD were important drivers, LAI was required to capture the seasonality and magnitude of the fluxes to achieve good performance. LAI integrates the observed structural changes of the savanna as annual rainfall declines with reduced woody stem density, driving water and carbon exchange as a result (Kanniah et al., 2010;Ma et al., 2013;Sea et al., 2011). When LAI is prescribed in a model, it is important that leaf area is partitioned correctly between the trees and grass layers to describe their respective phenology. This partitioning is important, as the C 4 grass understorey explains most of the seasonal variation in LAI and is a consequence of an annual phenology that exhibits rapid growth at the onset of the wet season and senescence at the onset of the dry (R. J. . By contrast, the evergreen eucalypt canopy shows modest reductions in canopy leaf area during the dry season, especially as mean annual rainfall declines (Bowman and Prior, 2005;Kelley et al., 2007). The strong seasonal dynamics of the grasses result in large changes in LAI, with levels varying between 0.7 and 2.5 at high rainfall sites (Sea et al., 2011). The phenological strategy of the C 4 grasses also changes with rainfall interannual variability, with the onset of the greening period becoming progressively delayed as sites become drier, to become eventually rain-pulse driven as the monsoonal influence weakens (Ma et al., 2013).
With the exception of LPJGUESS, all models prescribed LAI as an input driver. Prescribing LAI can be problematic depending on the timescale and how it is partitioned between trees and grass layers. At large time steps (months) it will fail to capture the rapidly changing dynamics of vegetation during the transition periods, and this is particularly true for the onset of the wet season (September-November) especially at drier sites that are subject to larger interannual rainfall variability . Additionally, as the sites become drier the tree : grass ratio will become smaller and this dynamic can be difficult to predict, although methods do exist (see Donohue et al., 2009). From the results, we infer that TBMs that prescribe LAI and allow for a dynamic representation of tree and grass ratios are better able to capture the changing dynamics of the savanna system. This is a possible explanation for the better performance of the BIOS2, MAESPA and SPA models in simulating GPP as these models dynamically partition leaf area between trees and grasses at the sub-monthly timescale, rather than using a bulk value. However, models that prescribe LAI have limited capability in simulating the land-surface response of savannas to changing climate, as tree and grass cover is the outcome of the environmental forcing (particularly rainfall variability and disturbance) and not a driver of the system (Sankaran et al., 2005). DGVMs that consider dynamic vegetation and use a prognostic LAI can simulate the feedback between the climate and the relative cover of trees and grasses, which shapes the savanna continuum. This feedback allows the simulated savanna structure to potentially shift to alternate states (e.g. grassland or forest) in response to changes in annual rainfall and fire severity Higgins, 2007, 2009). While LPJGUESS was the only TBM to use a prognostic LAI in our study, it achieved only moderate performance, and this may be due to how carbon is allocated from the pool on an annual time step, such that it is not as dynamic as it could be. However, its capability to simulate the feedback between climate and LAI is critical for simulating how savanna dynamics may change from year to year. There may also be issues with how phenology is simulated, particularly as it is determined from empirical formulations, which are (i) not specifically developed for savanna environments and (ii) calculated before the growing season begins. Such formulations are therefore not mechanistic and do not respond to actual season dynamics (e.g. limiting soil water), but they are empirically determined (Richardson et al., 2013).

Conclusions
This study set out to assess how well a set of functionally different, state-of-the-art TBMs perform at predicting the bulk exchanges of carbon and water over savanna land surfaces. Our model inter-comparison has identified key weaknesses in the assumptions of biosphere-atmosphere processes, which do not hold for savanna environments. Our benchmarking has identified low model performance by TBMs is likely a result of incorrect assumptions related to (i) deep soil-water access, (ii) a systematic underestimation of the contribution of the grass understorey in the wet season, and (iii) the use of static phenology to represent dynamic vegetation. Our results showed that these assumptions, as they currently exist in TBMs, are not wholly supported by "observations" of savanna water and carbon exchange and need to be addressed if more reliable projections are to be made on how savannas respond to environmental change. Despite this, our benchmarking has shown that all TBMs could potentially operate well for savanna ecosystems provided that the above issues are developed. We suggest that further work investigate how particular processes in the models may be affecting overall predicted water and carbon fluxes and may include testing variable rooting depths, alternate root-water uptake schemes and how these might affect leaf-level outputs (e.g. stomatal conductance, leaf water potential) among TBMs, and different phenology schemes. The issues highlighted here also have scope beyond savanna environments, and are relevant to other water-limited ecosystems. The results from this study provide a foundation for improving how savanna ecosystem dynamics are simulated.

Data availability
Half-hourly eddy-covariance data sets pertaining to each of the savanna sites used in this study are available from http: //data.ozflux.org.au (Isaac and van Gorsel, 2016). Soil descriptions for each savanna site are derived from the Digital Atlas of Australian Soils available at www.asris.csiro.au (Isbell, 2016).
Appendix A   Table A1. Definition of common metrics used to determine ranks against the empirical benchmarks. The terms M and O stand for model and observations respectively, while n denotes the length of the data, and i is the datum.

Statistical metric Definition
Correlation coefficient (r) The Supplement related to this article is available online at doi:10.5194/bg-13-3245-2016-supplement.