Water, Energy, and Carbon with Artificial Neural Networks (WECANN): A statistically-based estimate of global surface turbulent fluxes and gross primary productivity using solar-induced fluorescence

A new global estimate of surface turbulent fluxes, latent heat flux (LE) and sensible heat flux (H), and gross primary production (GPP) is developed using a machine learning approach informed by novel remotely sensed Solar-Induced Fluorescence (SIF) and other radiative and meteorological variables. This is the first study to jointly retrieve LE, H and GPP using SIF observations. The approach uses an artificial neural network (ANN) with a target dataset generated from three independent data sources, weighted based on triple collocation (TC) algorithm. The new retrieval, named Water, Energy, and Carbon with Artificial 20 Neural Networks (WECANN), provides estimates of LE, H and GPP from 2007 to 2015 at 1° × 1° spatial resolution and on monthly time resolution. The quality of ANN training is assessed using the target data, and the WECANN retrievals are evaluated using eddy covariance tower estimates from FLUXNET network across various climates and conditions. When compared to eddy covariance estimates, WECANN typically outperforms other products, particularly for sensible and latent heat fluxes. Analysing WECANN retrievals across three extreme drought and heatwave events demonstrates the capability of the retrievals in capturing 25 the extent of these events. Uncertainty estimates of the retrievals are analysed and the inter-annual variability in average global and regional fluxes show the impact of distinct climatic events – such as the 2015 El Niño on surface turbulent fluxes and GPP.


Introduction
Turbulent fluxes from the land surface to the atmosphere, particularly sensible heat flux (H) and latent heat flux (LE), and plants carbon uptake characterized by gross primary production (GPP) are key to understanding ecosystem response to climate and the 30 feedback on the overlying atmosphere, as well as constraining the global carbon, water and energy cycles. In recent years, there has been substantial effort towards estimating these variables from remote sensing observations at a global scale (see e.g. Fisher et al., 2008;Jiang and Ryu, 2016;Jiménez et al., 2009Jiménez et al., , 2011Jung et al., 2009;Miralles et al., 2011a;Mu et al., 2007;Mueller et al., 2011). Two typical approaches have been used to estimate these from remote sensing information. The first approach uses physically-based or semi-empirical models (e.g. the Priestley-Taylor or Penman-Monteith equations in the case of ET, or a light 35 use efficiency model in the case of GPP) informed by remote sensing information (e.g. vegetation indices, infrared temperature, microwave soil moisture), often in combination with reanalysis meteorological forcing data (Fisher et al., 2008;Miralles et al., 2011a;Mu et al., 2007;Zhang et al., 2016b;Zhao et al., 2005;Zhao and Running, 2010). These approaches are sensitive to the https://ntrs.nasa.gov/search.jsp?R=20170009899 2020-05-08T02:18:04+00:00Z assumptions and imperfections of the underlying flux models. The second approach, uses machine learning (e.g. a model tree ensemble) to determine LE, H, and GPP from meteorological drivers and optical remote sensing data (Tramontana et al., 2016).
Like all supervised machine learning models, this approach relies on a training dataset to determine the non-linear statistical relationships. In this case, in situ turbulent flux and GPP estimates from eddy-covariance towers are used (Beer et al., 2010;Jung et al., 2011). Such an approach relies implicitly on an assumption that a long temporal record of these variables at a small number 5 of sites captures the full range of behavior and sensitivities of terrestrial ecosystems around the globe. In addition, extreme and therefore rare events may be difficult to capture based on the limited data availability.
Alternatively, one can use a machine learning approach, such as an Artificial Neural Network (ANN), trained on globallyrepresentative but imperfect estimates of the fluxes (such as those from models) to parameterize the non-linear statistical relationships between remote sensing observations and surface fluxes. This approach has been successfully used for global soil 10 moisture retrieval (Aires et al., 2012;Kolassa et al., 2013Kolassa et al., , 2016Rodriǵuez-Fernández et al., 2015) and surface heat flux retrieval (Jiménez et al., 2009). Such ANNs require a target dataset for training. Climate model simulations of the relevant geophysical variable are usually used as the training dataset to facilitate subsequent data assimilation efforts (Aires et al., 2012;Kolassa et al., 2013Kolassa et al., , 2016. However, the downside of this approach is that the resulting fluxes estimated by the ANN often exhibit some of the same biases as the simulations used to train the network (Rodriǵuez-Fernández et al., 2015), even if improvements can be achieved 15 such as a more realistic seasonal cycle as it is informed by the seasonal cycle of the remote sensing data (Jiménez et al., 2009).
Previous studies show a strong relationship between the rate of photosynthesis and SIF observations and indicate that the plant fluorescence measurements can be a useful proxy for photosynthesis estimation (Flexas et al., 2002;Govindjee et al., 1981;Havaux and Lannoye, 1983;van Kooten and Snel, 1990;Krause and Weis, 1991;McFarlane et al., 1980;Toivonen and Vidaver, 1988;van der Tol et al., 2009). Recently, satellite observations of SIF have become available, opening new possibilities for the global 20 monitoring of photosynthesis (Frankenberg et al., 2011(Frankenberg et al., , 2014Guanter et al., 2012;Joiner et al., 2013;Schimel et al., 2015;Xu et al., 2015). SIF observations from the Global Ozone Monitoring Experiment-2 (GOME-2) instrument are shown to better track the seasonal cycle of GPP compared to typical high-resolution optically-based vegetation index estimates Walther et al., 2016). SIF has also been shown to be a pertinent indicator of vegetation water stress (Lee et al., 2013). 25 Moreover, a near-linear relationship between monthly SIF retrievals and GPP is found for different vegetation types which suggests that SIF estimates can strongly constrain GPP retrievals (Frankenberg et al., 2011).
Recently, a new SIF product was developed from observations of the GOME-2 satellite using a new retrieval algorithm that disentangles three components from multispectral observations (Joiner et al., 2013). SIF retrievals are shown not to be strongly affected by cloud contamination and seasonal variabilities in aerosol optical depth (Frankenberg et al., 2014). More recently, 30 remotely sensed SIF retrievals have been used to successfully provide estimates of GPP in cropland and grassland ecosystems Zhang et al., 2016a). SIF retrievals are also integrated with photosynthesis estimates from National Center for Atmospheric Research Community Land Model version 4 (NCAR CLM4) which result in significant improvement of the photosynthesis simulation (Lee et al., 2015). As GPP relates to plant transpiration through stomata regulation (Damour et al., 2010;DeLucia and Heckathorn, 1989;Dewar, 2002), and transpiration water fluxes dominate continental ET (Jasechko et al., 2013), the 35 use of remotely sensed SIF has the potential to also better constrain estimates of the continental water (LE), and energy (H) cycles, in addition to carbon (GPP) cycle.
In this study, we develop an ANN approach to retrieve monthly estimates of LE, H, and GPP at the global scale. The network uses remotely sensed solar-induced fluorescence (SIF) estimates in addition to other data including precipitation, temperature, soil moisture, snow cover, and net radiation as inputs (predictor). To our knowledge, this is the first study that uses remotely-sensed SIF estimates at the global scale to retrieve LE and H surface turbulent fluxes along with GPP.
Moreover, to reduce any errors, we introduce a Bayesian perspective to generate the target dataset for the ANN. Multiple estimates of LE, H and GPP are selected according to an a priori probability that reflects the quality and information content of the dataset at the particular pixel of interest (details are provided in Section 3.2). This approach enables us to generate a robust target dataset for 5 remote sensing observations along with a statistical algorithm for the retrieval, while bypassing the need for a land surface model and radiative transfer scheme. We use the triplet of GLEAM, ECMWF and FLUXNET-MTE for training of LE and H; and the triplet of MODIS-GPP, ECMWF and FLUXNET-MTE for GPP training.
This new global product is named WECANN (Water, Energy, and Carbon Cycle with Artificial Neural Networks). WECANN monthly estimates for the period 2007 -2015 are provided on a 1° × 1° resolution grid and with units of W m -2 for LE and H, and 10 gC m -2 day -1 for GPP. The spatial coverage of WECANN is presented in Figure S1. It includes all the land areas, except for Greenland, Antarctica, and any 1° × 1° pixel that has more than 75% water, snow or ice permanently. To estimate the fraction of water, snow and ice in each pixel we used the 0.05° × 0.05° MODIS-based Land Cover Type product (MCD12C1 v051) (NASA LP DAAC, 2016).

Data 15
The inputs of WECANN include six remotely sensed variables introduced in Section 2.2 and Table 2: SIF, net radiation, air temperature, soil moisture, precipitation, and snow water equivalent. These are used to retrieve LE, H, and GPP. Different observation and/or model based datasets are used as the training dataset, and are explained in Section 2.1 and summarized in Table   1. All the data presented here are projected and gridded on a 1° × 1° geographic grid and averaged at monthly temporal resolution.
Finally, independent datasets used for evaluation of the ANN retrievals are presented in Section 2.3. 20

Training Datasets
Four products are introduced in this section, and a triplet of them is used for training of each of the LE, H, and GPP (Section 3.2).
For LE and H, training is performed based on GLEAM, FLUXNET-MTE, and ECWMF ERA HTESSEL. For GPP, training is performed on FLUXNET-MTE, ECWMF ERA HTESSEL, and MODIS-GPP. Table 1 summarizes the characteristics of the training datasets used here. 25

GLEAM
The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms to estimate terrestrial evapotranspiration using satellite observations (Martens et al., 2016;Miralles et al., 2011a). GLEAM is a physically-based model composed of 1) a rainfall interception scheme, driven by rainfall and vegetation cover observations; 2) a potential evaporation scheme, calculated from the Priestley and Taylor (1972) equation and driven by satellite observations; and 3) a stress factor attenuating potential evaporation, 30 based on a semi-empirical relationship between microwave VOD observations and root-zone soil moisture estimates (based on a running water balance for rainfall and assimilating satellite soil moisture). The data is provided on a 0.25° × 0.25° spatial resolution and daily temporal resolution and starts in 1980. GLEAM data have been used for studying land-atmosphere interactions, and the global water cycle (Guillod et al., 2014, Miralles et al., 2011a, 2014a, 2014b. In this study, we use LE and H estimates from the latest version v3.0a (Martens et al., 2016). 35

FLUXNET-MTE
The FLUXNET-MTE (Multi-Tree Ensemble) provides global surface fluxes at 0.5° × 0.5° spatial resolution derived from empirical upscaling of eddy-covariance measurements from the FLUXNET global network (Baldocchi et al., 2001). The MTE method used is an ensemble learning algorithm that enables learning diverse sequence of different model trees by perturbing the base learning algorithm (Jung et al., 2009(Jung et al., , 2011. The data covers the period from January 1982 to December 2012 and can be used for 5 benchmarking land surface models and assessment of biosphere gas exchange. We use LE, H, and GPP estimates from FLUXNET-MTE.

ECMWF ERA HTESSEL
The European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis (ERA) is a global 3D variational data assimilation (3DVAR) product that uses the Hydrology Tiled ECMWF Scheme for Surface Exchanges over Land (HTESSEL) in 10 the forecast system. HTESSEL has a surface runoff component and accounts for a global non-uniform soil texture unlike the old TESSEL model (Balsamo et al., 2009). This is an offline model simulation, and HTESSEL is driven by meteorological forcing output from the forecast runs. Photosynthesis in the model is computed independently (i.e. with its own canopy conductance) from LE, so that the carbon cycle does not interact with the water cycle at the stomata level, adding errors. We use LE, H, and GPP estimates from ERA HTESSEL provided on a 0.25° × 0.25° geographic grid with daily temporal resolution. 15

MODIS-GPP
The Moderate Resolution Imaging Spectroradiometer (MODIS) sensor is onboard the sun-synchronous NASA satellites Terra algorithm is based on a light-use efficiency approach proposed by (Monteith and Moss, 1977), which states that GPP is proportional to the product of incoming Photosynthetically Active Radiation (PAR), fraction of Absorbed PAR (fAPAR) and efficiency of radiation absorption in photosynthesis. We use the monthly MOD17A2 GPP product (Running et al., 2004;Zhao et al., 2005;Zhao and Running, 2010). MOD17A2 is available from 2000 until 2015, and provided on a 0.05° × 0.05° spatial resolution. 25

Input Datasets
Six sets of observations are used as input to the WECANN retrieval algorithm. These are selected in a way to provide necessary physical constraints on the estimates from the ANN. Table 2 lists the characteristics of each of the datasets, and they are briefly introduced in the following.

Solar-Induced Fluorescence 30
The GOME-2 instrument is an optical spectrometer onboard Meteorological Operational Satellite Program, MetOp-A and MetOp-B satellites, which were launched by the European Space Agency (ESA). GOME-2 was designed to monitor atmospheric ozone profile as wells as other trace gases and water vapor content. It senses Earth backscatter radiance and solar irradiance at a 40×40 km spatial resolution (prior to July 2013 the spatial resolution was 40×80 km). Recently, the retrieval of Solar-Induced chlorophyll Fluorescence (SIF) using GOME-2 observations in the 650-800 nm spectrum has been investigated (Joiner et al., 2013(Joiner et al., , 2016. We 5 use version 26 of the daily SIF product that uses the MetOp-A GOME-2 channel 4 with a ~0.5 nm spectral resolution and wavelengths between 734 and 758 nm. SIF estimates are provided on a geographic grid with 0.5° × 0.5° grid spacing.

Net Radiation
Net radiation is the main control of the rates of sensible and latent heat in wet environments and is closely related to PAR. The Clouds and Earth's Radiation Energy System (CERES) is a suite of instruments which measure radiometric properties of solar 5 reflected and Earth emitted radiation from the Top Of Atmosphere (TOA) to Earth surface, from three broadband channels at 0.3 -100 . The CERES sensors are on board the Earth Observation Satellites (EOS) including Terra, Aqua and TRMM (Kato et al., 2013;Loeb et al., 2009). We use the net radiation estimates from Synoptic Radiative Fluxes and Clouds (SYN) product of CERES which are provided on a 1° × 1° geographic grid with monthly time resolution.

Air Temperature 10
The Atmospheric Infrared Sounder (AIRS) is a high spectral resolution spectrometer onboard the NASA Aqua satellite launched in 2002. It provides hyperspectral (visible and thermal infrared) observations for monitoring process changes in the Earth's atmosphere and land surface, as well as for improving weather prediction. The AIRS instrument was designed to obtain atmospheric temperature and humidity profiles of every 1 km layer of the atmosphere. The accuracy of AIRS temperature observations is typically better than 1°C in the lower troposphere under clear sky condition (Aumann et al., 2003). We use daily 15 temperature estimates from the lowest layer of AIRS level-3 standard product that is provided on a 0.5° × 0.5° geographic grid.

Precipitation
The Global Precipitation Climatology Project (GPCP) provides global daily precipitation estimates at 1° × 1° spatial resolution 25 from Oct. 1996 to near present (Huffman et al., 2001). Global precipitation estimates from infrared and microwave instruments are combined with monthly gauge measurements to produce the daily estimates. In this study, v1.2 of the one-Degree Daily (1DD) product of GPCP is used and daily estimates are aggregated to monthly scales. Several studies have evaluated the GPCP 1DD product at global or regional scales, and results show that it has high accuracy and good agreement with independent in situ measurements and other global precipitation estimates (Gebremichael et al., 2005;Joshi et al., 2012;McPhee et al., 2005;Rubel 30 et al., 2002).

Snow Water Equivalent
The GlobSnow project is developed by ESA, and provides long-term snow-related variables: Snow Water Equivalent (SWE) and areal Snow Extent (SE). It combines microwave-based retrievals of snow information (including Nimbus-7 SMMR, DMSP F8/F11/F13/F17 SSM/I(S) observations) and ground based station data through a data assimilation process and provides the SWE 6 and SE products at different temporal resolutions: daily, weekly and monthly (Pulliainen, 2006). Here, we use v2 of the daily L3A SWE product which is posted on a 25 km × 25 km EASE grid.

Eddy-Covariance Tower Estimates
FLUXNET is a network of regional tower sites, which measure turbulent flux exchanges (water vapor, energy fluxes and carbon 5 dioxide) between ecosystems and atmosphere (Baldocchi et al., 2001). FLUXNET comprises over 750 sites covering five continents. Measurements from the FLUXNET towers provide valuable information for evaluating satellite based retrievals of surface fluxes. In this study, FLUXNET measurements from the FLUXNET 2015, the La Thuile Synthesis dataset and the Largescale Biosphere-Atmosphere (LBA) experiment in Brazil are used for evaluation (details are provided in section 4.4).
From the Large-scale Biosphere-Atmosphere (LBA) experiment in Brazil, we use data from sites in Rondônia at the edge of a deforested region (BR-Ji1 and BR-Ji2) and near São-Paulo (BR-Sp1). As the data did not span recent years, we instead use a climatology of the fluxes for comparison from 1999 to 2003. We note that, of course, the inter-annual variability in the region 15 (such as El Niño and La Niña) could alter the seasonality and magnitude of the fluxes in the region.
We also use data from the La Thuile Synthesis Dataset (http://fluxnet.fluxdata.org/data/la-thuile-dataset/) covering 24 sites. These data are part of the free-fair use version of the dataset.

Basin Scale ET
We use estimates of an independent water budget closure model across five major basins to evaluate WECANN retrievals at 25 regional scales (Aires, 2014;Munier et al., 2014) . ET estimates from the budget closure approach satisfy a water budget closure with no residual; therefore, they can be used as a reference to evaluate WECANN ET estimates at basin scale. These basins include Amazon (4,680,000 km 2 ), Colorado (618,715 km 2 ), Congo (3,475,000 km 2 ), Mississippi (2,964,255 km 2 ) and Orinoco (836,000 km 2 ). Details of the water budget estimate are provided in Munier and Aires, 2017, but in a nutshell they combine estimates of precipitation, evaporation, water storage and runoff to define a best estimate of the different fluxes and changes in storage, 30 constrained by the water budget over the basin. Their analysis is carried out from 2002 through 2010.

Artificial Neural Network Setup
We developed an ANN retrieval algorithm to estimate the surface fluxes (LE and H) and GPP based on our six sets of input observations: SIF, net radiation, air temperature, soil moisture, precipitation, and SWE (as described in Section 2.2). The ANN 35 used here is a feedforward network consisting of three layers: (1) an input layer that directly connects to the input data, (2) one hidden layer and (3) an output layer that produces the 3 output estimates. The number of neurons in the input and output layer is determined by the number of input and output variables, whereas for the hidden layer it has to be chosen according to the complexity of the problem (see below). The neuron output from each layer is fed to neurons in the subsequent layer through weighted connections. Each neuron output is the weighted sum of its inputs plus a bias, which is then subjected to a transfer function. In this study, we chose a tangent sigmoid transfer function for neurons in the hidden layer and a linear transfer function 5 in the output layer. The change of the transfer function for the hidden layer (log sigmoid or tangent sigmoid) did not produce any significant changes in the retrievals (not shown), so we used the more common method. A schematic of the ANN architecture is provided in Fig. 1.
The training step of the ANN aims at estimating the weights for each of the neuron connections, such that the mismatch between the ANN outputs and target estimates is minimized. For this, we used the mean squared error (MSE) as the cost function and a 10 backpropagation algorithm to adjust the ANN weights. During the training, the network implicitly learns the coupling of the LE, H and GPP by using one set of neurons (with their respective weights and biases) to estimate the three variables. This is an advantage of using a machine learning technique that eliminates the need to define physical relationships between different variables.
For the purpose of training, the target data is divided into three subsets: training, validation and testing constituting 60%, 20% and 15 20% of the target data, respectively. In each iteration, the training subset is used to estimates the weights in the network, and the convergence of the training towards the target data is checked using the validation subset. When overfitting of the network weights to the training data occurs, the validation estimates start diverging from the target data and the training is stopped (early stopping).
The weights from the last iteration before the occurrence of the divergence represent the final solution. The testing subset are used to assess the ANN performance after the training phase. 20 As an additional measure to avoid overfitting, we repeated the training for several ANNs with an increasing number of neurons in the hidden layer (1 to 15). For 1 to 5 neurons, the R 2 value between the target data and the ANN estimates increased with an increasing number of neurons. For more than 5 neurons, little change in the skill was observed when increasing the number of hidden layer neurons (Fig. S3). Thus an ANN with 5 hidden layer neurons represents the simplest ANN that can converge to a solution and model the non-linear relationship between the satellite inputs and the surface flux estimates. 25 To train the ANN, we used LE, H and GPP estimates from the years 2008-2010. The target dataset was generated through a triple collocation based merging of triplets of the flux estimates introduced in Section 2.1 (details are discussed in Section 3.2). After completion of the training, the performance of the ANN and its ability to generalize was evaluated using the LE, H and GPP target data from 2011. Finally, WECANN retrievals are evaluated against other global products and eddy covariance tower data. Results of these comparisons are presented in section 4. 30

Target Dataset: A Bayesian prior using Triple Collocation
One of the key issues in the design of an ANN to retrieve any geophysical variable is defining a good target dataset. One practice has been to use outputs from a land surface model as the target (Aires et al., 2005;Jiménez et al., 2013;Kolassa et al., 2013;Rodriǵuez-Fernández et al., 2015). However, all observations and models contain random errors and biases. Therefore, the retrieval based on the ANN exhibits some of the biases of the original target dataset even if the ANN is able to make corrections to its 35 original target data (e.g. correction of an imperfect seasonal cycle, as demonstrated by Jiménez et al., 2009). To address this issue, we use three datasets, which are sufficiently independent so that the training can learn from each dataset and benefit from all of them, synergistically. We implement a pseudo Bayesian training by probabilistically weighting the occurrence of each training dataset by its likelihood, and define a target dataset. The three datasets are listed in Table 1 for each variable.
To define this prior distribution, we use the triple collocation (TC) technique. TC is a method to estimate the Root Mean Square Errors (RMSE) (and, if desired, correlation coefficients) of three spatially and temporally collocated measurements by assuming a linear error model between the measurements (McColl et al., 2014;Stoffelen, 1998). This methodology has been widely used in error estimation of land and ocean parameters, such as wind speed, sea surface temperature, soil moisture, evaporation, precipitation, fAPAR, and in the rescaling of measurement systems to reference system for data assimilation purposes 5 (Alemohammad et al., 2015;D'Odorico et al., 2014;Gruber et al., 2016;Hain et al., 2011;Lei et al., 2015;Miralles et al., 2010Miralles et al., , 2011bParinussa et al., 2011), as well as in validating categorical variables such as the soil freeze/thaw state (McColl et al., 2016).
The relationship between each measurement and the true value is assumed to follow a linear model: in which is the ( ℎ , ℎ ) element of the covariance matrix between the three measurements. Since the triplet of datasets used 20 for training each of the fluxes (see Table 1) is derived through different semi-empirical approaches with different sources of errors, the assumption of uncorrelated errors is more likely to be met. In the following, we will calculate the standard deviation of random error component of Equation (1)  There are several likely causes for these errors. For the FLUXNET-MTE data, the regions which are not covered by (many) FLUXNET eddy-covariance stations may result in larger uncertainties, and those regions for which interception is a large component of the LE flux as well (Michel et al., 2016). For the GLEAM and ECMWF data thick vegetation generally induces biases compared to the satellite observations, especially in tropical regions (Anber et al., 2015).
Finally, we use the TC-based RMSE estimates at each pixel to compute the a priori probability ( ) of selecting a particular dataset 35 in each pixel, if that pixel is used as part of the training dataset: in which is the probability of selecting dataset i when sampling from three measurements. We assume that these probabilities are time independent as we are limited by the currently available duration of the input data; however, future versions will explore the use of seasonally varying probabilities.  over deserts, such as the Sahara and Arabian Peninsula, as expected (Fig. 3). Wet tropical forests exhibit subtle seasonal variability 15 in LE. These spatial variabilities in the seasonal cycle reflect changes in the radiation, temperature, water availability during the dry season, soil nutrient, soil type conditions as well as leaf flushing (Anber et al., 2015;Morton et al., 2014Morton et al., , 2016Restrepo-Coupe et al., 2013;da Rocha et al., 2009;Saleska et al., 2016). In contrast, seasonal variability dominated by radiation availability are noticeable in wet mid-latitude regions for both Northern and Southern Hemisphere, i.e., East Asia, Eastern U.S. and Australian North and East Coast with over 60 W m -2 difference between winter and summer months. One exceptional case is South Asia, 20

Results and Discussion
where LE does not significantly rise in spring, likely due to the effects of the monsoonal climate. In Eastern South America, the ET estimates are relatively high compared to GPP estimates. This difference can be caused by either low water use efficiency or significant rain re-evaporation and soil evaporation. Moreover, the SIF relationship with GPP likely changes in C4 plants. However, we did not impose the C4/C3 delimitation in the ANN as it would be highly dependent on the quality of the classification map used. We note that all training products used here include C3/C4 delimitation and therefore the C3/C4 delimitation is implicit in 25 the training dataset; therefore, can be learnt by the network.
Seasonal variabilities in H (Fig. 4) are distributed in opposite pattern to LE, as expected. Deserts and dry regions i.e., the Sahara, Southwestern U.S. and Western Australia demonstrate much more seasonal variability than the rest of the world -given the strong water limitations there, the available energy converted into H becomes dictated by the seasonal cycle of solar radiation. In contrast, tropical rainforests (Amazon, Congo, Indonesia) exhibit limited seasonal variability. In mid-latitude energy-limited regions 30 (Central/Eastern Europe, Easter US), H also reflects the course of available energy, and in more water-limited regimes (e.g. Western US and Mediterranean Europe), it reflects the interplay between soil dryness and available energy, with a peak between spring and summer for dry regions.
The seasonal variability of GPP (Fig. 5) in Northern latitudes follows the availability of radiation in wet regions with a peak in summer and another in spring for dry regions, corresponding to both soil water availability and high incoming radiation. A clear 35 East-West transition conditioned by water availability is observed in continental U.S. In tropics and subtropics, the response is diverse. The Amazon rainforest exhibits high GPP throughout the year with a peak between September and February in the wetter part of the basin, following the dry season, consistent with the observations at eddy-covariance towers near Manaus and Santarem (Restrepo-Coupe et al., 2013;da Rocha et al., 2009). Compared to LE, substantial geographical variability is observed in the Amazon, because of the strong variabilities in soil type, green up, biodiversity and rooting depth. In the drier part of the basin, water availability controls the seasonal cycle of photosynthesis and the peak in GPP is observed in the wet season (DJFMA). In the Congo rainforest, GPP exhibits four seasons, with two wet and two dry ones, with substantial decrease in GPP during those 5 dry spells. In Indonesia, GPP is steadier throughout the year, exhibiting high values year-round. Monsoonal climates over India, South-East Asia, Northern Australia and Central-Northern America are well captured with rapid rise in GPP following water availability. The highest GPP are observed in rainforests and the US agricultural Great Plains, in JJA for the latter. Northern latitude regions mainly exhibit substantial GPP in the summer and late spring, and small values throughout the rest of the year.

Evaluation with FLUXNET Data 10
Direct validation of the WECANN retrievals is challenging by the fact that no global, error-free estimates of LE, H and GPP are available. Remote sensing or model products such as those used for training have their own errors. When three datasets with uncorrelated errors (commonly assumed to be true if the sources of error in each dataset have no common physical origin) are available, triple collocation provides a valuable technique to evaluate large-scale datasets in the absence of a known truth. However, WECANN's use of different training datasets will cause the presence of some correlated errors between WECANN retrievals and 15 any of the datasets used for the training. Instead, we evaluate the retrievals by comparing them to data from a set of FLUXNET eddy-covariance towers. WECANN uses three training datasets which one of them (FLUXNET-MTE) is based on upscaling FLUXNET eddy-covariance tower estimates. This might cause some dependence between WECANN retrievals and the tower estimates. However, WECANN learns from all three training datasets collectively, and uses remote sensing observations as input.
Therefore, this dependence is negligible. In situ estimates from eddy covariance towers with a footprint of a few 100 m to km may 20 not be representative of the entire 1° × 1° pixel, and are known to have problems with energy closure (Foken et al., 2010). However, in the comparison against tower data the impact of large-scale climate variability and seasonality can still be seen even at different spatial scales. For instance, the phenology has a strong impact on the seasonal cycle of the LE, H and GPP and in the following examples, it is clearly highlighted when comparing different products to flux tower estimates.
Summary of statistics across 85 FLUXNET sites are provided in Tables S1 -S3. Overall, WECANN performs better than other 25 alternative global products. In particular, WECANN has the highest correlation for 76% of sites for LE, 54% of sites for H, and 53% of sites for GPP. This high R 2 reflects the capacity of WECANN to correctly capture the seasonal cycle and interannual variability, as it is largely imposed by the remote sensing observations rather than by the statistical retrieval (Jiménez et al., 2009).
One of the reasons for this is the presence of the SIF information in the ANN retrieval, which is directly related to GPP and plant transpiration (Frankenberg et al., 2011). The RMSE of WECANN is lower than all other products at 71% of sites for LE, 46% of 30 sites for H, and 51% of the sites for GPP. The bias is also reduced compared to other retrievals, even if some variability can be seen from site to site. Figure 6 shows a summary of the correlation coefficients presented in Tables S1 -S3 for each group of Plant Functional Types (PFTs). Each class has between 6 and 22 sites. WECANN has the best mean within each PFT class, and the smallest variability in most of the classes for all three variables. 35 Figure 7 shows the comparison of monthly WECANN retrievals and three other global products' estimates with the tower estimates across 5 select sites that span a range of climatic and vegetation coverage conditions. At the Oklahoma agricultural site (US-ARM), H and LE are well reproduced, yet dry year H is underestimated (Fig. Figure 7a). The GPP reported at the site very rapidly decays at the end of the spring whereas the region is highly agricultural with sustained agriculture in the summer. The difference between the reported GPP and WECANN retrievals might be again due to the difference in the footprint of the two estimates.
At the Brasschaat site, BE-Bra, Belgium (Fig. Figure 7b), LE is very well captured by WECANN, which captures the seasonal cycle well, yet misses some of the interannual variability. WECANN outperforms the other retrievals of LE and GPP and captures the GPP seasonal cycle very well compared to other products, which display too early GPP rise and overestimate summer GPP. 5 Again, the SIF data provides independent useful data compared to other environmental information (radiation, temperature, vegetation indices) used by the other retrieval schemes. All retrievals strongly underestimate the reported eddy-covariance H. At this humid site though, the magnitude of the measured H is often higher or on the same order in the summer as LE. Given the high degree of urbanization around the site, it is most likely a reflection of the footprint of the eddy-covariance and the fact that it observes urbanized surfaces with high H. Indeed, the surface energy budget is not locally balanced and turbulent fluxes are higher 10 than the observed net radiation minus ground heat flux.
At the cold Finland site (FI-Hyy), WECANN very well captures the seasonal cycle of GPP and LE, as well as to a less extent H.
WECANN better reproduces the seasonality, amplitude and interannual variability compared to other retrievals (Fig. Figure 7c).
It also reflects the difficulties of retrieving fluxes in snow dominated regions. SIF has the great advantage that it is not directly sensitive to snow compared to vegetation indices for instance, which incorrectly attribute snowmelt and changes in observed 15 ground color to photosynthesis onset (Jeong et al., 2017).
At the monsoonal grassland site of Santa Rita, AZ, WECANN correctly captures the complex dynamics of H and LE at the site with some rain periods preceding the Monsoon period (Fig. Figure 7d). Yet, WECANN slightly underestimates LE and overestimates GPP. In fact, all products overestimate GPP in the dry and cold seasons. The landscape in the region is highly heterogeneous with denser vegetation in riparian zones, away from the tower location, which may explain the lower GPP value at 20 the site compared to estimates of the larger-scale values.
Finally, at the South African Mediterranean site, ZA-Kru, WECANN reproduces some of the dynamics of the observed H, yet is typically smoother (Fig. Figure 7e). It reasonably captures the LE dynamics, except for the suspect cold season increase reported at the tower in 2013 (like other products). All products overestimate the reported GPP, though WECANN is closest to the observations and better captures the seasonal dynamics compared to other products. 25 Overall, across the different sites, the WECANN retrieval performs better than other products, especially in terms of the seasonality of LE, H and GPP. Several factors contribute to the improved retrieval of WECANN compared to other products, even at those smaller footprint sites. First, the SIF measurements that are directly correlated with GPP provide a better constraint on estimating LE, H and GPP. The ANN approach in WECANN also uses a novel training technique based on probabilistically merging different datasets to remove outliers from its target dataset. Therefore, WECANN retrievals learn collectively from the different datasets 30 (and remote sensing observations) and are closer to the truth than each of the individual target datasets.

Comparison against other remote-sensing based products
In this section, we compare the WECANN-based estimates to other datasets used in the training to better understand how WECANN differs from those training data. Figure 8 shows the comparisons for LE, and indicates that our product has a relatively similar R 2 with the three products (R 2 = 0.96 with FLUXNET-MTE and ECMWF, and R 2 = 0.94 with GLEAM). However, the 35 scatterplot with FLUXNET-MTE is more concentrated and aligned along the 1:1 line, further emphasizing the consistency between the two datasets (RMSD of 6.42 W m -2 for FLUXNET MTE versus 8.47 W m -2 and 9.72 W m -2 for GLEAM and ECMWF, respectively). Differences in spatial patterns shown in Fig. 8a-c reflect that WECANN exhibits smaller spatial differences with FLUXNET-MTE than GLEAM or ECMWF and such differences exhibit a narrower range between -10 and 10 W m -2 . FLUXNET-MTE overestimates LE compared to WECANN in transitional tropical and subtropical regions and particularly over India, which are regions with few eddy-covariance towers. GLEAM exhibits substantial differences with our product particularly in regions dominated by seasonal water stress such as Brazilian savannas, the Horn of Africa, Central America, India and the subtropical humid part of Africa south of the Congo. In the Sahel, GLEAM LE is higher than our estimate and FLUXNET-MTE. The LE estimate of ECMWF is nearly always higher than our estimate with much higher values in the Congo, the Amazon, Southern Brazil, 5 and Northern Canada. In Europe, where the ECMWF estimate should be best because of the frequent weather operational forecast checks and model adjustment in the region, the estimates are more similar. The differences and similarities of WECANN retrievals with the three target datasets is consistent with the error estimates from TC. For example, Fig. S4 shows that FLUXNET-MTE has the smallest error in LE estimates globally compared to GLEAM and ECMWF, other than across India. WECANN retrievals also have better agreement with FLUXNEWT-MTE. 10 The differences in H estimates are more complex (Fig. 9). First, the R 2 between WECANN and the other datasets are slightly lower than for LE. ECMWF and FLUXNET-MTE yield higher R 2 with WECANN (0.92) while GLEAM has an R 2 of 0.87. GLEAM exhibits lower H in most of the Northern hemisphere, especially in seasonally dry regions, potentially due to its simple formulation of ground heat flux (G). H estimates are relatively higher over the Amazon and Congo but lower over Indonesia for GLEAM. In  Figure S5 shows 20 that ECMWF has higher errors in the Sahel, Southern Congo, and Brazilian Savana and GLEAM has higher errors in the Amazon, East Asia and Central Africa.
The comparison between the GPP estimates shows significant differences (Fig. 10). WECANN compares the best against FLUXNET-MTE (R 2 = 0.93), with MODIS (R 2 = 0.91) and ECMWF (R 2 = 0.90) following. While all three products have similar R 2 , their spatial differences are distinct. In the Amazon, ECMWF and FLUXNET-MTE have larger GPP estimates compared to 25 WECANN, while MODIS estimates are much smaller. In cold northern latitude regions of Siberia and Northern Canada, all three products have higher GPP than WECANN. In Congo, MODIS and FLUXNET-MTE have higher GPP, while ECMWF has a lower one. In Central and Southwestern US, all three products tend to yield lower GPP. Comparison of these findings with the error estimates from TC (Fig. S6) shows that FLUXNET-MTE has the lowest errors globally, while ECMWF has the largest errors in the Amazon. values differ. Anomalies in GPP have better agreement across different products compared to LE and H. Evaluation of the discrepancies between these anomalies are beyond the scope of this paper and will be characterized in detail in a future study.

Extreme Events Assessment
In order to further assess WECANN at regional scales, we analyse its capacity in capturing extreme events. We thus selected three major heatwave and drought events that occurred during the temporal coverage of WECANN product. These events are the Russia 2010 heat wave, Texas 2011 drought and US Corn Belt 2012 drought. Figure 11 shows the percentage of average monthly anomalies with respect to mean values, for LE, H and GPP, in each of the three cases. The patterns reveal significant anomalies in 5 all fluxes which is consistent with reported patterns. In summer 2010, a historical heatwave occurred over western Russia and resulted in all-time maximum temperature record in many locations (Dole et al., 2011). The extent of reduction in LE and increase in H derived from WECANN retrievals is consistent with estimates reported in the literature (Lau and Kim, 2012), with 10-15% increase in H and 15-20% reduction in LE. In early 2011, drought conditions developed in southern US, particularly in the states of Texas and Louisiana (Luo and Zhang, 2012). By April, most of Texas, Oklahoma, Louisiana and Arkansas were classified in 10 the D4 drought condition (exceptional drought), and the situation continued throughout the summer and fall of 2011 as reported by US Drought Monitor (Svoboda et al., 2002). As Figure 11 reveals, the same spatial pattern is pronounced in the monthly anomalies derived from WECANN retrievals, emphasizing massive reduction in LE and GPP accompanied by high H over the region.
Finally, an intense drought in the central US, particularly in the Corn Belt, occurred in 2012 and reduced maize yields by about 15 25% and increased prices by 17-24% (Boyer et al., 2013;USDA, 2013). By mid-September 2012 almost two-thirds of the continental US was covered by drought, and different parts of US Corn Belt was categorized as either D3 (extreme drought) or D4 (exceptional drought) condition. Figure 11 shows similar patterns in LE, H and GPP with significant positive anomaly in H (~20%) and reductions in LE and GPP (~20%), consistent with crop yield decrease.

Basin Scale Evaluation 20
We also assess the accuracy of WECANN ET retrievals using ET estimates from the independent water budget closure model introduced in Section 2.3.2. The analysis is carried out for years 2007 to 2010 that overlaps between WECANN retrievals and the water budget closure study. Figure 12 shows the relative absolute difference in ET estimates from WECANN compared to the ET estimates from the water budget study for each of the five basins (mean value and one standard deviation across 48 months). Mean absolute differences vary between a low of 5% in the Amazon and a larger 24% value in Colorado, while other three basins have 25 a mean difference of 9%, 17%, and 20%. While the differences vary between a low and moderate range, it should be noted that the coarse spatial resolution of WECANN product causes a difference in the spatial averaging to get the basin level estimates of ET.
Moreover, in the budget closure estimates only a single runoff (at the outlet of the considered basin) is used over the entire basin; therefore, large heterogeneous basins such as the Colorado and Mississippi have large uncertainties associated with them, as runoff does not correctly constrain the flux distribution over the entire basin. It is over those basins that the WECANN retrieval compare 30 less favorably with these large-scale estimates. Downscaled version of those estimates would further help in the evaluation of ET products.

Uncertainty Analysis of WECANN Retrievals
One of the advantages of a statistical retrieval algorithm, in particular of ANNs, is that the run time is extremely fast, after the training step. This enables us to characterize the uncertainty of the retrievals by propagating the uncertainties in the input variables 35 through the network. For this purpose, we set up a 10,000 bootstrap experiment and run the WECANN retrieval by adding error to input variables. The errors are normally distributed with mean zero and a standard deviation that depends on the input variable.
For SIF, air temperature and soil moisture, we use the error estimates or standard deviations reported in their associated products.
These errors are spatially and temporally varying and we used the associated value for each time and space data point. For net radiation, we use a constant standard deviation of 34.58 W m -2 based on the analysis by . For precipitation and SWE estimates, we use a conservative 10% of the estimates themselves as a standard deviation for error. For each bootstrap replicate, we sample from the error distribution of each input variable and add that to the input. Figure 13 shows the results of the bootstrap for each of the LE, H and GPP globally and in different climatic zones (defined in 5 Section 4.3). Each panel in Figure 13 shows the uncertainty derived from the bootstrap experiment, relative to the interannual variability of the retrievals. GPP estimates are provided in units of PgC yr -1 as total productivity in each region. LE and H are As expected, the variability of LE and H are anti-correlated. We note that while WECANN is trained on three independent estimates of LE, H and GPP, its inter-annual variability is driven by remote sensing observations that are input to the ANN.

Impact of SIF on the retrieval of surface fluxes and GPP
Satellite SIF observations are relatively new, and have not been used to estimate LE and H at the global scale previously. Therefore, 20 we want to assess the information content of SIF observations in the WECANN retrievals by replacing them with more typical optical/near-infrared indices of vegetation (NDVI or EVI).
To do so, we trained two different ANNs with NDVI and EVI instead of SIF data on each of the three variables (LE, H and GPP) and evaluated the retrievals against the same FLUXNET tower measurements used in Section 4.2 for evaluating WECANN retrievals. Tables S4 -S6 show the results of evaluations of these three retrievals against the tower measurements for LE, H and 25 GPP, respectively. In terms of correlation coefficient, on average all three retrievals have relatively similar performance except in regions where phenology (and incident radiation) is not the main contributor to the flux variability such as in Spain (ES-LgS).
Indeed, in such regions changes in canopy structure is more limited and changes in response to water stress (through changes in light and water use efficiency) are the primary reason for the seasonal variability. This emphasizes, similarly to current thinking on the SIF signal, that the monthly SIF signal is dominated by incident radiation and canopy structure but that in some conditions 30 light use efficiency changes are detected by SIF, but not optical vegetation indices (Lee et al., 2013). We also point out that current SIF retrievals (such as those from GOME-2 used here) are still noisy as they were not obtained by satellites designed to measure SIF. Future SIF designated missions such as Fluorescence Explorer (FLEX) will have higher accuracy and finer spatial and temporal resolution (Drusch et al., 2016). We expect they will further enhance the retrievals of surface fluxes and GPP such as those from WECANN. 35 This study introduces a new statistical approach to retrieve global surface latent and sensible heat fluxes as well as gross primary productivity using remotely sensed observations at a monthly time scale. The methodology is developed based on an Artificial Neural Network (ANN) that uses six input datasets including solar induced fluorescence (SIF), precipitation, net radiation, soil moisture, snow water equivalent, and air temperature. Moreover a Bayesian approach is implemented to optimally integrate 5 information from three target datasets for training the ANN using Triple Collocation to calculate a priori probabilities for each of the three target datasets based on their uncertainty estimates.
The new global product, referred to as WECANN, is evaluated using target datasets as well as FLUXNET tower observations. The evaluation results comparing with training datasets show that our retrieval has similar correlation with the three products while it has the smallest RMSD with FLUXNET-MTE for LE (RMSD=6.42 W m -2 ), H (RMSD=7.84 W m -2 ) and GPP (RMSD=0.88 gC 10 m -2 day -1 ), which is believed to be one of the most realistic global datasets and it has the lowest RMSE based on our TC error estimates ( Fig. S4 -Fig. S6), despite its reported underestimated inter-annual variability due to the use of climatological values for several meteorological drivers (Miralles et al., 2014a(Miralles et al., , 2016. Such tendency also can be summarized from the global difference maps, which show that FLUXNET-MTE has the best agreement with WECANN retrievals. The WECANN and FLUXNET-MTE approaches are both based on machine learning, although the FLUXNET-MTE retrievals use a regression tree rather than an ANN. 15 Nevertheless, this commonality of methods may also contribute to the greater correspondence between these two datasets.
The retrieval maps indicate that LE, H and GPP have similar seasonal variability and distribution which are determined by annual phenological cycle in energy limited Northern latitude regions, dryness in Mediterranean and Monsoonal climates and by light availability in rainforests. Seasonal radiation has great impact on some regions for all three variables, such as Eastern U.S., Europe and East Asia, which have wet conditions, are highly vegetated and located in mid-latitudes. As opposed to this, the seasonal 20 variability of LE, H and GPP in some low-latitude and wet condition regions, such as Amazon rainforest, Southern Africa and Southeast Asia, as well as some low-latitude arid regions, such as Southwest U.S., Western Australia, North Africa and Western Asia are not significant, as there is less seasonal solar radiation variability in aforementioned regions. Comparison between the LE, H, and GPP, they all demonstrate generally similar patterns of seasonal variability through time.
We also assessed the impact of SIF on retrieval quality. In comparison to optical-based vegetation indices, SIF has better 25 performance in regions where phenology and incident radiation are not the main contributor to flux variability, while it has similar performance in other regions.
From the evaluation results comparing with FLUXNET tower observations, it is noted that WECANN has better performance compared to other global products. LE and H estimates from WECANN are more consistent with tower observations compared to GPP. WECANN retrievals have better correlation with tower observations in 76% of site for LE, 54% of sites for H, and 53% of 30 sites for GPP compared to other products. Moreover, retrievals from WECANN outperform other global products in capturing the seasonality of surface fluxes and GPP across a wide range of sites with different climatic and biome conditions. We also assessed the performance of WECANN in capturing extreme heatwave and drought events, and showed that in the case of Russia 2010 heatwave, Texas 2011 drought and US Corn Belt 2012 drought WECANN properly captures the extent of the anomalies in LE, H and GPP. Moreover, an independent ET estimate from a water budget closure model was used to evaluate 35 WECANN ET estimates across five large basins, and it showed small to moderate errors for WECANN retrievals.

Data Availability
WECANN product is publicly available for download on Aura Validation Data Center (AVDC) at Goddard Space Flight Center via https://avdc.gsfc.nasa.gov/pub/data/project/WECANN/

Competing Interests
The authors declare that they have no conflict of interest. 5 Figure 1: Architecture of the ANN layers. Input layer provides the matrix P of the inputs to the Hidden layer. Hidden layer has a matrix W of weights and b of biases for the neurons, and the f1 transfer function. The output of the Hidden layer (a = f1(WP +b) ) is an input to the Output layer that applies the transfer function f2 to the estimates and generates final outputs O.