The sources of systematic error responsible for introducing significant biases in the eddy covariance (EC) flux computation are manifold, and their correct identification is made difficult by the lack of reference values, by the complex stochastic dynamics, and by the high level of noise characterizing raw data. This work contributes to overcoming such challenges by introducing an innovative strategy for EC data cleaning. The proposed strategy includes a set of tests aimed at detecting the presence of specific sources of systematic error, as well as an outlier detection procedure aimed at identifying aberrant flux values. Results from tests and outlier detection are integrated in such a way as to leave a large degree of flexibility in the choice of tests and of test threshold values, ensuring scalability of the whole process. The selection of best performing tests was carried out by means of Monte Carlo experiments, whereas the impact on real data was evaluated on data distributed by the Integrated Carbon Observation System (ICOS) research infrastructure. Results evidenced that the proposed procedure leads to an effective cleaning of EC flux data, avoiding the use of subjective criteria in the decision rule that specifies whether to retain or reject flux data of dubious quality. We expect that the proposed data cleaning procedure can serve as a basis towards a unified quality control strategy for EC datasets, in particular in centralized data processing pipelines where the use of robust and automated routines ensuring results reproducibility constitutes an essential prerequisite.
In the last decades, the number of eddy covariance (EC) stations for measuring biosphere–atmosphere exchanges of energy and greenhouse gases (mainly
The use of the EC technique involves a set of complex choices. Selection of the measurement site and of the instrumentation, design of the data acquisition strategy, deployment and maintenance of the EC system, and design of the data processing pathway are only some examples of such choices. Over time, the EC community has developed guidelines and best practices aimed at “standardizing” the methodology, with the overarching goal of increasing comparability and integrability of flux datasets across different stations, thereby improving robustness and accurateness of resulting synthesis, analysis and models. Examples of efforts in this direction can be found in the EC handbooks by
An integral part of the EC method is the definition of quality assurance (QA) and quality control (QC) procedures. Quality assurance refers to the set of measures aimed at preventing errors and therefore concerns design of the experimental setup, selection of the site, choice of instrumentation and its deployment, and maintenance scheduling. Quality control, instead, refers to the ensemble of procedures aimed at (1) identifying and eliminating errors in resulting datasets (i.e. data cleaning) and (2) characterizing the uncertainty of flux measurements. This paper is concerned with the definition of QC procedures for EC datasets, in particular for error identification and data cleaning.
In the context of EC, a thorough QC scheme should aim at detecting errors caused by instrumental issues as well as by violations of the assumptions underlying the method
It is worth noting that EC measurements, like any measurement process, are also subject to a number of sources of unavoidable random errors causing noise in flux data, due for example to sampling a 3D stochastic process at a single point or due to the finite precision of the measuring devices or to the variability of the source area (the so-called flux footprint) within the flux averaging timescale. By definition, random error cannot be eliminated in single-measurement experiments such as EC measurements. However, their effect can be minimized by careful QA procedures (e.g. selection of station location and of instrumentation, relative to the intended application) and quantified by characterizing the random error distribution through an appropriate probability density function (PDF), most commonly assumed to be well-approximated by a normal or Laplace distribution
Following
In practice, it is difficult to distinguish between random and systematic errors because some sources of error can have both a random and a systematic component, there are no reference values to quantify the bias, and there are no replicates to consistently quantify the random uncertainty. To avoid confusion, however, it is worth stressing that the difference between random and systematic errors should not be linked to the intrinsic characteristics of the source of error but rather to the effect it has on the quantity of interest.
Following this rule, if some source of error is responsible for over-/underestimating the true target value, then the source of error is systematic, even if it manifests itself as a noise term showing characteristics similar to a random error component. In the ideal case of
It follows from those definitions that long-term biases in flux time series (say, a systematic underestimation) can only be caused by sources of systematic error. However, it can not be said that the effect of a specific source of systematic error acts constantly over time because its effect can vary both in sign and in magnitude. Sources of random error, instead, never lead to long-term biases
QC procedures developed to identify EC fluxes affected by significant errors can be broadly classified into partially and completely data-driven (or automatic) approaches. The former entail at least some degree of subjective evaluation in the decision process. They rely on the ability of the analyst to make a final call on whether a data point should be retained or rejected, allowing the researcher to exploit the accumulated knowledge about the site and the dataset, in order to discern what is physically or ecologically implausible. Such a call is usually made on the basis of some preliminary error detection algorithm and, in practice, is typically performed via visual inspection. As an example,
Usually, a QC procedure is comprised of multiple routines, each of which evaluates data with respect to a specific source of systematic error. In the case of EC, we therefore have routines to identify instrumental issues, severe violations of the method assumptions, or issues with the data processing pipeline. Obviously, there may be several routines for each category, e.g. routines to look at specific types of instrumental issues (e.g. spikes, reached detection limit, implausible discontinuities in the time series).
Once a set of QC tests has been selected, the results of these tests must be combined, in order to derive an actionable
Two aspects of this approach to QC classification are of concern. Firstly, the combination of the results of individual tests into an overall flag appears to be somewhat arbitrary and no directives are provided – nor are they easily imagined – as to how to extend the combination to integrate results of additional or alternative tests. More fundamentally, with the methods above, the final aim of the process is to attach a quality statement to the data via a flag, as opposed to identifying data points affected by severe errors, therefore confounding the two processes – that we deem distinct – of cleaning the dataset and of characterizing its quality. We suggest, instead, that a
Bearing this in mind, the aim of this paper is thus to present a robust data cleaning procedure which (i) includes only completely data-driven routines and is therefore suitable for automatic and centralized data processing pipelines, (ii) guarantees result reproducibility, (iii) is flexible and scalable to accommodate addition or removal of routines from the proposed set, and (iv) results in a binary label such as retain or reject for each data point, decoupling the data cleaning procedure from its quality evaluation.
Since quantifying bias affecting EC fluxes at a half-hourly scale is not possible in the absence of reference values, the only option is to ascertain the occurrence of specific sources of systematic error which, in turn, are assumed to introduce biases in flux estimates. Therefore, the proposed data cleaning procedure includes (i) a set of tests aimed at detecting the presence of a specific source of systematic error and (ii) an outlier detection procedure aimed at identifying aberrant flux values. As described below, results from tests and outlier detection are integrated in such a way as to leave a large degree of flexibility in the choice of tests and of test threshold values without losing in efficacy while striving to avoid the use of subjective criteria in the decision rule that specifies whether to ultimately retain or reject flux data of dubious quality.
For each test, two threshold values are defined, designed to minimize false-positive and false-negative error rates. Comparing the test statistic with such thresholds, each test returns one of three possible statements:
Threshold values can be set either based on the sampling distribution of the test statistic (e.g. when the sampling distribution of the statistic is standard normal, tabulated
Test results are used as inputs to the data cleaning procedure, which includes two stages (see Fig.
Schematic summary of the proposed data cleaning procedure. SevEr and ModEr indicate strong and weak evidence about the presence of a specific source of systematic error, respectively.
In the second stage, flux data that inherited no SevEr statement are subject to an outlier detection procedure and only flux data that are both detected as outlier and inherited at least a ModEr statement are conclusively rejected. This implies that data points that inherited any number of ModEr statements but were not detected to be outliers, as well as outliers which showed no evidence of systematic errors, are retained in the dataset and can be used for any analysis or modelling purposes. In other words, only data points that provide strong evidence of error (either because of a SevEr or because of a ModEr and being identified as an outlier) are rejected, while peculiar data points, which would look suspicious to the visual inspection and are possibly identified as outliers, but only inherited NoEr statements, are retained.
The following section details the set of tests used in the proposed data-cleaning procedure and summarized in Table
Sources of systematic error, test statistics and adopted threshold values to define NoEr, ModEr and SevEr statements. Details on how the threshold values have been set are provided in Sect. 2.2.
The proposed data cleaning procedure is implemented in the RFlux R package
The possible sources of systematic errors are divided and analysed into three categories: (1) instrumental issues, (2) poorly developed turbulence regimes and (3) non-stationary conditions.
Modern EC instruments can detect and report malfunctions occurring during the measurement process via diagnostic variables. However, there are situations where a measurement is affected by an error but it is still valid from the instrument's perspective, and for this reason it is not flagged by the instrument diagnostics. As an example, a physical quantity may have variations that are too small to detect, given the settings or the specifications of the instrument. This is the case of a time series of temperature that varies very little during a calm, stable night; the measurements may be affected by a low-resolution problem, where the quantization of the measurement due to the intrinsic resolution of the instrument becomes evident and leads to a reduced variability of the underlying signal. In this case, from the measurement perspective, there is nothing wrong with the measured quantity and diagnostics would not indicate any issues. In addition, with some instruments, especially older models, or often when collecting data in analogue format, diagnostic information is simply not available
EC fluxes are calculated starting from the covariance between the vertical wind speed,
Data records can be unavailable for covariance computation for a variety of reasons.
First, records may simply be missing because of problems during data acquisition. In addition, specific values may be eliminated if
instrumental diagnostics flag a problem with the measurement system; individual high-frequency data points are outside their physically plausible range or are identified as spikes (in this work we used the despiking algorithm proposed by VM97); data were recorded during periods when the wind was blowing from directions known to significantly affect the turbulent flow reaching the sonic anemometer sampling volume, e.g. due to the interference of the anemometer structure itself (C-clamp models) or to the presence of other obstacles; the angle-of-attack of individual wind measurements is beyond the calibration range specified by the sonic anemometer manufacturer.
Note that the criteria above may apply to single variables, groups of variables (e.g. anemometric variables) or entire records.
Although covariances can also be computed on time series with gaps, some of the procedures involved in the typical data processing do require continuous time series
Typically, gaps in raw time series are filled using linear interpolation. While this algorithm provides obvious computational and implementation advantages, it should be noted that it only performs satisfactorily when the time series is dominated by low-frequency components, while it can introduce biases in time series characterized by significant high-frequency components, as is the case with EC data. Its application should thus be limited to very short gaps. The pattern of missing data plays a role too: when gaps occur simultaneously across all variables, linear interpolation can lead to biases in resulting covariances even for short gaps. Such biases are linearly proportional to the amount of missing data and relatively larger for smaller fluxes. These considerations also apply to other interpolation methods such as splines and the last-observation-carried-forward method.
With this in mind, by evaluating the fraction of missing records (FMR) and the longest gap duration (LGD) in time series involved in the covariance estimation, we suggest the following classification criteria:
SevEr if FMR ModEr if 5 % NoEr if FMR
High-frequency EC data can be affected by low-signal-resolution problems (see VM97 for a detailed description). Resolution problems are mainly caused by a limited digitalization of the signal during data acquisition, when signal fluctuations approach the resolution of the instruments. This kind of problem causes a step ladder in the distribution of sampled data, and time series are characterized by the presence of repeated contiguous values.
Instrumental faults can lead to a time series that remains fixed at a constant value for a period of time (dropout), analogously introducing artificial repeated values, though with a very different pattern of repetition.
Repeated values are always to be considered an artefact since even in the (unlikely) event of a signal maintaining a constant value for an extended period of time, its measured values would still vary on account of the random error. In this particular scenario, contiguous repeated values would not actually lead to a bias in the flux estimate, because neglecting the presence of random error (as defined in the introductory section) does not affect covariance estimates. Instead when repeated values do not reflect the true dynamic of the underlying signal, they can lead to a significant bias in flux estimates.
To disentangle these two situations, we evaluate the discrepancy between the SevEr if ModEr if NoEr if
EC time series can be subject to regime changes such as sudden shifts in the mean and/or in the variance. In some circumstances, those are imputable to natural causes as in cases of intermittent turbulence
The first test takes into consideration the homogeneity of the distribution of fluctuations (
The second test makes use of the same rule, but it evaluates the homogeneity of the distribution of first-order differenced data, SevEr, if HF ModEr, if 2 % NoEr, if HF
As a robust estimate of
The third test consists of evaluating the kurtosis index on the differenced data
To eliminate the sensitivity of the KID to the presence of repeated values (which become zeros in the differenced variable), such values were not included in KID estimation. Bearing in mind that the kurtosis index for a Gaussian and a Laplace PDF is equal to 3 and 6, respectively, we choose reasonably large threshold values to make sure we select only time series characterized by heavy-tailed distribution as is the case of data contaminated by extreme events representative of the aforementioned problems. Namely, we suggest that the following criteria be applied:
SevEr if KID ModEr if 30 NoEr if KID
One of the assumptions underlying the EC method is the occurrence of well-developed turbulence conditions. A widely used test to assess the quality of the turbulence regime is the integral turbulence characteristics (ITC) test proposed by
The criteria adopted to assign SevEr, ModEr and NoEr statements are based on threshold values proposed by SevEr if ITC ModEr if 30 NoEr if ITC
The working equation of turbulent fluxes as the (appropriately scaled)
To avoid biases, a possible approach is to preliminarily remove the source of non-stationarity before calculating covariances. This way, the amount of data rejected for non-stationary conditions can be limited. To this end, procedures based on linear detrending or running mean filtering are often used during the raw data processing stage. However, their application can be ineffective, for example when linear detrending is used on nonlinear trends (see the Supplement, Sect. S2) and even risk introducing further biases when data are truly stationary or when non-stationarity is of more complex nature
A few tests have been proposed for EC data, of which we discuss two popular ones. A widely used statistic is based on a test introduced by FW96, based on the comparison between the covariance computed over the flux averaging period
A major problem with Eq. (
An alternative test was proposed by
With respect to the FW96 test, the non-stationarity ratio by M98 is always well-behaved with the denominator strictly positive (average of standard deviations). For this reason, the test proposed by M98 was selected for this work.
Based on a performance evaluation described later in this paper (see Sect. SevEr if ModEr if NoEr if
As described in Sect. 2.1, the second step in the data cleaning procedure consists of detecting outlying fluxes and removing them if they inherited at least one ModEr statement. Outlying fluxes can be caused by a variety of sources involving instrument issues and natural causes (e.g. non-stationary conditions as often occur during post-dawn transition from a stable to a growing/convective boundary layer).
The outlier detection proceeds as follows: first, a signal extraction is performed, to obtain an estimate
Condition (i) can be assessed by examining the degree of serial correlation in the residual time series. When a model is correctly specified, residuals should not show any serial correlation structure and resemble a white noise process. Distinguishing between conditions (ii) and (iii) is not trivial and would require an in-depth analysis of the causes generating the anomalies. We propose considering the presence of at least a ModEr statement as a symptom of condition (ii) and thus rejecting flux data identified as outliers and flagged with at least a ModEr statement. Otherwise, if all tests return NoEr statements, the outlying data are retained irrespective of the magnitude of the anomaly. Details about the modelling framework and how to take into account the heteroscedastic behaviour of the residual component are provided in the following.
For the purpose of signal extraction, we considered the following multiplicative model:
The estimation of each component in Eq. (
The main steps of the STL algorithm are as follows. To separate out the diurnal cycle component, STL fits a smoothing curve to each sub-series that consists of the points in the same phase of the cycles in the time series (i.e. all points at the same half hour of day, for diurnal cycles). After removing the diurnal cycle, it fits another smoothing curve to all the points consecutively to get the trend components. This step can be executed iteratively in the presence of outliers. In particular, the STL algorithm deals with outliers by down-weighting them and iterating the procedure of diurnal cycle and trend component estimation. As the diurnal cycle and trend are smoothed, outliers tend to aggregate in the irregular component. The span parameters of the loess functions used for the extraction of the
The ability of the model to adequately describe the dynamics of the time series under investigation is determined by analysing the statistical properties of the residual time series. As mentioned above, a properly fitted model must produce residuals that are approximately uncorrelated in time. Any pattern in residuals, in fact, indicates the fitted model is misspecified and, consequently, some kind of bias in fitted values is introduced. To this end, spectral properties of the incomplete residuals are assessed through the Lomb–Scargle (LS) periodogram
Previous studies focusing on flux random uncertainty quantification have evidenced a heteroscedastic behaviour of the random error
It is important to note here that a significant limitation of the proposed outlier detection method is the inability to detect systematic errors whose sources act constantly across a flux time series, for long periods of time. As an example, a miscalibration of the instruments that persists for days would most likely not be identified as outlying fluxes by the proposed outlier detection procedure. Such sources of systematic errors should be prevented via appropriate QA actions or, at least, specific QC tests should be devised, able to mark those time series with SevEr statements.
This work has made extensive use of Monte Carlo experiments that make use of simulated time series mimicking raw EC datasets. The main purpose of the simulations is to create pairs of reference time series with known covariance structure that, after being contaminated with a specific source of systematic error, allow a quantitative and objective evaluation of (i) the bias effect the source of systematic error has and (ii) the ability of proposed tests to correctly detect it. We note that, for these purposes, it is not strictly required to simulate medium- or long-term time series with realistic joint probability distributions from which to generate half-hourly fluxes with typical diurnal and seasonal cycles. In fact, it is reasonable to assume that QC tests exhibiting poor performances on simulated data have little chance of success when applied to observed time series which can exhibit more complex structures of the underlying signal and of the (random) measurement error process, and which can be contaminated by the simultaneous occurrence of several sources of systematic errors.
Based on the main properties of EC time series (summarized in Sect. S1), synthetic time series were created from two first-order autoregressive processes (hereafter denoted AR; see Sect. S2 for an overview of their stochastic properties) representative of the vertical wind speed and of the atmospheric scalar. The procedure adopted to ensure that simulated AR processes have a pre-fixed correlation structure is described in Sect. S3. All simulations were executed in the R programming environment
Several scenarios have been considered to simulate stationary time series with different degrees of temporal dependence and with pre-fixed correlation structures in order to simulate fluxes of different magnitude. All simulated time series have 18 000 data points as in EC raw high-frequency time series sampled at 10 Hz and collected in 30 min files. Once simulated, time series were then contaminated with specific sources of systematic error, the details of which will be provided in Sect.
In this study, data from 10 EC sites that are part of the ICOS network
The lack of knowledge of the reference flux value and of the contributions of systematic and random error drastically limits the use of field data in test performance evaluation. However, field data are crucial to an evaluation of the impact the proposed data cleaning procedure and, in particular, to understand which is the main source of systematic error affecting flux variables. Although the performance evaluation of the proposed tests is mainly carried out via Monte Carlo simulations, a selection of raw data were also used to provide representative examples of test application.
The turbulent fluxes of
The net ecosystem exchange of
A simulation study was carried out with the twofold aim of quantifying the bias caused by low-signal-resolution problems and evaluating the performance of the LSR test described in Sect. 2.2.1.
To this end, error-free AR processes of length
In summary, the following six macro-scenarios, AR processes with AR processes with AR processes with AR processes with AR processes with AR processes with
Three realizations of AR processes of length 1000 with
Bias affecting correlation estimates was quantified as the difference in absolute percentage between the correlation estimated on error-free (
Bias effect of low-signal-resolution problems and test performance evaluation in the
The ability of the LSR test to disentangle among these situations can be appreciated by looking at the distribution of the
We applied both LSR and amplitude resolution testing procedures to the selected field datasets described in Sect. 2.5 (results are reported in Sect. S4). For
To aid in comparison, some illustrative examples are shown in Fig.
Comparison of low-signal-resolution (LSR) test and amplitude resolution test by VM97 on observed data collected at the SE-Nor site. Sonic temperature time series collected in 30 min (left panels) and in a shorter temporal window of length equal to 1000 time steps (middle panels). The right panels show the cross-correlation function (CCF) between
The Monte Carlo experiment designed for the evaluation of the performance of the proposed structural changes tests described in Sect. 2.2.1 involved six scenarios, a stochastic trend; a deterministic linear trend; an abrupt change in the mean level whose duration and shift were fixed at 3000 time steps and 3 times the interquartile distance (IQD) of the data, respectively; multiplying a block of consecutive data of size 6000 (corresponding to 10 min in EC raw data sampled at 10 Hz) by a cosine function to mimic episodic burst events as often observed in real data; introducing 0.5 % of spiky data generated by adding replacing 15 % of the data with the original values multiplied by 5, to simulate changes in variance.
Although these scenarios only cover a fraction of the problems encountered in real observations, the experiment aims at evaluating the test sensitivity in the presence of aberrant structural changes which, in most cases, can only be imputable to malfunction of the measurement system. Note that scenarios
An illustrative example of the simulated AR process mimicking sonic temperature time series contaminated by structural changes according to the
Illustrative example of an autoregressive (AR) process
Distribution of results for the structural change detection tests in the
We observe that all tests exhibited a low false-positive error rate (compare scenarios
The application of the testing procedures on actual EC time series has shown a higher sensitivity of the VM97 tests, compared to the newly proposed tests, with a tendency to assign hard flags even in cases in which no evidence of instrumental error was supported by visual inspection. Conversely, the proposed tests were more selective at identifying data affected by structural changes, although in some cases such structural changes were not necessarily imputable to instrument malfunction. In most of these occurrences, however, structural changes are indicative of non-stationary conditions, which as we know are another source of systematic error that introduces bias in covariance estimates (see Sect. 3.1.3). This is an example where two tests are not fully independent and could identify the same issue, which is why the ModEr statements are not combined.
Illustrative examples of application of the testing procedures on raw data are shown in Fig.
Application of tests for structural change detection on a selection of
We compared the performance of the stationary tests by FW96 and by M98 via Monte Carlo simulation making use of synthetic bivariate pairs of AR processes, as in as in as in as in
Notice that the
To mimic trend dynamics of magnitude similar to those observed in real cases, the slope of the deterministic linear trend function
Stochastic trend components were generated as
Representative realizations of the simulated AR processes and their CCFs estimated on either single run and averaged over multiple runs (i.e. ensemble CCF) are illustrated in Fig.
Illustrative examples of simulated AR processes,
Results of applying the FW96 and M98 stationarity tests for each of the eight scenarios considered are shown in Fig.
Performance evaluation of the stationarity tests by
In the
In both
The performance considerations described above are confirmed when tests are applied to observed data. Some examples are shown in Fig.
Application of stationary tests to a selection of EC raw data collected at the FI-Hyy site. From left to right: vertical wind speed (
In this section we report the results of the data cleaning procedure based on the workflow depicted in Fig.
Illustrative example of the sequential data cleaning procedure applied to NEE fluxes at the FI-Sii site.
Percentages of
Figure
Severe low-signal-resolution issues were identified only sporadically (in no more than 2.4 % of flux values), while a ModEr statement for the HD
Figure
Example of STL decomposition applied to NEE time series collected at the FI-Sii site.
Panel
Previous research
Illustrative example of the outlier detection procedure applied to NEE fluxes at the FI-Sii site. NEE time series for the whole period (black lines) and selected flux values according to three deciles of the extracted signal (coloured points) are shown in
Quality control of eddy covariance flux datasets is challenging. The sources of systematic error responsible for introducing significant biases in the flux computation are manifold, and their correct identification is often made difficult by the masking effect induced by both the intrinsic stochastic properties (e.g. high degree of serial dependence, heteroscedasticity) and by the high level of noise characterizing raw data.
To take into account these features, new tests have been developed and included in a robust data cleaning procedure where the data rejection is articulated in two stages: the first stage involves the removal of any flux data for which at least one of the QC tests returned strong evidence of a specific source of systematic error (SevEr); the second stage consists of the removal of outlying fluxes, provided that at least one of the QC tests returned weak evidence of systematic error (ModEr) for the same flux value. Any flux data where all QC tests provided only negligible evidence of systematic error (NoEr) are never removed, even if they are later identified as outliers in the flux time series.
Although there is a strict relationship between the value of test statistics and the amount of bias affecting individual flux data, the interpretation of SevEr, ModEr and NoEr statements is performed in probabilistic terms, as the chances of a systematic error responsible for introducing bias in flux estimation. Consequently, the choice of threshold values used to assign the SevEr, ModEr and NoEr statements is to be interpreted as indicative of the margin of error associated with the result of a statistical test.
Compared to the existing classification schemes, the proposed approach does not aim at assigning a quality flag to flux data by combining the results of different QC tests. Rather, it aims at ensuring its scalability in order to facilitate the inclusion of new tests beyond those proposed in this paper. As for assigning a quality level to the retained data, we maintain that, given an unbiased flux estimate, a robust quantification of the random uncertainty would be the appropriate metric. Indeed, as a general principle, the larger the random uncertainty, the larger the amount of measurement error and, consequently, the lower the quality of the data. Assuming that flux data affected by systematic error have been avoided or removed via appropriate QA–QC procedures, the use of random uncertainty estimates as a quality indicator (1) would not constrain the QC test development, (2) would not preclude a classification of the data quality, if needed, and (3) would meet the requirements of advanced methods of analyses where interval estimates are more important than individual point estimates, such as in studies based on data assimilation techniques.
To this end, (global) uncertainty estimation procedures representative of the contribution of all possible sources of random error are required. They should include not only the contribution of random error caused by temporal and spatial sampling, but also those due to post-field data processing (e.g. the uncertainty related to the estimates of the spectral correction factors or those linked to the imputation model used to replace missing data). For such analyses, however, an essential prerequisite to achieve consistent results is the availability of unbiased, cleaned datasets. In this perspective, therefore, the data cleaning procedure proposed in this work constitutes an essential step in reducing the uncertainty of the results of subsequent analyses. Among these, also those aiming at providing a posteriori information about the presence of undetected/unknown sources of systematic error (e.g. those based on the evaluation of energy balance closure) can also benefit from less biased data.
In this study, the performance evaluation of each proposed test was carried out mainly by means of Monte Carlo simulations because they allow full control of (1) the time series dynamics (since the simulated autoregressive processes are stationary with a pre-fixed, reference correlation), (2) the presence of a specific source of systematic error and (3) the uncertainty due to the random error component. As a consequence, a proper evaluation of the bias effect caused by systematic errors and a more objective performance evaluation of the tests involved for their detection become feasible. Such evaluations are difficult to achieve with real, EC field data because the reference “true” value is unknown; therefore it is not possible to properly quantify the bias effect, and replicates are not available, making it difficult to evaluate the uncertainty associated with the estimates (either covariances and test statistics) due to the random error component. Developing a stochastic simulation model so complex as to include all the sources of error, both systematic and random, that are present in real-word EC time series data is a difficult task and is a target of ongoing work. Anyway, although the model generating the simulated data has a simple dynamic that can reproduce, at least in part, the behaviour of real EC data, if a QC test exhibits high false-positive and/or false-negative rates in such simple scenarios, it is most unlikely it will work properly with real data. In this case, the test would likely be methodologically robust but unusable in practice, affected by overinflated false-positive and false-negative errors.
Although there is still room for improvement, particularly in the development of more well-performing QC tests aiming at detecting violations of the main assumptions underlying the EC technique, we believe that the proposed data cleaning procedure can serve as a basis toward a unified QC strategy suitable for the centralized data processing pipelines, where the use of completely data-driven procedures that guarantee objectivity and reproducibility of the results constitutes an essential prerequisite.
RFlux software package
Access to benchmark data described in Sect.
The supplement related to this article is available online at:
DV and DP conceived the study. DV organized the structure; selected, proposed and implemented the methodologies; and performed all the simulations, discussing the results with the coauthors. DV and GF wrote all the sections with the supervision of DP and with the contribution of all the coauthors. All authors reviewed the final manuscript, approved it and agreed on the submission.
The authors declare that they have no conflict of interest.
This research has been supported by the European Commission, H2020 Research Infrastructures (ENVRI PLUS (grant no. 654182), RINGO (grant no. 730944) and ENVRI-FAIR (grant no. 824068)).
This paper was edited by Martin De Kauwe and reviewed by Andrew Kowalski and one anonymous referee.