Remote sensing the sea surface CO 2 of the Baltic Sea using the SOMLO methodology

Abstract. Studies of coastal seas in Europe have noted the high variability of the CO2 system. This high variability, generated by the complex mechanisms driving the CO2 fluxes, complicates the accurate estimation of these mechanisms. This is particularly pronounced in the Baltic Sea, where the mechanisms driving the fluxes have not been characterized in as much detail as in the open oceans. In addition, the joint availability of in situ measurements of CO2 and of sea-surface satellite data is limited in the area. In this paper, we used the SOMLO (self-organizing multiple linear output; Sasse et al., 2013) methodology, which combines two existing methods (i.e. self-organizing maps and multiple linear regression) to estimate the ocean surface partial pressure of CO2 (pCO2) in the Baltic Sea from the remotely sensed sea surface temperature, chlorophyll, coloured dissolved organic matter, net primary production, and mixed-layer depth. The outputs of this research have a horizontal resolution of 4 km and cover the 1998–2011 period. These outputs give a monthly map of the Baltic Sea at a very fine spatial resolution. The reconstructed pCO2 values over the validation data set have a correlation of 0.93 with the in situ measurements and a root mean square error of 36 μatm. Removing any of the satellite parameters degraded this reconstructed CO2 flux, so we chose to supply any missing data using statistical imputation. The pCO2 maps produced using this method also provide a confidence level of the reconstruction at each grid point. The results obtained are encouraging given the sparsity of available data, and we expect to be able to produce even more accurate reconstructions in coming years, given the predicted acquisition of new data.


Introduction
The ocean plays an important role in the global carbon budget. It acts as a major carbon sink for anthropogenic carbon dioxide (CO 2 ) emitted to the atmosphere from fossil fuel burning, cement production, biomass burning, deforestation, and various land use changes. The ocean is currently slowing the rate of climate change, having absorbed approximately 30 % of human emissions of CO 2 to the atmosphere since the Industrial Revolution (Stocker et al., 2013). The exchange of CO 2 between coastal environments and the atmosphere is a significant part of the global carbon budget (e.g. Borges et al., 2005;Chen and Borges, 2009;Laruelle et al., 2010). While continental shelves represent only 7 % of the oceanic surface area and less than 0.5 % of the ocean volume, the estimated overall sink of CO 2 in the continental shelf sea is −0.22 PgC yr −1 (Laruelle et al., 2010), corresponding to 16 % of the open oceanic sink (Takahashi et al., 2009). These estimates are subject to great uncertainty related to sparse data coverage in time and space. Monitoring the oceanic partial pressure of CO 2 (pCO 2 ) at monthly and seasonal timescales is essential for estimating the regional and global air-sea CO 2 fluxes and reducing this uncertainty. For technical and budgetary reasons, in situ measurements of marine pCO 2 are sparsely distributed in time and space. However, over the last decade, technical improvements and cooperation with the shipping industry have allowed for the installation of several autonomous monitoring systems aboard commercial vessels routinely crossing the ocean basins. These instruments make quasi-continuous measurements, allowing regional analysis of the highly variable spatial and temporal distributions of pCO 2 (e.g. Lefèvre et al., 2004;Lüger et al., 2004;Corbière et al., 2007;Schneider et al., 2003). The Baltic Sea, a semi-enclosed sea in northern Europe, is relatively well monitored and has been studied for several decades (Meier et al., 2014). Despite the increased number of measurements made in the Baltic Sea, assessing the carbon fluxes in the Baltic Sea remains particularly challenging due to the non-linearity of the emission and absorption system. This non-linearity is complicated by a combination of varying salinity, varying river input of dissolved organic carbon, and large general variability due to the strong seasonal cycle in the region. Using new methodologies could generate additional information from the relatively limited number of existing measurement data. Neural network techniques are empirical statistical tools that somewhat resolve the non-linear and often discontinuous relationships among proxy parameters without any a priori assumptions. In the past decade, several authors have reported the application of a neural network technique to basin-scale pCO 2 sea analysis (Lefèvre et al., 2005;Jamet et al., 2007;Friedrich and Oschlies, 2009;Telszewski et al., 2009;Landschützer et al., 2013;Nakaoka et al., 2013;Schuster et al., 2013), concentrating mainly on the North Atlantic Ocean. Most recently, Telszewski et al. (2009) successfully applied a neural network technique based on a self-organizing map (SOM) to reconstruct the seawater pCO 2 distribution in the North Atlantic (10.5 to 75.5 • N, 9.5 • E to 75.5 • W) for three years (i.e. 2004 to 2006) by examining the non-linear/discontinuous relationship between pCO 2 sea and the ocean parameters of sea surface temperature (SST), mixed-layer depth (MLD), and chlorophyll a concentration (Chl a). In this paper, we applied the SOMLO methodology (self-organizing multiple linear output), which creates an SOM classification of the available explicative oceanic parameters in the Baltic Sea and then calculates multiple linear regression (MLR) parameters for estimating the pCO 2 from the elements belonging to each class separately. The major benefit of this methodology is that it allows the use of a linear model (i.e. the MLR) despite the non-linear relationship between pCO 2 and its explicative parameters, by using the SOM classifier. The SOM classifier can determine the region of the multidimensional data space in which to perform the linear regression. If the classification is fine enough, each region will represent a single type of relationship between the pCO 2 and the explicative data, a type that can be reproduced by an MLR. Due to the temporal and spatial limitations of the in situ pCO 2 data, the satellite data can help estimate the air-sea CO 2 fluxes over the entire Baltic Sea. The satellite data have a higher spatial coverage of the Baltic Sea, allowing estimation of the pCO 2 from in situ data using the SOMLO method. Chierici et al. (2009) demonstrate that, in the North Atlantic Ocean, SST, Chl a, and MLD contributed significantly to the estimation of pCO 2 from a linear relationship. Based on this idea, we applied these parameters to the Baltic Sea, adding two other parameters, i.e. net primary production (NPP) and coloured dissolved organic matter (CDOM), that provide information about the biological activity occurring in summer. From this, we develop pCO 2 algorithms applicable to the Baltic Sea using in situ pCO 2 values; remotely sensed SST, Chl a, and CDOM; modelled MLD and NPP; and time.
The manuscript is structured in four parts. First, we present a synopsis of the problem studied, including existing studies of pCO 2 reconstruction in other maritime regions. Next we present the available data and briefly describe the methodology used. In the third part of the article we present our results, namely, the topological maps obtained and the reconstructions performed with them. We conclude the article by discussing the results obtained and future possible improvements of the method used.

Study area
The Baltic Sea is a semi-enclosed sea with limited exchange with the North Atlantic through the North Sea-Skagerrak system. Previous investigations of the Baltic Proper found large temporal and spatial variability of pCO 2 . The amplitude of the annual pCO 2 cycle varies significantly depending on the region, ranging from 400 µatm in the northeastern Baltic Proper to 120 µatm in the transition areas to the North Sea (Schneider and Kaitala, 2006). The Baltic Sea receives significant river run-off from surrounding land (a total of approximately 15 000 m 3 s −1 (Bergstrom, 1994) and net precipitation of approximately 1500 m 3 s −1 (Omstedt et al., 2004)). This large freshwater addition brings large amounts of nutrients and inorganic and organic carbon to the Baltic Sea basin (Omstedt et al., 2004;Hjalmarsson et al., 2008). The biogeochemical processes in the Baltic Sea marine environment are controlled mainly by the biological production and decomposition of organic matter occurring in the context of the region's hydrography (Siegel and Gerth, 2012). Physical forcing controls the water transport, stratification, temperature, and salinity in the Baltic Sea; these factors then influence the nutrient and carbon distribution, thereby affecting biogeochemical processes. We divide the Baltic Sea into three basins, i.e. the central part (CP), the Gulf of Bothnia (GB), and Gulf of Finland (GF), as shown in Fig. 1. The Baltic Sea has an average depth of 55 m and a maximum depth of 460 m at the Landsort Deep (Wesslander, 2011).

pCO 2 observations
To compile the pCO 2 maps, we use measured data from three sources.
1. The Östergarnsholm site: this site is located next to the small island of Östergarnsholm in the central Baltic Sea and is further described by Rutgersson et al. (2008) and  Norman et al. (2013a). The island is situated 4 km from the east coast of the larger island of Gotland. SST and pCO 2 are measured semi-continuously 4 m below the sea surface using a submersible autonomous moored instrument (SAMI) CO 2 sensor moored at a buoy 1 km southeast of the tower situate on the island. In addition, SST is also measured using a wave rider buoy (operated by the Finnish Meteorological Institute) at 0.5 m depth situated approximately 4 km southeast of the tower.
2. Cargo ship: this data set derives from continuous measurements of the surface water pCO 2 made in the Baltic Sea using a fully automated measurement system deployed on a cargo ship. The Leibniz Institute for Baltic Sea Research, Warnemünde, Germany (IOW -Institut für Ostseeforschung Warnemünde), has made continuous measurements of pCO 2 at 5 m depth aboard the cargo vessel F innpartner. This ship crosses between Lübeck and Helsinki at a 2-day interval, alternately crossing the eastern and western Gotland Sea (Schneider and Kaitala, 2006;Schneider et al., 2009). Data from F innpartner were acquired between July 2003 and December 2005.
3. The Swedish Meteorological and Hydrological Institute (SMHI) database Svenskt Havsarkiv (SHARK): pH (measured using the method of Grasshoff et al., 1999) and total alkalinity (TA) (measured using potentiometric titration as described by Grasshoff et al., 1999) are measured continuously at a monthly or semi-monthly resolution in the Baltic Sea at various stations. All measurements are made at a depth of 5 m depth. The uncertainty of the pH is ±0.03 pH units and of the TA is ±5 % . pCO 2 is estimated from the pH, TA, salinity, and temperature measurements using the standard CO2SYS program (Lewis and Wallace, 1998) with the equilibrium constant from Weiss (1974) and Merbach et al. (1973) as refitted by Dickson and Millero (1987), as in Wesslander et al. (2009). The pCO 2 estimation was compared with the other source of pCO 2 data, all the data were compared in time (less than 1 h) and position (around less than 0.2 • ), the correlation coefficient (R) gives 0.98, and the standard deviation (SD) is 9 µatm. The high standard deviation (SD) is explained by the presence of upwelling with high pCO 2 present near the coast in the SAMI sensor measurement; when we remove the upwelling event, the SD is 1.5 µatm.

Remote sensing data
The satellite data used in this study are from various sources. We use a monthly temporal resolution and a spatial resolution based on the lowest spatial resolution of our data sets, i.e. based on the lower spatial resolution CDOM data set.
We obtain values for five parameters from various sources: and 2011 (R = 0.99 and MD = 0.14). The two SST data sets used between 2007 and 2011 were also compared with the SAMI sensor data at a daily resolution, giving a good correlation for the BSH data (R = 0.95 and MD = 0.06 • C) and the GRHSST data (R = 0.95 and MD = 0.08 • C).
Chl a: this data set consists of monthly averages from the following sensors: Sea-viewing Wide Field-of-view Sensor (SeaWiFS) (September 1998to December 2002 with 4 km spatial and monthly temporal resolutions and Moderate-Resolution Imaging Spectroradiometer (MODIS-Aqua) (July 2002 to June 2011) with 4 km spatial and monthly temporal resolutions (Casey et al., 2010). A log-normal distribution was assumed for the Chl a data. Comparison with SMHI measurement data and in situ data (personal communication from Dr. Tiit Kutser) gives R = 0.67 and MD = 7 mg m −3 . The chlorophyll levels from the satellite data seem to be overestimated compared with the in situ data. Since we use the same data set over the whole study period, this bias was learned during the classification process and and the MLR parameters are calculated considering that any further input will include that bias.
CDOM values come from MODIS-Aqua 4 km monthly average data. The CDOM index quantifies the deviation in the relationship between the CDOM and Chl a concentrations, where 1.0 represents the mean relationship for Morel and Gentili (2009) case 1 waters, and values above or below 1.0 indicate an excess or deficit, respectively, in CDOM relative to the mean relationship. The algorithm and its application are fully described by Morel and Gentili (2009). In situ CDOM data (T. Kutser, personal communication, 2014) give a lower correlation coefficient and a low average difference (R = 0.48, MD = 2.3). As for the chlorophyll, the bias is applied to all years so it does not affect the estimation of pCO 2 .
NPP values come from two data sources. The first data set comes from the Environmental Marine Information System (EMIS): the EMIS model is depth-integrated but allows for depth-dependent variability in the diffuse attenuation coefficient, which is calculated from a multiple-component semi-analytical inversion algorithm (Lee et al., 2005). The primary production calculation is based on the formulation obtained through dimensional analysis by Platt and Sathyendranath (1993). The photosynthetic parameters are assigned by the combined use of a temperature-dependent relationship for the maximum growth rate Eppley (1972) and a variable formulation to retrieve the C : Chl a ratio following the empirical relationship of Cloern et al. (1995). The EMIS data set comprises monthly average values from October 1997 to September 2008. The second data set, for 2009-2011, uses the Vertically Generalized Production Model (VGPM) of Behrenfeld and Boss (2006) as the standard algorithm. VGPM is a "chlorophyllbased" model that estimates net primary production from chlorophyll using a temperature-dependent description of chlorophyll-specific photosynthetic efficiency. For VGPM, net primary production is a function of chlorophyll, available light, and photosynthetic efficiency. VGPM uses MODIS-Aqua chlorophyll and temperature data, SeaWiFS photosynthetically active radiation (PAR) data, and estimates of the euphotic zone depth from a model developed by Morel and Berthon (1989)  In the Baltic Sea, there are many gaps in the satellite data; this is due to the high proportion of coastal waters, where satellite data are less reliable, and to the frequent large-scale cloud coverage. To increase the total number of our data, Biogeosciences, 12, 3369-3384, 2015 www.biogeosciences.net/12/3369/2015/ we used a monthly temporal resolution and a method to improve the spatial distribution of the data. For statistical analysis, the irregularly spaced density of the measurements were first uniformly resampled. To this end, Gaussian grinding was used, as described by Greengard and Lee (2004); Dutt and Rokhlin (1995). The data points of the original series are convolved using a Gaussian kernel. As a result, the data points are smeared over their neighbouring equi-spaced points, which are more densely distributed. This method produces more realistic values than does simple interpolation, particularly when there are many data gaps (Schomberg and Timmer, 1995). There is no discontinuity between the different data sets, but NPP, CDOM, and Chl a data are missing for January and December, so no reconstruction can be performed for these months.

Data available
The positions and values of all the in situ pCO 2 data are shown in Fig. 1. We use the spatial resolution of the parameter with the lowest resolution for the final data set chosen (i.e. CDOM). A monthly temporal resolution is used for this study. Rutgersson et al. (2008) demonstrate that the agreement between SAMI sensor data and the ship data from F innpartner (near the mooring maximum of 23 km) is quite good. This good agreement is confirmed by the comparison between pCO 2 data from the SAMI sensor, the data surrounding the mooring (0.2 • ), and the other data sets. These analyses give a good correlation factor of 0.98. The in situ data are available mainly for the central basin, but the number of data for the Gulf of Bothnia is very low, coming from two SMHI stations. The in situ pCO 2 data are well distributed over the 12 months (Fig. 2). January is the month for which the number of data is lower (i.e. below 80), but the other months have 110-155 data points each. In our case, each in situ data point is characterized by SST, Chl a, CDOM, NPP, and MLD as well as information on the date the measurements were made. This temporal information was normalized by sine and cosine, as follows: where "Days" represents the Julian day. This definition of time is used to render the values continuous over the course of the year, sidestepping the artificial numerical transition from the last day of one year to the first day of the next, to be able to situate the process in relation to its seasonality.
Although time itself is not affecting the pCO 2 , the inclusion of time (time(cosine) and time(sine)) as a parameter is important since, in the database, some situations have similar values for SST, Chl a, CDOM, NPP, and MLD but different pCO 2 values. For example, in May we had a situation (SST = 9.3 • C, Chl a = 0.1 mg m −3 , CDOM = 4.3, NNP = 1.7 mg C m −2 d −1 , and MLD = 10.8 m) whose parameters were very close to those of a situation in November (SST = 9.1 • C, Chl a = 0.1 mg m −3 , CDOM = 4.3, NNP = 1.6 g C m −2 d −1 , and MLD = 10.3 m); however, the pCO 2 values differed greatly, being pCO 2 = 214 and 444 µatm respectively, during the two situations. This dissimilarity in the pCO 2 values informs us that these situations are generated by other drivers than SST, Chl a, CDOM, NPP, and MLD. Since these drivers can be related to seasonal patterns, we included information on the time of the year, as a proxy that allows us to fine-tune our classification. The inclusion of two temporal values instead of only one, even though those are highly correlated, was unavoidable in order to preserve continuity in the values obtained when changing years.
A principal component analysis (PCA) was conducted to highlight the importance of the parameters in the pCO 2 variability (Fig. 3). PCA is a method of analysis which involves finding the linear combination of a set of variables that has maximum variance and removing its effect, repeating this successively (Jolliffe, 2002). The percent of variance explained by each axis of the PCA is shown in Table 1. The results of this PCA indicate that the percentages of variance explained by all axes beyond the first do not differ greatly, with the notable exception of the last two axes, indicating that most parameters can be discriminant in the definition of the SOM states. The need to maintain the totality of these parameters is further demonstrated by the projection of the explicative parameters in the correlation circle, where all values are close to the boundaries of the correlation circle. This informs us that they are all tied to the phenomena explaining 62 % of the total variation of our data. While we could reduce the dimension of the problem by using only the projection of these variables on the first few axes, we chose to maintain all the explicative parameters presented when applying SOMLO to estimate the pCO 2 in order to be able to not lose any information when performing the SOM classifications.  In total, 1445 pCO 2 data are used in this study, after having removed the outliers from our data set. These 170 oultiers were beyond 3 standard deviations away from at least one of the explicative parameters. All parameters (i.e. SST, Chl a, CDOM, NPP, and MLD) are located around each pCO 2 datum. In winter (i.e. October to March), more data are missing ( Table 2, column 1), particularly for Chl a, CDOM, and NPP, winter being the period when it is more difficult to measure or estimate these parameters. Between April and September, the number of missing SST, Chl a, CDOM, and MLD data is relatively low compared with the total number of data (Table 2, column 2), i.e. fewer than 3 % of the total. To increase the number of data available, we completed the data by training the topological map. Further details on this are presented in Sect. 2.5.

Methodology
The relationship between pCO 2 and the environmental parameters is highly non-linear: a slight variation in some of the environmental parameters could correspond to significant variations in pCO 2 . We chose to use the SOMLO methodology, which combines two statistical approaches: SOMs (Kohonen, 1990) and linear regression. SOMs are a subfamily of neural network algorithms used to perform multidimensional classification. A defining characteristic of SOMs is that their  On the left we have the data set used to train the SOM, which discretizes it into classes. Each class contains a referent vector containing, for each parameter, the average value of the elements comprising it as well as an index of the class that informs us of its location on the topological map.
classes can represent a Gaussian distribution centred around the typical profile of environmental parameters, if there is high discretization of the training data set (Dreyfus, 2005). We use this hypothesis to classify the environmental parameter data set and then estimate the parameters of a linear regression for each class. In the following section, we present a brief overview of the two statistical algorithms and their application to our data sets.

Self-organizing maps
Self-organizing topological maps comprise a clustering method based on neural networks. They cluster a learning data set into a reduced number of subsets, called classes, with common statistical characteristics. Generating a SOM requires the creation of a training database that contains homogenous vectors. After a training phase, we obtain a SOM. The term "map" corresponds to a 2-D matrix that stores, for each class, its referent vector, r i , which approximates the mean value of the elements belonging to it, and its index, which positions it in the map matrix used to situate it in relation to the other classes (Fig. 4). SOMs are also called self-organizing topological maps, "topological" indicating that the SOM training algorithm forces a topological ordering of the classes in the matrix, meaning that any two neighbouring classes C i and C j on the map matrix have referent vectors r i and r j that are close in the Euclidean sense in the data space.

Biogeosciences
Let us consider a vector x that is of the same dimensions and nature as the data used to generate the topological map; we can find the index of the class to which it is classified by choosing index = arg max i (||x − r i ||) (where arg max is the argument maximal), therefore assigning it to the class whose referent is closest to it in the Euclidean sense (Fig. 5). A classified vector x will be represented by its class index, C index . If we are trying to classify a vector that has some missing values, the comparison is performed between the existing values of x and the corresponding values of each r i .
As a version of the expectation-maximization algorithm, the SOM algorithm performs an iterative training. During the early phases of this training, the referent vectors of each class are strongly affected by the changes imparted on their neighbours' referent vectors in order to capture the shape of the data cloud. Depending on the training parameters of the SOM, in the latter phases of the training, the effect of the neighbouring vectors on the determination of the referent vector can be considered null. In these cases, each referent vector approximates, locally, the mean value of the multidimensional Gaussian random distribution that generated the training data assigned to that class (Dreyfus, 2005).

Multiple linear regression
MLR is a modelling method that expresses the value of one response variable, V (in our study pCO 2 ), as a linear function of other explicative variables, i.e. X = X 1 , X 2 , . . ., X i (in our study SST, Chl a, CDOM, NPP, MLD, time sin , and time cos ). An MLR is generally performed either to interpret the relationship between the variable y and each of the other predictive variables X i or else to predict, from a data set of vectors containing the values of X, the corresponding value of y. In this paper, we used both aspects of MLR.
However, to perform MLRs in the present case, we had to take into account their limitations and the nature of the problem. Specifically, to perform an MLR we are obliged to assume that the relationship between the predictor variables and the response variable is linear. However, this is not the case in our data sets: pCO 2 is not linearly related to the variables presented when considering the entirety of the problem. However, as noted above in Sect. 2.5.1, if we consider the classes created by the SOM, they are very localized regions of the combined explicative and response data space that can be considered to approximate, locally, the mean value of a multidimensional Gaussian random distribution. We therefore assume that, if performed in the reduced neighbourhood of a SOM class, the relationships between pCO 2 and the explicative variables are linear.

Statistical imputation
As described in Sect. 2.4, both the satellite and measured data available for the application present missing values. To complete these data sets, we chose to use imputation methods similar to those described by Schafer and Graham (2002) and Malek et al. (2008). The main idea of these methods is to use the classifying abilities of the SOMs to regroup the data in typical situations and replace the missing explicative data values with the corresponding values of the referent vector of the class to which it belongs.
We first selected the database containing SST, Chl a, CDOM, NPP, MLD, time sine , and time cosine . The vectors were sorted according to the number of values missing from each vector and noting the locations of these missing values. We chose all complete data vectors and the first 5 % of the sorted vectors containing missing data, and we trained a SOM. We proceeded by replacing the missing values of these first 5 % of vectors with the corresponding values of the referent vector of the class to which they each belong. We then included the next 5 % of vectors with missing data in a new training data set and created a new SOM. Based on this new SOM, we again filled the new missing values with the corresponding values of the referent vector of the class of the new SOM to which they each belong. In addition, we deleted the www.biogeosciences.net/12/3369/2015/ Biogeosciences, 12, 3369-3384, 2015 Figure 6. A schematic of the imputation method used. We initially sort the data depending on the amount of missing values present, then progressively train SOMs on the data set, steadily including more vectors for the training and completing and updating the missing data during each iteration.
values we added to the first 5 % of vectors and replaced them with the values of the referent vector of the class of the new SOM to which they each belong. The training parameters of this method, such as the number of classes of the topological map and the number of iterations, were selected by parameter tuning. We then continued iterating this process, updating the previously filled missing values with the values of their corresponding referents belonging to the most recently trained SOM, until all missing values were filled. The updating of the previously filled values allows the method to progressively incorporate information and to rectify the previous errors. A schematic of the imputation method used can be seen in Fig. 6. After this imputation of the missing data through iterative training, the reconstructed data represent the original data well. Figures 7 and 8 show data for six variables before and after the reconstruction respectively. The main difference is observed for the values of Chl a, where the peak of over 200 individuals occurs at 0 mg m −3 because, at the initialization of the imputation process, we decided to replace all null values. The repartitioning of pCO 2 (Fig. 8a) is very representative of the data variability with a large range of values. Some very high values occur during local events, such as coastal upwelling. Most of the data range between 180 µatm (value observed in summer) and 550 µatm (observed in winter). The SST (Fig. 8b) is very representative of the variability in the Baltic Sea, with a maximum occurring between July and September in all basins around 18 • C (Siegel and Gerth, 2012). The NPP variability is fairly homogenous, except for the peak at 10 mg C m −2 d −1 . This peak occurs because the first model providing us the NPP values has a set maximum of NPP at 10 mg C m −2 d −1 ; therefore, the correction of the NPP satellite data takes this maximum into account.  The variability of chlorophyll results in one subset of the data with a low value and another subset with a value higher than 6 mg m −3 , which can be explained by the fact that the Baltic Sea is a narrow sea, with important coastal regions, and that two blooms take place, in spring and in summer. The chlorophyll value can be very high during these periods, and the reconstruction gives a mean value for this characteristic. A peak at 10 mg m −3 is observed in the chlorophyll data, not due to the reconstruction but to the maximum value in the satellite data file. The low MLD occurs in summer, and in the model the minimum is 10 m deep, which appears to be around the minimum value observed in Fig. 8f

Topological map
We classified the explicative variables (i.e. SST, Chl a, CDOM, NPP, MLD, time sin , and time cos ) into classes that share similar characteristics. In order to optimize the SOM map size for the method and to calculate the method's performances, we randomly sampled 90 % of our completed data set (1300 vectors) to be used for the training phase, keeping 10 % (144 vectors) to be used for computing the performances of the method. We iterated this process, selecting many different random samplings for each map size, and selected the map size with the best average reconstruction when applying the SOMLO methodology. At the end of our optimization, we selected a SOM consisting of 77 classes. The number of observations captured by each class ranges from 0 to 38 (Fig. 9). The order of magnitude of the number of observations is constant throughout the SOM, and we can regard the classes as having spread in multidimensional space in order to accurately represent the data space of the explanatory parameters. The presence of classes that did not capture any elements can be justified as preventive: they preserve the topological aspect of the SOM by preventing classes that are not similar enough from becoming neighbours.
To estimate the average concentration of pCO 2 in each class, the measurements of pCO 2 associated with vectors consisting of SST, CDOM, NPP, MLD, and CHL a components were presented to the already-trained SOM as input data (Fig. 10). The average value computed for the vectors belonging to each class corresponds to the average value of pCO 2 for that class.
In the final map, the distribution of pCO 2 is strongly dependent on the SST distribution, with low values of pCO 2 correlating with high values of SST (Fig. 10). This is in agreement with the seasonal pCO 2 cycle, which is characterized by a large amplitude, ranging from a high value in winter (≈ 500 µatm) to a low value in summer (≈ 150 µatm), described, for example, by Wesslander (2011). According to Schneider and Kaitala (2006), the high winter value of pCO 2 is a consequence of mixing with a deeper water layer enriched in CO 2 , which is in agreement with the distribution of the MLD (Fig. 10h), with the higher value in winter and autumn correlating with the high value of pCO 2 . High values can also be explained by the mineralization, which exceeds production in winter (Wesslander, 2011). Biological production starts in spring when sunlight and nutrients are sufficient. The chlorophyll begins to increase in March-April due to the spring phytoplankton bloom, which reduces the pCO 2 level during this period. The more intensive decrease occurs in April and May, which is consistent with the higher value of NPP (Fig. 10). Studies in the central Baltic Sea identify two summer minima, the first in April-May and the second in July-August, resulting from a second production period.  . For (a) and (b), the numbers inside the neurons correspond to the number of data that construct the characteristics of the neuron. Higher variability is observed during this period, with a standard deviation between 39 and 50 µatm for different regions (Wesslander, 2011;Schneider and Kaitala, 2006).

Linear regression in the neurons
To perform an MLR, we must assume that the relationship between the predictor variables and the response variable is linear. We could take this to be a valid hypothesis only when performing the MLR in the reduced neighbourhood of a SOM class, where the relationship between pCO 2 and the explicative variables can be assumed to be linear.
For each class j a separate training data set was created containing all the vectors assigned to that class and to all its adjacent classes. Based on that data set, we computed the linear regression coefficient parameters for every explicative parameter and for a constant value. The calculated linear regression coefficient parameter values for each class are shown in Fig. 11. Note that all parameters are important in specific regions of the SOM, having both positive and negative correlations in different classes.
More importantly, the fact that each parameter has a significantly varying range of values over the different classes demonstrates that each parameter is important in reconstructing the pCO 2 in the Baltic Sea, even though a parameter may be highly significant in some classes and relatively stable in other regions of the topological map.
The addition of vectors belonging to adjacent classes did not generally perturb the estimation of the coefficient parameters because, as seen in Fig. 10, the values of all parameters are generally organized coherently on the map. The assumption that they are close in the data space is not as robust as it would have been had we solely considered the vectors belonging to each class, but, given the limited number of data available for modelling this highly non-linear and complex system, we would not have sufficient elements to correctly estimate the linear regression coefficients. Given the projected increase in available data in coming years, further applications of this approach will limit themselves to the elements belonging to each class.

Validation of reconstruction
To validate our results, we calculated the difference and SD between the value of pCO 2 reconstructed in each neuron and the observation defining that neuron (Fig. 9). On average, the SD is approximately 38 µatm and the difference observed is 25-30 µatm. By cross-validating a data set (divided by 10 sequences), we obtained a mean SD of 48 µatm with a variation of 32-57 µatm and R equal to 0.9 with a variation of 0.86-0.96. Nevertheless some points indicating higher values can be identified (shown in red in Fig. 9). These values are explained by the positions of these points, which are at the edges of the cloud and therefore more likely to include  outliers that disturb the estimation of the MLR coefficients. For the reconstruction of pCO 2 , with this identifiable point, it is quite easy to organize a flag system. The flag can give information about the quality of the reconstructions based on the root mean square errors (RMSEs) of the neurons used for the reconstruction. The difference obtained for pCO 2 in each neuron ranges from 0 to 56 µatm (Table 3), but 58 % of the values observed are under 30 µatm. The difference can be quite high for a parameter such as SST, with a maximum value of 1.9 • C, but most of the values are lower than 1 • C and CDOM ranges from 0 to 5.15 (Table 3). The other parameters have quite low variability, such as MLD, which ranges from 0 to 9.7 m. The average is 2 to 3 times lower than the maximum value observed, which gives a low value for all the satellite parameters.
The pCO 2 validation data set gives a quite good correlation (R = 0.93) with the results of the reconstruction method (Fig. 12), the root mean square (rms) being 36.7 µatm: 12 % of the validation data have a value higher than 20 µatm, and 45 % have a value between 20 and 30 µatm (Fig. 13). The characteristics of time, SST, MLD, CDOM, Chl a, and NPP do not explain the difference observed in the reconstruction. A reconstruction has been done using the satellite data from 1998 to 2011. The seasonal cycle of pCO 2 is well reproduced and in agreement with the results of other studies. The maximum is observed in winter, with a pCO 2 of 437 µatm on average, while the level is 274 µatm in summer. These values are comparable to the averages estimated in the central Baltic Sea of 500 µatm in summer and 150 µatm in winter (Wesslander, 2011). The pCO 2 decreases in April due to the biological activity and increases slowly in September (Fig. 14).
We also evaluate these results by comparing them with modelling results. The model output used in the present study is from a process-oriented biogeochemical ocean model in which the Baltic Sea is divided into 13 natural sub-basins (e.g. Omstedt et al., 2009;Norman et al., 2013b). The properties of each sub-basin are horizontally averaged and vertically resolved, and the various sub-basins are horizontally coupled to each other using strait flow models. The model is forced by meteorological gridded data with a 3 h temporal resolution and by river run-off and net precipitation data with a monthly resolution (Omstedt et al., 2005). To compare the output of the model with our results, we couple the 13 basins in optic to reduce the number to three basins, corresponding to the three basins defined in Fig. 1. The modelled and estimated pCO 2 are compared for the entire Baltic Sea and for the three basins from 1998 to 2009 (Fig. 14). The seasonal cycle for the entire Baltic Sea is well reproduced with a quite good correlation (R = 0.7) between the modelled and estimated pCO 2 values (Table 4), whose standard deviations differ by 74 µatm. The modelled and estimated pCO 2 values for the gulfs of Finland and Bothnia are not as correlated (R = 0.6), while the order of magnitude of the variability of pCO 2 in the Gulf of Finland as calculated with SOMLO is closer to the model estimate (122 µatm for the modelled and 142 µatm for the estimated pCO 2 ). This lower correlation could be due to the lower number of data in this region available for these basins. The central basin is well reproduced, but the amplitude of the seasonal pCO 2 cycle is lower in the simulation. In the southwest part of the Baltic Sea, SOMLO underestimates the pCO 2 concentration by 60 µatm compared with the model. In the eastern and western parts of the basin, SOMLO produces good estimates of pCO 2 compared with the model, with an average difference of 20 µatm. In Omstedt et al. (2009), the simulated pCO 2 agrees quite well with the calculated values based on observations in the eastern Gotland Basin. A simple flag was constructed to monitor the reconstruction quality and give an idea of the confidence in the estimated pCO 2 . The difference between the estimated and neural values is computed. The flag equals 1 for classes in which the average difference is less than 20 µatm, equals 2 for an average difference of 20-30 µatm, and equals 3 for higher average differences. In the example shown here, the flag values are high (i.e. 3), so the confidence in the reconstruction is low, but some points have flag values of 1 or 2 (Fig. 15d, e, and f) so the reconstruction is more reliable. On the geographic map (Fig. 15d, e, and f), the values of 4 correspond to the presence of ice, which is estimated using the satellite data of the National Snow and Ice Data Center based on NOAA level 3 data Njoku (2007). The flag gives confidence in our reconstruction; for example, in March 2010 (Fig. 15a), the southern portion of the map (i.e. the Bornholm and Arkona basins) shows lower pCO 2 values than does the northern portion and than in February (not show here). In March 2010, this region corresponds to a flag value of 2, which was ascribed medium confidence. In July 2010, the flag value is quite good and the variability of pCO 2 seems to be in line with the monthly variability ( Fig. 15b and e). In September 2010, the value of pCO 2 has a good order of magnitude when the flag is 2 but seems slightly too high when there is poor confidence (i.e. a flag value of 3) ( Fig. 15c and f).
In conclusion, the reconstruction of pCO 2 needs to be improved to increase the confidence in the reconstruction data, particularly in the gulfs.

Sensitivity analysis
In order to estimate the sensitivity of the reconstruction of the pCO 2 to noisy data, white noise was added on all incomplete data before the reconstruction. We performed three tests by adding white noise to each parameter, which was related to the SD of that parameter. The tested configurations were 1 × sigma, 2 × sigma, and 0.5 × sigma.   The results as seen in Table 5 imply that the method is sensitive to noisy data, since when increasing the noise we progressively get worse reconstructions. It is important to note that the values obtained when using 1 × sigma give an average rms of 77 and an average R coefficient of 0.80, which indicate that the method does not degrade too much.
Another area where errors might appear is the completion of missing data. In the data set there were vectors with missing parameters. We kept only those missing up to three parameters. The difference observed between the pCO 2 measured and reconstructed in function of month and of number of missing parameters can be seen in Fig. 16. We chose to present the data this way to also give a better understanding of the seasonal variation of the reconstruction. As seen in Table 6 the imputation method does introduce errors. When reconstructing the pCO 2 with data that have no missing data, we obtain a correlation of 0.96 and an rms of 25.7 µatm; when reconstructing the pCO 2 with data missing three parameters, the results are less reliable, with a correlation of 0.81 and an rms of 51.4 µatm. The rms we obtain in the case of the reconstruction using data that contain one missing value is higher than the one obtained when using data containing two missing values, while retaining a higher cor-  Figure 16. Absolute difference between pCO 2 reconstructed and pCO 2 measured in function of the month. The colour bar correspond to the number of empty parameter before the reconstruction (blue: 0 missing data; light blue: 1 missing datum; yellow: 2 missing data; red: 3 missing data). relation. This could be due to the higher number of vectors containing only one missing value.

Discussion and conclusions
In this paper, we used the SOMLO methodology to reconstruct the pCO 2 from satellite data for the Baltic Sea. SOMLO was used to accommodate the non-linearity of the mechanics driving the pCO 2 . It uses artificial neural networks to classify data into situations and then performs a reconstruction by using an MLR in each class. The process involves classifying the explicative parameters (i.e. SST, CDOM, Chl a, time, NPP, and MLD) and then using the linear regression coefficients corresponding to that class in order to reconstruct the pCO 2 . The satellite data used were also completed using an iterative process of SOM training. We also performed a statistical analysis of the reconstructions obtained, which allowed us to add a flag to each class, informing us of the quality of the reconstruction obtained. This could both influence further numerical modelling of other phenomena depending on the pCO 2 and allow for an informed interpretation of the reconstructions obtained.
The current results obtained using this method, based on 1445 vectors, gave a high correlation coefficient of 0.93 % and an rms of 36 µatm. These results are promising given the conditions under which we obtained them since, in addition to having a limited number of in situ pCO 2 measurements, the co-localized satellite data were frequently incomplete.
In comparison, existing studies were performed over the North Atlantic and North Pacific, based on a minimum of 10 000 data points (which take into account all the data from SOCAT) to a maximum of 800 000 data points (e.g. Friedrich and Oschlies, 2009;Hales et al., 2012;Landschützer et al., 2013). Friedrich and Oschlies (2009) obtained an RMSE of 19 µatm. A similar study over the totality of the Atlantic Ocean obtained an RMSE of 17 µatm for independent time series . Hales et al. (2012) obtained an RMSE of 20 µatm with a correlation coefficient of 0.81. The RMSE we obtained here was higher than that obtained in a previous study of the Atlantic Ocean but, taking into account the much smaller number of data available, and a stronger spatial pCO 2 variability might in the Baltic Sea compared to the open ocean. the results presented are promising.
The organization of the values of various MLR coefficients over each class indicated that all the satellite data parameters are important for reconstructing the pCO 2 in the Baltic Sea, even if only in certain cases. Improved satellite data availability could therefore also improve the performance of our reconstruction.
This study could be further developed so as to reconstruct the spatial fields of pCO 2 . Specifically, one could imagine a Bayesian approach that would select which class to use for the MLR by also taking into account the potential classes attributed to the neighbouring grid points of a geographic study area. This, however, remains dependent on the acquisition of additional in situ measurements to allow for the robust estimation of such Bayesian probabilities.
Many programs exist for the acquisition of new data. Data from the Östergarnsholm site are still being acquired; 2012 did not yield much data from this site, and the data from 2013 and 2014 still need to be validated. In time, the SMHI station could also supply additional data. The cargo ship transect data are not yet available for 2012-2014, but these measurements will continue, and some data will soon be available. Data are also being gathered from ferries sailing the Gothenburg-Kemi-Oulu-Lübeck-Gothenburg route. This Gothenburg transect is weekly (see http://www.hzg.de/imperia/md/content/ ferryboxusergroup/presentations/fb-ws2011_karlson.pdf). The first tests of these data were conducted in 2010 and 2011, so some data should soon be available. In addition, new measurements of pCO 2 began in 2012 at the Utö Atmospheric and Marine Research Station (see http://en.ilmatieteenlaitos.fi/GHG-measurement-sites#Uto).
Given the amount of new data soon to be available, we remain optimistic that comprehension and statistical modelling of pCO 2 in the Baltic Sea will continue to improve in coming years.