An evaluation of ocean color model estimates of marine primary productivity in coastal and pelagic regions across the globe

Nearly half of the earth's photosynthetically fixed carbon derives from the oceans. To determine global and re- gion specific rates, we rely on models that estimate marine net primary productivity (NPP) thus it is essential that these models are evaluated to determine their accuracy. Here we assessed the skill of 21 ocean color models by comparing their estimates of depth-integrated NPP to 1156 in situ 14 C measurements encompassing ten marine regions including the Sargasso Sea, pelagic North Atlantic, coastal Northeast


Introduction
Large-scale estimates of marine net primary productivity (NPP) are an essential component of global carbon budget analyses as nearly half of the earth's source of photosynthetically fixed carbon derives from the global ocean. Understanding the rate of marine fixed carbon production can only be accomplished using models due to the spatial and temporal limitations of in situ measurements. Therefore, it is critical that these models are carefully evaluated by comparing their estimates of NPP to in situ measurements collected from various regions across the globe in order to better understand which types of systems may be most challenging to model and to better constrain the model uncertainties.
The most commonly applied NPP models are ocean color models, which use input data derived from satellites (e.g. surface chlorophyll-a concentration and sea surface temperature) to estimate NPP over large areas. Ocean color models have been used to assess contemporary trends in global NPP , relationships between sea-ice variability and NPP in the Southern Ocean (Arrigo et al., 2008), bottom-up forcing on leatherback turtles (Saba et al., 2008), and fisheries management (Zainuddin et al., 2006).
Ocean color models vary in both their type (carbon-based versus chlorophyll-based) and complexity (depth and wavelength integrated versus resolved); thus a context is required in which these models can be evaluated. The Primary Productivity Algorithm Round Robin (PPARR) provides this framework such that the skill and sensitivities of ocean color models can be assessed in multiple types of comparisons. Early PPARR studies compared a small number of model estimates to in situ NPP data at 89 stations from various marine ecosystems (Campbell et al., 2002). Global fields of NPP estimated by 31 satellite-based ocean color models and coupled biogeochemical ocean general circulation models were contrasted to understand why and where models diverge in their estimates (Carr et al., 2006). A study comparing NPP estimates of 30 models to in situ data from nearly 1000 sta-tions over 13 years in the tropical Pacific Ocean revealed that ocean color models did not capture a broad scale shift from low biomass-normalized productivity in the 1980s to higher biomass-normalized productivity in the 1990s . Most recently, 36 models were evaluated to assess their ability to estimate multidecadal trends in NPP at two time-series stations in the North subtropical gyres of the Atlantic and Pacific Oceans (Saba et al., 2010). A multiregional PPARR analysis that compares output from multiple models to in situ NPP at various regions has not been recently conducted since the study by Campbell et al. (2002) and a larger sample size of in situ measurements would strengthen the assessment of model skill and provide insights into the relationship between region type, quality of the input variables, quality of the NPP measurement, and model error.
Here we assess the skill of 21 ocean color models ranging from simple integrated models to complex resolved models. This is accomplished by comparing model output to 1156 in situ NPP measurements that encompass ten different marine regions. We first assess both average and individual model skill on a region-specific basis using the root-mean square difference, which measures a model's ability to estimate the observed mean and variability of NPP. Next, we determine how ocean color model skill is affected by uncertainties in both the input variables and in situ measurements of NPP. This is followed by a correlation analysis to determine which station parameters (i.e. depth, latitude, surface chlorophylla) have the largest influence on model-data misfit. Finally, we assess model skill regardless of region and highlight the water characteristics that are most challenging to the models.

Data
We collected data from various projects (Table S1) that incorporated ship-based measurements of NPP profiles covering ten regions (Fig. 1) and spanning multiple time-periods between 1984-2007 (Table 1). Although each dataset included NPP, the over-arching goals and purposes for each of these field studies were diverse and were not optimized in their sampling design to assess ocean color models. However, in situ measurements of marine NPP are not common thus we had to use a diverse group of datasets. All 1156 NPP measurements were based on the 14 C tracer method; incubation times and type (in situ or on-deck) were dependent upon time of year and region, respectively (Table 1). Each station's NPP profile was measured to the 1% light-level at various depth intervals. We extracted each station's NPP datum at every depth of measurement and used trapezoid integration to provide daily NPP (mg C m −2 day −1 ) to the greatest isolume measured (1% light-level). Because 24-h incubations are more accurate measurements of NPP (Campbell et al., 2002), we adjusted NPP measurements that were based on Ocean color models require specific input data to estimate NPP; the suite of input data is dependent upon model type although all ocean color models require surface measurements of chlorophyll-a (Chl-a) ( Table 2). For each station, we used in situ surface fluorometric Chl-a and in situ sea surface temperature (SST) from the programs listed in Table 1 (surface = 0-5 m). Reanalysis estimates of shortwave radiation were obtained from the National Centers for Environmental Prediction (http://www.cdc.noaa.gov) and transformed to photosynthetically active radiation (PAR; Einsteins m −2 d −1 ) using a conversion factor of 0.43 (Olofsson et al., 2007). Mixed-layer depths (MLD) were derived either from in situ measurements using the surface offset method ( σ = 0.125 kg m −3 ) (Levitus, 1982) or from model results (WAP = Dinniman and Klinck, 2004;BATS and NEA = Doney, 1996;Doney et al., 2007;Black Sea = Kara et al., 2005;Mediterranean Sea = D'Ortenzio et al., 2005).
Depth data for each station were extracted from the British Oceanographic Data Centre (http://www.bodc.ac.uk/ data/online delivery/gebco) using one arc-minute grid data from the gridded bathymetric data sets. The complete dataset used for this analysis can be found in the online Supplement (Supplement file S1).

Models
Output from a total of 21 satellite-based ocean color models were contributed to this study (Table 2). Model complexity ranged from the relatively simple depth-integrated and/or wavelength-integrated models to the more complex depthresolved and wavelength-resolved models. Specific details for each of the 21 models are given in Appendix A of the Supplement. Model participants were provided with input fields (Chl-a, SST, PAR, MLD, latitude/longitude, date, and day length) and returned estimates of NPP integrated to the base of the euphotic zone (1% light-level). Although skill comparison results for the carbon-based models (Behrenfeld et al., 2005;Westberry et al., 2008) appear in Saba et al. (2010), these approaches are not included in the analyses presented here. One of the primary inputs for the carbon-based model is particulate backscattering, which is not included in the dataset described in Sect. 2.1, and which severely handicaps these models for the purposes of this type of evaluation. Satellite surrogates for particulate backscatter are available for use with some of the dataset assembled here, but are not available for the subset of data prior to the modern ocean color satellite era (prior to 1997).

Model performance
To assess overall model performance in terms of both bias and variability in a single statistic, we used the root mean square difference (RMSD) calculated for each model's N estimates of NPP: (1) where model-data misfit in log 10 space (i) is defined as: and where NPP m (i) is modeled NPP and NPP d (i) represents in situ data for each sample i. The RMSD statistic assesses model skill such that models with lower values have higher skill. The use of log normalized RMSD to assess overall model performance is consistent with prior PPARR studies (Campbell et al., 2002;Carr et al., 2006;Friedrichs et al., 2009;Saba et al., 2010). To assess model skill more specifically (whether a model over-or underestimated NPP), we calculated each model's bias (B) for each region where: We determined if certain model types or individual models had significantly higher skill than others (based on RMSD) by applying an ANOVA method with a 95% confidence interval.
Model performance was also illustrated using Target diagrams . These diagrams break down the RMSD such that:   where unbiased RMSD squared (uRMSD 2 ) is defined as: Target diagrams show multiple statistics on a single plot: bias on the y-axis, and the signed unbiased RMSD (uRMSD) on the x-axis, where: and σ m = standard deviation of log NPP m and σ d = standard deviation of log NPP d . The Target diagram thus enables one to easily visualize whether a model over-or under-estimates the mean and variability of NPP. By normalizing the bias and uRMSD by σ d and plotting a circle with radius equal to one, the Target diagrams also illustrate whether models are performing better than the mean of the observations . Models that perform better than the mean of the observations are defined to have a Model Efficiency (ME) greater than zero (Stow et al., 2009): The ME is located inside the reference circle on the Target diagrams. A model with ME < 0 is typically of limited use, because the data mean provides a better fit to the observations than the model predictions. In the NPP comparisons presented here, models produce the lowest RMSD for the regional data sets characterized by the least variability, yet at the same time these models can have ME < 0. When data sets have low variability, it is difficult for models to do better than the mean of the observations. To be consistent with previous PPARR results, we typically equate higher model skill with lower RMSD, yet we also discuss ME as a secondary indicator of model skill. Finally, to determine the effect of various station parameters on the NPP model estimates, for every NPP measurement the Pearson's correlation coefficient was calculated between model-data misfit ( (i)) and each of the following parameters: Chl-a, SST, PAR, MLD, NPP, absolute latitude (i.e. distance from the equator in degrees), and depth.

Uncertainty analysis
When comparing ocean color model estimates of NPP, it is important to consider uncertainty in the input fields and the NPP data, both of which can affect the assessment of a model's ability to accurately estimate NPP . For each measurement of NPP, we assumed an uncertainty in the measurement such that values less than or equal to 50 mg C m −2 day −1 were subject to a ±50% error, while values greater than or equal Table 3. Uncertainties in each input variable at each region based on differences between satellite, modeled, and in situ data sources. Ocean color models were provided with 81 perturbations of input data for each NPP measurement based on these region-specific uncertainties.

Region
Chl to 2000 mg C m −2 day −1 were subject to a ±20% error (W. Smith, unpublished data). Therefore, error in values between 50 and 2000 mg C m −2 day −1 ranged from 50% to 20% respectively and were calculated using a linear function of log(NPP). Ocean color models use satellite-derived input data, thus it is important to understand how their estimates of NPP can be affected by error in these data. For that purpose, we compared each station's in situ Chl-a and modeled PAR to 8-day, level-3 SeaWiFS 9 km data from the NASA Ocean Color Website (http://oceancolor.gsfc.nasa.gov). SeaWiFS measurements of Chl-a and PAR were averaged for the 3×3 grid point window (27 × 27 km) that encompassed each NPP measurement location. This was done for each 8-day Sea-WiFS image that contained the respective date of each measurement. Comparing in situ Chl-a to 8-day Level 3 Sea-WiFS data was not ideal but it was a compromise solution to get a maximum uncertainty estimate for each region. However, ocean color models typically use Level-3, monthly or 8-day, satellite-derived Chl-a and thus we were able to get an idea of RMSD sensitivity to in situ versus satellite-derived NPP estimates. For SST, we used an error of ±1 • C, which was found to be a conservative error between in situ and satellite-derived data ) and thus represented a maximum possible uncertainty. We compared MLD to the Thermal Ocean Prediction Model (TOPS) (http://www.science.oregonstate.edu/ ocean.productivity/mld.html), which is based on the Navy Coupled Data Assimilation. We extracted 8-day TOPS MLD data for each station using the same method for SeaWiFS Chl-a and PAR. There are no SeaWiFS or TOPS data prior to September of 1997 thus we only compared NPP measurements that were collected since 1997 to calculate uncertainty. Table 4. Mean RMSD (model skill), depth, in situ NPP, and input data (± standard deviation) for each of the ten regions. Uncertainties in each input variable were calculated for each region (Table 3). Each of the four input variables can have three possible values for each NPP measurement (original value, original value + uncertainty, original value -uncertainty). Similarly, each NPP measurement could also have three values (the original value and the observed ± uncertainty). Therefore, for each NPP measurement (N = 1156) there are 81 perturbations of input data and three possible values of NPP. Model participants were provided with 1156 × 81 perturbations of input data and the uncertainty analysis was conducted as follows: for each NPP measurement, we examined the 81 perturbations and selected the perturbation that produced the lowest RMSD using (a) uncertainty in individual input variables, (b) uncertainty in all input variables, (c) uncertainty in observed NPP, and (d) uncertainty in all input variables and in observed NPP.

Observed data
Measurements of NPP ranged from as low as 18 mg C m −2 day −1 in the Ross Sea to as high as 5038 mg C m −2 day −1 in the West Antarctic Peninsula (WAP). The region with the highest mean NPP was the Ross Sea (1274 mg C m −2 day −1 ) while the region with the lowest mean NPP was the Black Sea (341 mg C m −2 day −1 ) ( Table 4). The region with the highest variability in NPP was the Mediterranean Sea while the North Atlantic Bloom Experiment (NABE) and the Antarctic Polar Frontal Zone (APFZ) had the lowest variability (Table 4).

Total RMSD
In terms of the average skill of the 21 ocean color models, RMSD was not consistent (P < 0.0001) at each of the ten regions (Table 4; Fig. 2). Average ocean color model skill was significantly lower (P < 0.0001) in the Black and Mediterranean Seas (mean RMSD = 0.44 and 0.42, respectively) when compared to the other eight regions (0.27) (Table 4; Fig. 2). Among the other eight regions, there were significant differences between specific groups. The hierarchy of average model skill (highest to lowest; P < 0.005) for groups of regions that had statistically significant differences in RMSD is as follows: the NABE and APFZ (0.15); the Ara- Models that have a RMSD below the solid black line have a Model Efficiency >0 thus they estimate NPP more accurately than using the mean of the observed data.
WAP (0.33); and finally the Black and Mediterranean Seas (0.43) ( Table 4; Fig. 2). Within each of these four groups of regions, model skill was not significantly different.
In terms of individual model skill, certain models performed better than others in specific regions (Fig. 3). Model 16 (Antoine and Morel, 1996) was among the best models (in terms of lowest RMSD) in eight of ten regions and had ME > 0 in three regions (Fig. 3). Models 9 (Behrenfeld and Falkowski, 1997;Eppley, 1972) and 12 (Armstrong, 2006) were among the best models in seven of ten regions (in terms of lowest RMSD) and had ME > 0 in five regions (Fig. 3). Model 7 (Kameda and Ishizaka, 2005) was among the best models in six of ten regions (Fig. 3) and had ME > 0 in four regions.  (11 models), DRWIs (4 models), and DRWRs (6 models). Bias * and uRMSD * are normalized such that Bias and uRMSD are divided by the standard deviation of in situ NPP data (σ d ) at each region. The solid circle is the normalized standard deviation of the in situ NPP data at each region. Symbols falling within the circle indicate that models estimate NPP more accurately than using the mean of the observed data (Model Efficiency >0) at each region. Red symbols are the pelagic regions and blue symbols are coastal.
The ME statistic was not consistent between regions (Figs. 3 and 4). In the Ross Sea, all models estimated NPP more accurately than using the mean of the observed data (ME > 0) whereas none of the models did better than the observed data mean in BATS and the Black Sea (ME < 0) (Figs. 3 and 4).

Bias and variance
Target diagrams were used to illustrate the ability of ocean color models to estimate NPP more accurately than using the observed mean for each region (values in Table 4) such that symbols within the solid circle were successful (ME > 0) and those lying on the circle or outside were not (ME ≤ 0). This ability was a function of both the type of ocean color model and the region (Fig. 4). The depth/wavelength integrated models fell within the solid circle for the Arabian and Ross Seas, WAP, and the APFZ; the depth resolved/wavelength integrated models for the Mediterranean and Ross Seas, and WAP; and the depth/wavelength resolved models for NABE, the Mediterranean, Arabian and Ross Seas, and APFZ (Fig. 4).
In terms of average bias, the models either overestimated the observed mean NPP or estimated it with no bias in the five shallowest regions (NEA, Black Sea, Mediterranean Sea, Ross Sea, WAP; Fig. 4). Conversely, the models all underestimated the observed mean NPP at BATS, the Arabian Sea, and HOT (Fig. 4). However, at NABE and APFZ, the sign of bias depended on whether depth and wavelength were resolved.
In contrast to the bias results discussed above, the ability of the models to reproduce NPP variability was not a function of depth, but was more a function of model type. The depth/wavelength resolved models underestimated NPP variability in all regions except the APFZ. On average, the three of types models overestimated the variance at the APFZ while underestimating the variance in the Black and Mediterranean Seas, HOT, Ross Sea, and WAP (Fig. 4). In the other four regions, the sign of uRMSD depended on whether depth and wavelength were resolved. Finally, total RMSD was not a function of whether or not depth and wavelength were resolved (Fig. 4).

Uncertainty analysis
The range of uncertainty in NPP measurements across all regions (N = 1156) was from ±11 to ±629 mg C m −2 day −1 with an average uncertainty of ±31% (±175 mg C m −2 day −1 ). Average uncertainty for Chl-a was ±60% (±0.54 mg m −3 ), MLD ± 41% (±17 m), PAR ± 20% (±8 E m −2 day −1 ), and SST ± 7% (±1 • C). When the uncertainty in both the input variables and NPP measurements were considered at each of the ten regions, average RMSD significantly decreased by nearly 72% (P < 0.0005) in every region (Figs. 2 and 5). Uncertainties in Chl-a and NPP measurements accounted for the largest individual-based reductions in RMSD across all regions (35% and 36%, respectively) (Fig. 5). The uncertainty in NPP measurements had the smallest influence (23%) on RMSD in the Mediterranean Sea but had the largest influence (46% to 60%) for NABE, HOT, and APFZ (Fig. 5). Uncertainty in Chl-a had the smallest influence (21%) on RMSD at BATS but had the largest influence (44% and 53%) at the APFZ and NABE (Fig. 5). Among uncertainties in the other individual input variables PAR, MLD, and SST, the average reduction in RMSD was only 6% (Fig. 5).

Individual model skill
When individual model skill was averaged over all ten regions, there were no significant differences in mean RMSD for the 21 ocean color models (Fig. 6). Average RMSD for the 21 models was 0.30 (±0.02 (2 × standard error)). There were also no significant differences between the three types of ocean color models (Fig. 6): a. Average RMSD for DIWI, DRWI, and DRWR models was 0.30 (±0.02), 0.30 (±0.02), and 0.28 (±0.04), respectively.

Relationship between model-data misfit and station parameters
The behavior of these models was investigated further by examining the correlation of model-data misfit to various parameters across all regions (Fig. 7). The highest correlation coefficient was found for station depth (mean correlation =−0.39) followed by observed NPP (−0.33), latitude (0.33), and SST (−0.32) (Figs. 7 and 8). The highest correlation between model-data misfit and station depth was for Model 17 (−0.65) and the lowest was for Model 20 (0.01) (Fig. 7). The lowest correlation between model-data misfit and observed NPP was for Model 2 (−0.05) while Model 20 had the highest (−0.77) (Fig. 7). For both latitude and SST, Model 3 had the lowest correlation (0.04 and −0.04) while Model 17 had the highest (0.73 and −0.72) (Fig. 7). Although Chl-a, MLD, and PAR did not produce correlations to |model-data misfit| that were higher than |0.30| for groups of models, some individual models did stand out for Chl-a (Models 17 and 20) and MLD (Model 2) (Fig. 7). For PAR, no individual model had a correlation that was higher than |0.30| (Fig. 7).
The general relationship between model-data misfit and station depth was such that the models overestimated NPP at shallow stations, underestimated NPP at deep stations, and had the greatest skill at stations in 2500-3500 m water depth (Fig. 8). The models generally produced a smaller range of NPP values than observed: they overestimated NPP when NPP was low and underestimated NPP when NPP was high, with optimal model-data fit at NPP ∼900-1000 mg C m −2 day −1 (Fig. 8). For SST, the models tended to overestimate NPP at SST below 5 • C, overestimate NPP at SST between 5 and 15 • C, and underestimate NPP at SST above 20 • C (Fig. 8). The correlation between model-data misfit and latitude (Fig. 7) was likely driven by the high correlation between SST and latitude (−0.96).

Model performance as a function of water column depth
In terms of average RMSD from the 21 models, skill was significantly higher (P < 0.01) at stations with depths greater than 250 m (Fig. 9). When the uncertainty of both the input variables and NPP measurements were considered, model skill significantly increased across the three depth ranges but the relationship between them was unchanged. When only the stations shallower than 250 m were considered, those <125 m had significantly lower skill (mean RMSD = 0.44 ± 0.05 standard deviation) than those between 125 and 250 m (0.39±0.05). However, stations between 125 and 250 m had significantly lower skill than those greater than 250 m. In terms of the performance of individual models within these depth intervals (Fig. 10), only Model 7 (Kameda and Ishizaka, 2005) had no substantial change in skill (relative to the change in skill for the other models) as a function of station depth (Fig. 10a). Within each of the three depth ranges, model skill (as a group) was a function of either SST (Fig. 10b) or surface Chl-a but not both (Fig. 10c). For stations with depths between 0-250 m, the models had statistically higher skill (P < 0.0001) at SST > 20 • C than at SST < 20 • C (Fig. 10b) but had no difference in skill at the three ranges of surface Chl-a (Fig. 10c). For depths between 250-750 m, model skill was highest at SST > 20 • C, intermediate at SST < 10 • C, and lowest at SST between 10-20 • C (P < 0.0001; Fig. 10b) but no difference in model skill at the three ranges of surface Chl-a (Fig. 10c). For depths greater than 750 m, there was no difference in model skill at the three ranges of SST (Fig. 10b) although model skill was highest (P < 0.005) at surface Chl-a concentrations <0.5 mg m −3 and >1.0 mg m −3 (Fig. 10c). Although these statistical comparisons (ANOVA) were based on groups of models, a few individual models did not have similar statistics. For example, within the 250-750 m depth range, Models 2 (Howard and Yoder, 1997) and 21 (Ondrusek et al., 2001) had a wide range of skill at the three surface Chl-a ranges whereas all of the models as a group did not (Fig. 10c).

Region-specific model performance
The average skill of the ocean color models assessed in this study varied substantially from region to region. Although the sample size of in situ NPP measurements and number of ocean color models tested were much higher than in the previous multi-regional PPARR study (Campbell et al., 2002), our results were similar in that model skill was a strong function of region. Although we cannot compare results in most cases because of sample size differences, the NABE NPP measurements compared in this study were identical to those used in Campbell et al. (2002) and thus in this case we can compare ocean color model skill between the two studies. The average RMSD among 12 ocean color models at NABE from Campbell et al. (2002) was 0.31 whereas the average RMSD from our study was 0.14. Our results suggest that the increase in skill is due to either or both: (1) improvements to particular algorithms that were used here and in Campbell et al. (2002); (2) the higher sample size of better-performing models since the Campbell et al. (2002) study, at least in the NABE region where ocean color model skill increased by nearly 50%.
Ocean color models were most challenged in the Black and Mediterranean Seas; these two regions also had the largest proportion of station depths that were less than 250 m (Black Sea 44%; Mediterranean Sea 36%) where average model skill was lowest. These results suggest that the shallow depths of the Black and Mediterranean Seas resulted in There was no difference in mean RMSD between the NABE and APFZ, the two regions where models had the highest skill. These regions shared multiple characteristics that may have led to the high skill of the models: they were among the deepest stations in the study, mean NPP was between 900-1000 mg C m −2 day −1 , and most importantly the NPP measurements were obtained over one month of a single year that sampled the spring phytoplankton bloom, and were thus characterized by low variability. If a longer temporal coverage of the NABE and APFZ were available, the seasonal variability of NPP would have been stronger, possibly further challenging the models to estimate NPP. The NABE and APFZ had the lowest observed variability in NPP (±24% standard deviation of the mean) followed by the Arabian Sea and HOT (±33%). The Arabian Sea and HOT regions followed the NABE and APFZ in the hierarchy of model skill thus one might suspect that model skill is driven by the level of NPP variability in the region. However, we found this not to be the case among the remaining regions (BATS, NEA, Ross Sea, and WAP) where models performed equally and NPP variability was not consistent. There may be a threshold of NPP variability (<35%) that affects model skill, however, the four regions where models had the highest skill were also among the deepest stations. Therefore, the high performance of the models at the NABE, APFZ, Arabian Sea, and HOT may be driven by a combination of low NPP variability (<35%), deep station depth (>2000 m), and moderate NPP (900-1000 mg C m −2 day −1 ). Of the eight regions investigated by Campbell et al. (2002), ocean color models had the lowest skill in those characterized by High-Nitrate Low-Chlorophyll (HNLC) regions, i.e. the equatorial Pacific and the Southern Ocean. In their comparison of globally modeled NPP using satellitederived input variables, Carr et al. (2006) found that modeled NPP significantly diverged in HNLC regions.
Using the results from the present study along with a recent PPARR study that compared ocean color model NPP estimates to in situ data in the tropical Pacific where 60% of the stations were in HNLC waters ), we can further assess model estimates in HNLC regions. Mean RMSD of 21 ocean color models tested in the tropical Pacific was 0.29 , which is similar in skill to the Arabian Sea (0.22) and HOT (0.26) where the 21 models tested here performed relatively well (skill was only higher in the NABE and APFZ regions). The average RMSD from the three Southern Ocean regions tested here (Ross Sea, WAP, and APFZ) was 0.28. Comparing these regions to the average RMSD of the other four regions in this study (BATS, NEA, the Black Sea, and Mediterranean Sea = 0.38), ocean color models performed better in HNLC regions such as the Southern Ocean and tropical Pacific. Therefore, it appears that the set of ocean color model algorithms tested here and in Friedrichs et al. (2009) may represent an improvement over those used in Campbell et al. (2002), specifically in that the NPP model estimates in HNLC regions are performing just as well if not better than in non-HNLC regions.
Contrary to expectations, the ocean color models tested here were not particularly challenged in extreme conditions of Chl-a and SST. The three Southern Ocean regions had an average SST of 1 • C and a wide range values of Chl-a yet the models had higher skill there than in regions with much warmer SST and average Chl-a concentrations. Our results show agreement with Carr et al. (2006) such that the rela- tionship between SST and ocean color model-data misfit is a function of SST range. At SST less than 10 • C, model-data misfit increases with increasing SST while at SST greater than 10 • C, misfit decreases with increasing SST (Fig. 8). Carr et al. (2006) showed that model estimates of NPP diverged the most in the Southern Ocean, at SST <10 • C, and at Chl-a concentrations above 1 mg m −3 . Our results were similar such that the standard deviations among the 21 ocean color model estimates tested here were significantly higher (P < 0.0001) in areas with SST <10 • C (±684 mg C m −2 day −1 ) versus SST >10 • C (±554 mg C m −2 day −1 ), Chl-a > 1 mg m −3 (±818 mg C m −2 day −1 ) versus Chl-a < 1 mg m −3 (±315 mg C m −2 day −1 ), and in the Southern Ocean (±692 mg C m −2 day −1 ) versus areas outside the Southern Ocean (±548 mg C m −2 day −1 ). However, if we consider individual regions, the highest divergence (P < 0.0001) in model estimates of NPP was in the Mediterranean Sea V. S. Saba et al.: An evaluation of ocean color model estimates of marine primary productivity (±1021 mg C m −2 day −1 ). Thus in the Mediterranean Sea, ocean color models are not only highly challenged in terms of model skill, but also produce the greatest divergence in NPP estimates.
It is important to note that model divergence is not always associated with low model skill in terms of model-data misfit or RMSD. For example, RMSD among the 21 models was exactly the same between waters with Chl-a concentrations less than 1 mg m −3 (mean RMSD = 0.34) and those above this concentration (mean RMSD = 0.34).

Model-type performance across all regions
Some of the models tested here were originally developed and tuned for specific regions included in our analysis, and this may explain their higher performance in those regions. Surprisingly, even though certain models performed significantly better than others in specific regions, the ocean color models generally performed equally well in terms of their average model skill across all ten regions. The simplest empirical relationship performed no worse than the most complex depth and wavelength resolved models. These results are consistent with Friedrichs et al. (2009) who also reported no effect of ocean color model complexity on model skill.
The most striking result among the models was their performance in the Southern Ocean where the extremely low temperatures should have not only affected model skill, but also challenged models that did not use SST as an input variable. Surprisingly, there was no statistically significant difference in model skill in the three Southern Ocean regions between models that used SST (17 models; mean RMSD = 0.27 (±0.09)) and those that did not (4 models; mean RMSD = 0.31 (±0.14)). Given the wide SST range of the regions tested here, one may expect models that used SST to outperform those that did not due to the temperaturedependent maximum carbon fixation rate of phytoplankton (Eppley, 1972). Across all regions, models that used SST performed no differently (mean RMSD = 0.30 (±0.12)) than those that did not (mean RMSD = 0.30 (±0.13)); however, model-data misfit among models that did not use SST had a correlation to station SST of −0.58 compared to −0.26 for models that used SST. Therefore, although model-data misfit was correlated to SST for models that did not use SST, it was not high enough to cause a significant difference in skill from the models that used SST.

Uncertainties in input variables and NPP measurements
When uncertainties in both the input variables and NPP measurements were considered, RMSD was reduced by 72%. The largest influence among the input variables was from Chl-a (35% reduction in RMSD). As Friedrichs et al. (2009) found in the tropical Pacific, uncertainties in SST, PAR, and MLD had a relatively small influence on RMSD. The region-specific uncertainty values used for Chl-a were based on differences between in situ data and SeaWiFS data to assess the sensitivity of model estimates of NPP to error in satellite data. This was an essential analysis given that ocean color models were designed to use satellite-derived input data in order to estimate NPP over large areas and long time-scales; however, we perturbed in situ input data, not satellite-derived data thus the reduction in RMSD from uncertainty in Chl-a would likely not have been as high as 35% if we had based the perturbations on error in the in situ measurements. Uncertainties in Chl-a for the PPARR tropical Pacific study based their perturbations on in situ measurement error such that the uncertainties ranged from ±50% for the minimum concentration (±0.01 mg m −3 ) and ±15% for the maximum concentration (±0.11 mg m −3 ) resulting in a 24% increase in ocean color model skill . Uncertainty in Chl-a for our study averaged ±60% (±0.54 mg m −3 ) across all regions thus explaining why the ocean color models here had a greater sensitivity to Chl-a uncertainty. Our goal was to describe the sensitivity of RMSD to differences between in situ and satellite-derived data given that models typically use the latter. If our estimates of 14 C measurement uncertainties are correct, then a 36% reduction in RMSD is substantial enough to consider these errors when estimating NPP. Assuming that the change in RMSD based on Chl-a uncertainties is closer to that found in Friedrichs et al. (2009) (24%) as opposed to our values (35%), then our estimate of RMSD difference when uncertainties in both the input variables and NPP measurements are considered would be lower than 72% but not likely less than 50%. Therefore, our study confirms the importance of both input variable (primarily Chl-a) and NPP uncertainty when using ocean color models to estimate NPP. However, in situ NPP data is not always available for one to consider the error associated with ocean color NPP estimates, therefore, the input variable uncertainty may be a more practical approach to addressing the expected range of estimated NPP. We recommend that ocean color NPP estimates, that are either published or made available online, are accompanied with the magnitude of uncertainty in the estimates due to uncertainty in input variables such as SeaWiFS Chl-a and MLD.

Water column depth and model performance
One of the clearest patterns emanating from this study was the relationship between station depth and average model skill: for stations with water column depths greater than 4000 m, ocean color models typically underestimated NPP whereas they overestimated NPP at depths shallower than 750 m. This positive NPP bias was even greater for depths shallower than 250 m. Interestingly, the relationship between model skill and SST/surface Chl-a was also a function of depth. The models performed significantly better at SST >20 • C at depths less than 750 m whereas SST made no difference to model skill at stations deeper than 750 m. Surface Chl-a concentration only affected model skill at the stations deeper than 750 m such that skill was highest at concentrations <0.5 mg m −3 and >1.0 mg m −3 . However, one must note that the sample sizes (N) were not consistent for each depth range and each SST/surface Chl-a range. The RMSD statistic is partially based on N and thus the inconsistencies in sample size may have biased our results.
Model skill was significantly lower at the shallow stations and thus the affect of station depth was by far the greatest variable driving model skill. The reason for this relationship is not completely clear. If satellite-derived chlorophyll concentrations were used in the algorithms, we would have expected the algorithms to perform better in deep Case-1 waters (defined as waters where Chl-a is considered the main driver of optical properties, Morel and Prieur, 1977) because the standard satellite chlorophyll algorithms are known to have difficulty in shallow Case-2 waters where other optically significant constituents dominate. Here, however, we used in situ chlorophyll concentrations, which are not likely to be associated with greater errors in shallower waters. Moreover, surface Chl-a concentration did not affect model skill at depths less than 250 m where Case-2 waters exist.
Most of the models tested here were developed based on in situ data collected in Case-1 waters, a likely explanation for their lower skill in Case-2 waters. A possible reason for the relationship between model bias and water column depth is that the models were overestimating the euphotic zone depth in Case-2 waters and underestimating the euphotic zone depth in Case-1 waters. The model contributors, however, did not provide their estimates of euphotic zone depth thus we presently have no way of confirming this.
In addition to obtaining estimated euphotic zone depth, another way of possibly resolving this would be to obtain depth-specific output from the depth-resolved models. Our study only required contributors to provide us with integrated NPP. We suggest that future NPP model assessment studies require model contributors to provide detailed output that includes euphotic zone depth estimations in addition to depthspecific NPP estimates from the depth resolved ocean color models. Results from such studies may help explain the relationship between model skill/bias and water column depth.

Summary and conclusions
The ocean color models tested in this study were not limited by their algorithm complexity in their ability to estimate NPP across all regions. However, model improvement is required to eliminate the poor performance of the ocean color models in shallow depths or possibly Case-2 waters that are close to coastlines. Additionally, ocean color chlorophylla algorithms are challenged by Case-2, optically complex waters (Gordon and Morel, 1983), therefore, using satellitederived Chl-a to estimate NPP at coastal areas would likely further reduce the skill of ocean color models. The reason for the correlation between station depth and model skill is unknown: we can only surmise that it is because the algorithms were developed from data in pelagic waters. A more detailed analysis of ocean color model output is required to address this, i.e. one that includes model output at specific depths along with estimations of euphotic zone depth.
Ocean color model performance was highly limited by the accuracy of input variables. Roughly half of the model-data misfit could be attributed to uncertainty in the four input variables, with the largest contributor being uncertainties in Chla. Moreover, another 22% of misfit could be attributed to uncertainties in the NPP measurements. These results suggest that ocean color models are capable of accurately estimating NPP if errors in measurements of input data and NPP are considered. Therefore, studies that use ocean color models to estimate NPP should note the degree of error in their estimates based on both the input data they use and the region where NPP is being estimated.
The intent of this study was not to identify the one best NPP ocean color model even though we clearly illustrated that no one best model existed for all conditions. The results provided here can, however, be used to determine which set of ocean models might be best to use for any given application. For example, in shallower regions (<250 m), a modeler might want to consider using Model 7 (Kameda and Ishizaka, 2005), rather than Model 2 (Howard and Yoder, 1997) or 21 (Ondrusek et al., 2001). In deeper waters, Model 16 (Antoine and Morel, 1996) might be an excellent choice. Model 3 (Carr, 2002) has much greater skill in warm open ocean regions, than warm shallow regions closer to shore, whereas Model 5 (Scardi, 2001) has greater skill in warm shallow regions than warm deep-ocean regions. We hope that our results (e.g. Fig. 10) can help future investigators make informed selections as to the most appropriate NPP ocean color model to use for their particular purpose.
Finally, partially in an effort to be consistent with past NPP comparison efforts, this study assessed model skill based on RMSD, which illustrates a model's ability to estimate the mean and variability of NPP. Another method of assessing model skill, however, is through Model Efficiency, which determines whether a model can reproduce observations with skill that is greater than the mean of the observations. When comparing total RMSD in a variety of regions, those sites with relatively low variability may perform best, yet in these regions the Model Efficiency may be low, since the mean of the observations will produce low RMSD values that are difficult to "beat".
Another type of assessment of model skill deals with determining how well models estimate trends in NPP over various temporal and spatial scales. The only way of determining this is to compare model estimates of NPP to stations where in situ measurements are taken year-round over multiple years, unlike the majority of the stations in this study. A recent study by Saba et al. (2010) assessed both ocean color model and biogeochemical circulation model skill at the BATS and HOT regions where single-station time-series of NPP data exists. It was found that ocean color models did not accurately estimate the magnitude of the trends of NPP over multidecadal time periods, and were even more challenged over shorter time periods, especially when the models used satellite-derived Chl-a. Therefore, until longer satellite ocean color time-series become available, the use of ocean color models may be more applicable to studies that are interested in estimating the magnitude and variability of NPP as opposed to the long-term trends.