Challenges associated with modeling low-oxygen waters in Chesapeake Bay: a multiple model comparison

As three-dimensional (3-D) aquatic ecosystem models are used more frequently for operational water quality forecasts and ecological management decisions, it is important to understand the relative strengths and limitations of existing 3-D models of varying spatial resolution and biogeochemical complexity. To this end, 2-year simulations of the Chesapeake Bay from eight hydrodynamic-oxygen models have been statistically compared to each other and to historical monitoring data. Results show that although models have difficulty resolving the variables typically thought to be the main drivers of dissolved oxygen variability (stratification, nutrients, and chlorophyll), all eight models have significant skill in reproducing the mean and seasonal variability of dissolved oxygen. In addition, models with constant net respiration rates independent of nutrient supply and temperature reproduced observed dissolved oxygen concentrations about as well as much more complex, nutrient-dependent biogeochemical models. This finding has significant ramifications for short-term hypoxia forecasts in the Chesapeake Bay, which may be possible with very simple oxygen parameterizations, in contrast to the more complex full biogeochemical models required for scenario-based forecasting. However, models have difficulty simulating correct density and oxygen mixed layer depths, which are important ecologically in terms of habitat compression. Observations indicate a much stronger correlation between the depths of the top of the pycnocline and oxycline than between their maximum vertical gradients, highlighting the importance of the mixing depth in defining the region of aerobic habitat in the Chesapeake Bay when low-oxygen bottom waters are present. Improvement in hypoxia simulations will thus depend more on the ability of models to reproduce the correct mean and variability of the depth of the physically driven surface mixed layer than the precise magnitude of the vertical density gradient. Published by Copernicus Publications on behalf of the European Geosciences Union. 2012 I. D. Irby et al.: Multiple modeling low-oxygen waters


Introduction
Since the middle of the last century, anthropogenic impacts have dramatically decreased water quality throughout the Chesapeake Bay (Boesch et al., 2001), one of the largest estuaries in North America. Land-use change along with the industrialization and urbanization of the Chesapeake Bay watershed have caused dramatic increases in nutrient inputs to the bay (Kemp et al., 2005), spurring additional primary production and phytoplankton abundance (Harding Jr. and Perry, 1997). Because increased primary production leads to more organic matter throughout the water column that is eventually decomposed by bacteria, these increased nutrient inputs to the bay have led to a corresponding decrease in dissolved oxygen (DO) concentrations (Hagy et al., 2004). Hypoxia, generally defined as the condition in which DO concentrations are below 2 mg L −1 , usually initiates seasonally in the northern portion of the bay and expands southward as summer develops (Kemp et al., 2009;. Although hypoxia in the Chesapeake Bay has likely existed since European colonization Brush, 1991, 1993), recent studies have highlighted an accelerated rise in the number and spatial extent of hypoxic, as well as anoxic (DO concentrations < 0.2 mg L −1 ), events in the bay since the 1950s, primarily attributed to increased anthropogenic nutrient input (Hagy et al., 2004;Kemp et al., 2005;Gilbert et al., 2010). These impacts are likely to be exacerbated by future climate change (Najjar et al., 2010;Meire et al., 2013;Harding Jr. et al., 2015).
Interest in the ecological impacts of reduced DO concentrations has been elevated due to the observed proliferation of hypoxic events in the world's coastal oceans, creating vast dead zone areas that compress suitable habitats for many marine species (Diaz, 2001;Diaz and Rosenberg, 2008;Pierson et al., 2009). Low-DO waters can greatly impact the abundance and health of important ecological species, potentially resulting in suffocation and major kills of fish, crabs, and shellfish (Breitburg, 2002;Ekau et al., 2010;Levin et al., 2009). While the presence of DO concentrations < 2 mg L −1 have been shown to decrease the abundance of fish larvae (Keister et al., 2000), some species can incur negative health impacts and modify their behavior at significantly higher DO concentrations (Vaquer-Sunyer and Duarte, 2008). DO concentrations of ∼ 4 mg L −1 have been found to compress demersal fish habitat as fish seek out more oxygenated waters (Buchheister et al., 2013). Zooplankton, a crucial food source for valuable species, have also been found to exhibit changes in distribution and predation when subject to large volumes of low-DO water, potentially leading to further impacts along the food chain (Breitburg et al., 1997;Pierson et al., 2009). Invertebrates have similarly been found to alter their behavior under low-DO conditions (Riedel et al., 2014). In the Chesapeake Bay, multiple regulated fish species, such as striped bass and American shad, require oxygen restoration targets as high as 5 mg L −1 Figure 1. Map of the Chesapeake Bay and its watershed. (USEPA, 2010). The greatest impact of low DO concentrations spatially will depend on the specific living resource; however, temporally, late spring to early fall is of most concern. As a result of the significant ecological importance of oxygen on living resources in the bay, DO concentrations are used as a primary indicator in assessing water quality for Chesapeake Bay regulations (Keisman and Shenk, 2013).
Improving the health of the Chesapeake Bay has become a priority for the Environmental Protection Agency (EPA) along with the six states and Washington, DC that make up the bay watershed ( Fig. 1), and together they have committed to utilizing a suite of regulatory models to inform their management decisions (USEPA, 2010). The Chesapeake Bay Program (CBP), a regional partnership that has led and directed the restoration of the Chesapeake Bay since 1983, has undertaken an extensive modeling effort of the bay (Cerco and Cole, 1993;Cerco et al., 2002;Noel, 2004, 2013). This modeling system is being used by the CBP to estimate the aggregate effect of changes in management practices, including land use, atmospheric deposition, animal populations, and fertilizer and manure application. Recently, the modeling system has been used to conduct scenario simulations to assess management actions needed to achieve desired bay water quality standards (USEPA, 2010). Ultimately this model was used to establish a regulatory set of total maximum daily loads of nutrients and sediment delivered from the watershed, with the goal of significantly improving water quality throughout the bay (USEPA, 2010). Many 3-D hydrodynamic-oxygen models of varying complexity stemming from the academic research community have also been used to simulate DO concentrations throughout the Chesapeake Bay (Scully, 2010(Scully, , 2013Hong and Shen, 2013;Feng et al., 2015;Y. Li et al., 2015). Bever et al. (2013) specifically demonstrated that multiple models of varying complexity are able to generate skillful estimates of hypoxic volume in the bay. Some of these models are being used in the bay to simulate short-term and/or seasonal forecasts of DO conditions. Furthermore, some models are also being used to generate scenario forecasts, or projections, that assess the impact of changes in management practices on estuarine DO concentrations, in some cases taking into account the impacts of future changes in climate.

Biogeosciences
As ecosystem and water quality models are increasingly used for operational forecasts as well as scenario-based management decisions by the regulatory and academic research communities, it is important to understand the relative strengths and limitations of existing models of varying complexity. The ability to discern which variables must be most accurately simulated in order to adequately reproduce the temporal and spatial variability of bay oxygen concentrations is a necessary prerequisite for fully understanding how volumes of low-DO water are initiated and sustained within water quality models. The utilization of multiple models can also inform projections by providing independent confidence bounds for management decisions. To those ends, the overarching goals of this research are to compare the relative skill of various 3-D Chesapeake Bay models characterized by different levels of biogeochemical complexity and spatial resolution, to better understand factors limiting their ability to reproduce observed DO distributions, and to suggest approaches for the continued improvement of these models.

Participating Chesapeake Bay models
Eight 3-D models were evaluated in this study (Table 1), each of which includes hydrodynamic and DO components. Among the eight models, there are four different hydrodynamic base models. Models B, C, D, F, and G utilize the Regional Ocean Modeling System (ROMS; Shchepetkin and McWilliams, 2005;Haidvogel et al., 2008) that employs a structured grid with sigma layers in the vertical dimension. Specifically, Models B, C, and F use a ROMS implementation developed for the Chesapeake Bay based on Xu et al. (2012;ChesROMS). Model D employs a ROMS implementation for the Chesapeake Bay based on M. Li et al. (2005), while Model G uses the ROMS-based Chesapeake Bay Operational Forecast System (CBOFS; Lanerolle et al., 2011). Models A, E, and H each use a different hydrodynamic base model: the Curvilinear Hydrodynamics in Three Dimensions model (CH3D; Cerco et al., 2010), the Finite-Volume Community Ocean Model (FVCOM; Jiang and Xia, 2016), and the Hydrodynamic Eutrophication Model -Environmental Fluid Dynamics Code (EFDC; Park et al., 1995;Hong and Shen, 2012;Du and Shen, 2015), respectively. The only model that employs a non-sigma vertical grid is Model A and the only model utilizing an unstructured horizontal grid is Model E. While Model E contains 10 sigma vertical layers, all of the other sigma grids use 20 layers. All of the grids vary in terms of their horizontal resolution, with Models A and G utilizing the highest resolution horizontal grids.
These four hydrodynamic models are coupled to five different models used to simulate DO (Table 1). Models A, B, C, D, and E utilize full biogeochemical models that include various combinations of oxygen, phytoplankton, zooplankton, and multiple inorganic and organic nutrients as state variables. Specifically, Models A and E employ a version of the Integrated Compartment Model (ICM; Cerco et al., 2010;Jiang et al., 2015), Model B uses the Estuarine Carbon Biogeochemistry model (ECB; Feng et al., 2015), Model C uses the Biogeochemistry model (BGC; Brown et al., 2013), and Model D uses the Row-Column AESOP model (RCA; . In terms of food web complexity the models vary considerably: Models B and C employ a single phytoplankton group whereas Model D uses two phytoplankton groups, Model E uses three, and Model A, the most complex of the participating models, uses five. In contrast to the full biogeochemical models discussed above (Models A through E), Models F, G, and H represent oxygen dynamics as simply as possible and therefore do not utilize a full biogeochemical component. Rather, the models impose a biological oxygen consumption rate that is modelspecific, but constant in both space and time. This component is referred to as a constant-respiration model (CRM). In this model, DO is introduced to the estuary via the river and ocean boundaries and is set to saturation at the estuarine surface. This constant-respiration oxygen parameterization (Scully, 2010) is simplistic, yet has been shown to adequately represent Chesapeake Bay oxygen dynamics (Scully, 2010(Scully, , 2013Bever et al., 2013).
The major difference in forcing between the eight model implementations is that Models A and B use riverine input derived from watershed models, whereas Models C-H used the measured flow from United States Geological Survey gauging stations, extrapolated using various techniques. Model A utilized the CBP's regulatory watershed model (Shenk and Linker, 2013), while Model B utilized the Dynamic Land Ecosystem Model (Yang et al., 2015a, b;Tian et al., 2015). At the open boundary with the Atlantic Ocean, Models B, C, D, F, G, and H utilize a sub-tidal elevation extrapolated from tidal stations on either side of the open boundary. Model E uses the TPXO tidal model, while Model A uses a mix of observational and model forcing (Cerco et al., 2010). While Model B utilizes wind forcing based on observations from the Thomas Point Light, Mod-   Du and Shen (2015) els C through H use wind estimates from the North American Regional Reanalysis (NARR).
The eight models used in this analysis have been developed for a variety of purposes. Model A is a governmental regulatory model developed by the CBP that has been extensively calibrated specifically to examine water quality issues in the Chesapeake Bay (Cerco and Cole, 1993;Noel, 2004, 2013;Cerco et al., 2010) and has been used in the development of the 2010 Chesapeake Bay Total Maximum Daily Load (USEPA, 2010). The National Oceanic and Atmospheric Administration employs the hydrodynamic component of Model F for operational forecasts of a variety of physical estuarine parameters for the Chesapeake Bay (http: //www.tidesandcurrents.noaa.gov/ofs/cbofs/cbofs.html). The other six models are academic models used in diverse research efforts focused on the Chesapeake Bay but not necessarily specifically on DO dynamics.
Finally, a ninth model is calculated as the mean of the results from the eight models described above, and is referred to here as Model Mean, or Model M.

Available Chesapeake Bay observations
Model simulations were compared to cruise data from the CBP for 2004 and 2005 from 13 stations along the main stem of the bay (Table 2, Fig. 2). The years 2004 and 2005 were selected to represent relatively wet and average years, respectively, and the 13 stations were chosen as they have been found to offer optimal estimates of bay-wide hypoxic volume (Bever et al., 2013). Stations were sampled on up to 34 cruises over the 2 years (Table 2), generally twice a month from April to August and once a month for the remainder of the year. Observational data can be downloaded from the CBP Water Quality Database (http://www.chesapeakebay.net/data/ downloads/cbpwaterqualitydatabase1984present). Variables downloaded from the CBP website and used in this study were temperature, salinity, DO, nitrate + nitrite (hereafter abbreviated as "nitrate"), and chlorophyll a (hereafter abbreviated as "chlorophyll"). For most cruises, observations of temperature, salinity, and DO were made at roughly 1 m intervals throughout the water column, whereas observations of chlorophyll and nitrate were generally made only at the surface, bottom, and sometimes one or two mid-water column locations. For further information on available water quality observations, please see USEPA (2012). While these observations were publicly available for model assessment during calibration of all of the models, they represent a very small subset of the 30 years of EPA observations across over 100 bay stations. The models compared here were calibrated based on access to the larger data set and for conditions in the bay in general, not specifically for the 13 stations and 2 years considered here.

Calculation of stratification and mixed layer depth
Stratification of the density and oxygen fields was examined to identify the maximum gradient of the pycnocline and oxycline as well as the depth of the top of the pycnocline and oxycline. In open ocean studies, the depth of the top of stratification is commonly referred to as the mixed layer depth (MLD), although this term is less frequently used in the estu-arine literature. As the research presented here distinguishes between the depths of the top of the pycnocline and that of the oxycline, these will be referred to respectively as the density (ρ) mixed layer depth (MLD ρ ) and the oxygen mixed layer depth (MLD O ). Density was calculated via a classical density formula that is also utilized by the CBP for use in the Chesapeake Bay (Fofonoff and Millard, 1983;USEPA, 2004) and is a function of temperature and salinity. The CBP defines the top and bottom of stratification in order to distinguish individual designated use areas for water quality management purposes (USEPA, 2004). They suggest that the top of the pycnocline be defined as the shallowest occurrence of a density gradient of 0.1 kg m −4 or greater as resolved by CBP profile observations, which are typically spaced at 0.5-2 m depth intervals. If density gradients throughout the water column are less than 0.1 kg m −4 , they define the water to be unstratified. The 0.1 kg m −4 threshold definition is designed to identify any initiation of stratification that may serve to cut off vertical mixing from a nearly perfectly well-mixed layer.
While the CBP definition described above delineates between designated use boundaries according to density, our research focuses on the relationship between the pycnocline and oxycline, requiring an alternate definition that can be applied to both the density and oxygen distributions. In addition, the CBP definition often generates estimates for the depth of the top of the pycnocline that are too shallow compared to the maximum depth of surface mixing (Fig. 3). As a result, a percentage threshold criterion was developed that identifies the bottom of the reasonably well-mixed layer, rather than perfectly mixed layer, and is used in this analysis. The percentage threshold method defines a density or DO profile as being stratified if a change of 10 % of the difference between the profile's maximum and minimum values occurs within a single meter (Fig. 3). For example, if the maximum DO concentration throughout the water column on an individual sampling date is 10 mg L −1 and the minimum concentration is 1 mg L −1 , stratification is defined to be present if a difference of 0.9 mg L −1 is present within 1 m. As recommended by the CBP, the uppermost meter of the water column is not considered (USEPA, 2004). The mixed layer depth is therefore defined as the shallowest level (below 1 m depth) where stratification is identified. The minimum stratification criterion utilized in this analysis requiring a profile to pass the 10 % threshold also ensures that observations where very little stratification exists do not bias the stratification results while also allowing for a single criterion to be used across multiple stratification variables.

Model skill metrics
Simulations of the Chesapeake Bay from the eight models described above were statistically compared to historical monitoring data using a variety of skill metrics including root-mean squared difference (RMSD), bias, standard deviation, and correlation coefficient. These metrics are illustrated on Taylor and target diagrams (Taylor, 2001;Hofmann et al., 2008;Jolliff et al., 2009), which offer a compact way of assessing model skill by displaying a number of different skill metrics. Target diagrams illustrate the bias and total RMSD of model output, which Taylor diagrams do not. Taylor diagrams include quantitative information on the standard deviations and correlations between the model output and the observations, which target diagrams do not. Both diagrams, however, represent unbiased RMSD, sometimes called "centered-pattern RMSD". On target diagrams, a model symbol above the horizontal axis overestimates the mean of the observations and a model symbol to the right of the vertical axis overestimates the variability of the observations. (See Hofmann et al. (2008) and Jolliff et al. (2009) for a more detailed description of these diagrams.) On Taylor diagrams, a model symbol lying on the horizontal axis exactly correlates to the observations and a model symbol further from the origin than the observation symbol overestimates the standard deviation of the observations. (See Taylor (2001) for a more detailed description of these diagrams.) Taylor and target diagrams presented here are normalized to the standard deviation of the observations, allowing multiple variables be represented on the same plot. This also conveniently allows the unit circle on a target diagram to represent the skill of a model defined as the mean of the observations. In effect, this means that if a model falls within the unit circle, it exhibits a skill that is greater than the skill obtained if one were to simply use the mean of the observations. The Taylor and target plots are either temporal (displaying model skill at a single station over the study period) or spatial (displaying model skill during a single month over the en-tire set of study stations). In addition, summary diagrams are presented which combine both temporal (examining the seasonal changes at each individual station) and spatial (examining differences across the bay during an individual month) variability.
Model skill was assessed using the hourly model output (daily for CH3D-ICM chlorophyll and nitrate) that was nearest in time to that of the observation and from the grid cell that encompassed the observation location. For months with two observations, each observation was individually matched to the model output and the skill statistics from those comparisons were averaged for that month. The native horizontal resolution and bathymetry of the individual model grids was preserved in the comparison so as not to bias the analysis through varying interpolation methodologies. For stratification variables, the models and observations were interpolated to a 1 m vertical grid that extended only as deep as the individual models' bathymetry or deepest observation in order to preserve the differences in bathymetric grids while allowing for a direct comparison of the observations to the models. Model-data comparisons at the bottom of the water column were not necessarily based on the same depths, since in many cases the modeled bathymetry was shallower (or at times, deeper) than the deepest data point at a given station. In order to avoid issues with extrapolation and/or grid stretching, data at the bottom of the water column were always compared with model estimates from the deepest grid cell provided by each particular model. Model-data comparisons for stratification and mixed layer depths only included stations and times for which stratification was defined to exist in both the observed and simulated fields. An analysis of model skill of the combined temporal and spatial variability of DO at the surface and bottom of the water column, as well as at the observed MLD O , indicates that all models, regardless of biogeochemical complexity or spatial resolution, exhibit a high degree of skill in reproducing observed DO (Fig. 4). Specifically, all models produce DO concentrations at the surface and bottom that have a normalized total RMSD less than 1. The same is true for nearly all models for DO at the observed MLD O . However, most models underestimate observed DO both at the surface and at the MLD O (Fig. 4a). The correlation between the observed and modeled DO is relatively constant with depth ( Fig. 4b), though on average slightly higher at the bottom (0.85) than at the surface (0.80). Further, on average, the models simulate DO at the surface and bottom better than they do at the MLD O . No statistical difference exists between the skill of models that utilize a full biogeochemical component and those that utilize the simple constant-respiration oxygen parameterization. Based on an analysis of variance (ANOVA) comparing the full biogeochemical models to the CRM models, the two model types do not perform differently in terms of their ability to reproduce the combined temporal and spatial variability of bottom DO as measured by total RMSD (p = 0.48). Overall, Model M (the mean of the eight models) consistently performs better than any individual model across all depths examined (Fig. 4).
The monthly temporal variability of bottom DO at each station over the 2 years studied is resolved similarly well by all of the models (Fig. 5a), but the models have difficulty simulating spatial DO variability during each month (Fig. 5b). Due to the stations chosen for this analysis (Fig. 2), the spatial variability being examined here is essentially the north to south variability. Most models exhibit a latitudinal gradient with respect to their skill in reproducing the temporal variability of bottom DO, with models overestimating DO at the more northern stations (Fig. 5a). Some models differ in their ability to reproduce summer (May-September) DO concentrations and winter (October-April) DO concentrations (Fig. 5b). Models B, F, and G all distinctively overestimate mean DO in the summer compared to the winter. In contrast, Models A and C perform similarly well in both seasons (Fig. 5b). In addition, all three constant respiration models, as well as Models D and E, substantially underestimate DO at several stations in the winter.
All eight models generally resolve the pycnocline and oxycline with similar skill (Fig. 6). All models consistently underestimate the mean and standard deviation of the maximum strength of stratification within the pycnocline and oxycline, defined herein as the maximum vertical gradients of density and oxygen (Fig. 6a). All models, except for Model A (see Sect. 4.2), also underestimate the mixed layer depth, regardless of whether it is computed in terms of density or oxygen. (Note that these model symbols in Fig. 6a are located above the y axis despite this negative bias in MLD because the vertical coordinate system is oriented upwards.) Thus, the models are producing stratification that is both weaker than observed and higher (shallower) in the water column. The correlation coefficient for these metrics is low, ranging 0.1-0.6, and indicates that all models are missing the majority of variability associated with the magnitude and location of the pycnocline and oxycline (Fig. 6b). However, there is slightly more consistency and better correlation coefficients among the models for the strength of stratification than the depth of the mixed layers.
All eight models are also characterized by similar skill in representing the temporal and spatial variability of density stratification and MLD ρ (Fig. 7). There is a latitudinal difference in skill of the models in reproducing the magnitude of the pycnocline and MLD ρ , with model skill generally lower at the northern stations (Fig. 7a). Contrary to the pattern shown for bottom DO (Fig. 5b), none of the models exhibit a significant seasonal pattern between summer and winter in reproducing spatial variability of dρ/dz or MLD ρ (Fig. 7b). However, Model A differentiates itself from the rest of the models in its pattern of skill at reproducing the spatial and temporal variability of the MLD ρ (see Sect. 4.2). Temporal and spatial patterns for oxycline stratification (dO/dz) and MLD O closely match those of dρ/dz and MLD ρ (not shown).
All eight models reproduce the variability of bottom DO better than the variables that are generally thought of as being the primary drivers of hypoxic conditions, including stratification (Fig. 6), salinity, chlorophyll, and nitrate (Fig. 8, Table 3). However, all models reproduce patterns in temperature across the bay and through time better than any of the other variables in this model comparison (Fig. 8). All eight models, as well as the Model Mean, are characterized by very low bias in modeled temperature, and correlation coefficients of approximately 0.99; this high skill results from the very strong and predictable seasonal temperature variability.
Even though the five models with full biogeochemical components (Models A, B, C, D, E) are characterized by large differences in their mechanistic approaches to modeling nitrate and chlorophyll, they produce similar total RMSDs for all of the variables examined at both the surface and the bottom (Table 3).
The mean of the eight models (Model M) has a higher model skill (lower RMSD) than any individual model across nearly every variable examined (Table 3). In addition, for nearly all observations at all stations, the 95 % confidence interval of all model hindcasts encapsulates the observed bottom DO concentration (Fig. 9), even though any individual model may overestimate or underestimate observed DO. Models generally fall into greater agreement during the summer, when DO is low, and into lesser agreement in the winter when DO is replete. While this study does not allow for a true interannual comparison, it is interesting that at station CB4.1C the model ensemble closely matches the timing of the drawdown of DO in the spring of 2004 (Fig. 9), whereas it produces a summer rather than spring initiation of hypoxic conditions in 2005. In addition, the model ensemble produces a premature relaxing of hypoxic conditions for both years at this observation station.
In order to better understand the impact of stratification on DO concentrations throughout the water column, the relationship between the observed pycnocline strength and MLD ρ were compared to the observed oxycline strength and MLD O . Observations from 1998 to 2006 demonstrate that while there is not a strong correlation between the strengths of the pycnocline and oxycline, there is a very strong correlation between MLD ρ and MLD O (Fig. 10). Depending on the criteria used for defining the existence of stratification (see Sect. 2.3), the correlation of the pycnocline and oxycline strengths range r 2 = 0.18-0.26 and the correlations of MLD ρ and MLD O range r 2 = 0.51-0.82 (Table 4). Furthermore, correlation of the relationship between the MLD ρ and MLD O is stronger for more severe stratification (Table 4). The relationship between the two mixed layer depths is biased towards the MLD O being located slightly deeper in the water column than the MLD ρ . As the cut-off criteria for the existence of stratification becomes more stringent, the relationship becomes closer to 1 : 1.

How does the skill of various hydrodynamically based DO models compare?
In examining the eight 3-D models in this study, there is not a statistical difference between the ability of simple and complex models to simulate the mean and monthly variability of bottom DO; in addition, models with higher spatial resolution do not necessarily produce better estimates of DO.
Models currently simulating hypoxia throughout Chesapeake Bay compute oxygen concentrations in essentially two distinct ways: they either utilize a simple constant respiration model or a full biogeochemical model. In this study, the relative skill of both types of models is compared. Specifically, in examining results of the comparison between five biogeochemical models (A, B, C, D, E) and three simplistic constant respiration models (F, G, H), the two groups of models performed statistically similar in their skill of reproducing bottom DO concentrations (Fig. 3, Table 3). These results support those of Bever et al. (2013) who compared three constant respiration models with the CBP regulatory model (Model A) and similarly found that all four of the models were equally skillful in terms of reproducing the seasonal variability in bottom DO throughout the bay in 2004 and 2005. Consistent with the results of Scully (2013), this result implies that the seasonal variability of DO in the Chesapeake Bay is primarily dependent on underlying hydrodynamic mechanisms which are nearly identical for all eight models, rather than on aspects related to the biogeochemical cycling which vary dramatically between models and in fact are constant in three of the eight models. It should be noted, however, that the 2 years studied here were relatively wet years and an analysis of dry years may offer different results.
Many previous studies have examined the costs and benefits of adding complexity to biogeochemical models. For example, increasing biogeochemical complexity has been found to improve skill in some biogeochemical data assimilative parameter optimization studies (Friedrichs et al., 2006(Friedrichs et al., , 2007Lehmann et al., 2009;Bagniewski et al., 2011;Ward et al., 2013;Xiao and Friedrichs, 2014). The additional parameters associated with increased complexity generally provide more parameters that are available for additional tuning and subsequent improved model-data agreement. This is in contrast to the results of this analysis demonstrating that increased biogeochemical complexity does not neces-   sarily improve model-data agreement. In this case, the increase in model complexity has likely outpaced the ability of the researchers to fully tune the model to the available observations. However, even past studies that have invoked formal parameter optimization methodologies, such as genetic algorithms and variational adjoint methods (Friedrichs et al., 2007;Ward et al., 2010;Xiao and Friedrichs, 2014), have found that under certain conditions, adding too much complexity does not necessarily improve model skill and in fact can decrease model skill and portability, primarily due to artifacts resulting from overtuning. This mirrors findings from the larger ecosystem modeling community where the best-fit models are often those with intermediate complexity (Fulton et al., 2003). In this study, horizontal grid resolution differed significantly between model implementations, with the most highly resolved grid (Model G) including more than 9 times more grid cells than the lower resolution grids (Table 1). A certain degree of resolution is clearly required to successfully simulate dynamic processes, and a model with 8-10 km resolution will not be able to correctly simulate the hydrodynamic processes within the bay (Feng et al., 2015). However, an increase in horizontal grid resolution from ∼ 1.8 to ∼ 0.6 km, which results in a run-time change of a factor of 9, or possibly of 27 if the time step is accordingly decreased by a factor of 3, does not necessarily result in a significant improvement in simulation skill of either stratification or bottom oxygen. Although not shown here, additional sensitivity experiments with Model G revealed that doubling the vertical resolution of this model had no significant effect on the model's ability to resolve the depth of stratification or the maximum magnitude of stratification. Thus, when selecting the optimal model resolution for a simulation, it is critical to weigh the advantages of increased resolution with the increased time required for simulation. With a given level of computational resources, fewer sensitivity experiments can be conducted with a model using a more highly resolved grid.
Accurately simulating the observed spatial variability of DO (Fig. 4b) was a greater challenge than simulating the temporal variability of DO (Fig. 4a) for all eight models participating in this intercomparison. This is especially true in the winter months when the vast majority of the bay is oxy- gen replete and the models have difficulty representing the observed variability from station to station. The majority of the models tend to slightly overestimate mean bottom DO in the summer whereas multiple models (e.g., Models D, E, F, and G) exhibit a strong negative bias during January and/or February of 2005, primarily at stations in the middle to southern portion of the bay's deep channel. Interestingly, increased biological complexity and higher grid resolution do not completely resolve this issue, as this is true for models utilizing full biogeochemical models (Models D, E) as well as those using highly resolved model grids (Model G). This is likely due to the ephemeral nature of the biological divers of DO.
The strong performance of the constant respiration models implies that these models may be excellent candidates for providing short-term bottom oxygen forecasts. The high DO skill of the CRM models primarily results from the fact that seasonal variations in physical processes (primarily wind mixing and temperature) play a dominant role in controlling the seasonal cycle of oxygen (Scully, 2013). Because the underlying hydrodynamic models all use similar physical forcing, the constant respiration models are able to simulate the seasonal cycle of DO with similar skill as the more complex biogeochemical models. As a result, these simple models that are easier to tune and require less in the way of computational resources than full biogeochemical models, may be efficiently used to produce short-term (on the order of days) DO forecasts. On the contrary, the more complex full biogeochemical models will be necessary for scenario-based and long-term (on the order of months to years) forecasting which requires that models respond to prescribed changes in the biogeochemical environment, such as increased rates of nutrient loading due to changes in land use, land cover, and/or climate.

How does model skill of DO compare to that of the primary drivers of DO variability?
Overall, model DO skill is greater than that of the variables generally considered to drive DO variability, such as stratification, salinity, mixed layer depth, chlorophyll, and nitrate; only modeled temperature has higher skill than modeled DO.
Since dissolved oxygen concentrations in the Chesapeake Bay are controlled by physical processes (e.g., advection, wind mixing, heating/cooling, and stratification), as well as biological processes (e.g., photosynthesis and respiration), it is critical to understand the skill of the models in terms of how well they reproduce the many factors influencing oxygen concentrations. As expected, the five models containing a specific biogeochemical model component had more difficulty simulating the observed chlorophyll and nitrate concen- trations than the physical variables (temperature and salinity), both at the surface (Table 3) and the bottom (Fig. 8).
Replicating the correct location, magnitude, and timing of phytoplankton blooms and nutrient cycling is a complex issue, and as a result, these features are generally not well simulated in the models. While the models generally simulate the total amount of chlorophyll adequately, it is more uniformly spatially distributed in the models rather than in patchy blooms as in nature, leading to the underestimation of chlorophyll variability across all models. Although all models produced a relatively high correlation between observed and modeled temperature and salinity (Fig. 8), the correlation coefficients for chlorophyll and nitrate were much lower. The correlations for observed vs. modeled DO was more similar to that of the physical variables (temperature, salinity) than the biological variables (chlorophyll and nitrate), highlighting that the seasonal variability in bottom DO is regulated I. D. Irby et al.: Multiple modeling low-oxygen waters 2023 more by physical than biological factors. This also explains the success of the constant respiration models, which by definition contain no biological variability yet reproduce DO variability nearly as well as the most complex biogeochemical models.
In this study, model skill was also considerably higher for bottom oxygen than it was for the vertical gradient of stratification and mixed layer depths (Figs. 6, 8). The underestimation of the vertical gradient across all models is largely due to the numerical diffusion that characterizes all of these hydrodynamic models, but may also be partially due to an underestimation of the winds or a lack of diffuse freshwater input around the bay. Even though the models all underestimated the strength of stratification (Figs. 4, 6), modeled stratification in summer was strong enough to prevent mixing with the relatively well-oxygenated surface waters. This result suggests, somewhat surprisingly, that simulating the correct vertical gradient of stratification is not absolutely necessary for skillful bottom DO simulations. Models need only simulate enough stratification to effectively cut off vertical mixing in order to develop an isolated bottom layer that can then experience a draw down in oxygen via respiration. In addition, the models must also correctly simulate the horizontal advection of oxygen (Scully, 2013;Y. Li et al., 2015). The fact that bottom DO is simulated so well by the eight models analyzed here suggests that not only is the advection of oxygen well represented in the models but also the strength of stratification, i.e., the maximum vertical gradients of density and oxygen, produced by these models is sufficient. Thus, although novel and somewhat unexpected, these results are not contradictory to previous studies demonstrating the importance stratification plays in initiating summer hypoxia in the Chesapeake Bay (Murphy et al., 2011).
Model skill in terms of reproducing observed mixed layer depths was likewise much lower than model skill of reproducing observed oxygen concentrations. All models, except Model A, produced mixed layer depths (MLD O and MLD ρ ) that were generally too shallow in the water column (Fig. 6a). Note that Model A is a regulatory model that has been used for many years by the Chesapeake Bay Program, and has thus undergone more extensive calibration aimed at matching the mean salinity and oxygen characteristics of the bay (Cerco and Cole, 1993). Furthermore, Model A employs a z grid that matches the bathymetry in trench areas better than the sigma grids used by the other models. Although Model A produced mixed layer depths that were generally in the correct location within the water column (Fig. 6a), they were too variable (Fig. 6b). This variability may partly be a result of the 1.5 m z grid employed by Model A causing large jumps between vertical grid cells and hence resulting in overestimates of MLD variability. All other models use sigma grids typically with more highly resolved vertical resolution at the depth of maximum stratification.
The two variables for which the models have greatest skill are DO and temperature (Fig. 8). This is because oxygen variability is driven primarily by seasonal variability in physical processes such as solubility and wind mixing and to a lesser degree by variability in oxygen consumption (Scully, 2013). As a result, the models using a constant mean respiration rate produce as realistic hypoxia simulations as the biogeochemically complex models. Observations clearly show this strong seasonal variability in bottom DO (Fig. 11a) and, to a slightly lesser extent, clear seasonal variability in DO at the bottom of the bottom of the oxygen mixed layer (MLD O ; Fig. 11b). However, a seasonal cycle is not manifested in the MLD O itself (Fig. 11c). The lack of such a strong seasonal cycle in the observed mixed layer depths makes this a more difficult variable for the models to simulate. As a result, the models can relatively skillfully simulate the combined spatial and temporal variability of DO while simultaneously missing the MLD O .

Why is it important for DO models to simulate the MLD O correctly?
Most of the aerobic habitat in the bay during the summer is located above the MLD O , thus it is critical for living resource managers to use models that accurately simulate this variable.
On average, the models miss the observed depth of the MLD O by 3.4 m, which equates to roughly a 60 % error in the modeled mixed layer depths. While the models have difficulty simulating the MLD O throughout the entire year (Figs. 6, 7b), the summer months are when the mismatch has the greatest potential to impact the available habitat for oxygen-dependent species. Each year during this time period low-oxygen waters occupy nearly the entire water column below the mixed layer. At station CB4.1C, a representative mesohaline deep trough station, the contours of low-oxygen (5 mg L −1 ) and hypoxic (2 mg L −1 ) waters are located just below the MLD O from late spring until late fall (Fig. 12). The severe depletion of oxygen below the mixed layer compresses the habitable space at this station to roughly 10 m (from a maximum of 32 m) during the annual low-oxygen event.
The impact of habitat compression can be substantial, as many bay species require DO concentrations well above the traditional hypoxic threshold (USEPA, 2010). While not all of the main stem stations develop hypoxic water each year, most mesohaline stations experience a dramatic drawdown of oxygen to levels during the summer that effectively remove a large portion of the bay from habitable space (Murphy et al., 2011;Schlenger et al., 2013). Studies have shown that some species modify their behavior based on the oxycline depth, which acts to constrict the habitable space in the water column (Prince and Goodyear, 2006;Pierson et al., 2009;Elliott et al., 2013). Since species can be negatively impacted by low-DO concentrations as high as 5 mg L −1 (Breitburg, 2002;Vaquer-Sunyer and Duarte, 2008;USEPA, 2010), the location of the oxycline is not only important for habitat compression in the summer months but can also be important in the winter months when an occasional lack of vertical mixing can substantially decrease bottom DO concentrations. Furthermore, in order to accurately estimate hypoxic volume, models must correctly simulate the depth of the mixed layer, since the MLD O closely follows the depth of the 2 mg L −1 contour.

How can DO simulations in the bay be improved for management of water quality and living resources?
To better simulate DO conditions and summer habitat compression due to low-DO water, simulations of the depth of the top of the pycnocline (MLD ρ ) must be improved.
Although the suite of models examined reproduce DO concentrations relatively well overall (Fig. 4), the models typically overestimate summer habitat compression by producing low DO concentrations too high in the water column (Fig. 6). Observations from the Chesapeake Bay Program show a strong correlation between the depths of the oxygen and density-defined mixed layers (Fig. 10b). The models analyzed here also clearly exhibit a close relationship between their skill in simulating the depths of the oxygen and density-defined mixed layers (Fig. 6). These strong relationships between the depths of the oxygen and density-defined mixed layers result from the fact that the pycnocline represents the physical barrier that leads to the development of the oxycline. Therefore, the inability of the models to accurately simulate habitat compression is an artifact of their lack of skill in simulating the depth of the density-defined mixed layer. In contrast, the strength of density stratification is not well correlated to the strength of oxygen stratification. This is because a relative wide range of intensities of density stratification is still sufficient to cut off vertical mixing, leading to the observed draw-down in bottom DO. Thus, even though all models underestimate the strength of the pycnocline, they still produce enough stratification to greatly reduce mixing. The results from this paper thus indicate that to further improve DO simulations and better estimate summertime habitat compression, it is even more critical for models to accurately simulate the depth of the top of the pycnocline than to accurately simulate the absolute strength of the pycnocline.

What is the utility of the multi-model ensemble and Model Mean?
The multi-model ensemble approach allows for the development of a model mean, which taken as its own model, is the most skilled model when examining the combined suite of variables analyzed in this study.
The model skill assessment presented here demonstrates that the average of all eight models, or five models in the case of chlorophyll and nitrate, does better than any individual model if looking across the suite of variables analyzed. This finding is similar to that of other studies that examined the value of the Model Mean from a multi-model ensemble (e.g., Gneiting and Raftery, 2005;Hagedorn et al., 2005). While the concept of using a multi-model ensemble has been most extensively employed by atmospheric, climatic, and global circulation modelers, such as the Intergovernmental Panel on Climate Change (e.g., Collins et al., 2013), the tool's utility for aquatic ecosystem modeling is gaining traction (Meier et al., 2012;Trolle et al., 2014;Janssen et al., 2015). As models are increasingly used in regulatory decisions regarding aquatic ecosystems, a cohort of similarly skilled models can be used to help inform a set of confidence bounds around an environmental forecast. Due to the restrictions placed on models used in regulatory actions, utilization of a multi-model ensemble may not be realistic for all environmental and resource managers; however, multiple models can be integrated into the decision-making process even when the ultimate decision must be based on a single model. For example, a confidence interval plot could help identify where regulatory model output might be acting out of sync with other skilled water quality models of the same system, thereby informing managers of the potential shortfalls associated with the regulatory model. Furthermore, if the models tend to be predicting similar DO concentrations, a cohort of models could enhance the confidence in regulatory decisions based on a single regulatory model (Friedrichs et al., 2012;Weller et al., 2013). Comparing multiple models can also help inform how to better improve models in the future, as this study has aimed to do.

Conclusions
All models analyzed here exhibited a high degree of skill in simulating dissolved oxygen concentrations within the main stem of the Chesapeake Bay in 2 years corresponding to relatively wet and average years. Their high skill results from the fact that physical processes (e.g., solubility, wind-mixing, and advection) exert a first order influence on the seasonal cycle of oxygen. As a result, the models' ability to reproduce dissolved oxygen concentrations is independent of the complexity of the biogeochemical parameterizations: the simplest constant respiration models were found to reproduce observed oxygen concentrations as well as the most biologically complex models. Essentially, all models are equally capable of respiring most of the available oxygen in the lower water column during summer.
This study also suggests that for use as management tools for water quality and living resources, it is more critical for these models to adequately resolve the depth of the mixed layer than the absolute strength of stratification (as long as modeled stratification is strong enough to limit vertical mixing). This is critical because observations show that during warmer months, oxygen-depleted water fills the water column to where stratification limits further mixing, which effectively cuts off waters below the mixed layer for use by the majority of the Chesapeake Bay's most recognized and valued living resources. These results furthermore suggest that modelers should focus their efforts on improving the hydrodynamics of their models in an effort to improve simulations of mixed layer depth dynamics and variability.
These findings have significant ramifications for shortterm bottom DO forecasts, which may be successful with very simple oxygen parameterizations embedded in hydrodynamic models. In contrast, scenario-based water quality forecasts are likely to benefit from more complex models, which must adequately reproduce the longer-term response of the oxygen field to changes in nutrient and organic matter loads.
This study also helps to demonstrate how multiple community models from governmental agencies and academic institutions may be used together to provide a model mean and a set of confidence bounds for regulatory model results that could be used to inform management decisions.

Data Availability
Observations used in this analysis can be downloaded from the Chesapeake Bay Program's Water Quality Database website at (http://www.chesapeakebay.net/data/downloads/ cbpwaterqualitydatabase1984present). Model output for the individual stations examined in this analysis can be obtained by contacting Marjorie Friedrichs (marjy@vims.edu) or downloaded from the Coastal & Ocean Modeling Testbed -Estuarine Hypoxia THREDDS server (http://comt.sura. org/thredds/catalog/comt2/cb_hypoxia/catalog.html).