1. Introdution
The water quality of Saemangeum Lake is deteriorating due to eutrophication, as nutrient-rich freshwater inflows from nearby industrial complexes and the Saemangeum Seawall (33km) was completed in April 2006, and seawater distribution is not sufficient. The distribution of seawater for this deteriorated water quality is carried out through the Sinsi Gate and Garyeok Gate installed on the Saemangeum Seawall, which creates suitable environmental conditions for phytoplankton to grow, and there have been cases of phytoplankton blooms that exceed the algae management standards. Damage caused by green and red tides due to this bloom is a concern.
There are studies on the distribution of phytoplankton in Saemangeum Lake, and Kim et al. (2009) reported changes in phytoplankton communities and distinct seasonal cycles due to semi-diurnal tidal coupling in the lower section of the Mankyeong River before the construction of the Saemangeum Seawall (1999~2000). In addition, Jang et al. (2009) reported a decrease in the number of species and an increase in the abundance of phytoplankton communities compared to previous studies based on surveys at a fixed station near Mankyeong Bridge immediately after the completion of the Saemangeum seawall (2006-2007). Yeo (2010) monitored the biomass of phytoplankton, which is the core of the green and red tide problem in the study area, in terms of abundance (cells/ml) for a long period of time (2001-2010). As a result, the temporal and spatial variability of the study area was examined by dividing the study area into rivers, artificial lakes, and seas. It has been reported that frequent algae blooms occurred in the streams flowing into Saemangeum Lake, and that the planned waters of Mankyeong Lake and Dongjin Lake experienced rapid changes in phytoplankton abundance due to changes in freshwater and seawater inputs and seasonal changes (Yeo, 2012), and although several studies have been reported, there is a lack of research on the prediction and analysis of phytoplankton according to the distribution of seawater.
Various water quality problems, including harmful algal bloom, are occurring worldwide in river-type lakes where sufficient nutrients are supplied at the time when water temperature and light conditions suitable for algae growth are formed. Direct problems caused by the massive proliferation of algae include the toxicity of species such as cyanobacteria (Codd et al., 2005;Lehman et al., 2005), the increased production of volatile organic compounds (VOC) by algae, resulting in bad taste in water supply (Watson, 2004), clogging of filter paper by diatoms (Jun et al., 2001), and human health threats and aesthetic effects due to toxins (Lee et al., 2013;Dencheva, 2010;Li et al., 2011).
In Korea, the algae warning system was piloted in Daecheong Lake in 1996 and expanded to 22 lakes nationwide in 2012 (Lee et al., 2012a), and since 2012, a water quality forecasting system has been implemented for the main stretches of the four major rivers for the purpose of proactive water quality management by predicting short-term changes in water temperature and Chl-a concentration (Lee et al., 2012b). Since May 2020, an algae prediction system has been implemented and operated by integrating the water quality forecasting system and the algae warning system to predict changes in water quality and algae outbreaks in public waters (MOE Order No. 1456).
To prevent this, studies have been reported on phytoplankton prediction in rivers and lakes. Looking at international cases, Recknagel et al. (1994) built an algae bloom prediction model using water quality data observed for 12 years as input to an artificial neural network, Wilson and Recknagel (2001) built an algae bloom prediction model using water quality data observed for 12 years as input to an artificial neural network and conducted model validation, and Karula et al. (2000) built a eutrophication neural network prediction model with a Levenberg- Marquardt (tangent-sigmoid) structure to analyze and predict Chl-a considering various water quality factors. Singh et al. (2009) built a DO and BOD prediction model using BPNN to predict DO and BOD, respectively, for water quality management in rivers.
In Korea, Ahn et al. (2001) performed monthly water quality predictions for DO, BOD, and TN at Gongju Branch of the Geumgang River Basin using the BP algorithmic neural network model and examined its applicability by comparing it with the ARIMA model, and Oh et al. (2002) built an optimal water quality prediction model through monthly water quality predictions for each water quality element using the BP algorithmic neural network model with DO, BOD, TN, and TP data from the Yeongsan River Basin. Lee and Seo (2002) conducted monthly water quality predictions of BOD, TN, and TP concentrations using the WASP5 model to identify the effects on the inflow water quality of Daecheong Lake. Park and Ha (2003) used Genetic Algorithm and Neural Network (GANN) to predict the monthly water quality of DO, BOD, TN, and TP concentrations in the Naju branch of Yeongsan River, and Cho et al. (2004) used BP algorithm neural network model to predict BOD, TN, TP, and TOC concentrations in the Naesacheon and Pyeongchang River basins within the Chungju Lake basin in real time. Ahn et al. (2000), used the BP algorithm neural network model to build an intelligent monthly water quality prediction model using each water quality data of the Dalcheon branch of the Han River basin and verified its applicability. Oh et al. (2008) developed a daily prediction model for runoff, TOC, and TOC load at the Naju branch of the Yeongsan River basin using the BPNN model. It was also used for the development and application of algae simulation techniques using chlorophyll-a concentration and cell counts by algae species in Lake Uiam in the midstream of the Bukhan River (Choi et al., 2015). However, it is difficult to apply it to Saemangeum, which has the characteristic that seawater is distributed through locks.
Meanwhile, Park et al. (2023) demonstrated using taxonomic statistics that salinity, including phototrophic salinity, is linked to the presence of phytoplankton. Consequently, they deduced that algal blooms' likelihood could be affected by shifts in salinity via the drainage gate.
This study does not aim to quantitatively predict the abundance or biomass of phytoplankton. Rather, it uses a classification approach to predict the probability of algal blooms. Using this approach to derive quantitative amounts of each controlling factor, allows for the calculation of the concentration of salts that can inhibit algal blooms. Algae blooms are significantly influenced by nutrients. With that said, predicting the probability of algae blooms using machine learning algorithms can allow for calculating the concentration of nutrients that can suppress algae blooms. In summary, this study aims to propose the most effective and efficient algae bloom suppression measures for each species of phytoplankton at each point in Saemangeum based on scientific prediction techniques.
2. Material and Method
2.1 Algal bloom control model design
The model is designed for future data accumulation. Data is collected in real-time or intermittently, and the collected data is preprocessed and stored in a data archive. The user selects a target species and a training dataset to predict it. The model predicts the probability of an algae bloom, which is then calibrated based on the model's confidence in the target species. Once you have the confidence, you select the variables you want to control and use algebra to predict the quantitative amount of the variable that will suppress algae growth. The final decision is whether to control or not, and the variable is controlled based on the result.
In this study, the observed data of 2021 were preprocessed and stored in the data archive, and a training dataset was created to fit the model using an artificial neural network algorithm. The training dataset of the model consists of 2,556 rows with 45 columns, including vertex, observation date, water quality, species abundance, month, temperature, precipitation, insolation, and evaporation. Using this, the model was fitted for each species, and the quantitative value of the target variable that reduces the probability of phytoplankton blooms was predicted by substituting the explanatory variables and the target data (DIN, Salinity).
2.2 Data and preprocessing
The data used in the model were observed once or twice a month from January to November 2021, including 10 months (January 25, February 22, March 24, April 26, May 19, June 29, July 13, August 9, September 8, October 13, and November 1) and survey during summer rainfall (August 29 and September 30). A total of seven observation locations (Fig. 2) were selected based on the water quality measurement network points in Saemangeum Lake, which are being investigated by the Jeonbuk Provincial Environment Agency.
An Ocean Seven 310 CTD from Idronaut (Italy) was used for the observations, and the specifications of the instrument are presented in Table 1.
The target phytoplankton are Skeletonema spp., Cyclotella atomus, Stephanodiscus, Chaetoceros spp. Phormidium tenue. To compensate for the lack of data prior to model design, a piecewise cubic Hermitian polynomial interpolation was used, which captures the motion of the data well while suppressing exaggerated values as much as possible. If the piecewise cubic polynomial is P(x), then in a two-dimensional coordinate system consisting of (x, y), hx and δx are defined as follows.
In addition, the slope of P(x) at xk can be expressed as , and if is in the range of the cubic equation P(x) can be expressed as follows.
As above, the cubic polynomial P(x) expressed by s and x is called a piecewise cubic Hermitian interpolation polynomial. The above equation requires 4 interpolation conditions, which are represented by 2 function values and 2 derivative values at a specific point as follows.
2.3 Prediction of phytoplankton overgrowth potential
2.3.1 Summary
Since there are limitations in quantitative prediction of algal organisms, efficiency and accuracy can be maximized by simplifying the problem to whether or not algae proliferate. Therefore, the response variable becomes a qualitative or categorical variable as opposed to a continuous or quantitative variable, and in this study, a classification algorithm that predicts qualitative variables among machine learning algorithms was used.
In this study, the Artificial Neural Network algorithm was adopted, but since there is not much data accumulated so far, we focused on the design of the model without distinguishing between training data and target data. On the other hand, an artificial neural network is an algorithm for machine learning, that is, machine learning developed inspired by human nerves. In general, a multilayer artificial neural network is divided into three layers: an input layer, a hidden layer, and an output layer, and each layer is composed of nodes. The input layer is composed of supply neurons and serves to input the values of predictor variables for deriving a predicted value. If there are n input values, the input layer has n nodes. The hidden layer consists of computational neurons, receives input values from input nodes, calculates a weighted sum, applies this value to a transition function, and delivers it to the output layer. When an input signal x is received and y is output, it can be expressed as y = wx + b, where w is a weight and b is a bias. In other words, a general artificial neuron with n number of input protrusions is expressed as follows.
An artificial neural network uses an activation function as a function that converts the sum of input signals into an output signal, and in this study, ReLU (Rectified Linear Unit) function, which is mainly used recently, was used.
In this study, the number of hidden layers was set to 20, and the weights were initialized to 0 for consistency in prediction.
2.3.2 Determination of explanatory and response variables
Phytoplankton can proliferate under the influence of physical factors such as water temperature and salinity, chemical factors such as nutrients and trace elements, and biological factors such as symbiosis and predation pressure (Kim et al., 2018). Therefore, water temperature, salinity, and nutrients (DIN, DIP) are the most basic factors to be considered.
Since insolation affects the photosynthesis of phytoplankton and rainfall determines the transport of nutrients in lakes such as Saemangeum, these two meteorological factors were included. On the other hand, as an important matter to be considered for the control of algal bloom, real-time monitoring or equivalent quick and simple observation should be possible, so biological factors were excluded. Therefore, as explanatory variables, environmental factors such as water temperature and salinity, nutrients of DIN and PO4-P, and meteorological conditions such as insolation and rainfall were determined. In the case of rainfall, the sum of the previous 24 hours based on the observation date was used.
The response variable is a categorical type, and the simpler the category, the higher the efficiency of the model and the higher the prediction accuracy, so it was simplified to Normal and Caution. Caution was determined when the current amount of algae was 1,000 cells/mL or more. The predicted targets are Skeletonema spp., Cyclotella atomus, Stephanodiscus, Chaetoceros spp., and Phormidium tenue.
2.3.3 Performance indicators of the model
As shown in Table 2, when each cell of the confusion matrix is defined as a, b, c, and d, the definition of each performance indicator is as follows.
Kappa is a statistical metric that measures the agreement between actuals and predictions, with a value of 0 indicating complete disagreement and a value of 1 indicating perfect agreement. The intuitive meaning of the Kappa coefficient is the probability that both the actual value and the observed value match by chance, and a common interpretation of the Kappa coefficient is as shown in Table 3. On the other hand, Balanced accuracy is the average of Sensitivity, the percentage of positive predictions, and Specificity, the percentage of negative predictions. Also, N.I.R. (No Information Rate) is the accuracy when the model predicts only negatives, so Accuracy should be higher than N.I.R..
2.4 Variable importance
The model first identifies the importance of each explanatory variable, and uses a method of continuously calculating the probability of algal bloom by linearly increasing or decreasing the values of variables with higher importance. The importance of the explanatory variable can be identified as the connection strength between the input node and the hidden node using the Garson algorithm (Garson, 1991). If the input is ‘I’, the output is ‘o’, and the relative importance is R, Garson's algorithm is as follows.
Here, ni is the number of input nodes, nh is the number of hidden nodes, and no is the number of output nodes. wjl is the weight between the input node i and the hidden node j, and woj is the weight between the hidden node j and the output node o.
Using this, the relative importance of explanatory variables is identified, and controllable factors are used. In this model, DIN and salinity were used as control factors.
2.5 Initial conditions for species-specific prediction.
Based on the observations, the conditions of the observation day when the predicted value was predicted as a caveat among the values with a large existing amount were set as initial conditions as shown in Table 4, and DIN and salinity were increased and decreased according to the direction of increase and decrease (dir.) at regular intervals (int.) from minimum (min.) to maximum (max.) as shown in Table 5 to predict the possibility of phytoplankton bloom.
3. Results and Analysis
3.1 Artificial Neural Network Algorithm Fit Result
Fig. 3 shows the fitting result of this neural network model, which consists of 6 input nodes, 20 hidden nodes, and 1 output node. Each input node, hidden node, and output node is connected to a network with a weight, which is expressed as the connection strength. In this study, the ReLU (Rectified Linear Unit) function was used as the activation function to convert the sum of input signals into output signals.
Meanwhile, Fig. 4 is the confusion matrix showing how well the fitted model predicted caution and normal. Table 6 shows the performance metrics of the model calculated based on the confusion matrix. The balanced accuracy of the fitted model is 0.9014, 0.8980, 1.0000, 1.0000, and 0.9330 for Skeletonema spp., Cyclotella atomus, Stephanodiscus, Chaetoceros spp. and Phormidium tenue, respectively. In addition, Kappa values ranged from 0.7889 to 1.0000, indicating good or excellent agreement.
3.2 Importance of explanatory variables
The Garson algorithm was used to determine the importance of each explanatory variable. The results are shown in Fig. 5.
3.2.1 Skeletonema spp.
Looking at the initial conditions in Table 4, when Skeletonema spp. The importance of variables was in the order of PO4 > Salinity > DIN > Solar Radiation > Water Temperature > Rainfall. On the other hand, as shown in Table 1, when all species proliferated in large quantities, PO4-P was at a very low level, so controlling it is meaningless. Therefore, the mass growth probability according to DIN and salt concentration was calculated.
On the other hand, according to the study of Park et al. (2023), the mass growth of Skeletonema spp. is suppressed when there is no influx of salt, and it can be interpreted as mass growth when salt is introduced, so it was changed in the direction of reducing salinity.
3.2.2 Cyclotella atomus
When Cyclotella atomus proliferated in large quantities, the salinity was about 1.879 ppt, which was close to that of fresh water, and DIN was 4.149 mg/L and PO4-P was 0.025 mg/L. The importance of variables appeared in the order of PO4 > DIN > Salinity > Water Temperature > Insolation > Rainfall. Salinity was changed in the direction of increasing, and since DIN is very high, it is considered that the effect of inhibiting the growth of phytoplankton can be increased by limiting the inflow.
3.2.3 Stephanodiscus
At the time of the Stephanodiscus bloom, salinity was around 10.5000 ppt, brackish water conditions, DIN was 10.1640 mg/L, and PO4-P was 0.0100 mg/L. The order of importance of the variables was Water Temperature > DIN > Salinity > Insolation > PO4 > Rainfall. The salinity was changed in the direction of increasing, and since the concentration of DIN is very high, it is judged that the effect of inhibiting the proliferation of phytoplankton can be improved by limiting it.
3.2.4 Cheatoceros spp.
At the time Cheatoceros spp. proliferated in large quantities, salinity was about 24.110 ppt in brackish water conditions, DIN was 0.626 mg/L, and PO4-P was 0.004 mg/L. The importance of variables was in the order of DIN > Water Temperature > Rainfall > Salinity > Solar Radiation > PO4. Salinity was changed in a decreasing direction.
3.2.5 Phormidium tenue
At the time of the Phormidium tenue bloom, salinity was about 0.1319 ppt, which is freshwater conditions, DIN was 3.1288 mg/L, and PO4-P was 0.0050 mg/L. The order of importance of the variables was PO4 > DIN > Salinity > Water Temperature > Insolation > Rainfall. Salinity is increased by opening the gate, and DIN is very high, so restricting the inflow will inhibit phytoplankton growth.
3.3 Algal bloom control model prediction result
For each species, we quantitatively predicted the level of salinity and DIN that should be maintained to inhibit phytoplankton blooms under randomized conditions where blooms occurred (Fig. 6). PO4 was excluded from the calculation because it is present at too low a level (less than 0.1 mg/L), even though it is important, while nutrients were calculated as the probability of increasing or decreasing DIN.
3.3.1 Skeletonema spp.
For Skeletonema spp. the probability of mass proliferation decreased from about 63.3% to dir 49.9% when DIN was lowered from 0.634 mg/L to 0.130 mg/L. Mass proliferation was predicted to be inhibited when salinity was between 6.039 and 8.439 ppt and below 1.839 ppt.
3.3.2 Cyclotella atomus
For Cyclotella atomus, lowering DIN from 4.149 mg/L to 0.165 mg/L reduced the probability of mass proliferation from about 100.0% to 7.8%. Mass proliferation was predicted to be inhibited at salinities above about 4.379 ppt.
3.3.3 Stephanodiscus
For Stephanodiscus, lowering DIN from 10.164 mg/L to 8.364 mg/L reduced the probability of mass proliferation from about 100.0% to 50.0%. Mass growth was predicted to be inhibited at salinities above about 19.5 ppt.
3.3.4 Chaetoceros spp.
For Chaetoceros spp. the probability of mass proliferation decreased from about 99.9% to 40.2% when DIN was lowered from 0.626 mg/L to 0.296 mg/L. Mass proliferation was predicted to be inhibited at salinities below about 22.310 ppt.
3.3.5 Phormidium tenue
For Phormidium teunue, lowering DIN from 3.129 mg/L to 0.420 mg/L reduced the probability of mass growth from approximately 100.0% to 0.0%. Mass proliferation was predicted to be inhibited at salinities above about 2.932 ppt.
4. Conclusion
Using an artificial neural network algorithm, we were able to predict the probability of blooms according to phytoplankton species, and predict the quantitative amount of DIN and salinity to suppress blooms, so we were able to prepare efficient and effective countermeasures to control phytoplankton blooms. However, the reliability of the model was not sufficient with only one year of observations, and it will be possible to build a more sophisticated model if additional data can be accumulated in the future. The phytoplankton bloom control model is expected to contribute to the prediction and warning of phytoplankton blooms in large artificial lakes such as Saemangeum, and to efficiently suppress them.