1.Introduction
Maritime traffic flow is affected by the volume of traffic, tidal current, wave height, and so on. Analyzing maritime traffic flow is very important in the perspective of evaluating for the hazard of each route and the collision probability. Therefore, estimating the probability density function(pdf) is crucial to enhance the safety of maritime traffic.
In previous research, Silveira et al.(2013) studied the collision probability and traffic pattern on the coast of Portugal, but they only drew a histogram of navigation speed and location distribution and calculated the number of traffic. Giuliana et al.(2013) estimated anomalies by applying Kernel density estimation to traffic density on the Italian coast. Fangliang et al.(2012) analyzed the elements like navigation speed and traffic distance in the waterway of Netherlands and Shanghai, and applied it to normal distribution and log-normal distribution function. Qiang et al.(2014) analyzed the characteristic of traffic by applying navigation speed in Singaporean channel to beta distribution and weibull distribution. Liu et al.(2013) examined the traffic flow with the normal distribution and exponential distribution function drawn by the distribution of traffic time and speed.
Some studies estimated the collision probability when the ship is in confronting or passing by applying it to normal distribution (Fujii et al., 1974). And the proximity toward a hazard, defined in AASHTO(American Association of State Highway and Transportation Officials) and the regulations of maritime traffic safety audit, was calculated based on the navigation distance to estimate the collision probability with normal distribution function (Yim, 2010; Yim and Kim, 2010; AASHTO, 2014). Normally, studies assume the probability density function of traffic vessels as normal distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions except normal distribution function are used.
The GMM(Gaussian Mixture Model), combined with multiple normal distribution, is very useful to analyze very complex distribution like multimodal. The GMM has been used as an analyzing tool in various fields such as biology, economics, business administration, physics, astronomy, engineering, and so forth. Especially, GMM is used a lot in estimating the probability density function from multi-variate data(Ravindra et al., 2010; Gonzalez-Longatt et al., 2012). This study adopts GMM to examine frequency distribution of vessels and estimate parameter.
2.
2.1.Scope of Study Area
This study was conducted for 1 year, from January 1 to December 31, 2013, and used AIS observation data in Mokpo port. As shown in Fig. 1, study area is in Mokpogu that vessels are passing.
2.2.Procedure of Study
The process of this study is shown as Fig. 2. Vessels were classified into entry and departure, and the average position (34.7656°N, 126.2926°E) of vessels was set to the center point, and then calculated each distance between the location of each vessel and the center point.
To test goodness of fit, we applied chi-squared(χ2 ) test. According to the result of χ2 test, it was found that GMM is fit in this case, so we applied different type of GMM and selected the fit model with Akaike Information Criterion(AIC) and Bayesian information criterion(BIC). Desirable GMM was chosen from this process.
3.Estimation of Probability Distribution Function
3.1.Examining Probability Distribution Function for Test
There are various types of probability distribution function. Since this study indicates sample data x into the value of positive(+) and negative(-), such distributions that do not satisfy the condition of x>0 and 0≦x≦1 like gamma distribution and beta distribution are excluded. Given sample data were analyzed by using extreme value distribution(EV), generalized extreme value distribution(GEV), logistic distribution, normal distribution, and gaussian mixture model(GMM).
By using fitdist and fitgmdist functions in MATLAB(2014a), we drew Fig. 4. The formulae from (1) to (5) show the probability density function(pdf) for each 5 distribution function refer to MATLAB(MATLAB, 2014a; MATLAB, 2014b; MATLAB, 2014c).
fev (extreme value pdf for sample data x) can be described as formula (1).
Where μ and σ mean a location parameter and a scale parameter.
fgev (generalized extreme value pdf for x) can be depicted as formula (2).
Where ξ means the shape parameter of pdf.
Also, flogistic (logistic pdf for x) can be depicted as formula (3).
fnormal (normal pdf for x) can be depicted as formula (4).
And fgmm (GMM pdf for x) can be depicted as formula (5).
Where um and σm stand for the mean and standard deviation of gaussian distribution. cm is the mth mixture coefficient of gaussian distribution which means the radio of given data and the probability that one sample data is shown at mth gaussian distribution.
3.2Goodness of Fit
The χ2 values were compared to evaluate the GMM, EV, GEV, logistic, and normal distribution function. At first, divide the range of estimated distribution into k intervals, i.e., [a0, a1), [a0, a2), ⋯, [ak-1, ak), and then calculate each value, Nj (j=1, 2,⋯, k), for each interval to compute χ2 test statistics. Where Nj means the number of Xi at jth interval. Assuming that samples are in the designed distribution, pj (the expected ratio of Xi at jth interval) is calculated and test statistics is drawn by using formula (6)(Wikipedia, 2015).
For sample data outbound vessels in July, χ2 for each distribution function is shown in Table 1, which shows GMM is outstanding since χ2 of GMM is lower than those of other distribution functions. As shown in Fig 3, for the closeness to sample data GMM marks higher than other models to confirm GMM is fit to test.
3.3.Selecting suitable Gaussian Mixture Model and Estimating Parameter
The various gaussian mixture models were applied to select optimal model. Various GMM is described in Fig 4, where GMM2 means the mixture of 2 gaussian models, GMM3 of 3, GMM4 of 4, GMM5 of 5, GMM6 of 6 gaussian models.
However, the more gaussian models mixed, the more parameters outbound(July, 2013) created, so overfitting problem is raised. For this reason, formula (7) and (8) were used to calculate AIC and BIC which can solve overfitting problem.(Akaike, 1974; Schwarz, 1978).
n:sample size
k: number of estimated parameters in the model
L: maximized value of the likelilhood function for the model
On the other hand, unnecessary models could be composed with a lot of parameters. However, penalty will be imposed in this case to prevent constituting complex model. It is so called the principle of parsimony. In the case of AIC, penalty is 2k, and the case of BIC, penalty is k·ln(n). So penalty of BIC is much harder than that of AIC since ln(n) is much larger than 2 when n is large. In these reasons, we adopted BIC.
The comparison between AIC and BIC to each GMM for the outbound vessels in July is shown in Table 2. When it’s considered with AIC criterion, GMM6 is desirable. However, due to the overfitting problem, we chose BIC as the criterion and selected GMM3 as the fit model.
GMM3 forms each gaussian distribution with the center of 256m, 27m, and 348m, and each ratio of data is 0.07, 0.59, and 0.35 respectively. It shows that data are clustered in u2(27 m) and u3(348 m) with large mixture coefficient of 0.59 and 0.35 respectively. Therefore, we can assume the parameters, u1, u2, ⋯, um, as commonly used mainly routes.Table 3
Table 4 and 5 show traffic data for each month with suitable GMM by BIC criterion and the parameter is classified into inbound and outbound vessels. From April to June, GMM4 was fit for both inbound and outbound vessels, and from October to December, GMM4 was fit for inbound vessels and GMM3 was fit for outbound vessels.
For sample data of inbound vessels in May, GMM4 is fit and it’s described in Fig. 5. Where μ forms gaussian distributions at –820 m, -77 m, 206 m, and 422 m, and each mixture coefficient(c) was 0.04, 0.08, 0.37, and 0.51 respectively.
The goal of modeling is to get suitable probability distribution which can well express the given sample’s distribution. In reality, however, it’s not too much to say that it’s impossible to describe sample distribution into one model. The alternative way is to use GMM which can approximate various data sets by using multiple gaussian distribution functions, so GMM is considered as the fit model for maritime traffic distribution.
4.Conclusions
Normal distribution is broadly used to assume probability distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions including normal distribution function are used.
In this study, we experimented to find fit probability distribution function in multimodal area, by using AIS observation data gathered in Mokpo port for a year of 2013.
As the result of this study, GMM is the fittest model rather than other distribution functions, such as EV, GEV, logistic, and normal distribution. GMM was found to the fit model regard to multimodal data of maritime traffic flow distribution. Data were clustered in the mean(μ) which has large mixture coefficient(c ), so we can assume the parameters, u1, u2, ⋯, um, as commonly used mainly routes.
Probability density function for collision probability and traffic flow distribution will be calculated much precisely in the future. We hope this advance would help enhancement of navigation safety and vessel traffic services.Table 5